This post is intended for data scientists, other people employed in the data space as well as data enthusiasts.
Introduction
The chief objective of this post is to provide a thorough review of patterns in current COVID-19 prediction models, to discuss the pros and cons of using statistical and machine learning-based models, some prominent prediction models being used, as well as estimations for what kind of models may be used in the near future.
Since the work in this sector is moving forward very quickly, the number of relevant peer-reviewed academic papers available is very low. The majority of sources referred to for this piece come from prominent, trusted publications such as The New York Times, and the sources that they link to. Additionally, this piece will focus primarily on models that predict the ‘curve’, i.e. the total number of confirmed cases of COVID-19 and similar measures, such as the number of active cases, the number of confirmed cases per day, and so on. Although models are being developed to diagnose patients based on chest X-Rays and to study the biology of the virus, and the chemical composition of the drugs needed to fight it, the work in this field is at too early a stage and would rely on some expertise in biology to make well-informed claims.
Trend #1 – The Most ‘Influential’ Models Are All Statistical
This piece defines an influential model as one that plays a role in informing a government’s policies containing the spread of COVID-19. The methods used to fit the curve are generally some form of regression analysis.
The most well-known of these is the Institute of Health Metrics and Evaluation’s model. It has played a role in impacting policy decisions pertaining to COVID-19 in the United States, most European countries and some from Central America. It uses the Gaussian Error Function (erf) to fit the curve to predict the number of deaths per day, overall number of deaths and hospital bed usage in the United States. In the early stages of the outbreak in the US, the model was trained on patient data from China. It is believed that the stricter restrictions in China at the time of recording the data caused the model to output more optimistic numbers, leading it to be treated as a ‘best-case’ outcome.
The model has come under criticism for its purely statistical nature, because it ignores the manner of infection transmission, cultural factors that affect transmission and does not properly account for how policy decisions impact the spread. Detractors have noted how the numbers, while fairly accurate for New York state, are very inaccurate for the rest of the country.
Another prominent statistical model is the Swiss Data Science Center model, which unlink the IHME Model forecasts the number of cases for nearly every country in the world. It uses data from the European Center for Disease Control for the number of cases from around the world.
The IHME Model
The model calculates the growth rate of cumulative cases the current date and two days prior. If the growth rate greater than 5%, an exponential model is to forecast the cumulative number of cases. If the rate is less than 5%, a linear model is used instead.
Some other prominent statistical models come Los Alamos National Laboratory, and UT Austin.
Trend #2 – Some Models Aren’t Purely Statistical – But Don’t Involve AI
An alternative to purely statistical models is mechanistic or diffusion models, which attempt to simulate the transmission of the virus. An important parameter in these types of models is R0, i.e. the rate of transmission. These models are built on assumptions that come from epidemiology, the cultural practices of a region and the effects of different laws and guidelines. These effects primarily manifest in changing the rate of transmission and as such can lead to varying results that help policy makers to understand the impact of different restrictions.
The most well known of these types of model comes from the Imperial College of London, which made headlines for predicting an estimated 2.2 million deaths from COVID-19 in the United States, a stark contrast to the much more optimistic IHME model. However, this figure resulted from a model that assumed no social distancing measures being implemented, and thus used a high rate of transmission. The models also included lower figures for more optimistic scenarios with more social distancing measures in place and a larger hospital capacity. The model’s main drawback was that the transmission model was built for influenza transmission, even though COVID-19 is not comparable to influenza.
There is an ongoing effort to move to diffusion models to estimate the spread of COVID-19 since it takes biological and cultural factors into account, which theoretically make for more accurate predictions. It is anticipated that work in this area will progress slowly owing to the computational complexity of graph analytics and its relative inaccessibility compared to the methods employed by statistical models. Some other prominent models of this type come from Columbia University, MIT’s mechanistic model, and North Eastern University.
Visualization of R0 Source – Triplebyte
Trend #3 – Most Models Aren’t Very Accurate
As seen with the examples of the Imperial College of London and especially the IHME models, the predictions made by them are often very inaccurate. In the case of the ICL model, this is because the models were not trying to be accurate, but were attempting to forecast a wide range of scenarios. However, it is more important for the IHME model to be accurate since policymakers in the United States refer to it in some part in drawing up social distancing and other related guidelines.
The main reason for inaccurate predictions in statistical models is the high number of unknowns. The data that is used in these models is often aggregated from multiple different sources and the collated and standardized by an organization such as Johns Hopkins or The Centre for Disease Control. This means that, for example, if in a country, 70% of all the regions record the gender of COVID-19 patients, then the aggregation of all the data will have to make assumptions about the remaining 30%, or else not use gender as a factor. This would presumably lead to a margin of error. Additionally, for models that make predictions over a longer period of time, for example, four weeks, the effects of social distancing measures may result in a lower-than-predicted number. Alternatively, if the effects of social distancing measures are accounted for, but in practice are not followed effectively, it may result in higher-than-predicted numbers.
Diffusion models have these same problems associated with statistical models combined with more data and assumptions-related problems due to their increased complexity. The biology of the virus is still being researched, meaning that it is probable that models that consider the transmission of the virus may be based on incorrect or outdated information. Diffusion models that consider the social makeup and cultural practices of a group when calculating spread (e.g. grandparents in Country A being more likely to spend time with grandchildren than those of Country B) are then based on even more assumptions. In theory, these models still do provide a more comprehensive picture than statistical models, but also come with a higher margin of error.
Trend #4 – Machine Learning Is Not Being Used At Higher Levels
There are almost no influential machine learning-based models being used to influence policy decisions or in media reporting of the pandemic. The only AI model listed in COVID Forecast Hub, an aggregator of verified COVD-19 forecast models, comes from MIT, which uses data from the CDC to formulate a diffusion model that initially assumes a population without infection and then tracks the spread of infection through a transmission rate. Unlike other models, it does not attempt to fit a curve based on preexisting data but rather models the spread of infection through a community based on infection parameters. The model claims to be able to accurately predict mortality rates as well as pinpoint exactly when the effects of social distancing are felt.
However, this model too has the same problems of accuracy as the other types of models. For example, it predicts that the total number of cases in India on the 28th of April would range from 99,000 to 288,000. However, the actual official count as of the 28th of April is less than 30,000.
Discussion
How Much Does It Matter That Models Aren’t Precise?
While it is true that most models are not especially accurate, it can be argued that models, especially models based on the transmission of the virus, exist to present a range of possibilities of what could happen depending on factors like rules being implemented and increase in the number of available hospital beds and ventilators. In that case, these models would help policy makers to understand the scale of measures that need to be taken, the amount of money required to fund these measures, and so on. The aim of these models is to provide decision-makers with information that enables them to take actions that prevent the worse possible outcomes.
However, while this argument holds for transmission-based models, it is not as strong for statistical models. Statistical models by definition project past and current trends into the future to make predictions within a margin of error. When the numbers in reality fall significantly outside the margin of error, then that model has failed. These inaccuracies could be attributed to the fast-changing landscape of this pandemic where even a few days of action or inaction could alter the course of the outbreak in a region. But the inherent nature of statistical models ignores these external factors. This is the main reason for the push away from statistical models as well as calls from experts to disregard or be skeptical of forecasts
Why Aren’t There More Machine Learning Based Models?
The Brookings Institute, a US-based research organization recently published a report urging skepticism towards Artificial Intelligence-based solutions for problems posed by the COVID-19 crisis. Some of these points could also be explanations for the lack of machine learning-based prediction models in the public eye. The report suggests a lack of coordination between designers of machine learning models and Subject Matter Experts, in this case epidemiologists. In order for machine learning models to be accurate as well as explainable, it needs to have solid grounding in the theory of how the disease spreads. Additionally, there have been calls to consult other SMEs like anthropologists to gain an understanding of how different cultures would react to the outbreak.
Another point that is raised is that good models ideally require a large quantity of data to be trained on, and while most government agencies around the world are making this kind of data available, there isn’t enough of it yet, leaving aside the problems of standardizing this data. An outbreak of this scale, and at a time of great interconnectedness has never occurred before, leaving analysts with no precedents to work with. This makes the models trained on the relatively limited data able to achieve high performance only in specific scenarios. Using these models in the ‘real world’ often leads to lower performance.
It is also important to note that accuracy is not the only metric to evaluate the performance of models. In a high-stakes situation such as a pandemic, false positives and true negatives can have devastating consequences and thus, the performance of models becomes tantamount.
Should There Be More Machine Learning-Based Models?
Given the case against using machine learning-based models, it can be tempting to rule out AI-based prediction models for now. However, the points of skepticism regarding AI also contain suggestions for moving forward. With greater coordination with SMEs, models can be built with a better theoretical grounding and unlike statistical models, can iteratively improve based on evaluation metrics. As the outbreak progresses, more and more data will become available, and a clearer picture of the outbreak will emerge, helping models become more accurate, and useful in predicting ‘second-wave’ outbreaks such as the recent occurrence in Singapore.
Although it is beyond the scope of this piece, attention must be brought to the other applications of AI to problems raised by the pandemic, and which might possibly be better application for these techniques. These include applications in the medical sector such as diagnosing patients with COVID-19 based on X-Rays, and 3-D printing drugs based on research using AI, but also simpler and more utilitarian tools such as one that summarizes the large number of research papers written during the outbreak. The same arguments for skepticism arise here, along with concerns about biases in the data, privacy concerns and the need for human input, but there is a general wave of optimism surrounding these uses of Artificial Intelligence in the fight against COVID-19
Conclusion
Predicting the exact numbers for an outbreak is not an easy undertaking, and given the novelty of the COVID-19 pandemic, the lack of historic data, and the large number of unknowns at the moment will lead to a great deal of uncertainty in every prediction model regardless of its type. Statistical models have seen a great deal of use because of their relative ease to set up and understand but have the underlying problem of not taking epidemiological and socio-cultural factors into account as well as extrapolating current trends into the future, thus assuming that current trends will hold.
Mechanistic or transmission-based models take epidemiologists knowledge into account and use a parameter R0 to signify the number of people an infected person may pass the virus onto. They are used to visualize a range of scenarios based on R0, which is impacted by implementation (or lack of) social distancing measures, but the downside is that they require a great deal of assumptions to be made, and are working in an area with a great deal of unknowns.
Despite the skepticism surrounding AI models, the general consensus seems to be that AI will come to play a steadily larger part in providing solutions and forecasts for the outbreak, but need more standardized data and greater coordination with Subject Matter Experts. Additionally, concerns about biases in data and data privacy must be addressed.