Coronavirus Predictions
and Covid-19 Infections

According to the Center for Disease Control, the prevalence of Coronavirus infections is more than ten times higher than the official number of cases in six regions of the United States. The meaning of prevalence here is ‘the population share affected by a medical condition such as a coronavirus at a specific time.’ Coronavirus predictions often involve prevalence estimations. Because, when the prevalence reaches 60%, herd immunity is achieved, or so we hope. Assuming immunity lasts and prevalence is well calculated. Recent research studies from Spain and England show that immunity fades away. 

Herd immunity and silent infection: friends or foes? 

There have been more than 10.7 million diagnosed cases worldwide, including over 516 thousand deaths since October 2019. The figures published by the Center for Disease Control (C.D.C) suggest that these figures might be well off the real count of people infected.

Does this matter in a world where we seek herd immunity? Yes, it does matter in terms of likely silent infection. I.e., the number of people unknowingly carrying the virus seems to be more important than we initially imagined. Each one of these unknowingly infected persons is likely to contaminate others by merely going out and about. However, even by taking these other infections into account, the actual numbers are well beneath the 60% threshold. The C.D.C. has tested blood samples collected at commercial laboratories from people who came in for standard screenings, such as diabetes tests. It analyzed these specimens for antibodies to the virus — which would show previous infection even in the absence of symptoms. The difference between officially recorded infections and actual prevalence (from this study) was at its zenith in Missouri. 2.65 percent of the population was infected with the virus as of April 26, according to the C.D.C. study. While numerous people might not have had symptoms, this estimate is approximately 24 times the official rate: nearly 162,000 compared with the 6,800. 2.65 percent is considerably beneath 60%. In other words, we are at a low ebb where infections aggressively advance because many people are silently infected and undetected. Yet we have not reached the herd immunity soft spot. We might have to wait for a vaccine. 

As research progresses, we realize that when and where prevalence is determined plays a significant role in the resulting values. For example, New York City reported 53,803 cases by April 1. Nevertheless, the actual number of infections was probably twelve times higher, nearly 642,000, according to the C.D.C. study. However, with a 6.93 percent prevalence evaluated in this research, New York City is well under the 21 percent determined by the state’s survey in April. That previous survey studied people recruited at supermarkets. So the sample might have been biased. It contained exclusively people who were out shopping during a pandemic — young people, or those who had already had the virus and felt safe. 

Are some people living targets for COVID-19 infections?

In this context, vulnerability and risk factors for COVID-19 infection are fundamental research issues that can help build more reliable coronavirus predictions and prevalence estimations. Being a team of data scientists, we have considered two paths to understand who might be more prone to be affected. The former method is genuinely statistical, while the latter consists of using a Machine Learning tool fine-tuned to perform well on small amounts of data. So not a big data (i.e., millions of data) tool but rather a small data one, hence well suited for medical topics. The database we have used is made publicly available by the Mexican government according to regulation on Open Data. It is used in numerous research studies and papers, including an article by the medical researcher Omar Yaxmehen Bello-Chavolla and his team, who is also a kind reviewer of the present article. The records in this database have been gathered from Mexican hospitals, causing, once more, a bias. Just as obtaining samples from a supermarket produces biases over contaminated samples, the population checking in a hospital does too.

Furthermore, the bias strengthens when considering patients with COVID-19 symptoms. Therefore, we have not investigated the prevalence per se. We have concentrated on the relationship between pre-existing conditions and COVID-19 infection to contribute to more accurate coronavirus predictions. Amongst the pre-existing conditions reported in the database, there are asthma, immunosuppression, diabetes, hypertension, cardiovascular disease, obesity, smoking, and chronic renal disease. 

Looking at this database statistics, we see that the proportion of female and male subjects is approximately the same: 48.75% versus 51.25%. For every five year range in age (i.e., between 24 and 29), around 10% of the records are in the interval. The younger and older patients are spread differently (i.e., there are about 10% of patients aged 0 to 24, and 10% of patients aged 65 to 120). To get started, we have looked up the ratio of people contaminated with COVID-19 in the overall database, bearing in mind this database came from hospitals. The result was 34%. Next, we have looked at the proportion of people infected with specific comorbidities. We are focusing on those characteristics that induce a 10% percentage difference in infection compared to the 34% average reference figure. The factors which increase infection (compared to the average) are: 

  • being a man, 
  • being older than 46 years old, 
  • diabetes, 
  • hypertension, 
  • obesity. 

38% of men are contaminated, more than 40% of people aged over 46, 47% of people with diabetes, 42% of people with hypertension, 43% of obese people. Surprisingly enough, according to these modest preliminary ratios, the following criteria had little if no impact on the rate of infections: cardiovascular disease, smoking, chronic kidney disease. Smoking being ‘neutral’ is all the more surprising that several conflicting reports exist regarding the effects of smoking on the risk of coronavirus infection. Even more striking when looking at the raw numbers, it appears that being pregnant, having asthma and being immunosuppressed “protect” slightly from being infected with respective rates of infection of 25%, 23% and, 26%. Or it might only be that people with these conditions are more cautious. 

Machine Learning for Coronavirus predictions of infections

Now, what does Machine Learning have to say about these data? Can we help anticipate Coronavirus predictions and prevalence based on these informations?

We have fed the database mentioned above to our Machine Learning tool and have run it a hundred times to get a hundred models. For each model, our tool has elected seven criteria to achieve coronavirus prediction. The following results are what we get when we average the one hundred models. First, we get an accuracy of 60%, which means that in 60% of cases, the coronavirus prediction of infection was right based on the available data. It seems a rather small number, but it is not unexpected since the database does not include symptoms. In theory, it should not permit any coronavirus prediction of infection. Among the 100 models generated, 93 use age as a discriminant, 52 gender, 35 obesity, 22 diabetes, 21 immunosuppression, 18 smoking. We understand that the above criteria were employed, but we do not know whether they were used to include or exclude subjects. We can assume from the interpretation of the statistics that age, gender, obesity, and diabetes were utilized to predict an infection. In contrast, immunosuppression was employed to help predict a non-infection. Smoking comes as a surprise as it did not appear to influence the global statistics. 

Next, we ached to take the search a step further; we resolved to concentrate on younger people (aged under 45) without obesity. We looked up the raw statistics. 26% of the people under 45 and nonobese in this database were COVID-19 positive. In this case, the aggravating factors were age, gender, diabetes, hypertension, and kidney disease, with respective rates of infection of 40%, 29%, 41%, 33%, and 30%. The “protecting” factors were: pregnancy, asthma, and immunosuppression with infection rates of 22%, 20%, and 16%. Smoking was neutral here again. 

Next, we produced 100 models again, with a resulting comparable correctness, namely 59%. And we looked up the factors used by the Machine Learning tool to construct the models. The 100 models utilize age, i.e., all of them, even when focusing entirely on people under 45. 32 models use smoking. 27 use diabetes, and 27 use asthma. 26 use kidney disease, and 23 use hypertension. These results are exciting. Two of the most influencing factors for coronavirus prediction of infections, namely smoking and kidney disease, seemed to be neutral on the overall database. Smoking was also neutral on this particular subset of the database, both statistically and from the perspective of Machine Learning. Yet, they have a substantial impact when concentrating on nonobese younger people. Smoking is a great surprise because, statistically, it looks “neutral,” though combined with the additional factors, it is a recurring determinant. 

Take away: what’s next?

In this article, we used both statistics and a Machine Learning tool to explore factors that influence vulnerability to COVID-19 and contribute insights into making coronavirus predictions. As the pandemic of COVID-19 is still under progression, prognostic factors’ identification remains a global challenge. In this study, we have come across the ‘usual suspects’ in terms of factors: age, gender. It appears here that diabetes plays a significant part in the danger of being contaminated (versus only of experiencing severe complications once infected). The infection factors involve smoking (presumably as a shield). Although extremely controversial, smokers are generally more susceptible to infectious respiratory diseases and are at higher risk of developing severe complications from these infections. This study is a first step in investigating this Mexican database using both a statistical and a Machine Learning approach. We are very thankful to the Mexican Authorities for making such precious pieces of information public. We will provide follow up studies regarding the same topic. Stay tuned!

References and sources:

Need support ?

Questions? Problems? Need more info? Contact us, and we can help!

Was this page helpful?

On this page

Was this page helpful?

Start making sense of  your data

Test easily Quaartz with our test data here