Here is how Data Science can help in the prediction of pandemics such as COVID19. The main objective of the event was to answer the question about how this discipline can be used in different areas to visualize, predict and solve problems, but also to make clear to those interested that the barrier to entry to this technology is minimal and that it very diverse profiles can be incorporated.
What is really attractive for most is knowing the keys that Data Science uses to face a situation as delicate as the current one. It is something that, by force, generates indisputable interest, and proof of this are the many actions that are being carried out to combat this health crisis.
In Kaggle individual, one of the world’s leading communities for data scientists, more than 44,000 professional research related to the coronavirus has been published. The amateur community has also turned to publish more than 5000 notebooks. And the figures, in a few weeks, only increase.
Visualize The Problem (Data Analysis)
Top-tier institutions, universities and companies have created dozens of dashboards that offer visual representations of data on the current situation of the pandemic. Additionally, also users and communities whose only interest is to participate and contribute, have been able to create these useful tools.
This is only a small sample of what is Data Science is known as data analysis. To understand a problem, we must first be able to visualize it. Only in this way will we know its magnitude. This is why dashboards are so useful, and why the visual representation of data becomes essential.
Furthermore, behind the apparent complexity of these representations, there is no hiding a discipline only suitable for a few. Anyone with minimum knowledge and a lot of interest can create their own. A few small blocks of code made in Python, one of the most popular programming languages today and preferred by a good part of the data scientists community, are enough to create a dashboard that helps to understand the dimension of a problem.
Predict Its Evolution (Forecasting)
Once we see and understand what is going on, it is time to ‘predict’ what will happen next. To date, this is the forecasting phase. Continuing with the case of COVID-19, anyone may think that, to make a prediction of its evolution, it will be necessary to resort to complex techniques of Machine Learning, Deep Learning, neural networks or complex algorithms.
The reality is quite different, since most of the prediction models that we can find and that, in many cases, are the same that have been used by government administrations to try to get ahead of the virus, correspond to the well-known “SIR” model.. It is a mathematical model used for contagious diseases created at the beginning of the 20th century by Ronald Ross and William Hammer and that is part of a set of models known as compartmental models
SIR, in its simplest branching, is based on classifying the population into three distinct groups: those likely to contract an infection, those who are already infected, and those who have already been recovered, encompassing in the latter case both those who they survive the disease, like those who die from it.
Once the classification is made, the first thing is to define the probabilities of passing from one state to another. If, for example, a subject is in the infected block on a certain day, they will move to the recovered block depending on how long the disease lasts. If this lasts 14 days, we would say that each infected person has, every day, 1/14 chances of recovering.
In the case of wanting to know the probabilities of going from the block of susceptibles to the block of infected, we should take into account 3 factors:
- The probability of meeting someone
- The probability that that person was infected
- The probability of that person infecting another
The probability that a person is infected will depend on the number of infected that exist over the total population. On the other hand, the probability of encountering someone and the probability that, if that person is ill, they spread the disease, involves two different factors that, mathematically, are modelled into one.
Thus, we can create a system of differential equations to establish what is the probability of passing from one state to another. As we have said, many of the models and their corresponding variations are based on the SIR model.
As a current and close example, we have the one created by Myriam Barnés, a data scientist at K-Fund who, using a model similar to the SIR, but also taking into account variations due to hospital needs, allows us to predict the peak of hospitalizations that will be in different countries of the world. You can check the tool at this link.
Another example is the one created by Gabriel Goh, Machine Learning Researcher at OpenAI, which, in this case, is based on the SEIR model and which also allows us to predict the peak.
Either way, this is the most complex part. Different models based on different variables and data will yield alternative predictions. So George Box said that ” all models are wrong, but some of them are useful .”
Solve The Problem
Once we have managed to make certain predictions, can Data Science contribute anything to solve the problem? Of course. In the webinar, Mario López highlighted some of the initiatives that, based on this technology, help decisively in the fight against the pandemic.
At Kaggle there is a challenge trying to answer 10 key questions about the coronavirus. To do this, the readable datasheet will be used by a larger machine developed so far. That is the more than 44,000 investigations and the more than 5,000 notebooks published in the Kaggle community itself. Obviously, if a researcher needs information, no matter how structured and organized it is, it will be difficult to find it among such a large number of resources. Thanks to natural language processing systems, it is hoped to be able to find the work of investigation that gives an adequate answer to each question.
Another recurring application at this time is to use Computer Vision to help diagnose patients with coronavirus. With access to a series of medical images, such as scans and x-rays of patients with and without coronaviruses, patterns are created that serve to identify and differentiate people who have developed symptoms caused by the virus, from those who have developed symptoms from another cause.
While it may still take us a long time to find an effective remedy for COVID-19, there is a field of study using Machine Learning called Drug Repurposing and looking for commercially available chemical compounds, such as those found in pharmacies and hospitals, which have the potential to be effective against infection. It is sought that these compounds can be attached to some of the coronavirus proteins, changing their structure and making it less effective.
For this, there are banks and databases in which both chemicals and proteins with their encoded characteristics are registered. Thanks to supervised and unsupervised learning algorithms, these characteristics can be crossed and infer which combinations can be effective and which cannot.
Data Science has emerged as one of the most determining disciplines in the fight against pandemics. We are facing a technology whose learning curve is very smooth and which, far from requiring the intervention of excessively technical profiles, can be learned and exploited by professionals from very diverse areas. This specialization to a lesser or greater depth. And it is that the data supports us: the professionals who choose Data Science are the most demanded in the labour market.