- A cancer researcher has data about the genotype of 90 patients and wants to understand who might develop cancer.
- A marketing manager has gathered Data on about 20 000 customers, among which 100 have churned.
These are examples of small datasets with no known data distribution available to professionals. Do the use of small data sets without a known data distribution mean no possibility to use Machine Learning?
At MyDataModels, we have designed a Machine Learning Platform, TADA. It can be used by anybody and performs well on small data sets without a known data distribution. In this article, we’re going to share with you how we’re doing it.
We have no prior knowledge of the type of data used. Still, we believe that our users all deserve to have access to a powerful data analysis tool. That’s why the first pillar of our approach is to turn data into formulas, i.e., use symbolic regression.
We try to explore the scope of all possible mathematical formulas. We look for the ones that best predict the output variable using as input the input variables. We begin from a set of base functions like addition, multiplication, trigonometric functions, and square root. Even though it is challenging to create symbolic models, they have some very beneficial features.
For starters, a symbolic model is explicit, making it understandable and providing insight into the data. It is also simple, given that the optimization process will actively try to keep the formulas as short as possible. From a technical point of view, a symbolic model is very portable. Anyone can quickly implement it in any programming language without the need for complex data structures. TADA provides the code for the formulas generated in C++, Java and Python.
Global Optimization Under Constraints
Still, assuming that we do not know the data distribution at hand, whether a random variable or an exponential distribution. We use the concepts of global optimization under constraints, our second pillar. Why?
Because as a constraint, we choose to limit the complexity of the formulas generated. The goal of global optimization is to attain the globally best solution of models in the presence of multiple local optima. A constraint is a hard limit placed on a variable, which prevents us from going forever in specific directions.
The third pillar consists in the explicit use of non-parametric approaches. Parametric approaches assume a known data distribution, for instance, normal distribution of values or a “bell-shaped curve.” For example, weight is roughly a normal distribution. If you were to graph weight from a population, one would observe a typical bell-shaped curve.
We use non-parametric approaches in cases where parametric methods, i.e., Poisson distribution, are not appropriate. Non-parametric techniques can often be as potent as parametric ones.
When doing data science over unknown data distribution, we can use evolutionary algorithms.
In a nutshell, how come TADA is so performant on small datasets?
It uses symbolic regression, combined with global optimization under constraints and a non-parametric approach to its data distribution. All this, under the umbrella of its evolutionary algorithms.