Mikhail Kamalov is a Ph.D. student with INRIA. We interviewed him for the first time when he published his first scientific paper about semi-supervised learning for document classification last November.
Another one of his articles is accepted for publication this May at a prestigious conference: EUSIPCO. We want to learn more about this article and, in particular about his novel use of evolutionary algorithms for classification.
Hi Mikhail, nice seeing you again. Congratulations on your paper being accepted for EUSIPCO.
Thanks. We’re happy about it.
Was it easier this time than for the first article to get a paper accepted?
I think that it is never easy to get a paper accepted. As scientists, we are humbled by the challenges we face. We had to rework it a couple of times based on the feedback received. But at least, I get better at explaining what I do. It is a great learning experience.
Is the topic “Evolutionary algorithm for classification”?
It’s more than that. It deals with evolutionary algorithms for classification but not only.
What does it mean? Does it deal with data analytics? Predictive analytics? Augmented analytics?
Indeed, the whole analytics vocabulary is very trendy, whether data analytics, augmented analytics, or predictive analytics. But as a scientist, these are use cases to me. These are examples of how what I do can be used.
Do you mean that you do not care about data analytics but only about evolutionary algorithms?
No, I care about the use cases for my research. But my contribution focuses on the algorithms; the use cases are applications of this work.
OK, what is your contribution about? Semi-supervised learning algorithms ?
In part, yes.
What is the other part? Genetic algorithms?
Yes. It is about both.
But they have nothing in common. You cannot use both, can you?
Yes, you can.
It is indeed possible to combine algorithms. A technique called “stacking” exists. Model stacking is a form of lego applied to AI models. There are ‘base’ models (i.e., lego blocks), and there is a “superset” model (i.e., the final construct). The base models form the building blocks of the superset structure. At ground 0, there are several or one base model. Their outputs feed the next layer of base models, building a ‘pyramid’ or a pipeline of models. Different AI algorithms compose the base models.
OK, I got it. You can pipeline together different models, i.e., algorithms. But what’s the point?
The point is to obtain a performance for the stacked model which is higher than the performance of each model.
Do you mean that each time you stack two algorithms, the performance of the stacked algorithm is higher than the performance of the individual algorithms?
It is not systematic, no. Sometimes, the overall performance is higher, and sometimes it is not.
So what did you stack here?
We stacked a semi-supervised learning algorithm with a genetic algorithm.
Why would you do this?
Because our genetic algorithm, ZGP, needs as inputs fully labeled data. Each data point entered into ZGP for training has to be ‘tagged’ as belonging to a category, i.e. class. It is thanks to this fully labeled input data that ZGP “learns” about the classification. It is one of the strengths of ZGP. It is good at sorting. It is an evolutionary algorithm for classification (at least in this context).
And once it has learned, it can classify new data by itself?
OK, so why the stacking?
Because in real situations, we collect data, but it is cumbersome and time-consuming to label each data point manually.
Oh, you mean that the ‘classification’ is done manually?
Yes, it is a human being who labels, one by one, each data point in the data set.
Is it the same whether you do predictive analytics, data analytics, or augmented analytics?
Does it depend on the use case?
It is time-consuming to label data, whatever the data, whatever the use case.
OK, so that’s why the semi-supervised learning algorithm?
Yes, with a semi-supervised learning algorithm such as PRPCA, we can automatically label a whole dataset based on few inputs.
Yes, we try to combine the strengths of the semi-supervised classification algorithm PRPCA (the famous algorithm from Google) and the resilience of ZGP.
So you stack them?
How does it impact the performances?
The stacked model shows outstanding performances, much better than PRPCA alone or ZGP alone.
How do you measure this?
We took three datasets. We generated the first dataset experimentally. And the two others are publicly available. They are named respectively DC motor (the one we generated), WII and UWave (the two publicly available ones).
How did you generate the DC motor dataset?
To work on a real dataset on motor failures, we conducted our experiment to simulate anomalies of DC motors in a production environment. Loading weights generated motor axis imbalance onto a disk plate mounted on top of the motor at varying distances from its axis.
Interesting, and what did you measure?
We measured the accelerations, the rotations, and the magnetic fields using an accelerometer, a gyroscope, and a magnetometer.
Wait for a second. If you want to predict motor failure, then this is a predictive analysis use case!
True. It is a combination of genetic algorithms, optimization algorithms, and predictive analytics.
And you applied the semi-supervised learning algorithm PRPCA stacked with the genetic algorithm ZGP onto the three datasets, i.e., data analytics?
And how were the performances of the stacking?
Very good. When we apply PRPCA alone onto the DC motor dataset, we obtain accuracies ranging from 66% to 71%.
OK. That’s pretty good already.
It’s pretty good, but we can do better.
What about the genetic algorithm ZGP alone?
ZGP alone has accuracy performances ranging from 26% to 62% on the DC motor dataset.
OK, and what about when you stack both?
The stacking of PRPCA (semi-supervised algorithm) and ZGP (genetic algorithm) provides accuracy ranging from 94.2% to 98.8%.
That’s probably why the committee in charge of the selection noticed the article.
And now, this resulting model can be applied to any topic and any use case.
To a wide variety, yes. It can be used for data analytics, augmented analytics, predictive analytics. It is a semi-supervised evolutionary algorithm for classification.