Semi-supervised learning for document classification

Article

Mikhail Kamalov is a Ph.D. student at INRIA, a French Research Institute, and MyDataModels, a start-up specialized in Machine Learning for small amounts of data. MyDataModels funds his Ph.D. His supervisor at INRIA is Konstantin Avratchenkov, a senior scientist at INRIA with the NEO research team. And his supervisor at MyDataModels is Carlo Fanara, a senior scientist in charge of the Research Department.

Hi Mikhail, you recently published a paper at a scientific conference?

Yes, I published a paper titled “GenPR: Generative PageRank Framework for Semi-supervised Learning on Citation Graphs.” I presented it at the 9th Conference on Artificial Intelligence and Natural Language (AINL). It was held online, between October 7th and October 9th, 2020.

That doesn’t sound very easy. What is it about?

It is about the application of semi-supervised learning algorithms to document classification.

What is a semi-supervised algorithm?

To understand semi-supervised algorithms, we need to understand first supervised machine learning.

OK, what is supervised machine learning?

In supervised machine learning, the human users provide to the algorithm a data set divided into classes. Every individual data in the data set is labeled, i.e., it is tagged as belonging to a class. To train the model, this data set is provided wholly tagged. Once the model is trained, it automatically categorizes new non labeled data into one of the classes. In a nutshell, all the data is labeled.

And now, back to semi-supervised machine learning?

In semi-supervised learning algorithms, only a small amount of data is labeled, i.e., data associated with a class. Typically one or two elements per class are labeled. And there are large amounts of unlabeled data. The training method in semi-supervised learning allows for coping with both labeled and unlabeled data. The model learns to group similar data.

What about unsupervised learning?

It does also exist. In this case, no data is labeled. It is up to the unsupervised learning algorithm to create classes and assign data into each class. Unsupervised learning consumes time and computer resources. Supervised learning consumes human resources to make the labeled data set. Semi-supervised learning is a compromise between human time and computer resources consumed.

Is the choice of semi-supervised learning algorithm a question of resource allocation?

Not only. In most real-world use cases, few data points are labeled. Semi-supervised is well suited in these configurations. It automatically assigns missing labels.

And what was the use case you intended to address with this paper?

The classification of documents based on semi-supervised algorithms.

Which means?

We have a massive set of documents to classify into several classes. With only a couple of labeled papers per class, we provided a performant algorithm that could classify every one of the massive amounts of documents into a category.

How did you manage to do this?

We used other pieces of information present in the text to compensate for the lack of labeled data. In particular, we used NLP.

What is NLP?

Natural Language Processing is the discipline in data science that allows for language analysis and understanding. Document classification (the goal of our paper) is a subdomain of NLP.

Could you be more specific?

We have worked with two concepts: ‘bag of words’ and ‘citation network.’

What are these concepts about?

In a document, a finite group of words is used. Depending on the topic, the length, a standard article can employ on average between 2000 and 5000 different terms. These are the bag of words. Today, the bag of words are used as is, without questioning or modification. We have challenged this in our article.

Interesting. What about the citation network?

Each scientific article cites sources. Each source quotes other sources. And so on. It creates a mesh of related articles cited by one another. It is what we call the citation network.

So you used the bag of words and the citation network on the unlabeled data?

We used the bag of words and the citation network on all data, labeled and unlabeled, to find commonalities between the articles and classify them together.

We did more than that. We modified the bag of words using neural networks.

Why?

With a smaller, more pertinent bag of words, it is easier to make correlations between similar articles. We have used the variational autoencoder for this reduction.

The model learns from the labels, the bag of words, and the citation network. It improves the model step by step by reducing the bag of words, improving the citation network, and labeling more data.

Did you use neural networks in another place?

For classifying the articles, we have used the PageRank algorithm embedded in neural networks.

What is the performance of your algorithm?

We have experimented with it on three different data sets: the articles from PubMed, the ones from Citeseer, and Cora. It has outperformed existing algorithms on all three articles database.

Congratulations. Is this the reason why your paper was published?

The performance of the algorithm proposed is critical, of course, but it is not the only criteria.

What else?

The topic of the article itself: document classification is a subdomain of NLP. And this conference is dedicated to NLP.

What was the biggest challenge for you in getting this paper accepted?

Well, as a Ph.D. student, I was very proud when this paper was published. But it happened after several rejections. And I have learned a great deal from these rejections. I have learned that, even though the scientific content I was writing about was good, it was not enough. The paper had to be didactic and understandable. My first papers were not. But thanks to the mentoring I got from my supervisors, I learned to improve this. It is an achievement because I share the result of months of hard work in this paper, and I expose myself to peer reviews. Quite a challenge! Quite a thrill too!

The article from Mikhail can be found at:

https://link.springer.com/chapter/10.1007/978-3-030-59082-6_12

References

https://link.springer.com/chapter/10.1007/978-3-030-59082-6_12

Article

Start making sense of your data

Test easily TADA with our test data here

You might also like...

Products

How to make live prediction with TADA?

Announcements, Press

MyDataModels is among the 4 sample vendors chosen by Gartner for its “Hype Cycle for Artificial Intelligence, 2020.”

Article, Data science

Artificial Intelligence: AI made simple