Machine Learning can help us detect spams and classify our e-mails.
Our professional inboxes are stuffed with e-mails: conversations with our colleagues, negotiations with third-party professionals but also newsletters or spams are the daily bread of millions of people all around the world.
This never-ending flow of information must be classified according to its importance, its emergency and other factors. Such tasks are very time-consuming and divert us from priorities.
Predictive models can help workers to classify automatically their e-mails and reduce their processing time so that they can focus on actual important tasks.
Problems to solve
- How to detect spam ?
- How to set up your inbox to avoid spam ?
- How can predictive models be helpful in classifying automatically your e-mails?
Benefits of TADA
Most professionals around the world have an e-mail address and receive e-mails daily. However, most of them are not data scientists and don’t have the right Machine Learning and code skills to create models.
Moreover, even if it seems big for the e-mail receiver, the quantity of e-mails is limited and makes a Small Data set that can’t be handled by traditional Machine Learning tools.
To answer these issues, MyDataModels offers TADA, a solution designed for Small Data to help professionals build predictive models out of their Small Data sets.
Professionals don’t need data science training to use TADA. Domain experts can use their own Small Data sets without normalization or preprocessing.
A self-service solution for domain experts with no data science knowledge, TADA provides them with convincing results in less than a minute on a laptop.
A more efficient classification of professionals’ e-mails can improve their productivity by allowing them to focus on their most important tasks. Most notably, TADA can improve spam detection and limit fraud, phishing or computer virus’s proliferation.
Professionals can use TADA predictive models to complete their expertise and decide efficiently which e-mails must be treated in priority.
Automated Machine Learning tools help users to predict the future thanks to historical data. To predict a future result, you must compile your descriptive data and the past results obtained.
TADA allows you to easily create a relevant predictive model from your data and apply it to future data.
Here, descriptive data comes from a sample of received e-mails. TADA must determine if the e-mail is or isn’t spam, this is a binary classification.
This example can be expanded to various e-mail categories, which would make the prediction a multi-classification.
You can generate a model in just 4 steps:
- Step 1: create your project and upload your data as a CSV file (with data in rows and variables in column).
- Step 2: Select the variable you want to predict, called “Goal”. In this use case, the goal is the “Spam” variable.
- Step 3: Select your data for the model generation. This step is called "Creating the Variable set" and allows you to manually select the descriptive variables you want to use. By default, they are all selected.
TADA identifies the relevant descriptive variables by itself which affects the calculation time required to create the model.
The fewer variables selected the faster the model creation.
- Step 4: Create your model. When creating your model, some default values are proposed for the name of the model, the size of the population and the number of iterations.
You can start your model generation by validating the default values or editing them according to your preferences. You’ll find best practices at your disposal to guide you in the choice of these parameters in the TADA UI.
According to the size of the file, this step can take between a few seconds and ten minutes. Once the model created, you have access to metrics and graphs to evaluate its relevance.
How can we go further?
You have various options to put your model into practice:
- Use the « Predict » feature of TADA: upload a CSV file with the data to predict. In return, TADA will generate a CSV file with the calculated predictions.
- Retrieve the associated mathematical formula and apply it (for instance on Excel).
- Retrieve the source code of the mathematical formula and use it on your own apps. The source code is available in R, Java, C++ and Python soon. (This option is only available in TADA Premium and Pro).
The below screenshot is an extract of the dataset. Each line is an e-mail and each column is a variable that can be used by the model.
The dataset is made of 48 continuous variables whose value is comprised between 0 and 100. They are of “word_freq_WORD” type and represent the percentage of occurrence of a word in the e-mail text.
A “word” is defined as a chain of alphanumeric characters limited by non-alphanumeric characters or an end of chain.
Model type: binary classification
Number of rows: 2972
Number of columns: 48
Goal : Spam class (with values 1,0)
Class balance: positive class (0) 57%, negative class (1) 43%
The results show how the predictive model performs.
The predictive model type and its metrics are linked to the Goal and its values. The model type is shown on the model results display.
Three types of prediction can be done according to the Goal data. Here, our goal is “Spam”:
Binary classification: a discrete value taking only two values, such as Yes/No.
Multiclass classification: a discrete value with more than two values, such as status of state with values like “On”, “At Risk”, “Down”, etc.
- Regression: a continuous value that can take an infinite number of values, such as a temperature, a pressure, a turnover or the price of a house.
When generating the model and according to the state of the art of Machine Learning, your data will be divided in three parts by TADA:
- Part 1: A Training part which represents 40% of the data and is used to train a certain number of models,
- Part 2: A Validation part which represents 30% of the data and is used to validate and select the best models found in the previous step,
- Part 3: A Test part which represents 30% of the data and is used to test the model approved during the validation step.
The performance measurement and the model evaluation must be done on the Test part (according to Machine Learning standards) as the data used during this phase was not used to build the model and is just used to measure its performance.
Here, different metrics show that the model has good results.
First, we can see thanks to Accuracy (ACC) that the model is right in 91% of the predictions.
Moreover, TADA separates well negative classes (True Negative Rate of 88.98%) and positive classes (True Positive Rate of 93.55%). These results are confirmed by the Matthew’s Correlation Coefficient (MCC) of 0.82 that shows the good allocation among classes.
The confusion matrix is a visual way to read the metrics.
Here, TADA predicted 521 times that an incoming e-mail was not a spam. Among these predictions, 479 e-mails were not spam (True Positive) while 42 of them were (False Positive).
Meanwhile, TADA predicted 372 times that an incoming e-mail was spam. Among these predictions, 339 e-mails were spam (True Negative) while 33 of them were not (False Negative).
Accuracy (ACC) is the overall accuracy rate of the model: it is the percentage of classes that are well distributed (here, 91.6% predictions are correct)
Recall or True Positive Rate (TPR) represents the accuracy rate of the prediction of the positive class
Specificity or True Negative Rate (TNR) represents the accuracy rate of the prediction of the negative class
Matthew’s correlation coefficient (MCC) is an indicator of the general quality of the model and shows the quality of the allocation of the values among the two classes.
Ready to use TADA?
You don't have immediate data?
No problem, data are available to make your trial as relevant as possible!Try it now!