Quality can be seen as the outcome of a complex process implying many components. By analyzing these components (or variables) and understanding the impact of each of them in the final outcome, professionals can predict the quality of a product or service.
This case study is about building a model to predict the quality of a red wine depending on several variables
Problems to solve
- How to predict, early in the production process, the quality of the output?
- How to detect the quality of a future product?
- How to help quality assurance (QA) to maintain high standards?
- Can machine learning help in these matters and how accurate predictive models can be to predict quality?
Benefits of TADA
Wine makers in particular and production managers in general could use predictive models to get actionable predictive quality insights and help them control the quality of their production.
However, they are not data scientists and they may not have the required skills in machine learning nor coding experience to build models. Most data handled by these professionals are Small Data, meaning that often historical data contains a limited number of different wines (or products) but rarely thousands or more like in Big Data. Traditional machine learning tools work well with Big Data but do not perform with Small Data.
MyDataModels allows domain experts, in this case Production or QA managers, to build predictive models from Small Data automatically and without training. They can use their collected data directly, without normalization and outlier’s management nor feature engineering. Thanks to this limited data preparation, the results from this specific dataset were obtained with a few clicks in 3 minutes on a regular laptop.
MyDataModels brings a self-service solution for those who have Small Data and no data scientists.
Conclusion
The computerization of industrial machinery is increasing and more and more sensors are connected via Internet of Things (IoT) platforms. Machine learning enables to move from passive to preventive monitoring helping production and QA managers, operators and process engineers to optimize quality, minimize rework and warranty claims by forecasting issues and avoiding recalls.
In this red wine quality prediction use case, the results obtained from MyDataModels’ predictive models is more than satisfying with a 0.63 RMSE. It means than when estimating the grade between 0 and 10, we make an average mistake 0.63 point out of 10.
By using an automated machine learning solution like TADA, manufacturers can proactively identify quality issues by running a root cause analysis. This analysis enables them to detect which variable along the production process is likely to affect the overall quality of their production.
Case study
Solution
Automated Machine Learning solutions consist of predicting the future with historical data. To predict a future result, you must bring your descriptive data and the past result obtained.
TADA allows you to simply create a relevant predictive model from your data and apply it to future data.
In this case, the descriptive data are wine’s information.
The goal of the dataset is to predict the wines notation, it’s a regression task
To generate a model, the steps are the following:
- Create your project and load your data as a CSV table (with data in rows and variables in columns).
-
Select the variable you want to predict, called Goal.
In this case, the Goal is the variable "Quality" (a visualization of the variable is provided). -
Select your data for the model generation. This step is called "Creating the Variable set" and allows you to manually select the descriptive variables you want to use. By default, they are all selected.
TADA identifies the relevant descriptive variables by itself, which affects the calculation time required to create the model.
The fewer variables selected, the faster the model creation. -
Create your model.
At creation, default values are proposed to you: Name of models, Population, Iteration. You only need to validate the default values to start model generation.
‘Best practices’ are at your disposal to guide you in the choice of these parameters.
Depending on the size of the descriptive data file, this step can take between a few seconds and ten minutes.
Once the model is created, you can see the results of the model using metrics and charts so you can judge its relevance.
Note:
To apply a model that you think is relevant, you can:
- Retrieve the associated mathematical formula and apply it (for instance on Excel)
- Retrieve the source code of the formula and use it by yourself (Valid only on TADA paying offers). The source code is available in R, Java, C ++ and soon Python.
- In order to use our "Predict" feature on the product, you will have to upload your file containing the data to be predicted. You will be returned a downloadable file containing the given data, with
the calculated predictions.
Dataset information
The screenshot below shows an extract of the public dataset.
Each row is a red wine and each column is a variable which can be used in the model.
The dataset is related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones) thus we privileged a regression compared to a multiclass classification.
Dataset summary
Task Type: Regression
Number of variables: 12
Number of rows 1119
Goal: Quality grade from 0 to 10.
The Input variables, based on physicochemical tests, are:
1 - fixed acidity (Numeric)
2 - volatile acidity (Numeric)
3 - citric acid (Numeric)
4 - residual sugar (Numeric)
5 - chlorides (Numeric)
6 - free sulfur dioxide (Numeric)
7 - total sulfur dioxide (Numeric)
8 - density (Numeric)
9 - pH (Numeric)
10 - sulphates (Numeric)
11 - alcohol (Numeric)
Goal variable (based on sensory data):
12 - quality grade from 3-to-8 (Numeric)
Results
The results of the model are available following the generation of the model.
They present the performance of the predictive model.
The type of predictive model and the measurement indicators of the associated model are related to the Goal (Variable to be predicted) and the values of this variable.
The type of model you make is shown on the model results display.
According to the type of the Goal (in our case, the Goal is "Quality"), we can make three types of predictions:
- Binary classification: Discrete value taking only two values (yes / no for instance)
- Multiclass classification: Discrete value taking more than two values (for instance a status of state with values like: On, Risk of breakdown, Down, etc.)
- Regression: Continuous value that can take an infinite number of values (a temperature, a pressure, a turnover, the price of a house, etc.)
At the generation of the model and according to the practices and state of the art of Machine Learning, your dataset will be divided into three parts by TADA:
- A training part which represents 40% of your dataset, it allows to train a certain number of formulas,
- A validation part, which represents 30% of your dataset, which validates and selects the best formulas found in the previous step,
-
A test part which represents the last 30% of the model and which corresponds to the test of the formulas approved by the preceding stage. The performance measurement and the evaluation of your model should mainly be done on this partition (Standard and state of the art of Machine Learning) because the present data were not used in the learning and validation phase of the machine learning. model and serve just to measure its performance.
MAPE = Mean Absolute Percentage Error represents the percentage of the average error that we made, here, on each prediction, we have an average error of 9.091%
MAE = Mean Absolute Error represents the average of the sum of absolute errors, it means that on each prediction, our error has an average of ±0.5
RMSE = Root Mean Square Error represents the average of the sum of the square of errors, it means that on each prediction, our error has an average of ±0.63, this metric is more sensible to the outliers.
Ready to use TADA?
You don't have immediate data?
No problem, data are available to make your trial as relevant as possible!
Try it now!