Quality can be seen as the outcome of a complex process implying many components. By analyzing these components (or variables) and understanding the impact of each of them in the final outcome, professionals can predict the quality of a product or service.
This case study is about building a model to predict the quality of a red wine depending on several variables.
Problems to solve
How to predict, early in the production process, the quality of the output?
How to detect the quality of a future product?
How to help quality assurance (QA) to maintain high standards?
Can machine learning help in these matters and how accurate predictive models can be to predict quality?
Like humans, machines can learn to make predictions by analyzing past information (historical data). Machines can quickly identify patterns in a set of data and produce a mathematical formula (model) using the variables from this historical data.
In order to demonstrate the performance of MyDataModels solution for this type of problem we choose a specific case study.
This case study is based on data from a public dataset* from Kaggle or UCI.
The objective in this case study is to predict the quality of a red wine depending on several variables. The quality is graded from 0 to 10.
* see detailed information on this dataset at Dataset information section
The graph below shows an extract of the public dataset.
Each line is a red wine and each column (aka feature) is a variable which can be used in the model.
Used features :
Input variables (based on physicochemical tests):
- fixed acidity (Numeric)
- volatile acidity (Numeric)
- citric acid (Numeric)
- residual sugar (Numeric)
- chlorides (Numeric)
- free sulfur dioxide (Numeric)
- total sulfur dioxide (Numeric)
- density (Numeric)
- pH (Numeric)
- sulphates (Numeric)
- alcohol (Numeric)
Output variable (based on sensory data):
- quality grade from 3-to-8 (Numeric)
To create predictive models MyDataModels has a performant solution called TADA.
In order to build a model, experts need to :
1) upload the historical dataset into TADA,
2) set what they want to predict (here “quality”)
3) select the other variables to use (all the other columns).
Below here are the statistics of a model obtained with TADA within 3 minutes.
MAPE = Mean Absolute Percentage Error
MAE = Mean Absolute Error
RMSE = Root Mean Square Error
R2 = Root Mean Squared
How to use this model?
Once a model is being built from historical data, professionals can easily make predictions by using information from their current wines.
To use a model, domain experts need to create a new dataset with, as lines all the wines they want a prediction for, and as columns all the variables used in the model.
Then they need to import this file into TADA (see screen below) and click on “Generate score”.
A new file will be generated with a new column (“prediction”) giving the predictions of wine quality: The grade between 0 to 10.
Benefits of TADA
Wine makers in particular and production managers in general could use predictive models to get actionable predictive quality insights and help them control the quality of their production.
However, they are not data scientists and they may not have the required skills in machine learning nor coding experience to build models. Most data handled by these professionals are Small Data, meaning that often historical data contains a limited number of different wines (or products) but rarely thousands or more like in Big Data. Traditional machine learning tools work well with Big Data but do not perform with Small Data.
MyDataModels allows domain experts, in this case Production or QA managers, to automatically build predictive models from Small Data. No training is required and they can use collected data directly without a need to normalize them or handle outliers. No feature engineering is required. Thanks to this limited data preparation and in few clicks the above results from this specific dataset were obtained in 3 minutes on a standard laptop.
MyDataModels brings a self-service solution for those who have Small Data and no data scientists.
The computerization of industrial machinery is increasing and more and more sensors are connected via Internet of Things (IoT) platforms. Machine learning enables to move from passive to preventive monitoring helping production and QA managers, operators and process engineers to optimize quality, minimize rework and warranty claims by forecasting issues and avoiding recalls.
In this red wine quality prediction use case, the results obtained from MyDataModels’ predictive models is more than satisfying with a 0.63 RMSE. It means than when estimating the grade between 0 and 10, we make an average mistake 0.63 point out of 10.
By using an automated machine learning solution like TADA, manufacturers can proactively identify quality issues by running a root cause analysis. This analysis enables them to detect which variable along the production process is likely to affect the overall quality of their production.
The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones) thus we privileged a regression compared to a multiclass classification.
Task: Binary Classification
Number of features: 12
Size of data: 1119 Train samples, 480 Score sample
Target: Quality grade from 0 to 10.
Input variables (based on physicochemical tests):
- fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
- volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
- citric acid: found in small quantities, citric acid can add 'freshness' and flavor to wines
- residual sugar: the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
- chlorides: the amount of salt in the wine
- free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
- total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
- density: the density of water is close to that of water depending on the percent alcohol and sugar content
- pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
- sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant
- alcohol: the percent alcohol content of the wine
- quality: output variable (based on sensory data)
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.