Back

Article

How to prepare a dataset?

1

Artificial intelligence and machine learning became, in a few years, key technologies for professionals and organizations to master, to stay in the game and ahead of the competition. Organizations are starting to invest heavily in machine learning, and we already see highly positive results.

What is a dataset in machine learning?

In simple words, a dataset is a collection of data. It is usually organized as a table with data and column names. Not very different than what you are used to work with when using Excel. Column names can be also referred to as ‘features’ or ‘variables’.

Predictive modeling is the process that uses a historical dataset to build a mathematical solution able to predict outcomes from a new dataset.

In order to build a predictive model, you’ll need a dataset with historical data. This dataset will also contain a target feature – or ‘goal’ – you’ll want to predict once you’ve built your predictive model.

For example, if you work in a supply chain department and you want to predict when one of your products will be out-of-stock, you will first build a predictive model based on your historical dataset, which reports when the product A or product C was out-of-stock.

Once your model is built, you’ll be able to use a new dataset with the same structure but without the goal variable (say, the ‘out-of-stock’ column). In this case, the machine learning model will predict the risk of any product to be out-of-stock.

Can a dataset be prepared by a non-data scientist?

Machine learning depends heavily on data. The quality of your dataset will affect the quality of your predictive model. This is why it is crucial to prepare your dataset correctly. It doesn’t mean that you need to have terabytes of information. If data records do not make sense to you, a machine will be nearly useless or perhaps even harmful.

Understanding data is much easier for a domain expert than for a data scientist, who lacks domain expertise and may spend a vast majority of time exploring and visualizing a dataset, trying to understand it.

A data analyst can be of help in preparing your dataset, prior to machine learning analysis.

Data preparation tasks

When preparing a dataset, data scientists face a number of problems like the format of the data, the presence of outliers or missing values and, perhaps other types of formal inconsistencies.

Additionally, they may deal with contextual difficulties, understanding the meaning of some data. This may challenge the relevance of one or more variables, and appropriate variable selection techniques may then be required.

To complicate things further, there might be a need for feature engineering (construction of new variables from the existing ones). In effect, feature engineering may require several different tedious tasks. As you can see, the level of complication and sophistication may increase.

A schematic of the different steps involved in the preparation of data is shown in picture 1.

Each task may be lead to more detailed subtasks.

A mixed set of skills is therefore required: some of technical nature, others of domain expertise.

Data Preparation: MyDataModels

In contrast, by using MyDataModels technology, you do not need to deal with most of the mentioned difficulties. In fact, MyDataModels automated machine learning software does not require to do feature engineering, management of outliers nor data normalization.

Once the user has properly collected data, it would be a good practice to browse through the dataset to check for potential inconsistencies. Once finished, a Comma Separated Value data (“.csv”) can be effortlessly uploaded to the platform to build a predictive model in a few clicks.

You can start using MyDataModels technology for free, create your account now or learn more about our packages.