Written by Satvik Tripathi
The aim of data science projects is to bring rational sense to people interested only in the analysis of that data. There are several steps required to achieve the desired results by a Data Scientist/Machine Learning Engineer. Data preprocessing (cleaning, formatting, scaling, and Normalization) and Data visualization through various plots are the two most vital steps that lead to the accuracy of machine learning models.
Also, this is going to be a bit long article but very interesting, so hang on tight!
The purpose of this article is to clarify these terminologies and their functions in machine learning implementation and to address their impacts on different business applications.
We’ll be using the Chocolate Bar Dataset (sounds yummy, right?). This dataset includes chocolate ratings, origins, percentage of cocoa, and the variety of beans used and where the beans were grown.
The dataset is so full of information — I bet that most of you have to wonder, what do we do with this information, and what information can we derive from it
There's a variety of things we can do with this data, but for this particular article, we 're going to analyze the data to address the following issues, using various visualization tools such as distribution plot, box plot, KDE, and violin plot:
Which is the average ranking of blended and pure chocolates?
What countries make the highest quality chocolate bars?
Find the distribution of cocoa percentage across the data set (different data points).
Until seeking answers to the above questions, some data pre-processing steps — cleaning, formatting, etc.—are needed to visualize the data more clearly.
Data Preparation: Cleaning & Formatting Data
The data pipeline begins with the data collection and finishes with the transmission of the results. The process isn't as easy as it sounds. Several steps are involved — one of the most critical steps is data pre-processing.
Data pre-processing itself has several steps, and the number of steps depends on the type of data file, the quality of the data, the different types of values, and more.
Meet Data Pre-processing
Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. Data preprocessing prepare raw data for further processing. Data preprocessing is used database-driven applications such as customer relationship management and rule-based applications (like neural networks).
So what makes pre-processing data so relevant in machine learning or in any data science project?
Importance of Data Preprocessing
Let's take a quick example: a couple went to the hospital for a paternity test — both the man and the woman have to go through the procedure. When the test results returned, they suggest that the man is pregnant. Pretty weird, huh?
Now try and relate this to a machine learning problem — classification. We have 1000+ couples’ pregnancy test data, and for 60% of the data, we know who’s pregnant. For the remaining 40%, we need to predict the results on the basis of previously recorded tests. Let’s say, out of this 60%, 1% suggests that the man is pregnant.
While building a machine learning model, if we haven’t done any pre-processing like correcting outliers, handling missing values, normalization and scaling of data, or feature engineering, we might end up considering that 1% of results that are false.
The machine learning model is nothing but a piece of code; a practitioner or data scientist makes it intelligent through training with data. So if you give garbage to the model, you will get garbage in return, i.e. the trained model will provide false or wrong predictions for the people (40%) whose results are unknown.
This is just one example of incorrect data. People might end up receiving absurd values (e.g. negative salary of an employee), sometimes missing values. This can all result in misleading predictions/answers for the unknowns, which is one of the main goals of machine learning models.
Getting Started with Data Pre-processing
Data pre-processing includes cleaning, Instance selection, normalization, transformation, feature extraction and selection, etc. The product of data pre-processing is the final training set. Kotsiantis et al. (2006) present a well-known algorithm for each step of data pre-processing.
Let’s load our chocolate data and explore if it needs any data pre-processing.
#Import the necessary libraries import pandas as pd import numpy as np #load the chocolate data - Keep the data file in the same folder as your python code chocolate_data = pd.read_csv("flavors_of_cacao.csv") #have a look at the data chocolate_data.head()
# Let's have a look how many values are missing. chocolate_data.isnull().sum()
It seems like we can ignore one missing value in the Bean Type column. So no imputation (inserting values) is required.
Let’s pause here and look at the column name in the above image. Specifically, we’re looking at the structure of the dataset :
#Lets have a look at the data and see identify Object/Categorical values and Continuous values chocolate_data.dtypes
The column name contains \n — this will give the errors during data analysis. Let’s format the column names:
original_col = chocolate_data.columns new_col = ['Company', 'Species', 'REF', 'ReviewDate', 'CocoaPercent','CompanyLocation', 'Rating', 'BeanType', 'Country'] chocolate_data=chocolate_data.rename(columns=dict(zip(original_col, new_col))) chocolate_data.head()
The column CocoaPercent contains a % sign — this will also give further errors. So we need to format this, too.
#Remove % sign from CocoaPercent column chocolate_data['CocoaPercent'] = chocolate_data['CocoaPercent']. str.replace('%','').astype(float)/100 chocolate_data.head()
Let’s create a new column, BlendNotBlend. This column will provide information on whether the chocolate is made with a mixture of flavors or is pure. We’ll talk about the reason behind creating this column in the next section.
chocolate_data['BlendNotBlend'] = np.where(np.logical_or( np.logical_or(chocolate_data['Species'].str.lower().str.contains(',|(blend)|;'),chocolate_data['Country'].str.len() == 1), chocolate_data['Country'].str.lower().str.contains(',')) , 1 , 0) chocolate_data.head()
We’ve cleaned and formatted the data. Now we want to see the presentation of this data using some visualization tools and answer the questions we discussed in the introduction.
Data visualization is an integral part of any data science project. Understanding insights using excel spreadsheets or files becomes more difficult when the size of the dataset increases. It’s certainly not fun to scroll up/down to do an analysis. Let’s understand visualization and its importance in machine learning modeling. We’ll also try to explore the chocolate bar dataset using a few of these tools.
Visualize the data
Data visualization is viewed by many disciplines as a modern equivalent of visual communication. It involves the creation and study of the visual representation of data. To communicate information clearly and efficiently, data visualization uses statistical graphics, plots, information graphics and other tools. Numerical data may be encoded using dots, lines, or bars, to visually communicate a quantitative message.
In data visualization, we use different graphs and plots to visualize complex data to ease the discovery of data patterns. How does this visualization help in machine learning modeling, or even before we start modeling?
Importance of Visualization
The CSV data (panda dataframes) can be really difficult to approach if you want to get some insights. It doesn’t matter if your data is formatted or not formatted correctly.
According to SAS Data Visualization’s webpage,
The way the human brain processes information, using charts or graphs to visualize large amounts of complex data is easier than poring over spreadsheets or reports. Data visualization is a quick, easy way to convey concepts in a universal manner — and you can experiment with different scenarios by making slight adjustments.
Data visualization also helps identify areas that need attention, e.g outliers, which can later impact our machine learning model. It also helps us understand the factors that have more impacts on your results: for example, in house price predictions, the house price will be impacted more by the size of the house than the house style.
Visualization doesn’t just help before the modeling but even after it, too. For instance, it could help in identifying different clusters in a dataset, which is obviously very difficult to see through just simple files, without having proper visualization.
Visualization impacts modeling in many ways, but it’s especially handy in the EDA (Exploratory Data Analysis) phase, where you try to understand patterns in the data. For this particular exercise, we’ll visualize the distribution of chocolate bar data using some popular techniques.
The chocolate bar dataset has different kinds of values — Categorical and Continuous/Numeric. We’ll only be focusing on visualizing the distribution of continuous variables. Let’s jump into plotting.
1. Histogram Plot
A histogram is an accurate representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable (quantitative variable).
The main question here is which data should we pick up and check the distribution? After reading the above definition one might say, “Oh! Except for object or categorical variables/values, we can plot a histogram for anything.” This is a valid point, but are we certain that all continuous values tell a meaningful story?
Let’s start with the Rating column.
#Let's see the distribution of continuous variables sb.distplot(chocolate_data['Rating'],kde = False) plt.show()
The number of different ratings given is counted and plotted. The bars are displayed next to each other because the variable being measured is continuous and is on the x-axis. What’s the story behind this plot? We can see around 390 people provide 3.5 ratings for chocolates.
Now, the REF column,
sb.distplot(chocolate_data['REF'],kde = False) plt.show()
The REF column is the reference number of the ratings received. The higher reference number is the latest one.
The next continuous variable is CocoaPercent. A lot of people like dark chocolates (I don’t), so we want to see the distribution of the darkness included in the chocolates.
sb.distplot(chocolate_data['CocoaPercent'],kde = False) plt.show()
2. Box Plot
In descriptive statistics, a box plot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles.
Box plots give an impression of the underlying distribution. But that’s what Histograms do, too. Then why do we need box plots? In histograms, when you compare many distributions, they do not overlay well and take up a lot of space to show them side-by-side.
Here we’re going to create a box plot for chocolate manufacturing facilities and the ratings given by customers.
# Look at boxplot over the countries, even Blends fig, ax = plt.subplots(figsize=[6, 16]) sns.boxplot( data=chocolate_data, y='Country', x='Rating' ) ax.set_title('Boxplot, Rating for countries (+blends)')
In the above plot, you can clearly see the ratings given to chocolate bars for each individual country. This visualization can help us understand the distribution of ratings throughout the dataset according to each country and further help in finding which country has more popularity than others.
It also explains which country is more profitable to the sellers and potential regions to target. We can further calculate the average rating and sort the data before box plotting. But for this post, we aren’t going into too many details here.
3. Violin plot
Recently I came across Violin plots and yes, they do resemble the instrument. To see what it can tell us about the data.
A violin plot is a method of plotting numeric data. It is similar to a box plot with a rotated kernel density plot on each side.
Pretty complicated, right? In order to simplify this, let’s try and plot in steps.
Remember how earlier we created a column BlendNotBlend. Well here, we’re going to use it. We’re going to see how Blended or Pure chocolate did by comparing the ratings received.
1. Box plot (small one unlike the above box plot): The below plot shows that blended chocolate did better than pure chocolate. So it seems from the data that more people like chocolate with different flavors or a mixture of different flavors.
sns.boxplot( data=chocolate_data, x='BlendNotBlend', y='Rating', ) ax.set_title('Boxplot, Rating by Blend/Pure')
2. KDE (kernel density plot): Let’s try and plot the same thing using a KDE plot.
Blended = chocolate_data.loc[chocolate_data.BlendNotBlend == 1] NotBlended = chocolate_data.loc[chocolate_data.BlendNotBlend == 0] ax = sns.kdeplot(Blended.Rating, shade=True,shade_lowest=False, label = "Blend") ax = sns.kdeplot(NotBlended.Rating, shade=True,shade_lowest=False, label = "Pure")
A KDE is a non-parametric method to estimate a probability density function of a variable. A histogram can be thought of as a simplistic non-parametric density estimate. Here, a rectangle is used to represent each observation and it gets bigger the more observations are made.
So the above plot covers the area of observations/column values and gets bigger with more data points. The rationale behind this is that each value can be thought of as being representative of a greater number of observations. We can sum all of the kernels to give a smoothed distribution.
3. Violin Plot: We will now put together the box plot and KDE plot.
ax = sns.violinplot(x=”BlendNotBlend”, y=”Rating”, data=chocolate_data, hue=”BlendNotBlend”)
The violin plot shows a clear smooth curve i.e. the combination of box and KDE plot. With the above plot, you can easily identify how the “Blend” bar has a larger area covered for ratings, i.e. it got more reviews than pure bars and it also has received different types of ratings. The benefit of using this plot is there’s no need to read a lot of plot points to make sense of the data.
Throughout this post, we’ve explored how data preprocessing and data visualization can impact the complex machine learning model building phase. We learned about different data pre-processing techniques and tried out a few on the chocolate bar dataset.
With respect to this data, imagine we want to learn more about the distribution of current and future ratings/reviews so that companies can improve their production and strategy of making bars. If we don’t handle missing values or correct the incorrect/corrupted data, this will result in inaccurate decision making during the modeling phase.
We also explored a few data visualization tools and discussed how visualization can impact modeling itself. Each visualization tool has its own significance in storytelling, and it’s important to understand which ones can be used with particular types of data.
I hope you enjoyed reading this article and now you are ready to implement all these various techniques. Good Luck.
You can read our previous article on Machine Learning-
(Click on the title to view article)