How to perform EDA for machine learning?

Typically, when we start working on a Machine Learning (ML) project, competition, or a simple task. The first step is to take, is to process the data we're going to work with in order for us to better understand its content and to extract important hidden information and insights. This is usually done with Descriptive Statistics, which provides the mathematical tools to describe the basic features of the data in a study through formulas and graphs. The described process of exploring and investigating the data is referred to as Exploratory Data Analysis (EDA).

Fortunately, there are many frameworks and libraries that has built-in functions to perform descriptive statistics and to produce different kind of visualisations. In this post, I will only refer to the basic python libraries for EDA, which are enough to convey the main idea:

Pandas
NumPy
Matplotlib
Seaborn

The EDA process can be tedious sometimes, especially when numerous data sets are involved or when both structured and unstructured data are available (see image below). So in this post, I will present a comprehensive method with a detailed though process that you can follow to perform EDA for any type of data.

Why do we need EDA for Machine Learning?

Since the main purpose of EDA is to help you review the data before starting the preprocess phase for the machine learning model creation. It can help identify both quality and tidiness errors; Quality errors are related to the content of the data, e.g. Duplicated instances, wrong variable type, and missing values. Whereas Tidiness errors are the type of errors related to the structure of the data, e.g. Unnecessary features, multiple data sets that should be combined, columns that should be combined.

Moreover, the EDA help us better understand patterns in data, detect outliers or unusual events, and discover interesting relationships between variables. And, this is applicable to both structured data and unstructured data.

After completing a thorough EDA, we can then pass to the preprocessing step to clean the data based on the observations we made and prepare it to train an ML model.

Structured & Unstructured Data

As I have mentioned above, both type of data can be analyzed using approximately the same process. For the structured data (Tabular data) the process is direct and can be done directly. However, for the unstructured data, we need to calculate/extract variables (Not the kind of feature extraction using deep learning) that could be then analyzed as structured data sets. This additional work that needs to be done is actually a natural step since the statistics formulas are calculated using numbers, hence the need for variable extraction.

For instance, if we have a data set of images for a classification task, we can create a data frame that contains the following features for each image: Width, Height, Brightness, Contrast, Skewness, Kurtosis, Channels number, Class, etc..

And in case we have audio data we can calculate its spectral properties and use them as a structured data set for EDA.

So, we need to calculate the appropriate variables with respect the the type of data we have. But, in te end the EDA will always be executed using structured data.

EDA Process

There are 3 main steps in the EDA process that I will be presenting in this paragraph:

Global
Univariate
Multivariate

The first step consist of a general inspection of the data so that we have a better understanding of its components (variables and instances), as well as denoting the tidiness issues that should be fixed later on.

Both the second and third step have two components:

Nongraphical
Graphical

Nongraphical univariate:

This is the simplest form of data analysis, in which the analyzed data consists of a single variable. Since it is a single variable, there is no cause or relationship involved. The main purpose of univariate analysis is to describe data and discover patterns in it.

Graphical univariate:

Non-graphical methods cannot provide a complete picture of the data. Therefore, a graphical approach is required. Common types of univariate charts include:

Histogram, bar graph, where each bar represents the frequency (count) or proportion (count / total count) of a series of values.
boxplots, graphing the minimum, first quartile, median, third quartile, and maximum five-digit summaries.

Nongraphical multivariate:

Multivariate data come from multiple variables. Multivariate non-graphical EDA statistics generally shows the relationship between two or more variables in the data. Many techniques can be used to explore the relation between variables such as, correlation analysis, principal component analysis (PCA), and the multivariate analysis of variance (MANOVA).

Graphical multivariate:

Multivariate data uses graphs to show the relationship between two or more data sets. The most commonly used charts are clustered bar charts or bar charts. Each group represents the level of one variable and each bar in a group represents the level of another variable.

Other common types of multivariate graphics include:

Scatter plot, which is used to plot data points on a horizontal and a vertical axis to show how much one variable is affected by another.
Multivariate chart, which is a graphical representation of the relationships between factors and a response.
Run chart, which is a line graph of data plotted over time.
Bubble chart, which is a data visualization that displays multiple circles (bubbles) in a two-dimensional plot.
Heat map, which is a graphical representation of data where values are depicted by color.

Global

In this first step, we need to inspect the data set we have as a whole, meaning that no individual variable inspection should be carried on. There are 4 main objectives to to keep in mind in this step of the EDA which are:

Discovering missing values.
Discovering duplicated instances.
Checking variables type.
Calculating basic statistics for all the variables.

Furthermore, we need to denote, first a strategy on how to deal with some of the problems that are discovered within the data set, and second the remarks and insights that are discovered that can be helpful during the preprocessing or the model creation stage.

Univariate

As the title implies, the only focus on this step of the EDA is on each variable individually. The aim of the univariate EDA is to provide summary statistics for "each" feature in the data set, and use plotting tools to visualize the distribution of each variable so that we can have a deeper understanding on each variable. In practice, many times there are numerous features that are not of importance to the task we have in hands. For instance, the "id" variable isn't relevant for a house price prediction task (view example below).

Most common used visualization: Bar chart, Line chart, Histogram, Area Chart, and Box plot.

Multivariate

This phase of the EDA is dedicated to the inspection of the interactions between 2 or more features in the data set. In case of a ML classification or regression task, it's always important to investigate the correlation between the target variable and all the other features in the data set.

Most common used visualization: Scatter plot, stacked univariate visuals, Heat map, and Bubble chart.

House Sales King County Dataset Example

The EDA for this data set is available in this notebook.

How to perform EDA for machine learning?

Why do we need EDA for Machine Learning?

Structured & Unstructured Data

EDA Process

Global

Univariate

Multivariate

House Sales King County Dataset Example

Share

Related Posts

About Me

My favorite quote

Labels

Contact Form

Menu

Search

Popular Posts

Boosting Your Machine Learning Models with Bagging Techniques

Exploring the Tech Job Horizon: Unveiling Insights from 25,000 Opportunities

What is Stable Diffusion and How Does it Work?

Recent Comments

Contact Me