Search⌘ K
AI Features

Exploratory Data Analysis

Explore how to conduct exploratory data analysis on numeric and categorical variables using histograms, bar charts, scatter plots, and hue mapping. Understand data distribution and relationships essential for preparing regression models with PyCaret.

We’ll now perform EDA on our data. As mentioned earlier, EDA is a method that helps us understand the dataset properties by using descriptive statistics and visualization. It is an important part of every machine learning or data science project because it’s essential that we understand the data set before we utilize it.

Histogram of numeric variables

The distribution of numeric variables can be visualized with a histogram that can be easily created with the hist() function.

Python 3.5
# Histogram of numeric variables
numeric = ['age', 'bmi', 'children', 'charges']
data[numeric].hist(bins=20, figsize = (10,5))
plt.show()

As we can see in the output, some of the variables have right-skewed distributions that may cause problems with regression models, so we’ll have to deal with that later.

Bar charts of categorical variables

Using bar charts is the standard way of plotting categorical variables. We can accomplish that easily by using the value_counts() and plot() functions.

Python 3.5
# Bar charts of categorical variables
categorical = ['smoker', 'sex', 'region']
color = ['C0', 'C1', 'C2', 'C3']
fig, axes = plt.subplots(2, 2, figsize = (9,7))
axes[1,1].set_axis_off()
for ax, col in zip(axes.flatten(), categorical) :
data[col].value_counts().plot(kind = 'bar', ax = ax, color = color)
ax.set_xlabel(col)

As we can see in the output, the smoker variable has uneven distribution, with only 2020% of people being smokers. On the other hand, the sex and region variables are equally distributed.

Numeric and categorical features

The histplot() Seaborn function lets us visualize the relationship between numeric and categorical variables using hue mapping.

Python 3.5
# Histogram of numeric&categorical features
fig, axes = plt.subplots(2, 2, figsize=(10,7))
axes[1,1].set_axis_off()
for ax, col in zip(axes.flatten(), categorical):
sns.histplot(data, x='charges', hue=col, multiple='stack', ax=ax)

In this case, we plot the target variable histogram, colored differently for every category of the smoker, sex, and region variables. Smokers get significantly higher charges compared to non-smokers. This is expected because the health risks associated with smoking are numerous and well-documented.

Scatter plots

Scatter plots are a type of visualization that helps us understand the relationship between numeric variables. The pairplot() Seaborn function creates a grid of scatter plots for all pairs of numeric variables in a given dataset.

Python 3.5
# Scatter plots
cols = ['age', 'bmi', 'charges', 'smoker']
sns.pairplot(data[cols], hue='smoker')
plt.show()

The diagonal contains distribution plots of the variables, such as histograms or kernel density estimation (KDEKDE plot is used for continuous variables to visualize their Probability Density. It is similar to histogram plots.) plots in this case. Once again, we use hue mapping to highlight the differences between smokers and non-smokers. As we can see in the output, age is correlated with charges: people get higher charges as they grow older. Despite that, being a non-smoker keeps most people’s costs lower, regardless of their age. Furthermore, people with a high BMI don’t seem to get significantly higher charges, unless they’re also smokers.