Exploratory Data Analysis

Explore how to conduct exploratory data analysis on numeric and categorical variables using histograms, bar charts, scatter plots, and hue mapping. Understand data distribution and relationships essential for preparing regression models with PyCaret.

We'll cover the following...

Histogram of numeric variables
Bar charts of categorical variables
Numeric and categorical features
Scatter plots

In this case, we plot the target variable histogram, colored differently for every category of the smoker, sex, and region variables. Smokers get significantly higher charges compared to non-smokers. This is expected because the health risks associated with smoking are numerous and well-documented.

Scatter plots

Scatter plots are a type of visualization that helps us understand the relationship between numeric variables. The pairplot() Seaborn function creates a grid of scatter plots for all pairs of numeric variables in a given dataset.

The diagonal contains distribution plots of the variables, such as histograms or kernel density estimation (KDEKDE plot is used for continuous variables to visualize their Probability Density. It is similar to histogram plots.) plots in this case. Once again, we use hue mapping to highlight the differences between smokers and non-smokers. As we can see in the output, age is correlated with charges: people get higher charges as they grow older. Despite that, being a non-smoker keeps most people’s costs lower, regardless of their age. Furthermore, people with a high BMI don’t seem to get significantly higher charges, unless they’re also smokers.

1.Introduction to Machine Learning

2.Regression

3.Classification

4.Clustering

Project

5.Anomaly Detection

6.Natural Language Processing

7.Deploying a Machine Learning Model

8.Conclusion

9.Appendix

Exploratory Data Analysis

Histogram of numeric variables

Bar charts of categorical variables

Numeric and categorical features

Scatter plots