Patch the Gaps
Learn how to handle missing data, duplicates, and outliers to make the dataset more reliable.
One of our fundamental challenges as data scientists is working with imperfect data. Imagine trying to solve a jigsaw puzzle, but some pieces are missing, others are duplicated, and a few don’t belong. We could still try to make sense of the picture, but what we end up with might be misleading or incomplete.
That’s what working with messy data feels like.
So far, we’ve explored why data often arrives in imperfect form and how to load and explore it using pandas. Now it’s time for the hands-on work of making it usable. This lesson focuses on three major issues that break down analysis:
Missing values
Duplicates
Outliers
Fixing these issues isn’t about filling in gaps or deleting data—it’s about making informed, careful decisions that balance completeness with accuracy. By mastering these skills, we ensure the quality and reliability of the datasets that form the backbone of our analyses and predictive models.
Missing data
Missing data can compromise the accuracy of results, particularly when key variables are affected. It is important to identify these gaps and apply suitable imputation strategies.
Identify missing values
We begin by scanning the dataset for missing entries. One of the most effective ways to do this in pandas is combining the isna()
method with sum()
. This shows how many values are missing in each column, helping us spot potential issues early on.
In the following example, we work with a transaction-level dataset containing fields such as TransactionID
, CustomerID
, Quantity
, TotalAmount
, Product
, and Date
. Our goal is to identify the missing value count in each column.
Get hands-on with 1400+ tech skills courses.