Patch the Gaps
Learn how to handle missing data, duplicates, and outliers to make the dataset more reliable.
As data engineers, one of our fundamental challenges is working with imperfect data. Imagine solving a jigsaw puzzle where some pieces are missing, others are duplicated, and a few don’t belong. We could still try to make sense of the picture, but what we end up with might be misleading or incomplete.
That’s what working with messy data feels like.
So far, we’ve explored why data often arrives in imperfect form and how to load and explore it using pandas. Now it’s time for the hands-on work of making it usable. This lesson focuses on three major issues that can disrupt analysis:
Missing values
Duplicates
Outliers
Missing data
Sometimes, data just… isn’t recorded. In logs, events, or customer databases, it’s not uncommon to find fields left blank or marked as “null.” Your job is to detect these gaps and patch them wisely.
Identify missing values
We begin by scanning the dataset for missing entries. In pandas, one of the most effective ways to do this is by combining the isna()
method with sum()
. This shows us how many values are missing in each column, helping us spot potential issues early.
Missing values often appear during ingestion—think of corrupted CSVs, broken API calls, or optional fields in web forms. Knowing where they’re common helps build preventive checks upstream.
In the following example, we work with a dataset containing transaction-level data, including fields such as TransactionID
, CustomerID
, Quantity
, TotalAmount
, Product
, and Date
. Our goal is to identify the missing value count in each column.
Get hands-on with 1400+ tech skills courses.