Missing Data Representation
Learn the different ways in which missing data is represented in pandas.
Introduction
Dealing with missing data is an essential aspect of data analysis. The data we receive is often incomplete, with missing values that need to be managed. Given that missing data can significantly affect the outcomes of our analysis or models, it’s important that we know how to work with missing values so that their negative impact is minimized.
Over the next few lessons, we’ll discover how to leverage the robust methods in pandas to represent, detect, analyze, and manage missing data.
Representation of missing data
Let's start by exploring how missing data is represented and displayed in pandas.
General representations
The two common missing data representations in pandas are NaN (an acronym for not a number) and None. Although NaN is considered the default missing value indicator for reasons of computational speed and convenience, it’s important to understand both representations because they have some key differences in their underlying data types.
Here are some details about each missing data representation:
NaN:A special floating-point value from
NumPythat specifically represents missing numerical data.The default missing value marker in
pandasfor real or floating-point values. It is based on the IEEE 754 floating-point standard.It’s of the floating-point type (rather than a Python object like
None).NaNis contagious in computations, which means that almost any operation involvingNaNwill also result inNaN. For example, if we perform an arithmetic operation withNaNand another number, the result is alwaysNaN. This phenomenon is also known as the propagation ofNaNin mathematical operations, which will be discussed in the next lesson.The following code shows two ways we can generate
NaNvalues:
import numpy as np# Method 1 - using numpyprint(np.nan)print('=' * 10)# Method 2 - using Python's in-built float functionprint(float('nan'))