Accessors and Operations
Learn the accessors and operations for handling sparse arrays.
Introduction
Having learned about how sparse data can be represented as SparseArray objects in pandas, let’s now look at the accessors and operations we can apply to these sparse arrays. We’ll look at the sparse dataset of movie ratings scored between 1 and 5 by different viewers, where NaN means that the movie isn’t rated yet:
Movie Ratings By Viewers
Movie 1 | Movie 2 | Movie 3 | Movie 4 | Movie 5 | Movie 6 | |
Viewer 1 | NaN | 3.0 | NaN | 5.0 | 3.0 | NaN |
Viewer 2 | NaN | NaN | 3.0 | NaN | NaN | 3.0 |
Viewer 3 | 2.0 | 1.0 | 1.0 | NaN | NaN | 1.0 |
Viewer 4 | 5.0 | NaN | NaN | NaN | NaN | 5.0 |
Viewer 5 | NaN | NaN | NaN | 2.0 | NaN | NaN |
Viewer 6 | 2.0 | NaN | NaN | NaN | NaN | NaN |
Accessors
The SparseArray object supports the .sparse accessor for sparse-specific methods and attributes. It’s similar to the other accessors we have seen before, such as .str for string data and .dt for datetime data. Firstly, let’s convert the original DataFrame into a fully sparse representation:
# Convert df to sparse representation for all columnsdf_sparse = df.copy()for col in df_sparse.columns:df_sparse[col] = pd.arrays.SparseArray(df_sparse[col])# View dtypesprint(df_sparse.dtypes)
We can then use the .sparse accessor to find attributes, such as fill and non-fill values of a SparseArray and the density of a DataFrame (i.e., the proportion of non-fill values).
# Get fill value of a DataFrame columnprint('Fill value of Movie 1 col:', df_sparse['Movie 1'].sparse.fill_value)# Get non-fill values of a DataFrame columnprint('Non-fill values of Movie 1 col:', df_sparse['Movie 1'].sparse.sp_values)# Get density of Sparse DataFrameprint('Density:', df_sparse.sparse.density)
In the example above, the fill_value and sp_values attributes are for the SparseArray at the column level (i.e., an array with SparseDtype). On the other hand, the density attribute is generated from the DataFrame.sparse accessor because it applies to the entire sparse DataFrame. This is because pandas has included the .sparse accessor for DataFrames as well.
The DataFrame.sparse accessor also lets us perform conversions to other formats. For instance, the following code shows how to convert a sparse DataFrame into a sparse SciPy COO (Coordinate Format) matrix:
# Convert df to sparse representation for all columnsdf_sparse = df.copy()# Ensure every sparse array has fill value of 0 in order to convert to COOfor col in df_sparse.columns:df_sparse[col] = pd.arrays.SparseArray(df_sparse[col], fill_value=0)# Convert to SciPy COO matrixcoo_matrix = df_sparse.sparse.to_coo()print(f'SciPy COO matrix:\n{coo_matrix}\n')
The COO representation is a sparse matrix format for efficiently storing ...