Statistics and Counts
Explore how to gather key statistics and counts from data using Pandas in Python. Understand data types, convert data as needed, examine unique values, apply grouping functions, calculate correlations, and generate percentiles. This lesson helps you use these techniques to better describe and analyze datasets for clearer insights.
In the last chapter, while studying how to read data from CSV files, we used Pandas. Now, we will look deeper at Pandas to process data.
To do so, we will be using the same data set we used when learning about how to read in CSV data. Here are the first 5 rows to refresh your memory.
Gathering statistics on data
A good place to start is just looking at your data using some pandas functions to better understand what issues there might be. Describe will give you counts and some statistics for continuous variables.
For all of our numeric values, we now have the mean, the std, the min, the max, and a few different percentiles.
Note: It is good to remember that the mean value will be influenced more by outliers than the median. Also, you can always square the standard deviation to get the variance.
You may have noticed that some columns are missing. The only columns describe() function fetched for us are the ones that hold numeric data.
Finding the data types
To see what types of data we have, we can use the info() function:
In our dataframe, we have two data types, object and int64. You can think of an object as a string value and int64 as an integer.
Converting data types
If a column doesn’t seem to have the correct type, it is easy to convert it to different types using .to_() functions:
to_numeric()to_datetime()to_string()
For example:
df['numeric_column'] = pd.to_numeric(df['string_column'])
You also get the count of non-null values per column and the memory usage of your dataframe from the info command used above.
Finding unique values
Another useful step is to look at unique values for columns. Here is an example for the relationship column:
Now we can see all the unique values above and can check the counts of unique values for relationships:
This shows us all the unique values for relationships as well as their counts. So, we have a lot of values as husband in the relationship column, but less Other-relative.
Grouping the data
We can also do these types of counts by specific groups by using the groupby() function. This function takes a list of columns by which you would like to group your dataframe. It then performs the requested calculations on each group individually and returns the results by group. Here is an example:
What we did above was group by the variable, relationship, and then perform value counts on the variable label. For these data, label is whether you make more than 50k. We can see above that 55% of husbands make more than 50k. We received percentages because we used the normalize=True parameter.
You can do many types of calculations on groups using Pandas. For example, here we can see the mean hours worked per week by workclass.
From the above, it looks like Federal government workers work more than local workers on average. Never-worked average about 28 hours.
Finding the correlation
Another useful statistic is the correlation. If you need a refresher on correlation, please check out Wikipedia. You can calculate all the pair-wise correlations in your dataframe by using the corr function.
We can quickly see that compared to all of the correlations, there is a higher correlation between “hours per week” and “education num”, but it is not very high. You will notice, though since our label is an object, it isn’t included here. Knowing how variables correlate with our label would be useful, so let’s take care of that:
There seems to be some decent correlation with the label and education num. One thing to note, though, is that our label is categorical, so correlation doesn’t really apply, our groupby frequencies are probably a better method.
Note: Categorical variables are variables with categories with no intrinsic order. For example, gender.
Also, keep in mind, these are just univariate correlations (between one variable) and don’t account for multi-variate effects (between multiple variables). You can also calculate the correlation using the scipy package which has the added benefit of p-values. This was discussed in the “Scipy an External Library” lesson.
Generating percentiles
Lastly, the describe function of Pandas gives some percentiles, but it is easy to add more:
A percentile is the value below which a given percent of the data falls.
We pass the percentile values we want using the percentiles parameter shown above.
In the next lesson, we will take a look at various ways to reshape our dataframe.