Statistics and Counts

Explore how to gather key statistics and counts from data using Pandas in Python. Understand data types, convert data as needed, examine unique values, apply grouping functions, calculate correlations, and generate percentiles. This lesson helps you use these techniques to better describe and analyze datasets for clearer insights.

We'll cover the following...

- Gathering statistics on data
- Finding the data types
- - Converting data types
- Finding unique values
- Grouping the data
- Finding the correlation
- Generating percentiles

For all of our numeric values, we now have the mean, the std, the min, the max, and a few different percentiles.

Note: It is good to remember that the mean value will be influenced more by outliers than the median. Also, you can always square the standard deviation to get the variance.

You may have noticed that some columns are missing. The only columns describe() function fetched for us are the ones that hold numeric data.

Finding the data types

To see what types of data we have, we can use the info() function:

In our dataframe, we have two data types, object and int64. You can think of an object as a string value and int64 as an integer.

Converting data types

If a column doesn’t seem to have the correct type, it is easy to convert it to different types using .to_() functions:

to_numeric()
to_datetime()
to_string()

For example:

df['numeric_column'] = pd.to_numeric(df['string_column'])

You also get the count of non-null values per column and the memory usage of your dataframe from the info command used above.

Finding unique values

Another useful step is to look at unique values for columns. Here is an example for the relationship column:

There seems to be some decent correlation with the label and education num. One thing to note, though, is that our label is categorical, so correlation doesn’t really apply, our groupby frequencies are probably a better method.

Note: Categorical variables are variables with categories with no intrinsic order. For example, gender.

Also, keep in mind, these are just univariate correlations (between one variable) and don’t account for multi-variate effects (between multiple variables). You can also calculate the correlation using the scipy package which has the added benefit of p-values. This was discussed in the “Scipy an External Library” lesson.

Generating percentiles

Lastly, the describe function of Pandas gives some percentiles, but it is easy to add more:

1.What is Analytics

2.Python Basics for Analytics

3.Reading Data

4.Describing Data

5.Cleaning Data

6.Visualizing Data

Statistics and Counts

Gathering statistics on data

Finding the data types

Converting data types

Finding unique values

Grouping the data

Finding the correlation

Generating percentiles