by: Raguvir Kunani for Data 100, Summer 2019
This notebook is meant to (very briefly) introduce the concepts of grouping and merging. More importantly, I want to touch on common use cases as well as mistakes I've seen students make. Hope this is helpful for you!
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
Throughout the notebook, we will be using data collected by the State of Connecticut on SAT scores and participation. You can find the data here.
data_path = './data/SAT_School_Participation_and_Performance__2012-2013.csv'
sat = pd.read_csv(data_path)
# Drop columns we don't need
drop_cols = ['Participation Rate (estimate): Change%', 'Percent Meeting Benchmark: Change%', "Test-takers: Change%"]
sat = sat.drop(labels=drop_cols, axis='columns')
Often, we have data that contains multiple groups within it. For example, the US Census (which primarily serves to count how many people are in the US) contains data about people of different ages, races, and education levels. In cases like this, we might want to draw conclusions about each group within the data separately.
For example, we may ask "How does income differ among people with different education levels?" or "How does unemployment differ among people in different geographical areas?". Questions like these are most easily answered by using grouping.
Grouping is the act of partitioning the elements of our data into groups, based on a certain attribute (or set of attributes) each element has. Census data, for example can be grouped by age, by state, or by income level.
Grouping in pandas
is syntactically simple. To group a DataFrame df
by the column col_name
, we simply write df.groupby('col_name)
. This divides each row of df
into groups based on the value a given row has for the column col_name
. Each unique value of col_name
gets its own group.
Let's take a look at our SAT data to see what groups could exist within it.
sat.head()
It looks like each row in sat
corresponds to one school. Each school belongs to a district, so we can group sat
by district:
sat.groupby('District')
Notice that the result of .groupby
is a DataFrameGroupBy
object, not a DataFrame
. This is a commonly forgotten property of .groupby
in pandas
. Remembering it will save you a lot of headache in the future for debugging.
DataFrameGroupBy
objects have an important method that you will use extensively in this class: .agg(func)
.
.agg(func)
aggregates each group with func
, computing a single value for each column of each group. For example, .agg(sum)
computes the sum of each column in the group. In the end, you will have 1 value for each column of each group.Note: func
can be any function, even functions you define. Plugging in custom functions is something this class makes heavy use of, and the most convenient way of using custom functions for .agg
is to use lambda functions (I know, your favorite functions).
Another note: DataFrameGroupBy
objects also have a .filter(func)
method, which only keeps elements from each group that meet the criteria specified by func
. You probably won't need filter
too much in this class but it's useful in general so I thought it would be worth mentioning here.
.agg
method¶Now that we have grouped sat
by district, we can form conclusions about the SAT participation and scores of each district. Let's see what the average number of test-takers in 2012 was for each district.
avg_test_takers_2012 = sat.groupby('District')['Test-takers: 2012'].agg('mean')
# You can also do the same thing with a lambda function
# avg_test_takers_2012 = sat.groupby('District')["Test-takers: 2012"].agg(lambda group: sum(group)/len(group))
avg_test_takers_2012.head()
Let's unpack what just happened by looking at each part of sat.groupby('District')['Test-takers: 2012'].agg('mean')
:
sat.groupby('District')
groups the rows (schools) by districtsat.groupby('District')['Test-takers: 2012']
selects only the 'Test-takers: 2012'
column from each groupsat.groupby('District')['Test-takers: 2012'].agg('mean')
aggregates the 'Test-takers: 2012'
column within each group by taking the average of the column.Note: The order of aggregating and selecting the desired column(s) does not matter. The only thing that matters is you must group first. This means that sat.groupby('District').agg('mean')['Test-takers: 2012']
will give the same output.
Tip: Whenever you want to compute a single value (e.g. average, sum, max, min, etc.) for each group in your data, you should use .agg
with the appropriate aggregating function. In this example, the single value we wanted to compute was the average number of test-takers in 2012.
Quiz: Write code that computes the number of schools in each district. The result should be a Series
where the index is a district and the value is the number of schools in that district.
# This is the answer to the question above. Please attempt to solve the question before looking at the solution;
# it's the best way to learn.
num_schools_per_district = sat.groupby('District')['School'].agg(lambda group: len(group))
num_schools_per_district.head()
DataFrame
¶Many students try to see the result of what happens right after grouping by a column. Specifically, a lot of students try to use the .head()
method of a DataFrame
to check if the groups look correct. While a DataFrameGroupBy
object happens to also have a .head()
method, it works slightly differently than the .head()
for a DataFrame
.
You won't be so lucky if try to get the .shape
of a DataFrameGroupBy
object.
DataFrameGroupBy
object instead of using .agg(func)
¶Many students have the right intention of using a lambda function to compute the desired value after grouping, but they often write the lambda function as taking the DataFrameGroupBy
object as the input rather than using .agg
. While there could be a way to make this approach also work to get the right answer (I'm not sure if it exists but I would think with enough hacking one could make it work), the authors of pandas
have spent so much time writing and optimizing the .agg
method to be much easier to use.
Grouping is best used for categorical or otherwise discrete variables. If the idea of grouping is to split data into groups based on what value each element of the data takes on for a certain variable, it makes sense that we want the variable to take on a limited number of values. Otherwise, we will have too many groups (and each group will be relatively uninformative).
Often, we want to know something about our data that our data doesn't capture. In other to find this extra information, we find another source of data. In the Census example, we might want to understand the demographics of states that tend to vote Republican versus Democratic. To do this, we would have to find information about how each state historically votes.
In order for us to use that extra information to make conclusions, we have to merge our data with the additional data. Merging is the act of combining our data with another separate dataset, where the combination exploits a relationship between one variable in our data and one variable in the other dataset.
pandas
has 2 different ways of merging DataFrame
s:
pd.merge(left, right, how, left_on, right_on)
: merges left
with right
using the join specified by how
, where the value for left_on
is the same as the value for right_on
df.merge(right, how, left_on, right_on)
: same as pd.merge
, where df
is left
Let's see some examples of these 2 merging methods in action. To do this, we will merge our SAT data with data about school quality. You can find the school quality data here.
data_path = './data/State_Department_of_Education_Next_Generation_School_Accountability_data_2014-2015.csv'
acc = pd.read_csv(data_path)
# Filtering using boolean arrays
acc = acc[acc["School Name"].str.contains("High School")]
# Data cleaning using regex
acc["School Name"] = acc["School Name"].str.replace("_[0-9]+", "", regex=True)
acc.head()
pd.merge
:¶pandas_merge = pd.merge(left=sat, right=acc, how='inner', left_on="School", right_on="School Name")
pandas_merge.head()
Let's unpack what just happened:
left=sat
and right=acc
. This tells pandas
that the left table is sat
and the right table is acc
. The reason why we have to define a left and right table is because of joining. In most cases, it doesn't matter which table you set to be left
or right
as long as you use the appropriate join.left_on="School"
and right_on="School Name"
. This tells pandas
what the relationship between the left and the right tables is. In this case, we are saying that the "School"
column of the left table is related to the "School Name"
column of the right table. how='inner'
. This tells pandas
we want to do an inner join, meaning we want our merged DataFrame
to only contain rows where the value of left_on
matches a value of right_on
. Inner joins are the most common form of joins, but there are other types of joins too. The 'Joins' section of this part of the textbook does a great job explaining the differences between types of joins.df.merge
:¶We can do the exact same merge using df.merge
. The only difference is that df
is what left
was in pd.merge
.
df_merge = sat.merge(right=acc, how='inner', left_on='School', right_on='School Name')
df_merge.head()
We can also confirm that pandas_merge
and df_merge
are the same:
pandas_merge.equals(df_merge)
Now that we have a merged table, we can answer more questions about our data than we could have with just our original data.
Let's see whether there seems to be a relationship between the SAT participation rate of a school and their Accountability Index, which aims to measure how well a school prepares a student for success beyond high school.
sns.lmplot(x="Participation Rate (estimate): 2013", y='Accountability Index', data=pandas_merge);
Excepting a few outliers, there seems to be a positive relationship between participation rate and accountability index. Thus, we have evidence (though not strong) to claim that in Connecticut, students who took the SAT were more prepared for life after high school than those who didn't.
With the original data, we were unable to make such a claim. But with merging, we made use of information outside our original data to make broader claims.
The most common mistake students make in merging is using the wrong type of join. The only way to avoid this mistake is to really understand what the differences between types of joins are, which is really well described in this section of the textbook.
Tip: In this course (and in general), you will most likely be using an inner join or a left join.
left_on
and right_on
¶Many students forget that merging 2 tables is not just attaching one table to the end of another. The reason we merge tables is because there is some relationship between those 2 tables. When we merge, we use that relationship to match rows of one table with rows of the other table. In pandas
, we specify the relationship between tables using the left_on
and right_on
arguments.
In our example, both tables had a column containing the name of the school, so we used those columns to merge the tables.
I really hope this notebook clarified any doubts/misconceptions you may have had about grouping or merging.
If you have any questions or feedback, you can email me (rkunani@berkeley.edu) or post on Piazza!