Pandas GroupBy
So you have a messy dataset and want to split it into groups, do something with each group, and combine the results. That is exactly what groupby does. Think of it like a split-apply-combine workflow.
Basic GroupBy
The simplest form of groupby groups rows by a column and applies an aggregation function. Here is how it works:
import pandas as pd
sales = pd.DataFrame({
'region': ['North', 'South', 'North', 'South', 'North'],
'product': ['A', 'B', 'A', 'B', 'A'],
'revenue': [100, 200, 150, 250, 120]
})
print(sales.groupby('region')['revenue'].sum())
print(sales.groupby('region')['revenue'].mean())
Multiple Aggregations
You can apply multiple aggregation functions at once using the agg method. This gives you a rich summary of your grouped data.
print(sales.groupby('region')['revenue'].agg(['sum', 'mean', 'count']))
print(sales.groupby(['region', 'product']).agg(
total_revenue=('revenue', 'sum'),
avg_revenue=('revenue', 'mean'),
transaction_count=('revenue', 'count')
))
Transform
Transform returns a result the same size as the input. This is useful for adding group-level calculations as new columns without collapsing the data.
sales['region_avg'] = sales.groupby('region')['revenue'].transform('mean')
sales['deviation'] = sales['revenue'] - sales['region_avg']
print(sales)
Filter
Filter keeps or drops entire groups based on a condition. This is useful for focusing on specific subsets of your data.
high_revenue = sales.groupby('region').filter(lambda x: x['revenue'].sum() > 300)
print(high_revenue)
Try it Yourself →
Key Takeaways
- GroupBy follows a split-apply-combine pattern
- Use agg for multiple summary statistics at once
- Use transform to add group-level calculations without collapsing rows
- Use filter to keep only groups that meet a condition