GroupBy Basics
Let me give you the most powerful feature in Pandas — GroupBy. Think of it as "split, apply, combine." You split data into groups, do something to each group, then combine the results.
GroupBy One Column
The simplest GroupBy operation:
import pandas as pd
data = {'Department': ['Sales', 'Sales', 'HR', 'HR', 'Engineering'],
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Salary': [50000, 60000, 55000, 65000, 80000]}
df = pd.DataFrame(data)
print(df.groupby('Department')['Salary'].mean())
This groups by Department, selects Salary, and calculates the mean for each group. It's like a pivot table in Excel but way more flexible.
Multiple Aggregations
Want more than just the mean?
print(df.groupby('Department')['Salary'].agg(['mean', 'sum', 'count']))
The `agg()` method lets you apply multiple functions at once. One function that confused me at first was `count()` — it counts non-null values, not total rows.
GroupBy Multiple Columns
Need to group by more than one column? Just pass a list:
data = {'Department': ['Sales', 'Sales', 'HR', 'HR'],
'Level': ['Senior', 'Junior', 'Senior', 'Junior'],
'Salary': [60000, 50000, 65000, 55000]}
df = pd.DataFrame(data)
print(df.groupby(['Department', 'Level'])['Salary'].mean())
This creates a multi-level index (hierarchical). Think of it as grouping by Department first, then by Level within each department.
Try it Yourself →Key Takeaways
- `groupby()` splits data into groups based on column values
- Chain with aggregation functions like `mean()`, `sum()`, `count()`
- `agg()` applies multiple functions at once
- Group by multiple columns using a list