Data Transformation
Raw data rarely comes in the format you need. Data transformation is about reshaping, aggregating, and combining datasets to prepare them for analysis. Think of it as sculpting — you start with a rough block and carve it into something useful.
Merging Datasets
In the real world, data is spread across multiple tables. Merging (or joining) combines them based on common columns. It's just like SQL joins, but in Python.
import pandas as pd
customers = pd.DataFrame({
"id": [1, 2, 3],
"name": ["Alice", "Bob", "Charlie"]
})
orders = pd.DataFrame({
"customer_id": [1, 2, 2, 3],
"product": ["A", "B", "A", "C"]
})
merged = pd.merge(customers, orders, left_on="id", right_on="customer_id")
print(merged)
Use left_on and right_on when column names differ. Use how to control the join type: "inner", "left", "right", or "outer".
Grouping and Aggregating
Grouping is one of the most powerful operations in pandas. It splits your data into groups, applies a function to each, and combines the results.
import pandas as pd
sales = pd.DataFrame({
"region": ["East", "West", "East", "West", "East"],
"product": ["A", "A", "B", "B", "A"],
"revenue": [100, 150, 200, 120, 180]
})
result = sales.groupby("region")["revenue"].sum()
print(result)
result2 = sales.groupby(["region", "product"])["revenue"].mean()
print(result2)
You can group by multiple columns and apply different aggregation functions. Common ones: .sum(), .mean(), .count(), .max().
Pivoting Data
Pivoting reshapes data from long format to wide format. Think of it like rotating a table — rows become columns and vice versa.
import pandas as pd
long_data = pd.DataFrame({
"date": ["2024-01", "2024-01", "2024-02", "2024-02"],
"product": ["A", "B", "A", "B"],
"sales": [100, 200, 150, 250]
})
pivoted = long_data.pivot(index="date", columns="product", values="sales")
print(pivoted)
Pivoting is essential when you need to create crosstabs or compare categories side by side. The opposite operation is .melt(), which goes from wide to long format.
Key Takeaways
- Merging combines datasets based on common columns
- GroupBy splits data into groups and applies aggregations
- Pivoting reshapes data from long to wide format
- Always inspect the result after transformation — mistakes are common