Handling Duplicates
Let me show you how to find and remove duplicate rows. Duplicate data is everywhere, and cleaning it up is essential for accurate analysis.
Finding Duplicates
Use `duplicated()` to see which rows are duplicates:
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
'Age': [25, 30, 25, 35, 30]}
df = pd.DataFrame(data)
print(df.duplicated())
This returns True for rows that are duplicates of earlier rows. The first occurrence is always marked as False — it's the "original."
Counting Duplicates
How many duplicates do you have?
print(df.duplicated().sum())
Removing Duplicates
Use `drop_duplicates()` to remove them:
df_clean = df.drop_duplicates()
print(df_clean)
This keeps the first occurrence and removes the rest. But what if you want to keep the last one? Or check only specific columns?
The keep Parameter
Here is the thing — the `keep` parameter gives you control:
df_keep_last = df.drop_duplicates(keep='last')
print(df_keep_last)
`keep='first'` (default) keeps the first occurrence. `keep='last'` keeps the last. `keep=False` removes all duplicates. One thing that confused me at first was that you can also specify which columns to check for duplicates using the `subset` parameter.
Try it Yourself →Key Takeaways
- `duplicated()` marks duplicate rows as True
- `drop_duplicates()` removes duplicates
- `keep` parameter controls which duplicate to keep
- `subset` parameter checks only specific columns for duplicates