Removing Duplicates

Finding and dropping duplicate rows.

Handling Duplicates

Let me show you how to find and remove duplicate rows. Duplicate data is everywhere, and cleaning it up is essential for accurate analysis.

Finding Duplicates

Use `duplicated()` to see which rows are duplicates:


import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
        'Age': [25, 30, 25, 35, 30]}
df = pd.DataFrame(data)

print(df.duplicated())

This returns True for rows that are duplicates of earlier rows. The first occurrence is always marked as False — it's the "original."

Counting Duplicates

How many duplicates do you have?


print(df.duplicated().sum())

Removing Duplicates

Use `drop_duplicates()` to remove them:


df_clean = df.drop_duplicates()
print(df_clean)

This keeps the first occurrence and removes the rest. But what if you want to keep the last one? Or check only specific columns?

The keep Parameter

Here is the thing — the `keep` parameter gives you control:


df_keep_last = df.drop_duplicates(keep='last')
print(df_keep_last)

`keep='first'` (default) keeps the first occurrence. `keep='last'` keeps the last. `keep=False` removes all duplicates. One thing that confused me at first was that you can also specify which columns to check for duplicates using the `subset` parameter.

Try it Yourself →

Key Takeaways

`duplicated()` marks duplicate rows as True
`drop_duplicates()` removes duplicates
`keep` parameter controls which duplicate to keep
`subset` parameter checks only specific columns for duplicates

← Previous Handling Missing Values

Next → Changing Data Types