Data Cleaning
Here's a secret that experienced data scientists know: you'll spend 80% of your time cleaning data. Real-world data is messy — missing values, duplicates, inconsistent formatting, outliers. Cleaning it is not glamorous, but it's where the real work happens.
Handling Missing Values
Missing data is the most common problem you'll face. You can drop rows with missing values, fill them with a default, or use more sophisticated imputation methods.
import pandas as pd
import numpy as np
df = pd.read_csv("data.csv")
print(df.isnull().sum())
df_dropped = df.dropna()
df_filled = df.fillna({"age": df["age"].median(), "city": "Unknown"})
The .isnull().sum() combo shows you exactly where data is missing. Always check this first — it determines your strategy. Dropping is fine when data is abundant; filling is better when every row matters.
Removing Duplicates
Duplicate rows can skew your analysis. Sometimes they're errors; sometimes they're legitimate. Either way, you need to identify and handle them.
import pandas as pd
df = pd.read_csv("transactions.csv")
print(df.duplicated().sum())
df_unique = df.drop_duplicates()
print(f"Removed {len(df) - len(df_unique)} duplicates")
Use .duplicated() to find duplicates and .drop_duplicates() to remove them. By default, it considers all columns — you can specify a subset with the subset parameter.
Fixing Data Types
Pandas sometimes guesses wrong on data types. Numbers stored as strings, dates as objects — these need to be corrected before analysis.
import pandas as pd
df = pd.read_csv("orders.csv")
df["price"] = pd.to_numeric(df["price"], errors="coerce")
df["date"] = pd.to_datetime(df["date"])
print(df.dtypes)
The errors="coerce" parameter turns unparseable values into NaN instead of raising an error. This is safer than letting your code crash on bad data.
Key Takeaways
- Always check for missing values with .isnull().sum()
- Choose between dropping and filling missing data based on context
- Duplicate removal prevents skewed analysis results
- Correct data types before performing calculations