Exploratory Data Analysis (EDA)
EDA is where you put everything together. It's the systematic process of investigating a dataset to discover patterns, spot anomalies, and form hypotheses. Think of it as being a detective — you interrogate the data until it reveals its secrets.
The EDA Checklist
Every EDA follows roughly the same steps. Master this workflow and you can analyze any dataset with confidence.
import pandas as pd
import numpy as np
df = pd.read_csv("titanic.csv")
print(f"Shape: {df.shape}")
print(f"\nColumn types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nBasic stats:\n{df.describe()}")
Start with the basics: how much data do you have, what types, what's missing? These three questions set the stage for everything else.
Try it Yourself toUnivariate Analysis
Look at each variable individually. Distributions tell you about the shape, central tendency, and spread of your data.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
df = pd.read_csv("titanic.csv")
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
sns.histplot(df["age"].dropna(), kde=True, ax=axes[0, 0])
axes[0, 0].set_title("Age Distribution")
sns.countplot(data=df, x="sex", ax=axes[0, 1])
axes[0, 1].set_title("Gender Count")
sns.boxplot(data=df, y="fare", ax=axes[1, 0])
axes[1, 0].set_title("Fare Distribution")
sns.countplot(data=df, x="pclass", ax=axes[1, 1])
axes[1, 1].set_title("Passenger Class")
plt.tight_layout()
plt.show()
Histograms for continuous variables, count plots for categorical variables, box plots for outlier detection. This 2x2 grid gives you a quick overview of the most important features.
Try it Yourself toBivariate Analysis
Now look at relationships between variables. How does survival relate to passenger class? Does age affect fare?
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("titanic.csv")
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
sns.barplot(data=df, x="pclass", y="survived", ax=axes[0])
axes[0].set_title("Survival Rate by Class")
sns.scatterplot(data=df, x="age", y="fare", hue="survived", ax=axes[1])
axes[1].set_title("Age vs Fare")
sns.heatmap(df.select_dtypes(include="number").corr(), annot=True, cmap="coolwarm", ax=axes[2])
axes[2].set_title("Correlation Matrix")
plt.tight_layout()
plt.show()
Bar plots show survival rates across categories. Scatter plots reveal relationships between continuous variables. The correlation matrix highlights which variables move together.
Try it Yourself toKey Takeaways
- Always start EDA with shape, types, and missing values
- Univariate analysis examines each variable individually
- Bivariate analysis reveals relationships between variables
- Document your findings as you go — patterns are easy to forget