Exploratory Data Analysis (EDA)

The complete workflow for analyzing any dataset.

Exploratory Data Analysis (EDA)

EDA is where you put everything together. It's the systematic process of investigating a dataset to discover patterns, spot anomalies, and form hypotheses. Think of it as being a detective — you interrogate the data until it reveals its secrets.

The EDA Checklist

Every EDA follows roughly the same steps. Master this workflow and you can analyze any dataset with confidence.


import pandas as pd
import numpy as np

df = pd.read_csv("titanic.csv")
print(f"Shape: {df.shape}")
print(f"\nColumn types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nBasic stats:\n{df.describe()}")

Start with the basics: how much data do you have, what types, what's missing? These three questions set the stage for everything else.

Try it Yourself to

Univariate Analysis

Look at each variable individually. Distributions tell you about the shape, central tendency, and spread of your data.


import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

df = pd.read_csv("titanic.csv")
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

sns.histplot(df["age"].dropna(), kde=True, ax=axes[0, 0])
axes[0, 0].set_title("Age Distribution")

sns.countplot(data=df, x="sex", ax=axes[0, 1])
axes[0, 1].set_title("Gender Count")

sns.boxplot(data=df, y="fare", ax=axes[1, 0])
axes[1, 0].set_title("Fare Distribution")

sns.countplot(data=df, x="pclass", ax=axes[1, 1])
axes[1, 1].set_title("Passenger Class")

plt.tight_layout()
plt.show()

Histograms for continuous variables, count plots for categorical variables, box plots for outlier detection. This 2x2 grid gives you a quick overview of the most important features.

Try it Yourself to

Bivariate Analysis

Now look at relationships between variables. How does survival relate to passenger class? Does age affect fare?


import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv("titanic.csv")
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

sns.barplot(data=df, x="pclass", y="survived", ax=axes[0])
axes[0].set_title("Survival Rate by Class")

sns.scatterplot(data=df, x="age", y="fare", hue="survived", ax=axes[1])
axes[1].set_title("Age vs Fare")

sns.heatmap(df.select_dtypes(include="number").corr(), annot=True, cmap="coolwarm", ax=axes[2])
axes[2].set_title("Correlation Matrix")

plt.tight_layout()
plt.show()

Bar plots show survival rates across categories. Scatter plots reveal relationships between continuous variables. The correlation matrix highlights which variables move together.

Try it Yourself to

Key Takeaways

Always start EDA with shape, types, and missing values
Univariate analysis examines each variable individually
Bivariate analysis reveals relationships between variables
Document your findings as you go — patterns are easy to forget

← Previous Clustering with K-Means

Next → Feature Engineering