Descriptive Statistics
Before you build models or run tests, you need to understand your data at a basic level. Descriptive statistics summarize your data into a few key numbers. Think of it as the vital signs of your dataset — quick, essential diagnostics.
Measures of Central Tendency
These tell you where the "center" of your data is. Mean, median, and mode each answer this question differently — and the right one depends on your data.
import pandas as pd
import numpy as np
salaries = [30000, 35000, 40000, 42000, 150000]
print(f"Mean: {np.mean(salaries)}")
print(f"Median: {np.median(salaries)}")
from scipy import stats
print(f"Mode: {stats.mode([1, 2, 2, 3, 3], keepdims=False).mode}")
The mean is sensitive to outliers — notice how that $150K salary skews the average. The median is more robust. Always check both.
Try it Yourself →Measures of Spread
Knowing the center isn't enough — you need to know how spread out the data is. Standard deviation and variance measure this.
import numpy as np
data = [10, 12, 14, 15, 18, 22, 25]
print(f"Range: {max(data) - min(data)}")
print(f"Variance: {np.var(data, ddof=1):.2f}")
print(f"Std Dev: {np.std(data, ddof=1):.2f}")
The ddof=1 parameter uses Bessel's correction for sample variance. Without it, you get population variance. For most data science work, use ddof=1.
Percentiles and Quartiles
Percentiles tell you the value below which a certain percentage of data falls. Quartiles split your data into four equal parts.
import numpy as np
scores = np.random.normal(75, 10, 1000)
print(f"25th percentile: {np.percentile(scores, 25):.1f}")
print(f"50th percentile (median): {np.percentile(scores, 50):.1f}")
print(f"75th percentile: {np.percentile(scores, 75):.1f}")
q1 = np.percentile(scores, 25)
q3 = np.percentile(scores, 75)
iqr = q3 - q1
print(f"IQR: {iqr:.1f}")
The Interquartile Range (IQR) is the middle 50% of your data. It's used to detect outliers — values beyond 1.5 * IQR from Q1 or Q3 are considered unusual.
Try it Yourself →Key Takeaways
- Mean is sensitive to outliers — always check the median too
- Standard deviation measures how spread out data is from the mean
- IQR identifies the middle 50% of your data
- Use ddof=1 for sample variance and standard deviation