Labs ICT
Pro Login

Descriptive Statistics

Mean, median, mode, standard deviation — summarizing data.

Descriptive Statistics

Before you build models or run tests, you need to understand your data at a basic level. Descriptive statistics summarize your data into a few key numbers. Think of it as the vital signs of your dataset — quick, essential diagnostics.

Measures of Central Tendency

These tell you where the "center" of your data is. Mean, median, and mode each answer this question differently — and the right one depends on your data.


import pandas as pd
import numpy as np

salaries = [30000, 35000, 40000, 42000, 150000]

print(f"Mean: {np.mean(salaries)}")
print(f"Median: {np.median(salaries)}")

from scipy import stats
print(f"Mode: {stats.mode([1, 2, 2, 3, 3], keepdims=False).mode}")
    

The mean is sensitive to outliers — notice how that $150K salary skews the average. The median is more robust. Always check both.

Try it Yourself →

Measures of Spread

Knowing the center isn't enough — you need to know how spread out the data is. Standard deviation and variance measure this.


import numpy as np

data = [10, 12, 14, 15, 18, 22, 25]

print(f"Range: {max(data) - min(data)}")
print(f"Variance: {np.var(data, ddof=1):.2f}")
print(f"Std Dev: {np.std(data, ddof=1):.2f}")
    

The ddof=1 parameter uses Bessel's correction for sample variance. Without it, you get population variance. For most data science work, use ddof=1.

Try it Yourself →

Percentiles and Quartiles

Percentiles tell you the value below which a certain percentage of data falls. Quartiles split your data into four equal parts.


import numpy as np

scores = np.random.normal(75, 10, 1000)

print(f"25th percentile: {np.percentile(scores, 25):.1f}")
print(f"50th percentile (median): {np.percentile(scores, 50):.1f}")
print(f"75th percentile: {np.percentile(scores, 75):.1f}")

q1 = np.percentile(scores, 25)
q3 = np.percentile(scores, 75)
iqr = q3 - q1
print(f"IQR: {iqr:.1f}")
    

The Interquartile Range (IQR) is the middle 50% of your data. It's used to detect outliers — values beyond 1.5 * IQR from Q1 or Q3 are considered unusual.

Try it Yourself →

Key Takeaways

  • Mean is sensitive to outliers — always check the median too
  • Standard deviation measures how spread out data is from the mean
  • IQR identifies the middle 50% of your data
  • Use ddof=1 for sample variance and standard deviation