Correlation and Regression
Correlation tells you if two variables move together. Regression goes further and quantifies the relationship. Together, they're the foundation of predictive modeling.
Correlation
Correlation ranges from -1 (perfectly opposite) to 1 (perfectly together). Zero means no linear relationship. But remember — correlation does not imply causation.
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({
"study_hours": np.random.uniform(1, 10, 50),
})
df["exam_score"] = df["study_hours"] * 8 + np.random.normal(0, 5, 50)
corr = df["study_hours"].corr(df["exam_score"])
print(f"Correlation: {corr:.4f}")
A correlation of 0.95 means the variables are strongly positively related. As study hours increase, exam scores tend to increase too. The correlation coefficient captures this linear relationship.
Try it Yourself toSimple Linear Regression
Linear regression fits a straight line through your data. It finds the best line that minimizes the distance between predictions and actual values.
from scipy import stats
import numpy as np
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([2.1, 4.3, 5.8, 8.1, 9.9, 12.0, 14.2, 15.8, 18.1, 20.0])
slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
print(f"Slope: {slope:.2f}")
print(f"Intercept: {intercept:.2f}")
print(f"R-squared: {r_value**2:.4f}")
The slope tells you how much y changes for each unit increase in x. The R-squared value tells you what percentage of the variance in y is explained by x.
Try it Yourself toMultiple Regression
Real-world outcomes usually depend on multiple factors. Multiple regression lets you include several predictors in one model.
import pandas as pd
import statsmodels.api as sm
df = pd.DataFrame({
"sqft": [1000, 1500, 2000, 2500, 3000],
"bedrooms": [1, 2, 3, 3, 4],
"price": [200000, 300000, 400000, 450000, 550000]
})
X = sm.add_constant(df[["sqft", "bedrooms"]])
model = sm.OLS(df["price"], X).fit()
print(model.summary())
The summary output gives you coefficients, p-values, and R-squared. Each coefficient tells you the effect of that variable while holding others constant.
Try it Yourself toKey Takeaways
- Correlation measures the strength of linear relationships
- R-squared tells you how much variance the model explains
- Multiple regression handles several predictors at once
- Always visualize data before fitting a regression line