Linear Regression

Predicting continuous values with lines.

Linear Regression

Linear regression is the hello world of machine learning. It finds the best-fitting straight line through your data. Simple, interpretable, and surprisingly powerful for many real-world problems.

How It Works

The algorithm finds the line that minimizes the average squared distance between predictions and actual values. This is called Ordinary Least Squares (OLS).


from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np

np.random.seed(42)
X = np.random.uniform(1, 10, (50, 1))
y = 3 * X.squeeze() + 7 + np.random.normal(0, 2, 50)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)

print(f"Coefficient: {model.coef_[0]:.2f}")
print(f"Intercept: {model.intercept_:.2f}")
print(f"R-squared: {model.score(X_test, y_test):.4f}")

The coefficient is the slope — how much y changes per unit x. The intercept is where the line crosses the y-axis. Together they define the prediction equation.

Try it Yourself to

Evaluating the Model

R-squared tells you how much variance is explained, but you should also look at error metrics to understand prediction accuracy.


from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

np.random.seed(42)
X = np.random.uniform(1, 10, (50, 1))
y = 3 * X.squeeze() + 7 + np.random.normal(0, 2, 50)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)
preds = model.predict(X_test)

print(f"RMSE: {np.sqrt(mean_squared_error(y_test, preds)):.2f}")
print(f"MAE: {mean_absolute_error(y_test, preds):.2f}")

RMSE (Root Mean Squared Error) penalizes large errors more. MAE (Mean Absolute Error) gives the average error in original units. Both are in the same units as your target variable.

Try it Yourself to

Assumptions and Limitations

Linear regression makes several assumptions. When they are violated, the model can give misleading results.

Linearity — the relationship between X and y is linear
Independence — observations are not correlated
Homoscedasticity — constant variance of residuals
Normality — residuals are normally distributed
No multicollinearity — predictors are not highly correlated with each other

If your data has strong nonlinear relationships, consider polynomial regression, decision trees, or other nonlinear models.

Key Takeaways

Linear regression finds the best-fitting straight line through data
Coefficients tell you the direction and magnitude of relationships
R-squared measures how well the model explains variance in the data
Check assumptions before trusting the results

← Previous Introduction to Machine Learning

Next → Logistic Regression