Linear Regression
Linear regression is the hello world of machine learning. It finds the best-fitting straight line through your data. Simple, interpretable, and surprisingly powerful for many real-world problems.
How It Works
The algorithm finds the line that minimizes the average squared distance between predictions and actual values. This is called Ordinary Least Squares (OLS).
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np
np.random.seed(42)
X = np.random.uniform(1, 10, (50, 1))
y = 3 * X.squeeze() + 7 + np.random.normal(0, 2, 50)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)
print(f"Coefficient: {model.coef_[0]:.2f}")
print(f"Intercept: {model.intercept_:.2f}")
print(f"R-squared: {model.score(X_test, y_test):.4f}")
The coefficient is the slope — how much y changes per unit x. The intercept is where the line crosses the y-axis. Together they define the prediction equation.
Try it Yourself toEvaluating the Model
R-squared tells you how much variance is explained, but you should also look at error metrics to understand prediction accuracy.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np
np.random.seed(42)
X = np.random.uniform(1, 10, (50, 1))
y = 3 * X.squeeze() + 7 + np.random.normal(0, 2, 50)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)
preds = model.predict(X_test)
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, preds)):.2f}")
print(f"MAE: {mean_absolute_error(y_test, preds):.2f}")
RMSE (Root Mean Squared Error) penalizes large errors more. MAE (Mean Absolute Error) gives the average error in original units. Both are in the same units as your target variable.
Try it Yourself toAssumptions and Limitations
Linear regression makes several assumptions. When they are violated, the model can give misleading results.
- Linearity — the relationship between X and y is linear
- Independence — observations are not correlated
- Homoscedasticity — constant variance of residuals
- Normality — residuals are normally distributed
- No multicollinearity — predictors are not highly correlated with each other
If your data has strong nonlinear relationships, consider polynomial regression, decision trees, or other nonlinear models.
Key Takeaways
- Linear regression finds the best-fitting straight line through data
- Coefficients tell you the direction and magnitude of relationships
- R-squared measures how well the model explains variance in the data
- Check assumptions before trusting the results