Feature Engineering
Raw data rarely gives you the best model performance. Feature engineering is the art of creating new, more informative features from your existing data. Think of it like turning raw ingredients into a gourmet meal — same stuff, better presentation.
Why Feature Engineering Matters
A good feature can make a mediocre model perform brilliantly. A bad feature can make a great model useless. The difference between an average data scientist and a great one is often their feature engineering skills.
Common Techniques
Here are the most common feature engineering moves:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'price': [100, 200, 150, 300],
'quantity': [2, 5, 3, 8],
'date': pd.date_range('2024-01-01', periods=4)
})
df['total'] = df['price'] * df['quantity']
df['price_per_unit'] = df['price'] / df['quantity']
df['day_of_week'] = df['date'].dt.dayofweek
df['month'] = df['date'].dt.month
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
Scaling & Normalization
Many algorithms perform better when features are on the same scale. MinMax scaling squishes values between 0 and 1. Standardization gives you a mean of 0 and standard deviation of 1.
from sklearn.preprocessing import StandardScaler, MinMaxScaler
scaler = StandardScaler()
df[['price_scaled']] = scaler.fit_transform(df[['price']])
minmax = MinMaxScaler()
df[['quantity_norm']] = minmax.fit_transform(df[['quantity']])
Encoding Categorical Variables
Models need numbers, not strings. One-hot encoding creates binary columns for each category. Label encoding assigns a number to each category.
df_encoded = pd.get_dummies(df, columns=['category'], prefix='cat')
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['color_label'] = le.fit_transform(df['color'])
Try it Yourself →
Key Takeaways
- Feature engineering often matters more than model selection
- Domain knowledge helps you create the most impactful features
- Always scale features for distance-based algorithms
- One-hot encoding is your go-to for categorical variables