Labs ICT
โญ Pro Login

Transfer Learning

Reusing pre-trained models to solve new problems.

Transfer Learning

Imagine you spent months training a model to recognize cats and dogs. Now you need a model that recognizes horses. Do you start from scratch? Absolutely not! Transfer learning lets you take what your model already learned and apply it to a new, related problem.

Think of it like this: a chef who mastered Italian cooking can pick up French cuisine much faster than someone who's never been in a kitchen. The foundational skills transfer over.

How Transfer Learning Works

Most deep learning models have two main parts: the feature extractor (early layers) and the classifier (final layers). The feature extractor learns universal patterns like edges, textures, and shapes. The classifier learns task-specific decisions.


    Pre-trained Model (e.g., trained on ImageNet)
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚  Early Layers          โ”‚  Final Layers   โ”‚
    โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
    โ”‚  โ”‚ Edges         โ”‚     โ”‚  โ”‚ Cat       โ”‚  โ”‚
    โ”‚  โ”‚ Textures      โ”‚     โ”‚  โ”‚ Dog       โ”‚  โ”‚
    โ”‚  โ”‚ Shapes        โ”‚     โ”‚  โ”‚ Bird      โ”‚  โ”‚
    โ”‚  โ”‚ Patterns      โ”‚     โ”‚  โ”‚ ...       โ”‚  โ”‚
    โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
    โ”‚  Feature Extractor     โ”‚  Classifier     โ”‚
    โ”‚  (KEEP - generalize)   โ”‚  (REPLACE)      โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
              โ”‚
              โ–ผ Transfer
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚  Adapted Model (e.g., for medical imaging)โ”‚
    โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
    โ”‚  โ”‚ Edges         โ”‚     โ”‚  โ”‚ Tumor     โ”‚  โ”‚
    โ”‚  โ”‚ Textures      โ”‚     โ”‚  โ”‚ Healthy   โ”‚  โ”‚
    โ”‚  โ”‚ Shapes        โ”‚     โ”‚  โ”‚           โ”‚  โ”‚
    โ”‚  โ”‚ Patterns      โ”‚     โ”‚  โ”‚           โ”‚  โ”‚
    โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
    โ”‚  (FROZEN weights)      โ”‚  (NEW weights)  โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    

Two Main Strategies

Feature Extraction: Freeze the pre-trained layers and only train a new classifier on top. Fast and great when your new dataset is small.

Fine-Tuning: Unfreeze some or all layers and continue training with a low learning rate. Better when you have more data, but risky โ€” you might forget what the model originally learned (called "catastrophic forgetting").


    Strategy Comparison
    โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    โ”‚ Scenario              โ”‚ Best Strategy     โ”‚
    โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    โ”‚ Small dataset,        โ”‚ Feature           โ”‚
    โ”‚ similar task          โ”‚ Extraction        โ”‚
    โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    โ”‚ Medium dataset,       โ”‚ Fine-tune top     โ”‚
    โ”‚ similar task          โ”‚ few layers        โ”‚
    โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    โ”‚ Large dataset,        โ”‚ Fine-tune all     โ”‚
    โ”‚ different domain      โ”‚ (lower lr)        โ”‚
    โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    โ”‚ Very small dataset,   โ”‚ Feature           โ”‚
    โ”‚ different domain      โ”‚ Extraction only   โ”‚
    โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    

When to Use Transfer Learning

Transfer learning shines in these situations:

Limited data: You have only 500 images but need to classify 10 types of flowers. A model pre-trained on millions of images already knows what petals and leaves look like.

Similar domains: Your source and target tasks share structure. A model trained on natural images transfers well to medical images, but poorly to radar signals.

Computational constraints: Training from scratch on ImageNet takes weeks on multiple GPUs. Fine-tuning takes hours on a single GPU.

Start small, iterate fast: Even if you have enough data, starting with transfer learning gives you a strong baseline quickly. You can always train from scratch later if needed.

Popular Pre-trained Models

The AI community has built an incredible library of pre-trained models:


    Model Zoo
    โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    โ”‚ Domain        โ”‚ Popular Models       โ”‚
    โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    โ”‚ Vision        โ”‚ ResNet, VGG,         โ”‚
    โ”‚               โ”‚ EfficientNet, ViT    โ”‚
    โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    โ”‚ NLP           โ”‚ BERT, GPT,           โ”‚
    โ”‚               โ”‚ T5, RoBERTa          โ”‚
    โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    โ”‚ Audio         โ”‚ wav2vec, Whisper     โ”‚
    โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    โ”‚ Multimodal    โ”‚ CLIP, DALL-E,        โ”‚
    โ”‚               โ”‚ Flamingo             โ”‚
    โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    

Hugging Face is the go-to hub for pre-trained NLP models, while torchvision and timm cover computer vision. These libraries make it dead simple to download and use state-of-the-art models.

Common Pitfalls

Transfer learning isn't magic. Watch out for these issues:

Domain mismatch: If your source and target domains are too different, transferred features may be useless or harmful. A model trained on paintings won't help with satellite imagery.

Catastrophic forgetting: When fine-tuning, aggressive learning rates can wipe out useful pre-trained knowledge. Always use a smaller learning rate (like 1e-5) for fine-tuning.

Dataset bias: Your pre-trained model might carry biases from its original dataset. If it was trained on mostly white faces, it may perform poorly on darker skin tones.

Practical Example

Here's the general workflow for fine-tuning a pre-trained model:


    Step 1: Load pre-trained model
            โ†“
    Step 2: Replace final classification layer
            โ†“
    Step 3: Freeze early layers (optional)
            โ†“
    Step 4: Train with small learning rate
            โ†“
    Step 5: Gradually unfreeze layers if needed
            โ†“
    Step 6: Evaluate and iterate
    

The key insight: you're not teaching the model what features to look for โ€” you're teaching it what those features mean in your specific context. That's why transfer learning is so powerful and efficient.

๐Ÿงช Quick Quiz

What is Transfer Learning?