Beginner's Guide to Deep Learning: From CNNs to Transformers

Deep learning has transformed nearly every field it has touched — from how we diagnose diseases to how we translate languages in real time. Yet for many developers, the journey into deep learning feels intimidating. Where do you start? What architecture do you use? How do all the pieces fit together?

This guide will walk you through the three cornerstone architectures of modern deep learning: Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformers — with hands-on PyTorch code examples, real-world use cases, and actionable advice.

Whether you're completely new to neural networks or you've trained a few models and want to solidify your foundations, this guide is for you.

🧠 What Is Deep Learning, Really?

Deep learning is a subfield of machine learning that uses artificial neural networks with many layers (hence "deep") to learn representations of data. Instead of manually engineering features, deep learning models learn hierarchical representations directly from raw data.

A deep neural network stacks layers of mathematical operations. Each layer transforms its input and passes the result to the next. Through a process called backpropagation and gradient descent, the network learns to minimize prediction errors by adjusting millions (or billions) of internal parameters called weights.

Here's the mental model:

Input Layer: Raw data (pixels, words, sensor readings)
Hidden Layers: Learned feature representations
Output Layer: Final prediction (class label, next word, bounding box)

Before diving into architectures, make sure you have PyTorch installed:

pip install torch torchvision torchaudio

🖼️ Part 1: Convolutional Neural Networks (CNNs)

What Problem Do CNNs Solve?

Imagine feeding an image — say, 224×224 pixels with 3 color channels — into a standard fully-connected neural network. That's 150,528 input features per image. If your first hidden layer has 1,000 neurons, you already have 150 million parameters to learn from the very first layer. This is computationally brutal and prone to overfitting.

CNNs solve this by exploiting two key properties of images:

Locality: Nearby pixels are more related than distant ones.
Translation Invariance: A cat looks like a cat whether it's in the top-left or bottom-right of the image.

How CNNs Work

A CNN is built from three main types of layers:

1. Convolutional Layers

A convolutional layer slides a small filter (also called a kernel) — typically 3×3 or 5×5 — across the input image, computing dot products at each position. This produces a feature map that highlights patterns the filter is tuned to detect.

Early layers detect low-level features (edges, colors). Deeper layers detect high-level features (eyes, wheels, faces).

2. Pooling Layers

Pooling (usually max pooling) reduces the spatial dimensions of feature maps, making the representation smaller, faster, and more robust to small spatial shifts.

3. Fully-Connected Layers

After several conv + pool stages, the output is flattened and passed through fully-connected layers to produce final class predictions.

CNN in PyTorch: Image Classification

Let's build a simple CNN to classify images from the CIFAR-10 dataset (10 classes: airplanes, cars, birds, cats, etc.):

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms

# ── Data Loading ──────────────────────────────────────────────
transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(32, padding=4),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465),
                         (0.2023, 0.1994, 0.2010)),
])

train_dataset = torchvision.datasets.CIFAR10(
    root='./data', train=True, download=True, transform=transform
)
train_loader = torch.utils.data.DataLoader(
    train_dataset, batch_size=64, shuffle=True, num_workers=2
)

# ── Model Architecture ────────────────────────────────────────
class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super(SimpleCNN, self).__init__()

        self.features = nn.Sequential(
            # Block 1
            nn.Conv2d(3, 32, kernel_size=3, padding=1),   # 32×32 → 32×32
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.Conv2d(32, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),                            # 32×32 → 16×16
            nn.Dropout2d(0.25),

            # Block 2
            nn.Conv2d(32, 64, kernel_size=3, padding=1),  # 16×16 → 16×16
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),                            # 16×16 → 8×8
            nn.Dropout2d(0.25),
        )

        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 8 * 8, 512),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(512, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

# ── Training Loop ─────────────────────────────────────────────
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SimpleCNN().to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=30)

def train_one_epoch(model, loader, optimizer, criterion, device):
    model.train()
    total_loss, correct = 0.0, 0
    for images, labels in loader:
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item() * images.size(0)
        correct += (outputs.argmax(1) == labels).sum().item()
    return total_loss / len(loader.dataset), correct / len(loader.dataset)

for epoch in range(30):
    loss, acc = train_one_epoch(model, train_loader, optimizer, criterion, device)
    scheduler.step()
    if (epoch + 1) % 5 == 0:
        print(f"Epoch {epoch+1:3d} | Loss: {loss:.4f} | Acc: {acc*100:.2f}%")

Real-World CNN Applications

Medical Imaging: Detecting tumors in X-rays and MRI scans
Autonomous Vehicles: Identifying pedestrians, road signs, and lane markings
Facial Recognition: Unlocking phones and tagging people in photos
Quality Control: Detecting defects on manufacturing lines

Popular CNN Architectures Worth Knowing

Architecture	Year	Key Innovation
AlexNet	2012	Proved deep CNNs work at scale
VGGNet	2014	Very deep networks with 3×3 convs
ResNet	2015	Residual (skip) connections
EfficientNet	2019	Neural architecture search scaling
ConvNeXt	2022	CNN redesigned with Transformer insights

For most practical tasks in 2026, start with ResNet-50 or EfficientNet-B0 as your CNN backbone — they're fast, well-supported, and pretrained on ImageNet.

🔁 Part 2: Recurrent Neural Networks (RNNs)

The Problem: Sequential Data

CNNs are brilliant for grid-like data (images), but what about sequences? Consider:

"The bank by the river is beautiful." (bank = riverbank)
"The bank rejected my loan application." (bank = financial institution)

The meaning of "bank" depends on its context — words that came before it. Standard feedforward networks have no concept of order or history. This is where RNNs come in.

How RNNs Work

An RNN processes a sequence one element at a time. At each step t, it takes:

The current input xₜ
The hidden state hₜ₋₁ from the previous step

And produces a new hidden state hₜ:

hₜ = tanh(Wₓ · xₜ + Wₕ · hₜ₋₁ + b)

This hidden state acts as the network's "memory," carrying information from past steps into future ones.

The Vanishing Gradient Problem

Plain RNNs suffer from a critical weakness: vanishing gradients. When backpropagating through many time steps, gradients shrink exponentially, making it nearly impossible to learn long-range dependencies.

Long Short-Term Memory (LSTM) networks solve this with a gating mechanism — input gate, forget gate, and output gate — that selectively retains or discards information over long sequences.

Gated Recurrent Units (GRUs) are a simplified, faster alternative that often performs comparably to LSTMs.

Sentiment Analysis with LSTM in PyTorch

import torch
import torch.nn as nn

class SentimentLSTM(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers, dropout=0.3):
        super(SentimentLSTM, self).__init__()

        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)

        self.lstm = nn.LSTM(
            input_size=embed_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0,
            bidirectional=True,          # Read sequence forward AND backward
        )

        self.dropout = nn.Dropout(dropout)
        # ×2 because bidirectional
        self.fc = nn.Linear(hidden_dim * 2, 1)

    def forward(self, x, lengths):
        # x: (batch, seq_len)  lengths: actual sequence lengths
        embedded = self.dropout(self.embedding(x))

        # Pack padded sequences for efficiency
        packed = nn.utils.rnn.pack_padded_sequence(
            embedded, lengths.cpu(), batch_first=True, enforce_sorted=False
        )
        packed_out, (hidden, _) = self.lstm(packed)

        # Concatenate the last hidden states from both directions
        # hidden shape: (num_layers * 2, batch, hidden_dim)
        hidden_fwd = hidden[-2]   # Last layer, forward direction
        hidden_bwd = hidden[-1]   # Last layer, backward direction
        combined = torch.cat([hidden_fwd, hidden_bwd], dim=1)

        out = self.dropout(combined)
        logits = self.fc(out).squeeze(1)
        return logits   # Raw logits; apply sigmoid for probability


# ── Example Usage ─────────────────────────────────────────────
VOCAB_SIZE   = 25_000
EMBED_DIM    = 128
HIDDEN_DIM   = 256
NUM_LAYERS   = 2
DROPOUT      = 0.3

model = SentimentLSTM(VOCAB_SIZE, EMBED_DIM, HIDDEN_DIM, NUM_LAYERS, DROPOUT)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

# Dummy forward pass
batch_size, seq_len = 32, 50
x       = torch.randint(1, VOCAB_SIZE, (batch_size, seq_len))
lengths = torch.randint(10, seq_len + 1, (batch_size,))
logits  = model(x, lengths)
print(f"Output shape: {logits.shape}")   # (32,)

Honest Note on RNNs in 2026

RNNs were the dominant sequence model before Transformers arrived. Today, for most NLP and time-series tasks, Transformers (or their efficient variants) outperform RNNs. However, RNNs still shine in:

Edge/embedded deployments where memory is constrained
Streaming inference where you process one token at a time
Certain time-series tasks with very long sequences

Understanding RNNs also gives you foundational intuition for why Transformers were designed the way they were.

⚡ Part 3: Transformers — The Architecture That Changed Everything

The Attention Revolution

In 2017, the landmark paper "Attention Is All You Need" introduced the Transformer architecture. It dispensed with recurrence entirely and replaced it with a mechanism called self-attention, which allows every position in a sequence to directly attend to every other position in a single step.

This means:

No more vanishing gradients over long sequences
Parallelizable training (RNNs must process tokens sequentially)
Better long-range dependency modeling

The result? Transformers became the backbone of GPT, BERT, T5, LLaMA, Stable Diffusion, and virtually every state-of-the-art model today.

Self-Attention: The Core Idea

Given a sequence of tokens, self-attention computes three vectors for each token:

Query (Q): "What am I looking for?"
Key (K): "What do I contain?"
Value (V): "What information do I carry?"

The attention score between two tokens is:

Attention(Q, K, V) = softmax(Q·Kᵀ / √dₖ) · V

The softmax produces a probability distribution over all tokens, determining how much each token should "attend to" every other token. The result is a weighted sum of Value vectors.

Multi-Head Attention runs this process in parallel across multiple "heads," each learning different relationship types (syntactic, semantic, positional, etc.).

Positional Encoding

Unlike RNNs, Transformers process all tokens simultaneously. To preserve order, positional encodings are added to token embeddings, injecting information about each token's position in the sequence.

Building a Mini Transformer in PyTorch

import torch
import torch.nn as nn
import math


class MultiHeadSelfAttention(nn.Module):
    def __init__(self, embed_dim, num_heads, dropout=0.1):
        super().__init__()
        assert embed_dim % num_heads == 0, "embed_dim must be divisible by num_heads"
        self.num_heads = num_heads
        self.head_dim  = embed_dim // num_heads
        self.scale     = math.sqrt(self.head_dim)

        self.qkv_proj  = nn.Linear(embed_dim, embed_dim * 3, bias=False)
        self.out_proj  = nn.Linear(embed_dim, embed_dim)
        self.dropout   = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        B, T, C = x.shape   # batch, seq_len, embed_dim

        # Project to Q, K, V and split into heads
        qkv = self.qkv_proj(x)                        # (B, T, 3*C)
        qkv = qkv.reshape(B, T, 3, self.num_heads, self.head_dim)
        qkv = qkv.permute(2, 0, 3, 1, 4)             # (3, B, heads, T, head_dim)
        q, k, v = qkv.unbind(0)

        # Scaled dot-product attention
        attn = (q @ k.transpose(-2, -1)) / self.scale  # (B, heads, T, T)
        if mask is not None:
            attn = attn.masked_fill(mask == 0, float('-inf'))
        attn = torch.softmax(attn, dim=-1)
        attn = self.dropout(attn)

        out = (attn @ v)                               # (B, heads, T, head_dim)
        out = out.transpose(1, 2).reshape(B, T, C)    # (B, T, C)
        return self.out_proj(out)


class TransformerBlock(nn.Module):
    def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1):
        super().__init__()
        self.attn   = MultiHeadSelfAttention(embed_dim, num_heads, dropout)
        self.ff     = nn.Sequential(
            nn.Linear(embed_dim, ff_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(ff_dim, embed_dim),
            nn.Dropout(dropout),
        )
        self.norm1  = nn.LayerNorm(embed_dim)
        self.norm2  = nn.LayerNorm(embed_dim)

    def forward(self, x, mask=None):
        # Pre-LayerNorm residual (more stable than post-LN)
        x = x + self.attn(self.norm1(x), mask)
        x = x + self.ff(self.norm2(x))
        return x


class MiniTransformer(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_heads, ff_dim,
                 num_layers, max_seq_len, num_classes, dropout=0.1):
        super().__init__()
        self.token_embed = nn.Embedding(vocab_size, embed_dim)
        self.pos_embed   = nn.Embedding(max_seq_len, embed_dim)
        self.dropout     = nn.Dropout(dropout)

        self.blocks = nn.ModuleList([
            TransformerBlock(embed_dim, num_heads, ff_dim, dropout)
            for _ in range(num_layers)
        ])

        self.norm      = nn.LayerNorm(embed_dim)
        self.head      = nn.Linear(embed_dim, num_classes)

    def forward(self, x, mask=None):
        B, T = x.shape
        positions = torch.arange(T, device=x.device).unsqueeze(0)  # (1, T)

        # Token + positional embeddings
        x = self.dropout(self.token_embed(x) + self.pos_embed(positions))

        for block in self.blocks:
            x = block(x, mask)

        x = self.norm(x)
        # Use the [CLS] token (position 0) for classification
        return self.head(x[:, 0])


# ── Instantiate and test ──────────────────────────────────────
model = MiniTransformer(
    vocab_size=10_000,
    embed_dim=256,
    num_heads=8,
    ff_dim=1024,
    num_layers=4,
    max_seq_len=512,
    num_classes=2,         # e.g., binary sentiment
)

total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total trainable parameters: {total_params:,}")

batch = torch.randint(0, 10_000, (16, 128))  # (batch=16, seq_len=128)
logits = model(batch)
print(f"Output shape: {logits.shape}")         # (16, 2)

Encoder vs. Decoder vs. Encoder-Decoder

Modern Transformer-based models fall into three families:

Family	Examples	Best For
Encoder-only	BERT, RoBERTa	Classification, NER, embeddings
Decoder-only	GPT-4, LLaMA 3, Mistral	Text generation, chat, reasoning
Encoder-Decoder	T5, BART, Whisper	Translation, summarization, ASR

For text classification or feature extraction, use an encoder model (BERT-family).
For text generation, use a decoder model (GPT-family).
For sequence-to-sequence tasks, use an encoder-decoder (T5-family).

Vision Transformers (ViT)

Transformers aren't limited to text. Vision Transformers (ViT) split an image into fixed-size patches, linearly project each patch into an embedding, and process the sequence of patch embeddings with standard Transformer blocks. In 2026, hybrid CNN + Transformer architectures (like ConvNeXt V2 and MaxViT) dominate many vision benchmarks.

🏆 Best Practices

Data Practices

Always normalize your inputs. For images, use dataset-specific mean and standard deviation. For text, use subword tokenization (BPE or WordPiece).
Use data augmentation aggressively for CNNs: random crops, flips, color jitter, mixup, and cutmix.
Validate your data pipeline first. Visualize a batch before training to catch preprocessing bugs early.

Training Practices

Start with a pretrained model whenever possible. Fine-tuning a pretrained ResNet or BERT will almost always outperform training from scratch on limited data.
Use learning rate scheduling. Warmup followed by cosine decay is the gold standard in 2026.
Monitor validation loss, not training loss. Training loss always goes down; validation loss tells you whether you're generalizing.
Use mixed-precision training (torch.autocast) to speed up training and reduce GPU memory usage by up to 2×.
Gradient clipping (torch.nn.utils.clip_grad_norm_) prevents exploding gradients, especially in RNNs and Transformers.

# Mixed precision + gradient clipping best practice
scaler = torch.cuda.amp.GradScaler()

for images, labels in train_loader:
    images, labels = images.to(device), labels.to(device)
    optimizer.zero_grad()

    with torch.autocast(device_type='cuda', dtype=torch.float16):
        outputs = model(images)
        loss = criterion(outputs, labels)

    scaler.scale(loss).backward()
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    scaler.step(optimizer)
    scaler.update()

Architecture Practices

Use BatchNorm for CNNs, LayerNorm for Transformers. These are not interchangeable.
Residual connections are almost always beneficial. When in doubt, add skip connections.
Dropout placement matters: After embedding layers in Transformers; after pooling layers in CNNs; avoid dropout in batch-normalized layers.

⚠️ Common Mistakes

Mistake 1: Not Shuffling Training Data

Always shuffle your training data between epochs. Forgetting this can cause the model to learn the order of your dataset rather than meaningful patterns.

Mistake 2: Data Leakage

Fitting your data scaler (e.g., mean/std for normalization) on the entire dataset — including the validation and test sets — inflates your evaluation metrics. Fit only on training data, then apply to validation/test.

Mistake 3: Learning Rate Too High or Too Low

A learning rate that's too high causes training to diverge (loss explodes). Too low, and training stalls. Always do a learning rate range test or start with 1e-3 for Adam/AdamW and adjust.

Mistake 4: Ignoring Class Imbalance

If 95% of your samples are class A and 5% are class B, a model that always predicts A achieves 95% accuracy but is useless. Use torch.nn.CrossEntropyLoss(weight=...) or oversampling techniques.

Mistake 5: Evaluating on the Test Set Too Early

Your test set should be touched exactly once — at the very end. Using it for model selection or hyperparameter tuning turns it into a validation set, invalidating your final evaluation.

Mistake 6: Forgetting `model.eval()` During Inference

Dropout and BatchNorm behave differently during training and inference. Always call model.eval() before running predictions and torch.no_grad() to save memory.

model.eval()
with torch.no_grad():
    predictions = model(test_inputs)

🚀 Pro Tips

Use torchinfo (formerly torchsummary) to print a clean summary of your model's layers, output shapes, and parameter counts: pip install torchinfo, then from torchinfo import summary; summary(model, input_size=(1, 3, 224, 224)).
Profile before optimizing. Use PyTorch's built-in profiler (torch.profiler) to identify actual bottlenecks before rewriting code.
For small datasets with Transformers, always use heavy regularization: weight decay, dropout, and data augmentation. Alternatively, use a pretrained model from Hugging Face — even with 1,000 samples, fine-tuning BERT can be highly effective.
The Hugging Face ecosystem is your friend. transformers, datasets, accelerate, and peft (for LoRA fine-tuning) can dramatically reduce the engineering overhead of working with large models.

# Fine-tuning a pretrained BERT for classification with Hugging Face
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "bert-base-uncased"
tokenizer  = AutoTokenizer.from_pretrained(model_name)
model      = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=2
)

texts  = ["This movie was fantastic!", "I hated every minute of it."]
inputs = tokenizer(texts, padding=True, truncation=True,
                   max_length=128, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    predictions = outputs.logits.argmax(dim=-1)
    print(predictions)   # tensor([1, 0]) → positive, negative

Watch your GPU memory with nvidia-smi -l 1 in a terminal during training. Running out of VRAM mid-epoch is frustrating. Reduce batch size or enable gradient checkpointing if needed.
Log everything. Use Weights & Biases (wandb) or TensorBoard to track loss curves, learning rate schedules, and validation metrics. Debugging a model you didn't instrument is nearly impossible.
Read papers on Papers With Code (paperswithcode.com). Every benchmark has a leaderboard showing the top methods with links to code — invaluable for understanding the state of the art.

📌 Key Takeaways

CNNs are the go-to architecture for image and spatial data. They exploit locality and translation invariance through learned convolutional filters.
RNNs and LSTMs were built for sequential data and introduced the concept of memory into neural networks. While largely superseded by Transformers for NLP, they remain relevant for edge inference and streaming applications.
Transformers are the dominant architecture across NLP, vision, audio, and multi-modal AI. Self-attention allows every token to directly interact with every other, enabling powerful long-range reasoning and highly parallelizable training.
Always start with pretrained models. Transfer learning is the single most effective technique for practitioners with limited data and compute.
Data quality matters more than architecture. A well-curated dataset with a simpler model almost always beats a complex model trained on noisy data.
The field moves fast. In 2026, efficient Transformer variants (FlashAttention 3, Mamba-style state space models, Mixture of Experts) are pushing boundaries. Keep reading.
PyTorch is the de facto framework for research and increasingly for production. Mastering it will serve you for years.

Conclusion

Deep learning is vast, but it's not impenetrable. By understanding these three foundational architectures — CNNs, RNNs, and Transformers — you've built the mental scaffolding to understand the vast majority of models you'll encounter in research papers, open-source repos, and production systems.

The best way to solidify this knowledge is to build things. Pick a dataset you care about, implement one of the architectures from this guide, and iterate. Break things, read error messages, profile your code, and experiment with hyperparameters. That hands-on loop is irreplaceable.

Deep learning rewards curiosity and persistence. The tools have never been more accessible, the pretrained models have never been more capable, and the community has never been more welcoming. Go build something.

References

Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS. https://arxiv.org/abs/1706.03762
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436–444.
He, K. et al. (2016). Deep Residual Learning for Image Recognition. CVPR. https://arxiv.org/abs/1512.03385
Hochreiter, S. & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780.
Dosovitskiy, A. et al. (2021). An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. ICLR. https://arxiv.org/abs/2010.11929
PyTorch Documentation (2026). https://pytorch.org/docs/stable/index.html
Hugging Face Transformers Documentation. https://huggingface.co/docs/transformers
Papers With Code. https://paperswithcode.com