Deep Learning Cheatsheet — BitWithBite

🧠 Deep Learning & AI

Deep Learning Complete Cheatsheet

Neural networks, CNNs, RNNs, Transformers, PyTorch and Keras — complete deep learning guide.

📖 10 sections

⏱ 28 min read

✅ Quizzes included

🌙 Dark mode

01 Neural Network Basics ▼

Neuron (perceptron)

Input → weighted sum + bias → activation → output. Mimics a brain neuron.

Layer

Input layer → hidden layers → output layer. Depth = number of hidden layers.

Forward pass

Input flows through network to produce prediction.

Backpropagation

Compute gradients via chain rule. Propagate error backward to update weights.

Epoch

One full pass through training dataset. Typically need many epochs.

Batch

Subset of training data processed together. Batch size = efficiency vs memory tradeoff.

Mini-batch gradient descent

Most common: 32-256 samples. Balances speed and stability.

Weight initialization

Xavier/Glorot for sigmoid/tanh. He init for ReLU. Prevents vanishing gradients.

PYTHONNetwork architecture notation

INPUT (784 for 28x28 image)
  ↓
DENSE (128 units, ReLU)     # first hidden layer
  ↓
DROPOUT (0.3)               # regularization
  ↓
DENSE (64 units, ReLU)      # second hidden layer
  ↓
BATCH NORM                  # normalize activations
  ↓
DENSE (10 units, Softmax)   # output: 10 classes

Params: 784×128 + 128×64 + 64×10 = 109,xxx weights

02 Activation Functions ▼

ReLU

max(0, x)

Most common. Fast. Dead neuron problem.

Leaky ReLU

max(0.01x, x)

Fixes dead neurons.

Sigmoid

1/(1+e^-x) → [0,1]

Binary output. Vanishing gradients.

Tanh

(e^x-e^-x)/(e^x+e^-x) → [-1,1]

Zero-centered. Still vanishes.

Softmax

e^xi / Σe^xj → probabilities

Multiclass output. Sums to 1.

GELU

x · Φ(x)

Transformers. Smooth approximation to ReLU.

Swish

x · sigmoid(x)

Self-gated. Good for deep networks.

💡

Hidden layers: ReLU or Leaky ReLU. Binary output: Sigmoid. Multiclass output: Softmax. Regression output: Linear (no activation).

03 Loss & Optimisation ▼

Loss functions

MSE

Mean Squared Error. Regression. Penalises large errors heavily.

MAE

Mean Absolute Error. Regression. Robust to outliers.

Binary Cross-entropy

Binary classification. Combines sigmoid + log-loss.

Categorical Cross-entropy

Multiclass. Softmax output. log(predicted_prob of true class).

Sparse Categorical CE

Same but labels are integers not one-hot.

Hinge loss

SVMs and margin classifiers.

Focal loss

Imbalanced datasets. Down-weights easy examples.

Optimizers

SGD

Stochastic Gradient Descent. Simple. Noisy. lr=0.01-0.1.

SGD + Momentum

Adds velocity term. Faster convergence. momentum=0.9.

Adam

Adaptive learning rates. Most popular default. lr=0.001.

AdaGrad

Per-parameter lr. Good for sparse data. lr decays fast.

RMSprop

Like AdaGrad but decays old gradients. Good for RNNs.

AdamW

Adam + weight decay. Better regularisation than L2. Transformers.

Learning rate

Controls step size. Too high: diverge. Too low: slow. Try: 1e-3 to 1e-5

LR schedule

Cosine annealing, step decay, warmup. Often critical for transformers

Gradient clipping

max_norm = 1.0. Prevents exploding gradients in RNNs/LSTMs

04 CNNs ▼

PYTHONCNN architecture

# Convolutional layers learn spatial features
# Conv → Pool → Conv → Pool → Flatten → Dense

import torch.nn as nn

class CNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),  # 3ch→32ch
            nn.ReLU(),
            nn.MaxPool2d(2, 2),                           # /2 spatially
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
        )
        self.classifier = nn.Sequential(
            nn.Linear(64 * 8 * 8, 256),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, num_classes)
        )
    def forward(self, x):
        x = self.features(x)      # conv layers
        x = x.flatten(start_dim=1) # flatten
        return self.classifier(x)  # dense layers

Kernel

Filter matrix sliding over input. Detects features (edges, textures).

Stride

Step size of kernel. Stride=2 halves spatial dimensions.

Padding

'same' keeps size. 'valid' shrinks.

MaxPool

Takes max in window. Reduces size, keeps strongest features.

Transfer learning

Use pretrained weights (ResNet, EfficientNet, VGG) as starting point. Fine-tune on your data.

05 RNNs & LSTMs ▼

PYTHONRNN and LSTM

# Recurrent networks: hidden state carries memory across timesteps
import torch.nn as nn

# Simple RNN (vanishing gradient problem)
rnn = nn.RNN(input_size=50, hidden_size=128, batch_first=True)

# LSTM — solves vanishing gradient with gates
lstm = nn.LSTM(
    input_size=100,    # embedding dimension
    hidden_size=256,   # hidden state size
    num_layers=2,      # stacked LSTMs
    dropout=0.3,       # between layers
    bidirectional=True, # process forward AND backward
    batch_first=True
)

# GRU — simpler than LSTM, similar performance
gru = nn.GRU(input_size=100, hidden_size=256, batch_first=True)

# LSTM gates:
# Forget gate:  what to remove from cell state
# Input gate:   what new info to store
# Output gate:  what to output from cell state

Cell state

Long-term memory in LSTM. Flows with minor modifications.

Hidden state

Short-term memory. Output at each timestep.

Bidirectional

Process sequence left-to-right AND right-to-left. Better context.

Seq2Seq

Encoder reads input sequence → context vector → Decoder generates output. Translation, summarisation.

06 Transformers & Attention ▼

PYTHONTransformer and Attention

# Self-attention: each token attends to all other tokens
# Q = Query, K = Key, V = Value

# Attention(Q,K,V) = softmax(QK^T / sqrt(dk)) × V

# Multi-head attention: multiple parallel attention heads
# Each head learns different relationship patterns

import torch.nn as nn

mha = nn.MultiheadAttention(embed_dim=512, num_heads=8, batch_first=True)

# Transformer encoder block:
# Input → MultiHeadAttention → Add&Norm → FFN → Add&Norm

# BERT: bidirectional encoder (good for classification, NER)
# GPT:  autoregressive decoder (good for text generation)
# T5:   encoder-decoder (good for seq2seq tasks)

# Positional encoding: adds position info since attention has no order
# Absolute (BERT) or Relative (RoPE in LLaMA) or Learned

# Key hyperparameters:
# num_heads: 8, 12, 16  |  d_model: 512, 768, 1024
# num_layers: 6, 12, 24 |  d_ff (FFN): 4×d_model

💡

Transformers process all tokens in parallel (unlike RNNs). This enables massive parallelism and scaling. The attention mechanism is the key innovation.

07 Training Tricks ▼

Batch Normalization

Normalize activations within a batch. Speeds training, reduces sensitivity to init.

Layer Normalization

Normalize across features (not batch). Used in Transformers. Better for variable-length sequences.

Dropout

Randomly zero neurons (rate=0.1-0.5). Prevents overfitting. Applied only during training.

Weight decay (L2)

Add λ||w||² to loss. Penalises large weights. Prevents overfitting.

Early stopping

Stop training when validation loss stops improving. Best model = lowest val loss.

Data augmentation

Random crop, flip, rotation, colour jitter. Creates artificial training variety.

Learning rate warmup

Start with tiny lr, gradually increase. Critical for Transformers.

Gradient accumulation

Simulate large batches by accumulating gradients over multiple small batches.

08 PyTorch Basics ▼

PYTHONPyTorch training loop

import torch
import torch.nn as nn
from torch.utils.data import DataLoader

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = MyModel().to(device)
optimiser = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
criterion = nn.CrossEntropyLoss()
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimiser, T_max=10)

for epoch in range(num_epochs):
    # ── Training ──
    model.train()
    for X, y in train_loader:
        X, y = X.to(device), y.to(device)
        optimiser.zero_grad()       # 1. clear gradients
        pred = model(X)             # 2. forward pass
        loss = criterion(pred, y)   # 3. compute loss
        loss.backward()             # 4. backprop
        nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # 5. clip
        optimiser.step()            # 6. update weights
    scheduler.step()

    # ── Validation ──
    model.eval()
    with torch.no_grad():           # disable gradient tracking
        for X, y in val_loader:
            X, y = X.to(device), y.to(device)
            val_pred = model(X)

09 Keras/TensorFlow ▼

PYTHONKeras quick model

import tensorflow as tf
from tensorflow import keras

# Sequential API
model = keras.Sequential([
    keras.layers.Dense(256, activation='relu', input_shape=(784,)),
    keras.layers.BatchNormalization(),
    keras.layers.Dropout(0.3),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dense(10, activation='softmax')
])

model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=1e-3),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

callbacks = [
    keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True),
    keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=3),
    keras.callbacks.ModelCheckpoint('best.keras', save_best_only=True)
]

history = model.fit(
    X_train, y_train,
    epochs=50, batch_size=64,
    validation_split=0.2,
    callbacks=callbacks
)

model.evaluate(X_test, y_test)
predictions = model.predict(X_new)

10 Mini Quizzes ▼

❓ Quiz 1

What problem do LSTMs solve that simple RNNs cannot?

Simple RNNs suffer from vanishing gradients — gradients shrink exponentially during backpropagation through time, making it impossible to learn long-range dependencies. LSTM gates (forget, input, output) allow gradients to flow unchanged, enabling long-term memory.

❓ Quiz 2

What is the key innovation in the Transformer architecture?

The Transformer's self-attention mechanism computes relationships between all pairs of tokens in parallel. Unlike RNNs, there's no sequential processing — all positions are processed simultaneously, enabling massive parallelism and scaling to very large models.