🧠 Deep Learning & AI
Deep Learning Complete Cheatsheet
Neural networks, CNNs, RNNs, Transformers, PyTorch and Keras — complete deep learning guide.
01
Neural Network Basics
▼
Neuron (perceptron)
Input → weighted sum + bias → activation → output. Mimics a brain neuron.
Layer
Input layer → hidden layers → output layer. Depth = number of hidden layers.
Forward pass
Input flows through network to produce prediction.
Backpropagation
Compute gradients via chain rule. Propagate error backward to update weights.
Epoch
One full pass through training dataset. Typically need many epochs.
Batch
Subset of training data processed together. Batch size = efficiency vs memory tradeoff.
Mini-batch gradient descent
Most common: 32-256 samples. Balances speed and stability.
Weight initialization
Xavier/Glorot for sigmoid/tanh. He init for ReLU. Prevents vanishing gradients.
PYTHONNetwork architecture notation
INPUT (784 for 28x28 image) ↓ DENSE (128 units, ReLU) # first hidden layer ↓ DROPOUT (0.3) # regularization ↓ DENSE (64 units, ReLU) # second hidden layer ↓ BATCH NORM # normalize activations ↓ DENSE (10 units, Softmax) # output: 10 classes Params: 784×128 + 128×64 + 64×10 = 109,xxx weights
02
Activation Functions
▼
ReLU
max(0, x)
Most common. Fast. Dead neuron problem.
Leaky ReLU
max(0.01x, x)
Fixes dead neurons.
Sigmoid
1/(1+e^-x) → [0,1]
Binary output. Vanishing gradients.
Tanh
(e^x-e^-x)/(e^x+e^-x) → [-1,1]
Zero-centered. Still vanishes.
Softmax
e^xi / Σe^xj → probabilities
Multiclass output. Sums to 1.
GELU
x · Φ(x)
Transformers. Smooth approximation to ReLU.
Swish
x · sigmoid(x)
Self-gated. Good for deep networks.
💡
Hidden layers: ReLU or Leaky ReLU. Binary output: Sigmoid. Multiclass output: Softmax. Regression output: Linear (no activation).
03
Loss & Optimisation
▼
Loss functions
MSE
Mean Squared Error. Regression. Penalises large errors heavily.
MAE
Mean Absolute Error. Regression. Robust to outliers.
Binary Cross-entropy
Binary classification. Combines sigmoid + log-loss.
Categorical Cross-entropy
Multiclass. Softmax output. log(predicted_prob of true class).
Sparse Categorical CE
Same but labels are integers not one-hot.
Hinge loss
SVMs and margin classifiers.
Focal loss
Imbalanced datasets. Down-weights easy examples.
Optimizers
SGD
Stochastic Gradient Descent. Simple. Noisy. lr=0.01-0.1.
SGD + Momentum
Adds velocity term. Faster convergence. momentum=0.9.
Adam
Adaptive learning rates. Most popular default. lr=0.001.
AdaGrad
Per-parameter lr. Good for sparse data. lr decays fast.
RMSprop
Like AdaGrad but decays old gradients. Good for RNNs.
AdamW
Adam + weight decay. Better regularisation than L2. Transformers.
Learning rate
Controls step size. Too high: diverge. Too low: slow. Try: 1e-3 to 1e-5
LR schedule
Cosine annealing, step decay, warmup. Often critical for transformers
Gradient clipping
max_norm = 1.0. Prevents exploding gradients in RNNs/LSTMs
04
CNNs
▼
PYTHONCNN architecture
# Convolutional layers learn spatial features
# Conv → Pool → Conv → Pool → Flatten → Dense
import torch.nn as nn
class CNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=3, padding=1), # 3ch→32ch
nn.ReLU(),
nn.MaxPool2d(2, 2), # /2 spatially
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2, 2),
)
self.classifier = nn.Sequential(
nn.Linear(64 * 8 * 8, 256),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(256, num_classes)
)
def forward(self, x):
x = self.features(x) # conv layers
x = x.flatten(start_dim=1) # flatten
return self.classifier(x) # dense layers
Kernel
Filter matrix sliding over input. Detects features (edges, textures).
Stride
Step size of kernel. Stride=2 halves spatial dimensions.
Padding
'same' keeps size. 'valid' shrinks.
MaxPool
Takes max in window. Reduces size, keeps strongest features.
Transfer learning
Use pretrained weights (ResNet, EfficientNet, VGG) as starting point. Fine-tune on your data.
05
RNNs & LSTMs
▼
PYTHONRNN and LSTM
# Recurrent networks: hidden state carries memory across timesteps
import torch.nn as nn
# Simple RNN (vanishing gradient problem)
rnn = nn.RNN(input_size=50, hidden_size=128, batch_first=True)
# LSTM — solves vanishing gradient with gates
lstm = nn.LSTM(
input_size=100, # embedding dimension
hidden_size=256, # hidden state size
num_layers=2, # stacked LSTMs
dropout=0.3, # between layers
bidirectional=True, # process forward AND backward
batch_first=True
)
# GRU — simpler than LSTM, similar performance
gru = nn.GRU(input_size=100, hidden_size=256, batch_first=True)
# LSTM gates:
# Forget gate: what to remove from cell state
# Input gate: what new info to store
# Output gate: what to output from cell state
Cell state
Long-term memory in LSTM. Flows with minor modifications.
Hidden state
Short-term memory. Output at each timestep.
Bidirectional
Process sequence left-to-right AND right-to-left. Better context.
Seq2Seq
Encoder reads input sequence → context vector → Decoder generates output. Translation, summarisation.
06
Transformers & Attention
▼
PYTHONTransformer and Attention
# Self-attention: each token attends to all other tokens # Q = Query, K = Key, V = Value # Attention(Q,K,V) = softmax(QK^T / sqrt(dk)) × V # Multi-head attention: multiple parallel attention heads # Each head learns different relationship patterns import torch.nn as nn mha = nn.MultiheadAttention(embed_dim=512, num_heads=8, batch_first=True) # Transformer encoder block: # Input → MultiHeadAttention → Add&Norm → FFN → Add&Norm # BERT: bidirectional encoder (good for classification, NER) # GPT: autoregressive decoder (good for text generation) # T5: encoder-decoder (good for seq2seq tasks) # Positional encoding: adds position info since attention has no order # Absolute (BERT) or Relative (RoPE in LLaMA) or Learned # Key hyperparameters: # num_heads: 8, 12, 16 | d_model: 512, 768, 1024 # num_layers: 6, 12, 24 | d_ff (FFN): 4×d_model
💡
Transformers process all tokens in parallel (unlike RNNs). This enables massive parallelism and scaling. The attention mechanism is the key innovation.
07
Training Tricks
▼
Batch Normalization
Normalize activations within a batch. Speeds training, reduces sensitivity to init.
Layer Normalization
Normalize across features (not batch). Used in Transformers. Better for variable-length sequences.
Dropout
Randomly zero neurons (rate=0.1-0.5). Prevents overfitting. Applied only during training.
Weight decay (L2)
Add λ||w||² to loss. Penalises large weights. Prevents overfitting.
Early stopping
Stop training when validation loss stops improving. Best model = lowest val loss.
Data augmentation
Random crop, flip, rotation, colour jitter. Creates artificial training variety.
Learning rate warmup
Start with tiny lr, gradually increase. Critical for Transformers.
Gradient accumulation
Simulate large batches by accumulating gradients over multiple small batches.
08
PyTorch Basics
▼
PYTHONPyTorch training loop
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = MyModel().to(device)
optimiser = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
criterion = nn.CrossEntropyLoss()
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimiser, T_max=10)
for epoch in range(num_epochs):
# ── Training ──
model.train()
for X, y in train_loader:
X, y = X.to(device), y.to(device)
optimiser.zero_grad() # 1. clear gradients
pred = model(X) # 2. forward pass
loss = criterion(pred, y) # 3. compute loss
loss.backward() # 4. backprop
nn.utils.clip_grad_norm_(model.parameters(), 1.0) # 5. clip
optimiser.step() # 6. update weights
scheduler.step()
# ── Validation ──
model.eval()
with torch.no_grad(): # disable gradient tracking
for X, y in val_loader:
X, y = X.to(device), y.to(device)
val_pred = model(X)
09
Keras/TensorFlow
▼
PYTHONKeras quick model
import tensorflow as tf
from tensorflow import keras
# Sequential API
model = keras.Sequential([
keras.layers.Dense(256, activation='relu', input_shape=(784,)),
keras.layers.BatchNormalization(),
keras.layers.Dropout(0.3),
keras.layers.Dense(128, activation='relu'),
keras.layers.Dense(10, activation='softmax')
])
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=1e-3),
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
callbacks = [
keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True),
keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=3),
keras.callbacks.ModelCheckpoint('best.keras', save_best_only=True)
]
history = model.fit(
X_train, y_train,
epochs=50, batch_size=64,
validation_split=0.2,
callbacks=callbacks
)
model.evaluate(X_test, y_test)
predictions = model.predict(X_new)
10
Mini Quizzes
▼
❓ Quiz 1
What problem do LSTMs solve that simple RNNs cannot?
Simple RNNs suffer from vanishing gradients — gradients shrink exponentially during backpropagation through time, making it impossible to learn long-range dependencies. LSTM gates (forget, input, output) allow gradients to flow unchanged, enabling long-term memory.
❓ Quiz 2
What is the key innovation in the Transformer architecture?
The Transformer's self-attention mechanism computes relationships between all pairs of tokens in parallel. Unlike RNNs, there's no sequential processing — all positions are processed simultaneously, enabling massive parallelism and scaling to very large models.