🤖 AI & ML
Machine Learning Complete Cheatsheet
From supervised learning to neural networks — master ML for data science.
📖 10 sections
⏱ 25 min read
✅ Quizzes included
🌙 Dark mode
01 ML Fundamentals
Supervised
Labelled data → predict output. Classification & Regression.
Unsupervised
No labels → find patterns. Clustering & Dimensionality reduction.
Reinforcement
Agent learns from rewards in environment.
Overfitting
Model memorizes training data — poor generalization.
Underfitting
Too simple — poor on both training and test.
Bias-Variance
High bias = underfitting. High variance = overfitting.
ℹ️
ML Workflow: Collect → Clean → Feature engineer → Split → Train → Evaluate → Tune → Deploy
02 Supervised Learning
Logistic Regression
Despite name, a classifier. Probability via sigmoid function.
Decision Tree
Tree of if-else rules. Interpretable, overfits without pruning.
Random Forest
Ensemble of trees. Robust, handles missing data well.
SVM
Best hyperplane separating classes. Great for small datasets.
KNN
k-Nearest Neighbors. Simple but slow on large data.
XGBoost
Gradient boosting — often best on tabular data (Kaggle winner).
💡
Start with Logistic Regression or Random Forest as baseline, then try XGBoost.
03 Unsupervised Learning
K-Means
Partition into k clusters. Sensitive to outliers and initial centroids.
DBSCAN
Density-based. Finds arbitrary shapes, detects outliers automatically.
PCA
Reduces dimensions while preserving maximum variance.
t-SNE
Visualizes high-dimensional data in 2D/3D. Non-linear.
Hierarchical
Builds cluster tree (dendrogram). No need to specify k.
Autoencoders
Neural network learns compressed representation.
04 Model Evaluation
Evaluation Metrics
PythonKey metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, classification_report,
    mean_squared_error, mean_absolute_error, r2_score
)

# Classification
print(classification_report(y_test, y_pred))
auc = roc_auc_score(y_test, y_prob)

# Regression
mse  = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
mae  = mean_absolute_error(y_test, y_pred)
r2   = r2_score(y_test, y_pred)
Accuracy
Correct/Total — misleading for imbalanced
Precision
TP/(TP+FP) — when FP costly
Recall
TP/(TP+FN) — when FN costly
F1
Harmonic mean P&R — imbalanced data
AUC-ROC
Binary classification ranking quality
RMSE
Regression error in original units
05 Feature Engineering
PythonFeature engineering
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # use fit from train!

# One-hot encoding
pd.get_dummies(df["category"], drop_first=True)

# Handle missing values
df.fillna(df.mean())           # mean imputation
df.dropna(subset=["target"])   # drop if target missing

# Feature creation
df["age_sq"] = df["age"] ** 2
df["name_len"] = df["name"].str.len()
df["is_weekend"] = df["date"].dt.dayofweek >= 5
06 Common Algorithms
Logistic Regression
O(n·d) train, interpretable, good baseline
SVM
O(n²-n³) — slow on large n, great for high-d small n
Random Forest
O(k·n·d·log n) — robust, parallelizable
XGBoost
Best on tabular data, handles nulls, has feature importance
KNN
O(1) train but O(n·d) predict — slow for large datasets
Neural Networks
Best for images, text, audio — needs lots of data
07 Neural Networks
PythonKeras neural network
from tensorflow import keras

model = keras.Sequential([
    keras.layers.Dense(128, activation="relu", input_shape=(X.shape[1],)),
    keras.layers.Dropout(0.3),
    keras.layers.Dense(64, activation="relu"),
    keras.layers.Dense(1, activation="sigmoid")  # binary
])

model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
history = model.fit(X_train, y_train, epochs=50, batch_size=32, validation_split=0.2)
ReLU
max(0,x) — most common hidden layer
Sigmoid
Output 0-1 — binary classification
Softmax
Multi-class output probabilities
Dropout
Randomly zeros neurons — prevents overfitting
08 Scikit-learn Pipeline
PythonScikit-learn pipeline
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", RandomForestClassifier(n_estimators=100))
])

pipe.fit(X_train, y_train)
scores = cross_val_score(pipe, X, y, cv=5, scoring="f1")

# Hyperparameter tuning
params = {"clf__n_estimators": [50,100,200], "clf__max_depth": [None,5,10]}
grid = GridSearchCV(pipe, params, cv=5).fit(X_train, y_train)
09 Pandas Essentials
PythonPandas essentials
import pandas as pd
df = pd.read_csv("data.csv")

df.head(5)             # first 5 rows
df.info()              # dtypes + nulls
df.describe()          # statistics
df.isnull().sum()      # null count per col
df.value_counts()      # frequency

# Filter
df[df["age"] > 25]
df[(df["age"] > 18) & (df["country"] == "PK")]

# GroupBy
df.groupby("country")["salary"].mean()
df.groupby("country").agg({"salary": ["mean","max"]})

# Merge
pd.merge(df1, df2, on="user_id", how="left")
10 Mini Quizzes
❓ Quiz 1
What does the F1 score measure?
F1 = 2×(P×R)/(P+R). Balances precision and recall. Essential for imbalanced datasets where accuracy is misleading.
❓ Quiz 2
What is overfitting?
Overfitting = model learns noise in training data. Fix: more data, regularization (L1/L2), dropout, cross-validation, simpler model.