🤖 AI & ML
Machine Learning Complete Cheatsheet
From supervised learning to neural networks — master ML for data science.
01
ML Fundamentals
▼
Supervised
Labelled data → predict output. Classification & Regression.
Unsupervised
No labels → find patterns. Clustering & Dimensionality reduction.
Reinforcement
Agent learns from rewards in environment.
Overfitting
Model memorizes training data — poor generalization.
Underfitting
Too simple — poor on both training and test.
Bias-Variance
High bias = underfitting. High variance = overfitting.
ℹ️
ML Workflow: Collect → Clean → Feature engineer → Split → Train → Evaluate → Tune → Deploy
02
Supervised Learning
▼
Logistic Regression
Despite name, a classifier. Probability via sigmoid function.
Decision Tree
Tree of if-else rules. Interpretable, overfits without pruning.
Random Forest
Ensemble of trees. Robust, handles missing data well.
SVM
Best hyperplane separating classes. Great for small datasets.
KNN
k-Nearest Neighbors. Simple but slow on large data.
XGBoost
Gradient boosting — often best on tabular data (Kaggle winner).
💡
Start with Logistic Regression or Random Forest as baseline, then try XGBoost.
03
Unsupervised Learning
▼
K-Means
Partition into k clusters. Sensitive to outliers and initial centroids.
DBSCAN
Density-based. Finds arbitrary shapes, detects outliers automatically.
PCA
Reduces dimensions while preserving maximum variance.
t-SNE
Visualizes high-dimensional data in 2D/3D. Non-linear.
Hierarchical
Builds cluster tree (dendrogram). No need to specify k.
Autoencoders
Neural network learns compressed representation.
04
Model Evaluation
▼
Evaluation Metrics
PythonKey metrics
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, classification_report,
mean_squared_error, mean_absolute_error, r2_score
)
# Classification
print(classification_report(y_test, y_pred))
auc = roc_auc_score(y_test, y_prob)
# Regression
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
Accuracy
Correct/Total — misleading for imbalanced
Precision
TP/(TP+FP) — when FP costly
Recall
TP/(TP+FN) — when FN costly
F1
Harmonic mean P&R — imbalanced data
AUC-ROC
Binary classification ranking quality
RMSE
Regression error in original units
05
Feature Engineering
▼
PythonFeature engineering
import pandas as pd from sklearn.preprocessing import StandardScaler, LabelEncoder # Scaling scaler = StandardScaler() X_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # use fit from train! # One-hot encoding pd.get_dummies(df["category"], drop_first=True) # Handle missing values df.fillna(df.mean()) # mean imputation df.dropna(subset=["target"]) # drop if target missing # Feature creation df["age_sq"] = df["age"] ** 2 df["name_len"] = df["name"].str.len() df["is_weekend"] = df["date"].dt.dayofweek >= 5
06
Common Algorithms
▼
Logistic Regression
O(n·d) train, interpretable, good baseline
SVM
O(n²-n³) — slow on large n, great for high-d small n
Random Forest
O(k·n·d·log n) — robust, parallelizable
XGBoost
Best on tabular data, handles nulls, has feature importance
KNN
O(1) train but O(n·d) predict — slow for large datasets
Neural Networks
Best for images, text, audio — needs lots of data
07
Neural Networks
▼
PythonKeras neural network
from tensorflow import keras
model = keras.Sequential([
keras.layers.Dense(128, activation="relu", input_shape=(X.shape[1],)),
keras.layers.Dropout(0.3),
keras.layers.Dense(64, activation="relu"),
keras.layers.Dense(1, activation="sigmoid") # binary
])
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
history = model.fit(X_train, y_train, epochs=50, batch_size=32, validation_split=0.2)
ReLU
max(0,x) — most common hidden layer
Sigmoid
Output 0-1 — binary classification
Softmax
Multi-class output probabilities
Dropout
Randomly zeros neurons — prevents overfitting
08
Scikit-learn Pipeline
▼
PythonScikit-learn pipeline
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipe = Pipeline([
("scaler", StandardScaler()),
("clf", RandomForestClassifier(n_estimators=100))
])
pipe.fit(X_train, y_train)
scores = cross_val_score(pipe, X, y, cv=5, scoring="f1")
# Hyperparameter tuning
params = {"clf__n_estimators": [50,100,200], "clf__max_depth": [None,5,10]}
grid = GridSearchCV(pipe, params, cv=5).fit(X_train, y_train)
09
Pandas Essentials
▼
PythonPandas essentials
import pandas as pd
df = pd.read_csv("data.csv")
df.head(5) # first 5 rows
df.info() # dtypes + nulls
df.describe() # statistics
df.isnull().sum() # null count per col
df.value_counts() # frequency
# Filter
df[df["age"] > 25]
df[(df["age"] > 18) & (df["country"] == "PK")]
# GroupBy
df.groupby("country")["salary"].mean()
df.groupby("country").agg({"salary": ["mean","max"]})
# Merge
pd.merge(df1, df2, on="user_id", how="left")
10
Mini Quizzes
▼
❓ Quiz 1
What does the F1 score measure?
F1 = 2×(P×R)/(P+R). Balances precision and recall. Essential for imbalanced datasets where accuracy is misleading.
❓ Quiz 2
What is overfitting?
Overfitting = model learns noise in training data. Fix: more data, regularization (L1/L2), dropout, cross-validation, simpler model.