Stats for AI
Statistics for AI
Probability, distributions, hypothesis testing, Bayesian thinking, and information theory for ML practitioners.
01Probability Foundations▼
Sample space
Set of all possible outcomes.
Event
Subset of sample space.
P(A)
0 to 1. P(certain)=1, P(impossible)=0.
Independence
P(A,B)=P(A)*P(B). Knowing A does not affect B.
Conditional
P(A|B)=P(A,B)/P(B). P(A) given B occurred.
Bayes' Theorem
P(A|B)=P(B|A)*P(A)/P(B). Update beliefs with evidence.
STATSBayes example
# Medical test: 1% have disease (prior) # Test: 99% true positive, 1% false positive P_disease = 0.01 P_pos_given_disease = 0.99 P_pos_given_healthy = 0.01 P_pos = P_pos_given_disease*P_disease + P_pos_given_healthy*0.99 P_disease_given_pos = P_pos_given_disease*P_disease/P_pos # Result: only ~50% if test positive!
02Key Distributions▼
| Distribution | PMF/PDF | Mean | Variance | ML use |
|---|---|---|---|---|
| Bernoulli(p) | p^x*(1-p)^(1-x) | p | p(1-p) | Binary classification |
| Binomial(n,p) | C(n,k)p^k(1-p)^(n-k) | np | np(1-p) | Count successes |
| Gaussian N(mu,sigma^2) | (1/sigma*sqrt(2pi))*exp(-((x-mu)/sigma)^2/2) | mu | sigma^2 | Most ML problems |
| Poisson(lambda) | lambda^k*e^-lambda/k! | lambda | lambda | Event counts |
| Exponential(lambda) | lambda*e^(-lambda*x) | 1/lambda | 1/lambda^2 | Time between events |
03Statistical Inference▼
STATSConfidence intervals & tests
# Confidence interval (sample mean) CI = x_bar +- z*(sigma/sqrt(n)) 95% CI: z=1.96, 99% CI: z=2.576 # t-test (unknown population std) t = (x_bar - mu0) / (s/sqrt(n)) degrees of freedom = n-1 # p-value interpretation: p < 0.05: reject H0 (significant) p > 0.05: fail to reject H0 # Effect size (Cohen d) d = (mean1 - mean2) / pooled_std Small: d=0.2, Medium: d=0.5, Large: d=0.8
❓ Quiz
In ML, what does a p-value < 0.05 indicate?
p < 0.05 means there is strong statistical evidence against the null hypothesis (less than 5% chance results occurred by chance if H0 is true).
04Information Theory▼
Entropy H(X)
-sum(p*log2(p)). Measures uncertainty/randomness.
High entropy
Uniform distribution. Maximum uncertainty.
Low entropy
Concentrated distribution. Predictable.
Cross-entropy
H(p,q)=-sum(p*log(q)). Loss function in classification.
KL divergence
D_KL(P||Q)=sum(p*log(p/q)). How different Q is from P.
Mutual information
How much knowing X reduces uncertainty about Y.
STATSEntropy calculation
# Binary: p=0.5 vs p=0.9
import numpy as np
def entropy(p):
return -p*np.log2(p)-(1-p)*np.log2(1-p)
entropy(0.5) # 1.0 (maximum)
entropy(0.9) # 0.469 (less uncertain)
entropy(1.0) # 0.0 (certain)
# Cross-entropy loss (classification)
loss = -sum(y_true * log(y_pred))05Correlation & Regression▼
STATSCorrelation types
# Pearson r: linear correlation r = cov(X,Y)/(std(X)*std(Y)) # Spearman: rank correlation (non-linear) from scipy.stats import spearmanr r_s, p = spearmanr(x, y) # Linear regression OLS from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(X_train, y_train) print(model.coef_, model.intercept_) # R-squared: proportion of variance explained model.score(X_test, y_test)
⚠
Correlation does NOT imply causation. Always check for confounding variables before drawing conclusions.