📊 Statistics for DS
Statistics for Data Science Cheatsheet
Descriptive stats, hypothesis testing, regression, Bayesian stats and Python scipy.
📖 10 sections
⏱ 26 min read
✅ Quizzes included
🌙 Dark mode
01 Descriptive Statistics
Central Tendency
Mean
x̄ = Σxᵢ/n
Sum of all values divided by count
Median
Middle value when sorted
Better for skewed data — unaffected by outliers
Mode
Most frequent value
Can be multimodal
Geometric mean
(x₁×x₂×...×xₙ)^(1/n)
Use for growth rates, ratios
Spread / Variability
Variance
σ² = Σ(xᵢ-x̄)²/n (pop) or /(n-1) (sample)
Population vs sample formula!
Std deviation
σ = √variance
Same units as data
IQR
Q3 - Q1
Interquartile range — robust to outliers
Range
max - min
Sensitive to outliers
Z-score
z = (x-μ)/σ
How many std devs from mean
CV
σ/μ × 100%
Coefficient of variation — relative spread
💡
Always report BOTH central tendency AND spread. Mean alone is misleading for skewed distributions. Use median + IQR for skewed data.
02 Probability Distributions
Uniform
All outcomes equally likely. U(a,b): mean=(a+b)/2.
Bernoulli
Single trial: success (p) or failure (1-p). Mean=p, Var=p(1-p).
Binomial
n trials, P(X=k) = C(n,k)×pᵏ×(1-p)ⁿ⁻ᵏ. Mean=np, Var=np(1-p).
Poisson
Events in time/space. P(X=k) = e⁻λλᵏ/k!. Mean=λ=Var.
Normal
Bell curve. N(μ,σ²). 68-95-99.7 rule. Central to statistics.
t-distribution
Like normal but heavier tails. Use when σ unknown + small n. Approaches normal as n→∞.
Chi-square
Sum of squared standard normals. Used for independence tests.
F-distribution
Ratio of chi-squares. Used in ANOVA and regression F-test.
68-95-99.7 rule
μ±σ (68%), μ±2σ (95%), μ±3σ (99.7%)
Normal distribution
CLT
x̄ ~ N(μ, σ²/n) as n→∞
Central Limit Theorem — basis of inference
Standard error
SE = σ/√n
Standard deviation of sampling distribution of mean
03 Hypothesis Testing
STATSHypothesis testing framework
STEPS:
1. State H₀ (null) and H₁ (alternative)
2. Choose significance level α (usually 0.05)
3. Collect data and calculate test statistic
4. Calculate p-value (P(data|H₀ true))
5. Decision: p < α → reject H₀ | p ≥ α → fail to reject H₀

TYPES OF ERROR:
  Type I (α):  Reject H₀ when it's true  (false positive)
  Type II (β): Fail to reject H₀ when false (false negative)
  Power = 1 - β = P(correctly rejecting false H₀)

ONE-TAILED vs TWO-TAILED:
  H₁: μ > μ₀   (one-tailed, right)
  H₁: μ < μ₀   (one-tailed, left)
  H₁: μ ≠ μ₀   (two-tailed — most common)

COMMON TEST STATISTICS:
  z-test (σ known): z = (x̄-μ₀)/(σ/√n)
  t-test (σ unknwn): t = (x̄-μ₀)/(s/√n), df=n-1
  Chi-square: χ² = Σ(O-E)²/E
  F-test: F = MSbetween/MSwithin
⚠️
p-value is NOT the probability H₀ is true. It's the probability of getting results this extreme IF H₀ were true.
04 Confidence Intervals
STATSConfidence intervals
# General formula:
# CI = estimate ± critical_value × standard_error

# For population mean (σ known):
# CI = x̄ ± z_{α/2} × σ/√n
# z_{α/2}: 1.645 (90%), 1.960 (95%), 2.576 (99%)

# For population mean (σ unknown, use t):
# CI = x̄ ± t_{α/2, n-1} × s/√n

# For proportion:
# CI = p̂ ± z_{α/2} × √(p̂(1-p̂)/n)

# Python
from scipy import stats
import numpy as np

data = [12, 15, 14, 10, 13, 16, 11, 14, 15, 12]
n = len(data)
xbar = np.mean(data)   # 13.2
s = np.std(data, ddof=1)  # sample std

# 95% CI
ci = stats.t.interval(0.95, df=n-1, loc=xbar, scale=s/np.sqrt(n))
print(f"95% CI: ({ci[0]:.2f}, {ci[1]:.2f})")
Interpretation
95% CI: if we repeated this experiment 100 times, ~95 intervals would contain the true parameter.
Width
Narrower CI = more precise. Wider with smaller n, larger σ, higher confidence level.
05 Regression Analysis
STATSRegression analysis
# Simple linear regression: y = β₀ + β₁x + ε
# β₁ = Σ(xᵢ-x̄)(yᵢ-ȳ) / Σ(xᵢ-x̄)²
# β₀ = ȳ - β₁x̄

# Multiple regression: y = β₀ + β₁x₁ + β₂x₂ + ... + ε

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Evaluation metrics
r2   = r2_score(y_test, y_pred)           # 0-1, higher better
mse  = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)                        # same units as y
mae  = np.mean(np.abs(y_test - y_pred))

print(f'R²: {r2:.3f}')
print(f'RMSE: {rmse:.3f}')
print(f'Coefficients: {model.coef_}')
print(f'Intercept: {model.intercept_}')

# Logistic regression (binary classification)
from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression()
log_model.fit(X_train, y_train)
probs = log_model.predict_proba(X_test)[:, 1]  # prob of class 1
1 - SS_res/SS_tot. Proportion of variance explained by model
1=perfect, 0=no better than mean
Adjusted R²
Penalises adding irrelevant predictors
Use for multiple regression
F-statistic
Tests if model as a whole is significant
p-value < 0.05 = significant
06 Correlation
Pearson r
Measures LINEAR relationship. -1 to +1. Sensitive to outliers. r = Σ(x-x̄)(y-ȳ)/(n-1)sₓsᵧ
Spearman ρ
Rank correlation. Non-parametric. Handles non-linear monotonic relationships.
Causation ≠ Correlation
r=0.9 does NOT mean X causes Y. Confounders, reverse causation.
|r| interpretation
0.0-0.3: weak, 0.3-0.5: moderate, 0.5-0.7: strong, 0.7-1.0: very strong
Multicollinearity
High correlation between predictors. Inflates coefficient standard errors. Check VIF.
Autocorrelation
Correlation of variable with itself at different time lags. Check Durbin-Watson.
07 Bayesian Statistics
STATSBayesian statistics basics
# Bayes' Theorem:
# P(A|B) = P(B|A) × P(A) / P(B)
#
# In modelling:
# Posterior = Likelihood × Prior / Evidence
# P(θ|data) ∝ P(data|θ) × P(θ)
#
# θ = parameter(s)
# Prior P(θ): belief about parameters BEFORE seeing data
# Likelihood P(data|θ): how likely data given parameters
# Posterior P(θ|data): updated belief AFTER seeing data

# Example: Coin flip
# Prior: P(θ=fair) = 0.5 (assume fair to start)
# Observe: 8 heads out of 10 flips
# Update: posterior shifts toward θ > 0.5

# Key differences from Frequentist:
# Bayesian: parameters have distributions, not fixed values
# Frequentist: parameters fixed, data is random

# Credible interval vs Confidence interval:
# Credible: P(θ in [a,b] | data) = 95%  ← direct probability!
# Confidence: If repeated, 95% of intervals contain true θ
💡
Bayesian approach: start with prior belief, update with evidence to get posterior. More intuitive interpretation than p-values.
08 Statistical Tests
t-test (1-sample)
Test if sample mean = hypothesized value. H₀: μ = μ₀.
t-test (2-sample)
Compare means of two independent groups. H₀: μ₁ = μ₂.
Paired t-test
Compare before/after same subjects. H₀: μ_diff = 0.
ANOVA
Compare means of 3+ groups. H₀: all means equal. F-statistic.
Chi-square (independence)
Test if two categorical variables are independent.
Mann-Whitney U
Non-parametric alternative to 2-sample t-test.
Kruskal-Wallis
Non-parametric alternative to ANOVA.
Shapiro-Wilk
Test for normality. p > 0.05: normal distribution assumed.
PYTHONCommon statistical tests
from scipy import stats

# One-sample t-test
t_stat, p_val = stats.ttest_1samp(data, popmean=10)

# Two-sample t-test
t_stat, p_val = stats.ttest_ind(group1, group2)

# Paired t-test
t_stat, p_val = stats.ttest_rel(before, after)

# Chi-square test of independence
chi2, p, dof, expected = stats.chi2_contingency(contingency_table)

# ANOVA
f_stat, p_val = stats.f_oneway(group1, group2, group3)

# Normality test
stat, p = stats.shapiro(data)  # p > 0.05: normal

# Correlation
r, p = stats.pearsonr(x, y)
rho, p = stats.spearmanr(x, y)
09 Python for Stats
PYTHONStatistics in Python
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm

# Descriptive stats
df['salary'].describe()  # count,mean,std,min,quartiles,max
np.percentile(data, [25, 50, 75])
stats.kurtosis(data)     # peakedness
stats.skew(data)         # asymmetry

# OLS Regression with statsmodels
X = sm.add_constant(X)   # add intercept
model = sm.OLS(y, X).fit()
print(model.summary())    # full regression table

# Bootstrap confidence interval
def bootstrap_ci(data, n_boot=1000, ci=95):
    boot_means = [np.mean(np.random.choice(data, len(data), replace=True))
                  for _ in range(n_boot)]
    lo = (100-ci)/2
    return np.percentile(boot_means, [lo, 100-lo])

# Power analysis
from statsmodels.stats.power import TTestPower
analysis = TTestPower()
power = analysis.power(effect_size=0.5, nobs=30, alpha=0.05)
# Or: find required sample size
n = analysis.solve_power(effect_size=0.5, power=0.8, alpha=0.05)
10 Mini Quizzes
❓ Quiz 1
What does a p-value of 0.03 mean?
The p-value is P(observed data or more extreme | H₀ true). A p-value of 0.03 means: assuming H₀ is true, there's a 3% chance of getting data this extreme by chance. It does NOT mean H₀ has 3% probability of being true.
❓ Quiz 2
What does R² = 0.85 mean in regression?
R² (coefficient of determination) measures proportion of variance in the dependent variable that is explained by the independent variables. R²=0.85 means 85% of the variation in y is explained by the model's predictors.