📊 Statistics for DS
Statistics for Data Science Cheatsheet
Descriptive stats, hypothesis testing, regression, Bayesian stats and Python scipy.
01
Descriptive Statistics
▼
Central Tendency
Mean
x̄ = Σxᵢ/n
Sum of all values divided by count
Median
Middle value when sorted
Better for skewed data — unaffected by outliers
Mode
Most frequent value
Can be multimodal
Geometric mean
(x₁×x₂×...×xₙ)^(1/n)
Use for growth rates, ratios
Spread / Variability
Variance
σ² = Σ(xᵢ-x̄)²/n (pop) or /(n-1) (sample)
Population vs sample formula!
Std deviation
σ = √variance
Same units as data
IQR
Q3 - Q1
Interquartile range — robust to outliers
Range
max - min
Sensitive to outliers
Z-score
z = (x-μ)/σ
How many std devs from mean
CV
σ/μ × 100%
Coefficient of variation — relative spread
💡
Always report BOTH central tendency AND spread. Mean alone is misleading for skewed distributions. Use median + IQR for skewed data.
02
Probability Distributions
▼
Uniform
All outcomes equally likely. U(a,b): mean=(a+b)/2.
Bernoulli
Single trial: success (p) or failure (1-p). Mean=p, Var=p(1-p).
Binomial
n trials, P(X=k) = C(n,k)×pᵏ×(1-p)ⁿ⁻ᵏ. Mean=np, Var=np(1-p).
Poisson
Events in time/space. P(X=k) = e⁻λλᵏ/k!. Mean=λ=Var.
Normal
Bell curve. N(μ,σ²). 68-95-99.7 rule. Central to statistics.
t-distribution
Like normal but heavier tails. Use when σ unknown + small n. Approaches normal as n→∞.
Chi-square
Sum of squared standard normals. Used for independence tests.
F-distribution
Ratio of chi-squares. Used in ANOVA and regression F-test.
68-95-99.7 rule
μ±σ (68%), μ±2σ (95%), μ±3σ (99.7%)
Normal distribution
CLT
x̄ ~ N(μ, σ²/n) as n→∞
Central Limit Theorem — basis of inference
Standard error
SE = σ/√n
Standard deviation of sampling distribution of mean
03
Hypothesis Testing
▼
STATSHypothesis testing framework
STEPS: 1. State H₀ (null) and H₁ (alternative) 2. Choose significance level α (usually 0.05) 3. Collect data and calculate test statistic 4. Calculate p-value (P(data|H₀ true)) 5. Decision: p < α → reject H₀ | p ≥ α → fail to reject H₀ TYPES OF ERROR: Type I (α): Reject H₀ when it's true (false positive) Type II (β): Fail to reject H₀ when false (false negative) Power = 1 - β = P(correctly rejecting false H₀) ONE-TAILED vs TWO-TAILED: H₁: μ > μ₀ (one-tailed, right) H₁: μ < μ₀ (one-tailed, left) H₁: μ ≠ μ₀ (two-tailed — most common) COMMON TEST STATISTICS: z-test (σ known): z = (x̄-μ₀)/(σ/√n) t-test (σ unknwn): t = (x̄-μ₀)/(s/√n), df=n-1 Chi-square: χ² = Σ(O-E)²/E F-test: F = MSbetween/MSwithin
⚠️
p-value is NOT the probability H₀ is true. It's the probability of getting results this extreme IF H₀ were true.
04
Confidence Intervals
▼
STATSConfidence intervals
# General formula:
# CI = estimate ± critical_value × standard_error
# For population mean (σ known):
# CI = x̄ ± z_{α/2} × σ/√n
# z_{α/2}: 1.645 (90%), 1.960 (95%), 2.576 (99%)
# For population mean (σ unknown, use t):
# CI = x̄ ± t_{α/2, n-1} × s/√n
# For proportion:
# CI = p̂ ± z_{α/2} × √(p̂(1-p̂)/n)
# Python
from scipy import stats
import numpy as np
data = [12, 15, 14, 10, 13, 16, 11, 14, 15, 12]
n = len(data)
xbar = np.mean(data) # 13.2
s = np.std(data, ddof=1) # sample std
# 95% CI
ci = stats.t.interval(0.95, df=n-1, loc=xbar, scale=s/np.sqrt(n))
print(f"95% CI: ({ci[0]:.2f}, {ci[1]:.2f})")
Interpretation
95% CI: if we repeated this experiment 100 times, ~95 intervals would contain the true parameter.
Width
Narrower CI = more precise. Wider with smaller n, larger σ, higher confidence level.
05
Regression Analysis
▼
STATSRegression analysis
# Simple linear regression: y = β₀ + β₁x + ε
# β₁ = Σ(xᵢ-x̄)(yᵢ-ȳ) / Σ(xᵢ-x̄)²
# β₀ = ȳ - β₁x̄
# Multiple regression: y = β₀ + β₁x₁ + β₂x₂ + ... + ε
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Evaluation metrics
r2 = r2_score(y_test, y_pred) # 0-1, higher better
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse) # same units as y
mae = np.mean(np.abs(y_test - y_pred))
print(f'R²: {r2:.3f}')
print(f'RMSE: {rmse:.3f}')
print(f'Coefficients: {model.coef_}')
print(f'Intercept: {model.intercept_}')
# Logistic regression (binary classification)
from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression()
log_model.fit(X_train, y_train)
probs = log_model.predict_proba(X_test)[:, 1] # prob of class 1
R²
1 - SS_res/SS_tot. Proportion of variance explained by model
1=perfect, 0=no better than mean
Adjusted R²
Penalises adding irrelevant predictors
Use for multiple regression
F-statistic
Tests if model as a whole is significant
p-value < 0.05 = significant
06
Correlation
▼
Pearson r
Measures LINEAR relationship. -1 to +1. Sensitive to outliers. r = Σ(x-x̄)(y-ȳ)/(n-1)sₓsᵧ
Spearman ρ
Rank correlation. Non-parametric. Handles non-linear monotonic relationships.
Causation ≠ Correlation
r=0.9 does NOT mean X causes Y. Confounders, reverse causation.
|r| interpretation
0.0-0.3: weak, 0.3-0.5: moderate, 0.5-0.7: strong, 0.7-1.0: very strong
Multicollinearity
High correlation between predictors. Inflates coefficient standard errors. Check VIF.
Autocorrelation
Correlation of variable with itself at different time lags. Check Durbin-Watson.
07
Bayesian Statistics
▼
STATSBayesian statistics basics
# Bayes' Theorem: # P(A|B) = P(B|A) × P(A) / P(B) # # In modelling: # Posterior = Likelihood × Prior / Evidence # P(θ|data) ∝ P(data|θ) × P(θ) # # θ = parameter(s) # Prior P(θ): belief about parameters BEFORE seeing data # Likelihood P(data|θ): how likely data given parameters # Posterior P(θ|data): updated belief AFTER seeing data # Example: Coin flip # Prior: P(θ=fair) = 0.5 (assume fair to start) # Observe: 8 heads out of 10 flips # Update: posterior shifts toward θ > 0.5 # Key differences from Frequentist: # Bayesian: parameters have distributions, not fixed values # Frequentist: parameters fixed, data is random # Credible interval vs Confidence interval: # Credible: P(θ in [a,b] | data) = 95% ← direct probability! # Confidence: If repeated, 95% of intervals contain true θ
💡
Bayesian approach: start with prior belief, update with evidence to get posterior. More intuitive interpretation than p-values.
08
Statistical Tests
▼
t-test (1-sample)
Test if sample mean = hypothesized value. H₀: μ = μ₀.
t-test (2-sample)
Compare means of two independent groups. H₀: μ₁ = μ₂.
Paired t-test
Compare before/after same subjects. H₀: μ_diff = 0.
ANOVA
Compare means of 3+ groups. H₀: all means equal. F-statistic.
Chi-square (independence)
Test if two categorical variables are independent.
Mann-Whitney U
Non-parametric alternative to 2-sample t-test.
Kruskal-Wallis
Non-parametric alternative to ANOVA.
Shapiro-Wilk
Test for normality. p > 0.05: normal distribution assumed.
PYTHONCommon statistical tests
from scipy import stats # One-sample t-test t_stat, p_val = stats.ttest_1samp(data, popmean=10) # Two-sample t-test t_stat, p_val = stats.ttest_ind(group1, group2) # Paired t-test t_stat, p_val = stats.ttest_rel(before, after) # Chi-square test of independence chi2, p, dof, expected = stats.chi2_contingency(contingency_table) # ANOVA f_stat, p_val = stats.f_oneway(group1, group2, group3) # Normality test stat, p = stats.shapiro(data) # p > 0.05: normal # Correlation r, p = stats.pearsonr(x, y) rho, p = stats.spearmanr(x, y)
09
Python for Stats
▼
PYTHONStatistics in Python
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
# Descriptive stats
df['salary'].describe() # count,mean,std,min,quartiles,max
np.percentile(data, [25, 50, 75])
stats.kurtosis(data) # peakedness
stats.skew(data) # asymmetry
# OLS Regression with statsmodels
X = sm.add_constant(X) # add intercept
model = sm.OLS(y, X).fit()
print(model.summary()) # full regression table
# Bootstrap confidence interval
def bootstrap_ci(data, n_boot=1000, ci=95):
boot_means = [np.mean(np.random.choice(data, len(data), replace=True))
for _ in range(n_boot)]
lo = (100-ci)/2
return np.percentile(boot_means, [lo, 100-lo])
# Power analysis
from statsmodels.stats.power import TTestPower
analysis = TTestPower()
power = analysis.power(effect_size=0.5, nobs=30, alpha=0.05)
# Or: find required sample size
n = analysis.solve_power(effect_size=0.5, power=0.8, alpha=0.05)
10
Mini Quizzes
▼
❓ Quiz 1
What does a p-value of 0.03 mean?
The p-value is P(observed data or more extreme | H₀ true). A p-value of 0.03 means: assuming H₀ is true, there's a 3% chance of getting data this extreme by chance. It does NOT mean H₀ has 3% probability of being true.
❓ Quiz 2
What does R² = 0.85 mean in regression?
R² (coefficient of determination) measures proportion of variance in the dependent variable that is explained by the independent variables. R²=0.85 means 85% of the variation in y is explained by the model's predictors.