Statistics Cheatsheet — BitWithBite

📊 Statistics for DS

Statistics for Data Science Cheatsheet

Descriptive stats, hypothesis testing, regression, Bayesian stats and Python scipy.

📖 10 sections

⏱ 26 min read

✅ Quizzes included

🌙 Dark mode

01 Descriptive Statistics ▼

Central Tendency

Mean

x̄ = Σxᵢ/n

Sum of all values divided by count

Median

Middle value when sorted

Better for skewed data — unaffected by outliers

Mode

Most frequent value

Can be multimodal

Geometric mean

(x₁×x₂×...×xₙ)^(1/n)

Use for growth rates, ratios

Spread / Variability

Variance

σ² = Σ(xᵢ-x̄)²/n (pop) or /(n-1) (sample)

Population vs sample formula!

Std deviation

σ = √variance

Same units as data

IQR

Q3 - Q1

Interquartile range — robust to outliers

Range

max - min

Sensitive to outliers

Z-score

z = (x-μ)/σ

How many std devs from mean

σ/μ × 100%

Coefficient of variation — relative spread

💡

Always report BOTH central tendency AND spread. Mean alone is misleading for skewed distributions. Use median + IQR for skewed data.

02 Probability Distributions ▼

Uniform

All outcomes equally likely. U(a,b): mean=(a+b)/2.

Bernoulli

Single trial: success (p) or failure (1-p). Mean=p, Var=p(1-p).

Binomial

n trials, P(X=k) = C(n,k)×pᵏ×(1-p)ⁿ⁻ᵏ. Mean=np, Var=np(1-p).

Poisson

Events in time/space. P(X=k) = e⁻λλᵏ/k!. Mean=λ=Var.

Normal

Bell curve. N(μ,σ²). 68-95-99.7 rule. Central to statistics.

t-distribution

Like normal but heavier tails. Use when σ unknown + small n. Approaches normal as n→∞.

Chi-square

Sum of squared standard normals. Used for independence tests.

F-distribution

Ratio of chi-squares. Used in ANOVA and regression F-test.

68-95-99.7 rule

μ±σ (68%), μ±2σ (95%), μ±3σ (99.7%)

Normal distribution

CLT

x̄ ~ N(μ, σ²/n) as n→∞

Central Limit Theorem — basis of inference

Standard error

SE = σ/√n

Standard deviation of sampling distribution of mean

03 Hypothesis Testing ▼

STATSHypothesis testing framework

STEPS:
1. State H₀ (null) and H₁ (alternative)
2. Choose significance level α (usually 0.05)
3. Collect data and calculate test statistic
4. Calculate p-value (P(data|H₀ true))
5. Decision: p < α → reject H₀ | p ≥ α → fail to reject H₀

TYPES OF ERROR:
  Type I (α):  Reject H₀ when it's true  (false positive)
  Type II (β): Fail to reject H₀ when false (false negative)
  Power = 1 - β = P(correctly rejecting false H₀)

ONE-TAILED vs TWO-TAILED:
  H₁: μ > μ₀   (one-tailed, right)
  H₁: μ < μ₀   (one-tailed, left)
  H₁: μ ≠ μ₀   (two-tailed — most common)

COMMON TEST STATISTICS:
  z-test (σ known): z = (x̄-μ₀)/(σ/√n)
  t-test (σ unknwn): t = (x̄-μ₀)/(s/√n), df=n-1
  Chi-square: χ² = Σ(O-E)²/E
  F-test: F = MSbetween/MSwithin

⚠️

p-value is NOT the probability H₀ is true. It's the probability of getting results this extreme IF H₀ were true.

04 Confidence Intervals ▼

STATSConfidence intervals

# General formula:
# CI = estimate ± critical_value × standard_error

# For population mean (σ known):
# CI = x̄ ± z_{α/2} × σ/√n
# z_{α/2}: 1.645 (90%), 1.960 (95%), 2.576 (99%)

# For population mean (σ unknown, use t):
# CI = x̄ ± t_{α/2, n-1} × s/√n

# For proportion:
# CI = p̂ ± z_{α/2} × √(p̂(1-p̂)/n)

# Python
from scipy import stats
import numpy as np

data = [12, 15, 14, 10, 13, 16, 11, 14, 15, 12]
n = len(data)
xbar = np.mean(data)   # 13.2
s = np.std(data, ddof=1)  # sample std

# 95% CI
ci = stats.t.interval(0.95, df=n-1, loc=xbar, scale=s/np.sqrt(n))
print(f"95% CI: ({ci[0]:.2f}, {ci[1]:.2f})")

Interpretation

95% CI: if we repeated this experiment 100 times, ~95 intervals would contain the true parameter.

Width

Narrower CI = more precise. Wider with smaller n, larger σ, higher confidence level.

05 Regression Analysis ▼

STATSRegression analysis

# Simple linear regression: y = β₀ + β₁x + ε
# β₁ = Σ(xᵢ-x̄)(yᵢ-ȳ) / Σ(xᵢ-x̄)²
# β₀ = ȳ - β₁x̄

# Multiple regression: y = β₀ + β₁x₁ + β₂x₂ + ... + ε

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Evaluation metrics
r2   = r2_score(y_test, y_pred)           # 0-1, higher better
mse  = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)                        # same units as y
mae  = np.mean(np.abs(y_test - y_pred))

print(f'R²: {r2:.3f}')
print(f'RMSE: {rmse:.3f}')
print(f'Coefficients: {model.coef_}')
print(f'Intercept: {model.intercept_}')

# Logistic regression (binary classification)
from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression()
log_model.fit(X_train, y_train)
probs = log_model.predict_proba(X_test)[:, 1]  # prob of class 1

R²

1 - SS_res/SS_tot. Proportion of variance explained by model

1=perfect, 0=no better than mean

Adjusted R²

Penalises adding irrelevant predictors

Use for multiple regression

F-statistic

Tests if model as a whole is significant

p-value < 0.05 = significant

06 Correlation ▼

Pearson r

Measures LINEAR relationship. -1 to +1. Sensitive to outliers. r = Σ(x-x̄)(y-ȳ)/(n-1)sₓsᵧ

Spearman ρ

Rank correlation. Non-parametric. Handles non-linear monotonic relationships.

Causation ≠ Correlation

r=0.9 does NOT mean X causes Y. Confounders, reverse causation.

|r| interpretation

0.0-0.3: weak, 0.3-0.5: moderate, 0.5-0.7: strong, 0.7-1.0: very strong

Multicollinearity

High correlation between predictors. Inflates coefficient standard errors. Check VIF.

Autocorrelation

Correlation of variable with itself at different time lags. Check Durbin-Watson.

07 Bayesian Statistics ▼

STATSBayesian statistics basics

# Bayes' Theorem:
# P(A|B) = P(B|A) × P(A) / P(B)
#
# In modelling:
# Posterior = Likelihood × Prior / Evidence
# P(θ|data) ∝ P(data|θ) × P(θ)
#
# θ = parameter(s)
# Prior P(θ): belief about parameters BEFORE seeing data
# Likelihood P(data|θ): how likely data given parameters
# Posterior P(θ|data): updated belief AFTER seeing data

# Example: Coin flip
# Prior: P(θ=fair) = 0.5 (assume fair to start)
# Observe: 8 heads out of 10 flips
# Update: posterior shifts toward θ > 0.5

# Key differences from Frequentist:
# Bayesian: parameters have distributions, not fixed values
# Frequentist: parameters fixed, data is random

# Credible interval vs Confidence interval:
# Credible: P(θ in [a,b] | data) = 95%  ← direct probability!
# Confidence: If repeated, 95% of intervals contain true θ

💡

Bayesian approach: start with prior belief, update with evidence to get posterior. More intuitive interpretation than p-values.

08 Statistical Tests ▼

t-test (1-sample)

Test if sample mean = hypothesized value. H₀: μ = μ₀.

t-test (2-sample)

Compare means of two independent groups. H₀: μ₁ = μ₂.

Paired t-test

Compare before/after same subjects. H₀: μ_diff = 0.

ANOVA

Compare means of 3+ groups. H₀: all means equal. F-statistic.

Chi-square (independence)

Test if two categorical variables are independent.

Mann-Whitney U

Non-parametric alternative to 2-sample t-test.

Kruskal-Wallis

Non-parametric alternative to ANOVA.

Shapiro-Wilk

Test for normality. p > 0.05: normal distribution assumed.

PYTHONCommon statistical tests

from scipy import stats

# One-sample t-test
t_stat, p_val = stats.ttest_1samp(data, popmean=10)

# Two-sample t-test
t_stat, p_val = stats.ttest_ind(group1, group2)

# Paired t-test
t_stat, p_val = stats.ttest_rel(before, after)

# Chi-square test of independence
chi2, p, dof, expected = stats.chi2_contingency(contingency_table)

# ANOVA
f_stat, p_val = stats.f_oneway(group1, group2, group3)

# Normality test
stat, p = stats.shapiro(data)  # p > 0.05: normal

# Correlation
r, p = stats.pearsonr(x, y)
rho, p = stats.spearmanr(x, y)

09 Python for Stats ▼

PYTHONStatistics in Python

import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm

# Descriptive stats
df['salary'].describe()  # count,mean,std,min,quartiles,max
np.percentile(data, [25, 50, 75])
stats.kurtosis(data)     # peakedness
stats.skew(data)         # asymmetry

# OLS Regression with statsmodels
X = sm.add_constant(X)   # add intercept
model = sm.OLS(y, X).fit()
print(model.summary())    # full regression table

# Bootstrap confidence interval
def bootstrap_ci(data, n_boot=1000, ci=95):
    boot_means = [np.mean(np.random.choice(data, len(data), replace=True))
                  for _ in range(n_boot)]
    lo = (100-ci)/2
    return np.percentile(boot_means, [lo, 100-lo])

# Power analysis
from statsmodels.stats.power import TTestPower
analysis = TTestPower()
power = analysis.power(effect_size=0.5, nobs=30, alpha=0.05)
# Or: find required sample size
n = analysis.solve_power(effect_size=0.5, power=0.8, alpha=0.05)

10 Mini Quizzes ▼

❓ Quiz 1

What does a p-value of 0.03 mean?

The p-value is P(observed data or more extreme | H₀ true). A p-value of 0.03 means: assuming H₀ is true, there's a 3% chance of getting data this extreme by chance. It does NOT mean H₀ has 3% probability of being true.

❓ Quiz 2

What does R² = 0.85 mean in regression?

R² (coefficient of determination) measures proportion of variance in the dependent variable that is explained by the independent variables. R²=0.85 means 85% of the variation in y is explained by the model's predictors.