STATISTICS Complete Formula Sheet


ESSENTIAL FORMULAS & CONCEPTS
Descriptive • Probability • Distributions • Inference • Regression • Correlation
Descriptive Statistics
Central Tendency
Mean: $\bar{x} = \frac{\sum x_i}{n}$
Median: Middle value (50th percentile)
Mode: Most frequent value
Measures of Spread
Range: Max - Min
Variance: $s^2 = \frac{\sum(x_i-\bar{x})^2}{n-1}$
Std Dev: $s = \sqrt{s^2}$
IQR: $Q_3 - Q_1$
Shape Measures
Skewness: Symmetry (0=symmetric)
Kurtosis: Tail weight
Data Visualization
Histogram
Frequency distribution of continuous data
Class width = (Max-Min)/# classes
Shows shape, center, spread
Box Plot
Min, $Q_1$, Median, $Q_3$, Max
Outliers: < $Q_1 - 1.5 \cdot IQR$ or > $Q_3 + 1.5 \cdot IQR$
Scatter Plot
Two quantitative variables
Shows relationship/correlation
Stem-and-Leaf
Preserves individual values
Probability Basics
Definitions
Sample Space: All possible outcomes
Event: Subset of sample space
Probability: $0 \le P(A) \le 1$
Basic Rules
$P(A \cup B) = P(A) + P(B) - P(A \cap B)$
$P(A^c) = 1 - P(A)$
$P(A|B) = \frac{P(A \cap B)}{P(B)}$
Axioms
1. $P(A) \ge 0$ for all events
2. $P(S) = 1$ (certain event)
3. Additivity: Disjoint events
Discrete Distributions
Binomial
$P(X=k) = \binom{n}{k}p^k(1-p)^{n-k}$
$E[X] = np$, $Var(X) = np(1-p)$
Poisson
$P(X=k) = \frac{e^{-\lambda}\lambda^k}{k!}$
$E[X] = \lambda$, $Var(X) = \lambda$
Rare events in fixed interval
Geometric
$P(X=k) = (1-p)^{k-1}p$
$E[X] = \frac{1}{p}$, $Var(X) = \frac{1-p}{p^2}$
Trials until first success
Hypergeometric
Sampling without replacement
$P(X=k) = \frac{\binom{K}{k}\binom{N-K}{n-k}}{\binom{N}{n}}$
Continuous Distributions
Normal (Gaussian)
$f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$
68-95-99.7 rule
$Z = \frac{X-\mu}{\sigma}$ (standardize)
Exponential
$f(x) = \lambda e^{-\lambda x}$, $x \ge 0$
$E[X] = \frac{1}{\lambda}$, $Var(X) = \frac{1}{\lambda^2}$
Waiting time between events
Uniform
$f(x) = \frac{1}{b-a}$ for $a \le x \le b$
$E[X] = \frac{a+b}{2}$, $Var(X) = \frac{(b-a)^2}{12}$
T-Distribution
$df = n-1$ (degrees of freedom)
Heavier tails than normal
Sampling Distributions
Central Limit Theorem
$\bar{X} \sim N(\mu, \frac{\sigma^2}{n})$
Sample means approximately normal (n ≥ 30)
Standard Error
$SE(\bar{x}) = \frac{s}{\sqrt{n}}$
$SE(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}$
Sample Proportion
$\hat{p} \sim N(p, \frac{p(1-p)}{n})$
Valid when $np \ge 10$, $n(1-p) \ge 10$
Difference of Means
$E[\bar{X}_1 - \bar{X}_2] = \mu_1 - \mu_2$
$SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$
Confidence Intervals
General Form
$\text{Estimate} \pm \text{ME}$
$ME = z_{\alpha/2} \cdot SE$ or $t_{\alpha/2} \cdot SE$
Mean (Known σ)
$\bar{x} \pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}}$
Mean (Unknown σ)
$\bar{x} \pm t_{\alpha/2}\frac{s}{\sqrt{n}}$
Proportion
$\hat{p} \pm z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$
Diff of Means
$(\bar{x}_1-\bar{x}_2) \pm t_{\alpha/2} SE$
Sample Size
$n = \frac{z_{\alpha/2}^2\sigma^2}{ME^2}$
Hypothesis Testing
Framework
1. Set $H_0$ vs $H_a$
2. Sig level $\alpha$ (0.05)
3. Compute test statistic
4. Find p-value
5. Conclude: reject or fail to reject
Error Types
Type I ($\alpha$): Reject true $H_0$
Type II ($\beta$): Fail to reject false $H_0$
Power: $1-\beta$ = correctly reject false $H_0$
Z-Test (σ Known)
$z = \frac{\bar{x}-\mu_0}{\sigma/\sqrt{n}}$
T-Test (σ Unknown)
$t = \frac{\bar{x}-\mu_0}{s/\sqrt{n}}$
$df = n-1$
Z-Tests & T-Tests
One-Sample Z-Test
$z = \frac{\bar{x}-\mu_0}{\sigma/\sqrt{n}}$
One-Sample T-Test
$t = \frac{\bar{x}-\mu_0}{s/\sqrt{n}}$
Two-Sample T-Test
$t = \frac{(\bar{x}_1-\bar{x}_2)-0}{SE}$
Pooled: $SE = s_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}$
Paired T-Test
$t = \frac{\bar{d}-0}{s_d/\sqrt{n}}$
$df = n-1$ (differences)
Proportion Z-Test
$z = \frac{\hat{p}-p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}}$
Chi-Square & ANOVA
Chi-Square Test
$\chi^2 = \sum \frac{(O_i-E_i)^2}{E_i}$
Goodness of Fit: 1 categorical variable
Independence: 2 categorical variables
$df = \text{(rows-1)(cols-1)}$
One-Way ANOVA
Compare means of 3+ groups
$F = \frac{MS_{between}}{MS_{within}}$
$MS = \frac{SS}{df}$
F-Statistic
$F_{df_1, df_2}$ distribution
$df_1 = k-1$, $df_2 = n-k$
Regression & Correlation
Linear Regression
$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x$
$\hat{\beta}_1 = r\frac{s_y}{s_x}$
$\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}$
Residuals & Fit
$e_i = y_i - \hat{y}_i$
$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$
Least squares: minimize $\sum e_i^2$
Pearson Correlation
$r = \frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{(n-1)s_x s_y}$
$-1 \le r \le 1$
Multiple Regression
$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon$
Critical Values & Quick Reference
Common Z-Values
Levelz
90%1.645
95%1.96
99%2.576
Alpha Levels
αα/2
0.100.05
0.050.025
0.010.005
Effect Size (Cohen)
Sized
Small0.2
Medium0.5
Large0.8
Correlation Strength
|r|Strength
<0.3Weak
0.3-0.7Moderate
>0.7Strong
Essential Probability & Statistics Summary
Bayes' Theorem
$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$
Expected Value
$E[X] = \sum x_i P(X=x_i)$
Variance Formula
$Var(X) = E[X^2] - (E[X])^2$
Standard Normal
$Z \sim N(0,1)$