Statistics – Complete Formula Sheet

Descriptive Statistics

Central Tendency

Mean: $\bar{x} = \frac{\sum x_i}{n}$

Median: Middle value (50th percentile)

Mode: Most frequent value

Measures of Spread

Range: Max - Min

Variance: $s^2 = \frac{\sum(x_i-\bar{x})^2}{n-1}$

Std Dev: $s = \sqrt{s^2}$

IQR: $Q_3 - Q_1$

Shape Measures

Skewness: Symmetry (0=symmetric)

Kurtosis: Tail weight

Data Visualization

Histogram

Frequency distribution of continuous data

Class width = (Max-Min)/# classes

Shows shape, center, spread

Box Plot

Min, $Q_1$, Median, $Q_3$, Max

Outliers: < $Q_1 - 1.5 \cdot IQR$ or > $Q_3 + 1.5 \cdot IQR$

Scatter Plot

Two quantitative variables

Shows relationship/correlation

Stem-and-Leaf

Preserves individual values

Probability Basics

Definitions

Sample Space: All possible outcomes

Event: Subset of sample space

Probability: $0 \le P(A) \le 1$

Basic Rules

$P(A \cup B) = P(A) + P(B) - P(A \cap B)$

$P(A^c) = 1 - P(A)$

$P(A|B) = \frac{P(A \cap B)}{P(B)}$

Axioms

1. $P(A) \ge 0$ for all events

2. $P(S) = 1$ (certain event)

3. Additivity: Disjoint events

Discrete Distributions

Binomial

$P(X=k) = \binom{n}{k}p^k(1-p)^{n-k}$

$E[X] = np$, $Var(X) = np(1-p)$

Poisson

$P(X=k) = \frac{e^{-\lambda}\lambda^k}{k!}$

$E[X] = \lambda$, $Var(X) = \lambda$

Rare events in fixed interval

Geometric

$P(X=k) = (1-p)^{k-1}p$

$E[X] = \frac{1}{p}$, $Var(X) = \frac{1-p}{p^2}$

Trials until first success

Hypergeometric

Sampling without replacement

$P(X=k) = \frac{\binom{K}{k}\binom{N-K}{n-k}}{\binom{N}{n}}$

Continuous Distributions

Normal (Gaussian)

$f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$

68-95-99.7 rule

$Z = \frac{X-\mu}{\sigma}$ (standardize)

Exponential

$f(x) = \lambda e^{-\lambda x}$, $x \ge 0$

$E[X] = \frac{1}{\lambda}$, $Var(X) = \frac{1}{\lambda^2}$

Waiting time between events

Uniform

$f(x) = \frac{1}{b-a}$ for $a \le x \le b$

$E[X] = \frac{a+b}{2}$, $Var(X) = \frac{(b-a)^2}{12}$

T-Distribution

$df = n-1$ (degrees of freedom)

Heavier tails than normal

Sampling Distributions

Central Limit Theorem

$\bar{X} \sim N(\mu, \frac{\sigma^2}{n})$

Sample means approximately normal (n ≥ 30)

Standard Error

$SE(\bar{x}) = \frac{s}{\sqrt{n}}$

$SE(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}$

Sample Proportion

$\hat{p} \sim N(p, \frac{p(1-p)}{n})$

Valid when $np \ge 10$, $n(1-p) \ge 10$

Difference of Means

$E[\bar{X}_1 - \bar{X}_2] = \mu_1 - \mu_2$

$SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$

Confidence Intervals

General Form

$\text{Estimate} \pm \text{ME}$

$ME = z_{\alpha/2} \cdot SE$ or $t_{\alpha/2} \cdot SE$

Mean (Known σ)

$\bar{x} \pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}}$

Mean (Unknown σ)

$\bar{x} \pm t_{\alpha/2}\frac{s}{\sqrt{n}}$

Proportion

$\hat{p} \pm z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$

Diff of Means

$(\bar{x}_1-\bar{x}_2) \pm t_{\alpha/2} SE$

Sample Size

$n = \frac{z_{\alpha/2}^2\sigma^2}{ME^2}$

Hypothesis Testing

Framework

1. Set $H_0$ vs $H_a$

2. Sig level $\alpha$ (0.05)

3. Compute test statistic

4. Find p-value

5. Conclude: reject or fail to reject

Error Types

Type I ($\alpha$): Reject true $H_0$

Type II ($\beta$): Fail to reject false $H_0$

Power: $1-\beta$ = correctly reject false $H_0$

Z-Test (σ Known)

$z = \frac{\bar{x}-\mu_0}{\sigma/\sqrt{n}}$

T-Test (σ Unknown)

$t = \frac{\bar{x}-\mu_0}{s/\sqrt{n}}$

$df = n-1$

Z-Tests & T-Tests

One-Sample Z-Test

$z = \frac{\bar{x}-\mu_0}{\sigma/\sqrt{n}}$

One-Sample T-Test

$t = \frac{\bar{x}-\mu_0}{s/\sqrt{n}}$

Two-Sample T-Test

$t = \frac{(\bar{x}_1-\bar{x}_2)-0}{SE}$

Pooled: $SE = s_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}$

Paired T-Test

$t = \frac{\bar{d}-0}{s_d/\sqrt{n}}$

$df = n-1$ (differences)

Proportion Z-Test

$z = \frac{\hat{p}-p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}}$

Chi-Square & ANOVA

Chi-Square Test

$\chi^2 = \sum \frac{(O_i-E_i)^2}{E_i}$

Goodness of Fit: 1 categorical variable

Independence: 2 categorical variables

$df = \text{(rows-1)(cols-1)}$

One-Way ANOVA

Compare means of 3+ groups

$F = \frac{MS_{between}}{MS_{within}}$

$MS = \frac{SS}{df}$

F-Statistic

$F_{df_1, df_2}$ distribution

$df_1 = k-1$, $df_2 = n-k$

Regression & Correlation

Linear Regression

$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x$

$\hat{\beta}_1 = r\frac{s_y}{s_x}$

$\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}$

Residuals & Fit

$e_i = y_i - \hat{y}_i$

$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$

Least squares: minimize $\sum e_i^2$

Pearson Correlation

$r = \frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{(n-1)s_x s_y}$

$-1 \le r \le 1$

Multiple Regression

$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon$

Level	z
90%	1.645
95%	1.96
99%	2.576

α	α/2
0.10	0.05
0.05	0.025
0.01	0.005

Size	d
Small	0.2
Medium	0.5
Large	0.8

\|r\|	Strength
<0.3	Weak
0.3-0.7	Moderate
>0.7	Strong

STATISTICS Complete Formula Sheet