Hypothesis Testing

Statistics is not just about describing data. Often you need to make a decision: does a new drug work? Is one algorithm faster than another? Has the average changed? Hypothesis testing gives you a structured framework for answering these questions using data.
The idea is simple: assume nothing has changed (the "null hypothesis"), then check whether the data is so extreme that this assumption becomes hard to believe.
The null hypothesis ($H_0$) is the default claim, usually a statement of "no effect" or "no difference." For example: "the average delivery time is still 30 minutes" or "the new model is no better than the old one."
The alternative hypothesis ($H_1$ or $H_a$) is what you suspect might be true instead: "the average delivery time has changed" or "the new model is better."
You never prove $H_1$ directly. Instead, you ask: if $H_0$ were true, how likely is it that I would see data this extreme? If it is very unlikely, you reject $H_0$ in favour of $H_1$.
The test statistic is a single number that summarises how far your sample result is from what $H_0$ predicts. Different tests use different formulas, but the logic is always the same: measure the distance between observed and expected.
The p-value is the probability of observing a test statistic at least as extreme as yours, assuming $H_0$ is true. A small p-value means the data is surprising under $H_0$.
The significance level ($\alpha$) is the threshold you set before looking at the data. If $p \le \alpha$, you reject $H_0$. Common choices are $\alpha = 0.05$ (5%) and $\alpha = 0.01$ (1%).

Normal curve with rejection regions shaded, test statistic marked, and p-value area highlighted

The shaded tails are the rejection regions. If your test statistic lands there, the data is surprising enough under $H_0$ that you reject it. The green area shows the p-value for a particular test statistic.
Here is the step-by-step procedure:
- Step 1: State $H_0$ and $H_1$
- Step 2: Choose a significance level $\alpha$
- Step 3: Collect data and compute the test statistic
- Step 4: Find the p-value (or compare the test statistic to a critical value)
- Step 5: If $p \le \alpha$, reject $H_0$. Otherwise, fail to reject $H_0$
Worked example: A factory claims their bolts have a mean length of 10 cm. You measure 36 bolts and find a sample mean of 10.3 cm. The known population standard deviation is 0.9 cm. Is there evidence that the mean has changed?
$H_0$: $\mu = 10$, $H_1$: $\mu \neq 10$, $\alpha = 0.05$
Test statistic (z-test, since $\sigma$ is known and $n$ is large):

$$z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}} = \frac{10.3 - 10}{0.9 / \sqrt{36}} = \frac{0.3}{0.15} = 2.0$$

For a two-tailed test at $\alpha = 0.05$, the critical values are $\pm 1.96$. Our $z = 2.0 > 1.96$, so we reject $H_0$. The p-value is approximately 0.046, which is less than 0.05.
Conclusion: there is statistically significant evidence that the mean bolt length differs from 10 cm.
A one-tailed test checks for an effect in one specific direction ($H_1$: $\mu > 10$ or $\mu < 10$). The entire $\alpha$ goes into one tail, making it easier to reject $H_0$ in that direction but impossible to detect an effect in the opposite direction.
A two-tailed test checks for any difference ($H_1$: $\mu \neq 10$). The $\alpha$ is split between both tails ($\alpha/2$ each). This is more conservative but catches effects in either direction.
Even with a good procedure, mistakes happen. There are exactly two types of errors:

2x2 grid showing Type I and Type II errors: reality vs decision

Type I Error (false positive): you reject $H_0$ when it is actually true. The probability of this is $\alpha$, which you control by choosing your significance level. Like a fire alarm going off when there is no fire.
Type II Error (false negative): you fail to reject $H_0$ when it is actually false. The probability of this is $\beta$. Like a fire alarm staying silent during a real fire.
Power is $1 - \beta$, the probability of correctly rejecting a false $H_0$. Higher power means you are better at detecting real effects. Power increases when:
- The true effect size is larger (bigger differences are easier to detect)
- The sample size is larger (more data = more precision)
- The significance level $\alpha$ is larger (but this raises Type I error risk)
- The variability is lower (less noise)
There is a tension between Type I and Type II errors. Lowering $\alpha$ (being more cautious about false positives) increases $\beta$ (more false negatives). You cannot minimise both simultaneously with a fixed sample size.
Parametric tests assume the data follows a specific distribution (usually normal). They are more powerful when the assumptions hold.
Z-test: compares a sample mean to a known value when $\sigma$ is known and $n$ is large ($n \ge 30$). Test statistic:

$$z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}$$

T-test: like the z-test, but for when $\sigma$ is unknown (estimated from the sample) or $n$ is small. Uses the t-distribution, which has heavier tails than the normal. The heavier tails account for the extra uncertainty from estimating $\sigma$.

$$t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}$$

The t-distribution has a parameter called degrees of freedom ($df = n - 1$). As $df$ increases, the t-distribution approaches the normal distribution.
There are several flavours of t-test:
- One-sample t-test: is the sample mean different from a specific value?
- Independent two-sample t-test: are the means of two separate groups different?
- Paired t-test: are the means of two related measurements different (e.g. before and after treatment on the same subjects)?
ANOVA (Analysis of Variance): tests whether three or more group means are equal. Instead of running multiple t-tests (which inflates the Type I error rate), ANOVA does a single test by comparing the variance between groups to the variance within groups.

$$F = \frac{\text{variance between groups}}{\text{variance within groups}}$$

A large $F$ ratio means the groups differ more than you would expect from random variation alone.
Non-parametric tests make fewer assumptions about the data distribution. They work on ranks rather than raw values, making them robust to outliers and non-normality.
Chi-square test ($\chi^2$): tests whether observed frequencies match expected frequencies. Used for categorical data. For example: do the proportions of red, blue, and green cars match the manufacturer's claimed proportions?

$$\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}$$

Mann-Whitney U test: the non-parametric alternative to the independent two-sample t-test. It tests whether one group tends to have larger values than the other by comparing ranks.
Wilcoxon signed-rank test: the non-parametric alternative to the paired t-test. Compares paired observations by looking at the magnitude and direction of differences.
Kruskal-Wallis test: the non-parametric alternative to one-way ANOVA. Tests whether multiple groups come from the same distribution by comparing ranks across all groups.
Goodness-of-fit tests check whether your data follows a specific theoretical distribution. The chi-square goodness-of-fit test compares observed bin counts to expected counts under the hypothesised distribution.
Normality tests specifically check whether data is normally distributed. Common ones include the Shapiro-Wilk test (powerful for small samples) and the Kolmogorov-Smirnov test (compares the sample CDF to the theoretical CDF).
In ML, hypothesis testing appears when you compare model performance. If model A achieves 92% accuracy and model B achieves 91%, is the difference real or just noise? A paired t-test on cross-validation scores can answer this.

Coding Tasks (use CoLab or notebook)

Perform a z-test for the bolt factory example from the text. Compute the test statistic, p-value, and make a decision.

import jax.numpy as jnp

x_bar = 10.3    # sample mean
mu_0 = 10.0     # null hypothesis value
sigma = 0.9     # known population std
n = 36           # sample size
alpha = 0.05

# Test statistic
z = (x_bar - mu_0) / (sigma / jnp.sqrt(n))
print(f"z = {z:.4f}")

# p-value (two-tailed) using the normal CDF approximation
# For |z| = 2.0, p ≈ 0.0456
from jax.scipy.stats import norm
p_value = 2 * (1 - norm.cdf(jnp.abs(z)))
print(f"p-value = {p_value:.4f}")
print(f"Reject H₀? {p_value <= alpha}")

Simulate Type I error: when $H_0$ is true, how often do we mistakenly reject it? Run 10,000 experiments and check that the rejection rate matches $\alpha$.

import jax
import jax.numpy as jnp

key = jax.random.PRNGKey(0)
mu_0 = 50.0
sigma = 10.0
n = 30
alpha = 0.05
n_experiments = 10_000

rejections = 0
for i in range(n_experiments):
    key, subkey = jax.random.split(key)
    sample = mu_0 + sigma * jax.random.normal(subkey, shape=(n,))
    z = (sample.mean() - mu_0) / (sigma / jnp.sqrt(n))
    p_value = 2 * (1 - __import__("jax").scipy.stats.norm.cdf(jnp.abs(z)))
    if p_value <= alpha:
        rejections += 1

print(f"Rejection rate: {rejections/n_experiments:.4f}")
print(f"Expected (α):   {alpha}")

Compare a t-test and a Mann-Whitney U test on two groups. Generate data where one group has a slightly higher mean and see which test detects the difference.

import jax
import jax.numpy as jnp

key = jax.random.PRNGKey(99)
k1, k2 = jax.random.split(key)

group_a = jax.random.normal(k1, shape=(25,)) * 5 + 100
group_b = jax.random.normal(k2, shape=(25,)) * 5 + 103  # slightly higher mean

# Two-sample t-test (equal variance assumed)
n_a, n_b = len(group_a), len(group_b)
mean_a, mean_b = group_a.mean(), group_b.mean()
pooled_var = ((n_a - 1) * group_a.var() + (n_b - 1) * group_b.var()) / (n_a + n_b - 2)
se = jnp.sqrt(pooled_var * (1/n_a + 1/n_b))
t_stat = (mean_a - mean_b) / se
print(f"T-test statistic: {t_stat:.4f}")

# Mann-Whitney: count how often group_a values beat group_b values
u_stat = jnp.sum(group_a[:, None] < group_b[None, :])
print(f"Mann-Whitney U:   {u_stat}")
print(f"\nGroup A mean: {mean_a:.2f}, Group B mean: {mean_b:.2f}")

Maths, CS & AI Compendium

Hypothesis Testing

Coding Tasks (use CoLab or notebook)