The sample statistic (δ) is the difference between the sample mean and the comparison value (μ₀).
We create a null distribution by recentering and bootstrapping. We shift our data so its average equals the comparison value (μ₀), then resample from that shifted data (with replacement) hundreds of times. This creates a world where the population mean is the comparison value.
Think of this as being a world where the true mean is μ₀. Importantly, this doesn’t mean that every bootstrapped sample mean will be exactly μ₀. There is variation in the data, and that variation is reflected in the null world. What it means is that in the null world, the sample mean is μ₀ ± some amount.
Here’s what this null world looks like:
Next we put our observed sample mean inside that null world and see how comfortably it fits there.
Is it surprising to see the red line in this null world? Is the line way out to one of the sides, or is it near the middle with the rest of the null world?
We can actually quantify the probability of seeing that red line in a null world. This is a p-value—the probability of seeing a sample mean at least that far from μ₀ in a world where the true mean is μ₀.
Finally, we have to decide if the p-value meets an evidentiary standard or threshold that would provide us with enough evidence that we aren’t in the null world (or, in more statsy terms, enough evidence to reject the null hypothesis).
There are lots of possible thresholds. By convention, most people use a threshold (often shortened to α) of 0.05, or 5%. But that’s not required! You could have a lower standard with an α of 0.1 (10%), or a higher standard with an α of 0.01 (1%).
When thinking about p-values and thresholds, I like to imagine myself as a judge or a member of a jury. Many legal systems around the world have formal evidentiary thresholds or standards of proof. If prosecutors provide evidence that meets a threshold (i.e. goes beyond a reasonable doubt, or shows evidence on a balance of probabilities), the judge or jury can rule guilty. If there’s not enough evidence to clear the standard or threshold, the judge or jury has to rule not guilty.
With p-values:
Importantly, if the difference is not significant, that does not mean that there is no difference. It just means that we can’t detect one if there is. If a prosecutor doesn’t provide sufficient evidence to clear a standard or threshold, it does not mean that the defendant didn’t do whatever they’re charged with*—it means that the judge or jury can’t detect guilt.
* Kind of—in common law systems, defendants are presumed innocent until proven guilty, so if there’s not enough evidence to prove guilt, they are innocent by definition.
Many legal systems have different levels of evidentiary standards: