Regular p-values

So far all the examples on this site calculated p-values through simulation—shuffling and resampling to build null worlds.

In real life, you won’t actually do this!

Regular statistical tests like R’s t.test(), prop.test(), and lm() skip the simulation and instead use known theoretical distributions (like the t, F, and \(\chi^2\) distributions) to approximate null worlds and calculate p-values. This theoretical, mathematical p-value is what you see in regular statistical output.

Even though they’re not based on simulations, the intuition is the same: a p-value is still the probability of seeing a \(\delta\) at least that large in a world where there is no difference.

Code

library(tidyverse)
library(scales)
library(ggdist)
library(ggbeeswarm)
library(parameters)
library(tinytable)

penguins <- penguins |> drop_na(sex)

theme_set(theme_minimal(base_size = 14))

# From MoMAColors::moma.colors("OKeeffe")
clrs <- c(
  "#f3d567",
  "#ee9b43",
  "#e74b47",
  "#b80422",
  "#172767",
  "#19798b"
)

Difference in means

This is the same thing the difference in means example calculates through simulation—just computed with a formula instead.

Here’s the average body mass across species:

Code

avg_weight |> 
  tt()

species	avg_weight
Adelie	3706.164
Chinstrap	3733.088
Gentoo	5092.437

Code

ggplot(penguins, aes(x = species, y = body_mass, color = species)) +
  geom_beeswarm(side = -1, size = 1, cex = 1.5) +
  stat_pointinterval(.width = 0.95, position = position_nudge(x = 0.2)) +
  scale_y_continuous(labels = label_number(scale_cut = cut_si("g"))) +
  scale_color_manual(values = c(clrs[2], clrs[4], clrs[5]), guide = "none") +
  labs(x = "Species", y = "Body mass")

We can ask a couple questions here about the differences in means.

Is there a difference in body mass between Chinstrap and Gentoo penguins?

We’re looking at the difference between these two values:

Code

avg_weight |> 
  tt() |> 
  style_tt(i = 2:3, background = clrs[1])

species	avg_weight
Adelie	3706.164
Chinstrap	3733.088
Gentoo	5092.437

Code

avg_weight |> 
  mutate(highlight = species %in% c("Chinstrap", "Gentoo")) |> 
  ggplot(aes(x = species, y = avg_weight, fill = species)) +
  geom_col(aes(color = highlight), linewidth = 2) +
  scale_y_continuous(labels = label_number(scale_cut = cut_si("g"))) +
  scale_color_manual(values = c(NA, clrs[1]), guide = "none") +
  scale_fill_manual(values = c(clrs[2], clrs[4], clrs[5]), guide = "none") +
  labs(x = "Species", y = "Average weight")

Code

t.test(
  body_mass ~ species,
  data = filter(penguins, species %in% c("Chinstrap", "Gentoo"))
)


    Welch Two Sample t-test

data:  body_mass by species
t = -20.765, df = 169.62, p-value < 2.2e-16
alternative hypothesis: true difference in means between group Chinstrap and group Gentoo is not equal to 0
95 percent confidence interval:
 -1488.578 -1230.120
sample estimates:
mean in group Chinstrap    mean in group Gentoo 
               3733.088                5092.437

Parameter	Difference (SE)	p	Group	species = Chinstrap	species = Gentoo
Alternative hypothesis: true difference in means between group Chinstrap and group Gentoo is not equal to 0
body_mass	-1359.35	<0.001	species	3733.09	5092.44

The p-value here is essentially zero (p < 2.2 × 10⁻¹⁶). In a world where Chinstrap and Gentoo penguins had the same average body mass, it would be virtually impossible to see a difference this large.

Is there a difference in body mass between Adelie and Chinstrap penguins?

We’re looking at the difference between these two values:

Code

avg_weight |> 
  tt() |> 
  style_tt(i = 1:2, background = clrs[1])

species	avg_weight
Adelie	3706.164
Chinstrap	3733.088
Gentoo	5092.437

Code

avg_weight |> 
  mutate(highlight = species %in% c("Chinstrap", "Adelie")) |> 
  ggplot(aes(x = species, y = avg_weight, fill = species)) +
  geom_col(aes(color = highlight), linewidth = 2) +
  scale_y_continuous(labels = label_number(scale_cut = cut_si("g"))) +
  scale_color_manual(values = c(NA, clrs[1]), guide = "none") +
  scale_fill_manual(values = c(clrs[2], clrs[4], clrs[5]), guide = "none") +
  labs(x = "Species", y = "Average weight")

Code

t.test(
  body_mass ~ species,
  data = filter(penguins, species %in% c("Adelie", "Chinstrap"))
)


    Welch Two Sample t-test

data:  body_mass by species
t = -0.44793, df = 154.03, p-value = 0.6548
alternative hypothesis: true difference in means between group Adelie and group Chinstrap is not equal to 0
95 percent confidence interval:
 -145.66494   91.81724
sample estimates:
   mean in group Adelie mean in group Chinstrap 
               3706.164                3733.088

Parameter	Difference (SE)	p	Group	species = Adelie	species = Chinstrap
Alternative hypothesis: true difference in means between group Adelie and group Chinstrap is not equal to 0
body_mass	-26.92	0.655	species	3706.16	3733.09

One-sample mean

This is the theoretical version of the one-sample mean simulation, where we bootstrap a null world centered at a hypothesized value.

Is the average body mass of all penguins in the dataset different from 4000 g?

Code

mass_summary |> tt()

n	Sample mean	μ₀
333	4207.057	4000

Code

ggplot(penguins, aes(x = body_mass)) +
  geom_histogram(
    binwidth = 200,
    fill = clrs[5], color = "white"
  ) +
  geom_vline(
    xintercept = 4000,
    color = clrs[3], linewidth = 1.5
  ) +
  annotate(
    "label",
    x = 4000, y = Inf, vjust = 1.5,
    label = "μ₀ = 4000 g", size = 4
  ) +
  labs(x = "Body mass (g)", y = "Count")

Code

t.test(penguins$body_mass, mu = 4000)


    One Sample t-test

data:  penguins$body_mass
t = 4.6925, df = 332, p-value = 3.952e-06
alternative hypothesis: true mean is not equal to 4000
95 percent confidence interval:
 4120.256 4293.858
sample estimates:
mean of x 
 4207.057

Parameter	Mean (SE)	p	mu
Alternative hypothesis: true mean is not equal to 4000
body_mass	4207.06	<0.001	4000

The p-value tells us the probability of seeing a sample mean this far from 4000 g in a world where the true average really is 4000 g. The small p-value gives us evidence that the true mean is not 4000 g.

Difference in proportions

This is the theoretical version of the difference in proportions simulation.

Is the proportion of female penguins the same in Adelie and Gentoo species?

Code

penguins_prop |> tt()

species	n_female	n	prop_female
Adelie	73	146	0.500000
Gentoo	58	119	0.487395

Code

ggplot(
  penguins_prop,
  aes(x = species, y = prop_female, fill = species)
) +
  geom_col() +
  scale_y_continuous(labels = label_percent()) +
  scale_fill_manual(
    values = c(clrs[2], clrs[5]),
    guide = "none"
  ) +
  labs(x = "Species", y = "Proportion female")

Code

prop.test(
  x = penguins_prop$n_female,
  n = penguins_prop$n
)


    2-sample test for equality of proportions with continuity correction

data:  penguins_prop$n_female out of penguins_prop$n
X-squared = 0.0065013, df = 1, p-value = 0.9357
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.1160296  0.1412397
sample estimates:
  prop 1   prop 2 
0.500000 0.487395

Difference (SE)	p	Proportion
Alternative hypothesis: two.sided
1.26%	0.936	50.00% / 48.74%

This time the p-value is large—both species have roughly the same proportion of females. In a world where the two species truly had the same proportion of females, it would be completely unsurprising to see a difference this small. There is not enough evidence to say the proportions are different; the result is not statistically significant.

Not every test produces a tiny p-value! A large p-value doesn’t mean there’s no difference—just that the data don’t provide enough evidence to distinguish a real difference from random variation.

Regression

This is the theoretical version of the regression slope simulation, where we shuffle the outcome to build a null world where the slope is zero.

Does flipper length predict body mass?

Code

reg_summary |> tt()

n	Cor(x, y)
333	0.8729789

Code

ggplot(
  penguins,
  aes(x = flipper_len, y = body_mass)
) +
  geom_point(color = clrs[5], alpha = 0.5) +
  geom_smooth(
    method = "lm",
    color = clrs[3], se = FALSE
  ) +
  labs(
    x = "Flipper length (mm)",
    y = "Body mass (g)"
  )

`geom_smooth()` using formula = 'y ~ x'

Code

penguin_model <- lm(
  body_mass ~ flipper_len,
  data = penguins
)

summary(penguin_model)


Call:
lm(formula = body_mass ~ flipper_len, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-1057.33  -259.79   -12.24   242.97  1293.89 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5872.09     310.29  -18.93   <2e-16 ***
flipper_len    50.15       1.54   32.56   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 393.3 on 331 degrees of freedom
Multiple R-squared:  0.7621,    Adjusted R-squared:  0.7614 
F-statistic:  1060 on 1 and 331 DF,  p-value: < 2.2e-16

Parameter	Coefficient (SE)	p
(Intercept)	-5872.09 (310.29)	<0.001
flipper len	50.15 ( 1.54)	<0.001

In regression output, each coefficient has its own p-value. The p-value for flipper_len tests whether the slope is different from zero—in a world where flipper length had no relationship with body mass (a slope of zero), it would be essentially impossible to see a slope this steep (p < 2.2 × 10⁻¹⁶).