Difference in means

Difference in means

◎◉○

Live simulation

Differences in penguin weights

For this example, we’ll look at the body mass of penguins near Palmer Station, Antarctica. At first glance, it looks like there are some definite species-based differences in average penguin weight:

Species Average weight (g)
Adelie 3706
Chinstrap 3733
Gentoo 5092

Let’s look specifically at the difference in the average body mass for just Chinstrap and Gentoo penguins. First, we’ll load some packages:

library(tidyverse)
library(infer)
library(parameters)

penguins <- penguins |> drop_na(sex)

With some filtering, grouping, and summarizing, we can find the difference in means and see that Chinstraps seem to be a lot lighter than Gentoos, on average:

penguins |>
  filter(species %in% c("Chinstrap", "Gentoo")) |>
  group_by(species) |>
  summarize(avg_weight = mean(body_mass)) |>
  mutate(difference = c(NA, avg_weight[1] - avg_weight[2]))
# A tibble: 2 × 3
  species   avg_weight difference
  <fct>          <dbl>      <dbl>
1 Chinstrap      3733.        NA 
2 Gentoo         5092.     -1359.

But is that difference real? Could it potentially be zero? We need to do some hypothesis testing.

Null hypothesis inference with {infer}

Null hypothesis inference with t.test()

In practice, most people do not simulate null worlds. Instead, they’ll use a built-in test that uses a known theoretical distribution of what the null world is assumed to look like (like the t, F, and \(\chi^2\) distributions), and calculate a p-value based on that null distribution. This theoretical, mathematical p-value is what you see in regular statistical output.

Even though these values are not based on simulations, the intuition is the same: a p-value is still the probability of seeing a \(\delta\) at least that extreme in a world where there is no difference.

To find the difference in group means, we can conduct a t-test with t.test():

t.test(
  body_mass ~ species,
  data = filter(penguins, species %in% c("Chinstrap", "Gentoo"))
)

    Welch Two Sample t-test

data:  body_mass by species
t = -20.765, df = 169.62, p-value < 2.2e-16
alternative hypothesis: true difference in means between group Chinstrap and group Gentoo is not equal to 0
95 percent confidence interval:
 -1488.578 -1230.120
sample estimates:
mean in group Chinstrap    mean in group Gentoo 
               3733.088                5092.437 

Buried in that giant wall of text is the p-value: p < 2.2e-16, or p < 2.2 × 10−16. That’s really tiny. In a world where Chinstrap and Gentoo penguins had the same average body mass, it would be virtually impossible to see a difference as extreme as −1,359. We have enough evidence to declare that difference is statistically significant.

If you don’t like all that text output, you can feed the results of t.test() to the model_parameters() function from the {parameters} package. This creates a nice little table with the different group means, the difference, the confidence interval, and the p-value. Feed that into display() and you’ll get a nicely rendered table.

t.test(
  body_mass ~ species,
  data = filter(penguins, species %in% c("Chinstrap", "Gentoo"))
) |>
  model_parameters() |>
  display(caption = "")
Parameter Group species = Chinstrap species = Gentoo Difference 95% CI t(169.62) p
Alternative hypothesis: true difference in means between group Chinstrap and group Gentoo is not equal to 0
body_mass species 3733.09 5092.44 -1359.35 (-1488.58, -1230.12) -20.76 < .001
WarningSide caveat: Different “flavors” of tests

Since we’re not simulating the null world, we’re making strong assumptions about what the null world looks like. In the case of this t-test, we’re assuming it follows a t-distribution with 169.6 degrees of freedom. That looks like this:

We can then put a standardized version of our observed \(\delta\)—the t value of -20.8—in that mathematical null world and find the exact probability of seeing it there:

We can only assume that this precise t-distribution represents a null world under certain conditions. This is why there are so many statistical test flowcharts—you have to make sure you choose the test that matches the conditions and characteristics of your data.

For example, different versions of the t-test make different assumptions about whether the two groups have equal variances. By default, R uses Welch’s t-test, which is designed for group means with unequal variances, while the Student t-test assumes that group means have equal variances. Technically, in order to check which flavor of t-test we need to run, we’d need to test for equality of variances, which involves a separate statistical test with its own null hypothesis. If the variances are equal, we’d use t.test(..., var.equal = TRUE); if the variances are unequal, we’d use t.test(..., var.equal = FALSE). The only way to remember all these assumption checks and all the different versions of statistical tests is to consult a flowchart. It can be miserable.

If you simulate, you can skip all that. Your null world is based on the qualities of your observed data, not an idealized mathematical distribution.

Footnotes

  1. Kind of—in common law systems, defendants are presumed innocent until proven guilty, so if there’s not enough evidence to prove guilt, they are innocent by definition. ↩︎