Theory based statistical inference

Difference in proportions

Packages

library(palmerpenguins)
library(tidyverse)
library(tidymodels)

In this document, we are going to walk through, step-by-step, the fundamentals of hypothesis testing when working with two categorical variables.

Context

In class on Sep 25, we went through how to conduct a hypothesis test and create a confidence interval using simulation based techniques. In this walk through, we will mirrow that those analyses, but with theory based methodology. The context to the penguin study can be found below.

We are interested in exploring if the species of the penguin impacts the sex of the penguin on the Palmer island. We will be looking at the Chinstrap and Gentoo species of penguin. We are interested in researching if there are more male Gentoo penguins than male Chinstrap penguins.

Types of variables

When identifying types of variables, we first ask the question “What variables are we working with?” We see the from the context that we are working with species and sex. Once the variables are identified, we ask ourselves two more questions: Are they categorical or quantitative?; Which one is the explanatory, and which one is the response?

Speices has the levels of Gentoo, and Chinstrap. This is a cateogrical variable (bins). Sex has the levels of male and female (bins). This is a categorical variable. To answer the second question, we need to think about what variable we are actually interested in vs what variable we hope explains the changes we see. The context starts off by saying “species … impacts sex.” We can interpret this as species explaining the differences we see in the sex of the penguins. We get an additional clue when reading “We are interested in researching if there are more male Gentoo penguins than male Chinstrap penguins.” We are interested in male penguins. This is what we are going to take the proportion of… across the grouping variable that is species. Therfore, species is our explanatory variable, and sex is our response variable.

Hypothesis Testing

In class, we set up the following null and alternative hypothesis

\[H_o: \pi_g - \pi_c = 0\]

\[H_a: \pi_g - \pi_c > 0\]

You can also think about this hypothesis test as a test for independence. That is, is species independent from sex. Therfore, the assumption that we will make about the population is that species is independent from sex.. or that the true proportion of male gentoo penguins is the same as the true proportion of male chinstrap penguins. Our researcher question in the context of the problem suggests are are looking to see if there are more gentoo penguins, so our alternative hypothesis sign is >.

Calculating the sample statistic

In class, we used the following code to calculate our sample statistic. Note that we now have two groups, which means we need to calculate two sample proportions!

penguins |>
  filter(species %in% c("Chinstrap", "Gentoo"),
         (!is.na(sex))) |>
  group_by(species, sex) |>
  summarise(data = n())

`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.

# A tibble: 4 × 3
# Groups:   species [2]
  species   sex     data
  <fct>     <fct>  <int>
1 Chinstrap female    34
2 Chinstrap male      34
3 Gentoo    female    58
4 Gentoo    male      61

In this context, a success is being male. That is what we are taking the proportion of for each group.

\[\hat{p_g} = \frac{61}{119}= 0.513\]

\[\hat{p_c} = \frac{34}{68}= 0.5\]

\[\hat{p_g} - \hat{p_c} = 0.013\]

Note That our order of subtraction is consistent with how we set up our null and alternative hypotheses! Order matters!

Assumptions for theory based

We again need to check assumptions to ensure that the central limit theorem kicks in, and that the sampling distribution of our difference in proportions under the assumption of the null hypothesis is ~ normally distributed.

Independence: As always, we need to check the independence assumption. For this situation, we need to check both within and across groups. That is, the penguins, within and between species, do not influence one another.
Success-Failure condition: Recall that these central limit theorem assumptions are for the null distribution. Thus, we need to check this condition under the assumption that the null hypothesis is true… or that species is completely independent from sex. Because of this, we are going to pool our data together as if species didn’t matter.

\[ \hat{p_\text{pool}} = \frac{\text{total success}}{\text{total sample size}} \]

\[ \hat{p_\text{pool}} = \frac{95}{187} = 0.508 \]

Now, we check our proportion of total successes (best guess for what is going on if the true proportions are really independent), and multiply by our sample size to see what we would expect. We do 1 - \(\hat{p}_\text{pool}\) to look at expected failures.

119 * 0.508 = 60.45 > 10

119 * 0.492 = 58.55 > 10

68 * 0.508 = 34.54 > 10

Because all of these values are greater than 10 (and we can assume independence), we know that the sampling distribution under the assumption of the null hypothesis is going to be normal.

Estimating the Standard Error

Much like when working with just a single categorical variable, we now are going to estimate the standard error of our sampling distribution under the assumption of the null hypothesis. We use the following formula to do so:

\[ SE(null) = \sqrt{\frac{\hat{p_\text{pool}}*(1-\hat{p_\text{pool})}}{n1} + \frac{\hat{p_\text{pool}}*(1-\hat{p_\text{pool})}}{n2}} \]

\[ SE(null) = \sqrt{\frac{0.508*(1-0.508)}{119} + \frac{0.508*(1-0.508)}{68}} = 0.076 \]

Standardized statistic

Note: We can calculate the p-value from here using pnorm()

pnorm(0.013, 0, 0.076, lower.tail = F)

[1] 0.4320912

However, you are often asked to standardize your test statistic and perform a Z test. This formula looks very similar:

\[ Z = \frac{\hat{p_g} - \hat{p_c} - \pi_\text{diff}}{SE} \]

\[ Z = \frac{0.013 - 0}{0.076} = 0.17 \]

This is interpreted as: Our statistic of 0.013 is 0.17 SEs above the null hypothesis of 0.

The assocciated p-value using our Z statistic and the standard normal distribution is:

pnorm(0.17, 0, 1, lower.tail = F)

[1] 0.4325051

Which matches what we calculated above before standardizing! This p-value is close to what we got using simulation techniques, and would be even closer if we ran more and more repetitions.

From here, you can write a decision, conclusion, and interpretation of the p-value.

Confidence interval

Now suppose we want to estimate \(\pi_g - \pi_c\), or the difference in the true proportion of males between the gentoo and chinstrap species of penguins. We can use a confidence interval for that!

Checking assumptions

We DO NOT have a null hypothesis to assume. Instead, we use our original data. Let’s remind ourselves the counts for the successes and failures within each group.

penguins |>
  filter(species %in% c("Chinstrap", "Gentoo"),
         (!is.na(sex))) |>
  group_by(species, sex) |>
  summarise(data = n())

`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.

# A tibble: 4 × 3
# Groups:   species [2]
  species   sex     data
  <fct>     <fct>  <int>
1 Chinstrap female    34
2 Chinstrap male      34
3 Gentoo    female    58
4 Gentoo    male      61

58 > 10

61 > 10

34 > 10

Therefore, we know our sampling distribution of our sample statistic is going to be ~ normally distributed (assuming independence again)!

Estimating the Standard Error

Because we do not have a null hypothesis to assume, we use our data to estimate the standard error of the sampling distribution.

\[ SE(\hat{p_g}-\hat{p_c}) = \sqrt{\frac{{\hat{p_g}*(1-\hat{p_g})}}{n_1} + \frac{{\hat{p_c}*(1-\hat{p_c})}}{n_2}} \]

Plugging in these values, we get:

\[ SE(\hat{p_g}-\hat{p_c}) = \sqrt{\frac{0.513*0.487}{119} + \frac{.5*.5}{68}} = 0.76 \]

Note:It doesn’t always come out to be the same as the standard error for the null distribution.

Confidence interval calculation

Our formula is again going to look very similar:

\(\hat{p_g} - \hat{p_c} \pm \text{margin of error}\)

\(\hat{p_g} - \hat{p_c} \pm \text{z* * SE}\)

Let’s assume we want to make a 90% confidence interval. We can use the emperical rule, or we can calculate z* using qnorm()

qnorm(0.95, 0, 1)

[1] 1.644854

The multiplier for a 90% confidence interval is 1.64. Let’s plug in the rest of the values!

\(0.013\pm \text{1.64 * 0.076}\) = (-0.112, 0.138)

Interpretation

Finally, we can interpret our results.

Interpretation of our confidence interval: We are 90% confident that the true proportion of gentoo penguins that are male are .112 lower to .138 higher than the true proportion of the chinstrap penguins that are male.

Notice the direction of lower and higher associated with the sign of the value in the confidence interval. Notice that we reference the population level by saying true!