HW 4: Recreating R Output + Chi-Square

Homework

Note

You need to add the following code to your document in order for it to run + be able to recreate the output you see in the assignment.

library(MASS)

survey <- na.omit(survey)

Packages

library(tidyverse)
library(MASS)

Data

For this homework, we are going to being using the survey data set from the MASS package. This data frame contains the responses of 237 Statistics I students at the University of Adelaide to a number of questions. You can find the variable names and descriptions by pulling up the help file for survey.

We are going to use a cleaned up data set (like we did in class). Your data set should look like this, if it doesn’t please feel free to include the follow code in your homework document.

survey <- na.omit(survey)

Exercise 1

In this exercise, we are going to investigate the relationship between how students express height, and their age. In this survey, students were ask to express their height, and the researcher recorded if they expressed their height in imperial (feet/inches), or metric (centimeters/meters). The researcher wanted to know if there is a difference between expressed height and a student’s age.

Justify, in as much detail as possible, why it is appropriate to use difference in means methodology to analyze these data.
Set up the null AND alternative hypothesis, using proper notation, for the difference in means test described above. Note, you only need to write this out in proper notation (not words).

\(Ho:\)

\(Ha:\)

Suppose that the assumptions to conduct a theory based test were met. The researchers used the function t.test to calculate the appropriate p-value. The results can be seen below.


    Welch Two Sample t-test

data:  Age by M.I
t = 1.2452, df = 65.081, p-value = 0.2175
alternative hypothesis: true difference in means between group Imperial and group Metric is not equal to 0
95 percent confidence interval:
 -0.9451082  4.0753596
sample estimates:
mean in group Imperial   mean in group Metric 
              21.45836               19.89324

Question: Using the survey data set, recreate the sample means (it’s okay if they are rounded) for group Imperial and group Metric. Report these means in a 2x2 table by adding to the existing pipeline that starts by filtering out all NA values for the expressed height variable.

survey |>
  filter(!is.na(M.I)) |>
  # insert code here

Now, report the standard deviations for each group by adding the code below.

survey |>
  filter(!is.na(M.I)) |>
  # insert code here

Use the information from parts c and d to recreate the t-static that was calculated using t.test. Show your work.
Using the pt() function, recreate the p-value calculated in the t.test function. Hint: Think about your alternative hypothesis.

pt()

The book used in 511 suggests estimating the degrees of freedmon as \(min(n_1 - 1;n_2-1)\). Is our estimate of the degrees of freedom the same as what t.test calculates? Justify your answer by reporting the degrees of freedom estimated by the book’s formula.
Without doing the calculation, justify if the p-value would increase, decrease, or stay the same, if we calculated it using the degrees of freedom from part g.

Exercise 2

In this exercise, we are again going to use the survey data set. We are going to test if a student’s smoking status in independent from how often they exercise. You can find the variable names and descriptions by pulling up the help file for survey.

Make an observed table (tibble output) that counts the frequency of every exercise by smoking combination. Your answer should be a 11 x 3 tibble. Add to the existing pipeline that filters out missing values for smoking when reporting your tibble.

survey |>
  filter(!is.na(Smoke)) |>
  # insert code here

Below is a table of expected counts. These are counts that would expect to see if exercise really has no impact on smoking status.

           survey$Smoke
survey$Exer     Heavy    Never    Occas    Regul
       Freq 3.5416667 67.79762 6.577381 7.083333
       None 0.5833333 11.16667 1.083333 1.166667
       Some 2.8750000 55.03571 5.339286 5.750000

Question: Using the values from a, recreate the expected count of those who smoke heavy and exercise frequently. Show your work.

Are you justified to use theory based procedures to carry out a chi-square test of independence? Justify your answer
Regardless of your answer, the results from a chi-square test of independence can be seen below…


    Pearson's Chi-squared test

data:  survey$Exer and survey$Smoke
X-squared = 1.7861, df = 6, p-value = 0.9383

Question: Please explain how df = 6 was calculated when carrying out this test.

Using the results from the output in part d, report the test statistic, p-value, and distribution the p-value was calculated from. Be specific.

Submission

Go to http://www.gradescope.com and click Log in in the top right corner.
Log in with your school credentials.
Click on your STA 511 course.
Click on the assignment, and you’ll be prompted to submit it.
Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”). If you do not do this, you will be subject to lose points on the assignment.
Do not select any pages of your PDF submission to be associated with the “Workflow & formatting” question.

Note

The “Workflow & formatting” grade is to assess the reproducible workflow. This includes:

linking all pages appropriately on Gradescope
putting your name in the YAML at the top of the document
Pipes %>%, |> and ggplot layers + should be followed by a new line
You should be consistent with stylistic choices, e.g. %>% vs |>

Grading for HW-4

Exercise 1: 25 points
Exercise 2: 15 points
Workflow + formatting: 5 points
Total: 45 points