HW 4: Recreating R Output + Chi-Square
Note
You need to add the following code to your document in order for it to run + be able to recreate the output you see in the assignment.
Packages
Data
For this homework, we are going to being using the survey
data set from the MASS package. This data frame contains the responses of 237 Statistics I students at the University of Adelaide to a number of questions. You can find the variable names and descriptions by pulling up the help file for survey.
We are going to use a cleaned up data set (like we did in class). Your data set should look like this, if it doesn’t please feel free to include the follow code in your homework document.
survey <- na.omit(survey)
Exercise 1
In this exercise, we are going to investigate the relationship between how students express height, and their age. In this survey, students were ask to express their height, and the researcher recorded if they expressed their height in imperial (feet/inches), or metric (centimeters/meters). The researcher wanted to know if there is a difference between expressed height and a student’s age.
Justify, in as much detail as possible, why it is appropriate to use difference in means methodology to analyze these data.
Set up the null AND alternative hypothesis, using proper notation, for the difference in means test described above. Note, you only need to write this out in proper notation (not words).
\(Ho:\)
\(Ha:\)
- Suppose that the assumptions to conduct a theory based test were met. The researchers used the function
t.test
to calculate the appropriate p-value. The results can be seen below.
Welch Two Sample t-test
data: Age by M.I
t = 1.2452, df = 65.081, p-value = 0.2175
alternative hypothesis: true difference in means between group Imperial and group Metric is not equal to 0
95 percent confidence interval:
-0.9451082 4.0753596
sample estimates:
mean in group Imperial mean in group Metric
21.45836 19.89324
Question: Using the survey data set, recreate the sample means (it’s okay if they are rounded) for group Imperial and group Metric. Report these means in a 2x2 table by adding to the existing pipeline that starts by filtering out all NA values for the expressed height variable.
|>
survey filter(!is.na(M.I)) |>
# insert code here
- Now, report the standard deviations for each group by adding the code below.
|>
survey filter(!is.na(M.I)) |>
# insert code here
Use the information from parts c and d to recreate the t-static that was calculated using
t.test
. Show your work.Using the
pt()
function, recreate the p-value calculated in thet.test
function. Hint: Think about your alternative hypothesis.
pt()
The book used in 511 suggests estimating the degrees of freedmon as \(min(n_1 - 1;n_2-1)\). Is our estimate of the degrees of freedom the same as what
t.test
calculates? Justify your answer by reporting the degrees of freedom estimated by the book’s formula.Without doing the calculation, justify if the p-value would increase, decrease, or stay the same, if we calculated it using the degrees of freedom from part g.
Exercise 2
In this exercise, we are again going to use the survey data set. We are going to test if a student’s smoking status in independent from how often they exercise. You can find the variable names and descriptions by pulling up the help file for survey.
- Make an observed table (tibble output) that counts the frequency of every exercise by smoking combination. Your answer should be a 11 x 3 tibble. Add to the existing pipeline that filters out missing values for smoking when reporting your tibble.
|>
survey filter(!is.na(Smoke)) |>
# insert code here
- Below is a table of expected counts. These are counts that would expect to see if exercise really has no impact on smoking status.
survey$Smoke
survey$Exer Heavy Never Occas Regul
Freq 3.5416667 67.79762 6.577381 7.083333
None 0.5833333 11.16667 1.083333 1.166667
Some 2.8750000 55.03571 5.339286 5.750000
Question: Using the values from a, recreate the expected count of those who smoke heavy and exercise frequently. Show your work.
Are you justified to use theory based procedures to carry out a chi-square test of independence? Justify your answer
Regardless of your answer, the results from a chi-square test of independence can be seen below…
Pearson's Chi-squared test
data: survey$Exer and survey$Smoke
X-squared = 1.7861, df = 6, p-value = 0.9383
Question: Please explain how df = 6
was calculated when carrying out this test.
- Using the results from the output in part d, report the test statistic, p-value, and distribution the p-value was calculated from. Be specific.
Submission
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Log in with your school credentials.
- Click on your STA 511 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”). If you do not do this, you will be subject to lose points on the assignment.
- Do not select any pages of your PDF submission to be associated with the “Workflow & formatting” question.
The “Workflow & formatting” grade is to assess the reproducible workflow. This includes:
- linking all pages appropriately on Gradescope
- putting your name in the YAML at the top of the document
- Pipes
%>%
,|>
and ggplot layers+
should be followed by a new line - You should be consistent with stylistic choices, e.g.
%>%
vs|>
Grading for HW-4
- Exercise 1: 25 points
- Exercise 2: 15 points
- Workflow + formatting: 5 points
- Total: 45 points