HW 6 - Interaction Models

Homework

Important

This homework is due Tuesday, Nov 26 at 11:59pm.

Packages

library(tidyverse)
library(tidymodels)
library(gapminder)

Data

For the first, we will be using the legos dataset, which includes data about Lego sets on sale on Amazon.

legos <- read_csv("data/lego_sample.csv")

We are going to use the following three variables:

amazon_price: the price for the Lego set listed on Amazon;
pieces: the number of Lego pieces in the set;
theme: the category the set belongs to.

Exercise-1: Interaction model output

a. Fit an interaction model with theme and number of pieces as your explanatory variables to investigate Amazon price. Name this model price_fit_int. Report your model output below.

b. Write out the entire estimated model in proper notation.

c. Write out the estimated model for the DUPLO theme in proper notation.

d. Write out the estimated model for the City theme in proper notation.

e. Without any calculations, justify if the coefficient of determination for the price_fit_int model will be larger, smaller, or not enough information to know, than the coefficient of determination for an additive model with theme and pieces as explanatory variables, and amazon price as the response.

f. Without any calculations, justify if the coefficient of determination for the price_fit_int model will be larger, smaller, or not enough information to know, than the coefficient of determination for an additive model with theme and pieces as explanatory variables, and pages of the manual as the response.

Exercise-2: Should we fit a lm?

Data

Gapminder is a “fact tank” that uses publicly available world data to produce data visualizations and teaching resources on global development. We will use an excerpt of their data to explore relationships among world health metrics across countries and regions between the years 1952 and 2007. The data set is called gapminder, from the gapminder package. A table of variables can be found below.

We are going to use the data just from the year 2007 using the following code.

gapminder_2007 <- gapminder |>
  filter(year == 2007)

Use the gapminder_2007 data set to answer the follow questions.

Hint: pull up the help file for gapminder for a variable name key.

We are interested in learning more about life expectancy (y), and we’ll start with exploring the relationship between life expectancy and GDP (x). Create a scatter plot of gdpPercap vs. lifeExp.
While working with a fellow student, they claim that they don’t think it’s appropriate to fit a linear model to explore this relationship. In 1-2 sentences, explain why your fellow student is correct.
Because of the concern your fellow student raised, we are going to take the natural log of gdpPercap, and use this as our new explanatory variable (lnDGP) using the following code below.

gapminder_2007_log <- gapminder_2007 |>
  mutate(lnGDP = log(gdpPercap))

For the remainder of this question, we are going to use the gapminder_2007_log data set.

Create a scatter plot of lnGDP vs. lifeExp. Then, make a comment on if you think it is appropriate to fit a linear model to analyze this relationship.

Exercise-3: Model selection

Another student comes to you and suggests that they believe continent could also be a useful variable at explaining the variability in lnGDP. In this exercise, we are going to investigate which model would be the most appropriate to model lnGDP.

Using the gapminder_2007_log data set, fit the three models below, and report their summary output. For each of these models, assume lifeExp is the response variable.

SLR model with lnGDP as the explanatory varaible

Additive model with lnGDP and continent

Interaction model with lnGDP and continent
Now, calculate and use the adjusted r-squared value for each model to determine what model (of the three models fit) is the “best” model to model life expectancy.

c). Based on your answer to part b, select the most appropriate interpretation of your model below…

Our model suggests that the relationship between ln of GDP per cap and life expectancy does not depend on continent.
Our model suggests that the relationship between ln of GDP per cap and life expectancy does depend on continent.
Our model suggests that continent is not a useful predictor when modeling the relationship between ln of GDP and life expectancy.

Exercise-4

Communication is a critical yet often overlooked part of research. When we engage with our audience and capture their interest, we can ultimately better communicate what we are trying to share.

Please watch the following video: Hans Rosling: 200 years in 4 minutes.

Then, write a paragraph (4-5 sentences) addressing the following:

What did you enjoy about the presentation of data? What did you find interesting

Were there any aspects of the presentation that were hard to follow? If so, what?

What are your general takeaways from this presentation?

What are your general takeaways from how this presentation was given?

Submission

Go to http://www.gradescope.com and click Log in in the top right corner.
Log in with your school credentials.
Click on your STA 511 course.
Click on the assignment, and you’ll be prompted to submit it.
Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”). If you do not do this, you will be subject to lose points on the assignment.
Do not select any pages of your PDF submission to be associated with the “Workflow & formatting” question.

Note

The “Workflow & formatting” grade is to assess the reproducible workflow. This includes:

linking all pages appropriately on Gradescope
putting your name in the YAML at the top of the document
Pipes %>%, |> and ggplot layers + should be followed by a new line
You should be consistent with stylistic choices, e.g. %>% vs |>

Grading

Exercise-1: 20 points

Exercise-2: 15 points

Exercise-3: 20 points

Exercise-4: 10 points

Workflow + Format: 5 points