HW 6 - Interaction Models
This homework is due Tuesday, Nov 26 at 11:59pm.
Packages
Data
For the first, we will be using the legos
dataset, which includes data about Lego sets on sale on Amazon.
legos <- read_csv("data/lego_sample.csv")
We are going to use the following three variables:
amazon_price
: the price for the Lego set listed on Amazon;pieces
: the number of Lego pieces in the set;theme
: the category the set belongs to.
Exercise-1: Interaction model output
a. Fit an interaction model with theme and number of pieces as your explanatory variables to investigate Amazon price. Name this model price_fit_int
. Report your model output below.
b. Write out the entire estimated model in proper notation.
c. Write out the estimated model for the DUPLO theme in proper notation.
d. Write out the estimated model for the City theme in proper notation.
e. Without any calculations, justify if the coefficient of determination for the price_fit_int
model will be larger, smaller, or not enough information to know, than the coefficient of determination for an additive model with theme
and pieces
as explanatory variables, and amazon price
as the response.
f. Without any calculations, justify if the coefficient of determination for the price_fit_int
model will be larger, smaller, or not enough information to know, than the coefficient of determination for an additive model with theme
and pieces
as explanatory variables, and pages
of the manual as the response.
Exercise-2: Should we fit a lm?
Data
Gapminder is a “fact tank” that uses publicly available world data to produce data visualizations and teaching resources on global development. We will use an excerpt of their data to explore relationships among world health metrics across countries and regions between the years 1952 and 2007. The data set is called gapminder
, from the gapminder package. A table of variables can be found below.
We are going to use the data just from the year 2007 using the following code.
gapminder_2007 <- gapminder |>
filter(year == 2007)
Use the gapminder_2007
data set to answer the follow questions.
Hint: pull up the help file for gapminder
for a variable name key.
We are interested in learning more about life expectancy (y), and we’ll start with exploring the relationship between life expectancy and GDP (x). Create a scatter plot of
gdpPercap
vs.lifeExp
.While working with a fellow student, they claim that they don’t think it’s appropriate to fit a linear model to explore this relationship. In 1-2 sentences, explain why your fellow student is correct.
Because of the concern your fellow student raised, we are going to take the natural log of
gdpPercap
, and use this as our new explanatory variable (lnDGP
) using the following code below.
For the remainder of this question, we are going to use the gapminder_2007_log
data set.
Create a scatter plot of lnGDP
vs. lifeExp
. Then, make a comment on if you think it is appropriate to fit a linear model to analyze this relationship.
Exercise-3: Model selection
Another student comes to you and suggests that they believe continent
could also be a useful variable at explaining the variability in lnGDP
. In this exercise, we are going to investigate which model would be the most appropriate to model lnGDP
.
-
Using the
gapminder_2007_log
data set, fit the three models below, and report their summary output. For each of these models, assumelifeExp
is the response variable.SLR model with
lnGDP
as the explanatory varaibleAdditive model with
lnGDP
andcontinent
Interaction model with
lnGDP
andcontinent
Now, calculate and use the adjusted r-squared value for each model to determine what model (of the three models fit) is the “best” model to model life expectancy.
c). Based on your answer to part b, select the most appropriate interpretation of your model below…
Our model suggests that the relationship between ln of GDP per cap and life expectancy does not depend on continent.
Our model suggests that the relationship between ln of GDP per cap and life expectancy does depend on continent.
Our model suggests that continent is not a useful predictor when modeling the relationship between ln of GDP and life expectancy.
Exercise-4
Communication is a critical yet often overlooked part of research. When we engage with our audience and capture their interest, we can ultimately better communicate what we are trying to share.
Please watch the following video: Hans Rosling: 200 years in 4 minutes.
Then, write a paragraph (4-5 sentences) addressing the following:
What did you enjoy about the presentation of data? What did you find interesting
Were there any aspects of the presentation that were hard to follow? If so, what?
What are your general takeaways from this presentation?
What are your general takeaways from how this presentation was given?
Submission
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Log in with your school credentials.
- Click on your STA 511 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”). If you do not do this, you will be subject to lose points on the assignment.
- Do not select any pages of your PDF submission to be associated with the “Workflow & formatting” question.
The “Workflow & formatting” grade is to assess the reproducible workflow. This includes:
- linking all pages appropriately on Gradescope
- putting your name in the YAML at the top of the document
- Pipes
%>%
,|>
and ggplot layers+
should be followed by a new line - You should be consistent with stylistic choices, e.g.
%>%
vs|>
Grading
Exercise-1: 20 points
Exercise-2: 15 points
Exercise-3: 20 points
Exercise-4: 10 points
Workflow + Format: 5 points