HW 1 - Summary Stats + Data visualization

solutions

Homework

Important

This homework is due Sunday, Sep 8 at 11:59pm.

Logistics

Packages

library(tidyverse)
library(openintro)

Tips

Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete this homework and other assignments in this course. There will be reminders in this assignment for you to Render your document. The last thing you want to do is work an entire assignment before realizing you have an error somewhere that makes it so you can’t compile your document. Render after each completed question.

Note

Note: Do not let R output answer the question for you unless the question specifically asks for just a plot. For example, if the question asks for the number of columns in the data set, please type out the number of columns. You may lose points if you do not.

Exercises

Data 1: Duke Forest houses

Note

Use the duke_forest dataset for Exercises 1 and 2.

For the following two exercises you will work with data on houses that were sold in the Duke Forest neighborhood of Durham, NC in November 2020. The duke_forest dataset comes from the openintro package. You can see a list of the variables on the package website or by running ?duke_forest in your console.

Exercise 1

We are going to first explore the duke_forest dataset by calculating a variety of summary statistics. Calculate the summary statistic associated with each scenario below. For this question, your code + code output is enough to earn full credit

Create a data frame that displays the mean house price across all categories of bedroom.

## Solutions 

duke_forest |>
  group_by(bed) |>
  summarise(mean(price))

# A tibble: 5 × 2
    bed `mean(price)`
  <dbl>         <dbl>
1     2       349250 
2     3       491650 
3     4       570982.
4     5       707500 
5     6      1250000

Create a data frame that displays the minimum and maximum lot area, in acres. Name your columns min_lot and max_lot.

duke_forest |>
  summarise(
    min_lot = min(lot, na.rm = T),
    max_lot = max(lot, na.rm = T)
  )

# A tibble: 1 × 2
  min_lot max_lot
    <dbl>   <dbl>
1    0.15    1.47

Create a data frame that gives the number of homes for each combination of cooling system AND number of bathrooms. You will receive a bonus point if you use the functions arrange() and desc() to put your data frame in descending order of count. Name your count column n_count.

duke_forest |>
  group_by(cooling, bath) |>
  summarise(n_count = n()) |>
  arrange(desc(n_count)) # +1 extra credit

`summarise()` has grouped output by 'cooling'. You can override using the
`.groups` argument.

# A tibble: 12 × 3
# Groups:   cooling [2]
   cooling  bath n_count
   <fct>   <dbl>   <int>
 1 other     3        22
 2 central   3        19
 3 central   4        14
 4 other     2         9
 5 other     4         9
 6 central   2         9
 7 other     2.5       6
 8 other     1         3
 9 central   5         3
10 other     5         2
11 other     4.5       1
12 other     6         1

Exercise 2

Usually, we expect that within any market, larger houses will have higher prices. We can also expect that there exists a relation between the age of an house and its price. However, in some markets newer houses might be more expensive, while in other markets antique houses will have ‘more character’ than newer ones and have higher prices. In this question, we will explore the relations among age, size and price of houses.

Your family friend ask: “In Duke Forest, do houses that are bigger and more expensive tend to be newer than smaller and cheaper ones?”.

Once again, data visualization skills to the rescue!

Create a scatter plot to exploring the relationship between price and area, also display information about year_built (that is conditioning for year_built, or your z variable).
Use size = 3 within the appropriate geom function used to make a scatter plot to make your points bigger.
Layer on geom_smooth() with the argument se = FALSE to add a smooth curve fit to the data and color the points by year_built.
Include informative title, axis, and legend labels.
Discuss each of the following claims (1-2 sentences per claim). Use elements you observe in your plot as evidence for or against each claim.
- Claim 1: Larger houses are priced higher.
- Claim 2: Bigger and more expensive houses tend to be newer ones than smaller and cheaper ones.

duke_forest |>
  ggplot(
  aes(x = area, 
      y = price, 
      color = year_built)
) +
  geom_point(size = 3) +
  geom_smooth(se = FALSE) +
  labs(
    x = "Area of Home",
    y = "Price of Home",
    title = "Relationship between Price and Area by Year Built",
    color = "Year Built"
  )

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Warning: The following aesthetics were dropped during statistical transformation:
colour.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

Claim 1: Yes, there seems to be evidence of a positive relationship between the price of the home and the area of the home. As area increases, so does price.
Claim 2: No, there does not seem to be any evidence to suggest that larger more expensive homes are newer than those houses that are cheaper and smaller. Points that are lighter colored (newer homes) are not concentrated on the top right of the plot.

Data 2: BRFSS

Note

Use this dataset for Exercises 3 through 5.

The Behavioral Risk Factor Surveillance System (BRFSS) is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.

Source: cdc.gov/brfss

In the following exercises we will work with data from the 2020 BRFSS survey. The data originally come from here, though we will work with a random sample of responses and a small number of variables from the data provided. These have already been sampled for you and the dataset you’ll use can be found in the data folder of your repo. It’s called brfss.csv.

brfss <- read_csv("https://st511-01.github.io/data/brfss.csv")

Exercise 3

How many rows are in the brfss dataset? What does each row represent?
How many columns are in the brfss dataset?
Include the code and resulting output used to support your answer.

Write a sentence along with your code to answer your question

glimpse(brfss)

Rows: 2,000
Columns: 4
$ state          <chr> NA, "CO", "MN", "VA", "UT", "KS", "UT", "TX", "OR", "OH…
$ general_health <chr> "Fair", "Good", "Very good", "Excellent", "Very good", …
$ smoke_freq     <chr> "Not at all", "Some days", "Every day", "Not at all", "…
$ sleep          <dbl> 6, 7, 6, 8, 7, 10, 7, 6, 8, 8, 8, 6, 9, 8, 7, 7, 8, 6, …

There are 2000 rows and 4 columns in the data set. Each row represents a respondent to the survey.

Exercise 4

Do people who smoke more tend to have worse health conditions?

Use a segmented bar chart to visualize the relationship between smoking (smoke_freq) and general health (general_health). Put smoke_freq on the x-axis.
- Below is sample code for releveling general_health. Here we first convert general_health to a factor (how R stores categorical data) and then order the levels from Excellent to Poor. The same is done to smoke_freq, with the ordering being from Not at all to Every day.
You will add to the existing pipeline (code) to make the segmented bar chart.

brfss |>
  mutate(
    general_health = as.factor(general_health),
    general_health = fct_relevel(general_health, "Excellent", "Very good", "Good", "Fair", "Poor")
  ) # add a pipe here to start creating your bar chart

brfss |>
  mutate(
    general_health = as.factor(general_health),
    general_health = fct_relevel(general_health, "Excellent", "Very good", "Good", "Fair", "Poor"),
    smoke_freq = as.factor(smoke_freq),
    smoke_freq = fct_relevel(smoke_freq, "Not at all", "Some days", "Every day")
    ) |>
  ggplot(aes(x = smoke_freq, 
             fill = general_health)) + 
  geom_bar(position = "fill") +
  labs( 
    x = "Smoking Frequency",
    y = "Proportion", 
    title = "Relationship between Smoking and Health",
    fill = "General Health"
    )

Include informative title, axis, and legend labels.
Comment on the motivating question based on evidence from the visualization: Do people who smoke more tend to have worse health conditions?

From the graph, you can see that as your smoking frequency increases, your general health tends to decrease. That is, we see less very good and excellent status, and more good, fair, and poor status and smoking frequency increases.

Exercise 5

How are sleep and general health associated?

Create a visualization displaying the relationship between sleep and general_health.
Include informative title and axis labels.
Modify your plot to use a different theme than the default.
Comment on the motivating question based on evidence from the visualization: How are sleep and general health associated?

Now is a good time to save and render

# Answers will vary. Could be histogram or side-by-side boxplot 

brfss |>
  mutate(
    general_health = as.factor(general_health),
    general_health = fct_relevel(general_health, "Poor", "Fair", "Good", "Very good",  "Excellent")
  ) |>

# Answers will vary. Could be histogram or side-by-side boxplot 

brfss |>
  mutate(
    general_health = as.factor(general_health),
    general_health = fct_relevel(general_health, "Poor", "Fair", "Good", "Very good",  "Excellent")
  ) |>
  ggplot(aes(x = general_health, y = sleep, fill = general_health)) + 
  geom_boxplot(alpha = 0.8, show.legend = FALSE) + 
  theme_minimal() + 
  labs( 
     x = "General Health",
     y = "Sleep", 
     title = "Sleep and Health"
     )

There is slight evidence to suggest that those who sleep less thend to have worse health. However, the variability around the medians (IQR) all tend to overlap, with fair through excellent having extremely similar medians.

Submission

Go to http://www.gradescope.com and click Log in in the top right corner.
Log in with your school credentials.
Click on your STA 511 course.
Click on the assignment, and you’ll be prompted to submit it.
Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”). If you do not do this, you will be subject to lose points on the assignment.
Do not select any pages of your PDF submission to be associated with the “Workflow & formatting” question.

Grading

Exercise 1: 6 points
Exercise 2: 10 points
Exercise 3: 4 points
Exercise 4: 10 points
Exercise 5: 9 points
Workflow + formatting: 5 points
Total: 44 points

Note

The “Workflow & formatting” grade is to assess the reproducible workflow. This includes:

linking all pages appropriately on Gradescope
putting your name in the YAML at the top of the document
Pipes %>%, |> and ggplot layers + should be followed by a new line
You should be consistent with stylistic choices, e.g. %>% vs |>