Exercise - Groupings

SNAP History 1970-2019

The Supplementary Nutrition Assistance Program (SNAP), formerly known as the Food Stamp Program, provides food assistance to low-income families in the form of a debit card. Data about the program from 1969 to 2019 are in SNAP_history_1969_2019.csv. A brief description of each variable in the data set is here (you may need to scroll down some).

Load these data into R and perform the following wranglings:

Create simpler variable names.
Remove the data for 1969 (so the first year will be 1970).
Create a new variable that contains the decade for each observation (e.g., the 1970s decade will contain all years from 1970 to 1979).
Create a new variable that is the difference between total benefits paid and total costs of the program (construct the difference such that negative numbers represent more costs than benefits paid).

Answer the following questions with the newly wrangled data frame.

Summarize the annual average number of participants by decade (summary should include a measure of center and dispersion). Comment on any patterns over time.
Summarize the annual difference in benefits paid and total costs by decade (summary should include a measure of center and dispersion). Comment on any patterns over time.

Quarterbacks

In this exercise you created a data frame of statistics for all NCAA quarterbacks in 2019 and 2020. Get those data for this exercise but reduce it to just those quarterbacks in “Power 5” conferences (as defined in this exercise). Then use those data to perform the following tasks.

Construct a data frame that has the sample size and mean passer rating for each conference for each year.
Construct a data frame that has the sample and mean passer rating for each conference (across both years).
Construct a data frame that has the sample size and mean passer rating for each year (across all five conferences).
Construct a data frame that has all of the results from the previous three data frames sorted by year within each conference. [Hints: (1) Modify the last two data frames so that all three data frames have the same number of variables with the same names. (2) Insert NAs for elements that don’t exist in the modified final data frame. (3) Your final data frame should look like that below.]

## # A tibble: 17 x 5
## # Groups:   Year [3]
##    Year  Conf        n valid_n  mean
##    <chr> <chr>   <int>   <int> <dbl>
##  1 2019  ACC        10      10  138.
##  2 2020  ACC        13      13  139.
##  3 <NA>  ACC        23      23  139.
##  4 2019  Big 12     10      10  144.
##  5 2020  Big 12     10      10  133.
##  6 <NA>  Big 12     20      20  138.
##  7 2019  Big Ten    11      11  143.
##  8 2020  Big Ten    12      12  131.
##  9 <NA>  Big Ten    23      23  137.
## 10 2019  Pac-12     10      10  150.
## 11 2020  Pac-12     10      10  136.
## 12 <NA>  Pac-12     20      20  143.
## 13 2019  SEC         9       9  141.
## 14 2020  SEC        12      12  145.
## 15 <NA>  SEC        21      21  143.
## 16 2019  <NA>       50      50  143.
## 17 2020  <NA>       57      57  137.

Sums-of-Squares

The standard deviation of a variable \(x\) is calculated with

\[ \sqrt{\frac{\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}{n-1}} \] where \(x_{i}\) is the \(i\)th observation of the variable \(x\), \(\bar{x}\) is the mean of \(x\), and \(n\) is the number of observations of \(x\). A list of six steps for this calculation and a table that helps with computing the standard deviation are described here.

Use the following data frame called df for the questions below.

df <- tibble(group=rep(c("A","B"),c(4,5)),
             value=c(10,22,14,18,22,25,28,21,24))

Construct a data frame that is like Table 4.3 (from the link above) but for the data in df. [Hint: The table will not have the “Sum” row.]
Summarize the data frame created above to compute the standard deviation for these data. [Hint: this will use summarize() and mutate() but not group_by(). The standard deviation you calculate should equal 5.61.]
Construct a data frame that is like Table 4.3 but will ultimately allow you to compute the standard deviation for both groups. [Hint: this will require use of group_by().]
Summarize the last data frame created above to compute the standard deviation for both groups in these data. [Hint: the standard deviations you calculate should equal 5.16 and 2.74.]