Join Identification I

Suppose that you have two simple data frames G and H.

Data frame G
id val
A 1
C 2
C 3
B 4
Data frame H
id ue
E 9
A 8
B 7
B 6
D 5

For each join situation below identify how many rows and columns you would expect in the resulting data frame. [Think deeply about your answers here but note that you will not be graded on your answers to these questions.]

  1. inner_join(G,H,by="id")
  2. left_join(G,H,by="id")
  3. right_join(G,H,by="id")
  4. full_join(G,H,by="id")
  5. semi_join(G,H,by="id")
  6. anti_join(G,H,by="id")

 

Join Identification II

Create data frames in R that match the G and H data frames from the “Join Identification I” question. Check your expected number of row and columns for each join type using the requisite functions in R. Were each of your answers correct? Which ones (if any) were not correct and where do you think your reasoning went awry?

 

Join Identification III

Suppose that you have two simple data frames A and B.

Data frame A
uniq meas1 meas2
A13 34 sun
A15 87 sun
A15 65 moon
A16 33 moon
A16 12 sun
Data frame B
uniq meas3
A14 am
A15 pm
A16 am
A17 pm

What kind of join was applied to A and B to create each data frame below? [Think deeply about your answers here but note that you will not be graded on your answers to these questions.]

  1. Data frame Z
  2. uniq meas1 meas2 meas3
    A15 87 sun pm
    A15 65 moon pm
    A16 33 moon am
    A16 12 sun am
  3. Data frame Y
  4. uniq meas1 meas2
    A15 87 sun
    A15 65 moon
    A16 33 moon
    A16 12 sun
  5. Data frame X
  6. uniq meas1 meas2 meas3
    A15 87 sun pm
    A15 65 moon pm
    A16 33 moon am
    A16 12 sun am
    A14 NA NA am
    A17 NA NA pm
  7. Data frame W
  8. uniq meas1 meas2
    A13 34 sun
  9. Data frame V
  10. uniq meas1 meas2 meas3
    A13 34 sun NA
    A15 87 sun pm
    A15 65 moon pm
    A16 33 moon am
    A16 12 sun am
    A14 NA NA am
    A17 NA NA pm
  11. Data frame U
  12. uniq meas1 meas2 meas3
    A13 34 sun NA
    A15 87 sun pm
    A15 65 moon pm
    A16 33 moon am
    A16 12 sun am
  13. Data frame T
  14. uniq meas3 meas1 meas2
    A14 am NA NA
    A15 pm 87 sun
    A15 pm 65 moon
    A16 am 33 moon
    A16 am 12 sun
    A17 pm NA NA

 

Join Identification IV

  1. Create data frames in R that match the A and B data frames from the “Join Identification III” question.
  2. Use join commands in R to create each of the Z, Y, W, V, U, and T data frames in the “Join Identification III” questions from the A and B data frames. Show your code and the complete resulting data frame.
  3. Reflect on your answers in “Join Identification III” and the results here. Were each of your answers in “Join Identification III” correct? Which ones (if any) were not correct and where do you think your reasoning went awry?

 

Ski Club

A group of 16 competitive but amateur skiers formed a loosely-knit “club” to motivate themselves to keep skiing. The record keeper for the club developed a data organization scheme that uses four sheets in this Excel file. The “skiers” sheet has information about each skier including a unique bibID code. The “races” sheet has information about each ski race that members of the club partook in and includes a unique “raceID” code. To date, two races have been entered into the spreadsheet. The “race19” sheet has results from the first race (i.e., raceID==1) for each of the members that participated in that race. The “race20” sheet is the same for the second race (i.e., raceID==2). Download the Excel file and load each sheet into a separate dataframe in R. Use these data frames to perform the following tasks. Show your R code and the resulting data frame.

  1. Show results from both races for racers that were in both races.

  2. Append the racers’ personal information to the data frame from the previous question. As a matter of aesthetics do this in a manner such that the personal information is left most in the resulting data frame. [Note that there should be no missing values in this data frame.]

  3. Note that some of the variables in the data frame from the previous question have suffixes appended to the name. Explain why you think this happened and why it is needed.

  4. Show results for the first race only for racers that were in both races.

  5. Show results for the first race only for racers that were only in the first race.

  6. Show “results” for both races for all racers that were in the first race. [There may be missing data in this data frame.]

  7. Construct a “master race file” that has results for all racers for all races.

 

Food Insecurity Surveys

Ashland County administrators want to assess factors that lead to food insecurity among residents of Ashland County. Towards this end they will distribute a food insecurity survey to as many households in the county as possible. As it is not known who is food insecure they have decided to distribute the surveys at locations where those that are food insecure are likely to frequent. At these location, the surveys will either be placed where respondents can take a survey or not, or the administrators will physically hand some surveys directly to people.

To monitor the return rate of the surveys, the administrators have uniquely numbered each of 1000 surveys and will keep track of where, when, and how each survey was delivered. This information is recorded in the “Surveys” sheet of FoodInsecuritySurvey.xlsx. Information, including the unique survey number, on the returned surveys was entered separately in the “Results” sheet of the same Excel workbook. Meta data for the two sheets is in the “meta” sheet of the same workbook. Load data from the “Surveys” and “Results” sheets into separate data frames in R.

  1. Use at inner, full, left, or right join to create a new data frame that has all variables from both data frames present but only for surveys that have been returned.

  2. Repeat the previous question using a different type of join.

  3. Explain why the other two joins are not appropriate for this task.