Exercise - Relational Data

Join Identification I

Suppose that you have two simple data frames G and H.

Data frame G
id	val
A	1
C	2
C	3
B	4

Data frame H
id	ue
E	9
A	8
B	7
B	6
D	5

For each join situation below identify how many rows and columns you would expect in the resulting data frame. [Think deeply about your answers here but note that you will not be graded on your answers to these questions.]

inner_join(G,H,by="id")
left_join(G,H,by="id")
right_join(G,H,by="id")
full_join(G,H,by="id")
semi_join(G,H,by="id")
anti_join(G,H,by="id")

Join Identification II

Create data frames in R that match the G and H data frames from the “Join Identification I” question. Check your expected number of row and columns for each join type using the requisite functions in R. Were each of your answers correct? Which ones (if any) were not correct and where do you think your reasoning went awry?

Join Identification III

Suppose that you have two simple data frames A and B.

Data frame A
uniq	meas1	meas2
A13	34	sun
A15	87	sun
A15	65	moon
A16	33	moon
A16	12	sun

Data frame B
uniq	meas3
A14	am
A15	pm
A16	am
A17	pm

What kind of join was applied to A and B to create each data frame below? [Think deeply about your answers here but note that you will not be graded on your answers to these questions.]

Data frame Z

uniq	meas1	meas2	meas3
A15	87	sun	pm
A15	65	moon	pm
A16	33	moon	am
A16	12	sun	am

Data frame Y

uniq	meas1	meas2
A15	87	sun
A15	65	moon
A16	33	moon
A16	12	sun

Data frame X

uniq	meas1	meas2	meas3
A15	87	sun	pm
A15	65	moon	pm
A16	33	moon	am
A16	12	sun	am
A14	NA	NA	am
A17	NA	NA	pm

Data frame W

uniq	meas1	meas2
A13	34	sun

Data frame V

uniq	meas1	meas2	meas3
A13	34	sun	NA
A15	87	sun	pm
A15	65	moon	pm
A16	33	moon	am
A16	12	sun	am
A14	NA	NA	am
A17	NA	NA	pm

Data frame U

uniq	meas1	meas2	meas3
A13	34	sun	NA
A15	87	sun	pm
A15	65	moon	pm
A16	33	moon	am
A16	12	sun	am

Data frame T

uniq	meas3	meas1	meas2
A14	am	NA	NA
A15	pm	87	sun
A15	pm	65	moon
A16	am	33	moon
A16	am	12	sun
A17	pm	NA	NA

Join Identification IV

Create data frames in R that match the A and B data frames from the “Join Identification III” question.
Use join commands in R to create each of the Z, Y, W, V, U, and T data frames in the “Join Identification III” questions from the A and B data frames. Show your code and the complete resulting data frame.
Reflect on your answers in “Join Identification III” and the results here. Were each of your answers in “Join Identification III” correct? Which ones (if any) were not correct and where do you think your reasoning went awry?

Ski Club

A group of 16 competitive but amateur skiers formed a loosely-knit “club” to motivate themselves to keep skiing. The record keeper for the club developed a data organization scheme that uses four sheets in this Excel file. The “skiers” sheet has information about each skier including a unique bibID code. The “races” sheet has information about each ski race that members of the club partook in and includes a unique “raceID” code. To date, two races have been entered into the spreadsheet. The “race19” sheet has results from the first race (i.e., raceID==1) for each of the members that participated in that race. The “race20” sheet is the same for the second race (i.e., raceID==2). Download the Excel file and load each sheet into a separate dataframe in R. Use these data frames to perform the following tasks. Show your R code and the resulting data frame.

Show results from both races for racers that were in both races.
Append the racers’ personal information to the data frame from the previous question. As a matter of aesthetics do this in a manner such that the personal information is left most in the resulting data frame. [Note that there should be no missing values in this data frame.]
Note that some of the variables in the data frame from the previous question have suffixes appended to the name. Explain why you think this happened and why it is needed.
Show results for the first race only for racers that were in both races.
Show results for the first race only for racers that were only in the first race.
Show “results” for both races for all racers that were in the first race. [There may be missing data in this data frame.]
Construct a “master race file” that has results for all racers for all races.

Food Insecurity Surveys

Ashland County administrators want to assess factors that lead to food insecurity among residents of Ashland County. Towards this end they will distribute a food insecurity survey to as many households in the county as possible. As it is not known who is food insecure they have decided to distribute the surveys at locations where those that are food insecure are likely to frequent. At these location, the surveys will either be placed where respondents can take a survey or not, or the administrators will physically hand some surveys directly to people.

To monitor the return rate of the surveys, the administrators have uniquely numbered each of 1000 surveys and will keep track of where, when, and how each survey was delivered. This information is recorded in the “Surveys” sheet of FoodInsecuritySurvey.xlsx. Information, including the unique survey number, on the returned surveys was entered separately in the “Results” sheet of the same Excel workbook. Meta data for the two sheets is in the “meta” sheet of the same workbook. Load data from the “Surveys” and “Results” sheets into separate data frames in R.

Use at inner, full, left, or right join to create a new data frame that has all variables from both data frames present but only for surveys that have been returned.
Repeat the previous question using a different type of join.
Explain why the other two joins are not appropriate for this task.