Module 4 Relational Data
4.1 Nature of Relational Data
Some situations will have multiple data frames with related data. A particular analysis may require combining these data frames into a single data frame. If the data in the separate data frames are connected by a key variable then the data are said to be relational – i.e., they relate to each other through a common variable (or variables).
As an example a college may have the following four data frames with respect to its students.
- Personal information (hometown, age, etc.)
- Financial aid information (family income, Pell Grant aid amount, etc.)
- Academic information (standing, major, gpa, etc.)
- Current course information (i.e., which courses a students is registered for)
Each of these data frames would also contain a student ID variable so that a student’s personal information can be connect with the students financial aid, academic, or course information. This student ID variable is the key variable for these relational data frames.
Data from related data frames can be joined in a variety of ways. This module will explain several types of joins and how to accomplish those joins in R.
Relational Data: Data tables that are connected by a common key variable.
Key Variable: A variable in a data table that connects related data. This is often a unique “identification” variable.
Relational data can be more efficient and, thus, lead to fewer data entry errors. For example consider a very simple set of data where the occurrence of plant species in plots is recorded, along with information about the plot. For example, the data for one plot that had five species may look like this …
plotID | lat | long | loc | species |
---|---|---|---|---|
101 | 43.458 | -91.2304 | Baffle Is. | Big Bluestem |
101 | 43.458 | -91.2304 | Baffle Is. | Fox Sedge |
101 | 43.458 | -91.2304 | Baffle Is. | Indian Grass |
101 | 43.458 | -91.2304 | Baffle Is. | Prairie Dropseed |
101 | 43.458 | -91.2304 | Baffle Is. | Tufted Hair Grass |
However, with this data entry the four variables related to the plot (e.g., plotID
, lat
, long
, and loc
) must be repeated multiple times for each species
. This repetition is both inefficient and error-prone. A better structure is to create one data frame, with plotID
, with the characteristics for each plot …
plotID | lat | long | loc |
---|---|---|---|
101 | 43.458 | -91.2304 | Baffle Is. |
… and then a second data frame, again with plotID
, with the species observed …
plotID | species |
---|---|
101 | Big Bluestem |
101 | Fox Sedge |
101 | Indian Grass |
101 | Prairie Dropseed |
101 | Tufted Hair Grass |
With this structure the only variable that is repeated is plotID
, which will save time and reduce errors upon data entry.
If you find yourself repeat values when entering data, stop and determine if you can use a relational structure to be more efficient and reduce errors.
4.2 Join Concepts
To illustrate the various joins, suppose that a simple data frame x
exists that has an id
key variable and a val1
measurement variable.15
id | val1 |
---|---|
101 | x1 |
102 | x2 |
102 | x3 |
103 | x4 |
Further suppose that a second data frame y
has the same id
key variable, though it may have different values, and different val2
and val3
measurement variables.16
id | val2 | val3 |
---|---|---|
101 | y1 | z1 |
102 | y2 | z2 |
104 | y3 | z3 |
104 | y4 | z4 |
It is instructive when learning about joins to visualize all combinations of observations in the two data frames.17 As some key variable values may not be present in one of the data frames, combinations with a missing key variable value (and the measurement variables set to NA
) must also be considered. All combinations of the rows in x
and y
with missing key variable values are shown on the left for each join type in the subsections below.
4.2.1 Inner Join
An inner join is the simplest join. It returns values from both data frames where the key variable(s) match in both data frames. In our simple data frames an inner join returns the rows from all combinations of rows (see left below) where the id
variables (i.e., colors) from x
and y
match (see center below). The final result is these rows with the duplicated id
key variable removed (see right below).
All combinations from x and y
|
||||
---|---|---|---|---|
id.x | val1 | id.y | val2 | val3 |
101 | x1 | 101 | y1 | z1 |
101 | x1 | 102 | y2 | z2 |
101 | x1 | 104 | y3 | z3 |
101 | x1 | 104 | y4 | z4 |
101 | x1 | NA | NA | |
102 | x2 | 101 | y1 | z1 |
102 | x2 | 102 | y2 | z2 |
102 | x2 | 104 | y3 | z3 |
102 | x2 | 104 | y4 | z4 |
102 | x2 | NA | NA | |
102 | x3 | 101 | y1 | z1 |
102 | x3 | 102 | y2 | z2 |
102 | x3 | 104 | y3 | z3 |
102 | x3 | 104 | y4 | z4 |
102 | x3 | NA | NA | |
103 | x4 | 101 | y1 | z1 |
103 | x4 | 102 | y2 | z2 |
103 | x4 | 104 | y3 | z3 |
103 | x4 | 104 | y4 | z4 |
103 | x4 | NA | NA | |
NA | 101 | y1 | z1 | |
NA | 102 | y2 | z2 | |
NA | 104 | y3 | z3 | |
NA | 104 | y4 | z4 |
Retained for ‘inner join’
|
||||
---|---|---|---|---|
id.x | val1 | id.y | val2 | val3 |
101 | x1 | 101 | y1 | z1 |
101 | x1 | 102 | y2 | z2 |
101 | x1 | 104 | y3 | z3 |
101 | x1 | 104 | y4 | z4 |
101 | x1 | NA | NA | |
102 | x2 | 101 | y1 | z1 |
102 | x2 | 102 | y2 | z2 |
102 | x2 | 104 | y3 | z3 |
102 | x2 | 104 | y4 | z4 |
102 | x2 | NA | NA | |
102 | x3 | 101 | y1 | z1 |
102 | x3 | 102 | y2 | z2 |
102 | x3 | 104 | y3 | z3 |
102 | x3 | 104 | y4 | z4 |
102 | x3 | NA | NA | |
103 | x4 | 101 | y1 | z1 |
103 | x4 | 102 | y2 | z2 |
103 | x4 | 104 | y3 | z3 |
103 | x4 | 104 | y4 | z4 |
103 | x4 | NA | NA | |
NA | 101 | y1 | z1 | |
NA | 102 | y2 | z2 | |
NA | 104 | y3 | z3 | |
NA | 104 | y4 | z4 |
Final ‘inner join’ result
|
|||
---|---|---|---|
id | val1 | val2 | val3 |
101 | x1 | y1 | z1 |
102 | x2 | y2 | z2 |
102 | x3 | y2 | z2 |
Inner Join: Rows from both data frames where the key variable values match.
4.2.2 Left Join
A left join returns the same rows as an inner join (i.e., all rows where the key variables match) AND all rows from the first data frame that don’t have a key variable match in the second data frame. The values for the variables in the second data frame for key values in the first data frame without a match are replaced with NA
s. So, a left join will include rows for all key variables from the first data frame, but only rows from the second data frame that had a key variable match with the first data frame.
All combinations from x and y
|
||||
---|---|---|---|---|
id.x | val1 | id.y | val2 | val3 |
101 | x1 | 101 | y1 | z1 |
101 | x1 | 102 | y2 | z2 |
101 | x1 | 104 | y3 | z3 |
101 | x1 | 104 | y4 | z4 |
101 | x1 | NA | NA | |
102 | x2 | 101 | y1 | z1 |
102 | x2 | 102 | y2 | z2 |
102 | x2 | 104 | y3 | z3 |
102 | x2 | 104 | y4 | z4 |
102 | x2 | NA | NA | |
102 | x3 | 101 | y1 | z1 |
102 | x3 | 102 | y2 | z2 |
102 | x3 | 104 | y3 | z3 |
102 | x3 | 104 | y4 | z4 |
102 | x3 | NA | NA | |
103 | x4 | 101 | y1 | z1 |
103 | x4 | 102 | y2 | z2 |
103 | x4 | 104 | y3 | z3 |
103 | x4 | 104 | y4 | z4 |
103 | x4 | NA | NA | |
NA | 101 | y1 | z1 | |
NA | 102 | y2 | z2 | |
NA | 104 | y3 | z3 | |
NA | 104 | y4 | z4 |
Retained for ‘left join’
|
||||
---|---|---|---|---|
id.x | val1 | id.y | val2 | val3 |
101 | x1 | 101 | y1 | z1 |
101 | x1 | 102 | y2 | z2 |
101 | x1 | 104 | y3 | z3 |
101 | x1 | 104 | y4 | z4 |
101 | x1 | NA | NA | |
102 | x2 | 101 | y1 | z1 |
102 | x2 | 102 | y2 | z2 |
102 | x2 | 104 | y3 | z3 |
102 | x2 | 104 | y4 | z4 |
102 | x2 | NA | NA | |
102 | x3 | 101 | y1 | z1 |
102 | x3 | 102 | y2 | z2 |
102 | x3 | 104 | y3 | z3 |
102 | x3 | 104 | y4 | z4 |
102 | x3 | NA | NA | |
103 | x4 | 101 | y1 | z1 |
103 | x4 | 102 | y2 | z2 |
103 | x4 | 104 | y3 | z3 |
103 | x4 | 104 | y4 | z4 |
103 | x4 | NA | NA | |
NA | 101 | y1 | z1 | |
NA | 102 | y2 | z2 | |
NA | 104 | y3 | z3 | |
NA | 104 | y4 | z4 |
Final ‘left join’ result
|
|||
---|---|---|---|
id | val1 | val2 | val3 |
101 | x1 | y1 | z1 |
102 | x2 | y2 | z2 |
102 | x3 | y2 | z2 |
103 | x4 | NA | NA |
Left Join: All rows from the first data frame with variables from the second data frame where the key variable values match (NA otherwise).
4.2.3 Right Join
A right join works just like a left join except that the result will include all rows from the second data frame that don’t have a key variable match in the first data frame. A right join can also be accomplished with a left join by reversing the order of the two data frames.
Right Join: All rows from the second data frame with variables from the first data frame where the key variable values match (NA otherwise).
4.2.4 Full Join
A full join returns the same rows as an inner join (i.e., all rows where the key variable match) AND all rows from each data frame that don’t have a key variable match in the other data frame. So a full join will include rows for all key variable values from both data frames.
All combinations from x and y
|
||||
---|---|---|---|---|
id.x | val1 | id.y | val2 | val3 |
101 | x1 | 101 | y1 | z1 |
101 | x1 | 102 | y2 | z2 |
101 | x1 | 104 | y3 | z3 |
101 | x1 | 104 | y4 | z4 |
101 | x1 | NA | NA | |
102 | x2 | 101 | y1 | z1 |
102 | x2 | 102 | y2 | z2 |
102 | x2 | 104 | y3 | z3 |
102 | x2 | 104 | y4 | z4 |
102 | x2 | NA | NA | |
102 | x3 | 101 | y1 | z1 |
102 | x3 | 102 | y2 | z2 |
102 | x3 | 104 | y3 | z3 |
102 | x3 | 104 | y4 | z4 |
102 | x3 | NA | NA | |
103 | x4 | 101 | y1 | z1 |
103 | x4 | 102 | y2 | z2 |
103 | x4 | 104 | y3 | z3 |
103 | x4 | 104 | y4 | z4 |
103 | x4 | NA | NA | |
NA | 101 | y1 | z1 | |
NA | 102 | y2 | z2 | |
NA | 104 | y3 | z3 | |
NA | 104 | y4 | z4 |
Retained for ‘full join’
|
||||
---|---|---|---|---|
id.x | val1 | id.y | val2 | val3 |
101 | x1 | 101 | y1 | z1 |
101 | x1 | 102 | y2 | z2 |
101 | x1 | 104 | y3 | z3 |
101 | x1 | 104 | y4 | z4 |
101 | x1 | NA | NA | |
102 | x2 | 101 | y1 | z1 |
102 | x2 | 102 | y2 | z2 |
102 | x2 | 104 | y3 | z3 |
102 | x2 | 104 | y4 | z4 |
102 | x2 | NA | NA | |
102 | x3 | 101 | y1 | z1 |
102 | x3 | 102 | y2 | z2 |
102 | x3 | 104 | y3 | z3 |
102 | x3 | 104 | y4 | z4 |
102 | x3 | NA | NA | |
103 | x4 | 101 | y1 | z1 |
103 | x4 | 102 | y2 | z2 |
103 | x4 | 104 | y3 | z3 |
103 | x4 | 104 | y4 | z4 |
103 | x4 | NA | NA | |
NA | 101 | y1 | z1 | |
NA | 102 | y2 | z2 | |
NA | 104 | y3 | z3 | |
NA | 104 | y4 | z4 |
Final ‘full join’ result
|
|||
---|---|---|---|
id | val1 | val2 | val3 |
101 | x1 | y1 | z1 |
102 | x2 | y2 | z2 |
102 | x3 | y2 | z2 |
103 | x4 | NA | NA |
104 | NA | y3 | z3 |
104 | NA | y4 | z4 |
Full Join: All rows from both data frames with variables from the other data frame where the key variable columns match (NA otherwise).
4.2.5 Semi Join
In a semi join only values from the first data frame that have a key variable match in the second data frame are retained. Thus, the final result will only have variables from the first data frame for rows that had a key variable match in the second data frame. This is the same result as an inner join but without including the variables from the second data frame.
All combinations from x and y
|
||||
---|---|---|---|---|
id.x | val1 | id.y | val2 | val3 |
101 | x1 | 101 | y1 | z1 |
101 | x1 | 102 | y2 | z2 |
101 | x1 | 104 | y3 | z3 |
101 | x1 | 104 | y4 | z4 |
101 | x1 | NA | NA | |
102 | x2 | 101 | y1 | z1 |
102 | x2 | 102 | y2 | z2 |
102 | x2 | 104 | y3 | z3 |
102 | x2 | 104 | y4 | z4 |
102 | x2 | NA | NA | |
102 | x3 | 101 | y1 | z1 |
102 | x3 | 102 | y2 | z2 |
102 | x3 | 104 | y3 | z3 |
102 | x3 | 104 | y4 | z4 |
102 | x3 | NA | NA | |
103 | x4 | 101 | y1 | z1 |
103 | x4 | 102 | y2 | z2 |
103 | x4 | 104 | y3 | z3 |
103 | x4 | 104 | y4 | z4 |
103 | x4 | NA | NA | |
NA | 101 | y1 | z1 | |
NA | 102 | y2 | z2 | |
NA | 104 | y3 | z3 | |
NA | 104 | y4 | z4 |
Retained for ‘semi join’
|
||||
---|---|---|---|---|
id.x | val1 | id.y | val2 | val3 |
101 | x1 | 101 | y1 | z1 |
101 | x1 | 102 | y2 | z2 |
101 | x1 | 104 | y3 | z3 |
101 | x1 | 104 | y4 | z4 |
101 | x1 | NA | NA | |
102 | x2 | 101 | y1 | z1 |
102 | x2 | 102 | y2 | z2 |
102 | x2 | 104 | y3 | z3 |
102 | x2 | 104 | y4 | z4 |
102 | x2 | NA | NA | |
102 | x3 | 101 | y1 | z1 |
102 | x3 | 102 | y2 | z2 |
102 | x3 | 104 | y3 | z3 |
102 | x3 | 104 | y4 | z4 |
102 | x3 | NA | NA | |
103 | x4 | 101 | y1 | z1 |
103 | x4 | 102 | y2 | z2 |
103 | x4 | 104 | y3 | z3 |
103 | x4 | 104 | y4 | z4 |
103 | x4 | NA | NA | |
NA | 101 | y1 | z1 | |
NA | 102 | y2 | z2 | |
NA | 104 | y3 | z3 | |
NA | 104 | y4 | z4 |
Final ‘semi join’ result
|
|
---|---|
id | val1 |
101 | x1 |
102 | x2 |
102 | x3 |
Semi Join: All variables from the first data frame for rows where the key variable column has a match in the second data frame.
4.2.6 Anti Join
In an anti join only values from the first data frame that DO NOT have a key variable match in the second data frame are retained. Thus, the final result will only have variables from the first data frame for rows without a key variable match in the second data frame.
All combinations from x and y
|
||||
---|---|---|---|---|
id.x | val1 | id.y | val2 | val3 |
101 | x1 | 101 | y1 | z1 |
101 | x1 | 102 | y2 | z2 |
101 | x1 | 104 | y3 | z3 |
101 | x1 | 104 | y4 | z4 |
101 | x1 | NA | NA | |
102 | x2 | 101 | y1 | z1 |
102 | x2 | 102 | y2 | z2 |
102 | x2 | 104 | y3 | z3 |
102 | x2 | 104 | y4 | z4 |
102 | x2 | NA | NA | |
102 | x3 | 101 | y1 | z1 |
102 | x3 | 102 | y2 | z2 |
102 | x3 | 104 | y3 | z3 |
102 | x3 | 104 | y4 | z4 |
102 | x3 | NA | NA | |
103 | x4 | 101 | y1 | z1 |
103 | x4 | 102 | y2 | z2 |
103 | x4 | 104 | y3 | z3 |
103 | x4 | 104 | y4 | z4 |
103 | x4 | NA | NA | |
NA | 101 | y1 | z1 | |
NA | 102 | y2 | z2 | |
NA | 104 | y3 | z3 | |
NA | 104 | y4 | z4 |
Retained for ‘anti join’
|
||||
---|---|---|---|---|
id.x | val1 | id.y | val2 | val3 |
101 | x1 | 101 | y1 | z1 |
101 | x1 | 102 | y2 | z2 |
101 | x1 | 104 | y3 | z3 |
101 | x1 | 104 | y4 | z4 |
101 | x1 | NA | NA | |
102 | x2 | 101 | y1 | z1 |
102 | x2 | 102 | y2 | z2 |
102 | x2 | 104 | y3 | z3 |
102 | x2 | 104 | y4 | z4 |
102 | x2 | NA | NA | |
102 | x3 | 101 | y1 | z1 |
102 | x3 | 102 | y2 | z2 |
102 | x3 | 104 | y3 | z3 |
102 | x3 | 104 | y4 | z4 |
102 | x3 | NA | NA | |
103 | x4 | 101 | y1 | z1 |
103 | x4 | 102 | y2 | z2 |
103 | x4 | 104 | y3 | z3 |
103 | x4 | 104 | y4 | z4 |
103 | x4 | NA | NA | |
NA | 101 | y1 | z1 | |
NA | 102 | y2 | z2 | |
NA | 104 | y3 | z3 | |
NA | 104 | y4 | z4 |
Final ‘anti join’ result
|
|
---|---|
id | val1 |
103 | x4 |
Anti Join: All variables from the first data frame for rows where the key variable column does NOT have a match in the second data frame.
4.3 Joins in R
Performing the joins described in the previous section is straightforward with dplyr
(part of the tidyverse
) using, conveniently enough, inner_join()
, left_join()
, right_join()
, full_join()
, semi_join()
, and anti_join()
. The first two arguments to each of these functions are the two data frames to join. In addition, the name of the key variable should be given, in quotes, to by=
.18
The two data frames used in the previous section are created below as objects in R.
<- data.frame(id=c(101,102,102,103),val1=paste0("x",1:4))
x x
#R> id val1
#R> 1 101 x1
#R> 2 102 x2
#R> 3 102 x3
#R> 4 103 x4
<- data.frame(id=c(101,102,104,104),val2=paste0("y",1:4),val3=paste0("z",1:4))
y y
#R> id val2 val3
#R> 1 101 y1 z1
#R> 2 102 y2 z2
#R> 3 104 y3 z3
#R> 4 104 y4 z4
The six joins discussed in the previous section are completed below. You should compare the results here to the visual results above.
<- inner_join(x,y,by="id")
ij ij
#R> id val1 val2 val3
#R> 1 101 x1 y1 z1
#R> 2 102 x2 y2 z2
#R> 3 102 x3 y2 z2
<- left_join(x,y,by="id")
lj lj
#R> id val1 val2 val3
#R> 1 101 x1 y1 z1
#R> 2 102 x2 y2 z2
#R> 3 102 x3 y2 z2
#R> 4 103 x4 <NA> <NA>
<- right_join(x,y,by="id")
rj rj
#R> id val1 val2 val3
#R> 1 101 x1 y1 z1
#R> 2 102 x2 y2 z2
#R> 3 102 x3 y2 z2
#R> 4 104 <NA> y3 z3
#R> 5 104 <NA> y4 z4
<- left_join(y,x,by="id") # right_join as left_join with x & y reversed
rj2 rj2
#R> id val2 val3 val1
#R> 1 101 y1 z1 x1
#R> 2 102 y2 z2 x2
#R> 3 102 y2 z2 x3
#R> 4 104 y3 z3 <NA>
#R> 5 104 y4 z4 <NA>
<- full_join(x,y,by="id")
fj fj
#R> id val1 val2 val3
#R> 1 101 x1 y1 z1
#R> 2 102 x2 y2 z2
#R> 3 102 x3 y2 z2
#R> 4 103 x4 <NA> <NA>
#R> 5 104 <NA> y3 z3
#R> 6 104 <NA> y4 z4
<- semi_join(x,y,by="id")
sj sj
#R> id val1
#R> 1 101 x1
#R> 2 102 x2
#R> 3 102 x3
<- anti_join(x,y,by="id")
aj aj
#R> id val1
#R> 1 103 x4
4.4 Examples With Context
The following examples demonstrate different types of joins within fictitious, but realistic, contexts. Please examine each data frame and the joined results carefully to help further understand what each type of join does.
4.4.1 Student Data (One-to-One)
In large institutions or in complicated data environments, data about specific individuals may be housed in a variety of departments, each of which maintains its own database. Preferably these data sources can be related via a primary key variable, such as a unique student ID number. As an example, suppose that a college’s admissions office maintains a database of personal information about every student at the college. For example it might look like that below for a fictitious five students.
<- tibble(studentID=c(34535,45423,73424,89874,98222),
personal first_nm=c("Rolando","Catherine","James","Rachel","Esteban"),
last_nm=c("Blackman","Johnson","Carmichael","Brown","Perez"),
hometown=c("Windsor","Eden Prairie","Marion","Milwaukee","El Paso"),
homestate=c("MI","MN","IA","WI","TX"))
personal
#R> # A tibble: 5 x 5
#R> studentID first_nm last_nm hometown homestate
#R> <dbl> <chr> <chr> <chr> <chr>
#R> 1 34535 Rolando Blackman Windsor MI
#R> 2 45423 Catherine Johnson Eden Prairie MN
#R> 3 73424 James Carmichael Marion IA
#R> 4 89874 Rachel Brown Milwaukee WI
#R> 5 98222 Esteban Perez El Paso TX
In addition the financial aid office may have a database of financial aid information.
<- tibble(studentID=c(34535,45423,73424,89874,98222),
finaid income_cat=c(4,5,3,2,3),
pell_elig=c(TRUE,FALSE,TRUE,TRUE,TRUE),
work_study=c(TRUE,FALSE,FALSE,FALSE,TRUE))
finaid
#R> # A tibble: 5 x 4
#R> studentID income_cat pell_elig work_study
#R> <dbl> <dbl> <lgl> <lgl>
#R> 1 34535 4 TRUE TRUE
#R> 2 45423 5 FALSE FALSE
#R> 3 73424 3 TRUE FALSE
#R> 4 89874 2 TRUE FALSE
#R> 5 98222 3 TRUE TRUE
Furthermore the registrar’s office has a database of academic information.
<- tibble(studentID=c(34535,45423,73424,89874,98222),
academics standing=c("FY","FY","SO","SR","JR"),
major=c("undecided","NRS","Biology","SCD","SCD"),
cum_gpa=c(0,0,3.12,3.67,2.89))
academics
#R> # A tibble: 5 x 4
#R> studentID standing major cum_gpa
#R> <dbl> <chr> <chr> <dbl>
#R> 1 34535 FY undecided 0
#R> 2 45423 FY NRS 0
#R> 3 73424 SO Biology 3.12
#R> 4 89874 SR SCD 3.67
#R> 5 98222 JR SCD 2.89
Note how each of these databases has the studentID
variable that will serve as the key variable to connect each student’s information across each of the databases.
These databases form what is called a one-to-one relationship because each observation record in each database can be connected to one and only one observation record in the other databases. In this example, each database would ideally have an entry for every student.
These data frames can generally be joined with inner, left, or right joins depending on the purpose. For example, an institutional researcher may want to examine whether student gpa differed between students that were eligible for a Pell Grant or not. In this case, the researcher would join the finaid
and academics
data frames so that the pell_elig
and cum_gpa
variables for each student would be in one data frame.
<- inner_join(finaid,academics,by="studentID")
tmp tmp
#R> # A tibble: 5 x 7
#R> studentID income_cat pell_elig work_study standing major cum_gpa
#R> <dbl> <dbl> <lgl> <lgl> <chr> <chr> <dbl>
#R> 1 34535 4 TRUE TRUE FY undecided 0
#R> 2 45423 5 FALSE FALSE FY NRS 0
#R> 3 73424 3 TRUE FALSE SO Biology 3.12
#R> 4 89874 2 TRUE FALSE SR SCD 3.67
#R> 5 98222 3 TRUE TRUE JR SCD 2.89
An inner join was used here because the researcher only wants to include students that are in both databases (i.e., would likely have an entry for both pell_elig
and cum_gpa
). Note, however, that a left join or a right join would accomplish the same task as long as both databases had entries for every student (i.e., the databases had the same set of students).
Further suppose that academic advisors would like to have the students’ names attached to these records so that they could reach out to students who could use some help academically.
<- inner_join(personal,tmp,by="studentID")
tmp tmp
#R> # A tibble: 5 x 11
#R> studentID first_nm last_nm hometown homestate income_cat pell_elig work_study
#R> <dbl> <chr> <chr> <chr> <chr> <dbl> <lgl> <lgl>
#R> 1 34535 Rolando Blackman Windsor MI 4 TRUE TRUE
#R> 2 45423 Catherine Johnson Eden Pr~ MN 5 FALSE FALSE
#R> 3 73424 James Carmichael Marion IA 3 TRUE FALSE
#R> 4 89874 Rachel Brown Milwauk~ WI 2 TRUE FALSE
#R> 5 98222 Esteban Perez El Paso TX 3 TRUE TRUE
#R> # ... with 3 more variables: standing <chr>, major <chr>, cum_gpa <dbl>
Continuing with this example, suppose that the registrar’s office also maintains a database that contains each students’ current class schedule.
<- tibble(studentID=c(34535,34535,34535,34535,
schedules 45423,45423,45423,45423,45423,
73424,73424,73424,73424,
89874,89874,89874,
98222,98222,98222,98222),
course=c("MTH107","BIO115","CHM110","IDS101",
"SCD110","PSY110","MTH140","OED212","IDS101",
"BIO234","CHM220","BIO370","SCD110",
"SCD440","PSY370","IDS490",
"SCD440","SCD330","SOC480","ART220"))
schedules
#R> # A tibble: 20 x 2
#R> studentID course
#R> <dbl> <chr>
#R> 1 34535 MTH107
#R> 2 34535 BIO115
#R> 3 34535 CHM110
#R> 4 34535 IDS101
#R> 5 45423 SCD110
#R> 6 45423 PSY110
#R> 7 45423 MTH140
#R> 8 45423 OED212
#R> 9 45423 IDS101
#R> 10 73424 BIO234
#R> 11 73424 CHM220
#R> 12 73424 BIO370
#R> 13 73424 SCD110
#R> 14 89874 SCD440
#R> 15 89874 PSY370
#R> 16 89874 IDS490
#R> 17 98222 SCD440
#R> 18 98222 SCD330
#R> 19 98222 SOC480
#R> 20 98222 ART220
The registrar’s office also maintains a database of information about every course taught at the college. A partial example of such a database is shown below.
<- tibble(course=c("ART220","ART330","BIO115","BIO234","BIO370","BIO490",
courses "CHM110","CHM220","CHM360","IDS101","IDS490","MTH107",
"MTH140","MTH230","OED212","OED330","OED360","PSY110",
"PSY370","SCD110","SCD330","SCD440","SOC111","SOC480"),
credits=c(3,3,4,4,4,4,4,4,4,3,4,4,4,4,3,3,3,4,4,3,3,4,4,4),
instructor=c("Duffy","Terry","Johnson","Goyke","Anich","Anich",
"Carlson","Robertson","Carlson","Goyke","Hannickel","Ogle",
"Jensen","Jensen","Andre","Andre","Coulson","Sneyd",
"Sneyd","Tochterman","Tochterman","Foster",
"Schanning","Schanning"))
courses
#R> # A tibble: 24 x 3
#R> course credits instructor
#R> <chr> <dbl> <chr>
#R> 1 ART220 3 Duffy
#R> 2 ART330 3 Terry
#R> 3 BIO115 4 Johnson
#R> 4 BIO234 4 Goyke
#R> 5 BIO370 4 Anich
#R> 6 BIO490 4 Anich
#R> 7 CHM110 4 Carlson
#R> 8 CHM220 4 Robertson
#R> 9 CHM360 4 Carlson
#R> 10 IDS101 3 Goyke
#R> # ... with 14 more rows
These two data frames are related via the common course
variable; thus for these two data frames course
is the key variable. This type of database organization is particularly useful because the information about any one course only needs to be entered once in courses
even though the actual course may appear many times in schedules
. The specific course information (credits and instructor) from courses
can be added to the student’s course information in schedules
with a left_join()
.
<- left_join(schedules,courses,by="course")
schedules2 schedules2
#R> # A tibble: 20 x 4
#R> studentID course credits instructor
#R> <dbl> <chr> <dbl> <chr>
#R> 1 34535 MTH107 4 Ogle
#R> 2 34535 BIO115 4 Johnson
#R> 3 34535 CHM110 4 Carlson
#R> 4 34535 IDS101 3 Goyke
#R> 5 45423 SCD110 3 Tochterman
#R> 6 45423 PSY110 4 Sneyd
#R> 7 45423 MTH140 4 Jensen
#R> 8 45423 OED212 3 Andre
#R> 9 45423 IDS101 3 Goyke
#R> 10 73424 BIO234 4 Goyke
#R> 11 73424 CHM220 4 Robertson
#R> 12 73424 BIO370 4 Anich
#R> 13 73424 SCD110 3 Tochterman
#R> 14 89874 SCD440 4 Foster
#R> 15 89874 PSY370 4 Sneyd
#R> 16 89874 IDS490 4 Hannickel
#R> 17 98222 SCD440 4 Foster
#R> 18 98222 SCD330 3 Tochterman
#R> 19 98222 SOC480 4 Schanning
#R> 20 98222 ART220 3 Duffy
Note here that if we consider schedules
as the primary data frame of interest then this relationship is still a one-to-one relationship because each course in schedules
can be connected to only one course record in courses
.
The student’s personal information can also be added to these results with a left_join()
but now using the studentID
key variable.
<- left_join(personal,schedules2,by="studentID")
schedules2A schedules2A
#R> # A tibble: 20 x 8
#R> studentID first_nm last_nm hometown homestate course credits instructor
#R> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
#R> 1 34535 Rolando Blackman Windsor MI MTH107 4 Ogle
#R> 2 34535 Rolando Blackman Windsor MI BIO115 4 Johnson
#R> 3 34535 Rolando Blackman Windsor MI CHM110 4 Carlson
#R> 4 34535 Rolando Blackman Windsor MI IDS101 3 Goyke
#R> 5 45423 Catherine Johnson Eden Prairie MN SCD110 3 Tochterman
#R> 6 45423 Catherine Johnson Eden Prairie MN PSY110 4 Sneyd
#R> 7 45423 Catherine Johnson Eden Prairie MN MTH140 4 Jensen
#R> 8 45423 Catherine Johnson Eden Prairie MN OED212 3 Andre
#R> 9 45423 Catherine Johnson Eden Prairie MN IDS101 3 Goyke
#R> 10 73424 James Carmichael Marion IA BIO234 4 Goyke
#R> 11 73424 James Carmichael Marion IA CHM220 4 Robertson
#R> 12 73424 James Carmichael Marion IA BIO370 4 Anich
#R> 13 73424 James Carmichael Marion IA SCD110 3 Tochterman
#R> 14 89874 Rachel Brown Milwaukee WI SCD440 4 Foster
#R> 15 89874 Rachel Brown Milwaukee WI PSY370 4 Sneyd
#R> 16 89874 Rachel Brown Milwaukee WI IDS490 4 Hannickel
#R> 17 98222 Esteban Perez El Paso TX SCD440 4 Foster
#R> 18 98222 Esteban Perez El Paso TX SCD330 3 Tochterman
#R> 19 98222 Esteban Perez El Paso TX SOC480 4 Schanning
#R> 20 98222 Esteban Perez El Paso TX ART220 3 Duffy
These data frames represent a one-to-many relationship because studentID
in personal
is connected to many studentID
records in schedules2
(one for each course the student is enrolled in).
4.4.2 Resource Sampling Data (One-to-Many Relationship)
In sampling of natural resources it is common to have one database for information about the unit of sampling and another database specific to items within that sampling unit. For example, in fisheries we may have one database to record information about a particular net (e.g., where it is located, the date it was set) and a second database that records the species of fish caught and number of the species caught. You may be tempted to do this all in one database with a separate field for each fish species but this is inefficient as you may not know which species you may catch. Thus, every time you catch a new species you would need to add a new field or column to your database. In addition, this would be highly inefficient if you were to record information about individual fish (e.g., length and weight) as the amount of this information may vary from net to net.
In this simple example, information about five specific settings of a net is stored in nets
, which has a unique identifier for each net setting called net_num
.
<- data.frame(net_num=1:5,
nets lake=c("Eagle","Hart","Hart","Eagle","Millicent"),
date=c("3-Jul-21","3-Jul-21","5-Jul-21","6-Jul-21","6-Jul-21"))
nets
#R> net_num lake date
#R> 1 1 Eagle 3-Jul-21
#R> 2 2 Hart 3-Jul-21
#R> 3 3 Hart 5-Jul-21
#R> 4 4 Eagle 6-Jul-21
#R> 5 5 Millicent 6-Jul-21
In a separate data frame the researchers recorded the species and number of each species caught in each net. Here there is a separate row for each species and its number caught with each row indexed to the specific net with the net_num
key variable.
<- data.frame(net_num=c(1,1,2,2,2,4,4,5),
catch species=c("Bluegill","Largemouth Bass",
"Bluegill","Largemouth Bass","Bluntnose Minnow",
"Bluegill","Largemouth Bass",
"Largemouth Bass"),
number=c(7,3,19,2,56,3,6,3))
catch
#R> net_num species number
#R> 1 1 Bluegill 7
#R> 2 1 Largemouth Bass 3
#R> 3 2 Bluegill 19
#R> 4 2 Largemouth Bass 2
#R> 5 2 Bluntnose Minnow 56
#R> 6 4 Bluegill 3
#R> 7 4 Largemouth Bass 6
#R> 8 5 Largemouth Bass 3
These data frames illustrate a one-to-many relationship as each record in nets
may be connected to multiple records in catch
.
The catch data will be joined to the net data using a left_join()
because it is important to keep track of nets that also did not catch fish. An inner_join()
would only return nets where some fish were caught.
<- left_join(nets,catch,by="net_num")
fishcatch fishcatch
#R> net_num lake date species number
#R> 1 1 Eagle 3-Jul-21 Bluegill 7
#R> 2 1 Eagle 3-Jul-21 Largemouth Bass 3
#R> 3 2 Hart 3-Jul-21 Bluegill 19
#R> 4 2 Hart 3-Jul-21 Largemouth Bass 2
#R> 5 2 Hart 3-Jul-21 Bluntnose Minnow 56
#R> 6 3 Hart 5-Jul-21 <NA> NA
#R> 7 4 Eagle 6-Jul-21 Bluegill 3
#R> 8 4 Eagle 6-Jul-21 Largemouth Bass 6
#R> 9 5 Millicent 6-Jul-21 Largemouth Bass 3
The visualizations in this section are modified from https://twitter.com/hadleywickham/status/684407629259526148/photo/1.↩︎
In this treatment, the unique
id
values are also uniquely colored to help track individual observations in the descriptions below.↩︎Finding all combinations, however, is not needed to actually join two data frames.↩︎
If
by=
is not explicitly set by the user then the two data frames will be joined using the variable(s) that the two data frames have in common forby=
. A message will be displayed about which variable(s) was used; check this message carefully to make sure you are joining by the variables that you want to join by.↩︎