Module 4 Relational Data

4.1 Nature of Relational Data

Some situations will have multiple data frames with related data. A particular analysis may require combining these data frames into a single data frame. If the data in the separate data frames are connected by a key variable then the data are said to be relational – i.e., they relate to each other through a common variable (or variables).

As an example a college may have the following four data frames with respect to its students.

Personal information (hometown, age, etc.)
Financial aid information (family income, Pell Grant aid amount, etc.)
Academic information (standing, major, gpa, etc.)
Current course information (i.e., which courses a students is registered for)

Each of these data frames would also contain a student ID variable so that a student’s personal information can be connect with the students financial aid, academic, or course information. This student ID variable is the key variable for these relational data frames.

Data from related data frames can be joined in a variety of ways. This module will explain several types of joins and how to accomplish those joins in R.

Relational Data: Data tables that are connected by a common key variable.

Key Variable: A variable in a data table that connects related data. This is often a unique “identification” variable.

Relational data can be more efficient and, thus, lead to fewer data entry errors. For example consider a very simple set of data where the occurrence of plant species in plots is recorded, along with information about the plot. For example, the data for one plot that had five species may look like this …

plotID	lat	long	loc	species
101	43.458	-91.2304	Baffle Is.	Big Bluestem
101	43.458	-91.2304	Baffle Is.	Fox Sedge
101	43.458	-91.2304	Baffle Is.	Indian Grass
101	43.458	-91.2304	Baffle Is.	Prairie Dropseed
101	43.458	-91.2304	Baffle Is.	Tufted Hair Grass

However, with this data entry the four variables related to the plot (e.g., plotID, lat, long, and loc) must be repeated multiple times for each species. This repetition is both inefficient and error-prone. A better structure is to create one data frame, with plotID, with the characteristics for each plot …

plotID	lat	long	loc
101	43.458	-91.2304	Baffle Is.

… and then a second data frame, again with plotID, with the species observed …

plotID	species
101	Big Bluestem
101	Fox Sedge
101	Indian Grass
101	Prairie Dropseed
101	Tufted Hair Grass

With this structure the only variable that is repeated is plotID, which will save time and reduce errors upon data entry.

If you find yourself repeat values when entering data, stop and determine if you can use a relational structure to be more efficient and reduce errors.

4.2 Join Concepts

To illustrate the various joins, suppose that a simple data frame x exists that has an id key variable and a val1 measurement variable.¹⁵

id	val1
101	x1
102	x2
102	x3
103	x4

Further suppose that a second data frame y has the same id key variable, though it may have different values, and different val2 and val3 measurement variables.¹⁶

id	val2	val3
101	y1	z1
102	y2	z2
104	y3	z3
104	y4	z4

It is instructive when learning about joins to visualize all combinations of observations in the two data frames.¹⁷ As some key variable values may not be present in one of the data frames, combinations with a missing key variable value (and the measurement variables set to NA) must also be considered. All combinations of the rows in x and y with missing key variable values are shown on the left for each join type in the subsections below.

4.2.1 Inner Join

An inner join is the simplest join. It returns values from both data frames where the key variable(s) match in both data frames. In our simple data frames an inner join returns the rows from all combinations of rows (see left below) where the id variables (i.e., colors) from x and y match (see center below). The final result is these rows with the duplicated id key variable removed (see right below).

All combinations from x and y
id.x	val1	id.y	val2	val3
101	x1	101	y1	z1
101	x1	102	y2	z2
101	x1	104	y3	z3
101	x1	104	y4	z4
101	x1		NA	NA
102	x2	101	y1	z1
102	x2	102	y2	z2
102	x2	104	y3	z3
102	x2	104	y4	z4
102	x2		NA	NA
102	x3	101	y1	z1
102	x3	102	y2	z2
102	x3	104	y3	z3
102	x3	104	y4	z4
102	x3		NA	NA
103	x4	101	y1	z1
103	x4	102	y2	z2
103	x4	104	y3	z3
103	x4	104	y4	z4
103	x4		NA	NA
	NA	101	y1	z1
	NA	102	y2	z2
	NA	104	y3	z3
	NA	104	y4	z4

Retained for ‘inner join’
id.x	val1	id.y	val2	val3
101	x1	101	y1	z1
101	x1	102	y2	z2
101	x1	104	y3	z3
101	x1	104	y4	z4
101	x1		NA	NA
102	x2	101	y1	z1
102	x2	102	y2	z2
102	x2	104	y3	z3
102	x2	104	y4	z4
102	x2		NA	NA
102	x3	101	y1	z1
102	x3	102	y2	z2
102	x3	104	y3	z3
102	x3	104	y4	z4
102	x3		NA	NA
103	x4	101	y1	z1
103	x4	102	y2	z2
103	x4	104	y3	z3
103	x4	104	y4	z4
103	x4		NA	NA
	NA	101	y1	z1
	NA	102	y2	z2
	NA	104	y3	z3
	NA	104	y4	z4

Final ‘inner join’ result
id	val1	val2	val3
101	x1	y1	z1
102	x2	y2	z2
102	x3	y2	z2

Inner Join: Rows from both data frames where the key variable values match.

4.2.2 Left Join

A left join returns the same rows as an inner join (i.e., all rows where the key variables match) AND all rows from the first data frame that don’t have a key variable match in the second data frame. The values for the variables in the second data frame for key values in the first data frame without a match are replaced with NAs. So, a left join will include rows for all key variables from the first data frame, but only rows from the second data frame that had a key variable match with the first data frame.

All combinations from x and y
id.x	val1	id.y	val2	val3
101	x1	101	y1	z1
101	x1	102	y2	z2
101	x1	104	y3	z3
101	x1	104	y4	z4
101	x1		NA	NA
102	x2	101	y1	z1
102	x2	102	y2	z2
102	x2	104	y3	z3
102	x2	104	y4	z4
102	x2		NA	NA
102	x3	101	y1	z1
102	x3	102	y2	z2
102	x3	104	y3	z3
102	x3	104	y4	z4
102	x3		NA	NA
103	x4	101	y1	z1
103	x4	102	y2	z2
103	x4	104	y3	z3
103	x4	104	y4	z4
103	x4		NA	NA
	NA	101	y1	z1
	NA	102	y2	z2
	NA	104	y3	z3
	NA	104	y4	z4

Retained for ‘left join’
id.x	val1	id.y	val2	val3
101	x1	101	y1	z1
101	x1	102	y2	z2
101	x1	104	y3	z3
101	x1	104	y4	z4
101	x1		NA	NA
102	x2	101	y1	z1
102	x2	102	y2	z2
102	x2	104	y3	z3
102	x2	104	y4	z4
102	x2		NA	NA
102	x3	101	y1	z1
102	x3	102	y2	z2
102	x3	104	y3	z3
102	x3	104	y4	z4
102	x3		NA	NA
103	x4	101	y1	z1
103	x4	102	y2	z2
103	x4	104	y3	z3
103	x4	104	y4	z4
103	x4		NA	NA
	NA	101	y1	z1
	NA	102	y2	z2
	NA	104	y3	z3
	NA	104	y4	z4

Final ‘left join’ result
id	val1	val2	val3
101	x1	y1	z1
102	x2	y2	z2
102	x3	y2	z2
103	x4	NA	NA

Left Join: All rows from the first data frame with variables from the second data frame where the key variable values match (NA otherwise).

4.2.3 Right Join

A right join works just like a left join except that the result will include all rows from the second data frame that don’t have a key variable match in the first data frame. A right join can also be accomplished with a left join by reversing the order of the two data frames.

Right Join: All rows from the second data frame with variables from the first data frame where the key variable values match (NA otherwise).

4.2.4 Full Join

A full join returns the same rows as an inner join (i.e., all rows where the key variable match) AND all rows from each data frame that don’t have a key variable match in the other data frame. So a full join will include rows for all key variable values from both data frames.

All combinations from x and y
id.x	val1	id.y	val2	val3
101	x1	101	y1	z1
101	x1	102	y2	z2
101	x1	104	y3	z3
101	x1	104	y4	z4
101	x1		NA	NA
102	x2	101	y1	z1
102	x2	102	y2	z2
102	x2	104	y3	z3
102	x2	104	y4	z4
102	x2		NA	NA
102	x3	101	y1	z1
102	x3	102	y2	z2
102	x3	104	y3	z3
102	x3	104	y4	z4
102	x3		NA	NA
103	x4	101	y1	z1
103	x4	102	y2	z2
103	x4	104	y3	z3
103	x4	104	y4	z4
103	x4		NA	NA
	NA	101	y1	z1
	NA	102	y2	z2
	NA	104	y3	z3
	NA	104	y4	z4

Retained for ‘full join’
id.x	val1	id.y	val2	val3
101	x1	101	y1	z1
101	x1	102	y2	z2
101	x1	104	y3	z3
101	x1	104	y4	z4
101	x1		NA	NA
102	x2	101	y1	z1
102	x2	102	y2	z2
102	x2	104	y3	z3
102	x2	104	y4	z4
102	x2		NA	NA
102	x3	101	y1	z1
102	x3	102	y2	z2
102	x3	104	y3	z3
102	x3	104	y4	z4
102	x3		NA	NA
103	x4	101	y1	z1
103	x4	102	y2	z2
103	x4	104	y3	z3
103	x4	104	y4	z4
103	x4		NA	NA
	NA	101	y1	z1
	NA	102	y2	z2
	NA	104	y3	z3
	NA	104	y4	z4

Final ‘full join’ result
id	val1	val2	val3
101	x1	y1	z1
102	x2	y2	z2
102	x3	y2	z2
103	x4	NA	NA
104	NA	y3	z3
104	NA	y4	z4

Full Join: All rows from both data frames with variables from the other data frame where the key variable columns match (NA otherwise).

4.2.5 Semi Join

In a semi join only values from the first data frame that have a key variable match in the second data frame are retained. Thus, the final result will only have variables from the first data frame for rows that had a key variable match in the second data frame. This is the same result as an inner join but without including the variables from the second data frame.

All combinations from x and y
id.x	val1	id.y	val2	val3
101	x1	101	y1	z1
101	x1	102	y2	z2
101	x1	104	y3	z3
101	x1	104	y4	z4
101	x1		NA	NA
102	x2	101	y1	z1
102	x2	102	y2	z2
102	x2	104	y3	z3
102	x2	104	y4	z4
102	x2		NA	NA
102	x3	101	y1	z1
102	x3	102	y2	z2
102	x3	104	y3	z3
102	x3	104	y4	z4
102	x3		NA	NA
103	x4	101	y1	z1
103	x4	102	y2	z2
103	x4	104	y3	z3
103	x4	104	y4	z4
103	x4		NA	NA
	NA	101	y1	z1
	NA	102	y2	z2
	NA	104	y3	z3
	NA	104	y4	z4

Retained for ‘semi join’
id.x	val1	id.y	val2	val3
101	x1	101	y1	z1
101	x1	102	y2	z2
101	x1	104	y3	z3
101	x1	104	y4	z4
101	x1		NA	NA
102	x2	101	y1	z1
102	x2	102	y2	z2
102	x2	104	y3	z3
102	x2	104	y4	z4
102	x2		NA	NA
102	x3	101	y1	z1
102	x3	102	y2	z2
102	x3	104	y3	z3
102	x3	104	y4	z4
102	x3		NA	NA
103	x4	101	y1	z1
103	x4	102	y2	z2
103	x4	104	y3	z3
103	x4	104	y4	z4
103	x4		NA	NA
	NA	101	y1	z1
	NA	102	y2	z2
	NA	104	y3	z3
	NA	104	y4	z4

Final ‘semi join’ result
id	val1
101	x1
102	x2
102	x3

Semi Join: All variables from the first data frame for rows where the key variable column has a match in the second data frame.

4.2.6 Anti Join

In an anti join only values from the first data frame that DO NOT have a key variable match in the second data frame are retained. Thus, the final result will only have variables from the first data frame for rows without a key variable match in the second data frame.

All combinations from x and y
id.x	val1	id.y	val2	val3
101	x1	101	y1	z1
101	x1	102	y2	z2
101	x1	104	y3	z3
101	x1	104	y4	z4
101	x1		NA	NA
102	x2	101	y1	z1
102	x2	102	y2	z2
102	x2	104	y3	z3
102	x2	104	y4	z4
102	x2		NA	NA
102	x3	101	y1	z1
102	x3	102	y2	z2
102	x3	104	y3	z3
102	x3	104	y4	z4
102	x3		NA	NA
103	x4	101	y1	z1
103	x4	102	y2	z2
103	x4	104	y3	z3
103	x4	104	y4	z4
103	x4		NA	NA
	NA	101	y1	z1
	NA	102	y2	z2
	NA	104	y3	z3
	NA	104	y4	z4

Retained for ‘anti join’
id.x	val1	id.y	val2	val3
101	x1	101	y1	z1
101	x1	102	y2	z2
101	x1	104	y3	z3
101	x1	104	y4	z4
101	x1		NA	NA
102	x2	101	y1	z1
102	x2	102	y2	z2
102	x2	104	y3	z3
102	x2	104	y4	z4
102	x2		NA	NA
102	x3	101	y1	z1
102	x3	102	y2	z2
102	x3	104	y3	z3
102	x3	104	y4	z4
102	x3		NA	NA
103	x4	101	y1	z1
103	x4	102	y2	z2
103	x4	104	y3	z3
103	x4	104	y4	z4
103	x4		NA	NA
	NA	101	y1	z1
	NA	102	y2	z2
	NA	104	y3	z3
	NA	104	y4	z4

Final ‘anti join’ result
id	val1
103	x4

Anti Join: All variables from the first data frame for rows where the key variable column does NOT have a match in the second data frame.

4.3 Joins in R

Performing the joins described in the previous section is straightforward with dplyr (part of the tidyverse) using, conveniently enough, inner_join(), left_join(), right_join(), full_join(), semi_join(), and anti_join(). The first two arguments to each of these functions are the two data frames to join. In addition, the name of the key variable should be given, in quotes, to by=.¹⁸

The two data frames used in the previous section are created below as objects in R.

x <- data.frame(id=c(101,102,102,103),val1=paste0("x",1:4))
x

#R>     id val1
#R>  1 101   x1
#R>  2 102   x2
#R>  3 102   x3
#R>  4 103   x4

y <- data.frame(id=c(101,102,104,104),val2=paste0("y",1:4),val3=paste0("z",1:4))
y

#R>     id val2 val3
#R>  1 101   y1   z1
#R>  2 102   y2   z2
#R>  3 104   y3   z3
#R>  4 104   y4   z4

The six joins discussed in the previous section are completed below. You should compare the results here to the visual results above.

ij <- inner_join(x,y,by="id")
ij

#R>     id val1 val2 val3
#R>  1 101   x1   y1   z1
#R>  2 102   x2   y2   z2
#R>  3 102   x3   y2   z2

lj <- left_join(x,y,by="id")
lj

#R>     id val1 val2 val3
#R>  1 101   x1   y1   z1
#R>  2 102   x2   y2   z2
#R>  3 102   x3   y2   z2
#R>  4 103   x4 <NA> <NA>

rj <- right_join(x,y,by="id")
rj

#R>     id val1 val2 val3
#R>  1 101   x1   y1   z1
#R>  2 102   x2   y2   z2
#R>  3 102   x3   y2   z2
#R>  4 104 <NA>   y3   z3
#R>  5 104 <NA>   y4   z4

rj2 <- left_join(y,x,by="id")  # right_join as left_join with x & y reversed
rj2

#R>     id val2 val3 val1
#R>  1 101   y1   z1   x1
#R>  2 102   y2   z2   x2
#R>  3 102   y2   z2   x3
#R>  4 104   y3   z3 <NA>
#R>  5 104   y4   z4 <NA>

fj <- full_join(x,y,by="id")
fj

#R>     id val1 val2 val3
#R>  1 101   x1   y1   z1
#R>  2 102   x2   y2   z2
#R>  3 102   x3   y2   z2
#R>  4 103   x4 <NA> <NA>
#R>  5 104 <NA>   y3   z3
#R>  6 104 <NA>   y4   z4

sj <- semi_join(x,y,by="id")
sj

#R>     id val1
#R>  1 101   x1
#R>  2 102   x2
#R>  3 102   x3

aj <- anti_join(x,y,by="id")
aj

#R>     id val1
#R>  1 103   x4

4.4 Examples With Context

The following examples demonstrate different types of joins within fictitious, but realistic, contexts. Please examine each data frame and the joined results carefully to help further understand what each type of join does.

4.4.1 Student Data (One-to-One)

In large institutions or in complicated data environments, data about specific individuals may be housed in a variety of departments, each of which maintains its own database. Preferably these data sources can be related via a primary key variable, such as a unique student ID number. As an example, suppose that a college’s admissions office maintains a database of personal information about every student at the college. For example it might look like that below for a fictitious five students.

personal <- tibble(studentID=c(34535,45423,73424,89874,98222),
                   first_nm=c("Rolando","Catherine","James","Rachel","Esteban"),
                   last_nm=c("Blackman","Johnson","Carmichael","Brown","Perez"),
                   hometown=c("Windsor","Eden Prairie","Marion","Milwaukee","El Paso"),
                   homestate=c("MI","MN","IA","WI","TX"))

personal

#R>  # A tibble: 5 x 5
#R>    studentID first_nm  last_nm    hometown     homestate
#R>        <dbl> <chr>     <chr>      <chr>        <chr>    
#R>  1     34535 Rolando   Blackman   Windsor      MI       
#R>  2     45423 Catherine Johnson    Eden Prairie MN       
#R>  3     73424 James     Carmichael Marion       IA       
#R>  4     89874 Rachel    Brown      Milwaukee    WI       
#R>  5     98222 Esteban   Perez      El Paso      TX

In addition the financial aid office may have a database of financial aid information.

finaid <- tibble(studentID=c(34535,45423,73424,89874,98222),
                 income_cat=c(4,5,3,2,3),
                 pell_elig=c(TRUE,FALSE,TRUE,TRUE,TRUE),
                 work_study=c(TRUE,FALSE,FALSE,FALSE,TRUE))
finaid

#R>  # A tibble: 5 x 4
#R>    studentID income_cat pell_elig work_study
#R>        <dbl>      <dbl> <lgl>     <lgl>     
#R>  1     34535          4 TRUE      TRUE      
#R>  2     45423          5 FALSE     FALSE     
#R>  3     73424          3 TRUE      FALSE     
#R>  4     89874          2 TRUE      FALSE     
#R>  5     98222          3 TRUE      TRUE

Furthermore the registrar’s office has a database of academic information.

academics <- tibble(studentID=c(34535,45423,73424,89874,98222),
                    standing=c("FY","FY","SO","SR","JR"),
                    major=c("undecided","NRS","Biology","SCD","SCD"),
                    cum_gpa=c(0,0,3.12,3.67,2.89))
academics

#R>  # A tibble: 5 x 4
#R>    studentID standing major     cum_gpa
#R>        <dbl> <chr>    <chr>       <dbl>
#R>  1     34535 FY       undecided    0   
#R>  2     45423 FY       NRS          0   
#R>  3     73424 SO       Biology      3.12
#R>  4     89874 SR       SCD          3.67
#R>  5     98222 JR       SCD          2.89

Note how each of these databases has the studentID variable that will serve as the key variable to connect each student’s information across each of the databases.

These databases form what is called a one-to-one relationship because each observation record in each database can be connected to one and only one observation record in the other databases. In this example, each database would ideally have an entry for every student.

These data frames can generally be joined with inner, left, or right joins depending on the purpose. For example, an institutional researcher may want to examine whether student gpa differed between students that were eligible for a Pell Grant or not. In this case, the researcher would join the finaid and academics data frames so that the pell_elig and cum_gpa variables for each student would be in one data frame.

tmp <- inner_join(finaid,academics,by="studentID")
tmp

#R>  # A tibble: 5 x 7
#R>    studentID income_cat pell_elig work_study standing major     cum_gpa
#R>        <dbl>      <dbl> <lgl>     <lgl>      <chr>    <chr>       <dbl>
#R>  1     34535          4 TRUE      TRUE       FY       undecided    0   
#R>  2     45423          5 FALSE     FALSE      FY       NRS          0   
#R>  3     73424          3 TRUE      FALSE      SO       Biology      3.12
#R>  4     89874          2 TRUE      FALSE      SR       SCD          3.67
#R>  5     98222          3 TRUE      TRUE       JR       SCD          2.89

An inner join was used here because the researcher only wants to include students that are in both databases (i.e., would likely have an entry for both pell_elig and cum_gpa). Note, however, that a left join or a right join would accomplish the same task as long as both databases had entries for every student (i.e., the databases had the same set of students).

Further suppose that academic advisors would like to have the students’ names attached to these records so that they could reach out to students who could use some help academically.

tmp <- inner_join(personal,tmp,by="studentID")
tmp

#R>  # A tibble: 5 x 11
#R>    studentID first_nm  last_nm    hometown homestate income_cat pell_elig work_study
#R>        <dbl> <chr>     <chr>      <chr>    <chr>          <dbl> <lgl>     <lgl>     
#R>  1     34535 Rolando   Blackman   Windsor  MI                 4 TRUE      TRUE      
#R>  2     45423 Catherine Johnson    Eden Pr~ MN                 5 FALSE     FALSE     
#R>  3     73424 James     Carmichael Marion   IA                 3 TRUE      FALSE     
#R>  4     89874 Rachel    Brown      Milwauk~ WI                 2 TRUE      FALSE     
#R>  5     98222 Esteban   Perez      El Paso  TX                 3 TRUE      TRUE      
#R>  # ... with 3 more variables: standing <chr>, major <chr>, cum_gpa <dbl>

Continuing with this example, suppose that the registrar’s office also maintains a database that contains each students’ current class schedule.

schedules <- tibble(studentID=c(34535,34535,34535,34535,
                                45423,45423,45423,45423,45423,
                                73424,73424,73424,73424,
                                89874,89874,89874,
                                98222,98222,98222,98222),
                    course=c("MTH107","BIO115","CHM110","IDS101",
                             "SCD110","PSY110","MTH140","OED212","IDS101",
                             "BIO234","CHM220","BIO370","SCD110",
                             "SCD440","PSY370","IDS490",
                             "SCD440","SCD330","SOC480","ART220"))
schedules

#R>  # A tibble: 20 x 2
#R>     studentID course
#R>         <dbl> <chr> 
#R>   1     34535 MTH107
#R>   2     34535 BIO115
#R>   3     34535 CHM110
#R>   4     34535 IDS101
#R>   5     45423 SCD110
#R>   6     45423 PSY110
#R>   7     45423 MTH140
#R>   8     45423 OED212
#R>   9     45423 IDS101
#R>  10     73424 BIO234
#R>  11     73424 CHM220
#R>  12     73424 BIO370
#R>  13     73424 SCD110
#R>  14     89874 SCD440
#R>  15     89874 PSY370
#R>  16     89874 IDS490
#R>  17     98222 SCD440
#R>  18     98222 SCD330
#R>  19     98222 SOC480
#R>  20     98222 ART220

The registrar’s office also maintains a database of information about every course taught at the college. A partial example of such a database is shown below.

courses <- tibble(course=c("ART220","ART330","BIO115","BIO234","BIO370","BIO490",
                           "CHM110","CHM220","CHM360","IDS101","IDS490","MTH107",
                           "MTH140","MTH230","OED212","OED330","OED360","PSY110",
                           "PSY370","SCD110","SCD330","SCD440","SOC111","SOC480"),
                  credits=c(3,3,4,4,4,4,4,4,4,3,4,4,4,4,3,3,3,4,4,3,3,4,4,4),
                  instructor=c("Duffy","Terry","Johnson","Goyke","Anich","Anich",
                               "Carlson","Robertson","Carlson","Goyke","Hannickel","Ogle",
                               "Jensen","Jensen","Andre","Andre","Coulson","Sneyd",
                               "Sneyd","Tochterman","Tochterman","Foster",
                               "Schanning","Schanning"))
courses

#R>  # A tibble: 24 x 3
#R>     course credits instructor
#R>     <chr>    <dbl> <chr>     
#R>   1 ART220       3 Duffy     
#R>   2 ART330       3 Terry     
#R>   3 BIO115       4 Johnson   
#R>   4 BIO234       4 Goyke     
#R>   5 BIO370       4 Anich     
#R>   6 BIO490       4 Anich     
#R>   7 CHM110       4 Carlson   
#R>   8 CHM220       4 Robertson 
#R>   9 CHM360       4 Carlson   
#R>  10 IDS101       3 Goyke     
#R>  # ... with 14 more rows

These two data frames are related via the common course variable; thus for these two data frames course is the key variable. This type of database organization is particularly useful because the information about any one course only needs to be entered once in courses even though the actual course may appear many times in schedules. The specific course information (credits and instructor) from courses can be added to the student’s course information in schedules with a left_join().

schedules2 <- left_join(schedules,courses,by="course")
schedules2

#R>  # A tibble: 20 x 4
#R>     studentID course credits instructor
#R>         <dbl> <chr>    <dbl> <chr>     
#R>   1     34535 MTH107       4 Ogle      
#R>   2     34535 BIO115       4 Johnson   
#R>   3     34535 CHM110       4 Carlson   
#R>   4     34535 IDS101       3 Goyke     
#R>   5     45423 SCD110       3 Tochterman
#R>   6     45423 PSY110       4 Sneyd     
#R>   7     45423 MTH140       4 Jensen    
#R>   8     45423 OED212       3 Andre     
#R>   9     45423 IDS101       3 Goyke     
#R>  10     73424 BIO234       4 Goyke     
#R>  11     73424 CHM220       4 Robertson 
#R>  12     73424 BIO370       4 Anich     
#R>  13     73424 SCD110       3 Tochterman
#R>  14     89874 SCD440       4 Foster    
#R>  15     89874 PSY370       4 Sneyd     
#R>  16     89874 IDS490       4 Hannickel 
#R>  17     98222 SCD440       4 Foster    
#R>  18     98222 SCD330       3 Tochterman
#R>  19     98222 SOC480       4 Schanning 
#R>  20     98222 ART220       3 Duffy

Note here that if we consider schedules as the primary data frame of interest then this relationship is still a one-to-one relationship because each course in schedules can be connected to only one course record in courses.

The student’s personal information can also be added to these results with a left_join() but now using the studentID key variable.

schedules2A <- left_join(personal,schedules2,by="studentID")
schedules2A

#R>  # A tibble: 20 x 8
#R>     studentID first_nm  last_nm    hometown     homestate course credits instructor
#R>         <dbl> <chr>     <chr>      <chr>        <chr>     <chr>    <dbl> <chr>     
#R>   1     34535 Rolando   Blackman   Windsor      MI        MTH107       4 Ogle      
#R>   2     34535 Rolando   Blackman   Windsor      MI        BIO115       4 Johnson   
#R>   3     34535 Rolando   Blackman   Windsor      MI        CHM110       4 Carlson   
#R>   4     34535 Rolando   Blackman   Windsor      MI        IDS101       3 Goyke     
#R>   5     45423 Catherine Johnson    Eden Prairie MN        SCD110       3 Tochterman
#R>   6     45423 Catherine Johnson    Eden Prairie MN        PSY110       4 Sneyd     
#R>   7     45423 Catherine Johnson    Eden Prairie MN        MTH140       4 Jensen    
#R>   8     45423 Catherine Johnson    Eden Prairie MN        OED212       3 Andre     
#R>   9     45423 Catherine Johnson    Eden Prairie MN        IDS101       3 Goyke     
#R>  10     73424 James     Carmichael Marion       IA        BIO234       4 Goyke     
#R>  11     73424 James     Carmichael Marion       IA        CHM220       4 Robertson 
#R>  12     73424 James     Carmichael Marion       IA        BIO370       4 Anich     
#R>  13     73424 James     Carmichael Marion       IA        SCD110       3 Tochterman
#R>  14     89874 Rachel    Brown      Milwaukee    WI        SCD440       4 Foster    
#R>  15     89874 Rachel    Brown      Milwaukee    WI        PSY370       4 Sneyd     
#R>  16     89874 Rachel    Brown      Milwaukee    WI        IDS490       4 Hannickel 
#R>  17     98222 Esteban   Perez      El Paso      TX        SCD440       4 Foster    
#R>  18     98222 Esteban   Perez      El Paso      TX        SCD330       3 Tochterman
#R>  19     98222 Esteban   Perez      El Paso      TX        SOC480       4 Schanning 
#R>  20     98222 Esteban   Perez      El Paso      TX        ART220       3 Duffy

These data frames represent a one-to-many relationship because studentID in personal is connected to many studentID records in schedules2 (one for each course the student is enrolled in).

4.4.2 Resource Sampling Data (One-to-Many Relationship)

In sampling of natural resources it is common to have one database for information about the unit of sampling and another database specific to items within that sampling unit. For example, in fisheries we may have one database to record information about a particular net (e.g., where it is located, the date it was set) and a second database that records the species of fish caught and number of the species caught. You may be tempted to do this all in one database with a separate field for each fish species but this is inefficient as you may not know which species you may catch. Thus, every time you catch a new species you would need to add a new field or column to your database. In addition, this would be highly inefficient if you were to record information about individual fish (e.g., length and weight) as the amount of this information may vary from net to net.

In this simple example, information about five specific settings of a net is stored in nets, which has a unique identifier for each net setting called net_num.

nets <- data.frame(net_num=1:5,
                   lake=c("Eagle","Hart","Hart","Eagle","Millicent"),
                   date=c("3-Jul-21","3-Jul-21","5-Jul-21","6-Jul-21","6-Jul-21"))
nets

#R>    net_num      lake     date
#R>  1       1     Eagle 3-Jul-21
#R>  2       2      Hart 3-Jul-21
#R>  3       3      Hart 5-Jul-21
#R>  4       4     Eagle 6-Jul-21
#R>  5       5 Millicent 6-Jul-21

In a separate data frame the researchers recorded the species and number of each species caught in each net. Here there is a separate row for each species and its number caught with each row indexed to the specific net with the net_num key variable.

catch <- data.frame(net_num=c(1,1,2,2,2,4,4,5),
                    species=c("Bluegill","Largemouth Bass",
                              "Bluegill","Largemouth Bass","Bluntnose Minnow",
                              "Bluegill","Largemouth Bass",
                              "Largemouth Bass"),
                    number=c(7,3,19,2,56,3,6,3))
catch

#R>    net_num          species number
#R>  1       1         Bluegill      7
#R>  2       1  Largemouth Bass      3
#R>  3       2         Bluegill     19
#R>  4       2  Largemouth Bass      2
#R>  5       2 Bluntnose Minnow     56
#R>  6       4         Bluegill      3
#R>  7       4  Largemouth Bass      6
#R>  8       5  Largemouth Bass      3

These data frames illustrate a one-to-many relationship as each record in nets may be connected to multiple records in catch.

The catch data will be joined to the net data using a left_join() because it is important to keep track of nets that also did not catch fish. An inner_join() would only return nets where some fish were caught.

fishcatch <- left_join(nets,catch,by="net_num")
fishcatch

#R>    net_num      lake     date          species number
#R>  1       1     Eagle 3-Jul-21         Bluegill      7
#R>  2       1     Eagle 3-Jul-21  Largemouth Bass      3
#R>  3       2      Hart 3-Jul-21         Bluegill     19
#R>  4       2      Hart 3-Jul-21  Largemouth Bass      2
#R>  5       2      Hart 3-Jul-21 Bluntnose Minnow     56
#R>  6       3      Hart 5-Jul-21             <NA>     NA
#R>  7       4     Eagle 6-Jul-21         Bluegill      3
#R>  8       4     Eagle 6-Jul-21  Largemouth Bass      6
#R>  9       5 Millicent 6-Jul-21  Largemouth Bass      3

The visualizations in this section are modified from https://twitter.com/hadleywickham/status/684407629259526148/photo/1.↩︎
In this treatment, the unique id values are also uniquely colored to help track individual observations in the descriptions below.↩︎
Finding all combinations, however, is not needed to actually join two data frames.↩︎
If by= is not explicitly set by the user then the two data frames will be joined using the variable(s) that the two data frames have in common for by=. A message will be displayed about which variable(s) was used; check this message carefully to make sure you are joining by the variables that you want to join by.↩︎