Module 2 Data Structures
2.1 Vectors
The vector is the primary unit for storing data in R. You can think of a vector as a set of similar items or elements. Vectors are created in R by combining or concatenating together the individual elements into a single set with c()
. For example, the code below creates a vector of county names stored in an object called cn
.6
<- c("Ashland","Bayfield","Douglas","Iron") cn
Similarly below creates a vector of population sizes in four counties in an object called pop
.
<- c(15512,15056,43164,5687) pop
Individual elements in a vector are accessed by following the vector’s object name with square brackets that contain the numeric position of the element. For example, the second county in cn
and the third population size in pop
are extracted below.
2] cn[
#R> [1] "Bayfield"
3] pop[
#R> [1] 43164
Multiple elements are accessed by combining their position indices into a vector.
c(2,3)] cn[
#R> [1] "Bayfield" "Douglas"
Vector: A sequence of data elements of the same basic type.7
2.2 Data Classes
Vectors must contain the same “type” or class of items. There are four main classes of data in R.
- numeric: Numbers that may have decimals; e.g., 12.3.
- integer: Numbers that do not have decimals; e.g., 12.
- character: Words; e.g., “Bayfield.”
- logical: Logical that must be either
TRUE
orFALSE
.
The primary difference between numeric and integer classes is how the data are stored in memory. For most of our purposes this will be irrelevant, so there is no practical difference between these two classes for our work. However, integer values are entered into a vector by appending the value with an “L.”
<- c(4L,3L,3L,3L) nabors
The values in a logical vector must be either TRUE
or FALSE
.8
<- c(TRUE,TRUE,FALSE,FALSE) cheqbay
The class (i.e., type) of data in a vector is found with class()
.
class(cn)
#R> [1] "character"
class(pop)
#R> [1] "numeric"
class(nabors)
#R> [1] "integer"
class(cheqbay)
#R> [1] "logical"
A factor is a special class of data where character items are specifically classified as representing groups or levels of items. A vector can be converted to a factor class with factor()
.
<- factor(cn)
fcn fcn
#R> [1] Ashland Bayfield Douglas Iron
#R> Levels: Ashland Bayfield Douglas Iron
class(fcn)
#R> [1] "factor"
Factors have useful properties that will be discussed in more detail in Module 10.
As stated above, a vector should consist of items of the same class type. For example, this code does not make sense in most instances.
<- c("Ashland",15512,TRUE,3.65) huh
However, this will not produce an error, though it likely will not be what you want it to be. For example, examine the class of this object.
class(huh)
#R> [1] "character"
R uses hierarchical rules to assign a class for these odd situations. Rather than focusing on these rules it is more beneficial to remember that each vector should be of the same class type.
Items in vectors should all be the same class type.
2.3 Data Frames
Vectors are useful for small numbers of items that have a single purpose. However, a data frame is more useful if you have multiple types of items (e.g., variables) recorded on a large number of individuals. Here we explore small data frames; larger data frames will be imported from external data sources in Module 3.
A data frame is a rectangular data structure where columns are vectors of the same class that represent variables recorded on individuals which are represented in rows. Simple data frames can be constructed with data.frame()
with named arguments set equal to vectors of data. For example, the following code produces a data frame object called counties
that has three variables called name
, pop
, and party
.
<- data.frame(name=c("Ashland","Bayfield","Douglas","Iron","Sawyer"),
counties pop=c(15512,15056,43164,5687,16746),
party=c("Dem","Dem","Dem","Rep","Rep"))
Type the name of the data frame object to see its contents.
counties
#R> name pop party
#R> 1 Ashland 15512 Dem
#R> 2 Bayfield 15056 Dem
#R> 3 Douglas 43164 Dem
#R> 4 Iron 5687 Rep
#R> 5 Sawyer 16746 Rep
Columns of data frames correspond to variables whereas rows correspond to individuals.
Use str()
to examine the structure of the data frame object, which will show that the object is a data.frame
, show the number of individuals (label as obs
for observations) and variables, and show the name of each column/variable along with its class type abbreviation and a snapshot of the first few items in each row.
str(counties)
#R> 'data.frame': 5 obs. of 3 variables:
#R> $ name : chr "Ashland" "Bayfield" "Douglas" "Iron" ...
#R> $ pop : num 15512 15056 43164 5687 16746
#R> $ party: chr "Dem" "Dem" "Dem" "Rep" ...
As data frames are rectangular, individual items are accessed by using both the row and column positions within square brackets after the data frame object name.
1,2] # first row, second column counties[
#R> [1] 15512
3,1] # third row, first column counties[
#R> [1] "Douglas"
Entire rows or columns are accessed by providing the numerical position of the row or column and leaving the other indice blank.
1,] # First row counties[
#R> name pop party
#R> 1 Ashland 15512 Dem
1] # First column counties[,
#R> [1] "Ashland" "Bayfield" "Douglas" "Iron" "Sawyer"
Note that choosing rows or more than one column will return a data frame as it will likely have data of different classes.
class(counties[1,]) # one row is a data frame
#R> [1] "data.frame"
class(counties[,c(1,2)]) # two columns is a data frame
#R> [1] "data.frame"
However, choosing one column will return a vector of items all of the same class.
class(counties[,1]) # one column is a vector
#R> [1] "character"
As columns are named we can also use the name to access a specific column.
"pop"] counties[,
#R> [1] 15512 15056 43164 5687 16746
This same column can be accessed by separating the data frame object name from the column name with a $
.
$pop counties
#R> [1] 15512 15056 43164 5687 16746
Again a column is simply a vector so you access single items in this vector in the usual way.
$pop[3] counties
#R> [1] 43164
A $
is only used to separate a data frame name from the variable name within that data frame.
2.3.1 Tibbles
Tibbles are a special form of data frame that was introduced as part of the “tidyverse.” Tibbles are created using tibble()
in the same way that data.frame()
was used previously.
<- tibble(name=c("Ashland","Bayfield","Douglas","Iron","Sawyer"),
counties2 pop=c(15512,15056,43164,5687,16746),
party=c("Dem","Dem","Dem","Rep","Rep"))
For small data frames a tibble will behave exactly as a data frame. For example,
counties2
#R> # A tibble: 5 x 3
#R> name pop party
#R> <chr> <dbl> <chr>
#R> 1 Ashland 15512 Dem
#R> 2 Bayfield 15056 Dem
#R> 3 Douglas 43164 Dem
#R> 4 Iron 5687 Rep
#R> 5 Sawyer 16746 Rep
$pop counties2
#R> [1] 15512 15056 43164 5687 16746
There are, however, differences between tibbles and data frames as described in this introduction to tibbles. The primary difference that you will notice in this course is when you examine the contents of a tibble with a larger number of rows, columns, or both. When a large tibble is displayed only the first 10 rows and as many columns as will fit on the width of your display are shown. In the example below, 141 rows and one variable are not shown as seen in the note at the bottom.
tibex
#R> # A tibble: 151 x 11
#R> fishID tl scale1 scale2 scaleC finray1 finray2 finrayC otolith1 otolith2
#R> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#R> 1 1 345 3 3 3 3 3 3 3 3
#R> 2 2 334 4 3 4 3 3 3 3 3
#R> 3 3 348 7 5 6 3 3 3 3 3
#R> 4 4 300 4 3 4 3 2 3 3 3
#R> 5 5 330 3 3 3 4 3 4 3 3
#R> 6 6 316 4 4 4 2 3 3 6 5
#R> 7 7 508 6 7 7 6 6 6 9 10
#R> 8 8 475 4 5 5 9 9 9 11 12
#R> 9 9 340 3 3 3 2 3 3 3 4
#R> 10 10 173 1 1 1 2 1 1 1 1
#R> # ... with 141 more rows, and 1 more variable: otolithC <int>
Tibbles will be encountered frequently in subsequent modules as some tidyverse functions return tibbles by default. A tibble can be converted to a data frame with as.data.frame()
.
2.4 Tidy Data
Tidy Data was a term introduced here in 2011 to describe a strict data organization that leads to consistency and efficiencies in data analyses. Tidy data is described briefly below and in more detail in the R for Data Science book.
Data can be organized in different ways. For example, below is one representation of the simple data frame created in Section 2.3.
#R> name pop party
#R> 1 Ashland 15512 Dem
#R> 2 Bayfield 15056 Dem
#R> 3 Douglas 43164 Dem
#R> 4 Iron 5687 Rep
#R> 5 Sawyer 16746 Rep
However, these same data could be organized as below (among other possible organizations).
#R> county variable value
#R> 1 Ashland pop 15512
#R> 2 Ashland party Dem
#R> 3 Bayfield pop 15056
#R> 4 Bayfield party Dem
#R> 5 Douglas pop 43164
#R> 6 Douglas party Dem
#R> 7 Iron pop 5687
#R> 8 Iron party Rep
#R> 9 Sawyer pop 16746
#R> 10 Sawyer party Rep
The first data frame is “tidy” and is fairly easy to work with. However, the second data frame is not “tidy” and is much more difficult to use.
Tidy data frames follow three simple rules (Figure 2.1):
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
A common “challenge” when entering data in a tidy format occurs when data is recorded on individuals in separate groups. For example, the following data are methyl mercury levels recorded in mussels from two locations labeled as “impacted” and “reference.”
impacted 0.011 0.054 0.056 0.095 0.051 0.077
reference 0.031 0.040 0.029 0.066 0.018 0.042 0.044
In this case, one “observation” is a methyl mercury measurement on a mussel AND to which group the mussel belongs. Thus, each observation results in the recording of two variables. For example, the first mussel had a methyl mercury level of 0.011 AND it was at the impacted site. With this understanding these data are entered in a tidy format as follows.
<- tibble(loc=c("impacted","impacted","impacted","impacted","impacted","impacted",
mussels "reference","reference","reference","reference",
"reference","reference","reference"),
merc=c(0.011,0.054,0.056,0.095,0.051,0.077,
0.031,0.040,0.029,0.066,0.018,0.042,0.044))
mussels
#R> # A tibble: 13 x 2
#R> loc merc
#R> <chr> <dbl>
#R> 1 impacted 0.011
#R> 2 impacted 0.054
#R> 3 impacted 0.056
#R> 4 impacted 0.095
#R> 5 impacted 0.051
#R> 6 impacted 0.077
#R> 7 reference 0.031
#R> 8 reference 0.04
#R> 9 reference 0.029
#R> 10 reference 0.066
#R> 11 reference 0.018
#R> 12 reference 0.042
#R> 13 reference 0.044
Tidy data will facilitate data wrangling in subsequent modules and data analysis and graphing in other courses.