Module 2 Data Structures

2.1 Vectors

The vector is the primary unit for storing data in R. You can think of a vector as a set of similar items or elements. Vectors are created in R by combining or concatenating together the individual elements into a single set with c(). For example, the code below creates a vector of county names stored in an object called cn.⁶

cn <- c("Ashland","Bayfield","Douglas","Iron")

Similarly below creates a vector of population sizes in four counties in an object called pop.

pop <- c(15512,15056,43164,5687)

Individual elements in a vector are accessed by following the vector’s object name with square brackets that contain the numeric position of the element. For example, the second county in cn and the third population size in pop are extracted below.

cn[2]

#R>  [1] "Bayfield"

pop[3]

#R>  [1] 43164

Multiple elements are accessed by combining their position indices into a vector.

cn[c(2,3)]

#R>  [1] "Bayfield" "Douglas"

Vector: A sequence of data elements of the same basic type.⁷

2.2 Data Classes

Vectors must contain the same “type” or class of items. There are four main classes of data in R.

numeric: Numbers that may have decimals; e.g., 12.3.
integer: Numbers that do not have decimals; e.g., 12.
character: Words; e.g., “Bayfield.”
logical: Logical that must be either TRUE or FALSE.

The primary difference between numeric and integer classes is how the data are stored in memory. For most of our purposes this will be irrelevant, so there is no practical difference between these two classes for our work. However, integer values are entered into a vector by appending the value with an “L.”

nabors <- c(4L,3L,3L,3L)

The values in a logical vector must be either TRUE or FALSE.⁸

cheqbay <- c(TRUE,TRUE,FALSE,FALSE)

The class (i.e., type) of data in a vector is found with class().

class(cn)

#R>  [1] "character"

class(pop)

#R>  [1] "numeric"

class(nabors)

#R>  [1] "integer"

class(cheqbay)

#R>  [1] "logical"

A factor is a special class of data where character items are specifically classified as representing groups or levels of items. A vector can be converted to a factor class with factor().

fcn <- factor(cn)
fcn

#R>  [1] Ashland  Bayfield Douglas  Iron    
#R>  Levels: Ashland Bayfield Douglas Iron

class(fcn)

#R>  [1] "factor"

Factors have useful properties that will be discussed in more detail in Module 10.

As stated above, a vector should consist of items of the same class type. For example, this code does not make sense in most instances.

huh <- c("Ashland",15512,TRUE,3.65)

However, this will not produce an error, though it likely will not be what you want it to be. For example, examine the class of this object.

class(huh)

#R>  [1] "character"

R uses hierarchical rules to assign a class for these odd situations. Rather than focusing on these rules it is more beneficial to remember that each vector should be of the same class type.

Items in vectors should all be the same class type.

2.3 Data Frames

Vectors are useful for small numbers of items that have a single purpose. However, a data frame is more useful if you have multiple types of items (e.g., variables) recorded on a large number of individuals. Here we explore small data frames; larger data frames will be imported from external data sources in Module 3.

A data frame is a rectangular data structure where columns are vectors of the same class that represent variables recorded on individuals which are represented in rows. Simple data frames can be constructed with data.frame() with named arguments set equal to vectors of data. For example, the following code produces a data frame object called counties that has three variables called name, pop, and party.

counties <- data.frame(name=c("Ashland","Bayfield","Douglas","Iron","Sawyer"),
                       pop=c(15512,15056,43164,5687,16746),
                       party=c("Dem","Dem","Dem","Rep","Rep"))

Type the name of the data frame object to see its contents.

counties

#R>        name   pop party
#R>  1  Ashland 15512   Dem
#R>  2 Bayfield 15056   Dem
#R>  3  Douglas 43164   Dem
#R>  4     Iron  5687   Rep
#R>  5   Sawyer 16746   Rep

Columns of data frames correspond to variables whereas rows correspond to individuals.

Use str() to examine the structure of the data frame object, which will show that the object is a data.frame, show the number of individuals (label as obs for observations) and variables, and show the name of each column/variable along with its class type abbreviation and a snapshot of the first few items in each row.

str(counties)

#R>  'data.frame':  5 obs. of  3 variables:
#R>   $ name : chr  "Ashland" "Bayfield" "Douglas" "Iron" ...
#R>   $ pop  : num  15512 15056 43164 5687 16746
#R>   $ party: chr  "Dem" "Dem" "Dem" "Rep" ...

As data frames are rectangular, individual items are accessed by using both the row and column positions within square brackets after the data frame object name.

counties[1,2]  # first row, second column

#R>  [1] 15512

counties[3,1]  # third row, first column

#R>  [1] "Douglas"

Entire rows or columns are accessed by providing the numerical position of the row or column and leaving the other indice blank.

counties[1,]  # First row

#R>       name   pop party
#R>  1 Ashland 15512   Dem

counties[,1]  # First column

#R>  [1] "Ashland"  "Bayfield" "Douglas"  "Iron"     "Sawyer"

Note that choosing rows or more than one column will return a data frame as it will likely have data of different classes.

class(counties[1,])      # one row is a data frame

#R>  [1] "data.frame"

class(counties[,c(1,2)]) # two columns is a data frame

#R>  [1] "data.frame"

However, choosing one column will return a vector of items all of the same class.

class(counties[,1])      # one column is a vector

#R>  [1] "character"

As columns are named we can also use the name to access a specific column.

counties[,"pop"]

#R>  [1] 15512 15056 43164  5687 16746

This same column can be accessed by separating the data frame object name from the column name with a $.

counties$pop

#R>  [1] 15512 15056 43164  5687 16746

Again a column is simply a vector so you access single items in this vector in the usual way.

counties$pop[3]

#R>  [1] 43164

A $ is only used to separate a data frame name from the variable name within that data frame.

2.3.1 Tibbles

Tibbles are a special form of data frame that was introduced as part of the “tidyverse.” Tibbles are created using tibble() in the same way that data.frame() was used previously.

counties2 <- tibble(name=c("Ashland","Bayfield","Douglas","Iron","Sawyer"),
                    pop=c(15512,15056,43164,5687,16746),
                    party=c("Dem","Dem","Dem","Rep","Rep"))

For small data frames a tibble will behave exactly as a data frame. For example,

counties2

#R>  # A tibble: 5 x 3
#R>    name       pop party
#R>    <chr>    <dbl> <chr>
#R>  1 Ashland  15512 Dem  
#R>  2 Bayfield 15056 Dem  
#R>  3 Douglas  43164 Dem  
#R>  4 Iron      5687 Rep  
#R>  5 Sawyer   16746 Rep

counties2$pop

#R>  [1] 15512 15056 43164  5687 16746

There are, however, differences between tibbles and data frames as described in this introduction to tibbles. The primary difference that you will notice in this course is when you examine the contents of a tibble with a larger number of rows, columns, or both. When a large tibble is displayed only the first 10 rows and as many columns as will fit on the width of your display are shown. In the example below, 141 rows and one variable are not shown as seen in the note at the bottom.

tibex

#R>  # A tibble: 151 x 11
#R>     fishID    tl scale1 scale2 scaleC finray1 finray2 finrayC otolith1 otolith2
#R>      <int> <int>  <int>  <int>  <int>   <int>   <int>   <int>    <int>    <int>
#R>   1      1   345      3      3      3       3       3       3        3        3
#R>   2      2   334      4      3      4       3       3       3        3        3
#R>   3      3   348      7      5      6       3       3       3        3        3
#R>   4      4   300      4      3      4       3       2       3        3        3
#R>   5      5   330      3      3      3       4       3       4        3        3
#R>   6      6   316      4      4      4       2       3       3        6        5
#R>   7      7   508      6      7      7       6       6       6        9       10
#R>   8      8   475      4      5      5       9       9       9       11       12
#R>   9      9   340      3      3      3       2       3       3        3        4
#R>  10     10   173      1      1      1       2       1       1        1        1
#R>  # ... with 141 more rows, and 1 more variable: otolithC <int>

Tibbles will be encountered frequently in subsequent modules as some tidyverse functions return tibbles by default. A tibble can be converted to a data frame with as.data.frame().

2.4 Tidy Data

Tidy Data was a term introduced here in 2011 to describe a strict data organization that leads to consistency and efficiencies in data analyses. Tidy data is described briefly below and in more detail in the R for Data Science book.

Data can be organized in different ways. For example, below is one representation of the simple data frame created in Section 2.3.

#R>        name   pop party
#R>  1  Ashland 15512   Dem
#R>  2 Bayfield 15056   Dem
#R>  3  Douglas 43164   Dem
#R>  4     Iron  5687   Rep
#R>  5   Sawyer 16746   Rep

However, these same data could be organized as below (among other possible organizations).

#R>       county variable value
#R>  1   Ashland      pop 15512
#R>  2   Ashland    party   Dem
#R>  3  Bayfield      pop 15056
#R>  4  Bayfield    party   Dem
#R>  5   Douglas      pop 43164
#R>  6   Douglas    party   Dem
#R>  7      Iron      pop  5687
#R>  8      Iron    party   Rep
#R>  9    Sawyer      pop 16746
#R>  10   Sawyer    party   Rep

The first data frame is “tidy” and is fairly easy to work with. However, the second data frame is not “tidy” and is much more difficult to use.

Tidy data frames follow three simple rules (Figure 2.1):

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

Schematic illustration the structure of tidy data (from [RStudio Data Wrangling Cheat Sheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf))

Figure 2.1: Schematic illustration the structure of tidy data (from RStudio Data Wrangling Cheat Sheet)

A common “challenge” when entering data in a tidy format occurs when data is recorded on individuals in separate groups. For example, the following data are methyl mercury levels recorded in mussels from two locations labeled as “impacted” and “reference.”

  impacted   0.011  0.054  0.056  0.095  0.051  0.077
  reference  0.031  0.040  0.029  0.066  0.018  0.042  0.044

In this case, one “observation” is a methyl mercury measurement on a mussel AND to which group the mussel belongs. Thus, each observation results in the recording of two variables. For example, the first mussel had a methyl mercury level of 0.011 AND it was at the impacted site. With this understanding these data are entered in a tidy format as follows.

mussels <- tibble(loc=c("impacted","impacted","impacted","impacted","impacted","impacted",
                        "reference","reference","reference","reference",
                        "reference","reference","reference"),
                  merc=c(0.011,0.054,0.056,0.095,0.051,0.077,
                         0.031,0.040,0.029,0.066,0.018,0.042,0.044))
mussels

#R>  # A tibble: 13 x 2
#R>     loc        merc
#R>     <chr>     <dbl>
#R>   1 impacted  0.011
#R>   2 impacted  0.054
#R>   3 impacted  0.056
#R>   4 impacted  0.095
#R>   5 impacted  0.051
#R>   6 impacted  0.077
#R>   7 reference 0.031
#R>   8 reference 0.04 
#R>   9 reference 0.029
#R>  10 reference 0.066
#R>  11 reference 0.018
#R>  12 reference 0.042
#R>  13 reference 0.044

Tidy data will facilitate data wrangling in subsequent modules and data analysis and graphing in other courses.

Perhaps, short for “county names.”↩︎
Sometimes the elements will be coerced to be of the same basic type.↩︎
Make sure to note that both of these values are in all capital letters.↩︎