Data Structures
Overview
Teaching: 50 min
Exercises: 30 minQuestions
What are the basic data types in R?
How do I represent categorical information in R?
Objectives
To be aware of the different types of data.
To begin exploring data frames, and understand how it’s related to vectors, factors and lists.
To be able to ask questions from R about the type, class, and structure of an object.
Disclaimer: This lesson is based on a chapter from the Advanced R book.
Data structures
R’s base data structures can be organized by their dimensionality (1d, 2d, or nd) and whether they’re homogeneous (all contents must be of the same type) or heterogeneous (the contents can be of different types):
R’s base data structures
Homogeneous Heterogeneous 1d Atomic vector List 2d Matrix Data frame nd Array
Almost all other objects are built upon these foundations. Note that R has no 0-dimensional, or scalar types. Individual numbers or strings are vectors of length one.
Given an object, the best way to understand what data structures it’s composed
of is to use str()
. str()
is short for structure and it gives a compact,
human readable description of any R data structure.
Vectors
The basic data structure in R is the vector. Vectors come in two flavors: atomic vectors and lists. They have three common properties:
- Type,
typeof()
, what it is. - Length,
length()
, how many elements it contains. - Attributes,
attributes()
, additional arbitrary metadata.
They differ in the types of their elements: all elements of an atomic vector must be the same type, whereas the elements of a list can have different types.
Atomic vectors
There are four common types of atomic vectors : logical, integer, double (often called numeric), and character.
Atomic vectors are usually created with c()
, short for combine.
dbl_var <- c(1, 2.5, 4.5)
# With the L suffix, you get an integer rather than a double
int_var <- c(1L, 6L, 10L)
# Use TRUE and FALSE (or T and F) to create logical vectors
log_var <- c(TRUE, FALSE, T, F)
chr_var <- c("these are", "some strings")
Atomic vectors are always flat, even if you nest c()
’s: Try c(1, c(2, c(3, 4)))
Missing values are specified with NA
, which is a logical vector of length 1.
NA
will always be coerced to the correct type if used inside c()
.
Coercion
All elements of an atomic vector must be the same type, so when you attempt to
combine different types they will be coerced to the most flexible type. The
coercion rules go: logical
-> integer
-> double
-> complex
-> character
,
where -> can be read as are transformed into. You can try to force coercion
against this flow using the as.
functions:
Challenge 1
Create the following vectors and predict their type:
a <- c("a", 1) b <- c(TRUE, 1) c <- c(1L, 10) d <- c(a, b, c)
Solution to Challenge 1
typeof(a); typeof(b); typeof(c); typeof(d)
[1] "character"
[1] "double"
[1] "double"
[1] "character"
Coercion often happens automatically. Most mathematical functions (+
, log
,
abs
, etc.) will coerce to a double or integer, and most logical operations
(&
, |
, any
, etc) will coerce to a logical. You will usually get a warning
message if the coercion might lose information. If confusion is likely,
explicitly coerce with as.character()
, as.double()
, as.integer()
, or
as.logical()
.
TIP
When a logical vector is coerced to an integer or double,
TRUE
becomes 1 andFALSE
becomes 0. This is very useful in conjunction withsum()
andmean()
, which will calculate the total number and proportion of “TRUE’s”, respectively.
Lists
Lists are one dimensional data structures that are different from atomic
vectors because their elements can be of any type, including lists. We construct
lists by using list()
instead of c()
:
x <- c(1,2,3)
y <- list(1,2,3)
z <- list(1:3, "a", c(TRUE, FALSE, TRUE), c(2.3, 5.9))
RESULTS
str(x); str(y); str(z)
num [1:3] 1 2 3
List of 3 $ : num 1 $ : num 2 $ : num 3
List of 4 $ : int [1:3] 1 2 3 $ : chr "a" $ : logi [1:3] TRUE FALSE TRUE $ : num [1:2] 2.3 5.9
Lists are sometimes called recursive vectors, because a list can contain other lists. This makes them fundamentally different from atomic vectors.
x <- list(list(list(list())))
str(x)
List of 1
$ :List of 1
..$ :List of 1
.. ..$ : list()
is.recursive(x)
[1] TRUE
c()
will combine several lists into one. If given a combination of atomic
vectors and lists, c()
will coerce the vectors to lists before combining them.
Example
Compare the results of
list()
andc()
:x <- list(list(1, 2), c(3, 4)) y <- c(list(1, 2), c(3, 4)) str(x); str(y)
List of 2 $ :List of 2 ..$ : num 1 ..$ : num 2 $ : num [1:2] 3 4
List of 4 $ : num 1 $ : num 2 $ : num 3 $ : num 4
The typeof()
a list is list
. You can test for a list with is.list()
and
coerce to a list with as.list()
. You can turn a list into an atomic vector
with unlist()
. If the elements of a list have different types, unlist()
uses
the same coercion rules as c()
.
Discussion 1
What are the common types of atomic vector? How does a list differ from an atomic vector?
What will the commands
is.vector(list(1,2,3))
is.numeric(c(1L,2L,3L))
produce? How abouttypeof(as.numeric(c(1L,2L,3L)))
?Test your knowledge of vector coercion rules by predicting the output of the following uses of
c()
:c(1, FALSE) c("a", 1) c(list(1), "a") c(TRUE, 1L)
Why do you need to use
unlist()
to convert a list to an atomic vector? Why doesn’tas.vector()
work?Why is
1 == "1"
true? Why is-1 < FALSE
true? Why is"one" < 2
false?Why is the default missing value,
NA
, a logical vector? What’s special about logical vectors? (Hint: think aboutc(FALSE, NA)
vs.c(FALSE, NA_character_)
.)
Attributes
All objects can have arbitrary additional attributes, used to store metadata
about the object. Attributes can be thought of as a named list (with unique names).
Attributes can be accessed individually with attr()
or all at once (as a list)
with attributes()
.
The three most important attributes:
-
Names, a character vector giving each element a name, described in names.
-
Dimensions, used to turn vectors into matrices and arrays, described in matrices and arrays.
-
Class, used to implement the S3 object system.
Each of these attributes has a specific accessor function to get and set values:
names(x)
, dim(x)
, and class(x)
. To see
Names
You can name elements in a vector in three ways:
- When creating it:
x <- c(a = 1, b = 2, c = 3)
- By modifying an existing vector in place:
x <- 1:3; names(x) <- c("a", "b", "c")
- By creating a modified copy of a vector:
x <- 1:3; x <- setNames(x, c("a", "b", "c"))
You can see these names by just typing the vector’s name. You can access them
by using the names(x)
function.
x <- 1:3;
x <- setNames(x, c("a", "b", "c"))
x
a b c
1 2 3
names(x)
[1] "a" "b" "c"
Names don’t have to be unique and not all elements of a vector need to have a name.
If some names are missing, names()
will return an empty string for those elements.
If all names are missing, names()
will return NULL
.
You can create a new vector without names using unname(x)
, or remove names in
place with names(x) <- NULL
.
Factors
One important use of attributes is to define factors. A factor is a vector that
can contain only predefined values, and is used to store categorical data.
Factors are built on top of integer vectors using two attributes: the class()
,
“factor”, which makes them behave differently from regular integer vectors, and
the levels()
, which defines the set of allowed values.
Examples
x <- c("a", "b", "b", "a") x
[1] "a" "b" "b" "a"
x <- factor(x) x
[1] a b b a Levels: a b
class(x)
[1] "factor"
levels(x)
[1] "a" "b"
Factors are useful when you know the possible values a variable may take. Using a factor instead of a character vector makes it obvious when some groups contain no observations:
sex_char <- c("m", "m", "m") sex_factor <- factor(sex_char, levels = c("m", "f")) table(sex_char)
sex_char m 3
table(sex_factor)
sex_factor m f 3 0
Factors crip up all over R, and occasionally cause headaches for new R users.
Matrices and arrays
Adding a dim()
attribute to an atomic vector allows it to behave like a
multi-dimensional array. An array with two dimensions is called matrix.
Matrices are used commonly as part of the mathematical
machinery of statistics. Arrays are much rarer, but worth being aware of.
Matrices and arrays are created with matrix()
and array()
,
or by using the assignment form of dim()
.
# Two scalar arguments to specify rows and columns
a <- matrix(1:6, ncol = 3, nrow = 2)
# One vector argument to describe all dimensions
b <- array(1:12, c(2, 3, 2))
# You can also modify an object in place by setting dim()
c <- 1:12
dim(c) <- c(3, 4)
c
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
dim(c) <- c(4, 3)
c
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
dim(c) <- c(2, 3, 2)
c
, , 1
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
, , 2
[,1] [,2] [,3]
[1,] 7 9 11
[2,] 8 10 12
You can test if an object is a matrix or array using is.matrix()
and is.array()
,
or by looking at the length of the dim()
. as.matrix()
and as.array()
make
it easy to turn an existing vector into a matrix or array.
Discussion 2
What does
dim()
return when applied to a vector?If
is.matrix(x)
isTRUE
, what willis.array(x)
return?How would you describe the following three objects? What makes them different to
1:5
?x1 <- array(1:5, c(1, 1, 5)) x2 <- array(1:5, c(1, 5, 1)) x3 <- array(1:5, c(5, 1, 1))
Data frames
A data frame is the most common way of storing data in R, and if used systematically makes data analysis easier. Under the hood, a data frame is a list of equal-length vectors. This makes it a 2-dimensional structure, so it shares properties of both the matrix and the list.
Useful Data Frame Functions
head()
- shows first 6 rowstail()
- shows last 6 rowsdim()
- returns the dimensions of data frame (i.e. number of rows and number of columns)nrow()
- number of rowsncol()
- number of columnsstr()
- structure of data frame - name, type and preview of data in each columnnames()
- shows thenames
attribute for a data frame, which gives the column names.sapply(dataframe, class)
- shows the class of each column in the data frame
Creation
You create a data frame using data.frame()
, which takes named vectors as input:
df <- data.frame(x = 1:3, y = c("a", "b", "c"))
str(df)
'data.frame': 3 obs. of 2 variables:
$ x: int 1 2 3
$ y: chr "a" "b" "c"
In R prior to v.4, data.frame()
’s default behavior was to turn strings into
factors. Use stringAsFactors = FALSE
to suppress this behavior!
df <- data.frame(
x = 1:3,
y = c("a", "b", "c"),
stringsAsFactors = FALSE)
str(df)
'data.frame': 3 obs. of 2 variables:
$ x: int 1 2 3
$ y: chr "a" "b" "c"
Testing and coercion
Because a data.frame
is an S3 class, its type reflects the underlying vector
used to build it: the list. To check if an object is a data frame, use class()
or test explicitly with is.data.frame()
:
Examples
is.vector(df)
[1] FALSE
is.list(df)
[1] TRUE
is.data.frame(df)
[1] TRUE
typeof(df)
[1] "list"
class(df)
[1] "data.frame"
You can coerce an object to a data frame with as.data.frame()
:
- A vector will create a one-column data frame.
- A list will create one column for each element; it’s an error if they’re not all the same length.
- A matrix will create a data frame with the same number of columns and rows as the matrix.
Discussion 3
What attributes does a data frame possess?
What does
as.matrix()
do when applied to a data frame with columns of different types?Can you have a data frame with 0 rows? What about 0 columns?
Key Points
Atomic vectors are usually created with
c()
, short for combine;Lists are constructed by using
list()
;Data frames are created with
data.frame()
, which takes named vectors as input;The basic data types in R are double, integer, complex, logical, and character;
All objects can have arbitrary additional attributes, used to store metadata about the object;
Adding a
dim()
attribute to an atomic vector creates a multi-dimensional array;