Data Import
Overview
Teaching: 20 min
Exercises: 10 minQuestions
How to read data in R?
What are some potential problems with reading data in R?
How to write data in files?
Objectives
To understand how to read data into R
To learn most commonly used functions in
readr
for reading the filesTo understand and be able to fix common problems in reading data files
To know how to write data into a file
Data import
Introduction
Before transforming data we have to learn how to read them into R. We’ll start by learning how to read plain-text rectangular files into R. We’ll finish with a few pointers to packages that are useful for other types of data.
The tutorial is based on Chapter 10 of the R for Data Science book by Garrett Grolemund and Hadley Wickham. Check this book for more details.
Prerequisites
We’ll be using package readr to load flat files into R. This packages is a part of the core tidyverse, an “an opinionated collection of R packages designed for data science”.
if (!require("tidyverse")) install.packages("tidyverse")
library(tidyverse)
Getting started
The four most commonly used functions in readr
are read_csv
, read_csv2
, read_tsv
, and read_delim
.
read_csv()
reads comma delimited files,read_csv2()
reads semicolon separated files (common in countries where,
is used as the decimal place),read_tsv()
reads tab delimited files, andread_delim()
reads in files with any delimiter.
These functions all have similar syntax; we will use read_csv
as an example.
The first argument to read_csv()
is the to the file to read.
heights <- read_csv("../data/heights.csv")
Error: '../data/heights.csv' does not exist in current working directory ('/Users/dlavrov/GitHub/Teaching/BCB546-all/BCB546-R/_episodes_rmd').
When you run read_csv()
it prints out a column specification that gives the name and type of each column.
You can also supply an inline csv file. This is useful for experimenting with readr and for creating reproducible examples to share with others:
read_csv("a,b,c
1,2,3
4,5,6")
# A tibble: 2 x 3
a b c
<dbl> <dbl> <dbl>
1 1 2 3
2 4 5 6
In both cases read_csv()
uses the first line of the data for the column names, which is a very common convention. There are two cases where you might want to tweak this behaviour:
-
Sometimes there are a few lines of metadata at the top of the file. You can use
skip = n
to skip the firstn
lines; or usecomment = "#"
to drop all lines that start with (e.g.)#
.read_csv("The first line of metadata The second line of metadata x,y,z 1,2,3", skip = 2)
# A tibble: 1 x 3 x y z <dbl> <dbl> <dbl> 1 1 2 3
read_csv("# A comment I want to skip x,y,z 1,2,3", comment = "#")
# A tibble: 1 x 3 x y z <dbl> <dbl> <dbl> 1 1 2 3
-
The data might not have column names. You can use
col_names = FALSE
to tellread_csv()
not to treat the first row as headings, and instead label them sequentially fromX1
toXn
. Alternatively you can passcol_names
a character vector which will be used as the column names:read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
# A tibble: 2 x 3 x y z <dbl> <dbl> <dbl> 1 1 2 3 2 4 5 6
Another option that commonly needs tweaking is na
: this specifies the value (or values)
that are used to represent missing values in your file:
read_csv("a,b,c\n1,2,.", na = ".")
# A tibble: 1 x 3
a b c
<dbl> <dbl> <lgl>
1 1 2 NA
Compared to base R
If you’ve used R before, you might wonder why we’re not using read.csv()
.
There are a few good reasons to favour readr functions over the base equivalents:
-
They are typically much faster (~10x) than their base equivalents. There is even a faster version,
data.table::fread()
, but it doesn’t fit quite so well into the tidyverse. -
They produce tibbles, they don’t convert character vectors to factors, use row names, or munge the column names. These are common sources of frustration with the base R functions.
-
They are more reproducible. Base R functions inherit some behaviour from your operating system and environment variables, so import code that works on your computer might not work on someone else’s.
Writing to a file
readr also comes with two useful functions for writing data back to disk: write_csv()
and write_tsv()
.
Both functions increase the chances of the output file being read back in correctly by:
-
Always encoding strings in UTF-8.
-
Saving dates and date-times in ISO8601 format so they are easily parsed elsewhere.
If you want to export a csv file to Excel, use write_excel_csv()
— this writes a special character
(a “byte order mark”) at the start of the file which tells Excel that you’re using the UTF-8 encoding.
The most important arguments are x
(the data frame to save), and path
(the location to save it).
You can also specify how missing values are written with na
, and if you want to append
to an existing file.
write_csv(challenge, "challenge.csv")
Note that the type information is lost when you save to csv:
challenge
# A tibble: 2,000 x 2
x y
<dbl> <date>
1 404 NA
2 4172 NA
3 3004 NA
4 787 NA
5 37 NA
6 2332 NA
7 2489 NA
8 1449 NA
9 3665 NA
10 3863 NA
# … with 1,990 more rows
write_csv(challenge, "challenge-2.csv")
read_csv("challenge-2.csv")
── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
x = col_double(),
y = col_logical()
)
# A tibble: 2,000 x 2
x y
<dbl> <lgl>
1 404 NA
2 4172 NA
3 3004 NA
4 787 NA
5 37 NA
6 2332 NA
7 2489 NA
8 1449 NA
9 3665 NA
10 3863 NA
# … with 1,990 more rows
This makes CSVs a little unreliable for caching interim results—you need to recreate the column specification every time you load in. There are two alternatives:
-
write_rds()
andread_rds()
are uniform wrappers around the base functionsreadRDS()
andsaveRDS()
. These store data in R’s custom binary format called RDS:write_rds(challenge, "challenge.rds") read_rds("challenge.rds")
# A tibble: 2,000 x 2 x y <dbl> <date> 1 404 NA 2 4172 NA 3 3004 NA 4 787 NA 5 37 NA 6 2332 NA 7 2489 NA 8 1449 NA 9 3665 NA 10 3863 NA # … with 1,990 more rows
-
The feather package implements a fast binary file format that can be shared across programming languages:
library(feather) write_feather(challenge, "challenge.feather") read_feather("challenge.feather") #> # A tibble: 2,000 x 2 #> x y #> <dbl> <date> #> 1 404 <NA> #> 2 4172 <NA> #> 3 3004 <NA> #> 4 787 <NA> #> 5 37 <NA> #> 6 2332 <NA> #> # ... with 1,994 more rows
Feather tends to be faster than RDS and is usable outside of R. However, RDS supports list-columns while feather currently does not.
Other types of data
To get other types of data into R, we recommend starting with the tidyverse packages listed below. They’re certainly not perfect, but they are a good place to start. For rectangular data:
-
haven reads SPSS, Stata, and SAS files.
-
readxl reads excel files (both
.xls
and.xlsx
). -
DBI, along with a database specific backend (e.g. RMySQL, RSQLite, RPostgreSQL etc) allows you to run SQL queries against a database and return a data frame.
For hierarchical data: use jsonlite (by Jeroen Ooms) for json, and xml2 for XML. Jenny Bryan has some excellent worked examples at https://jennybc.github.io/purrr-tutorial/.
For other file types, try the R data import/export manual and the rio package.
Homework
Parse the UNIX assignment files (fang_et_al_genotypes.txt and snp_position.txt) into R.
- What problems did you encounter?
- Did R guessed the data type correctly?
- If not, how did you fix it?
Try to use the
t
function to transpose the first file
- Did it work?
- Can you figure out how to make it work?
Parsing a vector (read if you run into problems or if like details)
Before we get into the details of how readr reads files from disk, we need to take a little
detour to talk about the parse_*()
functions. These functions take a character vector and
return a more specialised vector like a logical, integer, or date. These functions are useful
in their own right, but are also an important building block for readr
. Like all functions
in the tidyverse, the parse_*()
functions are uniform: the first argument is a character
vector to parse, and the na
argument specifies which strings should be treated as missing:
parse_integer(c("1", "231", ".", "456"), na = ".")
[1] 1 231 NA 456
If parsing fails, you’ll get a warning:
x <- parse_integer(c("123", "345", "abc", "123.45"))
Warning: 2 parsing failures.
row col expected actual
3 -- an integer abc
4 -- no trailing characters 123.45
And the failures will be missing in the output:
x
[1] 123 345 NA NA
attr(,"problems")
# A tibble: 2 x 4
row col expected actual
<int> <int> <chr> <chr>
1 3 NA an integer abc
2 4 NA no trailing characters 123.45
There are severeal important parsers, but we will look just at two types:
- number parsers:
parse_double()
is a strict numeric parser, andparse_number()
is a flexible numeric parser. - character parser:
parse_character()
.
The following sections describe these parsers in more detail.
Numbers
It seems like it should be straightforward to parse a number, but three problems make it tricky:
-
People write numbers differently in different parts of the world.
-
Numbers are often surrounded by other characters that provide some context, like “$1000” or “10%”.
-
Numbers often contain “grouping” characters to make them easier to read, like “1,000,000”, and these grouping characters vary around the world.
To address the first problem, readr has the notion of a “locale”, an object that specifies
parsing options that differ from place to place. When parsing numbers, the most important
option is the character you use for the decimal mark. You can override the default value
of .
by creating a new locale and setting the decimal_mark
argument:
parse_double("1.23")
[1] 1.23
parse_double("1,23", locale = locale(decimal_mark = ","))
[1] 1.23
parse_number()
addresses the second problem: it ignores non-numeric characters before and
after the number. This is particularly useful for currencies and percentages, but also works
to extract numbers embedded in text.
parse_number("$100")
[1] 100
parse_number("20%")
[1] 20
parse_number("It cost $123.45")
[1] 123.45
The final problem is addressed by the combination of parse_number()
and the locale as parse_number()
will ignore the “grouping mark”:
# Used in America
parse_number("$123,456,789")
[1] 123456789
# Used in many parts of Europe
parse_number("123.456.789", locale = locale(grouping_mark = "."))
[1] 123456789
# Used in Switzerland
parse_number("123'456'789", locale = locale(grouping_mark = "'"))
[1] 123456789
Strings
It seems like parse_character()
should be really simple but it’s not, as there are multiple ways to
represent the same string. To understand what’s going on, we need to dive into the details of how
computers represent strings. In R, we can get at the underlying representation of a string using charToRaw()
:
charToRaw("Hadley")
[1] 48 61 64 6c 65 79
Each hexadecimal number represents a byte of information: 48
is H, 61
is a, and so on. The mapping
from hexadecimal number to character is called the encoding, and in this case the encoding is called ASCII.
ASCII does a great job of representing English characters, because it’s the American Standard Code for
Information Interchange.
Things get more complicated for languages other than English as there are many competing standards for encoding non-English characters. Fortunately, today there is one standard that is supported almost everywhere: UTF-8. UTF-8 can encode just about every character used by humans today, as well as many extra symbols (like emoji!).
readr uses UTF-8 everywhere: it assumes your data is UTF-8 encoded when you read it, and always uses it when writing. This is a good default, but will fail for data produced by older systems that don’t understand UTF-8. If this happens to you, your strings will look weird when you print them. Sometimes just one or two characters might be messed up; other times you’ll get complete gibberish. For example:
x1 <- "El Ni\xf1o was particularly bad this year"
x2 <- "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"
x1
[1] "El Ni\xf1o was particularly bad this year"
x2
[1] "\x82\xb1\x82\xf1\x82ɂ\xbf\x82\xcd"
To fix the problem you need to specify the encoding in parse_character()
:
parse_character(x1, locale = locale(encoding = "Latin1"))
[1] "El Niño was particularly bad this year"
parse_character(x2, locale = locale(encoding = "Shift-JIS"))
[1] "こんにちは"
How do you find the correct encoding? If you’re lucky, it’ll be included somewhere in the data documentation. If not,
readr provides guess_encoding()
to help you figure it out. It’s not foolproof, and it works better when you have
lots of text (unlike here), but it’s a reasonable place to start.
guess_encoding(charToRaw(x1))
# A tibble: 2 x 2
encoding confidence
<chr> <dbl>
1 ISO-8859-1 0.46
2 ISO-8859-9 0.23
guess_encoding(charToRaw(x2))
# A tibble: 1 x 2
encoding confidence
<chr> <dbl>
1 KOI8-R 0.42
The first argument to guess_encoding()
can either be a path to a file, or, as in this case, a raw vector
(useful if the strings are already in R).
Parsing a file
readr uses a heuristic to figure out the type of each column: it reads the first 1000 rows and uses some
(moderately conservative) heuristics to figure out the type of each column. You can emulate this process
with a character vector using guess_parser()
, which returns readr’s best guess, and parse_guess()
which
uses that guess to parse the column:
guess_parser("2010-10-01")
[1] "date"
guess_parser("15:01")
[1] "time"
guess_parser(c("TRUE", "FALSE"))
[1] "logical"
guess_parser(c("1", "5", "9"))
[1] "double"
guess_parser(c("12,352,561"))
[1] "number"
str(parse_guess("2010-10-10"))
Date[1:1], format: "2010-10-10"
The heuristic tries each of the following types, stopping when it finds a match:
- logical: contains only “F”, “T”, “FALSE”, or “TRUE”.
- integer: contains only numeric characters (and
-
). - double: contains only valid doubles (including numbers like
4.5e-5
). - number: contains valid doubles with the grouping mark inside.
- time: matches the default
time_format
. - date: matches the default
date_format
. - date-time: any ISO8601 date.
If none of these rules apply, then the column will stay as a vector of strings.
Problems
These defaults don’t always work for larger files. There are two basic problems:
-
The first thousand rows might be a special case, and readr guesses a type that is not sufficiently general. For example, you might have a column of doubles that only contains integers in the first 1000 rows.
-
The column might contain a lot of missing values. If the first 1000 rows contain only
NA
s, readr will guess that it’s a character vector, whereas you probably want to parse it as something more specific.
readr contains a challenging CSV that illustrates both of these problems:
head(read_csv(readr_example("challenge.csv")))
── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
x = col_double(),
y = col_logical()
)
Warning: 1000 parsing failures.
row col expected actual file
1001 y 1/0/T/F/TRUE/FALSE 2015-01-16 '/Users/dlavrov/Library/R/4.0/library/readr/extdata/challenge.csv'
1002 y 1/0/T/F/TRUE/FALSE 2018-05-18 '/Users/dlavrov/Library/R/4.0/library/readr/extdata/challenge.csv'
1003 y 1/0/T/F/TRUE/FALSE 2015-09-05 '/Users/dlavrov/Library/R/4.0/library/readr/extdata/challenge.csv'
1004 y 1/0/T/F/TRUE/FALSE 2012-11-28 '/Users/dlavrov/Library/R/4.0/library/readr/extdata/challenge.csv'
1005 y 1/0/T/F/TRUE/FALSE 2020-01-13 '/Users/dlavrov/Library/R/4.0/library/readr/extdata/challenge.csv'
.... ... .................. .......... ..................................................................
See problems(...) for more details.
# A tibble: 6 x 2
x y
<dbl> <lgl>
1 404 NA
2 4172 NA
3 3004 NA
4 787 NA
5 37 NA
6 2332 NA
challenge <- read_csv(readr_example("challenge.csv"))
── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
x = col_double(),
y = col_logical()
)
Warning: 1000 parsing failures.
row col expected actual file
1001 y 1/0/T/F/TRUE/FALSE 2015-01-16 '/Users/dlavrov/Library/R/4.0/library/readr/extdata/challenge.csv'
1002 y 1/0/T/F/TRUE/FALSE 2018-05-18 '/Users/dlavrov/Library/R/4.0/library/readr/extdata/challenge.csv'
1003 y 1/0/T/F/TRUE/FALSE 2015-09-05 '/Users/dlavrov/Library/R/4.0/library/readr/extdata/challenge.csv'
1004 y 1/0/T/F/TRUE/FALSE 2012-11-28 '/Users/dlavrov/Library/R/4.0/library/readr/extdata/challenge.csv'
1005 y 1/0/T/F/TRUE/FALSE 2020-01-13 '/Users/dlavrov/Library/R/4.0/library/readr/extdata/challenge.csv'
.... ... .................. .......... ..................................................................
See problems(...) for more details.
(Note the use of readr_example()
which finds the path to one of the files included with the package)
There are two printed outputs: the column specification generated by looking at the first 1000 rows,
and the first five parsing failures. It’s always a good idea to explicitly pull out the problems()
,
so you can explore them in more depth:
problems(challenge)
# A tibble: 1,000 x 5
row col expected actual file
<int> <chr> <chr> <chr> <chr>
1 1001 y 1/0/T/F/TRUE/… 2015-01-… '/Users/dlavrov/Library/R/4.0/library/r…
2 1002 y 1/0/T/F/TRUE/… 2018-05-… '/Users/dlavrov/Library/R/4.0/library/r…
3 1003 y 1/0/T/F/TRUE/… 2015-09-… '/Users/dlavrov/Library/R/4.0/library/r…
4 1004 y 1/0/T/F/TRUE/… 2012-11-… '/Users/dlavrov/Library/R/4.0/library/r…
5 1005 y 1/0/T/F/TRUE/… 2020-01-… '/Users/dlavrov/Library/R/4.0/library/r…
6 1006 y 1/0/T/F/TRUE/… 2016-04-… '/Users/dlavrov/Library/R/4.0/library/r…
7 1007 y 1/0/T/F/TRUE/… 2011-05-… '/Users/dlavrov/Library/R/4.0/library/r…
8 1008 y 1/0/T/F/TRUE/… 2020-07-… '/Users/dlavrov/Library/R/4.0/library/r…
9 1009 y 1/0/T/F/TRUE/… 2011-04-… '/Users/dlavrov/Library/R/4.0/library/r…
10 1010 y 1/0/T/F/TRUE/… 2010-05-… '/Users/dlavrov/Library/R/4.0/library/r…
# … with 990 more rows
A good strategy is to work column by column until there are no problems remaining. Here we can see that there
are a lot of parsing problems with the x
column - there are trailing characters after the integer value. That
suggests we need to use a double parser instead.
To fix the call, start by copying and pasting the column specification into your original call. Then you can
tweak the type of the x
column:
challenge <- read_csv(
readr_example("challenge.csv"),
col_types = cols(
x = col_double(),
y = col_character()
)
)
That fixes the first problem, but if we look at the last few rows, you’ll see that they’re dates stored in a character vector:
tail(challenge)
# A tibble: 6 x 2
x y
<dbl> <chr>
1 0.805 2019-11-21
2 0.164 2018-03-29
3 0.472 2014-08-04
4 0.718 2015-08-16
5 0.270 2020-02-04
6 0.608 2019-01-06
You can fix that by specifying that y
is a date column:
challenge <- read_csv(
readr_example("challenge.csv"),
col_types = cols(
x = col_double(),
y = col_date()
)
)
tail(challenge)
# A tibble: 6 x 2
x y
<dbl> <date>
1 0.805 2019-11-21
2 0.164 2018-03-29
3 0.472 2014-08-04
4 0.718 2015-08-16
5 0.270 2020-02-04
6 0.608 2019-01-06
Every parse_xyz()
function has a corresponding col_xyz()
function. You use parse_xyz()
when the data is in a character vector in R already; you use col_xyz()
when you want to tell readr how to load the data.
It’s a good idea to supply col_types
from the print-out provided by readr in your read_csv
call. This
ensures that you have a consistent and reproducible data import script. If you rely on the default guesses
and your data changes, readr will continue to read it in. If you want to be really strict, use
stop_for_problems()
: that will throw an error and stop your script if there are any parsing problems.
Key Points
Use
read_cvs
to read in CSV filesUse
read_tvs
to read in TSV filesUse
write_csv()
andwrite_tsv()
to write such filesSupply
col_types
to read functions in your script to insure consistancyUse
guess_encoding()
to guess encoding of strings in old docs