Tibbles¶
Introduction¶
Throughout this book we work with "tibbles" instead of R's traditional data.frame
.
Tibbles are data frames, but they tweak some older behaviours to make life a little easier.
R is an old language, and some things that were useful 10 or 20 years ago now get in your way.
It's difficult to change base R without breaking existing code, so most innovation occurs in packages.
Here we will describe the tibble package, which provides opinionated data frames that make working in the tidyverse a little easier.
In most places, I'll use the term tibble and data frame interchangeably; when I want to draw particular attention to R's built-in data frame, I'll call them data.frame
s.
If this chapter leaves you wanting to learn more about tibbles, you might enjoy vignette("tibble")
.
Prerequisites¶
In this chapter we'll explore the tibble package, part of the core tidyverse.
library(tidyverse)
Warning message in system("timedatectl", intern = TRUE): “running command 'timedatectl' had status 1” ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ── ✔ ggplot2 3.3.6 ✔ purrr 0.3.4 ✔ tibble 3.1.7 ✔ dplyr 1.0.9 ✔ tidyr 1.2.0 ✔ stringr 1.4.0 ✔ readr 2.1.2 ✔ forcats 0.5.1 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag()
Creating tibbles¶
Almost all of the functions that you'll use in this book produce tibbles, as tibbles are one of the unifying features of the tidyverse.
Most other R packages use regular data.frame
s, so you might want to coerce a data.frame
to a tibble.
You can do that with as_tibble()
:
as_tibble(mtcars)
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb |
---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 |
24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 |
19.2 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.30 | 1 | 0 | 4 | 4 |
17.8 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.90 | 1 | 0 | 4 | 4 |
16.4 | 8 | 275.8 | 180 | 3.07 | 4.070 | 17.40 | 0 | 0 | 3 | 3 |
17.3 | 8 | 275.8 | 180 | 3.07 | 3.730 | 17.60 | 0 | 0 | 3 | 3 |
15.2 | 8 | 275.8 | 180 | 3.07 | 3.780 | 18.00 | 0 | 0 | 3 | 3 |
10.4 | 8 | 472.0 | 205 | 2.93 | 5.250 | 17.98 | 0 | 0 | 3 | 4 |
10.4 | 8 | 460.0 | 215 | 3.00 | 5.424 | 17.82 | 0 | 0 | 3 | 4 |
14.7 | 8 | 440.0 | 230 | 3.23 | 5.345 | 17.42 | 0 | 0 | 3 | 4 |
32.4 | 4 | 78.7 | 66 | 4.08 | 2.200 | 19.47 | 1 | 1 | 4 | 1 |
30.4 | 4 | 75.7 | 52 | 4.93 | 1.615 | 18.52 | 1 | 1 | 4 | 2 |
33.9 | 4 | 71.1 | 65 | 4.22 | 1.835 | 19.90 | 1 | 1 | 4 | 1 |
21.5 | 4 | 120.1 | 97 | 3.70 | 2.465 | 20.01 | 1 | 0 | 3 | 1 |
15.5 | 8 | 318.0 | 150 | 2.76 | 3.520 | 16.87 | 0 | 0 | 3 | 2 |
15.2 | 8 | 304.0 | 150 | 3.15 | 3.435 | 17.30 | 0 | 0 | 3 | 2 |
13.3 | 8 | 350.0 | 245 | 3.73 | 3.840 | 15.41 | 0 | 0 | 3 | 4 |
19.2 | 8 | 400.0 | 175 | 3.08 | 3.845 | 17.05 | 0 | 0 | 3 | 2 |
27.3 | 4 | 79.0 | 66 | 4.08 | 1.935 | 18.90 | 1 | 1 | 4 | 1 |
26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.70 | 0 | 1 | 5 | 2 |
30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.90 | 1 | 1 | 5 | 2 |
15.8 | 8 | 351.0 | 264 | 4.22 | 3.170 | 14.50 | 0 | 1 | 5 | 4 |
19.7 | 6 | 145.0 | 175 | 3.62 | 2.770 | 15.50 | 0 | 1 | 5 | 6 |
15.0 | 8 | 301.0 | 335 | 3.54 | 3.570 | 14.60 | 0 | 1 | 5 | 8 |
21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 |
You can create a new tibble from individual vectors with tibble()
.
tibble()
will automatically recycle inputs of length 1, and allows you to refer to variables that you just created, as shown in this example:
tibble(
x = 1:5,
y = 1,
z = x ^ 2 + y
)
x | y | z |
---|---|---|
<int> | <dbl> | <dbl> |
1 | 1 | 2 |
2 | 1 | 5 |
3 | 1 | 10 |
4 | 1 | 17 |
5 | 1 | 26 |
If you're already familiar with data.frame()
, note that tibble()
does less: it never changes the names of variables and it never creates row names.
Another way to create a tibble is with tribble()
, short for transposed tibble.
tribble()
is customized for data entry in code: column headings start with ~
) and entries are separated by commas.
This makes it possible to lay out small amounts of data in easy to read form:
tribble(
~x, ~y, ~z,
"a", 2, 3.6,
"b", 1, 8.5
)
x | y | z |
---|---|---|
<chr> | <dbl> | <dbl> |
a | 2 | 3.6 |
b | 1 | 8.5 |
Non-syntactic names¶
It's possible for a tibble to have column names that are not valid R variable names, aka non-syntactic names.
For example, they might not start with a letter, or they might contain unusual characters like a space.
To refer to these variables, you need to surround them with backticks, `
:
tb <- tibble(
`:)` = "smile",
` ` = "space",
`2000` = "number"
)
tb
:) | 2000 | |
---|---|---|
<chr> | <chr> | <chr> |
smile | space | number |
You'll also need the backticks when working with these variables in other packages, like ggplot2, dplyr, and tidyr.
Tibbles vs. data.frame¶
There are two main differences in the usage of a tibble vs. a classic data.frame
: printing and subsetting.
Printing¶
Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen.
This makes it much easier to work with large data.
In addition to its name, each column reports its type, a nice feature borrowed from str()
:
tibble(
a = lubridate::now() + runif(1e3) * 86400,
b = lubridate::today() + runif(1e3) * 30,
c = 1:1e3,
d = runif(1e3),
e = sample(letters, 1e3, replace = TRUE)
)
a | b | c | d | e |
---|---|---|---|---|
<dttm> | <date> | <int> | <dbl> | <chr> |
2022-08-04 01:55:01 | 2022-08-28 | 1 | 0.6214954893 | e |
2022-08-03 23:28:30 | 2022-08-16 | 2 | 0.7172978630 | k |
2022-08-03 12:30:32 | 2022-08-14 | 3 | 0.0611727890 | b |
2022-08-03 06:39:35 | 2022-08-20 | 4 | 0.4621063569 | y |
2022-08-03 20:34:28 | 2022-08-28 | 5 | 0.0047366079 | k |
2022-08-04 01:15:22 | 2022-08-17 | 6 | 0.0005293433 | d |
2022-08-03 11:27:17 | 2022-08-14 | 7 | 0.9520139829 | j |
2022-08-03 05:42:59 | 2022-08-18 | 8 | 0.8445059608 | z |
2022-08-03 21:23:24 | 2022-08-28 | 9 | 0.5152816789 | f |
2022-08-04 01:02:06 | 2022-08-04 | 10 | 0.7695424075 | a |
2022-08-03 18:55:06 | 2022-08-16 | 11 | 0.4667063369 | s |
2022-08-03 17:33:59 | 2022-08-23 | 12 | 0.6463241733 | j |
2022-08-04 02:12:13 | 2022-09-01 | 13 | 0.1948386112 | z |
2022-08-03 23:31:40 | 2022-08-17 | 14 | 0.0158112978 | c |
2022-08-03 07:18:33 | 2022-08-30 | 15 | 0.8494394217 | h |
2022-08-03 11:20:52 | 2022-08-26 | 16 | 0.0404442837 | l |
2022-08-04 00:58:59 | 2022-08-30 | 17 | 0.5670637824 | h |
2022-08-03 10:50:35 | 2022-08-30 | 18 | 0.6687090767 | c |
2022-08-03 16:04:43 | 2022-08-15 | 19 | 0.7714697982 | y |
2022-08-03 15:40:41 | 2022-08-03 | 20 | 0.9826351046 | y |
2022-08-03 11:41:33 | 2022-08-21 | 21 | 0.3696225001 | d |
2022-08-03 18:01:38 | 2022-08-04 | 22 | 0.1337647925 | l |
2022-08-03 13:32:47 | 2022-08-16 | 23 | 0.6028851504 | k |
2022-08-03 06:14:53 | 2022-08-21 | 24 | 0.7475312802 | q |
2022-08-03 14:59:35 | 2022-08-25 | 25 | 0.7109023386 | a |
2022-08-03 05:30:53 | 2022-08-28 | 26 | 0.7201127883 | v |
2022-08-03 21:29:59 | 2022-08-07 | 27 | 0.6366676912 | x |
2022-08-03 19:05:08 | 2022-08-07 | 28 | 0.7875030979 | o |
2022-08-04 02:41:57 | 2022-08-06 | 29 | 0.6847760833 | g |
2022-08-03 13:07:38 | 2022-08-11 | 30 | 0.8229427468 | e |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
2022-08-04 00:36:14 | 2022-08-05 | 971 | 0.25075892 | m |
2022-08-03 06:47:54 | 2022-08-30 | 972 | 0.08228306 | s |
2022-08-03 17:35:24 | 2022-08-17 | 973 | 0.15429027 | a |
2022-08-03 14:56:08 | 2022-08-17 | 974 | 0.38258004 | g |
2022-08-03 04:50:35 | 2022-08-03 | 975 | 0.86589112 | s |
2022-08-03 13:25:36 | 2022-08-13 | 976 | 0.61569196 | o |
2022-08-03 06:11:32 | 2022-08-11 | 977 | 0.33144681 | a |
2022-08-03 22:58:08 | 2022-08-17 | 978 | 0.63312409 | v |
2022-08-03 06:59:51 | 2022-08-20 | 979 | 0.84533735 | s |
2022-08-04 02:51:46 | 2022-08-21 | 980 | 0.51375382 | y |
2022-08-03 16:25:34 | 2022-08-16 | 981 | 0.89179696 | b |
2022-08-03 13:55:37 | 2022-08-30 | 982 | 0.00278528 | n |
2022-08-03 17:15:52 | 2022-08-26 | 983 | 0.16919621 | b |
2022-08-04 01:34:53 | 2022-08-31 | 984 | 0.22647501 | g |
2022-08-03 15:56:44 | 2022-08-21 | 985 | 0.16249420 | e |
2022-08-03 08:23:27 | 2022-08-12 | 986 | 0.71163972 | x |
2022-08-03 07:55:24 | 2022-08-19 | 987 | 0.91189797 | k |
2022-08-03 20:02:56 | 2022-08-29 | 988 | 0.08494227 | y |
2022-08-03 23:05:17 | 2022-08-23 | 989 | 0.65302098 | g |
2022-08-03 09:59:53 | 2022-08-16 | 990 | 0.22611852 | a |
2022-08-03 14:13:47 | 2022-08-12 | 991 | 0.28215355 | v |
2022-08-03 10:44:28 | 2022-09-01 | 992 | 0.66075576 | i |
2022-08-03 15:01:06 | 2022-08-06 | 993 | 0.24564900 | s |
2022-08-03 07:56:06 | 2022-08-03 | 994 | 0.99811822 | k |
2022-08-03 14:09:59 | 2022-08-07 | 995 | 0.44342410 | y |
2022-08-03 17:24:22 | 2022-08-28 | 996 | 0.07096956 | n |
2022-08-03 13:36:18 | 2022-08-13 | 997 | 0.68727876 | h |
2022-08-03 11:59:56 | 2022-08-09 | 998 | 0.14232666 | e |
2022-08-03 12:32:03 | 2022-08-06 | 999 | 0.24661607 | a |
2022-08-03 21:56:50 | 2022-08-08 | 1000 | 0.39282402 | b |
Where possible, they also use color to draw your eye to important differences.
One of the most important distinctions is between the string "NA"
and the missing value, NA
:
tibble(x = c("NA", NA))
x |
---|
<chr> |
NA |
NA |
Tibbles are designed to avoid overwhelming your console when you print large data frames. But sometimes you need more output than the default display. There are a few options that can help.
First, you can explicitly print()
the data frame and control the number of rows (n
) and the width
of the display.
width = Inf
will display all columns:
nycflights13::flights |>
print(n = 10, width = Inf)
You can also control the default print behavior by setting options:
options(tibble.print_max = n, tibble.print_min = m)
: if more thann
rows, print onlym
rows. Useoptions(tibble.print_min = Inf)
to always show all rows.Use
options(tibble.width = Inf)
to always print all columns, regardless of the width of the screen.
You can see a complete list of options by looking at the package help with package?tibble
.
A final option is to use RStudio's built-in data viewer to get a scrollable view of the complete dataset. This is also often useful at the end of a long chain of manipulations.
nycflights13::flights |>
View()
Subsetting¶
So far all the tools you’ve learned have worked with complete data frames. If you want to pull out a single variable, you need some new tools, $
and [[
. [[
can extract by name or position; $
only extracts by name but is a little less typing.
df <- tibble(
x = runif(5),
y = rnorm(5)
)
# Extract by name
df$x
- 0.188726971391588
- 0.276726307347417
- 0.850530292373151
- 0.130424492061138
- 0.975597662851214
df[["x"]]
- 0.188726971391588
- 0.276726307347417
- 0.850530292373151
- 0.130424492061138
- 0.975597662851214
# Extract by position
df[[1]]
- 0.188726971391588
- 0.276726307347417
- 0.850530292373151
- 0.130424492061138
- 0.975597662851214
To use these in a pipe, you’ll need to use the special placeholder .
:
df %>% .$x
- 0.188726971391588
- 0.276726307347417
- 0.850530292373151
- 0.130424492061138
- 0.975597662851214
df %>% .[["x"]]
- 0.188726971391588
- 0.276726307347417
- 0.850530292373151
- 0.130424492061138
- 0.975597662851214
Compared to a data.frame, tibbles are more strict: they never do partial matching, and they will generate a warning if the column you are trying to access does not exist.
Interacting with older code¶
Some older functions don't work with tibbles.
If you encounter one of these functions, use as.data.frame()
to turn a tibble back to a data.frame
:
class(as.data.frame(tb))
The main reason that some older functions don't work with tibble is the [
function.
We don't use [
much in this book because for data frames, dplyr::filter()
and dplyr::select()
typically allow you to solve the same problems with clearer code.
With base R data.frame
s, [
sometimes returns a data.frame
, and sometimes returns a vector.
With tibbles, [
always returns another tibble.
Exercises¶
How can you tell if an object is a tibble? (Hint: try printing
mtcars
, which is a regulardata.frame
).Compare and contrast the following operations on a
data.frame
and equivalent tibble. What is different? Why might the defaultdata.frame
behaviours cause you frustration?df <- data.frame(abc = 1, xyz = "a") df$x df[, "xyz"] df[, c("abc", "xyz")]
If you have the name of a variable stored in an object, e.g.
var <- "mpg"
, how can you extract the reference variable from a tibble?Practice referring to non-syntactic names in the following data frame by:
a. Extracting the variable called
1
. b. Plotting a scatterplot of1
vs2
. c. Creating a new column called3
which is2
divided by1
. d. Renaming the columns toone
,two
andthree
.annoying <- tibble( `1` = 1:10, `2` = `1` * 2 + rnorm(length(`1`)) )
What does
tibble::enframe()
do? When might you use it?What option controls how many additional column names are printed at the footer of a tibble?