Tidy data¶
Introduction¶
"Happy families are all alike; every unhappy family is unhappy in its own way." --– Leo Tolstoy
"Tidy datasets are all alike, but every messy dataset is messy in its own way." --– Hadley Wickham
In this chapter, you will learn a consistent way to organise your data in R, an organisation called tidy data. Getting your data into this format requires some upfront work, but that work pays off in the long term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand.
This chapter will give you a practical introduction to tidy data and the accompanying tools in the tidyr package. If you'd like to learn more about the underlying theory, you might enjoy the Tidy Data paper published in the Journal of Statistical Software, http://www.jstatsoft.org/v59/i10/paper.
Prerequisites¶
In this chapter we'll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. tidyr is a member of the core tidyverse.
library(tidyverse)
Warning message in system("timedatectl", intern = TRUE): “running command 'timedatectl' had status 1” ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ── ✔ ggplot2 3.3.6 ✔ purrr 0.3.4 ✔ tibble 3.1.7 ✔ dplyr 1.0.9 ✔ tidyr 1.2.0 ✔ stringr 1.4.0 ✔ readr 2.1.2 ✔ forcats 0.5.1 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag()
Tidy data¶
You can represent the same underlying data in multiple ways. The example below shows the same data organised in four different ways. Each dataset shows the same values of four variables country, year, population, and cases, but each dataset organises the values in a different way.
table1
country | year | cases | population |
---|---|---|---|
<chr> | <int> | <int> | <int> |
Afghanistan | 1999 | 745 | 19987071 |
Afghanistan | 2000 | 2666 | 20595360 |
Brazil | 1999 | 37737 | 172006362 |
Brazil | 2000 | 80488 | 174504898 |
China | 1999 | 212258 | 1272915272 |
China | 2000 | 213766 | 1280428583 |
table2
country | year | type | count |
---|---|---|---|
<chr> | <int> | <chr> | <int> |
Afghanistan | 1999 | cases | 745 |
Afghanistan | 1999 | population | 19987071 |
Afghanistan | 2000 | cases | 2666 |
Afghanistan | 2000 | population | 20595360 |
Brazil | 1999 | cases | 37737 |
Brazil | 1999 | population | 172006362 |
Brazil | 2000 | cases | 80488 |
Brazil | 2000 | population | 174504898 |
China | 1999 | cases | 212258 |
China | 1999 | population | 1272915272 |
China | 2000 | cases | 213766 |
China | 2000 | population | 1280428583 |
table3
country | year | rate | |
---|---|---|---|
<chr> | <int> | <chr> | |
1 | Afghanistan | 1999 | 745/19987071 |
2 | Afghanistan | 2000 | 2666/20595360 |
3 | Brazil | 1999 | 37737/172006362 |
4 | Brazil | 2000 | 80488/174504898 |
5 | China | 1999 | 212258/1272915272 |
6 | China | 2000 | 213766/1280428583 |
# Spread across two tibbles
table4a # cases
country | 1999 | 2000 | |
---|---|---|---|
<chr> | <int> | <int> | |
1 | Afghanistan | 745 | 2666 |
2 | Brazil | 37737 | 80488 |
3 | China | 212258 | 213766 |
table4b # population
country | 1999 | 2000 | |
---|---|---|---|
<chr> | <int> | <int> | |
1 | Afghanistan | 19987071 | 20595360 |
2 | Brazil | 172006362 | 174504898 |
3 | China | 1272915272 | 1280428583 |
These are all representations of the same underlying data, but they are not equally easy to use. One dataset, the tidy dataset, will be much easier to work with inside the tidyverse.
There are three interrelated rules which make a dataset tidy:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
The following figure shows the rules visually.
These three rules are interrelated because it's impossible to only satisfy two of the three. That interrelationship leads to an even simpler set of practical instructions:
- Put each dataset in a tibble.
- Put each variable in a column.
In this example, only table1
is tidy. It's the only representation where each column is a variable.
Why ensure that your data is tidy? There are two main advantages:
There's a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it's easier to learn the tools that work with it because they have an underlying uniformity.
There's a specific advantage to placing variables in columns because it allows R's vectorised nature to shine. As you learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.
dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data. Here are a couple of small examples showing how you might work with table1
.
# Compute rate per 10,000
table1 %>%
mutate(rate = cases / population * 10000)
country | year | cases | population | rate |
---|---|---|---|---|
<chr> | <int> | <int> | <int> | <dbl> |
Afghanistan | 1999 | 745 | 19987071 | 0.372741 |
Afghanistan | 2000 | 2666 | 20595360 | 1.294466 |
Brazil | 1999 | 37737 | 172006362 | 2.193930 |
Brazil | 2000 | 80488 | 174504898 | 4.612363 |
China | 1999 | 212258 | 1272915272 | 1.667495 |
China | 2000 | 213766 | 1280428583 | 1.669488 |
# Compute cases per year
table1 %>%
count(year, wt = cases)
year | n |
---|---|
<int> | <int> |
1999 | 250740 |
2000 | 296920 |
# Visualise changes over time
library(ggplot2)
ggplot(table1, aes(year, cases)) +
geom_line(aes(group = country), colour = "grey50") +
geom_point(aes(colour = country))
Exercises¶
Using prose, describe how the variables and observations are organised in each of the sample tables.
Compute the
rate
fortable2
, andtable4a
+table4b
. You will need to perform four operations:- Extract the number of TB cases per country per year.
- Extract the matching population per country per year.
- Divide cases by population, and multiply by 10000.
- Store back in the appropriate place.
Which representation is easiest to work with? Which is hardest? Why?
Recreate the plot showing change in cases over time using
table2
instead oftable1
. What do you need to do first?
Pivoting¶
The principles of tidy data seem so obvious that you might wonder if you'll ever encounter a dataset that isn't tidy. Unfortunately, however, most data that you will encounter will be untidy. There are two main reasons:
Most people aren't familiar with the principles of tidy data, and it's hard to derive them yourself unless you spend a lot of time working with data.
Data is often organised to facilitate some use other than analysis. For example, data is often organised to make entry as easy as possible.
This means for most real analyses, you'll need to do some tidying. The first step is always to figure out what the variables and observations are. Sometimes this is easy; other times you'll need to consult with the people who originally generated the data. The second step is to resolve one of two common problems:
One variable might be spread across multiple columns.
One observation might be scattered across multiple rows.
Typically a dataset will only suffer from one of these problems; it'll only suffer from both if you're really unlucky! To fix these problems, you'll need the two most important functions in tidyr: pivot_longer()
and pivot_wider()
.
Longer¶
A common problem is a dataset where some of the column names are not names of variables, but values of a variable. Take table4a
: the column names 1999
and 2000
represent values of the year
variable, the values in the 1999
and 2000
columns represent values of the cases
variable, and each row represents two observations, not one.
table4a
country | 1999 | 2000 | |
---|---|---|---|
<chr> | <int> | <int> | |
1 | Afghanistan | 745 | 2666 |
2 | Brazil | 37737 | 80488 |
3 | China | 212258 | 213766 |
To tidy a dataset like this, we need to pivot the offending columns into a new pair of variables. To describe that operation we need three parameters:
The set of columns whose names are values, not variables. In this example, those are the columns
1999
and2000
.The name of the variable to move the column names to. Here it is
year
.The name of the variable to move the column values to. Here it's
cases
.
Together those parameters generate the call to pivot_longer()
:
table4a %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
country | year | cases |
---|---|---|
<chr> | <chr> | <int> |
Afghanistan | 1999 | 745 |
Afghanistan | 2000 | 2666 |
Brazil | 1999 | 37737 |
Brazil | 2000 | 80488 |
China | 1999 | 212258 |
China | 2000 | 213766 |
The columns to pivot are specified with dplyr::select()
style notation. Here there are only two columns, so we list them individually. Note that "1999" and "2000" are non-syntactic names (because they don't start with a letter) so we have to surround them in backticks. To refresh your memory of the other ways to select columns, see select.
year
and cases
do not exist in table4a
so we put their names in quotes.
In the final result, the pivoted columns are dropped, and we get new year
and cases
columns. Otherwise, the relationships between the original variables are preserved.
pivot_longer()
makes datasets longer by increasing the number of rows and decreasing the number of columns. I don’t believe it makes sense to describe a dataset as being in “long form”. Length is a relative term, and you can only say (e.g.) that dataset A is longer than dataset B.
We can use pivot_longer()
to tidy table4b
in a similar fashion. The only difference is the variable stored in the cell values:
table4b %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")
country | year | population |
---|---|---|
<chr> | <chr> | <int> |
Afghanistan | 1999 | 19987071 |
Afghanistan | 2000 | 20595360 |
Brazil | 1999 | 172006362 |
Brazil | 2000 | 174504898 |
China | 1999 | 1272915272 |
China | 2000 | 1280428583 |
To combine the tidied versions of table4a
and table4b
into a single tibble, we need to use dplyr::left_join()
, which you'll learn about in [relational data].
tidy4a <- table4a %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
tidy4b <- table4b %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")
left_join(tidy4a, tidy4b)
Joining, by = c("country", "year")
country | year | cases | population |
---|---|---|---|
<chr> | <chr> | <int> | <int> |
Afghanistan | 1999 | 745 | 19987071 |
Afghanistan | 2000 | 2666 | 20595360 |
Brazil | 1999 | 37737 | 172006362 |
Brazil | 2000 | 80488 | 174504898 |
China | 1999 | 212258 | 1272915272 |
China | 2000 | 213766 | 1280428583 |
Wider¶
pivot_wider()
is the opposite of pivot_longer()
. You use it when an observation is scattered across multiple rows. For example, take table2
: an observation is a country in a year, but each observation is spread across two rows.
table2
country | year | type | count |
---|---|---|---|
<chr> | <int> | <chr> | <int> |
Afghanistan | 1999 | cases | 745 |
Afghanistan | 1999 | population | 19987071 |
Afghanistan | 2000 | cases | 2666 |
Afghanistan | 2000 | population | 20595360 |
Brazil | 1999 | cases | 37737 |
Brazil | 1999 | population | 172006362 |
Brazil | 2000 | cases | 80488 |
Brazil | 2000 | population | 174504898 |
China | 1999 | cases | 212258 |
China | 1999 | population | 1272915272 |
China | 2000 | cases | 213766 |
China | 2000 | population | 1280428583 |
To tidy this up, we first analyse the representation in similar way to pivot_longer()
. This time, however, we only need two parameters:
The column to take variable names from. Here, it's
type
.The column to take values from. Here it's
count
.
Once we've figured that out, we can use pivot_wider()
, as shown programmatically below, and visually in next figure.
table2 %>%
pivot_wider(names_from = type, values_from = count)
country | year | cases | population |
---|---|---|---|
<chr> | <int> | <int> | <int> |
Afghanistan | 1999 | 745 | 19987071 |
Afghanistan | 2000 | 2666 | 20595360 |
Brazil | 1999 | 37737 | 172006362 |
Brazil | 2000 | 80488 | 174504898 |
China | 1999 | 212258 | 1272915272 |
China | 2000 | 213766 | 1280428583 |
As you might have guessed from their names, pivot_wider()
and pivot_longer()
are complements. pivot_longer()
makes wide tables narrower and longer; pivot_wider()
makes long tables shorter and wider.
Exercises¶
- Why are
pivot_longer()
andpivot_wider()
not perfectly symmetrical?
Carefully consider the following example:
```r
stocks <- tibble(
year = c(2015, 2015, 2016, 2016),
half = c( 1, 2, 1, 2),
return = c(1.88, 0.59, 0.92, 0.17)
)
stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return")
```
(Hint: look at the variable types and think about column names.)
pivot_longer()
has a names_ptypes
argument, e.g.
names_ptypes = list(year = double())
. What does it do?
Why does this code fail?
table4a %>% pivot_longer(c(1999, 2000), names_to = "year", values_to = "cases") #> Error: Can't subset columns that don't exist. #> ✖ Locations 1999 and 2000 don't exist. #> ℹ There are only 3 columns.
- What would happen if you widen this table? Why? How could you add a new column to uniquely identify each value?
```r
people <- tribble(
~name, ~names, ~values,
#-----------------|--------|------
"Phillip Woods", "age", 45,
"Phillip Woods", "height", 186,
"Phillip Woods", "age", 50,
"Jessica Cordero", "age", 37,
"Jessica Cordero", "height", 156
)
```
- Tidy the simple tibble below. Do you need to make it wider or longer? What are the variables?
```r
preg <- tribble(
~pregnant, ~male, ~female,
"yes", NA, 10,
"no", 20, 12
)
```
Separating and uniting¶
So far you've learned how to tidy table2
and table4
, but not table3
. table3
has a different problem: we have one column (rate
) that contains two variables (cases
and population
). To fix this problem, we'll need the separate()
function. You'll also learn about the complement of separate()
: unite()
, which you use if a single variable is spread across multiple columns.
Separate¶
separate()
pulls apart one column into multiple columns, by splitting wherever a separator character appears. Take table3
:
table3
country | year | rate | |
---|---|---|---|
<chr> | <int> | <chr> | |
1 | Afghanistan | 1999 | 745/19987071 |
2 | Afghanistan | 2000 | 2666/20595360 |
3 | Brazil | 1999 | 37737/172006362 |
4 | Brazil | 2000 | 80488/174504898 |
5 | China | 1999 | 212258/1272915272 |
6 | China | 2000 | 213766/1280428583 |
The rate
column contains both cases
and population
variables, and we need to split it into two variables. separate()
takes the name of the column to separate, and the names of the columns to separate into, as shown in next figure and the code below.
table3 %>%
separate(rate, into = c("cases", "population"))
country | year | cases | population |
---|---|---|---|
<chr> | <int> | <chr> | <chr> |
Afghanistan | 1999 | 745 | 19987071 |
Afghanistan | 2000 | 2666 | 20595360 |
Brazil | 1999 | 37737 | 172006362 |
Brazil | 2000 | 80488 | 174504898 |
China | 1999 | 212258 | 1272915272 |
China | 2000 | 213766 | 1280428583 |
{r}
table3 %>%
separate(rate, into = c("cases", "population"))
By default, separate()
will split values wherever it sees a non-alphanumeric character (i.e. a character that isn't a number or letter). For example, in the code above, separate()
split the values of rate
at the forward slash characters. If you wish to use a specific character to separate a column, you can pass the character to the sep
argument of separate()
. For example, we could rewrite the code above as:
table3 %>%
separate(rate, into = c("cases", "population"), sep = "/")
country | year | cases | population |
---|---|---|---|
<chr> | <int> | <chr> | <chr> |
Afghanistan | 1999 | 745 | 19987071 |
Afghanistan | 2000 | 2666 | 20595360 |
Brazil | 1999 | 37737 | 172006362 |
Brazil | 2000 | 80488 | 174504898 |
China | 1999 | 212258 | 1272915272 |
China | 2000 | 213766 | 1280428583 |
(Formally, sep
is a regular expression, which you'll learn more about in [strings].)
Look carefully at the column types: you'll notice that cases
and population
are character columns. This is the default behaviour in separate()
: it leaves the type of the column as is. Here, however, it's not very useful as those really are numbers. We can ask separate()
to try and convert to better types using convert = TRUE
:
table3 %>%
separate(rate, into = c("cases", "population"), convert = TRUE)
country | year | cases | population |
---|---|---|---|
<chr> | <int> | <int> | <int> |
Afghanistan | 1999 | 745 | 19987071 |
Afghanistan | 2000 | 2666 | 20595360 |
Brazil | 1999 | 37737 | 172006362 |
Brazil | 2000 | 80488 | 174504898 |
China | 1999 | 212258 | 1272915272 |
China | 2000 | 213766 | 1280428583 |
You can also pass a vector of integers to sep
. separate()
will interpret the integers as positions to split at. Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings. When using integers to separate strings, the length of sep
should be one less than the number of names in into
.
You can use this arrangement to separate the last two digits of each year. This make this data less tidy, but is useful in other cases, as you'll see in a little bit.
table3 %>%
separate(year, into = c("century", "year"), sep = 2)
country | century | year | rate |
---|---|---|---|
<chr> | <chr> | <chr> | <chr> |
Afghanistan | 19 | 99 | 745/19987071 |
Afghanistan | 20 | 00 | 2666/20595360 |
Brazil | 19 | 99 | 37737/172006362 |
Brazil | 20 | 00 | 80488/174504898 |
China | 19 | 99 | 212258/1272915272 |
China | 20 | 00 | 213766/1280428583 |
Unite¶
unite()
is the inverse of separate()
: it combines multiple columns into a single column. You'll need it much less frequently than separate()
, but it's still a useful tool to have in your back pocket.
We can use unite()
to rejoin the century and year columns that we created in the last example. That data is saved as tidyr::table5
. unite()
takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in dplyr::select()
style:
table5 %>%
unite(new, century, year)
country | new | rate |
---|---|---|
<chr> | <chr> | <chr> |
Afghanistan | 19_99 | 745/19987071 |
Afghanistan | 20_00 | 2666/20595360 |
Brazil | 19_99 | 37737/172006362 |
Brazil | 20_00 | 80488/174504898 |
China | 19_99 | 212258/1272915272 |
China | 20_00 | 213766/1280428583 |
In this case we also need to use the sep
argument. The default will place an underscore (_
) between the values from different columns. Here we don't want any separator so we use ""
:
table5 %>%
unite(new, century, year, sep = "")
country | new | rate |
---|---|---|
<chr> | <chr> | <chr> |
Afghanistan | 1999 | 745/19987071 |
Afghanistan | 2000 | 2666/20595360 |
Brazil | 1999 | 37737/172006362 |
Brazil | 2000 | 80488/174504898 |
China | 1999 | 212258/1272915272 |
China | 2000 | 213766/1280428583 |
Exercises¶
- What do the
extra
andfill
arguments do inseparate()
? Experiment with the various options for the following two toy datasets.
```r
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"))
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"))
```
Both
unite()
andseparate()
have aremove
argument. What does it do? Why would you set it toFALSE
?Compare and contrast
separate()
andextract()
. Why are there three variations of separation (by position, by separator, and with groups), but only one unite?
Missing values¶
Changing the representation of a dataset brings up an important subtlety of missing values. Surprisingly, a value can be missing in one of two possible ways:
- Explicitly, i.e. flagged with
NA
. - Implicitly, i.e. simply not present in the data.
Let's illustrate this idea with a very simple data set:
stocks <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)
There are two missing values in this dataset:
The return for the fourth quarter of 2015 is explicitly missing, because the cell where its value should be instead contains
NA
.The return for the first quarter of 2016 is implicitly missing, because it simply does not appear in the dataset.
One way to think about the difference is with this Zen-like koan: An explicit missing value is the presence of an absence; an implicit missing value is the absence of a presence.
The way that a dataset is represented can make implicit values explicit. For example, we can make the implicit missing value explicit by putting years in the columns:
stocks %>%
pivot_wider(names_from = year, values_from = return)
qtr | 2015 | 2016 |
---|---|---|
<dbl> | <dbl> | <dbl> |
1 | 1.88 | NA |
2 | 0.59 | 0.92 |
3 | 0.35 | 0.17 |
4 | NA | 2.66 |
Because these explicit missing values may not be important in other representations of the data, you can set values_drop_na = TRUE
in pivot_longer()
to turn explicit missing values implicit:
stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(
cols = c(`2015`, `2016`),
names_to = "year",
values_to = "return",
values_drop_na = TRUE
)
qtr | year | return |
---|---|---|
<dbl> | <chr> | <dbl> |
1 | 2015 | 1.88 |
2 | 2015 | 0.59 |
2 | 2016 | 0.92 |
3 | 2015 | 0.35 |
3 | 2016 | 0.17 |
4 | 2016 | 2.66 |
Another important tool for making missing values explicit in tidy data is complete()
:
stocks %>%
complete(year, qtr)
year | qtr | return |
---|---|---|
<dbl> | <dbl> | <dbl> |
2015 | 1 | 1.88 |
2015 | 2 | 0.59 |
2015 | 3 | 0.35 |
2015 | 4 | NA |
2016 | 1 | NA |
2016 | 2 | 0.92 |
2016 | 3 | 0.17 |
2016 | 4 | 2.66 |
complete()
takes a set of columns, and finds all unique combinations. It then ensures the original dataset contains all those values, filling in explicit NA
s where necessary.
There's one other important tool that you should know for working with missing values. Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:
treatment <- tribble(
~ person, ~ treatment, ~response,
"Derrick Whitmore", 1, 7,
NA, 2, 10,
NA, 3, 9,
"Katherine Burke", 1, 4
)
You can fill in these missing values with fill()
. It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimes called last observation carried forward).
treatment %>%
fill(person)
person | treatment | response |
---|---|---|
<chr> | <dbl> | <dbl> |
Derrick Whitmore | 1 | 7 |
Derrick Whitmore | 2 | 10 |
Derrick Whitmore | 3 | 9 |
Katherine Burke | 1 | 4 |
Exercises¶
Compare and contrast the
fill
arguments topivot_wider()
andcomplete()
.What does the direction argument to
fill()
do?
Case Study¶
To finish off the chapter, let's pull together everything you've learned to tackle a realistic data tidying problem. The tidyr::who
dataset contains tuberculosis (TB) cases broken down by year, country, age, gender, and diagnosis method. The data comes from the 2014 World Health Organization Global Tuberculosis Report, available at http://www.who.int/tb/country/data/download/en/.
There's a wealth of epidemiological information in this dataset, but it's challenging to work with the data in the form that it's provided:
who
country | iso2 | iso3 | year | new_sp_m014 | new_sp_m1524 | new_sp_m2534 | new_sp_m3544 | new_sp_m4554 | new_sp_m5564 | ⋯ | newrel_m4554 | newrel_m5564 | newrel_m65 | newrel_f014 | newrel_f1524 | newrel_f2534 | newrel_f3544 | newrel_f4554 | newrel_f5564 | newrel_f65 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | ⋯ | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> |
Afghanistan | AF | AFG | 1980 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 1981 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 1982 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 1983 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 1984 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 1985 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 1986 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 1987 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 1988 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 1989 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 1990 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 1991 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 1992 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 1993 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 1994 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 1995 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 1996 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 1997 | 0 | 10 | 6 | 3 | 5 | 2 | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 1998 | 30 | 129 | 128 | 90 | 89 | 64 | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 1999 | 8 | 55 | 55 | 47 | 34 | 21 | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 2000 | 52 | 228 | 183 | 149 | 129 | 94 | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 2001 | 129 | 379 | 349 | 274 | 204 | 139 | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 2002 | 90 | 476 | 481 | 368 | 246 | 241 | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 2003 | 127 | 511 | 436 | 284 | 256 | 288 | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 2004 | 139 | 537 | 568 | 360 | 358 | 386 | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 2005 | 151 | 606 | 560 | 472 | 453 | 470 | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 2006 | 193 | 837 | 791 | 574 | 572 | 572 | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 2007 | 186 | 856 | 840 | 597 | 566 | 630 | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 2008 | 187 | 941 | 773 | 545 | 570 | 630 | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 2009 | 200 | 906 | 705 | 499 | 491 | 596 | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋱ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Zimbabwe | ZW | ZWE | 1984 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Zimbabwe | ZW | ZWE | 1985 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Zimbabwe | ZW | ZWE | 1986 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Zimbabwe | ZW | ZWE | 1987 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Zimbabwe | ZW | ZWE | 1988 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Zimbabwe | ZW | ZWE | 1989 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Zimbabwe | ZW | ZWE | 1990 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Zimbabwe | ZW | ZWE | 1991 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Zimbabwe | ZW | ZWE | 1992 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Zimbabwe | ZW | ZWE | 1993 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Zimbabwe | ZW | ZWE | 1994 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Zimbabwe | ZW | ZWE | 1995 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Zimbabwe | ZW | ZWE | 1996 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Zimbabwe | ZW | ZWE | 1997 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Zimbabwe | ZW | ZWE | 1998 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Zimbabwe | ZW | ZWE | 1999 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Zimbabwe | ZW | ZWE | 2000 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Zimbabwe | ZW | ZWE | 2001 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Zimbabwe | ZW | ZWE | 2002 | 191 | 600 | 2548 | 1662 | 744 | 315 | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Zimbabwe | ZW | ZWE | 2003 | 133 | 874 | 3048 | 2228 | 981 | 367 | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Zimbabwe | ZW | ZWE | 2004 | 187 | 833 | 2908 | 2298 | 1056 | 366 | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Zimbabwe | ZW | ZWE | 2005 | 210 | 837 | 2264 | 1855 | 762 | 295 | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Zimbabwe | ZW | ZWE | 2006 | 215 | 736 | 2391 | 1939 | 896 | 348 | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Zimbabwe | ZW | ZWE | 2007 | 138 | 500 | 3693 | 0 | 716 | 292 | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Zimbabwe | ZW | ZWE | 2008 | 127 | 614 | 0 | 3316 | 704 | 263 | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Zimbabwe | ZW | ZWE | 2009 | 125 | 578 | NA | 3471 | 681 | 293 | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Zimbabwe | ZW | ZWE | 2010 | 150 | 710 | 2208 | 1682 | 761 | 350 | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Zimbabwe | ZW | ZWE | 2011 | 152 | 784 | 2467 | 2071 | 780 | 377 | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Zimbabwe | ZW | ZWE | 2012 | 120 | 783 | 2421 | 2086 | 796 | 360 | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Zimbabwe | ZW | ZWE | 2013 | NA | NA | NA | NA | NA | NA | ⋯ | 2349 | 1206 | 1208 | 1252 | 2069 | 4649 | 3526 | 1453 | 811 | 725 |
This is a very typical real-life example dataset. It contains redundant columns, odd variable codes, and many missing values. In short, who
is messy, and we'll need multiple steps to tidy it. Like dplyr, tidyr is designed so that each function does one thing well. That means in real-life situations you'll usually need to string together multiple verbs into a pipeline.
The best place to start is almost always to gather together the columns that are not variables. Let's have a look at what we've got:
It looks like
country
,iso2
, andiso3
are three variables that redundantly specify the country.year
is clearly also a variable.We don't know what all the other columns are yet, but given the structure in the variable names (e.g.
new_sp_m014
,new_ep_m014
,new_ep_f014
) these are likely to be values, not variables.
So we need to gather together all the columns from new_sp_m014
to newrel_f65
. We don't know what those values represent yet, so we'll give them the generic name "key"
. We know the cells represent the count of cases, so we'll use the variable cases
. There are a lot of missing values in the current representation, so for now we'll use values_drop_na
just so we can focus on the values that are present.
who1 <- who %>%
pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = "key",
values_to = "cases",
values_drop_na = TRUE
)
who1
country | iso2 | iso3 | year | key | cases |
---|---|---|---|---|---|
<chr> | <chr> | <chr> | <int> | <chr> | <int> |
Afghanistan | AF | AFG | 1997 | new_sp_m014 | 0 |
Afghanistan | AF | AFG | 1997 | new_sp_m1524 | 10 |
Afghanistan | AF | AFG | 1997 | new_sp_m2534 | 6 |
Afghanistan | AF | AFG | 1997 | new_sp_m3544 | 3 |
Afghanistan | AF | AFG | 1997 | new_sp_m4554 | 5 |
Afghanistan | AF | AFG | 1997 | new_sp_m5564 | 2 |
Afghanistan | AF | AFG | 1997 | new_sp_m65 | 0 |
Afghanistan | AF | AFG | 1997 | new_sp_f014 | 5 |
Afghanistan | AF | AFG | 1997 | new_sp_f1524 | 38 |
Afghanistan | AF | AFG | 1997 | new_sp_f2534 | 36 |
Afghanistan | AF | AFG | 1997 | new_sp_f3544 | 14 |
Afghanistan | AF | AFG | 1997 | new_sp_f4554 | 8 |
Afghanistan | AF | AFG | 1997 | new_sp_f5564 | 0 |
Afghanistan | AF | AFG | 1997 | new_sp_f65 | 1 |
Afghanistan | AF | AFG | 1998 | new_sp_m014 | 30 |
Afghanistan | AF | AFG | 1998 | new_sp_m1524 | 129 |
Afghanistan | AF | AFG | 1998 | new_sp_m2534 | 128 |
Afghanistan | AF | AFG | 1998 | new_sp_m3544 | 90 |
Afghanistan | AF | AFG | 1998 | new_sp_m4554 | 89 |
Afghanistan | AF | AFG | 1998 | new_sp_m5564 | 64 |
Afghanistan | AF | AFG | 1998 | new_sp_m65 | 41 |
Afghanistan | AF | AFG | 1998 | new_sp_f014 | 45 |
Afghanistan | AF | AFG | 1998 | new_sp_f1524 | 350 |
Afghanistan | AF | AFG | 1998 | new_sp_f2534 | 419 |
Afghanistan | AF | AFG | 1998 | new_sp_f3544 | 194 |
Afghanistan | AF | AFG | 1998 | new_sp_f4554 | 118 |
Afghanistan | AF | AFG | 1998 | new_sp_f5564 | 61 |
Afghanistan | AF | AFG | 1998 | new_sp_f65 | 20 |
Afghanistan | AF | AFG | 1999 | new_sp_m014 | 8 |
Afghanistan | AF | AFG | 1999 | new_sp_m1524 | 55 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Zimbabwe | ZW | ZWE | 2012 | new_sn_f5564 | 516 |
Zimbabwe | ZW | ZWE | 2012 | new_sn_f65 | 432 |
Zimbabwe | ZW | ZWE | 2012 | new_ep_m014 | 233 |
Zimbabwe | ZW | ZWE | 2012 | new_ep_m1524 | 214 |
Zimbabwe | ZW | ZWE | 2012 | new_ep_m2534 | 658 |
Zimbabwe | ZW | ZWE | 2012 | new_ep_m3544 | 789 |
Zimbabwe | ZW | ZWE | 2012 | new_ep_m4554 | 331 |
Zimbabwe | ZW | ZWE | 2012 | new_ep_m5564 | 178 |
Zimbabwe | ZW | ZWE | 2012 | new_ep_m65 | 182 |
Zimbabwe | ZW | ZWE | 2012 | new_ep_f014 | 208 |
Zimbabwe | ZW | ZWE | 2012 | new_ep_f1524 | 319 |
Zimbabwe | ZW | ZWE | 2012 | new_ep_f2534 | 710 |
Zimbabwe | ZW | ZWE | 2012 | new_ep_f3544 | 579 |
Zimbabwe | ZW | ZWE | 2012 | new_ep_f4554 | 228 |
Zimbabwe | ZW | ZWE | 2012 | new_ep_f5564 | 140 |
Zimbabwe | ZW | ZWE | 2012 | new_ep_f65 | 143 |
Zimbabwe | ZW | ZWE | 2013 | newrel_m014 | 1315 |
Zimbabwe | ZW | ZWE | 2013 | newrel_m1524 | 1642 |
Zimbabwe | ZW | ZWE | 2013 | newrel_m2534 | 5331 |
Zimbabwe | ZW | ZWE | 2013 | newrel_m3544 | 5363 |
Zimbabwe | ZW | ZWE | 2013 | newrel_m4554 | 2349 |
Zimbabwe | ZW | ZWE | 2013 | newrel_m5564 | 1206 |
Zimbabwe | ZW | ZWE | 2013 | newrel_m65 | 1208 |
Zimbabwe | ZW | ZWE | 2013 | newrel_f014 | 1252 |
Zimbabwe | ZW | ZWE | 2013 | newrel_f1524 | 2069 |
Zimbabwe | ZW | ZWE | 2013 | newrel_f2534 | 4649 |
Zimbabwe | ZW | ZWE | 2013 | newrel_f3544 | 3526 |
Zimbabwe | ZW | ZWE | 2013 | newrel_f4554 | 1453 |
Zimbabwe | ZW | ZWE | 2013 | newrel_f5564 | 811 |
Zimbabwe | ZW | ZWE | 2013 | newrel_f65 | 725 |
We can get some hint of the structure of the values in the new key
column by counting them:
who1 %>%
count(key)
key | n |
---|---|
<chr> | <int> |
new_ep_f014 | 1032 |
new_ep_f1524 | 1021 |
new_ep_f2534 | 1021 |
new_ep_f3544 | 1021 |
new_ep_f4554 | 1017 |
new_ep_f5564 | 1017 |
new_ep_f65 | 1014 |
new_ep_m014 | 1038 |
new_ep_m1524 | 1026 |
new_ep_m2534 | 1020 |
new_ep_m3544 | 1024 |
new_ep_m4554 | 1020 |
new_ep_m5564 | 1015 |
new_ep_m65 | 1018 |
new_sn_f014 | 1040 |
new_sn_f1524 | 1022 |
new_sn_f2534 | 1016 |
new_sn_f3544 | 1020 |
new_sn_f4554 | 1018 |
new_sn_f5564 | 1017 |
new_sn_f65 | 1019 |
new_sn_m014 | 1045 |
new_sn_m1524 | 1030 |
new_sn_m2534 | 1022 |
new_sn_m3544 | 1025 |
new_sn_m4554 | 1027 |
new_sn_m5564 | 1021 |
new_sn_m65 | 1020 |
new_sp_f014 | 3174 |
new_sp_f1524 | 3194 |
new_sp_f2534 | 3200 |
new_sp_f3544 | 3199 |
new_sp_f4554 | 3204 |
new_sp_f5564 | 3195 |
new_sp_f65 | 3197 |
new_sp_m014 | 3173 |
new_sp_m1524 | 3209 |
new_sp_m2534 | 3206 |
new_sp_m3544 | 3219 |
new_sp_m4554 | 3223 |
new_sp_m5564 | 3218 |
new_sp_m65 | 3209 |
newrel_f014 | 190 |
newrel_f1524 | 184 |
newrel_f2534 | 182 |
newrel_f3544 | 183 |
newrel_f4554 | 183 |
newrel_f5564 | 183 |
newrel_f65 | 185 |
newrel_m014 | 190 |
newrel_m1524 | 182 |
newrel_m2534 | 183 |
newrel_m3544 | 184 |
newrel_m4554 | 184 |
newrel_m5564 | 185 |
newrel_m65 | 182 |
You might be able to parse this out by yourself with a little thought and some experimentation, but luckily we have the data dictionary handy. It tells us:
The first three letters of each column denote whether the column contains new or old cases of TB. In this dataset, each column contains new cases.
The next two letters describe the type of TB:
rel
stands for cases of relapseep
stands for cases of extrapulmonary TBsn
stands for cases of pulmonary TB that could not be diagnosed by a pulmonary smear (smear negative)sp
stands for cases of pulmonary TB that could be diagnosed by a pulmonary smear (smear positive)
The sixth letter gives the sex of TB patients. The dataset groups cases by males (
m
) and females (f
).The remaining numbers gives the age group. The dataset groups cases into seven age groups:
014
= 0 -- 14 years old1524
= 15 -- 24 years old2534
= 25 -- 34 years old3544
= 35 -- 44 years old4554
= 45 -- 54 years old5564
= 55 -- 64 years old65
= 65 or older
We need to make a minor fix to the format of the column names: unfortunately the names are slightly inconsistent because instead of new_rel
we have newrel
(it's hard to spot this here but if you don't fix it we'll get errors in subsequent steps). You'll learn about str_replace()
in [strings], but the basic idea is pretty simple: replace the characters "newrel" with "new_rel". This makes all variable names consistent.
who2 <- who1 %>%
mutate(key = stringr::str_replace(key, "newrel", "new_rel"))
who2
country | iso2 | iso3 | year | key | cases |
---|---|---|---|---|---|
<chr> | <chr> | <chr> | <int> | <chr> | <int> |
Afghanistan | AF | AFG | 1997 | new_sp_m014 | 0 |
Afghanistan | AF | AFG | 1997 | new_sp_m1524 | 10 |
Afghanistan | AF | AFG | 1997 | new_sp_m2534 | 6 |
Afghanistan | AF | AFG | 1997 | new_sp_m3544 | 3 |
Afghanistan | AF | AFG | 1997 | new_sp_m4554 | 5 |
Afghanistan | AF | AFG | 1997 | new_sp_m5564 | 2 |
Afghanistan | AF | AFG | 1997 | new_sp_m65 | 0 |
Afghanistan | AF | AFG | 1997 | new_sp_f014 | 5 |
Afghanistan | AF | AFG | 1997 | new_sp_f1524 | 38 |
Afghanistan | AF | AFG | 1997 | new_sp_f2534 | 36 |
Afghanistan | AF | AFG | 1997 | new_sp_f3544 | 14 |
Afghanistan | AF | AFG | 1997 | new_sp_f4554 | 8 |
Afghanistan | AF | AFG | 1997 | new_sp_f5564 | 0 |
Afghanistan | AF | AFG | 1997 | new_sp_f65 | 1 |
Afghanistan | AF | AFG | 1998 | new_sp_m014 | 30 |
Afghanistan | AF | AFG | 1998 | new_sp_m1524 | 129 |
Afghanistan | AF | AFG | 1998 | new_sp_m2534 | 128 |
Afghanistan | AF | AFG | 1998 | new_sp_m3544 | 90 |
Afghanistan | AF | AFG | 1998 | new_sp_m4554 | 89 |
Afghanistan | AF | AFG | 1998 | new_sp_m5564 | 64 |
Afghanistan | AF | AFG | 1998 | new_sp_m65 | 41 |
Afghanistan | AF | AFG | 1998 | new_sp_f014 | 45 |
Afghanistan | AF | AFG | 1998 | new_sp_f1524 | 350 |
Afghanistan | AF | AFG | 1998 | new_sp_f2534 | 419 |
Afghanistan | AF | AFG | 1998 | new_sp_f3544 | 194 |
Afghanistan | AF | AFG | 1998 | new_sp_f4554 | 118 |
Afghanistan | AF | AFG | 1998 | new_sp_f5564 | 61 |
Afghanistan | AF | AFG | 1998 | new_sp_f65 | 20 |
Afghanistan | AF | AFG | 1999 | new_sp_m014 | 8 |
Afghanistan | AF | AFG | 1999 | new_sp_m1524 | 55 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Zimbabwe | ZW | ZWE | 2012 | new_sn_f5564 | 516 |
Zimbabwe | ZW | ZWE | 2012 | new_sn_f65 | 432 |
Zimbabwe | ZW | ZWE | 2012 | new_ep_m014 | 233 |
Zimbabwe | ZW | ZWE | 2012 | new_ep_m1524 | 214 |
Zimbabwe | ZW | ZWE | 2012 | new_ep_m2534 | 658 |
Zimbabwe | ZW | ZWE | 2012 | new_ep_m3544 | 789 |
Zimbabwe | ZW | ZWE | 2012 | new_ep_m4554 | 331 |
Zimbabwe | ZW | ZWE | 2012 | new_ep_m5564 | 178 |
Zimbabwe | ZW | ZWE | 2012 | new_ep_m65 | 182 |
Zimbabwe | ZW | ZWE | 2012 | new_ep_f014 | 208 |
Zimbabwe | ZW | ZWE | 2012 | new_ep_f1524 | 319 |
Zimbabwe | ZW | ZWE | 2012 | new_ep_f2534 | 710 |
Zimbabwe | ZW | ZWE | 2012 | new_ep_f3544 | 579 |
Zimbabwe | ZW | ZWE | 2012 | new_ep_f4554 | 228 |
Zimbabwe | ZW | ZWE | 2012 | new_ep_f5564 | 140 |
Zimbabwe | ZW | ZWE | 2012 | new_ep_f65 | 143 |
Zimbabwe | ZW | ZWE | 2013 | new_rel_m014 | 1315 |
Zimbabwe | ZW | ZWE | 2013 | new_rel_m1524 | 1642 |
Zimbabwe | ZW | ZWE | 2013 | new_rel_m2534 | 5331 |
Zimbabwe | ZW | ZWE | 2013 | new_rel_m3544 | 5363 |
Zimbabwe | ZW | ZWE | 2013 | new_rel_m4554 | 2349 |
Zimbabwe | ZW | ZWE | 2013 | new_rel_m5564 | 1206 |
Zimbabwe | ZW | ZWE | 2013 | new_rel_m65 | 1208 |
Zimbabwe | ZW | ZWE | 2013 | new_rel_f014 | 1252 |
Zimbabwe | ZW | ZWE | 2013 | new_rel_f1524 | 2069 |
Zimbabwe | ZW | ZWE | 2013 | new_rel_f2534 | 4649 |
Zimbabwe | ZW | ZWE | 2013 | new_rel_f3544 | 3526 |
Zimbabwe | ZW | ZWE | 2013 | new_rel_f4554 | 1453 |
Zimbabwe | ZW | ZWE | 2013 | new_rel_f5564 | 811 |
Zimbabwe | ZW | ZWE | 2013 | new_rel_f65 | 725 |
We can separate the values in each code with two passes of separate()
. The first pass will split the codes at each underscore.
who3 <- who2 %>%
separate(key, c("new", "type", "sexage"), sep = "_")
who3
country | iso2 | iso3 | year | new | type | sexage | cases |
---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <int> | <chr> | <chr> | <chr> | <int> |
Afghanistan | AF | AFG | 1997 | new | sp | m014 | 0 |
Afghanistan | AF | AFG | 1997 | new | sp | m1524 | 10 |
Afghanistan | AF | AFG | 1997 | new | sp | m2534 | 6 |
Afghanistan | AF | AFG | 1997 | new | sp | m3544 | 3 |
Afghanistan | AF | AFG | 1997 | new | sp | m4554 | 5 |
Afghanistan | AF | AFG | 1997 | new | sp | m5564 | 2 |
Afghanistan | AF | AFG | 1997 | new | sp | m65 | 0 |
Afghanistan | AF | AFG | 1997 | new | sp | f014 | 5 |
Afghanistan | AF | AFG | 1997 | new | sp | f1524 | 38 |
Afghanistan | AF | AFG | 1997 | new | sp | f2534 | 36 |
Afghanistan | AF | AFG | 1997 | new | sp | f3544 | 14 |
Afghanistan | AF | AFG | 1997 | new | sp | f4554 | 8 |
Afghanistan | AF | AFG | 1997 | new | sp | f5564 | 0 |
Afghanistan | AF | AFG | 1997 | new | sp | f65 | 1 |
Afghanistan | AF | AFG | 1998 | new | sp | m014 | 30 |
Afghanistan | AF | AFG | 1998 | new | sp | m1524 | 129 |
Afghanistan | AF | AFG | 1998 | new | sp | m2534 | 128 |
Afghanistan | AF | AFG | 1998 | new | sp | m3544 | 90 |
Afghanistan | AF | AFG | 1998 | new | sp | m4554 | 89 |
Afghanistan | AF | AFG | 1998 | new | sp | m5564 | 64 |
Afghanistan | AF | AFG | 1998 | new | sp | m65 | 41 |
Afghanistan | AF | AFG | 1998 | new | sp | f014 | 45 |
Afghanistan | AF | AFG | 1998 | new | sp | f1524 | 350 |
Afghanistan | AF | AFG | 1998 | new | sp | f2534 | 419 |
Afghanistan | AF | AFG | 1998 | new | sp | f3544 | 194 |
Afghanistan | AF | AFG | 1998 | new | sp | f4554 | 118 |
Afghanistan | AF | AFG | 1998 | new | sp | f5564 | 61 |
Afghanistan | AF | AFG | 1998 | new | sp | f65 | 20 |
Afghanistan | AF | AFG | 1999 | new | sp | m014 | 8 |
Afghanistan | AF | AFG | 1999 | new | sp | m1524 | 55 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Zimbabwe | ZW | ZWE | 2012 | new | sn | f5564 | 516 |
Zimbabwe | ZW | ZWE | 2012 | new | sn | f65 | 432 |
Zimbabwe | ZW | ZWE | 2012 | new | ep | m014 | 233 |
Zimbabwe | ZW | ZWE | 2012 | new | ep | m1524 | 214 |
Zimbabwe | ZW | ZWE | 2012 | new | ep | m2534 | 658 |
Zimbabwe | ZW | ZWE | 2012 | new | ep | m3544 | 789 |
Zimbabwe | ZW | ZWE | 2012 | new | ep | m4554 | 331 |
Zimbabwe | ZW | ZWE | 2012 | new | ep | m5564 | 178 |
Zimbabwe | ZW | ZWE | 2012 | new | ep | m65 | 182 |
Zimbabwe | ZW | ZWE | 2012 | new | ep | f014 | 208 |
Zimbabwe | ZW | ZWE | 2012 | new | ep | f1524 | 319 |
Zimbabwe | ZW | ZWE | 2012 | new | ep | f2534 | 710 |
Zimbabwe | ZW | ZWE | 2012 | new | ep | f3544 | 579 |
Zimbabwe | ZW | ZWE | 2012 | new | ep | f4554 | 228 |
Zimbabwe | ZW | ZWE | 2012 | new | ep | f5564 | 140 |
Zimbabwe | ZW | ZWE | 2012 | new | ep | f65 | 143 |
Zimbabwe | ZW | ZWE | 2013 | new | rel | m014 | 1315 |
Zimbabwe | ZW | ZWE | 2013 | new | rel | m1524 | 1642 |
Zimbabwe | ZW | ZWE | 2013 | new | rel | m2534 | 5331 |
Zimbabwe | ZW | ZWE | 2013 | new | rel | m3544 | 5363 |
Zimbabwe | ZW | ZWE | 2013 | new | rel | m4554 | 2349 |
Zimbabwe | ZW | ZWE | 2013 | new | rel | m5564 | 1206 |
Zimbabwe | ZW | ZWE | 2013 | new | rel | m65 | 1208 |
Zimbabwe | ZW | ZWE | 2013 | new | rel | f014 | 1252 |
Zimbabwe | ZW | ZWE | 2013 | new | rel | f1524 | 2069 |
Zimbabwe | ZW | ZWE | 2013 | new | rel | f2534 | 4649 |
Zimbabwe | ZW | ZWE | 2013 | new | rel | f3544 | 3526 |
Zimbabwe | ZW | ZWE | 2013 | new | rel | f4554 | 1453 |
Zimbabwe | ZW | ZWE | 2013 | new | rel | f5564 | 811 |
Zimbabwe | ZW | ZWE | 2013 | new | rel | f65 | 725 |
Then we might as well drop the new
column because it's constant in this dataset. While we're dropping columns, let's also drop iso2
and iso3
since they're redundant.
who3 %>%
count(new)
who4 <- who3 %>%
select(-new, -iso2, -iso3)
new | n |
---|---|
<chr> | <int> |
new | 76046 |
Next we'll separate sexage
into sex
and age
by splitting after the first character:
who5 <- who4 %>%
separate(sexage, c("sex", "age"), sep = 1)
who5
country | year | type | sex | age | cases |
---|---|---|---|---|---|
<chr> | <int> | <chr> | <chr> | <chr> | <int> |
Afghanistan | 1997 | sp | m | 014 | 0 |
Afghanistan | 1997 | sp | m | 1524 | 10 |
Afghanistan | 1997 | sp | m | 2534 | 6 |
Afghanistan | 1997 | sp | m | 3544 | 3 |
Afghanistan | 1997 | sp | m | 4554 | 5 |
Afghanistan | 1997 | sp | m | 5564 | 2 |
Afghanistan | 1997 | sp | m | 65 | 0 |
Afghanistan | 1997 | sp | f | 014 | 5 |
Afghanistan | 1997 | sp | f | 1524 | 38 |
Afghanistan | 1997 | sp | f | 2534 | 36 |
Afghanistan | 1997 | sp | f | 3544 | 14 |
Afghanistan | 1997 | sp | f | 4554 | 8 |
Afghanistan | 1997 | sp | f | 5564 | 0 |
Afghanistan | 1997 | sp | f | 65 | 1 |
Afghanistan | 1998 | sp | m | 014 | 30 |
Afghanistan | 1998 | sp | m | 1524 | 129 |
Afghanistan | 1998 | sp | m | 2534 | 128 |
Afghanistan | 1998 | sp | m | 3544 | 90 |
Afghanistan | 1998 | sp | m | 4554 | 89 |
Afghanistan | 1998 | sp | m | 5564 | 64 |
Afghanistan | 1998 | sp | m | 65 | 41 |
Afghanistan | 1998 | sp | f | 014 | 45 |
Afghanistan | 1998 | sp | f | 1524 | 350 |
Afghanistan | 1998 | sp | f | 2534 | 419 |
Afghanistan | 1998 | sp | f | 3544 | 194 |
Afghanistan | 1998 | sp | f | 4554 | 118 |
Afghanistan | 1998 | sp | f | 5564 | 61 |
Afghanistan | 1998 | sp | f | 65 | 20 |
Afghanistan | 1999 | sp | m | 014 | 8 |
Afghanistan | 1999 | sp | m | 1524 | 55 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Zimbabwe | 2012 | sn | f | 5564 | 516 |
Zimbabwe | 2012 | sn | f | 65 | 432 |
Zimbabwe | 2012 | ep | m | 014 | 233 |
Zimbabwe | 2012 | ep | m | 1524 | 214 |
Zimbabwe | 2012 | ep | m | 2534 | 658 |
Zimbabwe | 2012 | ep | m | 3544 | 789 |
Zimbabwe | 2012 | ep | m | 4554 | 331 |
Zimbabwe | 2012 | ep | m | 5564 | 178 |
Zimbabwe | 2012 | ep | m | 65 | 182 |
Zimbabwe | 2012 | ep | f | 014 | 208 |
Zimbabwe | 2012 | ep | f | 1524 | 319 |
Zimbabwe | 2012 | ep | f | 2534 | 710 |
Zimbabwe | 2012 | ep | f | 3544 | 579 |
Zimbabwe | 2012 | ep | f | 4554 | 228 |
Zimbabwe | 2012 | ep | f | 5564 | 140 |
Zimbabwe | 2012 | ep | f | 65 | 143 |
Zimbabwe | 2013 | rel | m | 014 | 1315 |
Zimbabwe | 2013 | rel | m | 1524 | 1642 |
Zimbabwe | 2013 | rel | m | 2534 | 5331 |
Zimbabwe | 2013 | rel | m | 3544 | 5363 |
Zimbabwe | 2013 | rel | m | 4554 | 2349 |
Zimbabwe | 2013 | rel | m | 5564 | 1206 |
Zimbabwe | 2013 | rel | m | 65 | 1208 |
Zimbabwe | 2013 | rel | f | 014 | 1252 |
Zimbabwe | 2013 | rel | f | 1524 | 2069 |
Zimbabwe | 2013 | rel | f | 2534 | 4649 |
Zimbabwe | 2013 | rel | f | 3544 | 3526 |
Zimbabwe | 2013 | rel | f | 4554 | 1453 |
Zimbabwe | 2013 | rel | f | 5564 | 811 |
Zimbabwe | 2013 | rel | f | 65 | 725 |
The who
dataset is now tidy!
I've shown you the code a piece at a time, assigning each interim result to a new variable. This typically isn't how you'd work interactively. Instead, you'd gradually build up a complex pipe:
who %>%
pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = "key",
values_to = "cases",
values_drop_na = TRUE
) %>%
mutate(
key = stringr::str_replace(key, "newrel", "new_rel")
) %>%
separate(key, c("new", "var", "sexage")) %>%
select(-new, -iso2, -iso3) %>%
separate(sexage, c("sex", "age"), sep = 1)
country | year | var | sex | age | cases |
---|---|---|---|---|---|
<chr> | <int> | <chr> | <chr> | <chr> | <int> |
Afghanistan | 1997 | sp | m | 014 | 0 |
Afghanistan | 1997 | sp | m | 1524 | 10 |
Afghanistan | 1997 | sp | m | 2534 | 6 |
Afghanistan | 1997 | sp | m | 3544 | 3 |
Afghanistan | 1997 | sp | m | 4554 | 5 |
Afghanistan | 1997 | sp | m | 5564 | 2 |
Afghanistan | 1997 | sp | m | 65 | 0 |
Afghanistan | 1997 | sp | f | 014 | 5 |
Afghanistan | 1997 | sp | f | 1524 | 38 |
Afghanistan | 1997 | sp | f | 2534 | 36 |
Afghanistan | 1997 | sp | f | 3544 | 14 |
Afghanistan | 1997 | sp | f | 4554 | 8 |
Afghanistan | 1997 | sp | f | 5564 | 0 |
Afghanistan | 1997 | sp | f | 65 | 1 |
Afghanistan | 1998 | sp | m | 014 | 30 |
Afghanistan | 1998 | sp | m | 1524 | 129 |
Afghanistan | 1998 | sp | m | 2534 | 128 |
Afghanistan | 1998 | sp | m | 3544 | 90 |
Afghanistan | 1998 | sp | m | 4554 | 89 |
Afghanistan | 1998 | sp | m | 5564 | 64 |
Afghanistan | 1998 | sp | m | 65 | 41 |
Afghanistan | 1998 | sp | f | 014 | 45 |
Afghanistan | 1998 | sp | f | 1524 | 350 |
Afghanistan | 1998 | sp | f | 2534 | 419 |
Afghanistan | 1998 | sp | f | 3544 | 194 |
Afghanistan | 1998 | sp | f | 4554 | 118 |
Afghanistan | 1998 | sp | f | 5564 | 61 |
Afghanistan | 1998 | sp | f | 65 | 20 |
Afghanistan | 1999 | sp | m | 014 | 8 |
Afghanistan | 1999 | sp | m | 1524 | 55 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Zimbabwe | 2012 | sn | f | 5564 | 516 |
Zimbabwe | 2012 | sn | f | 65 | 432 |
Zimbabwe | 2012 | ep | m | 014 | 233 |
Zimbabwe | 2012 | ep | m | 1524 | 214 |
Zimbabwe | 2012 | ep | m | 2534 | 658 |
Zimbabwe | 2012 | ep | m | 3544 | 789 |
Zimbabwe | 2012 | ep | m | 4554 | 331 |
Zimbabwe | 2012 | ep | m | 5564 | 178 |
Zimbabwe | 2012 | ep | m | 65 | 182 |
Zimbabwe | 2012 | ep | f | 014 | 208 |
Zimbabwe | 2012 | ep | f | 1524 | 319 |
Zimbabwe | 2012 | ep | f | 2534 | 710 |
Zimbabwe | 2012 | ep | f | 3544 | 579 |
Zimbabwe | 2012 | ep | f | 4554 | 228 |
Zimbabwe | 2012 | ep | f | 5564 | 140 |
Zimbabwe | 2012 | ep | f | 65 | 143 |
Zimbabwe | 2013 | rel | m | 014 | 1315 |
Zimbabwe | 2013 | rel | m | 1524 | 1642 |
Zimbabwe | 2013 | rel | m | 2534 | 5331 |
Zimbabwe | 2013 | rel | m | 3544 | 5363 |
Zimbabwe | 2013 | rel | m | 4554 | 2349 |
Zimbabwe | 2013 | rel | m | 5564 | 1206 |
Zimbabwe | 2013 | rel | m | 65 | 1208 |
Zimbabwe | 2013 | rel | f | 014 | 1252 |
Zimbabwe | 2013 | rel | f | 1524 | 2069 |
Zimbabwe | 2013 | rel | f | 2534 | 4649 |
Zimbabwe | 2013 | rel | f | 3544 | 3526 |
Zimbabwe | 2013 | rel | f | 4554 | 1453 |
Zimbabwe | 2013 | rel | f | 5564 | 811 |
Zimbabwe | 2013 | rel | f | 65 | 725 |
Exercises¶
In this case study I set
values_drop_na = TRUE
just to make it easier to check that we had the correct values. Is this reasonable? Think about how missing values are represented in this dataset. Are there implicit missing values? What's the difference between anNA
and zero?What happens if you neglect the
mutate()
step? (mutate(names_from = stringr::str_replace(key, "newrel", "new_rel"))
)I claimed that
iso2
andiso3
were redundant withcountry
. Confirm this claim.For each country, year, and sex compute the total number of cases of TB. Make an informative visualisation of the data.
Non-tidy data¶
Before we continue on to other topics, it's worth talking briefly about non-tidy data. Earlier in the chapter, I used the pejorative term "messy" to refer to non-tidy data. That's an oversimplification: there are lots of useful and well-founded data structures that are not tidy data. There are two main reasons to use other data structures:
Alternative representations may have substantial performance or space advantages.
Specialised fields have evolved their own conventions for storing data that may be quite different to the conventions of tidy data.
Either of these reasons means you'll need something other than a tibble (or data frame). If your data does fit naturally into a rectangular structure composed of observations and variables, I think tidy data should be your default choice. But there are good reasons to use other structures; tidy data is not the only way.
If you'd like to learn more about non-tidy data, I'd highly recommend this thoughtful blog post by Jeff Leek: http://simplystatistics.org/2016/02/17/non-tidy-data/