-
Notifications
You must be signed in to change notification settings - Fork 230
/
tidy.Rmd
786 lines (598 loc) · 26.6 KB
/
tidy.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
# Tidy data {#tidy-data .r4ds-section}
## Introduction {#introduction-6 .r4ds-section}
```{r setup,message=FALSE,cache=FALSE}
library("tidyverse")
```
## Tidy data {#tidy-data-1 .r4ds-section}
### Exercise 12.2.1 {.unnumbered .exercise data-number="12.2.1"}
<div class="question">
Using prose, describe how the variables and observations are organized in each of the sample tables.
</div>
<div class="answer">
In table `table1`, each row represents a (country, year) combination.
The columns `cases` and `population` contain the values for those variables.
```{r}
table1
```
In `table2`, each row represents a (country, year, variable) combination.
The column `count` contains the values of variables `cases` and `population` in separate rows.
```{r}
table2
```
In `table3`, each row represents a (country, year) combination.
The column `rate` provides the values of both `cases` and `population` in a string formatted like `cases / population`.
```{r}
table3
```
Table 4 is split into two tables, one table for each variable.
The table `table4a` contains the values of cases and `table4b` contains the values of population.
Within each table, each row represents a country, each column represents a year, and the cells are the value of the table's variable for that country and year.
```{r}
table4a
```
```{r}
table4b
```
</div>
### Exercise 12.2.2 {.unnumbered .exercise data-number="12.2.2"}
<div class="question">
Compute the `rate` for `table2`, and `table4a` + `table4b`.
You will need to perform four operations:
1. Extract the number of TB cases per country per year.
1. Extract the matching population per country per year.
1. Divide cases by population, and multiply by 10000.
1. Store back in the appropriate place.
Which representation is easiest to work with?
Which is hardest?
Why?
</div>
<div class="answer">
To calculate cases per person, we need to divide cases by population for each country and year.
This is easiest if the cases and population variables are two columns in a data frame in which rows represent (country, year) combinations.
Table 2: First, create separate tables for cases and population and ensure that they are sorted in the same order.
```{r}
t2_cases <- filter(table2, type == "cases") %>%
rename(cases = count) %>%
arrange(country, year)
t2_population <- filter(table2, type == "population") %>%
rename(population = count) %>%
arrange(country, year)
```
Then create a new data frame with the population and cases columns, and calculate the cases per capita in a new column.
```{r}
t2_cases_per_cap <- tibble(
year = t2_cases$year,
country = t2_cases$country,
cases = t2_cases$cases,
population = t2_population$population
) %>%
mutate(cases_per_cap = (cases / population) * 10000) %>%
select(country, year, cases_per_cap)
```
To store this new variable in the appropriate location, we will add new rows to `table2`.
```{r}
t2_cases_per_cap <- t2_cases_per_cap %>%
mutate(type = "cases_per_cap") %>%
rename(count = cases_per_cap)
```
```{r}
bind_rows(table2, t2_cases_per_cap) %>%
arrange(country, year, type, count)
```
Note that after adding the `cases_per_cap` rows, the type of `count` is coerced to `numeric` (double) because `cases_per_cap` is not an integer.
For `table4a` and `table4b`, create a new table for cases per capita, which we'll name `table4c`, with country rows and year columns.
```{r}
table4c <-
tibble(
country = table4a$country,
`1999` = table4a[["1999"]] / table4b[["1999"]] * 10000,
`2000` = table4a[["2000"]] / table4b[["2000"]] * 10000
)
table4c
```
Neither table is particularly easy to work with.
Since `table2` has separate rows for cases and population we needed to generate a table with columns for cases and population where we could
calculate cases per capita.
`table4a` and `table4b` split the cases and population variables into different tables which
made it easy to divide cases by population.
However, we had to repeat this calculation for each row.
The ideal format of a data frame to answer this question is one with columns `country`, `year`, `cases`, and `population`.
Then problem could be answered with a single `mutate()` call.
</div>
### Exercise 12.2.3 {.unnumbered .exercise data-number="12.2.3"}
<div class="question">
Recreate the plot showing change in cases over time using `table2` instead of `table1`.
What do you need to do first?
</div>
<div class="answer">
Before creating the plot with change in cases over time, we need to filter `table` to only include rows representing cases of TB.
```{r}
table2 %>%
filter(type == "cases") %>%
ggplot(aes(year, count)) +
geom_line(aes(group = country), colour = "grey50") +
geom_point(aes(colour = country)) +
scale_x_continuous(breaks = unique(table2$year)) +
ylab("cases")
```
</div>
## Pivoting {#pivoting .r4ds-section}
This code is reproduced from the chapter because it is needed by the exercises.
```{r}
tidy4a <- table4a %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
tidy4b <- table4b %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")
```
### Exercise 12.3.1 {.unnumbered .exercise data-number="12.3.1"}
<div class="question">
Why are `pivot_longer()` and `pivot_wider()` not perfectly symmetrical?
Carefully consider the following example:
Carefully consider the following example:
```{r}
stocks <- tibble(
year = c(2015, 2015, 2016, 2016),
half = c( 1, 2, 1, 2),
return = c(1.88, 0.59, 0.92, 0.17)
)
stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return")
```
(Hint: look at the variable types and think about column names.)
`pivot_longer()` has a `names_ptype` argument, e.g. `names_ptype = list(year = double())`. What does it do?
</div>
<div class="answer">
The functions `pivot_longer()` and `pivot_wider()` are not perfectly symmetrical because column type information is lost when a data frame is converted from wide to long.
The function `pivot_longer()` stacks multiple columns which may have had multiple data types into a single column with a single data type.
This transformation throws away the individual data types of the original columns.
The function `pivot_wider()` creates column names from values in column.
These column names will always be treated as `character` values by `pivot_longer()` so if the original variable used to create the column names did not have a `character` data type, then the round-trip will not reproduce the same dataset.
In the provided example, columns have the following data types:
```{r}
glimpse(stocks)
```
The `pivot_wider()` expression pivots the table to create a data frame with years as column names, and the values in `return` as the column values.
```{r}
stocks %>%
pivot_wider(names_from = year, values_from = return)
```
The `pivot_longer()` expression unpivots the table, returning it to a tidy data frame with columns for `half`, `year`, and `return`.
```{r}
stocks %>%
pivot_wider(names_from = year, values_from = return)%>%
pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return")
```
There is one difference, in the new data frame, `year` has a data type of `character` rather than `numeric`.
The `names_to` column created from column names by `pivot_longer()` will be character by default, which is usually a safe assumption, since syntactically valid-column names can only be character values.
The original data types of column which `pivot_wider()` used to create the column names was not stored, so `pivot_longer()` has no idea that the column names in this case should be numeric values.
<!--
The `names_ptype` argument of `pivot_longer()` provide that information.
Using `names_ptype = list(year = double())` informs `pivot_longer()` that the column created from column names should be a numeric vector.
Adding this argument to `pivot_longer()` in the example above will make the output of the experession the same as the original `stocks` data frame.
-->
In the current version of `tidyr`, the `names_ptype` argument does not convert the `year` column to a numeric vector, and it will raise an error.
```{r error=TRUE}
stocks %>%
pivot_wider(names_from = year, values_from = return)%>%
pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return",
names_ptype = list(year = double()))
```
Instead, use the `names_transform` argument to `pivot_longer()`, which provides a function to coerce the column to a different data type.
```{r}
stocks %>%
pivot_wider(names_from = year, values_from = return)%>%
pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return",
names_transform = list(year = as.numeric))
```
</div>
### Exercise 12.3.2 {.unnumbered .exercise data-number="12.3.2"}
<div class="question">
Why does this code fail?
```{r error=TRUE}
table4a %>%
pivot_longer(c(1999, 2000), names_to = "year", values_to = "cases")
```
</div>
<div class="answer">
The code fails because the column names `1999` and `2000` are not non-syntactic variable names.[^non-syntactic]
When selecting variables from a data frame, tidyverse functions will interpret numbers, like `1999` and `2000`, as column numbers.
In this case, `pivot_longer()` tries to select the 1999th and 2000th column of the data frame.
To select the columns `1999` and `2000`, the names must be surrounded in backticks (`\``) or provided as strings.
```{r}
table4a %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
```
```{r}
table4a %>%
pivot_longer(c("1999", "2000"), names_to = "year", values_to = "cases")
```
</div>
### Exercise 12.3.3 {.unnumbered .exercise data-number="12.3.3"}
<div class="question">
What would happen if you widen this table?
Why?
How could you add a new column to uniquely identify each value?
```{r}
people <- tribble(
~name, ~key, ~value,
#-----------------|--------|------
"Phillip Woods", "age", 45,
"Phillip Woods", "height", 186,
"Phillip Woods", "age", 50,
"Jessica Cordero", "age", 37,
"Jessica Cordero", "height", 156
)
glimpse(people)
```
</div>
<div class="answer">
Widening this data frame using `pivot_wider()` produces columns that are lists of numeric vectors because the `name` and `key` columns do not uniquely identify rows.
In particular, there are two rows with values for the age of "Phillip Woods".
```{r}
pivot_wider(people, names_from="name", values_from = "value")
```
We could solve the problem by adding a row with a distinct observation count for each combination of name and key.
```{r}
people2 <- people %>%
group_by(name, key) %>%
mutate(obs = row_number())
people2
```
We can make `people2` wider because the combination of `name` and `obs` will uniquely identify the rows in the wide data frame.
```{r}
pivot_wider(people2, names_from="name", values_from = "value")
```
Another way to solve this problem is by keeping only distinct rows of the name and key values, and dropping duplicate rows.
```{r}
people %>%
distinct(name, key, .keep_all = TRUE) %>%
pivot_wider(names_from="name", values_from = "value")
```
However, before doing this understand why there are duplicates in the data.
The duplicate values may not be just a nuisance, but may indicate deeper problems with the data.
</div>
### Exercise 12.3.4 {.unnumbered .exercise data-number="12.3.4"}
<div class="question">
Tidy the simple tibble below. Do you need to make it wider or longer?
What are the variables?
```{r}
preg <- tribble(
~pregnant, ~male, ~female,
"yes", NA, 10,
"no", 20, 12
)
```
</div>
<div class="answer">
To tidy the `preg` table use `pivot_longer()` to create a long table.
The variables in this data are:
- `sex` ("female", "male")
- `pregnant` ("yes", "no")
- `count`, which is a non-negative integer representing the number of observations.
The observations in this data are unique combinations of sex and pregnancy status.
```{r}
preg_tidy <- preg %>%
pivot_longer(c(male, female), names_to = "sex", values_to = "count")
preg_tidy
```
Remove the (male, pregnant) row with a missing value to simplify the tidied data frame.
```{r}
preg_tidy2 <- preg %>%
pivot_longer(c(male, female), names_to = "sex", values_to = "count", values_drop_na = TRUE)
preg_tidy2
```
This an example of turning an explicit missing value into an implicit missing value, which is discussed in the upcoming section, [Missing Values](https://r4ds.had.co.nz/tidy-data.html#missing-values-3) section.
The missing (male, pregnant) row represents an implicit missing value because the value of `count` can be inferred from its absence.
In the tidy data, we can represent rows with missing values of `count` either explicitly with an `NA` (as in `preg_tidy`) or implicitly by the absence of a row (as in `preg_tidy2`).
But in the wide data, the missing values can only be represented explicitly.
Though we have already done enough to make the data tidy, there are some other transformations that can clean the data further.
If a variable takes two values, like `pregnant` and `sex`, it is often preferable to store them as logical vectors.
```{r}
preg_tidy3 <- preg_tidy2 %>%
mutate(
female = sex == "female",
pregnant = pregnant == "yes"
) %>%
select(female, pregnant, count)
preg_tidy3
```
In the previous data frame, I named the logical variable representing the sex `female`, not `sex`.
This makes the meaning of the variable self-documenting.
If the variable were named `sex` with values `TRUE` and `FALSE`, without reading the documentation, we wouldn't know whether `TRUE` means male or female.
Apart from some minor memory savings, representing these variables as logical vectors results in more clear and concise code.
Compare the `filter()` calls to select non-pregnant females from `preg_tidy2` and `preg_tidy`.
```{r}
filter(preg_tidy2, sex == "female", pregnant == "no")
filter(preg_tidy3, female, !pregnant)
```
</div>
## Separating and uniting {#separating-and-uniting .r4ds-section}
### Exercise 12.4.1 {.unnumbered .exercise data-number="12.4.1"}
<div class="question">
What do the extra and fill arguments do in `separate()`?
Experiment with the various options for the following two toy datasets.
```{r}
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"))
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"))
```
</div>
<div class="answer">
The `extra` argument tells `separate()` what to do if there are too many pieces, and the `fill` argument tells it what to do if there aren't enough.
By default, `separate()` drops extra values with a warning.
```{r}
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"))
```
Adding the argument, `extra = "drop"`, produces the same result as above but without the warning.
```{r}
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"), extra = "drop")
```
Setting `extra = "merge"`, then the extra values are not split, so `"f,g"` appears in column three.
```{r}
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"), extra = "merge")
```
In this example, one of the values, `"d,e"`, has too few elements.
The default for `fill` is similar to those in `separate()`;
it fills columns with missing values but emits a warning.
In this example, the 2nd row of column `three` is `NA`.
```{r}
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"))
```
Alternative options for the `fill` are `"right"`, to fill with missing values from the right, but without a warning
```{r}
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"), fill = "right")
```
The option `fill = "left"` also fills with missing values without emitting a warning, but this time from the left side.
Now, the 2nd row of column `one` will be missing, and the other values in that row are shifted right.
```{r}
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"), fill = "left")
```
</div>
### Exercise 12.4.2 {.unnumbered .exercise data-number="12.4.2"}
<div class="question">
Both `unite()` and `separate()` have a `remove` argument.
What does it do?
Why would you set it to `FALSE`?
</div>
<div class="answer">
The `remove` argument discards input columns in the result data frame. You would set it to `FALSE` if you want to create a new variable, but keep the old one.
</div>
### Exercise 12.4.3 {.unnumbered .exercise data-number="12.4.3"}
<div class="question">
Compare and contrast `separate()` and `extract()`, Why are there three variations of separation (by position, by separator, and with groups), but only one unite?
</div>
<div class="answer">
The function `separate()`, splits a column into multiple columns by separator, if the `sep` argument is a character vector, or by character positions, if `sep` is numeric.
```{r}
# example with separators
tibble(x = c("X_1", "X_2", "AA_1", "AA_2")) %>%
separate(x, c("variable", "into"), sep = "_")
# example with position
tibble(x = c("X1", "X2", "Y1", "Y2")) %>%
separate(x, c("variable", "into"), sep = c(1))
```
The function `extract()` uses a regular expression to specify groups in character vector and split that single character vector into multiple columns.
This is more flexible than `separate()` because it does not require a common
separator or specific column positions.
```{r}
# example with separators
tibble(x = c("X_1", "X_2", "AA_1", "AA_2")) %>%
extract(x, c("variable", "id"), regex = "([A-Z])_([0-9])")
# example with position
tibble(x = c("X1", "X2", "Y1", "Y2")) %>%
extract(x, c("variable", "id"), regex = "([A-Z])([0-9])")
# example that separate could not parse
tibble(x = c("X1", "X20", "AA11", "AA2")) %>%
extract(x, c("variable", "id"), regex = "([A-Z]+)([0-9]+)")
```
Both `separate()` and `extract()` convert a single column to many columns.
However, `unite()` converts many columns to one, with a choice of a separator to include between column values.
```{r}
tibble(variable = c("X", "X", "Y", "Y"), id = c(1, 2, 1, 2)) %>%
unite(x, variable, id, sep = "_")
```
In other words, with `extract()` and `separate()` only one column can be chosen,
but there are many choices how to split that single column into different columns.
With `unite()`, there are many choices as to which columns to include, but only one
choice as to how to combine their contents into a single vector.
</div>
## Missing values {#missing-values-3 .r4ds-section}
### Exercise 12.5.1 {.unnumbered .exercise data-number="12.5.1"}
<div class="question">
Compare and contrast the `fill` arguments to `pivot_wider()` and `complete()`.
</div>
<div class="answer">
The `values_fill` argument in `pivot_wider()` and the `fill` argument to `complete()` both set vales to replace `NA`.
Both arguments accept named lists to set values for each column.
Additionally, the `values_fill` argument of `pivot_wider()` accepts a single value.
In `complete()`, the fill argument also sets a value to replace `NA`s but it is named list, allowing for different values for different variables.
Also, both cases replace both implicit and explicit missing values.
For example, this will fill in the missing values of the long data frame with `0` `complete()`:
```{r}
stocks <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)
stocks %>%
pivot_wider(names_from = year, values_from = return,
values_fill = 0)
```
```{r}
stocks <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)
stocks %>%
pivot_wider(names_from = year, values_from = return,
values_fill = 0)
```
For example, this will fill in the missing values of the long data frame with `0` `complete()`:
```{r}
stocks %>%
complete(year, qtr, fill=list(return=0))
```
</div>
### Exercise 12.5.2 {.unnumbered .exercise data-number="12.5.2"}
<div class="question">
What does the `direction` argument to `fill()` do?
</div>
<div class="answer">
With `fill`, the `direction` determines whether `NA` values should be replaced by the previous non-missing value (`"down"`) or the next non-missing value (`"up"`).
</div>
## Case Study {#case-study .r4ds-section}
This code is repeated from the chapter because it is needed by the exercises.
```{r}
who1 <- who %>%
pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = "key",
values_to = "cases",
values_drop_na = TRUE
)
who1
```
```{r}
who2 <- who1 %>%
mutate(names_from = stringr::str_replace(key, "newrel", "new_rel"))
who2
```
```{r}
who3 <- who2 %>%
separate(key, c("new", "type", "sexage"), sep = "_")
who3
```
```{r}
who3 %>%
count(new)
```
```{r}
who4 <- who3 %>%
select(-new, -iso2, -iso3)
```
```{r}
who5 <- who4 %>%
separate(sexage, c("sex", "age"), sep = 1)
who5
```
### Exercise 12.6.1 {.unnumbered .exercise data-number="12.6.1"}
<div class="question">
In this case study, I set `na.rm = TRUE` just to make it easier to check that we had the correct values.
Is this reasonable?
Think about how missing values are represented in this dataset.
Are there implicit missing values?
What’s the difference between an `NA` and zero?
</div>
<div class="answer">
The reasonableness of using `na.rm = TRUE` depends on how missing values are represented in this dataset.
The main concern is whether a missing value means that there were no cases of TB or whether it means that the WHO does not have data on the number of TB cases.
Here are some things we should look for to help distinguish between these cases.
- If there are no 0 values in the data, then missing values may be used to indicate no cases.
- If there are both explicit and implicit missing values, then it suggests that missing values
are being used differently. In that case, it is likely that explicit missing values would
mean no cases, and implicit missing values would mean no data on the number of cases.
First, I'll check for the presence of zeros in the data.
```{r}
who1 %>%
filter(cases == 0) %>%
nrow()
```
There are zeros in the data, so it appears that cases of zero TB are explicitly indicated, and the value of`NA` is used to indicate missing data.
Second, I should check whether all values for a (country, year) are missing or whether it is possible for only some columns to be missing.
```{r}
pivot_longer(who, c(new_sp_m014:newrel_f65), names_to = "key", values_to = "cases") %>%
group_by(country, year) %>%
mutate(prop_missing = sum(is.na(cases)) / n()) %>%
filter(prop_missing > 0, prop_missing < 1)
```
From the results above, it looks like it is possible for a (country, year) row to contain some, but not all, missing values in its columns.
Finally, I will check for implicit missing values.
Implicit missing values are (`year`, `country`) combinations that do not appear in the data.
```{r}
nrow(who)
who %>%
complete(country, year) %>%
nrow()
```
Since the number of complete cases of (`country`, `year`) is greater than the number of rows in `who`, there are some implicit values.
But that doesn't tell us what those implicit missing values are.
To do this, I will use the `anti_join()` function introduced in the later [Relational Data](https://r4ds.had.co.nz/relational-data.html#filtering-joins) chapter.
```{r}
anti_join(complete(who, country, year), who, by = c("country", "year")) %>%
select(country, year) %>%
group_by(country) %>%
# so I can make better sense of the years
summarise(min_year = min(year), max_year = max(year))
```
All of these refer to (`country`, `year`) combinations for years prior to the existence of the country.
For example, Timor-Leste achieved independence in 2002, so years prior to that are not included in the data.
To summarize:
- `0` is used to represent no cases of TB.
- Explicit missing values (`NA`s) are used to represent missing data for (`country`, `year`) combinations in which the country existed in that year.
- Implicit missing values are used to represent missing data because a country did not exist in that year.
</div>
### Exercise 12.6.2 {.unnumbered .exercise data-number="12.6.2"}
<div class="question">
What happens if you neglect the `mutate()` step?
(`mutate(key = str_replace(key, "newrel", "new_rel")`)
</div>
<div class="answer">
The `separate()` function emits the warning "too few values".
If we check the rows for keys beginning with `"newrel_"`, we see that `sexage` is missing,
and `type = m014`.
```{r}
who3a <- who1 %>%
separate(key, c("new", "type", "sexage"), sep = "_")
filter(who3a, new == "newrel") %>% head()
```
</div>
### Exercise 12.6.3 {.unnumbered .exercise data-number="12.6.3"}
<div class="question">
I claimed that `iso2` and `iso3` were redundant with country.
Confirm this claim.
</div>
<div class="answer">
If `iso2` and `iso3` are redundant with `country`, then, within each country,
there should only be one distinct combination of `iso2` and `iso3` values, which is the case.
```{r}
select(who3, country, iso2, iso3) %>%
distinct() %>%
group_by(country) %>%
filter(n() > 1)
```
This makes sense, since `iso2` and `iso3` contain the 2- and 3-letter country abbreviations for the country.
The `iso2` variable contains each country's [ISO 3166 alpha-2](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2), and the `iso3` variable contains each country's [ISO 3166 alpha-3](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3) abbreviation.
You may recognize the ISO 3166-2 abbreviations, since they are almost identical to internet [country-code top level domains](https://en.wikipedia.org/wiki/Country_code_top-level_domain), such as `.uk` (United Kingdom), `.ly` (Libya), `.tv` (Tuvalu), and `.io` (British Indian Ocean Territory).
</div>
### Exercise 12.6.4 {.unnumbered .exercise data-number="12.6.4"}
<div class="question">
For each country, year, and sex compute the total number of cases of TB.
Make an informative visualization of the data.
</div>
<div class="answer">
```{r}
who5 %>%
group_by(country, year, sex) %>%
filter(year > 1995) %>%
summarise(cases = sum(cases)) %>%
unite(country_sex, country, sex, remove = FALSE) %>%
ggplot(aes(x = year, y = cases, group = country_sex, colour = sex)) +
geom_line()
```
A small multiples plot faceting by country is difficult given the number of countries.
Focusing on those countries with the largest changes or absolute magnitudes after providing the context above is another option.
</div>
## Non-tidy data {#non-tidy-data .r4ds-section}
`r no_exercises()`
[ex-12.2.2]: It would be better to join these tables using the methods covered in the [Relational Data](https://r4ds.had.co.nz/relational-data.html).
We could use `inner_join(t2_cases, t2_population, by = c("country", "year"))`.
[non-syntactic]: See the [Creating Tibbles](https://r4ds.had.co.nz/tibbles.html#tibbles) section.