-
Notifications
You must be signed in to change notification settings - Fork 4
/
data_manipulation_workshop_script.r
1208 lines (942 loc) · 41.1 KB
/
data_manipulation_workshop_script.r
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# 2020-03
# Data manipulation in R
# This workshop focus on the basics of data manipulation
# using the dplyr and tidyr packages
# Other data manipulation in R options are:
# The apply() family of functions;
# ave() and aggregate() functions;
# The data.table package.
# You may find another of these options fits your
# coding style better, but I find dplyr
# approachable for folks with non-programming backgrounds
# The workshop is more-or-less broken into 3 parts:
# 1. Reviewing functions in dplyr for basic
# data manipulation/cleaning/munging
# 2. An intro to *reshaping* datasets using functions
# from the tidyr package
# 3. Merging datasets using dplyr functions
# We have small practice exercises after
# parts 1 and 2
# The common data manipulation tasks we will cover:
# Making summary datasets by group
# Filtering datasets
# Selecting only some variables in a dataset
# Adding variables to datasets
# Sorting datasets
# Reshaping datasets
# Merging/joining datasets
# Getting help ----
# Most of the help on data manipulation in R
# (including dplyr and tidyr)
# I use is from stack overflow
# http://stackoverflow.com/questions/tagged/r
# but also see the RStudio community
# https://community.rstudio.com/
# Both packages are relatively young and so
# there are still things that do/will change
# However, the basic functions
# we will be covering have been stable
# The introductory vignettes for the packages are periodically updated
# and are good resources on the basics
# https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html
# https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html
# Also see the RStudio cheat sheets:
# https://www.rstudio.com/resources/cheatsheets/
# The packages ----
# The current versions of the packages we'll be working with are
# dplyr 0.8.3 and tidyr 1.0.2
# check if your version is up to date using packageVersion()
packageVersion("dplyr")
packageVersion("tidyr")
# If one of these is not up-to-date, re-install
# You can install via code such as install.packages("tidyr") or
# use the Install Button in the RStudio Packages pane
# Once your versions are current, load dplyr and tidyr
library(dplyr)
library(tidyr)
# We are going to be using the mtcars dataset that comes with R
# in the first part of the workshop
# To learn more about this dataset, see the help page
?mtcars
# Let's take a look at the dataset
head(mtcars) # The first six lines
str(mtcars) # The structure of the dataset
# Part 1: Data summarizing and cleaning ----
# Calculating summaries by group ----
# I'm going to start by talking about summarising datasets
# by grouping variables, as this is a very common
# task that I see folks struggle with in R
# Summarising by groups (and other similar tasks) are generally referred to as
# "split-apply-combine" tasks in R and can be done in dplyr
# as well as with the tools I listed at the beginning of the workshop
# Grouping a dataset ----
# In dplyr, the key to such split-apply-combine tasks is *grouping*
# where we define which categorical variable/variables
# we want to make separate summaries for
# We make a grouped dataset using the group_by() function
# For example, if we want calculate summary statistics separately
# for each cylinder category,
# we need to *group by* the number of cylinders
# The first argument in group_by() is the dataset,
# and the second is the grouping variable
bycyl = group_by(mtcars, cyl)
# The result is a "grouped" data.frame, which you can
# determine because the grouping variable is listed
# when you look at the dataset
head(bycyl)
# You can also see this in the dataset structure,
# where the object is of class "grouped_df"
str(bycyl)
# Once you have a grouped dataset, any summaries are
# done within the defined groups
# Summaries by group ----
# For example, maybe we want to calculate
# mean engine displacement for each cylinder category
# We can use the summarise() function for this
# (summarize() is an alternative spelling)
# Note that throughout the first part of the workshop I am printing results
# to the console but not assigning names
# You certainly can (and likely will want to) assign names
# to your new object
# We'll cover this once
# we combine some of the data manipulation steps together later today
# Like in group_by(), the first argument of summarise() is a dataset
# We use our grouped dataset "bycyl" for this
# Then we simply define the function and variable
# we want to use for summarizing
summarise( bycyl, mean(disp) )
# We can summarise multiple variables at once or
# get different types of summaries by simply
# adding summary functions/variables with
# commas separating each new argument
# For example, let's calculate the average engine displacement
# and the average horsepower by each cylinder category
summarise( bycyl, mean(disp), mean(hp) )
# That's easy, but the column names leave something to be desired
# We can set those directly in summarise()
# when we calculate new variables
# Maybe we want mean and sd of engine displacement,
# with the names "mdisp" and "sdisp", respectively
summarise( bycyl, mdisp = mean(disp), sdisp = sd(disp) )
# Multiple grouping variables ----
# We can also group a dataset by multiple variables
# This is very common when we have more than one factor of interest
# or have a nested study designs
# (e.g., plots nested in transects nested in sites)
# Along these lines, we will next calculate summary stats
# for each cylinder category
# separately for each transmission type (am)
# To add more grouping variables,
# we include more variables in group_by(), separated by commas
byam.cyl = group_by(mtcars, cyl, am)
# The data summary is for the combination of groups
summarise( byam.cyl, mdisp = mean(disp) )
# Ungrouping a dataset ----
# You can see that the dataset we just created
# is still grouped by the "cyl" variable
# When we have finished the work with groups,
# it is safest to "ungroup" everything
# Working with a dataet you've forgotten to
# ungroup and don't realize it can cause problems
# down the road
# Ungrouping is done via the ungroup() function
ungroup( summarise( byam.cyl, mdisp = mean(disp) ) )
# Now the output no longer has any "Groups" listed and
# any further work will be done on the overall dataset,
# not on groups
# Using summarise_all()/summarise_at()/summarise_if() ----
# The dplyr package also has additional *scoped variants*
# of the summarise() function for
# calculating the same summary statistics
# for multiple variables at once
# These scoped functions will still be available
# but will be superseded by `across()` in dplyr 1.0.0,
# which will be released in 2020.
# These functions are listed in the same help page,
# and each has slightly different arguments
?summarise_all
# summarise_all() ----
# The summarise_all() function summarises every variable
# in the dataset the same way
# This will work well primarily when the variables are all
# the same type, like numeric, as in the mtcars dataset
# The second argument is the function we want to use
# Here is the mean of every variable for each cylinder group
# (using the "bycyl" grouped dataset)
summarise_all(bycyl, .funs = mean)
# What would happen if we had a factor variable
# and tried to do this?
# We'll make "vs" a factor so we can see the result
bycyl$vs = factor(bycyl$vs)
summarise_all(bycyl, .funs = mean)
# summarise_at() ----
# If we only want to summarize some of the variables
# in the dataset we can use summarise_at()
# To summarize only some of the columns in our dataset,
# we list the columns we want the summaries
# as the second argument within vars()
# prior to giving the summary function we want to use
# Here we'll ge the mean of only disp and wt
summarise_at(bycyl, .vars = vars(disp, wt), .funs = mean)
# Some of the variables in the dataset are
# actually categorical
# If we don't want to treat them as if they were numbers,
# we could remove them from the summary by excluding them
# To exclude variables we can just drop them using the minus sign
# We'll see more about choosing variables when
# we talk about the select()
# function in a few minutes
# Let's remove am and vs from the means summary table
summarise_at(bycyl, .vars = vars(-am, -vs), .funs = mean)
# summarise_if() ----
# Another option for summarizing only some of the variables
# in a dataset can be done using logic
# This is called using a *predicate* function
# For example, we could summarize only the numeric columns
# by testing each function with the is.numeric() predicate function
# The predicate function in summarise_if() comes before
# listing the summary functions we want
summarise_if(bycyl, .predicate = is.numeric, .funs = mean)
# In the results, one the numeric variables are summarized,
# so "vs", which we made a factor, isn't included
# With any of the summarise_* functions,
# we can calculate more than a single summary functions at once
# by including all the functions inside a list()
summarise_if( bycyl, .predicate = is.numeric,
.funs = list(mean, max) )
# Notice that the names of the function are appended
# with "fn1", "fn2"
# This is a regression, and in the next version of dplyr
# the names of the functions will be appended
# We can change what is appended within the list()
summarise_if( bycyl, .predicate = is.numeric,
.funs = list(mn = mean, mx = max) )
# The glimpse() function alternative to printing to the console ----
# Notice we are printing the object to the console
# instead of creating a new object and dplyr doesn't
# show all possible variables of wide datasets like this
# To see all variable names we could use glimpse()
glimpse( summarise_if( bycyl, .predicate = is.numeric,
.funs = list(mn = mean, mx = max) ) )
# More basic data manipulation ----
# with filter(), select(), mutate(), and arrange()
# Let's now switch our focus now to talk about
# some of the other functions we can use
# for other data cleaning/manipulation tasks
# I've seen folks leap to using loops for some of these tasks,
# which are useful but can also be difficult to code
# and difficult to read
# compared to some of the methods available to us
# through add-on packages
# Filtering the dataset ----
# *Filtering* a dataset involves getting rid of unwanted rows
# to create a specific subset of interest
# For example, if we wanted to just focus
# on automatic transmissions,
# we could *filter out* the rows in the mtcars dataset
# from manual transmission cars using filter()
# Like other dplyr functions, the dataset is the first argument to filter()
# In filter(), we define what we want to keep
# in the resulting dataset using logic
# Here we want to keep all rows when am is 0
# (i.e., automatic transmission cars)
# Note the *two* equals signs for testing equality
filter(mtcars, am == 0)
# In filtering you will always be using logical operators
# This means things like == (testing for equality),
# != (testing for inequality), > (greater than),
# !is.na (all values except NA), etc.
# If we wanted to filter out all weights greater than 4000 lbs (4 1000 lbs),
# we could use less than or equal to (<=)
# to *keep* rows only for cars with weights less than or equal to 4
filter(mtcars, wt <= 4)
# We could do the same thing, using greater than with the
# *not* operator, "!"
# This would say we want to keep cars with weights that are NOT
# greater than 4
filter(mtcars, !wt > 4)
# We can also filter a grouped dataset if we want to
# Maybe we'd like a dataset containing just the rows where
# the values of weight are greater than the mean
# within each cylinder category
filter( bycyl, wt > mean(wt) )
# And, of course, we can filter by multiple things at once
# Here, we filter to only cars that have both automatic transmissions
# AND weigh less than or equal to 4000 lbs
filter(mtcars, am == 0, wt <= 4)
# If you need an OR, you need the vertical line "|", which is
# found on the backslash key
# There are filter_all(), filter_at(), and filter_if() functions;
# See the help page for examples of these
# Selecting variables ----
# *Selecting* a dataset involves getting rid of unwanted columns
# to create a specific subset of interest
# Sometimes we might want to focus on only some of the columns of a dataset,
# especially if our dataset is very wide
# but not all variables are part of our current analysis
# The select() function involves keeping only some of the columns
# of a dataset
# We can do this by name (but also index number if needed),
# which makes coding easy and easy to read
# Here is a basic example, taking just the "cyl" column from the dataset
select(mtcars, cyl)
# We can take all columns between (and including) two variables
# using the colon, ":"
select(mtcars, cyl:vs)
# We can take just a few columns, separating columns names by commas
select(mtcars, cyl, vs)
# We can also choose columns that match certain rules
# using some of the special helper functions
# that can be used within select(),
# such as starts_with() and contains()
# See the help page for select_helpers() for all of these special
# functions and how to use them
# Everything we can do with select_helpers() in select()
# can also be done in the, e.g., summarise_at() vars argument
?select_helpers
# While I'm showing you some examples that aren't very meaningful,
# these really come in handy if you have some variables that all share
# a common name or number, etc., that you want to pull out and focus on
# For example, we could choose only columns
# with names that start with "d"
# Remember that R is case sensitive, so "d" isn't the same as "D"
select( mtcars, starts_with("d") )
# Or columns with names that contain "a"
select( mtcars, contains("a") )
# We can also drop columns using the minus sign
# (like we saw in summarise_at() ),
# both with and without the special functions
select(mtcars, -gear) # drop "gear" variable
select(mtcars, -gear, -carb) # drop "gear" and "carb" variables
select( mtcars, -(am:carb) ) # drop all variables between and including "am" and "carb"
select (mtcars, -ends_with("t") ) # dropping variables that end with "t" (drat and wt)
# Again, the select helpers can be used in other dplyr functions,
# such as in summarise_at() and mutate_at()
# Like the other verbs, there are scoped variants of select()
# ( select_all(), select_at(), select_if() )
# Adding new columns ----
# In dplyr we use mutate() to add new variables to a dataset
# With mutate(), the new variable we create
# will be the same length as the original dataset
# Adding new variables can be as basic as adding two variables together
# to make a third variable
# We'll make some nonsense variables
# to give you the idea of how this works
# First let's sum engine displacement and horsepower
# Like we did in summarise(), we will name the new
# variable whatever we like,
# so we'll name our new variable "disp.hp"
mutate(mtcars, disp.hp = disp + hp)
# We can create multiple new variables at once using commas to separate
# the new variables
# A handy feature of mutate() is that we can
# build on variables we just created
# earlier within the mutate() call
# We'll make three new variables:
# the sum of "disp" and "hp",
# half of the new "disp.hp" variable,
# and the ratio of "qsec" and "wt",
mutate(mtcars,
disp.hp = disp + hp,
halfdh = disp.hp/2,
qw = qsec/wt)
# With mutate(), we can add summary variables by group to
# the current dataset as a new column
# For example, we can add mean horsepower for each
# cylinder category to the dataset
# if we mutate() a grouped dataset
# Each car within a category will have the same value because,
# unlike summarise(), mutate() always creates variables
# that are the same length as the original dataset
mutate( bycyl, mhp = mean(hp) )
# As you can see, mutate() code looks a lot like summarise() code
# There are mutate_all()/mutate_at()/mutate_if() functions
# that work the same as
# the scoped variants of the summarise() functions
# that we learned above
# There is also a function called transmute(),
# which works like mutate() except
# it only keeps the new columns that you create
# instead of adding columns to the current dataset
# Sorting ----
# The last common task we might want to do is to *sort*
# the dataset a certain way with arrange()
# One place you might use this is when working in a time
# series, where you want to pull the first observation
# within each group and so you sort time within group
# prior to the next data manipulation step
# We can sort by variables
# By default we sort in ascending order
arrange(mtcars, cyl)
# Or sort by the reverse of the variables using the minus sign, "-"
# or the descend function desc()
arrange(mtcars, -cyl)
arrange( mtcars, desc(cyl) )
# To sort variables within groups
# we need to sort by the grouping variable first
# and then any other sorting variables
# The arrange() function ignores group_by()
# Here, we'll sort by wt ascending within each cylinder category
arrange(mtcars, cyl, wt)
# Doing several data manipulation steps in a row ----
# In real life, we will likely want to perform
# several data manipulation tasks on a single dataset,
# combining filtering, selecting, mutating, etc.
# Let's see what this looks like with the mtcars dataset
# Here are the steps we'll take:
# We will filter mtcars to just cars with automatic transmissions
# We will create a new variable, the ratio of hp and disp
# And finally we will calculate the mean of the new variable
# within each cylinder category
# We can do this by making a series of temporary objects,
# one for each step
# I'm including the extra pair of parantheses to print
# the output to the console
# so we can see what this looks like as we go
# Filter to automatic transmission cars
( filtcars = filter(mtcars, am == 0) )
# Create new variable in the filtered dataset
( ratio.cars = mutate( filtcars, hd.ratio = hp/disp) )
# Group by number of cylinders
grp.ratio = group_by(ratio.cars, cyl)
# Calculate mean of the new ratio variable by cylinder category
( sum.ratio = summarise( grp.ratio, mratio = mean(hd.ratio) ) )
# Nesting functions ----
# Instead of making new objects with a separate names for each step,
# we can "nest" the functions
# This keeps us from having to make temporary objects, but can
# get complicated if using many functions in a row
# Here's a simple example, where we group before summarizing
# We group the mtcars dataset by cyl and am
# and then calculate mean disp
summarise( group_by(mtcars, cyl, am), mdisp = mean(disp) )
# The downside to nesting functions is that it is
# fairly difficult to make the code readable
# To understand the code, we have to read the nesting "inside-out",
# as the most nested function
# is the first one used
# Here is the series of data manipulation tasks
# we were just doing using function nesting
# instead of temporary object naming
( sum.ratio = summarise( group_by( mutate( filter(mtcars, am == 0),
hd.ratio = hp/disp),
cyl),
mratio = mean(hd.ratio) ) )
# The pipe operator ----
# Now that you've seen some dplyr basics,
# it's time to introduce you to the *pipe operator*,
# The pipe operator is an important part of the dplyr package
# but a relatively new kind of coding in R
# The pipe operator is represented by "%>%",
# and allows us to perform a series of data manipulation steps
# one after another (aka "chain together")
# in a readable way
# The pipe operator allows us to write code
# that is read from left to right instead of inside-out
# The pipe operator "pipes" a dataset into a function
# This is the reason the dataset
# is the first argument in dplyr functions
# because the pipe by default feeds the dataset
# to the first argument of a function
# You can think of the pipe operator as being pronounced "then"
# Let's see an example
# Remember when we made our grouped dataset?
# It looked like this:
bycyl = group_by(mtcars, cyl)
# Even this simple examples reads from inside-out,
# as we have to go inside group_by() to see
# which dataset we are grouping
# Using pipes, we could write it like this:
bycyl = mtcars %>% group_by(cyl)
# This reads from left to right
# It says: Take the mtcars dataset and *then* group it by cyl
# Handily, we can simply keep piping through the
# data manipulation steps we want to perform on a dataset
# in a single chain
# Stylistically, the standard is to use line breaks to separate
# each task as it keeps your code easy to read
# (As an aside, much like writing in English,
# spaces make code easier to read.
# Don't forget spaces when writing your code, it
# seems clunky at first but quickly becomes natural.
# White space rationing is not in effect! :-D )
# Let's group mtcars by cyl and then calculate mean engine displacement,
# using pipes
mtcars %>%
group_by(cyl) %>%
summarise( mdisp = mean(disp) )
# Again, we are reading left to right instead of inside-out
# so we take the mtcars dataset and then
# group that dataset by cylinder category and then
# calculate the mean engine displacement for each group
# We can tie in filtering, mutating, selecting, summarizing, etc.
# in our chain without making temporary objects
# or nesting the functions
# To show this, let's go back and redo the filtering, mutating,
# grouping, and summarizing task
# we did a few minutes ago on the mtcars dataset
# using the pipe operator
mtcars %>%
filter(am == 0) %>% # filter out the manual transmission cars
mutate(hd.ratio = hp/disp) %>% # make new ratio variable
group_by(cyl) %>% # group by number of cylinders
summarise( mratio = mean(hd.ratio) ) # calculate mean hd.ratio per cylinder category
# We can assign a name to the final object, as well
sum.ratio = mtcars %>%
filter(am == 0) %>% # filter out the manual transmission cars
mutate(hd.ratio = hp/disp) %>% # make new ratio variable
group_by(cyl) %>% # group by number of cylinders
summarise(mratio = mean(hd.ratio) ) # calculate mean hd.ratio per cylinder category
# Using the pipe in non-dplyr functions ----
# The pipe operator can be used with non-dplyr functions
# If the data argument is the first argument in the function
# this looks just like what we've already been doing
# Here's an example using head(),
# printing the first 10 rows of the dataset
# The data argument is the first one in head()
# (you can go to the help page to see the order of the arguments)
mtcars %>% head(n = 10)
# If the datasets is not the first argument,
# you can refer to it with a dot, "."
# Here's an example wth t.test to show you what that looks like
mtcars %>% t.test(hp ~ am, data = .)
# And a more realistic example where you might be
# filtering the dataset before running a t test
mtcars %>%
filter(wt <= 4) %>%
t.test(hp ~ am, data = .)
# The n() function for counting rows per group ----
# Before we move on I want to talk about
# one more function
# The dplyr function n() counts
# up the number of rows in a group
# This is so useful when making tables of summary statistics
mtcars %>%
group_by(cyl) %>%
summarise( n = n(),
mdis = mean(disp) )
# We can also use it directly in, e.g., filter() or mutate(),
# I sometimes use it with filter to look for mistakes or unusual
# values in a dataset I'm trying to understand
# such as when a group has only a single observation
# Here we'll pull out the rows for
# any cylinder group where the number of observations
# is less than 10
# For best practice I'll ungroup at the end of these pipe chains
mtcars %>%
group_by(cyl) %>%
filter(n() < 10) %>%
ungroup()
# We can use n() in mutate() if we need an index within each group
# in the order of the dataset
# We use select() below only so the we can see the new variable
# when printed to the console
mtcars %>%
group_by(cyl) %>%
select(1:3) %>%
mutate( index = 1:n() ) %>%
ungroup()
# If we want to add the index based on some order,
# we could arrange the dataset first
# Here we'll arrange by disp within cyl
mtcars %>%
arrange(cyl, disp) %>%
group_by(cyl) %>%
select(1:3) %>%
mutate( index = 1:n() ) %>%
ungroup()
# Practicing data manipulation ----
# To make sure we fit it in, we'll take a few minutes
# to practice some of the data manipulation functions
# we've covered so far
# Install package babynames ----
# We'll be practicing using a dataset from the "babynames" package
# which is not currently installed in this room
# We'll install it now by going to the "Packages" pane,
# hitting the "Install" button and installing "babynames"
# Alternatively we could write install.packages("babynames")
# The current version of this package
# is version 1.0.0
packageVersion("babynames")
# Once babynames is installed, we'll load it and
# look at the help page for the "babynames" dataset
library(babynames)
?babynames
# This dataset has information on the number and proportion of
# given baby names each year from 1880-2015 for each sex (male, female)
# provided by the US Social Security Administration
# Only names with at least 5 uses are included
# There are five variables, shown below
glimpse(babynames)
head(babynames)
# Pratice problem 1 ----
# Filtering and sorting
# What name was given to the largest number of babies
# in the year you were born?
# How many babies were given that name in 2017?
# Practice problem 2 ----
# Filtering, grouping, summarizing, n()
# Calculate the total number of baby names
# for each levels of the sex variable
# in the year you were born and in 2017.
# Hint: To use filter() with multiple values
# you'll need %in% instead of ==.
# For example, if you wanted to filter
# to years 1980 and 2015 you'd use
# year %in% c(1980, 2015)
# as the condition in `filter()`.
# Part 2: Reshaping datasets ----
# Now we are going to switch gears and talk about
# *reshaping* datasets in R
# This involves changing datasets from a wide format to long format,
# and from long format to wide format.
# We are going to be working the tidyr package
# Which uses a language involving "pivoting"
# When we pivot datasets "long", we change them from wide format to long format,
# so we take information stored in multiple columns and
# gather it all together into a single long column
# When we pivot datasets "wide", we change them from long format to wide format
# so we take information stored in multiple rows of a single column
# and put it into multiple columns
# This may sound confusing, but it should
# become clearer once we start practicing this
# When I think of reasons to reshape datasets for work in R,
# I think of going wide to long more often
# than going long to wide so we'll begin with that
# We loaded tidyr at the beginning of the workshop
# so we should be ready to go
# I'm going to create a "toy" dataset to work with
# By "toy" I mean a small dataset that doesn't involve any "real" data
# Being able to create toy datasets to try out functions is important,
# especially when your real dataset is very large and it will
# be difficult to troubleshoot any problems you run into
# Alternatively you can practice using function on the example datasets
# from the package or from base R like the way
# we used mtcars in the first part of this workshop
# I'm going to make a "time series" dataset,
# where "individuals" were measured for some
# continuous response at multiple points in time
# We need a column for individuals,
# a column for some treatment that was applied,
# and columns holding values of some response
# collected at the different times
# This dataset will only have 6 rows
# Small datasets are good for practice!
# I'm skipping going through the steps, but here is the data
# There are 6 individuals but they were given the
# same names within each of the 2 treatments
# Not I didn't set a seed (see the help file for set.seed() ),
# so our datasets will all have slightly different numbers
( toy1 = data.frame(indiv = rep(1:3, times = 2),
trt = rep( c("a", "b"), each = 3),
time1 = rnorm(n = 6),
time2 = rnorm(n = 6),
time3 = rnorm(n = 6) ) )
# Take note that this dataset contains 18 values
# of quantiative info (6 rows, 3 columns of numbers)
# Wide to long ----
# If we did an analysis in R for a dataset like this
# we wouldn't want the three time periods in different columns
# Instead we want a single column that represents time of measurement
# and then a column with the quantitative measurements
# In short, we want to take a "wide" dataset and make it "long"
# We will "lengthen" the dataset
# using the function pivot_longer() on our data.frame
# In pivot_longer(), the first thing we do defining the data
# is to choose the columns we want to be combined
# We can use the select_helpers we used earlier for this
# Then we set the name of the new grouping column based
# on the column names with "names_to"
# Finally we set the name of new column of
# all the values with "values_to"
# Both column names must be done using strings,
# meaning the variable name in quotes
toy1 %>%
pivot_longer(cols = time1:time3,
names_to = "time",
values_to = "measurement")
# This dataset has 18 rows and a single column of values
# and so still contains our original 18 pieces of info
# We'd better name this object so we can use it for
# additional reshaping practice
# I use starts_with() this time in "cols"
toy1long = toy1 %>%
pivot_longer(cols = starts_with("time"),
names_to = "time",
values_to = "measurement")
# Long to wide ----
# Let's take the measurement column of our long format dataset
# back into the original format
# (i.e., "widen" it)
# You might do this if you had a data set in long format
# that needed to be in a wide format
# for use in a different software package
# We will use the pivot_wider() function for this
# After defining the dataset, we will use
# a pair of arguments to choose the
# column(s) that contain the variable that
# will be the new columns names ("names_from")
# and the column(s) that contain the values
# will will fill the new columns with ("values_from")
# These are existing columns, so can be written using
# "bare" names (i.e., without quotes)
toy1long %>%
pivot_wider(names_from = time,
values_from = measurement)
# Multiple columns in "names_from"
# To take info from multiple columns for making
# a new, wider dataset, we can pass multiple
# names to "names_from"
# The new column names are separated by an underscore
# by default and the names are affected
# by the order we list the variables
toy1long %>%
pivot_wider(names_from = c(trt, time),
values_from = measurement)
# With 3 rows and 6 columns of values,
# we still have our 18 original pieces of information
# We can the separator with "names_sep"
# We can change the names of the columns by changing
# the order we list them
toy1long %>%
pivot_wider(names_from = c(time, trt),
values_from = measurement,
names_sep = ".")
# Non-unique row identifiers ----
# If the rows in your dataset aren't uniquely identified,
# you will get warning messages in pivot_wider()
# For example, if we were trying to widen a dataset that
# only had the "trt" column but not the "indiv" column,
# our rows wouldn't be uniquely identified
# Let's remove the "indiv" column to try this out
toy1long %>%
select(-indiv)
# Look what happens if we try to widen this dataset
toy1long %>%
select(-indiv) %>%
pivot_wider(names_from = time,
values_from = measurement)
# In particular take note of the warning messages
# These give useful information about what is going on
# The output dataset looks odds because all three values
# for each trt and time were kept and put in a list
# It is likely we want to summarize over the multiple
# values, which we can do with the "values_fn" argument
# One of the messages was indicating this
# I'll calculate the mean of the values
# while widening into multiple columns
toy1long %>%
select(-indiv) %>%
pivot_wider(names_from = time,
values_from = measurement,
values_fn = list(measurement = mean) )
# Since we've summarized over some of the data,
# we now only have 6 pieces of info instead
# of the original 18
# Practice problem 3 ----
# Reshaping
# Let's practice reshaping before going on
# to the last topic of the workshop
# This will be based on the result of practice problem 2
# I didn't name the final object I made,
# so I'll remake it here with the name
# numbaby_76_17"
# You should do the same with your final
# dataset from practice problem 2
numbaby_76_17 = babynames %>%
filter( year %in% c(1976, 2017) ) %>%
group_by(year, sex) %>%
summarise(n = n() ) %>%
ungroup()
numbaby_76_17
# First, reshape the dataset from practice problem 2
# to a wide format.
# Make a dataset with a separate column
# for each sex containing the
# number of baby names in a given year.
# Then reshape the dataset from practice problem 2
# to a different wide format.
# Make a dataset with a separate column
# for each year containing the
# number of baby names in a given sex
# Finally, practice putting the datast
# back in the original format.
# Take the dataset that has sex as separate columns
# and put this back in the original format.
# Part 3: Joining datasets ----
# We haven't talked about merging yet,
# and I wanted to briefly show you an example
# There is a merge() function in base R, but we will be working with
# the join() functions from dplyr
# It isn't uncommon to have data from the same study in two different datasets,
# because of how/when the data were collected
# As long as there are unique identifiers
# of the unit that was measured in each dataset,
# we can easily join these together
# Let's make two toy datasets that reflect such a situation
# The first dataset is the counts of some organism in
# each plot within a site
# The second dataset contains information about
# some environmental variable measured at each plot
# Plots are identified by the combination of "site", and "treat"
# (although it would be useful in real life
# to have a "plot" variable)
set.seed(16) # If we set the seed, we will all get the same random numbers generated
# This dataset is slightly unbalanced, as site 3
# doesn't have the "c" treatment count
( tojoin1 = data.frame(site = rep(1:3, each = 3, length.out = 8),