This section covers intermediate R concepts, and is meant to be read through after completeing the beginnerConcepts tutorial and the beginnerTest exercise.
If you are working through this tutorial as part of the Geoscience 541: Paleobiology course, you must do the intermediateTest exercises and hand in your answers at the start of the next lab period.
- Subscripting and subsetting with logicals
- Rewriting elements using logical subscripts
- Direct subsetting with functionals
Perhaps the biggest benefit of [ ]
notation is that we can perform complex subscripting operations within them. The most powerful of these is the which( )
function, which finds the index (a.k.a., the position) of TRUE
values in a logical array. In other words which( )
is short for the phrase: which of the elements in this array are TRUE values.
# Create a vector of logical values, where the first element and fifth element are TRUE
> MyLogical<-c(TRUE,FALSE,FALSE,FALSE,TRUE)
> MyLogical
[1] TRUE FALSE FALSE FALSE TRUE
# Use which to find which elements of MyLogical are TRUE
> which(MyLogical)
[1] 1 5
Now, you might ask, how does this help? Well now that you you have the index positions, you can reference those elements of the array directly.
# Display the elements of MyLogical that are TRUE
> MyLogical[which(MyLogical)]
[1] TRUE TRUE
This isn't very impressive since it is circular. We asked which elements had a value of TRUE
, so of course the values of those elements is TRUE
. But, what if we don't start out with logical data?
# What if we want to see all the elements in array that are greater than 5 and what those elements are?
> MyVector<-c(2,6,4,5,6,1,3,4,7,9,3)
> MyVector
[1] 2 3 4 5 6 1 3 4 7 9 3
# We merely need to convert our numeric data into logical using the appropriate logical operator.
> MyLogical<-MyVector > 5
> MyLogical
[1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE
# Find which elements of MyLogical are TRUE and display them.
> MyVector[which(MyLogical)]
[1] 6 6 7 9
# We can do all of this in a single step.
> MyVector[which(MyVector > 5)]
[1] 6 6 7 9
You can combine logical statements using the &
(and) and |
(or) operators.
# Find numbers that are greater than 5 AND less than 9
> MyVector[which(MyVector > 5 & MyVector < 9)]
[1] 6 6 7
# Find numbers that are greater than 5 or equal 3
> MyVector[which(MyVector > 5 | MyVector == 3)]
[1] 6 6 3 7 9 3
The true power of which( )
doesn't become apparent until you want to start rewriting elements of a data object.
# Let's make a practice data frame
> MyArray<-array(data=c(5,6,4,5,5,6),dim=6)
> MyArray
[1] 5 6 4 5 5 6
# Let's imagine that we know the third element of the array, the number 4, should actually be a 7.
> MyArray[3]<-7
> MyArray
[1] 5 6 7 5 5 6
# Remember that you cannot mix types!
> MyArray[3]<-"seven"
> MyArray
[1] "5" "6" "seven" "5" "5" "6"
# R automatically coerced the numbers to characters.
That's fairly straightforward, but what if we want to overwrite multiple elements in an array?
# Create an array
> MyArray<-array(data=c(5,6,4,5,5,6),dim=6)
> MyArray
[1] 5 6 4 5 5 6
# Change all values of 5 to 9
> MyArray[which(MyArray==5)]<-9
> MyArray
[1] 9 6 4 9 9 6
We can also perform logicals on two and three-dimensional arrays, but it can be somewhat more complicated. Let's take a look at the WorldPhones
matrix. WorldPhones is an example dataset that comes preloaded with all versions of R.
Each row represents a different year. Each column represents a different country. The value of each cell represents how many phones were present in that country that year.
# We can load any of R's example datasets using the data( ) function.
> data(WorldPhones)
# Check if WorldPhones is a matrix
> is(WorldPhones,"matrix")
[1] TRUE
# Take a gander at WorldPhones
> WorldPhones
N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
1951 45939 21574 2876 1815 1646 89 555
1956 60423 29990 4708 2568 2366 1411 733
1957 64721 32510 5230 2695 2526 1546 773
1958 68484 35218 6662 2845 2691 1663 836
1959 71799 37598 6856 3000 2868 1769 911
1960 76036 40341 8220 3145 3054 1905 1008
1961 79831 43173 9053 3338 3224 2005 1076
Let's say that we recently learned that because of an error in the original study, any value indicating less than 2,000 or greater than 77,000 phones is unreliable. To reflect our uncertainty, we want to change all instances <2,000 or >77000 to NA. NA is R's way of saying that there is no data.
# Let's attempt to do some logical subsetting with which() on the WorldPhones matrix.
> which(WorldPhones < 2000 | WorldPhones > 77000)
[1] 7 22 29 36 37 38 39 40 41 43 44 45 46 47 48 49
# Convert those values to NA
> WorldPhones[which(WorldPhones < 2000 | WorldPhones > 77000)]<-NA
> WorldPhones
N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
1951 45939 21574 2876 NA NA NA NA
1956 60423 29990 4708 2568 2366 NA NA
1957 64721 32510 5230 2695 2526 NA NA
1958 68484 35218 6662 2845 2691 NA NA
1959 71799 37598 6856 3000 2868 NA NA
1960 76036 40341 8220 3145 3054 NA NA
1961 NA 43173 9053 3338 3224 2005 NA
What if we are only concerned with years where there were more than 118,000 phones worldwide? Logically, the first step would be to find the total number (sum) of phones for each year (row).
# Reload the data
> data(WorldPhones)
# Find the sum of the WorldPhones matrix.
> sum(WorldPhones)
[1] 805303
The sum( )
function doesn't give us what we want! It sums all elements in an object, not each row. What we need is a way to apply the sum( )
function to each individual row of the matrix. Luckily, there is an aptly named function, apply( )
, that we can use.
# Find the sum of each row in WorldPhones with apply( )
> apply(WorldPhones,1,sum)
1951 1956 1957 1958 1959 1960 1961
74494 102199 110001 118399 124801 133709 141700
# The 1 indicates the dimension of the matrix you want summed. We could also do the second dimension (columns).
> apply(WorldPhones,2,sum)
N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
467233 240404 43605 19406 18375 10388 5892
Notice that apply returned a vector of sums for each row or column. We can perform a logical operation on that vector, and use the logical vector to define the rows we want from the matrix.
# Isolate which rows have more than 118,000 phones
> which(apply(WorldPhones,1,sum) > 118000)
1958 1959 1960 1961
4 5 6 7
# Reference only those rows,
> WorldPhones[which(apply(WorldPhones,1,sum) > 118000), ]
N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
1958 68484 35218 6662 2845 2691 1663 836
1959 71799 37598 6856 3000 2868 1769 911
1960 76036 40341 8220 3145 3054 1905 1008
1961 79831 43173 9053 3338 3224 2005 1076
The apply( )
function is a special type of function called a functional. Functionals are an extremely versatile and important tool in your aRsenal.
Each functional is characterized by two features. First, the kinds of objects that it will accept. Second, the kind of object it returns. This second requirement implicitly restricts the kinds of functions that the functional will accept. For example, apply( )
only returns a vector. If you use a function that returns values incompatible with a vector, then apply cannot work.
Functionals | Accepted Object | Returned Object | Example Formula |
---|---|---|---|
apply( ) |
Array | Vector or Matrix | apply(object, dimension, function) |
sapply( ) |
Vector or List | Vector | sapply(object, function) |
lapply( ) |
Vector or List | List | lapply(object, function) |
You will learn more about using functionals in concert with more complex functions in the advancedConcepts.R tutorial. In the mean time, here are some useful functions that go well with these three basic functionals.
Sometimes you can avoid logical subscripting all together by subsetting your data directly with a functional. Let's load in the iris
dataset, another (quite famous) example dataset that comes with all versions of R.
Iris is a data.frame consisting of different sepal and petal measurements of different iris flowers. It is a data.frame because, in addition to the numeric measurements, there is a column of non-numeric data that denotes which of three different species the specimen belonged to: Iris setosa, Iris virginica, and Iris versicolor.
# Let's load in and take a look at the data
> data(iris)
# Take a look at the first five rows.
> iris[1:5,]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
What if we wanted to find which species has the largest sepal length? Using what we've already learned, we could use which( )
to subset the dataset by species. We could then find the max( )
sepal length of each species. Let's try it.
# Subset the iris data.frame into three separate data frames by species using which( ).
# Beware of upper and lower case!
> Setosa<-iris[which(iris[,"Species"]=="setosa"),]
> Virginica<-iris[which(iris[,"Species"]=="virginica"),]
> Versicolor<-iris[which(iris[,"Species"]=="versicolor"),]
# Find the maximum of sepal length of each new data frame.
> max(Setosa[,"Sepal.Length"])
[1] 5.8
> max(Virginica[,"Sepal.Length"])
[1] 7.9
> max(Versicolor[,"Sepal.Length"])
[1] 7
Nice! We were able to figure out that a specimen of Iris virginica had the longest sepal length! But, we can actually do this even faster us another functional called tapply( )
.
The tapply( )
function takes a two-dimensional array, splits it into subsets, and then applies a function to a specific column in the subset.
# Find the maximum sepal length of each species
> tapply(iris[,"Sepal.Length"],iris[,"Species"],max)
setosa versicolor virginica
5.8 7.0 7.9
Although this might seem like a trivial improvement at the moment, don't forget that you might be working on a data set with hundreds, thousands, or hundreds of thousands of species. Consider, that the first way we calculated this would take two-hundred thousand lines of code for a data frame with 100,000 species. It only takes one line with tapply( )
!
"The computer is the robot, you are the useR."
Automating boring and repetitive tasks is the whole reason that we invented computers in the first place! If it seems like an operation will take an inordinate amount of code, consider taking a break and formulating a more elegant solution. I assure you, one exists.
Don't worry, we will cover automatic repetition (iterating) in the advancedConcepts tutorial.