forked from kampalr/meetup_2
-
Notifications
You must be signed in to change notification settings - Fork 0
/
meetup_2.Rmd
266 lines (169 loc) · 10 KB
/
meetup_2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
---
title: "Meetup 2 - Importing and Cleaning Data"
author: "Kampalr"
date: "February 24, 2018"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, error = FALSE, warning = FALSE)
```
For this meetup, we shall be importing and cleaning data obtained from data.ug. We shall be looking at the [PLE Results by district 2010 - 2016 dataset](http://catalog.data.ug/dataset/uneb-results/resource/6e5885bd-2952-4086-aaac-0c2aa58ec762).
We shall levearge on the use of ***base R functions*** as these functions tend to be the most popular for new entrants into R. For our next session, we shall introduce members to [tidyverse](https://www.tidyverse.org/) that provides a collection of functions purposely customised for data science tasks.
### Importing the data
The file that we are using is a standard CSV file containing details on PLE results for the period 2010 - 2016.
```{r dataImport}
url <- "http://catalog.data.ug/dataset/1242faeb-b5a0-4ffb-8274-9c9975aad183/resource/6e5885bd-2952-4086-aaac-0c2aa58ec762/download/ple-results-by-district-2010-2016.csv"
# Make sure the "data" directory exists in the current project folder for holding
# the csv data file
if(!dir.exists('data')) dir.create('data')
# note: If on windows machine make sure to use method = "wininet" check out ?download.file() for different options.
download.file(url, destfile = "./data/data.csv", method="libcurl")
data <- read.csv('./data/data.csv', stringsAsFactors = FALSE)
head(data)
```
In the ideal case, we would have a document similar to a data dictionary that would help define what each column (variable) means and additional characteristics of these variables such as their expected types and value ranges. ***Unfortunately***, we were unable to find the corresponding document which may add some complexity when trying to understand the data without making a consultation with the data owner.
## Cleaning the data
We shall start with making the data readable from the column names and other variables and selecting variables we believe we need for our analysis.
```{r dataReadable}
# basis to understand my data.
dat <- data
head(dat)
str(dat)
names(dat)
dim(dat)
head(dat)
# specific columns that i need.
# names(df) or colnames(df)
dat <- dat[, (names(dat) %in% c("YEAR","DISTRICT","NUMBER.OF.SCHOOLS.UNEB.CENTRE","TOTAL.CANDIDATES",
"TOTAL.DIV.1", "TOTAL.DIV.2", "TOTAL.DIV.3", "TOTAL.U", "TOTAL.X",
"FEMALE.CANDIDATES", "MALE.CANDIDATES"))]
# clean column names remove . with regular expressions.
?gsub()
names(dat) <- gsub("\\.", "_", names(dat))
print(head(dat))
```
We won't attempt to completely clean the data set as this may take too much time but instead, we shall look at possible techniques to look at the data and invoke thinking on what approach could be taken to clean the dataset.
Let's first look at the spread of the data.
### 1. How many districts are there?
```{r districts}
districts <- readLines(file('districts.txt'))
districts
```
Looking at the [census data (2017)](http://www.ubos.org/onlinefiles/uploads/ubos/census_projections/District%20Single%20Years%20Final_17.07.2017.xls) from Uganda Bureau of Statistics, we can see that there are currently **`r length(districts)`** districts in Uganda.
What happens when we try to count the number of districts that are present in the PLE dataset?
```{r districtCount}
districts_PLE <- as.data.frame(table(dat$DISTRICT))
districts_PLE
```
We can observe that there are ***`r nrow(districts_PLE)`*** unique districts registered. By scrolling through the few rows, we notice a couple of things:
* There are duplicate districts (e.g. Abim, Adjumani, etc.). The districts have white spaces which prevent them from being recognised as the same.
* There are subcounties/municipalities listed as districts, e.g. Arua M/C, Arua Main, and Arua Mun.
Let's address the first issue and remove the white space and see whether our results will improve.
### Removing white space
```{r removeWhiteSpace}
dat$DISTRICT <- unlist(lapply(X = dat$DISTRICT, FUN = trimws))
# Using regexes or sub. sub("\\s+$", "", x) -- trailing or sub("^\\s+", "", x) --> leading.
# data$DISTRICT <- gsub('\\s$', '', data$DISTRICT)
# And let's test the new results
districts_PLE <- as.data.frame(table(dat$DISTRICT))
districts_PLE
```
As we can see, the duplicate districts have been knocked out bringing the unique number of districts down to ***`r nrow(districts_PLE)`***. Brilliant.
We can now tackle the second observation.
### Standardise districts
We do notice that each district got the REAL name then SPACE then an additional name like M/C or MUN e.g MBARARA M/C,IGANGA MUN. One other strategy we could use is removing anything after the first keyword. We could quickly do this with a for-loop and iterate over all the districts.
There are many type of loops, but today we will focus on the for loop. Here is the basic structure of a for loop:
```{ forloop}
for (variable in vector) {
# do something
}
```
```{r pickDistrictNamesOnly}
# reuse - main dataframe for an example.
# dat <- data
for(i in 1:length(dat$DISTRICT)){
dat$DISTRICT[i] <- strsplit(dat$DISTRICT[i], " ")[[1]][1]
}
print(tail(dat$DISTRICT, n = 100))
```
### Do we have any null (NA) values?
Let's check to see whether there are any null (NA) values for each variable:
```{r nullChecks}
# Let's apply an NA check on each column
na_cols <- apply(dat, 2, function(x) sum(is.na(x)))
# Let's print the result as a dataframe to ease the view.
na_df <- data.frame(Measure = names(na_cols), Value = na_cols, row.names = NULL)
# Let's order it in descending order
na_df <- na_df[order(-na_df$Value), ]
# Print out the summary
na_df
```
We observe that all observations have their "NUMBER_OF_SCHOOLS_UNEB_CENTRE" as blank (`r nrow(data)` observations). We cannot be sure whether this was a column added in error, whether this was an omission during the data import/export process or whether this was acutally not tallied. A further consult would be required with the data owner to find out why this is blank. However, for our example, we shall focus on maintaining these as **NIL/ZERO**.
We also notice an interesting observation. We observed that there are **`r na_df[na_df$Measure == 'FEMALE_TOTAL_DIV1', ]$Value`** female students that did not achieve a DIVISION 1 during this period. Let's see in which districts this phenomenon occurred.
```{r femaleDiv1}
# Obtain a subset of the data containing total candidate and female candidate info
female_attendance <- dat[is.na(dat$FEMALE_TOTAL_DIV1),
c('YEAR', 'DISTRICT', 'TOTAL_CANDIDATES', 'FEMALE_CANDIDATES')]
# Calculat the percentage of female candidates obtaining Division 1
female_attendance$FEMALE.ATTENDANCE.PERCENT <-
female_attendance$FEMALE.CANDIDATES / female_attendance$TOTAL.CANDIDATES * 100
# Print out the data
female_attendance
```
We will notice that these are in remote areas with low female attendance to the exams compared to the total number of candidates. Take note that as discovered during our data cleaning process, no females obtained a Division 1 over this period. In the long run of things, this could form part of a recommendation for follow-up on education with the girl child.
Now, back to the data cleaning process...
For our benefit, we shall mark all the NA values with **NIL/ZERO**. The missing values, as they are small, will not cause any significant variation in the overall trend of the data:
```{r zerofyNA}
# Replace all NA values with NIL/ZERO
dat[is.na(dat)] <- 0
```
Although we have standardized the districts, we have noticed that this has left some districts in the same year having two sets of observations. Ideally, we would want to find a way to merge these observations and in the best case, going back to the original data owner to better understand the approach on how this raw data was collected would provide the best guide on overcoming this challenge.
### Understanding data with simple visualizations
One of the most powerful functions of R is it's ability to produce a wide range of graphics to quickly and easily visualise data. Plots can be replicated, modified and even publishable with just a handful of basic functions.
```{r vizPlot}
# Using the base Plot.
plot(dat$FEMALE_CANDIDATES, dat$MALE_CANDIDATES)
```
Now is time to look at our data in the eyes of some basic visualizations. If we want to see the mean of FEMALE CANDIDATES.
```{r vizHist}
# Using the base Hist.
hist(dat$FEMALE_CANDIDATES, main = "Distribution of FEMALE CANDIDATES", breaks = 30, xlab = "count (n)", col = "red")
```
Understanding basic trends across the data. Are the first grades in Abim increasing by years? We can add a basic linear regression to the plot using abline. NB. we can also use lty, lwd, col here.
```{r vizLine}
dat$YEAR<- as.factor(dat$YEAR)
abim_dat <- dat[which(dat$DISTRICT=='ABIM'),]
# dim(newdata)
# Try "o" "p" "l" "b"
plot(as.numeric(abim_dat$YEAR), abim_dat$TOTAL_DIV_1, type ="l", col = "red" )
abline(lm (TOTAL_DIV_1 ~ YEAR, data = abim_dat), lty = "dashed")
```
Scatter plots with legends and we will specify colour as the years.
```{r vizScatter}
# add pch = 15
# bg = "blue"
# cex = 2
plot(dat$FEMALE_CANDIDATES, dat$MALE_CANDIDATES, col = dat$YEAR)
# levels(dat$YEAR)
legend(x = 500, y = 20000, legend = levels(dat$YEAR), col = c(1:3), pch = 16)
```
Let's examine the distribution of first grades for every year:
If you wish to compare the medians of the boxplot, you can use the function notch. If the notches of two plots do not overlap, this is 'strong evidence' that the two medians differ (see ?boxplot)
```{r vizBox}
#notch = T
#las = 1
boxplot(dat$TOTAL_DIV_1 ~ dat$YEAR)
```
Saving Plots.
```{r savePng}
# png
png("distribution_female.png", width = 500, height = 500, res = 72)
hist(dat$FEMALE_CANDIDATES, main = "Distribution of FEMALE CANDIDATES", breaks = 30, xlab = "count (n)", col = "red")
dev.off()
```
```{r savePdf}
# pdf
pdf("distribution_male.pdf")
hist(dat$MALE_CANDIDATES, main = "Distribution of MALE CANDIDATES", breaks = 30, xlab = "count (n)", col = "red")
dev.off()
```