-
Notifications
You must be signed in to change notification settings - Fork 4
/
Clean Up.Rmd
76 lines (59 loc) · 2.52 KB
/
Clean Up.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
title: "Clean Up"
author: "Barry Davis"
date: "June 26, 2017"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, results = "hide")
```
## What's in the box?
cdns_sanitized.csv
: A master list of CDNs and their names
skillsets_sanitized.csv
: A master list of skillset IDs and their names
applications_sanitized.csv
: A master list of application IDs and their names
aggint_application.csv
: Aggregated statistics for applications in 15 minute intervals
aggint_skillset.csv
: Aggregated statistics for skillsets in 15 minute intervals
More files may be added later if additional metrics are needed.
## Environment
I load Tidyverse to give access to some common and useful packages. I also loaded my data sources.
```{r}
library(tidyverse)
mapps <- read.csv("applications_sanitized.csv", TRUE, ",")
mskills <- read.csv("skillsets_sanitized.csv", TRUE, ",")
mcdns <- read.csv("cdns_sanitized.csv", TRUE, ",")
apps <- read.csv("aggint_application.csv", TRUE, ",")
skills <- read.csv("aggint_skillset.csv", TRUE, ",")
```
## Clean Up Objective
The data sets to be used are rather clean as raw data. The application that generates these data sets does a very good job of not letting garbage into these particular sets. Most of the information in the aggregated sources are generated by the application and numerical in nature. The master lists contain names that are input by humans, but in this case, they are already lowercase in the data source.
The column names are also already relevant. While some are long, they accurately descsribe what the statistic represents. Shortening them further I feel will detract from their usefulness, especially when doing any type of larger calculations.
## What was done?
I checked for missing values, but none existed.
```{r}
grep("^$", mapps)
sum(is.na(mapps))
grep("^$", mskills)
sum(is.na(mskills))
grep("^$", mcdns)
sum(is.na(mcdns))
grep("^$", apps)
sum(is.na(apps))
sum(is.na(skills))
grep("^$", skills)
```
There was a field called statTimestamp in both the aggint_application.csv and aggint_skillset.csv which needed to be separated into a Date and Timestamp field.
```{r}
apps <- apps %>% separate(statTimestamp, c("Date", "Timestamp"), sep = "T")
skills <- skills %>% separate(statTimestamp, c("Date", "Timestamp"), sep = "T")
```
## Write the new files
Finally, I write the new files that were changed.
```{r}
write.table(apps, "aggint_application_refined.csv", row.names = FALSE)
write.table(skills, "agg_int_skillset_refined.csv", row.names = FALSE)
```