Visualisation and analysis skills by analysing geospatial datasets relating to human life and society. Tech stack includes R
, tidyverse
, ggplot2
, plotly
, xml2
, rvest
; SQLite
; JavaScript
, D3.js
; HTML
, and CSS
.
The ultimate goal is to analyse the multiple datasets from various sources and build a single page website for displaying some beautiful graphics using D3.js
. I did this is a few parts:
- Cleaning, organising and merging the datasets.
- Exploratory analysis and visualisations with
ggplot2
. - Final plots and website;
JavaScript
,D3.js
.
You can find all preliminary analysis and work done with R in the R-analysis
directory. In the website
directory you'll find all work done with JavaScript
and D3.js
.
Tech stack will include and work in the following way:
ExpressJS
framework - for server creation.SQL
database - stores and manages data.Nginx
as a reverse proxy - avoids exposing server and relays client requests.NodeJS
server - runsSQL
andExpressJS
server.JavaScript
- used front end and back end.SCSS
- use for styling; better thanCSS
for its modularity.HTML
- front end code.D3.js
- visualisation library.
Preprocessing these datasets was a rather tedious matter. Unfortunately these do not always come in a neat and ready to use format. Preprocessing includes, merging, removing trash data, reformating, and manipulating data to fit the tidy
structure - note that these often are presented in multiple .csv
files. Reading these in all at once and working from a list is essential:
fontes <- lapply(list.files("./data/united-states-of-america/per-county-votes-20-fontes/", full.names = TRUE), read.csv)
names(fontes) <- gsub(".csv", "", list.files("./data/united-states-of-america/per-county-votes-20-fontes/"), perl = TRUE)
The trick is to list the files and pass this to the read.csv
function through lapply
then name the objects in the list.
Some of the aforementioned preprocessing techniques are demonstrated in the following example. For more information consult the preprocessing scripts.
The data was taken from the gapminder website and github repositories: open-numbers/ddf--gapminder--fasttrack.
Unfortunately the datasets provided are quite messy, it is difficult to obtain the full dataset, and manual download is often necessary for obtain certain ones. Most data is avaialable in the linked repository, but some is only available by manual download. Moreover, this data presents itself in short format.
Reshaping the data into long format was done by the clean-data.Rmd
script. There I use a list object to read in all data files, name the object and execute a custom algorithm for reshaping the data and passing the file name to a column name:
First the data is read and put into long format. This is held by a list object.
reshape_manual_data <- function(x) {
shift_long <- function(x) {
column_to_rownames(x, "country") %>%
t() %>% as.data.frame() %>%
rownames_to_column("year") %>%
mutate(year = gsub("X", "", year)) %>%
reshape2::melt() %>%
return()
}
shifted <- lapply(x, shift_long)
for(i in 1:length(x)) {
colnames(shifted[[i]])[3] <- names(x)[i]
}
return(shifted)
}
This list object is now merged to produce a single dataset, NA are inserted where necessary to keep all rows.
data$manual <- Reduce(function(...) {
merge(..., all = TRUE)
}, reshape_manual_data(data$manual))
Finally all the data is saved to the SQLite
database found in the sql
directory. I then use this database with JavaScript
and D3.js
to produce the website.
- Downloaded tables from their website: https://www.gapminder.org/data/
- https://github.com/open-numbers/ddf--gapminder--fasttrack