Merge branch 'main' into update/packages

UCL-ARC · Aug 6, 2024 · 6f78a86 · 6f78a86
2 parents aac035d + 53b1ebc
commit 6f78a86
Showing 1 changed file with 19 additions and 19 deletions.
diff --git a/episodes/23-statistics.Rmd b/episodes/23-statistics.Rmd
@@ -36,11 +36,11 @@ library(tidyverse)
 # loading data
 lon_dims_imd_2019 <- read.csv("../data/CDRC/English_IMD_2019_Domains_rebased_London_by_CDRC.csv")
 # Commenting out as not used in this version
-#library(lubridate)
-#library(gapminder)
+# library(lubridate)
+library(gapminder)
 # create a binary membership variable for europe (for later examples)
-#gapminder <- gapminder %>%
-#  mutate(european = continent == "Europe")
+gapminder <- gapminder %>%
+  mutate(european = continent == "Europe")
 ```
 
 We are going to use the data from the gapminder package.  We have added a variable *European* indicating if a country is in Europe.
@@ -55,7 +55,7 @@ We are going to use the data from the gapminder package.  We have added a variab
 
 ## Descriptive and inferential statistics
 
-::: Background
+::: callout
 Just as data in general are of different types - for example numeric vs text data - statistical data are assigned to different *levels of measure*. The level of measure determines how we can describe and model the data.
 :::
 
@@ -73,7 +73,7 @@ How do we convey information on what your data looks like, using numbers or figu
 First establish the distribution of the data. You can visualise this with a histogram.
 
 ```{r}
-ggplot(lon_dims_imd_2019, aes(x=barriers_london_rank)) +
+ggplot(lon_dims_imd_2019, aes(x = barriers_london_rank)) +
   geom_histogram()
 ```
 
@@ -84,7 +84,7 @@ What is the distribution of this data?
 The raw values are difficult to visualise, so we can take the log of the values and log those.  Try this command
 
 ```{r include=TRUE}
-ggplot(lon_dims_imd_2019, aes(x=log(barriers_london_rank))) +
+ggplot(lon_dims_imd_2019, aes(x = log(barriers_london_rank))) +
   geom_histogram()
 ```
 
@@ -142,7 +142,7 @@ Get them to plot the graphs. Explain that we are generating random data from dif
 ### Calculating mean and standard deviation
 
 ```{r}
-mean(lon_dims_imd_2019$barriers_london_rank,na.rm=TRUE)
+mean(lon_dims_imd_2019$barriers_london_rank, na.rm = TRUE)
 ```
 
 Calculate the standard deviation and confirm that it is the square root of the variance:
@@ -189,7 +189,7 @@ Contingency tables of frequencies can also be tabulated with **table()**. For ex
 
 ```{r}
 table(
-   lon_dims_imd_2019$la19nm, 
+  lon_dims_imd_2019$la19nm,
   lon_dims_imd_2019$IDAOP_london_decile
 )
 ```
@@ -275,12 +275,12 @@ Is the difference between heights statistically significant?
 
 ```{r}
 # Example to be changed
-#t.test(pop ~ european, data = gapminder)$statistic
-#t.test(pop ~ european, data = gapminder)$parameter
+# t.test(pop ~ european, data = gapminder)$statistic
+# t.test(pop ~ european, data = gapminder)$parameter
 ```
 
 Notice that the summary()** of the test contains more data than is output by default.
- 
+
 
 Write a paragraph in markdown format reporting this test result including the t-statistic, the degrees of freedom, the confidence interval and the p-value to 4 places.  To do this include your r code **inline** with your text, rather than in an R code chunk.
 
@@ -296,14 +296,14 @@ Testing supported the rejection of the null hypothesis that there is no differen
 While the t-test is sufficient where there are two levels of the IV, for situations where there are more than two, we use the **ANOVA** family of procedures. To show this, we will create a variable that subsets our data by *per capita GDP* levels. If the ANOVA result is statistically significant, we will use a post-hoc test method to do pairwise comparisons (here Tukey's Honest Significant Differences.)
 
 ```{r}
-#quantile(gapminder$gdpPercap)
-#IQR(gapminder$gdpPercap)
+# quantile(gapminder$gdpPercap)
+# IQR(gapminder$gdpPercap)
 
-#gapminder$gdpGroup <- cut(gapminder$gdpPercap, breaks = c(241.1659, 1202.0603, 3531.8470, 9325.4623, 113523.1329), labels = FALSE)
+# gapminder$gdpGroup <- cut(gapminder$gdpPercap, breaks = c(241.1659, 1202.0603, 3531.8470, 9325.4623, 113523.1329), labels = FALSE)
 
-#gapminder$gdpGroup <- factor(gapminder$gdpGroup)
+# gapminder$gdpGroup <- factor(gapminder$gdpGroup)
 
-#anovamodel <- aov(gapminder$pop ~ gapminder$gdpGroup)
+# anovamodel <- aov(gapminder$pop ~ gapminder$gdpGroup)
 anovamodel <- aov(lon_dims_imd_2019$health_london_rank ~ lon_dims_imd_2019$la19nm)
 summary(anovamodel)
 
@@ -315,10 +315,10 @@ TukeyHSD(anovamodel)
 The most common use of regression modelling is to explore the relationship between two continuous variables, for example between `Income_london_rank` and `health_london_rank` in our data. We can first determine whether there is any significant correlation between the values, and if there is, plot the relationship.
 
 ```{r}
-#cor.test(gapminder$gdpPercap, gapminder$lifeExp)
+# cor.test(gapminder$gdpPercap, gapminder$lifeExp)
 cor.test(lon_dims_imd_2019$Income_london_rank, lon_dims_imd_2019$health_london_rank)
 
-#ggplot(gapminder, aes(gdpPercap, log(lifeExp))) +
+# ggplot(gapminder, aes(gdpPercap, log(lifeExp))) +
 ggplot(lon_dims_imd_2019, aes(Income_london_rank, health_london_rank)) +
   geom_point() +
   geom_smooth()