Skip to content

Commit

Permalink
Merge branch 'main' into update/packages
Browse files Browse the repository at this point in the history
  • Loading branch information
milanmlft authored Aug 6, 2024
2 parents aac035d + 53b1ebc commit 6f78a86
Showing 1 changed file with 19 additions and 19 deletions.
38 changes: 19 additions & 19 deletions episodes/23-statistics.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -36,11 +36,11 @@ library(tidyverse)
# loading data
lon_dims_imd_2019 <- read.csv("../data/CDRC/English_IMD_2019_Domains_rebased_London_by_CDRC.csv")
# Commenting out as not used in this version
#library(lubridate)
#library(gapminder)
# library(lubridate)
library(gapminder)
# create a binary membership variable for europe (for later examples)
#gapminder <- gapminder %>%
# mutate(european = continent == "Europe")
gapminder <- gapminder %>%
mutate(european = continent == "Europe")
```

We are going to use the data from the gapminder package. We have added a variable *European* indicating if a country is in Europe.
Expand All @@ -55,7 +55,7 @@ We are going to use the data from the gapminder package. We have added a variab

## Descriptive and inferential statistics

::: Background
::: callout
Just as data in general are of different types - for example numeric vs text data - statistical data are assigned to different *levels of measure*. The level of measure determines how we can describe and model the data.
:::

Expand All @@ -73,7 +73,7 @@ How do we convey information on what your data looks like, using numbers or figu
First establish the distribution of the data. You can visualise this with a histogram.

```{r}
ggplot(lon_dims_imd_2019, aes(x=barriers_london_rank)) +
ggplot(lon_dims_imd_2019, aes(x = barriers_london_rank)) +
geom_histogram()
```

Expand All @@ -84,7 +84,7 @@ What is the distribution of this data?
The raw values are difficult to visualise, so we can take the log of the values and log those. Try this command

```{r include=TRUE}
ggplot(lon_dims_imd_2019, aes(x=log(barriers_london_rank))) +
ggplot(lon_dims_imd_2019, aes(x = log(barriers_london_rank))) +
geom_histogram()
```

Expand Down Expand Up @@ -142,7 +142,7 @@ Get them to plot the graphs. Explain that we are generating random data from dif
### Calculating mean and standard deviation

```{r}
mean(lon_dims_imd_2019$barriers_london_rank,na.rm=TRUE)
mean(lon_dims_imd_2019$barriers_london_rank, na.rm = TRUE)
```

Calculate the standard deviation and confirm that it is the square root of the variance:
Expand Down Expand Up @@ -189,7 +189,7 @@ Contingency tables of frequencies can also be tabulated with **table()**. For ex

```{r}
table(
lon_dims_imd_2019$la19nm,
lon_dims_imd_2019$la19nm,
lon_dims_imd_2019$IDAOP_london_decile
)
```
Expand Down Expand Up @@ -275,12 +275,12 @@ Is the difference between heights statistically significant?

```{r}
# Example to be changed
#t.test(pop ~ european, data = gapminder)$statistic
#t.test(pop ~ european, data = gapminder)$parameter
# t.test(pop ~ european, data = gapminder)$statistic
# t.test(pop ~ european, data = gapminder)$parameter
```

Notice that the summary()** of the test contains more data than is output by default.


Write a paragraph in markdown format reporting this test result including the t-statistic, the degrees of freedom, the confidence interval and the p-value to 4 places. To do this include your r code **inline** with your text, rather than in an R code chunk.

Expand All @@ -296,14 +296,14 @@ Testing supported the rejection of the null hypothesis that there is no differen
While the t-test is sufficient where there are two levels of the IV, for situations where there are more than two, we use the **ANOVA** family of procedures. To show this, we will create a variable that subsets our data by *per capita GDP* levels. If the ANOVA result is statistically significant, we will use a post-hoc test method to do pairwise comparisons (here Tukey's Honest Significant Differences.)

```{r}
#quantile(gapminder$gdpPercap)
#IQR(gapminder$gdpPercap)
# quantile(gapminder$gdpPercap)
# IQR(gapminder$gdpPercap)
#gapminder$gdpGroup <- cut(gapminder$gdpPercap, breaks = c(241.1659, 1202.0603, 3531.8470, 9325.4623, 113523.1329), labels = FALSE)
# gapminder$gdpGroup <- cut(gapminder$gdpPercap, breaks = c(241.1659, 1202.0603, 3531.8470, 9325.4623, 113523.1329), labels = FALSE)
#gapminder$gdpGroup <- factor(gapminder$gdpGroup)
# gapminder$gdpGroup <- factor(gapminder$gdpGroup)
#anovamodel <- aov(gapminder$pop ~ gapminder$gdpGroup)
# anovamodel <- aov(gapminder$pop ~ gapminder$gdpGroup)
anovamodel <- aov(lon_dims_imd_2019$health_london_rank ~ lon_dims_imd_2019$la19nm)
summary(anovamodel)
Expand All @@ -315,10 +315,10 @@ TukeyHSD(anovamodel)
The most common use of regression modelling is to explore the relationship between two continuous variables, for example between `Income_london_rank` and `health_london_rank` in our data. We can first determine whether there is any significant correlation between the values, and if there is, plot the relationship.

```{r}
#cor.test(gapminder$gdpPercap, gapminder$lifeExp)
# cor.test(gapminder$gdpPercap, gapminder$lifeExp)
cor.test(lon_dims_imd_2019$Income_london_rank, lon_dims_imd_2019$health_london_rank)
#ggplot(gapminder, aes(gdpPercap, log(lifeExp))) +
# ggplot(gapminder, aes(gdpPercap, log(lifeExp))) +
ggplot(lon_dims_imd_2019, aes(Income_london_rank, health_london_rank)) +
geom_point() +
geom_smooth()
Expand Down

0 comments on commit 6f78a86

Please sign in to comment.