Merge pull request #18 from UCL-ARC/linear-reg-broom

Linear reg and broom content and exercises
UCL-ARC · Sep 1, 2024 · a3492b9 · a3492b9
2 parents 029960a + 17a3395
commit a3492b9
Show file tree

Hide file tree

Showing 5 changed files with 1,012 additions and 6 deletions.
diff --git a/config.yaml b/config.yaml
@@ -87,9 +87,8 @@ episodes:
   # basic stats
   - 22-quarto.md
   - 23-statistics.Rmd
-  # 23-regression
-  # 24-logregression
-  # 25-broom
+  - 24-linear-reg-broom.Rmd
+  - 25-linreg-diagnostics.Rmd
   # 26-r-sql
   - 27-wrap-up.Rmd
 

diff --git a/episodes/23-statistics.Rmd b/episodes/23-statistics.Rmd
@@ -277,9 +277,8 @@ Is the difference between the income ranks statistically significant?
 ## Doing a t-test
 
 ```{r}
-
- t.test(health_london_rank ~ city, data = lon_dims_imd_2019)$statistic
- t.test(health_london_rank ~ city, data = lon_dims_imd_2019)$parameter
+t.test(health_london_rank ~ city, data = lon_dims_imd_2019)$statistic
+t.test(health_london_rank ~ city, data = lon_dims_imd_2019)$parameter
 ```
 
 Notice that the summary()** of the test contains more data than is output by default.

diff --git a/episodes/24-linear-reg-broom.Rmd b/episodes/24-linear-reg-broom.Rmd
@@ -0,0 +1,340 @@
+---
+title: "Linear Regression and Broom"
+teaching: 80
+exercises: 20
+source: Rmd
+---
+
+::::::::::::::::::::::::::::::::::::::: objectives
+
+- To be able to explore relationships between variables
+- To be able to calculate predicted variables and residuals
+- To be able to construct linear regression models
+- To be able to present model outcomes using Broom
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: questions
+
+- How can I explore relationships between variables in my data?
+- How can I present model outputs in an easier to read way?
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Content
+
+-   Linear Regression Models
+-   Use of Log transform
+-   Use of Categorical Variables
+-   Use of Broom
+
+## Data
+
+```{r libraries, message=FALSE, warning=FALSE}
+# We will need these libraries and this data later.
+library(ggplot2)
+library(tidyverse)
+library(lmtest)
+library(sandwich)
+library(broom)
+
+lon_dims_imd_2019 <- read.csv("data/English_IMD_2019_Domains_rebased_London_by_CDRC.csv")
+```
+
+We are going to use the data from the Consumer Data Research Centre, specifically the London IMD 2019 (English IMD 2019 Domains rebased).
+
+Atribution: Data provided by the Consumer Data Research Centre, an ESRC Data Investment: ES/L011840/1, ES/L011891/1
+
+The statistical unit areas across the country are Lower layer Super Output Areas (LSOAs). We will explore the relationships between the different dimensions of the Indices of Multiple Deprivation.
+
+## Linear Regression
+
+Linear Regression enables use to to explore the the linear relationship of the dependent variable Y and independent variable(s) X(s).
+We are going to explore the linear relationship between the Health Deprivation and Disability Domain and the Living Environment Deprivation Domain.
+
+The Health Deprivation and Disability Domain measures the risk of premature death and the impairment of quality of life through poor physical or mental health. The domain measures morbidity, disability and premature mortality but not aspects of behaviour or environment that may be predictive of future health deprivation.
+
+The Living Environment Deprivation Domain measures the quality of the local environment. The indicators fall into two sub-domains. The ‘indoors’ living environment measures the quality of housing; while the ‘outdoors’ living environment contains measures of air quality and road traffic accidents.
+
+Reference:
+  McLennan, David et al. The English Indices of Deprivation 2019 : Technical Report. Ministry of Housing, Communities and Local Government, 2019. Print.
+
+
+### Simple Linear Regression
+In the simple linear regression example we have only one dependent variable (health_london_rank) and one independent variable (livingEnv_london_rank).
+
+```{r}
+reg_LivEnv_health <- lm(health_london_rank ~ livingEnv_london_rank, data = lon_dims_imd_2019)
+# We put the dependent variable to the left of the '~' and the independent variable(s) to the right
+# and we tell R which dataset we are referring to.
+
+summary(reg_LivEnv_health)
+```
+
+From the result of this analysis, we can see that the Living Environment Deprivation Domain rank has a significant(small p-value, general rule of thumb <0.05) and positive relationship(positive coefficient) with the Health Deprivation and Disability Domain rank.
+
+One way of interpreting the result is: One unit increase in the Living Environment rank is related to around 0.343 (3.430e-01) points increase of the Health Deprivation and Disability rank.
+
+R-square shows the amount of variance of Y explained by X. In this case the Living Environment rank explains 6.225% of the variance in the Health Deprivation and Disability rank. Adj R2(6.205%) shows the same as R2 but adjusted by the # of cases and # of variables. When the # of variables is small and the # of cases is very large then Adj R2 is closer to R2.
+
+### Log transform
+
+If your data is skwewed, it can be useful to transform a variable to it's log form when doing the regression.
+You can either transform the variable beforehand or do so in the equation.
+
+```{r}
+reg_logbarriers_health <- lm(health_london_rank ~ log(barriers_london_rank), data = lon_dims_imd_2019)
+
+summary(reg_logbarriers_health)
+```
+The interpretation of the log-transformed variable is a bit different.
+In this example only the predictor variable is log tranformed, therefore to interpret the slope coefficient we divide it by 100 (2917.0/100=29.170).
+
+If the dependent/response variable is solely log-transformed. Exponentiate the coefficient. This gives the multiplicative factor for every one-unit increase in the independent variable. Example: the coefficient is 0.198. exp(0.198) = 1.218962. For every one-unit increase in the independent variable, our dependent variable increases by a factor of about 1.22, or 22%. Recall that multiplying a number by 1.22 is the same as increasing the number by 22%. Likewise, multiplying a number by, say 0.84, is the same as decreasing the number by 1 – 0.84 = 0.16, or 16%.
+
+If both are transformed. nterpret the coefficient as the percent increase in the dependent variable for every 1% increase in the independent variable. Example: the coefficient is 0.198. For every 1% increase in the independent variable, our dependent variable increases by about 0.20%. For x percent increase, calculate 1.x to the power of the coefficient, subtract 1, and multiply by 100. Example: For every 20% increase in the independent variable, our dependent variable increases by about (1.20 0.198 - 1) * 100 = 3.7 percent.
+
+### Predicted values and Residuals
+
+We can expand our simple linear regression example to incorporate the Barriers to Housing and Services Domain rank.
+The Barriers to Housing and Services Domain measures the physical and financial accessibility of housing and local services. The indicators fall into two sub-domains: ‘geographical barriers’, which relate to the physical proximity of local services, and ‘wider barriers’ which includes issues relating to access to housing, such as affordability.
+
+```{r}
+reg_LivEnv_barriers_health <- lm(
+  health_london_rank ~ livingEnv_london_rank + barriers_london_rank,
+  data = lon_dims_imd_2019
+)
+
+summary(reg_LivEnv_barriers_health)
+```
+
+After running the regression model, we can access the model predicted values and the residuals compared to the real observations.
+
+```{r}
+# first we fit the predictions
+health_rank_pred <- fitted(reg_LivEnv_barriers_health)
+health_rank_pred <- as.data.frame(health_rank_pred)
+
+# now we add the residual values too
+health_rank_resid <- residuals(reg_LivEnv_barriers_health)
+health_rank_pred$resid <- health_rank_resid
+
+# We can thenview the predictions and residuals
+head(health_rank_pred)
+```
+
+```{r, eval=FALSE}
+# You can view the full data in RStudio with the View() function
+View(health_rank_pred)
+```
+
+### Robust Regression
+
+We can run the robust standard error regressions(control for heteroskedasticity, meaning unequal variances):
+
+```{r}
+reg_LivEnv_barriers_health$robse <- vcovHC(reg_LivEnv_barriers_health, type = "HC1")
+coeftest(reg_LivEnv_barriers_health, reg_LivEnv_barriers_health$robse)
+```
+
+In addition,  we can access the cluster-robust standard errors regression results:
+
+```{r}
+# cluster-robust standard errors
+coeftest(reg_LivEnv_barriers_health, reg_LivEnv_barriers_health$clse)
+```
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 1
+
+Use the `gapminder` data to create a linear model between two continuous variables.
+
+Discuss your question and your findings.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+### Regression with Categorical independent variables
+
+We will explore the use of categorical independent variables in linear regression in this episode.
+When the dependent variable is a categorical variable, you may consider the alternatives of linear regression like logit regression and multinomial regression.
+
+```{r}
+# As a categorical variable we have added la19nm, these are the names of the London boroughs
+reg_cat_var <- lm(health_london_rank ~ livingEnv_london_rank + barriers_london_rank + la19nm, data = lon_dims_imd_2019)
+
+summary(reg_cat_var)
+```
+
+R automatically recognizes la19nm as a factor and treats it accordingly.
+The missing one in the coefficient summary (Barking and Dagenham) is treated as a base line, therefore the value is 0.
+However, we can also modify our model to show for all:
+
+```{r}
+reg_cat_var_showall <- lm(
+  health_london_rank ~ 0 + livingEnv_london_rank + barriers_london_rank + la19nm,
+  data = lon_dims_imd_2019
+)
+
+summary(reg_cat_var_showall)
+```
+### Categorical variables with interaction terms
+
+Sometimes we are interested in how a variable interacts with another variable.
+We can explore any interactions between locations (la19nm) and the living environment and barrier ranks.
+
+```{r}
+reg_cat_var_int <- lm(health_london_rank ~ la19nm * (livingEnv_london_rank + barriers_london_rank), data = lon_dims_imd_2019)
+
+summary(reg_cat_var_int)
+```
+
+:::::::::::::::::::::::::::::::::::::::  challenge
+
+## Challenge 2
+
+Using the `gapminder` data to create a linear model between a categorical and a continuous variable .
+
+Discuss your question and your findings.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Broom
+
+The 'broom' package offers an alternative way of presenting the output of statistical analysis.
+It centers around three S3 methods, each of which take common objects produced by R statistical functions (lm, t.test, nls, etc) and convert them into a tibble.
+
+These are:
+* tidy: constructs a tibble that summarizes the model’s statistical findings. This includes coefficients and p-values for each term in a regression, per-cluster information in clustering applications, or per-test information for multtest functions.
+* augment: add columns to the original data that was modeled. This includes predictions, residuals, and cluster assignments.
+* glance: construct a concise one-row summary of the model. This typically contains values such as R^2, adjusted R^2, and residual standard error that are computed once for the entire model.
+
+
+Let's revisit our initial linear model:
+```{r}
+reg_LivEnv_health <- lm(health_london_rank ~ livingEnv_london_rank, data = lon_dims_imd_2019)
+
+summary(reg_LivEnv_health)
+```
+
+There is a lot of useful information, but it not available in a way so that you can combine it with other models or do further analysis.
+We can convert this to tabular data using the 'tidy' function.
+
+```{r}
+tidy(reg_LivEnv_health)
+```
+The row names have been moved into a column called term, and the column names are simple and consistent (and can be accessed using $).
+
+Information about the model can be explored with 'augment'. The function augments the original data with information from the model,
+such as the fitted values and residuals for each of the original points in the regression.
+
+
+```{r}
+augment(reg_LivEnv_health)
+```
+
+Some of the data presented by 'augment' will be discussed in the episode Linear Regression Diagnostics.
+
+Summary statistics are computed for the entire regression, such as R^2 and the F-statistic can be accessed with the 'glance' function:
+
+
+```{r}
+glance(reg_LivEnv_health)
+```
+### Generalised linear models
+
+We can also use the 'broom' functions to present data from Generalised linear and non-linear models.
+For example, if we wanted to explore the Income Rank in relation to whether or not an area was within the City of London.
+
+```{r}
+# add a variable to indicate whether or not an area is within the City of London
+lon_dims_imd_2019 <- lon_dims_imd_2019 %>% mutate(city = la19nm == "City of London")
+
+# create a Generalised Linear Model
+glmlondims <- glm(city ~ Income_london_rank, lon_dims_imd_2019, family = "binomial")
+summary(glmlondims)
+```
+
+Use of 'tidy':
+```{r}
+tidy(glmlondims)
+```
+
+Use of 'augment':
+```{r}
+augment(glmlondims)
+```
+
+Use of 'glance':
+```{r}
+glance(glmlondims)
+```
+You will notice that the statistics computed by 'glance' are different for glm objects than for lm (e.g. deviance rather than R^2).
+
+### Hypothesis Testing
+
+The tidy function can also be applied a range of hypotheses tests, such as built-in functions like t.test, cor.test, and wilcox.test.
+
+### t-test
+```{r}
+tt <- t.test(Income_london_rank ~ city, lon_dims_imd_2019)
+tidy(tt)
+```
+Some cases might have fewer columns (for example, no confidence interval).
+
+Wilcox test:
+```{r}
+wt <- wilcox.test(Income_london_rank ~ city, lon_dims_imd_2019)
+tidy(wt)
+```
+
+Since the 'tidy' output is already only one row, glance returns the same output:
+```{r}
+# t-test
+glance(tt)
+
+# Wilcox test
+glance(wt)
+```
+The chisq.test enables use to investigate whether changes in one categorical variable are related to changes in another categorical variable.
+
+The 'augment' method is defined only for chi-squared tests, since there is no meaningful sense, for other tests,
+in which a hypothesis test produces output about each initial data point.
+
+```{r}
+# convert IDAOP_london_decile to a factor so it is not interprested as continuous data
+lon_dims_imd_2019$IDAOP_london_decile <- factor(lon_dims_imd_2019$IDAOP_london_decile)
+
+# xtabs creates a frequency table of IMD deciles within London borooughs
+chit <- chisq.test(xtabs(~ la19nm + IDAOP_london_decile, data = lon_dims_imd_2019))
+
+tidy(chit)
+```
+
+```{r}
+augment(chit)
+```
+There are a number of underlying assumptions of a Chi-Square test, these are:
+* Independence: The Chi-Square test assumes that the observations in the data are independent of each other. This means that the outcome of one observation should not influence the outcome of another.
+* Random Sampling: The data should be obtained through random sampling to ensure that it is representative of the population from which it was drawn.
+* Expected Frequency: The Chi-Square test assumes that the expected frequency count for each cell in the contingency table should be at least 5. If this assumption is not met, the test results may not be reliable.
+
+As we have received a warning about the reliability of our test, it is likely that one of these assumptions has not been met,
+and that this is not a suitable test for this data.
+
+
+:::::::::::::::::::::::::::::::::::::::  challenge
+
+## Challenge 3
+
+Use broom to amend the display of your model outputs.
+
+Which function(s) did you use and why?
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+### Conventions
+There are some conventions that enable consistency across the broom functions, these are:
+* The output of the tidy, augment and glance functions is always a tibble.
+* The output never has rownames. This ensures that you can combine it with other tidy outputs without fear of losing information (since rownames in R cannot contain duplicates).
+* Some column names are kept consistent, so that they can be combined across different models and so that you know what to expect (in contrast to asking “is it pval or PValue?” every time).