-
Notifications
You must be signed in to change notification settings - Fork 0
/
03-results-6-model.Rmd
86 lines (76 loc) · 7.22 KB
/
03-results-6-model.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
## Multiple Regression
To better understand if the collected survey data can be used as predictor variables, a linear model with treatment expenses per colony as target was used. Starting with the base model including only the reported number of colonies wintered (numerical, $\log_{10}$ transformed), which has demonstrated the most notable difference in the two level factor analysis between groups [Section \@ref(operational-size-single)]. Adding the two other operational questions 'Migratory Beekeeper' and 'Certified Organic Beekeeper' (Two levels: 'Yes', 'No') did lead to no noteworthy improvement and resulted in a higher BIC value. The base model plus treatment methods was the best model, as it showed the lowest AIC (`r round(r_model$fitted$AIC[5])`) and BIC (`r round(r_model$fitted$BIC[5])`) value and a $R^2$ value of 0.129. Adding states as additional covariate did improve the $R^2$ value (0.132) but showed higher AIC and BIC, indicating the penalization of the additional factors is higher than the gain and could lead to possible overfitting. The same is true for the addition of 'Survey Year', which did not increase the $R^2$ value (0.129) but also showed higher AIC and BIC values (Table \@ref(tab:model-table)).
```{r model-table, include=T}
c <- "Summary of model fit. The base model only included number of colonies wintered. The two questions migratory beekeeper and certified organic beekeeper were added as two level factor ('Yes', 'No'). The eleven treatment methods are binary encoded, excluded is drone brood removal. The covariates 'States' and 'Survey Year' were dummy encoded. Selection of best model, indicated as bold font, based on AIC and BIC value."
cn <- c("Model", "$R^2$", "adj. $R^2$", "df", "logLik", "AIC", "BIC")
tab <- kbl(
r_model$fitted %>%
mutate(
model = c("Intercept Only", "Base (Operation Size)", "Base $+$ Certified Organic", "Base $+$ Migratory Beekeeper", "Base $+$ Treatment Methods", "Base $+$ Treatment Methods $+$ States", "Base $+$ Treatment Methods $+$ Year"),
df = ifelse(is.na(df), 0, df),
across(logLik:BIC, ~ .x %>%
round() %>%
ft()),
) %>%
select(model, r.squared, adj.r.squared, df:BIC),
booktabs = TRUE,
col.names = cn,
digits = 3,
linesep = "",
align = c("l", rep("r", 9)),
caption = c,
escape = FALSE
) %>%
kableExtra::row_spec(5, bold = TRUE) %>%
kable_styling(latex_options = "scale_down")
tab
rm(cn, c, tab)
```
The residuals of the best model fit ('Base $+$ Treatment Methods') showed a normal distribution (Appendix Fig. \@ref(fig:model-qq-residuals)). The geometric mean intercept of the best model, which would represent a hypothetical beekeeper with one colony ($\log_{10} 1$) and no applied treatments is at EUR `r round(r_model$coeff_intercept$estimate_conv, 2)` (SE: `r round(r_model$coeff_intercept$error_conv, 2)`). All predictors are statistically significant except for the treatment method 'Lactic acid' (*p* = 0.5), which is used by 3% of the survey participants (Table \@ref(tab:model-best)). The coefficient direction is increasing for all except 'Another biotechnical method' decreases the estimated expenses by 3.7% (SE: 1.5%) if applied. The decrease for 'Number of colonies' is different to interpret as in this case both dependent and independent variables are log transformed, therefore the increase of colonies by 10% leads to a mean decrease of expenses by 1.9% (SE: 1.4%) (Fig. \@ref(fig:model-whisker)).
The predictive ability of the selected model ('Base $+$ Treatment Methods') is mediocre with a relative accuracy based on the correlation between observed and predicted in the training set of `r round(r_model$performance_accuracy_best$Accuracy*100, 1)`% (SE: `r round(r_model$performance_accuracy_best$SE*100)`%). The root mean square error (RMSE) for the training dataset (RMSE=`r r_model$prediction_stats %>% filter(type == "Training") %>% pull(.estimate) %>% round(., 2)`) and testing set (RMSE=`r r_model$prediction_stats %>% filter(type == "Testing") %>% pull(.estimate) %>% round(., 2)`) are similar, indicating a low probability of overfitting (Fig. \@ref(fig:model-pred)).
<!--
https://stats.stackexchange.com/questions/18480/interpretation-of-log-transformed-predictor-and-or-response
https://data.library.virginia.edu/interpreting-log-transformations-in-a-linear-model/
# Old one if we use geometric mean:
# I write geometric mean because back transformation of log transformed data is actually the same as the geometric mean
# abstract: https://www.researchgate.net/publication/5402553_The_logarithmic_transformation_and_the_geometric_mean_in_reporting_experimental_IgE_results_What_are_they_and_when_and_why_to_use_them
-->
```{r model-best, include=T}
c <- "Summary of the coefficients of the selected model and statistical results. Terms starting with 'T.' are treatment methods. Stars indicating significance are based on the \\textit{p}-value (*** < 0.001, ** < 0.01, * <= 0.05, n.s. (not significant) > 0.5). SE = Standard Error."
cn <- c("Term", "Estimate", "SE", "Statistic (t)", "\\textit{p}", "")
tab <- kable(
r_model$coeff %>%
mutate(
tname = str_replace(tname, "T.", "T."),
p.star = case_when(
p.value < 0.001 ~ "***",
p.value < 0.01 ~ "**",
p.value <= 0.5 ~ "*",
TRUE ~ "n.s."
),
p.value = ifelse(
p.value > 0.5,
base::format(round(p.value, 2), digits = 3, scientific = FALSE, nsmall = 1),
base::format(p.value, digits = 3, scientific = TRUE, nsmall = 1)
),
) %>%
select(tname, estimate:p.value, p.star) %>%
arrange(tname),
booktabs = TRUE,
col.names = cn,
digits = 3,
escape = FALSE,
align = c("l", rep("r", 4), "c"),
caption = c
) %>%
kable_styling(latex_options = "scale_down")
tab
rm(cn, c, tab)
```
(ref:model-whisker) Dot-and-whiskers plot showing regression coefficients for the selected model ('Base $+$ Treatment Methods'). The coefficients are ordered by the mean estimate value and direction, with highest positive at the top and lowest negative at the bottom. Dots are colour coded if regression direction is negative (green) or positive (red). Coefficients starting with 'T.' are treatment methods. The values are plotted to the mean increase or decrease of the estimate in percent, which means that if the treatment methods are applied they increase/decrease the value of the target by the shown percentage. In the case of 'Number of colonies' [$\log_{10}$] the effect of percentage change on the expenses is shown if the number of colonies increases by 10%.
```{r model-whisker, include=T, fig.cap="(ref:model-whisker)", out.width="80%"}
include_custom("output/figs/model-whisker.png")
```
```{r model-pred, include=T, fig.cap="Model prediction of expenses on training and testing dataset to the observed data from the survey. Orange and blue lines are the linear fit with standard error of the intercept only model without any predictors and selected best model ('Base $+$ Treatment Methods'). Red line represents the theoretical perfect fit between estimate and observed data.", out.width="80%"}
include_custom("output/figs/model-pred.png")
```