analysis.Rmd

---
title: "Analysis of German publication output"
author: "Anne Hobert"
date: "3/6/2020"
output:
  github_document:
  html_document:
    fig_caption: yes
    keep_md : true
    urlcolor: blue
word_document:
  fig_caption: yes
---

```{r, echo = FALSE, message = FALSE, warning = FALSE}
knitr::opts_chunk$set(
  comment = "#>",
  collapse = TRUE,
  warning = FALSE,
  message = FALSE,
  echo = FALSE,
  fig.width = 6,
  fig.asp = 0.618,
  out.width = "90%",
  fig.align = "center",
  dpi = 600,
  dev= c("png", "cairo_ps")
)
```

```{r}
library(tidyverse)
library(cowplot)
library(colorblindr)
library(scales)
library(viridis)
```

## Data

In this document we describe the analysis of our sample of publications from German research institutions. We work with the dataset `pubs_cat` generated in [data_gathering.Rmd](data_gathering.Rmd).

```{r}
pubs_cat <- readr::read_csv("data/pubs_cat.csv", col_types = "dccdcclc")
pubs_cat <- pubs_cat %>% 
  mutate(sec_abbr = case_when(
            sector == "Hochschulen" ~ "UNI",
            sector == "Helmholtz-Gemeinschaft" ~ "HGF",
            sector == "Max-Planck-Gesellschaft" ~ "MPS",
            sector == "Leibniz-Gemeinschaft" ~ "WGL",
            sector == "Fraunhofer-Gesellschaft" ~ "FhS",
            sector == "Ressortforschung" ~ "GRA"
          ),
        sector = case_when(
            sector == "Hochschulen" ~ "Universities (UNI)",
            sector == "Helmholtz-Gemeinschaft" ~ "Helmholtz Association (HGF)",
            sector == "Max-Planck-Gesellschaft" ~ "Max Planck Society (MPS)",
            sector == "Leibniz-Gemeinschaft" ~ "Leibniz Association (WGL)",
            sector == "Fraunhofer-Gesellschaft" ~ "Fraunhofer Society (FhS)",
            sector == "Ressortforschung" ~ "Government Research\n Agencies (GRA)"
        ),
        oa_category = factor(
           oa_category,
           levels = c(
              "full_oa_journal",
              "other_oa_journal",
              "opendoar_inst",
              "opendoar_subject",
              "opendoar_other",
              "other_repo",
              "not_oa"
          )
        )
    )
```

## Exlcude institutions from university sector which are not listed in official statistics

```{r}
excl_from_analysis <- readr::read_csv("data/exclude_from_analysis.csv")
pubs_cat <- pubs_cat %>% 
  anti_join(excl_from_analysis, by = c("INST_NAME" = "Universitäten"))
rm(excl_from_analysis)
```


## Investigation of research questions

The goal is to answer the three research questions 

1) Has the OA fraction of the publication output of German universities and research institutions increased constantly over time?
2) Can we observe differences between the research sectors of the German science system? Are there obvious explanations for this (like different missions or subject profiles?
3) Which OA type is the most prevalent OA approach and can we identify different patterns of adoption to OA?

### OA fraction of the German publication output
 
```{r}
pubs_oa_year <- pubs_cat %>%
  mutate(oa_category = fct_collapse(
    oa_category,
    is_oa = c(
      "full_oa_journal",
      "other_oa_journal",
      "opendoar_inst",
      "opendoar_subject",
      "opendoar_other",
      "other_repo"
    ),
    not_oa = "not_oa"
  ))  %>%
  mutate(PUBYEAR = lubridate::ymd(paste0(PUBYEAR, "-01-01"))) %>% 
  group_by(PUBYEAR) %>% 
  mutate(n_total_year = n_distinct(PK_ITEMS)) %>% 
  ungroup() %>% 
  group_by(PUBYEAR, oa_category, n_total_year) %>%
  summarise(number_of_articles = n_distinct(PK_ITEMS)) %>% 
  ungroup()
```
 
 
First, we look at how the overall OA share developed over time. The following figure displays the number of publications associated with one of the German research institutions we considered and highlights they part that is freely accessible online according to Unpaywall over the considered time period from 2010 until 2018. The total number of articles over the whole period is `r sum(pubs_oa_year$number_of_articles)` with an overall OA share of `r round(sum(pubs_oa_year %>% filter(oa_category == "is_oa") %>% .$number_of_articles)/sum(pubs_oa_year$number_of_articles)*100)` %.

 
```{r fig2, fig.cap="Open access to journal articles from German research institutions according to Unpaywall. Blue area represents journal articles with at least one freely available full-text, grey area represents toll-access articles."}
ggplot(pubs_oa_year, aes(x = PUBYEAR, y = number_of_articles)) +
    geom_area(aes(fill = fct_rev(oa_category), group = fct_rev(oa_category)),  alpha = 0.8, colour = "white") +
    scale_fill_manual(
      values = c("#cccccca0", "#56b4e9"),
      name = NULL,
      labels = c("Closed", "Open Access")
    ) +
    scale_y_continuous(
      labels = scales::number_format(big.mark = ","),
      expand = expansion(mult = c(0, 0.05)),
      breaks =  scales::extended_breaks()(0:110000)
      ) + 
    labs(x = "Publication Year", y = "Total Articles") +
    theme_minimal_hgrid() +
    theme(legend.position = "top",
          legend.justification = "right")
```

As can be seen, the total number of articles, as well as the part that is OA increases constantly over time. The number of articles that are not openly available, is quite stable with a slow increase from `r pubs_cat %>% filter(oa_category == "not_oa") %>% group_by(PUBYEAR) %>% summarise(n = n_distinct(PK_ITEMS)) %>% filter(PUBYEAR == 2010) %>% .$n` in 2010 to `r pubs_cat %>% filter(oa_category == "not_oa") %>% group_by(PUBYEAR) %>% summarise(n = n_distinct(PK_ITEMS)) %>% filter(PUBYEAR == 2013) %>% .$n` in 2013, and decreasing again continuously from that point onwards to `r pubs_cat %>% filter(oa_category == "not_oa") %>% group_by(PUBYEAR) %>% summarise(n = n_distinct(PK_ITEMS)) %>% filter(PUBYEAR == 2018) %>% .$n`  publications in 2018. Since the number of OA articles increases continuously from `r pubs_cat %>% filter(oa_category != "not_oa") %>% group_by(PUBYEAR) %>% summarise(n = n_distinct(PK_ITEMS)) %>% filter(PUBYEAR == 2010) %>% .$n` publications in 2010 to `r pubs_cat %>% filter(oa_category != "not_oa") %>% group_by(PUBYEAR) %>% summarise(n = n_distinct(PK_ITEMS)) %>% filter(PUBYEAR == 2018) %>% .$n` in 2018, the relative proportion of OA articles rises significantly from `r round(pubs_cat %>% filter(oa_category != "not_oa") %>% group_by(PUBYEAR) %>% summarise(n = n_distinct(PK_ITEMS)) %>% filter(PUBYEAR == 2010) %>% .$n/pubs_cat %>% group_by(PUBYEAR) %>% summarise(n = n_distinct(PK_ITEMS)) %>% filter(PUBYEAR == 2010) %>% .$n, 4)*100` % in 2010 to `r round(pubs_cat %>% filter(oa_category != "not_oa") %>% group_by(PUBYEAR) %>% summarise(n = n_distinct(PK_ITEMS)) %>% filter(PUBYEAR == 2018) %>% .$n/pubs_cat %>% group_by(PUBYEAR) %>% summarise(n = n_distinct(PK_ITEMS)) %>% filter(PUBYEAR == 2018) %>% .$n, 4)*100` % in 2018.

### Differences between sectors

In order to investigate what role the different sectors play in OA publishing in Germany and how they contribute to the OA development/overall OA shares, we distplay the development over time of the number of OA articles for each sector in the following figure. Note that scales for the `y-axes` are not the same, since the total publication output varies significantly among sectors.

```{r fig3, fig.asp=1, fig.cap="Open access to journal articles per sector according to Unpaywall. Blue area represents journal articles with at least one freely available full-text, grey area represents toll-access articles. Sectors are ordered by publication output with the highest output top left and lowest at the bottom. Note that scales for the `y-axes` are not the same, since the total publication output varies significantly among sectors."}
pubs_cat %>%
  mutate(oa_category = fct_collapse(
    oa_category,
    is_oa = c(
      "full_oa_journal",
      "other_oa_journal",
      "opendoar_inst",
      "opendoar_subject",
      "opendoar_other",
      "other_repo"
    ),
    not_oa = "not_oa"
  ))  %>%
  group_by(sector) %>%
  mutate(n_total = n_distinct(PK_ITEMS)) %>%
  ungroup() %>%
  group_by(sector, PUBYEAR, oa_category, n_total) %>%
  summarise(number_of_articles = n_distinct(PK_ITEMS)) %>%
  ggplot(aes(x = PUBYEAR, y = number_of_articles)) +
    geom_area(aes(fill = fct_rev(oa_category),
                  group = fct_rev(oa_category)),
              alpha = 0.8,
              colour = "white") +
    scale_fill_manual(
      values = c("#cccccca0", "#56b4e9"),
      name = NULL,
      labels = c("Closed", "Open Access")
    ) +
    facet_wrap( ~ fct_rev(fct_reorder(sector, n_total)),
                scales = "free",
                ncol = 2) +
    labs(x = "Publication Year", y = "Total Articles") +
    scale_y_continuous(
      labels = scales::number_format(big.mark = ","),
      expand = expansion(mult = c(0, 0.05)),
      breaks =  scales::extended_breaks()
      ) + 
    labs(x = "Publication Year", y = "Total Articles") +
    theme(legend.position = "top",
          legend.justification = "right") +
    theme_minimal_hgrid() +
    # bold facet names
   # theme(strip.text = element_text(face="bold")) +
     theme(legend.position = "top",
           legend.justification = "right")
```

```{r}
## exclusion for inst_level analysis
exclude_from_inst_analysis <- readr::read_csv("data/exclude_from_inst_level_analysis.csv")
```


In order to investigate the variability of OA publishing within the sectors, we now go one level deeper and examine OA shares of individual institutions, grouped by the sector they belong to. We only include institututions with a publication output of at least 100 publications in the observed time period of 9 years. Of the `r pubs_cat %>% summarise(n = n_distinct(INST_NAME)) %>% .$n` institutions in total, `r pubs_cat %>% anti_join(exclude_from_inst_analysis, by = c("INST_NAME" = "NAME")) %>% group_by(INST_NAME) %>% mutate(n_total = n_distinct(PK_ITEMS)) %>% ungroup() %>% filter(n_total >= 100) %>% summarise(n = n_distinct(INST_NAME)) %>% .$n` fulfill this condition. This means, that in the following institution specific analyses, `r pubs_cat %>% left_join(exclude_from_inst_analysis, by = c("INST_NAME" = "NAME")) %>% group_by(INST_NAME) %>% mutate(n_total = n_distinct(PK_ITEMS)) %>% ungroup() %>% filter(n_total < 100 | !is.na(PK_KB_INST)) %>% summarise(n = n_distinct(INST_NAME)) %>% .$n` insitutions, or `r pubs_cat %>% left_join(exclude_from_inst_analysis, by = c("INST_NAME" = "NAME")) %>% group_by(INST_NAME) %>% mutate(n_total = n_distinct(PK_ITEMS)) %>% ungroup() %>% filter(n_total < 100 | !is.na(PK_KB_INST)) %>% summarise(n = n_distinct(PK_ITEMS)) %>% .$n` articles are not considered. Of the remaining institutions, we first calculate the individual OA shares.

```{r}
oa_shares_inst <- pubs_cat %>%
  anti_join(exclude_from_inst_analysis, by = c("INST_NAME" = "NAME")) %>% 
  mutate(oa_category = fct_collapse(
    oa_category,
    is_oa = c(
      "full_oa_journal",
      "other_oa_journal",
      "opendoar_inst",
      "opendoar_subject",
      "opendoar_other",
      "other_repo"
    ),
    not_oa = "not_oa"
  )) %>%
  group_by(INST_NAME) %>%
  mutate(n_total = n_distinct(PK_ITEMS)) %>%
  ungroup() %>%
  group_by(INST_NAME, oa_category, n_total) %>%
  summarise(n_cat = n_distinct(PK_ITEMS)) %>%
  pivot_wider(
    names_from = oa_category,
    values_from = n_cat,
    values_fill = list(n_cat = 0)
  ) %>%
  mutate(oa_share = is_oa / n_total) %>%
  rename(n_oa = is_oa, n_not_oa = not_oa) %>%
  ungroup()

oa_shares_inst_sector <- pubs_cat %>%
  anti_join(exclude_from_inst_analysis, by = c("INST_NAME" = "NAME")) %>% 
  mutate(oa_category = fct_collapse(
    oa_category,
    journal_oa = c("full_oa_journal", "other_oa_journal"),
    repo_oa = c(
      "opendoar_inst",
      "opendoar_subject",
      "opendoar_other",
      "other_repo"
    ),
    not_oa = "not_oa"
  )) %>%
  group_by(sector) %>%
  mutate(n_total_sec = n_distinct(PK_ITEMS)) %>%
  ungroup() %>%
  left_join(oa_shares_inst, by = 'INST_NAME') %>%
  group_by(sector, INST_NAME, n_total, oa_share, n_total_sec) %>%
  summarise() %>% 
  mutate(inst_label = case_when(
    INST_NAME == "Universität Konstanz" ~ "Konstanz",
    INST_NAME == "Deutsches Elektronen-Synchrotron" ~ "DESY",
    INST_NAME == "Forschungszentrum Jülich GmbH (FZJ)" ~ "FZJ",
    INST_NAME == "Helmholtz-Zentrum Dresden-Rossendorf (HZDR)" ~ "HZDR",
    INST_NAME == "Leibniz-Institut für Astrophysik Potsdam" ~ "AIP",
    INST_NAME == "Kiepenheuer-Institut für Sonnenphysik (KIS)" ~ "KIS",
    INST_NAME == "Robert Koch-Institut" ~ "RKI",
    INST_NAME == "Deutscher Wetterdienst" ~ "DWD",
    INST_NAME == "Fraunhofer-Institut für Zelltherapie und Immunologie" ~ "IZI",
    TRUE ~ ""
  ))

oa_shares_inst_sector_stats <- oa_shares_inst_sector %>% 
  ungroup() %>% 
  filter(n_total >= 100) %>% 
  group_by(sector) %>% 
  summarise(mean_oa_share = mean(oa_share), 
            median_oa_share = median(oa_share),
            mean_pub_volume = mean(n_total),
            median_pub_volume = median(n_total),
            sd_oa_share = sd(oa_share),
            sd_pub_volume = sd(n_total)
            )
```
The following figure displays scatterplots where the OA share of an institution over the whole time period is shown with respect to its publication output.

```{r fig4, fig.asp=1, fig.cap="Open Access shares of research institutions in Germany with respect to their total publication output grouped by the sector they belong to. Only institutions with at least 100 publications are shown. Blue points correspond to single insitutions, gray lines are obained by linear regression within the sector, gray areas are pointwise symmetric 95% t-distribution confidence bands. Scales of the x-axes vary across subplots in order to adapt to the different publication volumes. Dashed lines show the median value per sector for the OA share (red) and the total number of publications (orange)."}
point_shapes <- oa_shares_inst_sector %>% 
  filter(n_total >= 100) %>% 
  mutate(point_shape = ifelse(inst_label =="", 19, 15)) %>% 
  .$point_shape
point_sizes <- oa_shares_inst_sector %>% 
  filter(n_total >= 100) %>% 
  mutate(point_shape = ifelse(inst_label =="", 1.5, 2)) %>% 
  .$point_shape
point_color <- oa_shares_inst_sector %>% 
  filter(n_total >= 100) %>% 
  mutate(point_color = ifelse(inst_label =="", "#56b4e9", "#490206")) %>% 
  .$point_color
oa_shares_inst_sector %>%
  filter(n_total >= 100) %>%
  left_join(oa_shares_inst_sector_stats, by = "sector") %>%
  ggplot(aes(x = n_total, y = oa_share, label = INST_NAME)) +
    geom_point(color = point_color, alpha = .7, shape = point_shapes, size = point_sizes)  +
    scale_x_log10(labels = scales::number_format(big.mark = ","),
                  expand = expansion(mult = c(0.05, 0.1))) +
    scale_y_continuous(labels = scales::percent_format(accuracy = 5L),
                       expand = expansion(mult = c(0, 0.05)),
                       limits = c(0,1)) +
    geom_smooth(color = "#999999a0", method = "lm") +
    facet_wrap(~ fct_rev(fct_reorder(sector, n_total_sec)),
               ncol = 2,
               scales = "free_x") +
    geom_hline(aes(yintercept = median_oa_share),
               colour = "#d55e00", linetype ="dashed", size = 1) +
    geom_vline(aes(xintercept = median_pub_volume),
               colour = "#E69F00", linetype ="dashed", size = 1) +
  ggrepel::geom_text_repel(aes(label = inst_label),
                              size = 3,
                              box.padding = 0.15,
                              point.padding = 0.15,
                              segment.color = "transparent",
                              color = "#666666",
                              force = 3
                              )+
    labs(x = "Total Articles (logarithmic scale)", y = "OA percentage") +
    theme_minimal_grid() +
    theme(legend.position = "none") +
    # bold facet labels
 #   theme(strip.text = element_text(face = "bold"))+
  theme(axis.text=element_text(size=10))
```
The most striking observations from this figure are the high OA shares of most of the Max-Planck and Helmholtz institutes and the very low OA fractions of almost all of the state and federal institutes as well as the ones from the Fraunhofer Society. Universities and Leibniz-Society have many institutes with OA shares close to one half. We can further see very well that the universities have by far the largest publication volumes, followed by the Helmholtz-Society. The linear trend of higher publication volume implying higher OA shares is most distinctive for the university sector (narrowest confidence bands). 


The following box plot quantifies the observations regarding the variability of OA shares within sectors made before.

```{r, echo=FALSE, eval=FALSE}
## some functions for inferring significance
get_ci_lims <- function(sec_data, grouped = FALSE){
  if(grouped){
    seccats <- unique(sec_data$sector_cat)
    ci_lims_groups <- tibble(sector_cat = seccats, ci_lower = rep(0, 4), ci_upper = rep(1, 4)) %>% 
      pivot_longer(cols = starts_with("ci_"), names_to = "lim", values_to = "val") %>%
      pivot_wider(names_from = sector_cat, values_from = val)
      for(i in 1:4){
        bstats <- sec_data %>% 
          filter(sector_cat == seccats[i]) %>% 
          .$oa_share %>% 
          boxplot.stats()
        ci_lims_groups[, i+1] <- bstats$conf
      }
    ci_lims_groups <- ci_lims_groups %>% 
      pivot_longer(cols = c(2:5), names_to = "sector_cat", values_to = "val") %>% 
      pivot_wider(names_from = lim, values_from = val) %>% 
      mutate(sector_cat = factor(
        sector_cat,
        levels = c("Universities",
                   "Research-oriented",
                   "Varying focuses",
                   "Practise-oriented"
                   )
      ))
    return(ci_lims_groups)
  } else {
    secnames <- unique(sec_data$sec_abbr)
    ci_lims <- tibble(sec_abbr = secnames, ci_lower = rep(0, 6), ci_upper = rep(1, 6)) %>% 
      pivot_longer(cols = starts_with("ci_"), names_to = "lim", values_to = "val") %>%
      pivot_wider(names_from = sec_abbr, values_from = val)
    for(i in 1:6){
      bstats <- sec_data %>% 
        filter(sec_abbr == secnames[i]) %>% 
        .$oa_share %>% 
        boxplot.stats()
      ci_lims[, i+1] <- bstats$conf
    }
    ci_lims <- ci_lims %>%
      pivot_longer(cols = c(2:7), names_to = "sec_abbr", values_to = "val") %>% 
      pivot_wider(names_from = lim, values_from = val) %>% 
      mutate(
        sector_cat = case_when(
          sec_abbr == "UNI" ~ "Universities",
          sec_abbr %in% c("MPS", "HGF") ~ "Research-oriented",
          sec_abbr == "WGL" ~ "Diverse missions",
          sec_abbr %in% c("FhS", "GRA") ~ "Practise-oriented"
        )
      ) %>%
      mutate(sector_cat = factor(
        sector_cat,
        levels = c("Universities",
                   "Research-oriented",
                   "Diverse missions",
                   "Practise-oriented"
                   )
      ))
    return(ci_lims)
  }
}

which_significant <- function(ci_lims, grouped = FALSE){
  if(grouped){
    ci_lowers <- matrix(ci_lims$ci_lower, nrow = 4, ncol = 4, dimnames = list(ci_lims$sector_cat, c(1:4)))
    ci_uppers <- matrix(ci_lims$ci_upper, nrow = 4, ncol = 4, byrow = TRUE, dimnames = list(c(1:4), ci_lims$sector_cat))
    ci_diff_signs <- sign(ci_lowers - ci_uppers)
    colnames(ci_diff_signs) <- rownames(ci_diff_signs)
    ci_signf_mat <- ci_diff_signs * t(ci_diff_signs) < 0
    list_of_signf <- which(ci_signf_mat * upper.tri(ci_signf_mat) > 0, arr.ind = TRUE, useNames = FALSE)
    list_of_signf[, 1] <- rownames(ci_signf_mat)[list_of_signf[, 1 ]]
    list_of_signf[, 2] <- colnames(ci_signf_mat)[as.numeric(list_of_signf[, 2 ])]
    list_of_signf[, 1] <- paste0(list_of_signf[, 1], ", ", list_of_signf[, 2])
    return(list(res_matrix = ci_signf_mat, res_list = list_of_signf[, 1]))
    
  } else{
    ci_lowers <- matrix(ci_lims$ci_lower, nrow = 6, ncol = 6, dimnames = list(ci_lims$sec_abbr, c(1:6)))
    ci_uppers <- matrix(ci_lims$ci_upper, nrow = 6, ncol = 6, byrow = TRUE, dimnames = list(c(1:6), ci_lims$sec_abbr))
    ci_diff_signs <- sign(ci_lowers - ci_uppers)
    colnames(ci_diff_signs) <- rownames(ci_diff_signs)
    ci_signf_mat <- ci_diff_signs * t(ci_diff_signs) < 0
    list_of_signf <- which(ci_signf_mat * upper.tri(ci_signf_mat) > 0, arr.ind = TRUE, useNames = FALSE)
    list_of_signf[, 1] <- rownames(ci_signf_mat)[list_of_signf[, 1 ]]
    list_of_signf[, 2] <- colnames(ci_signf_mat)[as.numeric(list_of_signf[, 2 ])]
    list_of_signf[, 1] <- paste0(list_of_signf[, 1], ", ", list_of_signf[, 2])
    return(list(res_matrix = ci_signf_mat, res_list = list_of_signf[, 1]))
  }
}
```


```{r fig5, fig.cap="OA shares of German research institutions per sector. The color of the boxes groups sectors into universities with a typically high total journal publication output, research-oriented institutes with a medium journal publication output and practise oriented institutions with a comparatively low journal publication output. Gray points display the OA shares for individual institutions. Notches indicate approximate 95 % confidence intervals for the median values. Non-overlapping notches imply a strong indication that median values are significantly different."}
oa_shares_inst_sec_boxplot <- pubs_cat %>%
  anti_join(exclude_from_inst_analysis, by = c("INST_NAME" = "NAME")) %>% 
  group_by(sector) %>%
  mutate(
    n_sector = n_distinct(PK_ITEMS),
    sector_cat = case_when(
      sec_abbr == "UNI" ~ "Universities",
      sec_abbr %in% c("MPS", "HGF") ~ "Research-oriented",
      sec_abbr == "WGL" ~ "Diverse missions",
      sec_abbr %in% c("FhS", "GRA") ~ "Practise-oriented"
    )
  ) %>%
  mutate(sector_cat = factor(
    sector_cat,
    levels = c("Universities",
               "Research-oriented",
               "Diverse missions",
               "Practise-oriented"
               )
  )) %>%
  ungroup() %>%
  left_join(oa_shares_inst, by = 'INST_NAME') %>%
  filter(n_total >= 100) %>% 
  group_by(INST_NAME, sector, oa_share, sector_cat, n_sector, sec_abbr) %>%
  summarise()

# ci_lims <- get_ci_lims(oa_shares_inst_sec_boxplot)
# ci_lims_groups <- get_ci_lims(oa_shares_inst_sec_boxplot, grouped = TRUE)
# 
# sign_diffs <- which_significant(ci_lims)$res_list
# sign_diffs_groups <- which_significant(ci_lims_groups, grouped = TRUE)$res_list

ggplot(data = oa_shares_inst_sec_boxplot, aes(x = fct_rev(fct_reorder(sec_abbr, n_sector)), y = oa_share)) +
  geom_boxplot(aes(color = sector_cat), varwidth = FALSE, size = 1, notch = TRUE) +
  geom_jitter(data = filter(oa_shares_inst_sec_boxplot, !INST_NAME %in% c(
    "Universität Konstanz",
    "Fraunhofer-Institut für Zelltherapie und Immunologie",
    "Max-Planck-Institut für ethnologische Forschung",
    "Max-Planck-Institut zur Erforschung multireligiöser und multiethnischer Gesellschaften"
  )),
  alpha = 0.1) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 5L),
                     expand = expansion(mult = c(0, 0.05)),
                     limits = c(0, 1)) +
  # geom_hline(data = ci_lims, aes(yintercept = ci_lower, color = sector_cat), linetype = "dashed", size = 0.5)+
  # geom_hline(data = ci_lims, aes(yintercept = ci_upper, color = sector_cat), linetype = "dotted", size = 0.5)+
  scale_color_manual(
    values = c("#684747", "#f68f46ff", "#a65c85ff", "#051461")
  ) +
  labs(
    x = "",
    y = "OA percentage",
    color = ""
  )  +
  theme_minimal_hgrid()+
  guides(color=guide_legend(nrow=2)) +
 theme(legend.position = "top",
     legend.justification = "right") 
```

Significance: HGF against all non-research oriented, MPS against all non-research oriented, FhS against all others
Significance groups: research against all others, practise against all others, mixed not well separated from universities (maybe sector_cat classification should be inst_cat, i.e. on the level of institutions - would have to be manually at least for some sectors)

### Prevalences of OA categories

As mentioned in the previous chapters, there are several ways of providing open access to scientific journal articles. In this section, we want to investigate the prevalence of the most widespread OA routes: Green OA and Gold OA. We further distinguish these two main categories as described in the methodology section (see Table 1) according to whether the journal is fully OA (Gold OA), and into deposition on disciplinary, institutional, or OpenDOAR-listed repositories (Green OA). Note that, as mentioned before, the OA categories are not exclusive, that is, an article might be counted for several categories and numbers not necessarily sum up to the total number of articles published. As a preliminary step, we therefore illustrate the most common combinations of OA categories in our dataset.
```{r}
cat_overlap <- pubs_cat %>%
  filter(oa_category != "not_oa") %>%
  select(PK_ITEMS, oa_category) %>%
  distinct() %>%
  arrange(PK_ITEMS, oa_category) %>%
  aggregate(oa_category ~ PK_ITEMS, data = ., paste, collapse = "&") %>%
  group_by(oa_category) %>%
  summarise(n = n_distinct(PK_ITEMS))
readr::write_csv(cat_overlap, "data/overlap_oa_categories.csv")
```

```{r}
library(UpSetR)
# list with counts
cat_overlap_list <- as.list(cat_overlap$n)
# categories as list names
names(cat_overlap_list) <- cat_overlap$oa_category
# convert to vector
cat_overlap_upset_expr <- unlist(cat_overlap_list)
upset(fromExpression(cat_overlap_upset_expr), sets = rev(c("full_oa_journal", "other_oa_journal", "opendoar_inst", "opendoar_subject", "opendoar_other", "other_repo")), keep.order = TRUE, nintersects = 20, order.by = "freq", show.numbers = FALSE, set_size.angles = 25)
```


Keeping in mind that our categories are non-exclusive, as just shown, we now visualise the number of articles per category on the national level, that is, without differentiation by sector. As a first step, we investigate the two main OA routes via a journal or via a repository.

```{r fig6, fig.asp = 0.4, fig.cap="Development of the number of articles per OA host type and their overlap. Highlighted in blue are the number of articles per OA host type with articles made available only via a journal on the left, articles available only in repositories on the right and the overlap, that is, articles openly accessible via both a journal and a repository, in the middle. Grey Area shows the remaining OA articles."}
host_overlap <- pubs_cat %>%
  filter(oa_category != "not_oa") %>%
  mutate(oa_host = fct_collapse(
    oa_category,
    journal = c("full_oa_journal", "other_oa_journal"),
    repository = c(
      "opendoar_inst",
      "opendoar_subject",
      "opendoar_other",
      "other_repo"
    )
  )) %>%
  mutate(oa_host = as.character(oa_host)) %>%
  select(PK_ITEMS, oa_host) %>%
  distinct() %>%
  arrange(PK_ITEMS, oa_host) %>% 
  aggregate(oa_host ~ PK_ITEMS, data = ., paste, collapse = "&")

  pubs_cat %>%
  filter(oa_category != "not_oa") %>% 
  mutate(PUBYEAR = lubridate::ymd(paste0(PUBYEAR, "-01-01"))) %>% 
  left_join(host_overlap, by = "PK_ITEMS") %>% 
  mutate(oa_host = case_when(
    oa_host == "journal" ~ "Journal only",
    oa_host == "repository" ~ "Repository only",
    oa_host == "journal&repository" ~ "Journal &\n Repository"
  )) %>% 
  mutate(oa_host = factor(oa_host, levels = c("Journal only", "Journal &\n Repository", "Repository only"))) %>% 
  group_by(PUBYEAR, oa_host) %>% 
  summarise(number_of_articles = n_distinct(PK_ITEMS)) %>% 
    ungroup() %>% 
  ggplot(aes(x = PUBYEAR, y = number_of_articles)) +
    geom_area(data = filter(pubs_oa_year, oa_category == "is_oa"), aes(fill = "All OA Articles"), alpha = 0.8, colour = "white") +
    geom_area(aes(fill = "by Host"), alpha = 0.8, colour = "white") +
    scale_fill_manual(
      values = c("#cccccca0", "#56b4e9"),
      name = NULL
    ) +
    facet_wrap( ~ oa_host) +
    labs(x = "Publication Year", y = "Total Articles") +
    scale_y_continuous(
      labels = scales::number_format(big.mark = ","),
      expand = expansion(mult = c(0, 0.05)),
      breaks =  scales::extended_breaks()
      ) + 
    scale_x_date(date_labels = "%y") +
    labs(x = "Publication Year", y = "Total Articles") +
    theme(legend.position = "top",
          legend.justification = "right") +
    theme_minimal_hgrid() +
    # bold facet names
   # theme(strip.text = element_text(face="bold")) +
     theme(legend.position = "top",
           legend.justification = "right")
```


```{r fig7, fig.cap="Development of the percentage of journal articles per OA category (as per schema in Table 1) over time. Categories are non-exclusive, that is some articles may be counted for more than one category. Colors correspond to the OA category. On the left, access provided via a journal is displayed, on the right via repositories. Grey area shows the total percentage of OA via the corresponding route (journal or repository). "}
oa_shares_host <- pubs_cat %>%
  mutate(PUBYEAR = lubridate::ymd(paste0(PUBYEAR, "-01-01"))) %>% 
  group_by(PUBYEAR) %>% 
  mutate(n_total = n_distinct(PK_ITEMS)) %>% 
  ungroup() %>% 
  filter(oa_category != "not_oa") %>% 
  mutate(oa_host = fct_collapse(
    oa_category,
    `Journal OA` = c("full_oa_journal", "other_oa_journal"),
    `Repository OA` = c(
      "opendoar_inst",
      "opendoar_subject",
      "opendoar_other",
      "other_repo"
    )
  )) %>%
  group_by(PUBYEAR, oa_host, n_total) %>% 
  summarise(n_host = n_distinct(PK_ITEMS)) %>% 
  mutate(share = n_host/n_total) %>% 
  ungroup()
  
pubs_cat %>%
  group_by(PUBYEAR) %>% 
  mutate(n_total = n_distinct(PK_ITEMS)) %>% 
  ungroup() %>% 
  group_by(PUBYEAR, oa_category, n_total) %>%
  summarise(number_of_articles = n_distinct(PK_ITEMS)) %>%
  ungroup() %>% 
  filter(oa_category != "not_oa") %>% 
  mutate(oa_host = fct_collapse(
    oa_category,
    `Journal OA` = c("full_oa_journal", "other_oa_journal"),
    `Repository OA` = c(
      "opendoar_inst",
      "opendoar_subject",
      "opendoar_other",
      "other_repo"
    )
  )) %>%
  ungroup() %>%
  mutate(PUBYEAR = lubridate::ymd(paste0(PUBYEAR, "-01-01")),
         prop = number_of_articles/n_total) %>%
  ggplot(aes(x = PUBYEAR, y = prop, group = oa_category)) +
    geom_line(aes(color = oa_category), size = 0.9) +
    scale_color_OkabeIto(name = "OA Category", order = c(4, 1, 2, 3, 5, 7, 6), darken = 0.11) +
    gghighlight::gghighlight(label_params = list(alpha = 0.8,
                                                 force = 15,
                                                 direction = "y",
                                                 hjust = 1,
                                                 segment.color = NA,
                                                 fontface = 'bold'),
                             unhighlighted_params = list(colour = "transparent")) +
     geom_area(data = oa_shares_host,
              aes(PUBYEAR, share, group = "none"),
              fill = "#cccccca0", color = "white", alpha = 0.2) +
    scale_y_continuous(labels = scales::percent_format(accuracy = 5L),
                       expand = expansion(mult = c(0, 0.05)),
                       breaks =  scales::extended_breaks(n = 6)) +
    facet_wrap(~ oa_host) +
    labs(x = "Publication Year", y = "OA percentage") +
    theme_minimal_hgrid() +
    scale_x_date(date_labels = "%y") +
    theme(legend.position = "none") # +
    # bold facet labels
   # theme(strip.text = element_text(face = "bold"))   
```
Observations:

- drop in other oa journal -> Delayed OA
- slight drop in other_repo -> more sources registered, published more in registered sources
- apart from this: all OA categories increase, not oa decreases
- most prevalent category: subject-specific repos, registered with OpenDOAR

Again, we go one step further and look at sector specific OA proportions.
```{r fig8, fig.asp=1, fig.cap="OA shares per category and sector. Coloring and size of the points displays the percentage in the respective category. Grey numbers display the percentage value explicitly. The bottom row shows the overall OA share of the sectors, the rightmost column the percentage of articles in the corresponding category regardless of the sector (on the national level). Ordering of the sectors is according to total publication output for the entire sector (highest: universities, lowest: Fraunhofer Society)."}
oa_shares_sector <-pubs_cat %>% 
  mutate(n_total = n_distinct(PK_ITEMS)) %>% 
  group_by(sec_abbr) %>%
  mutate(n_sec = n_distinct(PK_ITEMS)) %>%
  ungroup() %>%
  group_by(oa_category) %>% 
  mutate(n_cat = n_distinct(PK_ITEMS)) %>% 
  ungroup() %>% 
  group_by(sec_abbr, oa_category, n_sec, n_cat, n_total) %>%
  mutate(n_sec_cat = n_distinct(PK_ITEMS),
         share_sec_cat = n_sec_cat / n_sec,
         share_cat = n_cat / n_total) %>%
  ungroup() %>% 
  group_by(sec_abbr, oa_category, n_sec, share_sec_cat, share_cat, n_total) %>%
  summarise() %>%
  ungroup() %>% 
  pivot_longer(cols = c("share_sec_cat", "n_sec"),
               names_to = "fig_type",
               values_to = "fig_val") %>% 
  pivot_wider(names_from = sec_abbr,
              values_from = fig_val) %>% 
  mutate(all_sec = case_when(
    fig_type == "share_sec_cat" ~ share_cat,
    TRUE ~  as.double(n_total)
  )) %>% 
  select(-share_cat, -n_total) %>% 
  pivot_longer(cols = c("FhS", "HGF", "MPS", "GRA", "UNI", "WGL", "all_sec"),
               names_to = "sec_abbr",
               values_to = "fig_val") %>% 
  pivot_wider(names_from = fig_type,
              values_from = fig_val) %>% 
  mutate(share_sec_cat = ifelse(oa_category == "not_oa", 1-share_sec_cat, share_sec_cat),
         oa_category = fct_recode(oa_category, all_oa = "not_oa")) %>% 
  rename(share = share_sec_cat) %>%
  mutate(text_color = ifelse(share > .3, "black", "#cccccc"))

ggplot(oa_shares_sector, aes(x = fct_rev(fct_recode(fct_relevel(fct_reorder(sec_abbr, n_sec), "all_sec"), all = "all_sec")),
                             y = fct_rev(oa_category),
                             size = share, fill = share))+
  geom_point(shape = 21, color = "#666666") +
  geom_text(aes(label = round(share*100), colour = text_color), size = 4, show_guide = FALSE) +
  geom_vline(xintercept = 6.5, color = "black") +
  geom_hline(yintercept = 1.5, color = "black") +
  scale_size(name = "OA percentage",
             range = c(6, 12),
             guide = "none") +
  scale_fill_viridis(labels = scales::percent_format(accuracy = 5L),
                      option = "plasma",
                      name = "OA percentage\n",
                      breaks =  scales::extended_breaks(),
                      guide = guide_colorbar(frame.colour = "#666666", barwidth = 17)) +
  labs(x = "Sector", y = "OA Category") +
#  coord_flip() +
  scale_color_manual(values = c("black" = "black", "#cccccc" = "white")) +
  theme_minimal_grid() + 
  theme(legend.position = "top", legend.justification = "right") +
  theme(legend.direction = "horizontal", legend.box = "vertical") +
  theme(strip.text = element_text(face="bold"))
```
Generate tables for export:
```{r, eval=FALSE}
## Total articles per year
pubs_oa_year %>% 
  left_join(oa_shares_host, by = c("PUBYEAR")) %>% 
  select(-n_total, -share) %>% 
  pivot_wider(names_from = oa_category, values_from = number_of_articles) %>% 
  mutate(oa_share = is_oa/n_total_year) %>% 
  select(-not_oa) %>% 
    pivot_wider(names_from = oa_host, values_from = n_host) %>% 
    mutate(`Proportion of Journal OA` = `Journal OA` / n_total_year,
           `Proportion of Repository OA` = `Repository OA` / n_total_year) %>% 
  ungroup() %>% 
  mutate(PUBYEAR = lubridate::year(PUBYEAR)) %>% 
  readr::write_csv("data/export_tables/table_fig_is_oa.csv")

## Total articles per year and sector
pubs_cat %>%
  mutate(oa_host = fct_collapse(
    oa_category,
    journal = c("full_oa_journal", "other_oa_journal"),
    repository = c(
      "opendoar_inst",
      "opendoar_subject",
      "opendoar_other",
      "other_repo"
    )
  )) %>%
  mutate(oa_category = fct_collapse(
    oa_category,
    is_oa = c(
      "full_oa_journal",
      "other_oa_journal",
      "opendoar_inst",
      "opendoar_subject",
      "opendoar_other",
      "other_repo"
    ),
    not_oa = "not_oa"
  ))  %>%
  group_by(sector, PUBYEAR) %>%
  mutate(n_total = n_distinct(PK_ITEMS)) %>%
  ungroup() %>%
  group_by(sector, PUBYEAR, oa_category) %>%
  mutate(number_of_articles = n_distinct(PK_ITEMS)) %>%
  ungroup() %>% 
  group_by(sector, PUBYEAR, oa_host, n_total, oa_category, number_of_articles) %>% 
  summarise(n_host = n_distinct(PK_ITEMS)) %>% 
  ungroup() %>% 
  filter(oa_category != "not_oa") %>% 
    pivot_wider(names_from = oa_category, values_from = number_of_articles) %>% 
  mutate(oa_share = is_oa/n_total) %>% 
  pivot_wider(names_from = oa_host, values_from = n_host) %>% 
  mutate(journal_share = journal / n_total,
         repo_share = repository / n_total) %>% 
  readr::write_csv("data/export_tables/table_fig_is_oa_sec.csv")

## sector per category
oa_shares_host_sec <-pubs_cat %>% 
  mutate(oa_host = fct_collapse(
    oa_category,
    journal = c("full_oa_journal", "other_oa_journal"),
    repository = c(
      "opendoar_inst",
      "opendoar_subject",
      "opendoar_other",
      "other_repo"
    )
  )) %>%
  mutate(n_total = n_distinct(PK_ITEMS)) %>% 
  group_by(sec_abbr) %>%
  mutate(n_sec = n_distinct(PK_ITEMS)) %>%
  ungroup() %>%
  filter(oa_host !="not_oa") %>% 
  group_by(oa_host) %>% 
  mutate(n_host = n_distinct(PK_ITEMS)) %>% 
  ungroup() %>% 
  group_by(sec_abbr, oa_host, n_sec, n_host, n_total) %>%
  mutate(n_sec_host = n_distinct(PK_ITEMS),
         share_sec_host = n_sec_host / n_sec,
         share_host = n_host / n_total) %>%
  ungroup() %>% 
  group_by(sec_abbr, oa_host, n_sec, share_sec_host, share_host, n_total) %>%
  summarise() %>%
  ungroup() %>% 
  pivot_longer(cols = c("share_sec_host", "n_sec"),
               names_to = "fig_type",
               values_to = "fig_val") %>% 
  pivot_wider(names_from = sec_abbr,
              values_from = fig_val) %>% 
  mutate(all_sec = case_when(
    fig_type == "share_sec_host" ~ share_host,
    TRUE ~  as.double(n_total)
  )) %>% 
  select(-share_host, -n_total) %>% 
  pivot_longer(cols = c("FhS", "HGF", "MPS", "GRA", "UNI", "WGL", "all_sec"),
               names_to = "sec_abbr",
               values_to = "fig_val") %>% 
  pivot_wider(names_from = fig_type,
              values_from = fig_val) %>% 
  select(-n_sec)

oa_shares_sector %>% 
  left_join(oa_shares_host_sec, by = "sec_abbr") %>% 
  pivot_wider(names_from = oa_host, values_from = share_sec_host) %>%
  pivot_wider(names_from = oa_category, values_from = share) %>%
  left_join(distinct(select(pubs_cat, sector, sec_abbr))) %>% 
  select(-sec_abbr) %>% 
  mutate( sector = replace_na(sector, "All sectors")) %>%
  select(sector, n_sec, all_oa, journal, repository, everything()) %>% 
  readr::write_csv("data/export_tables/table_fig_oa_cat_sec.csv")

## institutions per sector and category
oa_share_host_inst <- pubs_cat %>% 
  group_by(INST_NAME) %>% 
  mutate(n_total = n_distinct(PK_ITEMS)) %>% 
  ungroup() %>% 
  mutate(oa_host = fct_collapse(
    oa_category,
    journal = c("full_oa_journal", "other_oa_journal"),
    repository = c(
      "opendoar_inst",
      "opendoar_subject",
      "opendoar_other",
      "other_repo"
    )
  )) %>%
  filter(oa_host != "not_oa") %>% 
  group_by(INST_NAME, oa_host, n_total) %>% 
  summarise(n_host = n_distinct(PK_ITEMS)) %>% 
  mutate(share = n_host /n_total) %>% 
  ungroup() %>% 
  select(-n_host, -n_total) %>% 
  pivot_wider(names_from = oa_host, values_from = share, values_fill = list(share = 0))

for(secname in unique(pubs_cat$sec_abbr)){
  pubs_cat %>% 
    filter(sec_abbr == secname) %>%
    group_by(INST_NAME) %>% 
    mutate(n_total = n_distinct(PK_ITEMS)) %>% 
    ungroup() %>% 
    group_by(INST_NAME, sector, sec_abbr, n_total, oa_category) %>% 
    summarise(n_cat = n_distinct(PK_ITEMS)) %>%
    ungroup() %>% 
    mutate(share = n_cat / n_total) %>%
    select(- c(n_cat, sector, sec_abbr)) %>% 
    pivot_wider(names_from = oa_category, values_from = share, values_fill = list(share = 0)) %>% 
    mutate(is_oa = 1 - not_oa) %>% 
    select(-not_oa) %>% 
    left_join(oa_share_host_inst, by = "INST_NAME") %>% 
    mutate(journal = replace_na(journal, 0),
           repository = replace_na(repository, 0)) %>% 
    select(INST_NAME, n_total, is_oa, journal, repository, everything()) %>% 
    readr::write_csv(paste0("data/export_tables/table_oa_cat_inst_", secname,".csv"))
}

## numbers overlap hosttype
 pubs_cat %>%
  filter(oa_category != "not_oa") %>% 
  mutate(PUBYEAR = lubridate::ymd(paste0(PUBYEAR, "-01-01"))) %>% 
  left_join(host_overlap, by = "PK_ITEMS") %>% 
  mutate(oa_host = case_when(
    oa_host == "journal" ~ "Journal only",
    oa_host == "repository" ~ "Repository only",
    oa_host == "journal&repository" ~ "Journal &\n Repository"
  )) %>% 
  mutate(oa_host = factor(oa_host, levels = c("Journal only", "Journal &\n Repository", "Repository only"))) %>% 
   group_by(oa_host) %>% 
   mutate(n_host = n_distinct(PK_ITEMS)) %>% 
   ungroup() %>% 
   mutate(n_total = n_distinct(PK_ITEMS)) %>% 
   group_by(oa_host, PUBYEAR, n_total, n_host) %>% 
   summarise(n = n_distinct(PK_ITEMS)) %>%
   ungroup() %>% 
   pivot_wider(names_from = PUBYEAR, values_from = n) %>% 
   pivot_longer(cols = c(3:12), names_to = "PUBYEAR", values_to = "n") %>% 
   mutate(PUBYEAR = lubridate::ymd(PUBYEAR)) %>% 
   left_join(select(filter(pubs_oa_year, oa_category == "is_oa"), PUBYEAR, number_of_articles), by = "PUBYEAR") %>% 
   mutate(number_of_articles = ifelse(is.na(PUBYEAR), n_total, number_of_articles)) %>% 
   pivot_wider(names_from = oa_host, values_from = n) %>% 
   mutate(PUBYEAR = ifelse(is.na(PUBYEAR), "All years", as.character(lubridate::year(PUBYEAR)))) %>% 
   rename(`Total OA articles` = number_of_articles) %>% 
   select(-n_total) %>% 
   mutate(`Proportion\n Journal only` = `Journal only` / `Total OA articles`,
          `Proportion\n Journal &\n Repository` = `Journal &\n Repository` / `Total OA articles`,
          `Proportion\n Repository only` = `Repository only` / `Total OA articles`) %>% 
  readr::write_csv("data/export_tables/overlap_hosttype_per_year.csv")

## which categories are dominant in the overlapping host type plot?
 pubs_cat %>%
  filter(oa_category != "not_oa") %>% 
  mutate(PUBYEAR = lubridate::ymd(paste0(PUBYEAR, "-01-01"))) %>% 
  left_join(host_overlap, by = "PK_ITEMS") %>% 
  mutate(oa_host = case_when(
    oa_host == "journal" ~ "Journal only",
    oa_host == "repository" ~ "Repository only",
    oa_host == "journal&repository" ~ "Journal &\n Repository"
  )) %>% 
  mutate(oa_host = factor(oa_host, levels = c("Journal only", "Journal &\n Repository", "Repository only"))) %>% 
    # mutate(oa_host_non_overlap = fct_collapse(oa_category,
    #   journal = c("full_oa_journal", "other_oa_journal"),
    #   repository = c(
    #     "opendoar_inst",
    #     "opendoar_subject",
    #     "opendoar_other",
    #     "other_repo"
    #   ))) %>% 
    # group_by(oa_host, oa_category, oa_host_non_overlap) %>% 
   group_by(oa_host) %>% 
   mutate(n_total = n_distinct(PK_ITEMS)) %>% 
   ungroup() %>% 
   group_by(oa_host, oa_category, n_total) %>% 
    summarise(n = n_distinct(PK_ITEMS)) %>% 
    # group_by(oa_host, oa_host_non_overlap) %>% 
    # mutate(prop_per_host = n/sum(n)) %>% 
  mutate(prop = n/n_total) %>% 
  select(-n, -n_total) %>% 
  pivot_wider(names_from = oa_host, values_from = prop, values_fill = list(prop = 0)) %>%
  readr::write_csv("data/export_tables/overlap_hosttype_per_category.csv")

```

```{r, eval=FALSE}
rm(oa_shares_inst, oa_shares_inst_sector, oa_shares_inst_sector_stats, oa_shares_sector, pubs_oa_year, host_overlap, oa_share_host_inst, oa_shares_host, oa_shares_host_sec)
```

### Discussion

 - Upset plot of overlapping evidence categories to show influence of semantic scholar, webscraping.

In order to demonstrate the prevalence of evidence categories in Unpaywall, we load the original, non-categorized Unpaywall data:

```{r, eval=FALSE}
upw_evidence <- readr::read_csv("data/upw_evidence.csv")
upw_ev_cat <- upw_evidence %>%
  mutate(upw_matched = ifelse(is.na(upw_doi), FALSE, TRUE)) %>%
  filter(is_paratext == FALSE | is.na(is_paratext))
# rm(upw_evidence)
```

We now determine the evidence combinations for all matched DOIs and then calculate the frequency of each combination found.

```{r, eval=FALSE}
upw_ev_cat <- upw_ev_cat %>%
  filter(upw_matched == TRUE) %>%
  select(upw_doi, evidence) %>%
  distinct() %>%
  arrange(upw_doi, evidence) %>%
  aggregate(evidence ~ upw_doi, data = ., paste, collapse = "&")
upw_ev_cat_n <- upw_ev_cat %>%
  group_by(evidence) %>%
  summarise(n = n_distinct(upw_doi))
upw_ev_cat_n %>%
  arrange(desc(n))
upw_ev_cat_n %>%
  readr::write_csv("data/upw_ev_cat.csv")
```
We now prepare the data for plotting with the UpSetR package and visualise the overlapping evidence categories.

```{r, eval=FALSE}
library(UpSetR)
# list with countsa
upw_ev_upset_list <- as.list(upw_ev_cat_n$n)
# categories as list names
names(upw_ev_upset_list) <- upw_ev_cat_n$evidence
# convert to vector
evidence_categories_upset_expr <- unlist(upw_ev_upset_list)
upset(fromExpression(evidence_categories_upset_expr), nsets = 7, nintersects = 15, order.by = "freq", show.numbers = FALSE, set_size.angles = 25)

```