-
Notifications
You must be signed in to change notification settings - Fork 11
/
33-pathway_overlap_biological_contexts.Rmd
188 lines (147 loc) · 5.81 KB
/
33-pathway_overlap_biological_contexts.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
---
title: "What pathways are captured in models trained on different biological
contexts?"
output:
html_notebook:
toc: true
toc_float: true
---
**J. Taroni 2018**
A pathway will be considered captured or represented when training on a
particular biological context if the majority of models have at least one
latent variable significantly associated (we'll use `FDR < 0.05`) with it.
We have trained 5 models initialized with different random seeds for each
biological context (identical training set).
Thus, if 3 or more models have a latent variable associated with a pathway,
that pathway is to be considered captured in that biological context.
## Set up
```{r}
# magrittr pipe
`%>%` <- dplyr::`%>%`
```
```{r}
GetRepresentedPathways <- function(list.of.plier.models, fdr.cutoff = 0.05) {
# For a list of plier models (e.g., series of repeats; output of
# scripts/subsampling_PLIER.R), find what pathways are represented in a
# majority of those models.
#
# Args:
# list.of.plier.models: the list of results
# fdr.cutoff: what FDR threshold should be used to determine if a pathway
# is captured? default is 0.05
#
# Returns:
# a vector of pathway names
# what pathways have at least one LV with FDR < fdr.cutoff
IdentifyCapturedPathways <- function(plier.results) {
summary.df <- plier.results$summary
captured.pathways <-
unique(summary.df$pathway[which(summary.df$FDR < fdr.cutoff)])
}
# for each model (repeat) in the list, what pathways are captured?
captured.pathways.list <-
lapply(list.of.plier.models, function(x) IdentifyCapturedPathways(x$PLIER))
# get counts
count.table <- table(unlist(captured.pathways.list))
# which pathways are represented in the majority of models?
number.of.models <- length(list.of.plier.models)
majority.number <- ceiling(number.of.models / 2)
majority.pathways <- names(count.table[which(count.table >= majority.number)])
# return the list of pathways represented in the majority of models
return(majority.pathways)
}
```
#### Directories
Plot and results directories specifically for this notebook.
```{r}
# plot and result directory setup for this notebook
plot.dir <- file.path("plots", "33")
dir.create(plot.dir, recursive = TRUE, showWarnings = FALSE)
results.dir <- file.path("results", "33")
dir.create(results.dir, recursive = TRUE, showWarnings = FALSE)
```
We're interested in the models exploring the different biological contexts,
not the various sample sizes.
```{r}
models.dir <- "models"
# can be distinguished by the file names
model.files <- list.files(models.dir, pattern = "accessions", full.names = TRUE)
```
## Identify the represented pathways
```{r}
pathway.lists <- lapply(model.files, function(x) {
model.list <- readRDS(x)
GetRepresentedPathways(model.list)
})
names(pathway.lists) <- stringr::str_match(model.files,
"recount2_(.*?)_accessions")[, 2]
```
Let's make an [UpSet plot](https://caleydo.org/tools/upset/) examining the
overlap between the five biological conditions.
```{r}
UpSetR::upset(UpSetR::fromList(pathway.lists),
order.by = "freq",
point.size = 2, line.size = 1,
mainbar.y.label = "Intersection of Pathways",
text.scale = 1.25)
```
Save the plot to file
```{r}
pdf(file.path(plot.dir, "biological_context_upset.pdf"))
UpSetR::upset(UpSetR::fromList(pathway.lists),
order.by = "freq",
point.size = 2, line.size = 1,
mainbar.y.label = "Intersection of Pathways",
text.scale = 1.25)
dev.off()
```
We can use `VennDiagram::calculate.overlap` to find which pathways are
overlapping and use the intersection size above to help us figure things out
from there.
```{r}
overlaps.list <- VennDiagram::calculate.overlap(pathway.lists)
```
### Blood
Let's take a look at what is captured exclusively in the models trained on
blood.
```{r}
data.frame(overlaps.list$a1)
```
We can see that half of these are immune cell-related genesets.
Specifically, the `DMAP` gene sets are further differentiated cell types
(see [Figure 1](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3049864/figure/F1/)
of [Novershtern, et al. _Cell._ 2011.](https://dx.doi.org/10.1016%2Fj.cell.2011.01.004))
### Everything but cancer
If we look at the UpSet plot above, we can see that the largest set of
in-all-but-one is excluding cancer with 22 pathways.
What are the pathways that are left out of the cancer models?
```{r}
data.frame(overlaps.list$a27)
```
We can see that natural killer (NK) cells are represented in other models,
but less so in cancer (e.g., `DMAP_NKA1` and `SVM NK cells activated`).
We'd likely expect models that are trained on blood to be particularly good
at capturing NK cell signals despite their relatively small sample size
(`n = 3862`).
It could be that the cancer sample size (`n = 8807`) is just too small to
adequately capture this signal.
We trained models on 8000 randomly selected samples, so we can use those models
to look into this.
```{r}
eight.thousand.file <- file.path(models.dir,
"subsampled_recount2_PLIER_model_8000.RDS")
eight.thousand.list <- readRDS(eight.thousand.file)
```
What pathways are in the majority of models (`n = 8000`, randomly selected)?
```{r}
eight.thousand.pathways <- GetRepresentedPathways(eight.thousand.list)
```
Are NK cells covered?
```{r}
data.frame(eight.thousand.pathways[grep("NK", eight.thousand.pathways)])
```
This underrepresentation in cancer models may be due to the sometimes
immunosuppressive nature of the tumor microenvironment.
However, this analysis doesn't specifically examine the _quality or usefulness_
of the NK cell gene set-latent variable association (i.e., is the latent
variable associated with many, unrelated pathways?).