-
Notifications
You must be signed in to change notification settings - Fork 5
/
07-containers.qmd
298 lines (191 loc) · 10.9 KB
/
07-containers.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
# Containers and Reproducibility {#sec-containers}
:::{.callout-note}
## Prep for Exercises
Make sure you are logged into the platform using `dx login` and that your course project is selected with `dx select`.
In your shell (either on your machine or in binder), make sure you're in the `bash_bioinfo_scripts/containers/` folder:
```
cd containers/
```
:::
## Learning Objectives
1. **Explain** the benefits of using containers on DNAnexus for reproducibility and for batch processing
1. **Define** the terms *image*, *container*, and *snapshot* in the context of Docker
1. **Create** snapshots on RAP using `docker pull` and `docker save` with the ttyd app
1. **Utilize** containers to batch process files on RAP
1. **Extend** a docker image by installing within interactive mode
1. **Build** a docker image using Dockerfiles
## Why Containers?
There is a replication crisis out there. Even given a script and the raw data, it is often difficult to replicate the results generated by a study.
Why is this difficult? Many others have talked about this, but one simple reason is that the results are tied to software and database versions.
This is the motivation for using *containers* - they are a way of packaging software that 'freezes' the software versions. If you provide the container that you used to generate the results, other people should be able to replicate your results even if they're on a different operating system.
## Terminology
In order to be unambiguous with our language, we'll use the following definitions:
![Docker Terms 1](images/docker_terminology.png){#fig-docker1}
- **Registry** - collection of repositories that you pull docker images from. Example repositories include DockerHub and Quay.io.
- **Docker Image** - what you download from a registry - the "recipe" for building the software environment. Stored in a registry. use `docker pull` to get image, `docker commit` to push changes to registry, can also generate image from a Dockerfile,
- **Docker Container** - The executable software environment installed on a machine. Runnable. Generate from `docker pull` from a repository.
- **Snapshot File** - An single archive file (`.tar.gz`) that contains the Docker container. Generate using `docker save` on a container. Also known as an *image file* on the platform.
## Building Docker Snapshot Files on the the DNAnexus platform
### The Golden Rule of Docker and Batch Analysis
DockerHub has a pull limit of 200 pulls/day/user. You will face this limit a lot if you just use the image url.
So, if you are processing more than 200 files (or Jobs), you should save the docker image into platform storage as a snapshot file.
Let's talk about the basic snapshot building process.
### Be Secure
Before we get started, security is always a concern when running Docker images. The `docker` group has elevated status on a system, so we need to be careful that when we're running them, they aren't introducing any system vulnerabilities.
These are mostly important when running containers that are web-servers or part of a web stack, but it is also important to think about when running jobs on the cloud.
Here are some guidelines to think about when you are working with a container.
- **Use vendor-specific Docker Images when possible**.
- **Use container scanners to spot potential vulnerabilities**. DockerHub has a vulnerability scanner that scans your Docker images for potential vulnerabilities.
- **Avoid kitchen-sink images**. One issue is when an image is built on top of many other images. It makes it really difficult to plug vulnerabilities. When in doubt, use images from trusted people and organizations.
### The Basic Snapshot Building Process
::: {#fig-snapshot-building}
```{mermaid}
flowchart TD
A[start ttyd] --> B[docker pull <br> from registry]
B --> C[docker save to <br> snapshot file]
C --> D[dx upload <br> snapshot to <br> project storage]
D --> E[terminate ttyd]
```
Building a docker snapshot on the DNAnexus platform.
:::
### Building Snapshot Files in `ttyd` {#sec-ttyd}
Up until now, we have been using our own machine or the binder shell for doing our work.
We're going to pull up a web-enabled shell on a DNAnexus worker with the `ttyd` app. `ttyd` is useful because:
1. `docker` is already installed, so we can `docker pull` our container and `docker save` our snapshot to the ttyd instance.
1. It's much faster to transfer our snapshot file back into project storage with `dx upload`.
To open ttyd, open the **Tool Library** under **Tools** and select your project.
![Opening ttyd]()
### Pull your image from a registry
```{bash}
#| eval: false
docker pull quay.io/biocontainers/samtools:1.15.1--h1170115_0
```
On your `ttyd` instance, do a `docker pull` to pull your image from the registry. Note that we're pulling `samtools` from `quay.io` here, from the `biocontainers` user.
We're also specifying a *version tag* - the `1.15.1--h1170115_0` to tie our `samtools` to a specific version. This is important - most `docker pull` operations will pull from the `latest` tag, which is not tied to a specific version. So make sure to tie your image to a specific version.
When you're done pulling the docker image, try out the `docker images` command.
```
docker images
```
### Try your docker image out
Now that we have our docker image downloaded, we can test it out by running `samtools --help`. This should give us the help message.
```{bash}
#| eval: false
docker run biocontainers/samtools samtools --help
```
### Save your docker image as a snapshot
Now that we've pulled the container, we are now going to save it as a snapshot file using `docker save`. We pipe the output of `docker save` into `gzip` to save it as `samtools_image.tar.gz`.
```{bash}
#| eval: false
docker save quay.io/biocontainers/samtools | gzip > samtools_image.tar.gz
```
### Upload your snapshot
Now we can get our image back into project storage. We'll create a folder called `images/` with `dx mkdir` and then use `dx upload` to get our snapshot file into the `images/` folder.
```{bash}
#| eval: false
dx mkdir images/
dx upload samtools_image.tar.gz --destination images/
```
### Important: make sure to terminate your ttyd instance!
One thing to remember is that there is no timeout associated with `ttyd`. You will get a reminder email after it's been open after 24 hours, but you will get no warning after that.
So make sure to use `dx terminate` or terminate the ttyd job under the `Manage` tab.
## Using Docker with Swiss Army Knife {#sec-docker-sak}
Now that we've built our Docker snapshot, let's use it in Swiss Army Knife.
Swiss Army Knife has two separate inputs associated with Docker:
- `-iimage_file` - This is where you put the snapshot file (such as the `samtools.tar.gz`)
- `-iimage` - This is where you'd put the Docker URL (such as `quay.io/ucsc_cgl/samtools`)
So, let's run a `samtools` job using our Docker snapshot.
```{bash}
#| eval: false
dx run app-swiss-army-knife \
-iimage_file="images/samtools.tar.gz" \
-iin="data/NA12878.bam"
-icmd="docker run samtools stats * > ${in_prefix}.stats.txt"
```
The main thing that has been changed here is that we've added an the `-iimage_file` input to our `dx run` statement.
## Extending a Docker Image
One thing that you might do is extend a Docker image by adding additional software. You can do this by opening up an interactive mode and installing within the container.
What is interactive mode? When you pull a docker image in your `ttyd` session (@sec-ttyd), you can issue a `docker run` command with these options:
```
docker run -it ubuntu:18.04 /bin/bash
```
It will open up a bash shell in the container.
### Pulling a Base Image
We'll start out with the official ubuntu 18.04 container in our ttyd session:
```{bash}
#| eval: false
docker pull ubuntu:18.04
docker images
```
### Open up interactive mode
In ttyd, now enter an interactive session:
```
docker run -it ubuntu:18.04 /bin/bash
```
If it works, you will open up a `bash` prompt in the container.
You'll know you're in the container if you do an `ls` and your filesystem looks different.
### Install Software
Now, let's install [EMBOSS](https://emboss.sourceforge.net/) (European Molecular Biology Open Software Suite), which is a suite of string utilities for working with genomic data. If you look at the EMBOSS link, you will see that you can install it via `apt install`, which is available by default in the `ubuntu` container.
```{bash}
#| eval: false
apt update && apt upgrade
apt install emboss gzip -y
```
### Exit Container
Now exit from your container's interactive mode:
```{bash}
#| eval: false
exit
```
You'll be back at the normal ttyd prompt.
### `docker commit`/`docker save` your new snapshot file
We created a new container when we installed everything. We'll need to find it its ID in ttyd.
```{bash}
#| eval: false
docker ps -a
```
We can see that our new container has the following id. We can use this id to save a new container with `docker commit`. Now we can save the snapshot file by using `docker save`:
```{bash}
#| eval: false
docker commit <container_id> emboss:6.6.0
docker save emboss:6.6.0 | gzip > emboss.tar.gz
dx upload emboss.tar.gz --destination images/
```
### Other uses of Interactive Mode
Docker's interactive mode is really helpful for testing out scripts and making sure they are reproducible.
If I have a one-off analysis, it may be faster for me to just open up `ttyd` and use `docker run` to open up interactive mode, and do work with a container.
## Making Dockerfiles
The other way to build image files is to use a Dockerfile. A Dockerfile is a recipe for installing software and its dependencies.
Let's take a look at a Dockerfile. By default, it is contained within a folder and is called `Dockerfile`:
```
FROM ubuntu:18.04
RUN apt-get update && \
apt-get install -y build-essential && \
apt-get install -y wget && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
ENV CONDA_DIR /opt/conda
RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh && \
/bin/bash ~/miniconda.sh -b -p /opt/conda
ENV PATH=$CONDA_DIR/bin:$PATH
#install plink with conda
RUN conda install -c "bioconda/label/cf201901" plink
RUN conda install -c "bioconda/label/cf201901" samtools
```
We can build the Docker image in our directory using:
```
docker build . -t gatk_sam_plink:0.0.1
```
When it's done, we can then make sure it's been built by using
```
docker images
```
And we can use it like any other image.
## Going Further with Docker
Now that you know how to build a snapshot file, you've also learned another step in building apps: specifying software dependencies. You can use these snapshot files to specify executables in your app.
You can also use these snapshot files in your WDL workflow.
## What you learned in this chapter
- How containers enable reproducibility
- Defined specific container terminology
- Created snapshot files using `ttyd`
- Use these snapshot files with Swiss Army Knife
- How to extend a docker image by installing new software