-
Notifications
You must be signed in to change notification settings - Fork 5
/
02-scripting-basics.qmd
518 lines (345 loc) · 16.7 KB
/
02-scripting-basics.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
# Shell Scripting Basics {#sec-script}
:::{.callout-note}
## Prep for Exercises
In your shell (either on your machine or in binder), make sure you're in the `bash_bioinfo_scripts/scripting-basics/` folder:
```
cd scripting-basics/
```
:::
## Learning Objectives
1. **Utilize** positional *arguments* to generalize our scripts
1. **Articulate** the *three streams* of a command line utility
1. **Define** variables for use in a bash script
1. **Iterate** a script over a set of files using `xargs` or `for` loops
1. **Wrap** executables and scripts in R/Python into a Bash script
## Review of Bash scripting
Bash scripting is often referred to as a useful "glue language" on the internet. Although a lot of functionality can be covered by both JavaScript and Python, bash scripting is still very helpful to know.
We are going to cover Bash scripting because it is the main shell that is available to us on DNAnexus machines, which are Ubuntu-based.
We will be using Bash scripts as "glue" for multiple applications in cloud computing, including:
1. **Wrapping scripts** from other languages such as R or Python so we can run them using `dx run` on a app such as Swiss Army Knife
2. **Specifying inputs and outputs** to executables in Applets/Workflows
3. **Specifying inputs and outputs** in a workflow built by Workflow Description Language (WDL).
As you can see, knowing Bash is extremely helpful when running jobs on the cloud.
## Our first script with positional arguments {#sec-positional}
Say we have [`samtools`](http://www.htslib.org/doc/samtools-stats.html) installed on our own machine. Let's start with a basic script and build from there. We'll call it `sam_run.sh`. With `nano`, a text editor, we'll start a very basic bash script and build its capabilities out.
```{bash}
#| eval: false
#| filename: scripting-basics/sam_run.sh
#!/bin/bash/
samtools stats $1 > $2
```
Let's take a look at the command that we're running first. We're going to run `samtools stats`, which will give us statistics on an incoming `bam` or `sam` file and save it in a file. We want to be able to run our script like this:
```{bash}
#| eval: false
bash sam_run my_file.bam out_stats.txt
```
When we run it like that, `sam_run.sh` will run `samtools stat` like this:
```{bash}
#| eval: false
samtools stats my_file.bam > out_stats.txt
```
So what's going on here is that there is some substitution using common arguments. Let's look at these.
### Positional Arguments such as `$1`
How did the script know where to substitute each of our arguments? It has to do with the argument variables. Arguments (terms that follow our command) are indexed starting with the number 1. We can access the value at the first position using the special variable `$1`.
Note that this works even in quotes.
So, to unpack our script, we are substituting our first argument for the `$1`, and our second argument for the `$2` in our script.
:::{.callout-note}
## Test yourself
How would we rewrite `sam_run.sh` if we wanted to specify the output file as the first argument and the bam file as the second argument?
```{bash}
#| eval: false
#| filename: scripting-basics/sam_run.sh
#!/bin/bash/
samtools stats $1 > $2
```
:::
:::{.callout-note collapse="true"}
## Answer
For this script, we would switch the positions of `$1` and `$2`.
```{bash}
#| eval: false
#!/bin/bash/
samtools stats $2 > $1
```
And we would run `sam_run.sh` like this:
```{bash}
#| eval: false
bash sam_run.sh my_file.bam out_stats.txt
```
:::
### What about named arguments in my script?
See @sec-named for more info.
## Using pipes: STDIN, STDOUT, STDERR
We will need to use pipes to chain our commands together. Specifically, we need to take a command that generates a list of files on the platform, and then spawns individual jobs to process each file. For this reason, understanding a little bit more about how pipes (`|`) work in Bash is helpful.
If we want to understand how to chain our scripts together into a pipeline, it is helpful to know about the different streams that are available to the utilities.
:::{#fig-std}
```{mermaid}
graph LR
A(STDIN) --> E[run_samtools.sh]
E --> B(STDOUT)
E --> C(STDERR)
```
Inputs/outputs to a script
:::
Every script has three streams available to it: Standard In (STDIN), Standard Out (STDOUT), and Standard Error (STDERR) (@fig-std).
STDIN contains information that is directed to the input of a script (usually text output via STDOUT from another script).
Why do these matter? To work in a Unix pipeline, a script must be able to utilize STDIN, and generate STDOUT, and STDERR.
Specifically, in pipelines, STDOUT of a script (here it's `run_samtools`) is directed into STDIN of another command (here `wc`, or word count)
:::{#fig-pipe}
```{mermaid}
graph LR
E[run_samtools.sh] --> B(STDOUT)
B --> F{"|"}
E --> C(STDERR)
F --> D("STDIN (wc)")
D --> G[wc]
```
Piping a script `run_samtools.sh` into another command (`wc`)
:::
We will mostly use STDOUT in our bash scripts, but STDERR can be really helpful in debugging what's going wrong.
:::{.callout-note}
## Why this is important on the platform
We'll use pipes and pipelines not only in starting a bunch of jobs using batch scripting on our home computer, but also when we are processing files within a job.
:::
### For more info about pipes and pipelines
<https://swcarpentry.github.io/shell-novice/04-pipefilter/index.html>
<https://datascienceatthecommandline.com/2e/chapter-2-getting-started.html?q=stdin#combining-command-line-tools>
## Batch Processing Basics: Iterating using `xargs` {#sec-xargs}
A really common pattern is taking a delimited list of files and doing something with them. We can do some useful things such as seeing the first few lines of a set of files, or doing some sort of processing with the set of jobs.
Let's start out with a list of files:
```{bash}
#| eval: false
source ~/.bashrc #| hide_line
ls data/*.sh
```
```
data/batch-on-worker.sh
data/dx-find-data-class.sh
data/dx-find-data-field.sh
data/dx-find-data-name.sh
data/dx-find-path.sh
data/dx-find-xargs.sh
```
Now we have a list of files, let's look at the first few lines of each of them, and print a separator `---` for each.
```{bash}
#| eval: false
#| filename: scripting-basics/xargs_example.sh
source ~/.bashrc #| hide_line
ls data/*.sh | xargs -I% sh -c 'head %; echo "\n---\n"'
```
```
#!/bash/bin
cmd_to_run="ls *.vcf.gz | xargs -I% sh -c "bcftools stats % > %.stats.txt"
dx run swiss-army-knife \
-iin="data/chr1.vcf.gz" \
-iin="data/chr2.vcf.gz" \
-iin="data/chr3.vcf.gz" \
-icmd=${cmd_to_run}
---
#!/bin/bash
dx find data --class file --brief
---
dx find data --property field_id=23148 --brief
---
dx find data --name "*.bam" --brief
---
```
Let's take this apart piece by piece.
`xargs` takes an `-I` argument that specifies a placeholder. In our case, we are using `%` as our placeholder in this statement.
We're passing on each filename from `ls` into the following code:
```{bash}
#| eval: false
sh -c 'head %; echo "---\n"'
```
The `sh -c` opens a subshell so that we can execute our command for each of the files in our list. We're using `sh -c` to run:
```{bash}
#| eval: false
'head %; echo "---\n"'
```
So for our first file, `01-scripting-basics.qmd`, we are substituting that for `%` in our command:
```{bash}
#| eval: false
'head 01-scripting-basics.qmd; echo "---\n"'
```
For our second file, `cloud-computing-basics.qmd`, we would substitute that for the `%`:
```{bash}
#| eval: false
'head cloud-computing-basics.qmd; echo "---\n"'
```
Until we cycle through all of the files in our list.
### The Basic `xargs` pattern
:::{#fig-xargs}
```{mermaid}
graph LR
A["ls *.bam"] --> B{"|"}
B --> C["xargs -I% sh -c"]
C --> D["command_to_run %"]
```
Basics of using `xargs` to iterate on a list of files
:::
As you cycle through lists of files, keep in mind this basic pattern (@fig-xargs):
```{bash}
#| eval: false
ls <wildcard> | xargs -I% sh -c "<command to run> %"
```
We will leverage this pattern when we get to batch processing files (@sec-batch).
:::{.callout-note}
## Test Yourself
How would we modify the below code to do the following?
1. List only `.json` files in our `data/` folder using `ls`
1. Use `tail` instead of `head`
```{bash}
#| eval: false
ls *.txt | xargs -I% sh -c "head %; echo '---\n'"
```
:::
:::{.callout-note collapse="true"}
## Answer
```{bash}
#| eval: false
ls data/*.json | xargs -I% sh -c "tail %; echo '---\n'"
```
:::
:::{.callout-note}
## Why this is important on the platform
We'll use this to execute batch jobs using `dx run`. This especially becomes powerful on the platform when we use `dx find files` to list files in our DNAnexus project.
:::
### For more information
<https://www.baeldung.com/linux/xargs-multiple-arguments>
## Variables in Bash Scripts {#sec-bash-variables}
We've already encountered a placeholder variable, `%`, that we used in running `xargs`. Let's talk about declaring variables in bash scripts and using them using variable expansion.
In Bash, we can declare a variable by using `<variable_name>=<value>`. Note there are no spaces between the variable (`my_variable`), equals sign, and the value (`"ggplot2"`).
```{bash}
my_variable="ggplot2"
echo "My favorite R package is ${my_variable}"
```
Take a look at line 3 above. We expand the variable (that is, we substitute the actual variable) by using `${my_variable}` in our `echo` statement.
In general, when expanding a variable in a quoted string, it is better to use `${my_variable}` (the variable name in curly brackets). This is especially important when using the variable name as part of a string:
```{bash}
my_var="chr1"
echo "${my_var}_1.vcf.gz"
```
If we didn't use the braces here, like this:
```
echo "$my_var_1.vcf.gz"
```
Bash would look for the variable `$my_var_1`, which doesn't exist. So use the curly braces `{}` when you expand variables. It's safer overall.
There is an alternate method for variable expansion which we will use when we call a *sub-shell* - a shell within a shell, much like in our `xargs` command above. We need to use parentheses `()` to expand them within the sub-shell, but not the top-shell. We'll use this when we process multiple files within a single worker.
::: {.callout-note}
### Why this is important on the platform
On the DNAnexus platform, there are special helper variables that are available to us, such as `$in_prefix`, which will give us the prefix of a file name input. This is essential when running commands within an app, as it allows us to generalize inputs and outputs in our app.
Here's a practical example:
```{bash}
#| eval: false
my_cmd="papermill notebook.ipynb output_notebook.ipynb"
dx run dxjupyterlab -icmd="${my_cmd}" -iin="notebook.ipynb"
```
We're storing the command we want to run in the `${my_cmd}` variable. Here we're mostly using it to break up the `dx run` statement on the next line.
:::
### `basename` can be very handy when on workers
If we are processing a bunch of files on a worker that we have specified using `dxFUSE` (@sec-dxfuse), we need a way to get the bare filename from a `dxfuse` path. We will take advantage of this when we run process multiple files on the worker.
For example
```
basename /mnt/project/worker_scripts/dx-run-script.sh
```
This will return:
```
dx-run-script.sh
```
Which can be really handy when we name our outputs.
## Quoting and Escaping Filenames in Bash
One point of confusion is when do you quote things in Bash? When do you use single quotes (`'`) versus double-quotes (`"`)? When do you use `\` to escape characters?
Let's talk about some quoting rules in Bash. I've tried to make things as simplified and generalized as possible, rather than stating all of the rules for each quote.
1. If you have spaces in a filename, use double quotes (`"chr 1.bam"`)
1. If you have a single quote in the filename, use double quotes to wrap it (`"ted's file.bam"`)
1. Only escape characters when necessary - if you can solve a problem with quotes, use them
1. If you need to preserve an escaped character, use single quotes
Let's go over each of these with an example.
### If you have spaces in a filename, use double quotes (Most common)
For example, if your filename is `chr 1 file.bam`, then use double quotes in your argument
```
samtools view -c "chr 1 file.bam"
```
### If you have a single quote in the name, use double quotes to wrap it (less common)
Say you have a file called `ted's new file.bam`. This can be a problem when you are calling it, especially because of the single quote.
In this case, you can do this:
```
samtools view -c "ted's new file.bam"
```
### Only escape characters when necessary (less common)
There are a number of special characters (such as Tab, and Newline) that can be specified as escape characters. In double quotes, characters such as `$` are signals to Bash to expand or evaluate code.
Say that someone had a `$` in their file name such as `Thi$file is money.bam`
How do we refer to it? We can escape the character with a backslash `\`:
```
samtools view -c "Thi\$file is money.bam"
```
The backslash is a clue to Bash that we don't want variable expansion in this case. Without it, bash would look for a variable called `$file`.
### If you need to preserve an escaped character, use single quotes (least common)
This is rarely used, but if you need to keep an escaped character in your filename, you can use single quotes. Say we have a filename called `Thi\$file.bam` and you need that backslash in the file name (btw, please don't do this), you can use single quotes to preserve that backslash:
```
samtools view -c 'Thi\$file.bam'
```
Again, hopefully you won't need this.
### For More Info
<https://www.grymoire.com/Unix/Quote.html#uh-3>
:::{.callout-note}
## What about backticks?
Backticks (`` ` ``) are an old way to do command evaluation in Bash. For example, if we run the following on the command-line:
```
echo "there are `ls -l | wc -l` files in this directory"
```
Will produce:
```
there are 36 files in this directory
```
Their use is deprecated, so you should be using `$()` in your command evaluations instead:
```
echo "there are $(ls -l | wc -l) files in this directory"
```
:::
:::{.callout-note}
## What about X use case?
There are a lot of rules for Bash variable expansion and quoting that I don't cover here. I try to show you a way to do things that work in multiple situations on the platform.
That's why I focus on double quotes for filenames and `${}` for variable expansion in general. They will work whether your Bash script is on the command line or in an App, or in WDL.
:::
## Running a R script on the command line
Let's end this chapter with wrapping R scripts in a Bash script. Say you have an R Script you need to run on the command line. In our bash script, we can do the following:
```{bash}
#| filename: "scripting-basics/wrap_r_script.sh"
#| eval: false
#!/bin/bash
Rscript my_script.R CSVFILE="${1}"
```
This calls `Rscript`, which is the command line executable, to run our R script. Note that we have a named argument called `CSVFILE` and it is done differently than in Bash - how do we use this in our R Script?
### Using Named Arguments in an R script
We can pass arguments from our bash script to our R script by using `commandArgs()` - this will populate a list of named arguments (such as `CSVFILE`) that are passed into the R Script. We assign the output of `commandArgs()` into the `args` object.
We refer to our `CSVFILE` argument as `args$CSVFILE` in our script.
```{r}
#| eval: false
#| filename: "scripting-basics/r_script.R"
library(tidyverse)
args <- commandArgs()
# Use arg$CSVFILE in read.csv
csv_file <- read.csv(file=args$CSVFILE)
# Do some work with csv_file
csv_filtered <- csv_file |> dplyr::filter()
# Write output
write.csv(csv_filtered, file = paste0(args$CSVFILE, "_filtered.csv"))
```
### Running our R Script
Now that we've set it up, we can run the R script from the command line as follows:
```{bash}
#| eval: false
bash my_bash_script.sh my_csvfile.csv
```
In our bash script, `my_bash_script.sh`, we're using positional argument (for simplicity) to specify our csvfile, and then passing the positional argument to named ones (`CSVFILE`) for `my_r_script.R`.
:::{.callout-note}
## Why this is important on the platform
We'll see when we build apps that our executable scripts need to be written as bash scripts. This means that if we want to run R code, we need to wrap it in a bash script.
:::
## What you learned in this chapter
Whew, this was a whirlwind tour. Keep this chapter in mind when you're working with the platform - the bash programming patterns will serve you well. We'll refer to these patterns a lot when we get to doing more bioinformatics tasks on the platform.
- Setting up bash scripts with positional arguments
- Iterating over a list of files using `xargs`
- How to use bash variables and variable expansions
- Wrapping an R Script in a bash script