04-doing-work-with-dx-run.qmd

# Working with files using `dx run` {#sec-dxrun}

:::{.callout-note}
## Prep for Exercises

Make sure you are logged into the platform using `dx login` and that your course project is selected with `dx select`.

In your shell, make sure you're in the `bash_bioinfo_scripts/dx-run/` folder:

```
cd dx-run/
```
:::

## Learning Objectives

1. **Utilize** an all purpose app (`swiss-army-knife`) to run bioinformatics jobs using `dx run` on files in a DNAnexus project
1. **Utilize** multiple inputs in a Swiss Army Knife job.
1. **Explain** and **utilize** multiple options in a Swiss Army Knife job. 
1. **Wrap** R and Python scripts using a bash script to run it on the cloud using `dx run`
1. **Tag** output files on the platform for downstream use.


## Doing Bioinformatics with the Swiss Army Knife (SAK) App

Let's focus on key bioinformatics tasks we need to accomplish. To make things easier, we'll be using [Swiss Army Knife](https://platform.dnanexus.com/app/swiss-army-knife), which is an app on the platform that contains a number of commonly used bioinformatics utilities, including:

:::{.column-body layout-ncol="2"}

- `bcftools` 
- `bedtools` 
- `BGEN` 
- `bgzip` 
- `BOLT-LMM`
- `Picard` 
- `Plato` 
- `plink` 
- `plink2` 
- `QCTool` 
- `REGENIE` 
- `sambamba` 
- `samtools` 
- `seqtk` 
- `tabix` 
- `vcflib` 
- `vcftools` 

:::

Swiss Army Knife also contains R and Python 3.

Everything you'll learn in this section will also be applicable to building apps on the platform as well. You'll learn the foundations of running a script on the platform, which is halfway to building your own apps on the platform. 

## Running Jobs on the DNAnexus platform using `dx run`

Our main command for running jobs is `dx run`. `dx run` lets us submit jobs to be run on the platform. These jobs can be to process files (such as aligning FASTQ files to a genome), or they can be for web apps, such as LocusZoom (for visualizing).

If you have used SLURM on an on-premise HPC system, the equivalent command would be `srun`, and if you have used PBS, the equivalent command would be `sbatch`.

## Try out your first job

In the sample project, you will see a `.bam` file in `data/` called `NA12878.bam` with no index. Let's create an index for this file with `sambamba` by running it with `dx run app-swiss-army-knife`.

```{bash}
#| eval: false
#| filename: dx-run/run-sambamba.sh

dx run app-swiss-army-knife \
  -iin="/data/NA12878.bam" \
  -icmd="sambamba index *"
```

Let's take this code apart. We start our command with `dx run` and the name of the app on the platform we want to use: `app-swiss-army-knife`. 

The second line contains the file input we want to process, which is a file in our current project. Note that the input is specified as `-iin`, not `--in` or `--iin`. Using a single hyphen `-` here instead of a double hyphen `--` for our inputs is different than we might expect for other linux/UNIX parameters.

The third line is the actual command we want to run in Swiss Army Knife. We want to run `sambamba index` on our input file.

When you run the above code, either by pasting it into your shell, or using `bash run-sambamba.sh` you'll see the following response. 

```
Using input JSON:
{
    "cmd": "sambamba index *",
    "in": [
        {
            "$dnanexus_link": {
                "project": "project-GGyyqvj0yp6B82ZZ9y23Zf6q",
                "id": "file-FpQKQk00FgkGV3Vb3jJ8xqGV"
            }
        }
    ]
}

Confirm running the executable with this input [Y/n]: Y
Calling app-GFxJgVj9Q0qQFykQ8X27768Y with output destination
  project-GGyyqvj0yp6B82ZZ9y23Zf6q:/

Job ID: job-GJ0G58Q0yp65gzJzKjYv04bZ
Watch launched job now? [Y/n] Y
```

Respond with `Y` when it asks 

`Confirm executable with this input [Y/n]:`

And when it asks you

`Watch launched job now? [Y/n]`

:::{.callout-note}
## Try it out!

Run the above code (also in `dx-run/run-sambamba.sh`) and look at the log that is generated. Take a look at it carefully, especially the output log. 

What file does it create? Where does it create that file?
:::

:::{.callout-note collapse="true"}
## Answer

If you use `dx tree` to take a look at the project, you'll see that there was a file `NA12878.bam.bai` that was created in the root of our project.

```
dx tree

── data
│   ├── NA12878.bai
│   ├── NA12878.bam
│   ├── NC_000868.fasta
│   ├── NC_001422.fasta
│   ├── small-celegans-sample.fastq
│   ├── SRR100022_chrom20_mapped_to_b37.bam
│   ├── SRR100022_chrom21_mapped_to_b37.bam
│   └── SRR100022_chrom22_mapped_to_b37.bam
└── NA12878.bam.bai
```

If we wanted to control where our index file ended up, we can use the `--destination` option to specify our destination, such as

`--destination data/`
:::

### What about the outputs?

One of the nice things about Swiss Army Knife is that it automatically transfers files that we generate on the worker. In our above job, we generated the corresponding index file `NA12878.bam.bai` from our `-icmd` input. It was automatically transferred over from the worker to the root of our project.

If we wanted to change the output folder, we can use the `--destination` flag, like below:

```{bash}
#| eval: false

dx run app-swiss-army-knife \
  -iin="/data/NA12878.bam" \
  -icmd="sambamba index *" \
  --destination "data/"
```

Another nice thing about `dx run` is that it will automatically create the folder we specify with `--destination` if it doesn't exist in project storage. 

:::{.callout-note}
## What if we run our statement multiple times?

We can actually run our `dx run` statement multiple times. The output will be multiple files with the same name in the `destination` directory. 

What's going on here? Well, files are considered as *objects* on the platform. Each copy generated by our `dx run` are considered unique objects on the platform, with separate file identifiers.
:::

## Building on our first `dx run` job

We've seen how to specify file inputs to Swiss Army Knife. Let's dive into some of the intricacies of using the `-icmd` input.

### Submitting a script to the worker {#sec-worker-script}

The `-icmd` input is helpful for small scripting jobs, but you often want to do a bunch of things in a Swiss Army Knife job.

Bash scripting to the rescue! Say we have a script called `run-on-worker.sh` that takes 1 positional input, which is a file path. We'll run this script on the worker:

```{bash}
#| eval: false
#| filename: worker-scripts/run-on-worker.sh
sambamba $1 > $1.bai
samtools stats $1 > $1.stats.txt
```

When we set up a `dx run app-swiss-army-knife`, we'll use this script as one of the inputs to `dx run`:

```{bash}
#| eval: false
#| filename: dx-run/dx-run-script.sh
dx run app-swiss-army-knife \
  -iin="worker_scripts/run-on-worker.sh" \
  -iin="data/NA12878.bam" \
  -icmd="bash run-on-worker.sh $in_path"
```

The magic here is that we're using `bash` in our `cmd` input to run the script `run-on-worker.sh`. Notice this script is already in our project in the `worker_scripts/` folder, so we can submit it as a file input:

```
-iin="worker_scripts/run-on-worker.sh"
```

:::{.callout-note}
# Why this is important on the platform

This is a common pattern that we'll use when we need to do more complicated things with Swiss Army Knife. It doesn't seem that helpful right now, but it will once we get to batch scripting.
:::

You'll notice that we are using a special helper variable here called `${in_path}` - this gives the current path of the file on the *worker*. It is tremendously helpful in scripting. Let's learn about these built-in helper variables next.

### Bash Helper Variables

There are some helper variables based on the file inputs specified in `-iin`.

These variables are really helpful in scripting in our `-icmd` input. Say we ran our command:

```
dx run app-swiss-army-knife \
  -iin="/data/NA12878.bam" \ 
  -icmd="sambamba index ${in_path}"
```

You'll see that we are using a special variable called `$in_path` here to specify the file name. This is called a *helper variable* - they are available based on the different file inputs we submit to Swiss Army Knife.

For our one input `-iin="data/NA12878.bam"`, this is what these helper variables contain:

| Bash Variable | Value | Notes |
|---------------|-------|-------|
| `$in`         | `data/NA12878.bam` | Location in project storage |
| `$in_path`    | `~/NA12878.bam` | Location in worker storage |
| `$in_name`    | `NA12878.bam` | File name (without folder path) |
| `$in_prefix`  | `NA12878` | File name without the suffix |

`$in_prefix` is really helpful when you want to create files that have the same prefix as our file. Say we wanted to run `samtools view -c` on our file. We want a file called `NA12878.count.txt` that contains the read counts.

```
dx run app-swiss-army-knife \
  -iin="data/NA12878.bam" \
  -icmd="samtools view -c ${in_path} > ${in_prefix}.counts.txt"
```

What happens when the command is run on the worker is that the variables are expanded like this:

```
samtools view -c ~/NA12878.bam > NA12878.counts.txt
```

:::{.callout-note}
## What is the difference between `$in` and `$in_path`?

You may have looked at the above table and been slightly confused about these two variables.

`$in` is the file location in **project storage**, while `$in_path` is the file location in the temporary **worker storage**. 

When writing scripts that run on the worker with the `-icmd` input, `$in_path` is much more useful.  
:::

:::{.callout-note}
## Why are we learning about helper variables?

Writing scripts that can be run on the worker is part of the app building process, which lets us specify our own executables.

If you can get a handle on working with helper variables in `-icmd`, then you are most of the way to building an app.
:::

## More Swiss Army Knife

There is a table in the [Swiss Army Knife documentation](https://platform.dnanexus.com/app/swiss-army-knife) that specifies the inputs and how to configure the `-icmd` parameter for each operation.

One thing you'll notice is that some of them require multiple file inputs. How do you specify them?

### Multiple file inputs to Swiss Army Knife

One thing that may not be apparent to you is that you can specify multiple file inputs with multiple `-iin` options.

The Swiss Army Knife Documentation notes that `-iin` is actually an *array* file input. That means that we can submit multiple files like this:

```{bash}
#| eval: false
#| filename: dx-run/dx-run-multiple-inputs.sh
dx run app-swiss-army-knife \
  -iin="data/SRR100022_chrom20_mapped_to_b37.bam" \
  -iin="data/SRR100022_chrom21_mapped_to_b37.bam" \
  -iin="data/SRR100022_chrom22_mapped_to_b37.bam" \
  -icmd="ls *.bam | xargs -I% sh -c 'samtools view -c % > %.count.txt'"
```

If we want to process each of these files, we'll have to use `xargs` in our `icmd` statement:

```
-icmd="ls *.bam | xargs -I% sh -c 'samtools view -c % > %.count.txt'"
```

We'll cover more of this in @sec-batch. 

### Using PLINK triplets {#sec-plink}

If we're working with a PLINK trio of files (`.bed`, `.bim`, and `.fam` files), we can use the file inputs as below. We'll submit the trio of files separately, and use the `--bfile` option in `plink` to specify all three files.

```{bash}
#| eval: false
#| filename: dx-run/run-plink.sh
dx run app-swiss-army-knife \
  -iin="data/plink/chr1.bed" \
  -iin="data/plink/chr1.bim" \
  -iin="data/plink/chr1.fam" \
  -icmd="plink --bfile chr1 --geno 0.01 --make-bed --out chr1_qc"
```

We will output the QC'ed and filtered files as `chr1_qc.bed`, `chr1_qc.bim`, and `chr1_qc.fam`. We do that by specifiying the `--out` option for `plink`.

## `dx run` runs deep

If you've looked at the `dx run` help page, you'll notice it is quite long. There are a lot of options. Let's look at a few of them that are really helpful when you're getting started.

  - `--tag` - please use these. Your jobs will be findable with a tag.
  - `--destination` - destination folder in project. If the path doesn't exist, it will be created.
  - `--instance-type` - Instance types used in the job.
  - `--watch` - run `dx watch` for the job id that is generated
  - `-y` - non-interactive mode. Replies yes to the questions and starts up `dx watch` for that job id.
  - `--allow-ssh` - very helpful in debugging jobs. Allows you to `dx ssh` into that worker and check out the status.
  - `--batch-tsv` - works with the tab-delimited files that you generate using `dx generate_batch_inputs`. 
  - `--detach` - if a subjob of a job, it detaches the job. This is really important in batch operations where a top-level job is generating a subjob. Running the subjobs as detached means that the entire top-level job will not fail if a subjob fails. 

I'll outline some examples where we leverage the options below.

### Save our output files into a different folder

As we mentioned above, by default our output files are generated in the root folder of the project. If we want to control this, we can use the `--destination` option. 

Remember, if the folder our `--destination` option doesn't exist, `dx run` will create that folder and send the outputs into the newly created folder. 

```{bash}
#| eval: false
#| filename: dx-run/run-destination.sh
dx run app-swiss-army-knife \
  -iin="data/chr1.bed" \
  -iin="data/chr1.bim" \
  -iin="data/chr1.fam" \
  -icmd="plink --bfile chr1 --geno 0.01 --make-bed --out chr1_qc" \
  --destination "results/"
```

### Run our job on a different instance type

We can specify a different instance type (@sec-instance) with the `--instance-type` parameter. 

```{bash}
#| eval: false
#| filename: dx-run/run-instance.sh
dx run app-swiss-army-knife \
  -iin="data/chr1.bed" \
  -iin="data/chr1.bim" \
  -iin="data/chr1.fam" \
  -icmd="plink --bfile chr1 --geno 0.01 --make-bed --out chr1_qc" \
  --instance-type "mem1_ssd1_v2_x8"
```

### Tag and rename our job

We can change the name of our job using `--name` - this can be very helpful when running a bunch of jobs. Here we're adding `$in_prefix` so we can see what Chromosome we're running on. Very helpful in batch submissions.

Adding a tag to our job, such as the run number, can be very helpful in terminating a large set of jobs (@sec-terminate).

```{bash}
#| eval: false
#| filename: dx-run/run-job-tag-name.sh
dx run app-swiss-army-knife \
  -iin="data/chr1.bed" \
  -iin="data/chr1.bim" \
  -iin="data/chr1.fam" \
  -icmd="plink --bfile chr1 --geno 0.01 --make-bed --out chr1_qc" \
  --tag "plink_job" --name "Run PLINK on fileset: $in_prefix"
```
  
### Clone an failed job

If you have run a job on a lower priority, your job has a chance of getting bumped (stopped). If that's the case, your job will fail.

That's when using the `--clone` parameter can come in handy.

```{bash}
#| eval: false
dx run app-swiss-army-knife --clone <job-id>
```

When we learn about JSON (#sec-json), we'll learn a recipe for restarting a list of these failed jobs.

:::{.callout-note}
## Keep in Mind: `dx run --tag`

Keep in mind when you are using the `--tag` option with `dx run`, it does not tag the *outputs* of your jobs with that tag. It only tags the jobs.

:::

## dxFUSE: Simplify your scripts with multiple inputs {#sec-dxfuse}

Ok, we learned about submitting files as inputs. Is there a way to run scripts with less typing?

There is a general method for working in Swiss Army Knife and other apps that lets you bypass specifying file inputs explicitly in your `dx run` statement: using the [dxFUSE file system](https://github.com/dnanexus/dxfuse).

The main thing you need to know as a developer is that you can prepend a `/mnt/project/` to your file path to use in your `-icmd` input directly. For the `dx run` statement in @sec-plink, we can rewrite it as the following:

```{bash}
#| eval: false
#| filename: dx-run/run-job_dxfuse.sh

dx run app-swiss-army-knife \
  -icmd="plink --bfile /mnt/project/data/plink/chr1 --geno 0.01 --make-bed --out chr1_qc"
```

In our `-icmd` input, we are referring to the file location of the triplet by pre-pending a `/mnt/project/` to the project path of the triplet: `data/chr1` to make a new path:

```
/mnt/project/data/chr1
```

This new path lets us specify files from project storage without directly specifying them as an input.

We could rewrite this command

```{bash}
#| eval: false

dx run app-swiss-army-knife \
  -iin="data/NA12878.bam" \
  -icmd="samtools view -c NA12878.bam > NA12878.counts.txt"
```

As this:

```{bash}
#| eval: false

dx run app-swiss-army-knife \
  -icmd="samtools view -c /mnt/project/data/NA12878.bam > NA12878.counts.txt"
```

We can also use an `ls` on a dxFUSE path in our `-icmd`. This is a pattern that will become especially helpful when we do process multiple files on a single worker (@sec-mult-worker)

```{bash}
#| eval: false

dx run app-swiss-army-knife \
  -icmd="ls /mnt/project/data/*.bam | xargs -I% sh -c 'samtools stats % > \$(basename %).stats.txt'" \
  --destination "results/"
```

Our `cmd` input is kind of complicated here. Within the subshell, we're taking advantage of the `basename` command, which returns the bare filename of our object. We need this because we can't currently write using dxfuse. 

```
sh -c 'samtools stats % > \$(basename %).stats.txt'
```

What's going on with the `\$(basename %)`? We need to use the parentheses (`()`) because it's in a subshell command, and we escape the `$` in front of it, so that our base shell doesn't expand it, since we want the shell on the worker to expand it instead. 

So if our first file is `/mnt/project/data/NA12878.bam`, then the substitution on the worker goes like this:

```
samtools stats /mnt/project/data/NA12878.bam > NA12878.stats.txt
```

Most of these issues are because we need to specify subshells in our `cmd` and use `dxfuse`. Honestly, I would probably put the `xargs` code into a script, and submit that script into the worker, rather than specify it as a single line in our `cmd` input. (@sec-worker-script)

### Advantages of dxFUSE

Most of my scripts leverage dxFUSE. Let's talk about the advantages of using `dxFUSE`.

- **You don't have to specify files as inputs**. This avoids a lot of typing in your scripts. This can be very helpful if there are a number of files you need as inputs per job. 
- **Allows you to stream files**. dxFUSE is a file streaming method. That is, it will stream files bit by bit to the worker. This can make the overall job run somewhat faster, as it can stream files as needed, and not have to explicitly transfer each file to the worker.

### Disadvantages of dxFUSE

There are some disadvantages to using dxFUSE you should be aware of:

  - **It obscures your audit trail.** Because we are not specifying the files as inputs, it is less obvious as to which files were processed with each job.
  - **It works best as read-only.** dxFUSE has limited write capabilities at this point, so it is best to treat it as a read-only file system.
  - **You have to be aware of how it resolves duplicate file names in a folder**. dxFUSE
works differently than what you might expect when it encounters multiple file names. It separates out duplicate files by prefacing them with a number. For example, if you generated three files called `chr1.results.txt` in a folder called `data`, this is how you would refer to each of them:

```
data/chr1_results.txt
data/1/chr1_results.txt
data/2/chr1_results.txt
```

This name resolution can be tricky to work with.

  - **It works best with good file hygiene practices.** One of the confusing things is that you can create files with the same filename in the same folder (see previous point). This can easily happen if you `clone` a job. Because these two files have unique identifiers, they are allowed to have identical file names on the platform. This is one reason to tag each set of jobs with a run number (such as `run001`), so you can make sure that all of the updated files are the correct version.

## Using a Docker Image with `dx run`/Swiss Army Knife

See the containers chapter (@sec-containers) for more information. 

## What you learned in this chapter

Kudos and congrats for reaching this far. We built upon our shell scripting skills (@sec-script) and our knowledge of cloud-computing (@sec-cloud) to start running jobs effectively in the cloud.