Skip to content

Tips and tricks

Arjun Prasad edited this page Sep 11, 2020 · 32 revisions
     * [Combine mutliple AMRFinderPlus output files prepending each line with the filename](#combine-mutliple-amrfinderplus-output-files-prepending-each-line-with-the-filename)
     * [Filter AMRFinderPlus output for a specfic type/subtype/scope/class, etc. using awk.](#filter-amrfinderplus-output-for-a-specfic-typesubtypescopeclass-etc-using-awk)
     * [Use a bash loop to run on a number of files serially](#use-a-bash-loop-to-run-on-a-number-of-files-serially)
     * [Combine results of AMRFinderPlus on many assemblies into one file for easier analysis](#combine-results-of-amrfinderplus-on-many-assemblies-into-one-file-for-easier-analysis)
     * [Considerations for HPC and maximizing throughput](#considerations-for-hpc-and-maximizing-throughput)

Combine mutliple AMRFinderPlus output files prepending each line with the filename

When running AMRFinderPlus on many assemblies it is often useful to combine the output from many runs into one file with an additional column for an assembly identifier. There are a few ways to do this outlined in issue 25.

The following assumes the files you want to combine are named *.amrfinder. It will create a file named combined.tsv that contains all of the AMRFinderPlus files combined.

h=$(head -1 $(ls *.amrfinder | head -1))
echo $'filename\t' "$h" > combined.tsv
grep -v '^Protein identifier' *.amrfinder | sed 's/:/\t/'  >> combined.tsv

Filter AMRFinderPlus output for a specfic type/subtype/scope/class, etc. using awk.

There are, of course, many ways to do this. Ex:

awk 'NR == 1 || $<column number> == "<value>" { print }' <amrfinder_output>

To use the technique described here you need to know what field number you want to filter on. For example let's say you want only core hits. You will need to know what column you want to filter on, for a combined (protein + nucleotide) of AMRFinderPlus the Scope is column 8.

awk 

Use a bash loop to run on a number of files serially

This assumes that you're using a consistent filename format with .assembly.fa as the extension on your assembly nucleotide FASTA files, and that you want to run AMRFinderPlus serially (one job after the other). See issue 32 for another example.

for assembly in *.assembly.fa
do
    base=$(basename $assembly .assembly.fa)
    amrfinder -n $assembly --threads 8 --plus > $base.amrfinder
done

Combine results of AMRFinderPlus on many assemblies into one file for easier analysis

Sometimes you want to collate the above output into a single file, prefixing each line by the filename. Here's one way to do that. This assumes that you want to combine all the AMRFinderPlus output files named with a .amrfinder extension (Inspired by Issue 25)

header=$( head -1 $( ls *.amrfinder | head -1 ) )
echo $'filename\t' "$header" > together.amrfinder.tab
grep '' *.amrfinder | grep -v 'Protein identifier\tContig id' \
    | sed 's/:/\t/' >> together.amrfinder.tab

Considerations for HPC and maximizing throughput

Groups wanting to do very large numbers of AMRFinderPlus analyses may want to run it on a cluster or to run many jobs in parallel.

In our experience CPU is the bottleneck when trying to run many runs of AMRFinderPlus, so to maximize efficiency when running many jobs in parallel we suggest using --threads 1 on each of the jobs.

This assumes you have sufficient RAM to run all of the jobs in parallel. RAM requirements depend on the sequence input because BLAST and HMMER RAM requirements will depend on how many hits they have to keep track of while they're running. In our experience RAM is usually not an issue, but tests on your data may be required to get a good idea of maximum RAM use if your environment is memory limited.

AMRFinderPlus reads and writes some fairly large temporary files in /tmp, so it is possible disk throughput may be a limiting factor in some cases.

Clone this wiki locally