-
Notifications
You must be signed in to change notification settings - Fork 41
Tips and tricks
* [Combine mutliple AMRFinderPlus output files prepending each line with the filename](#combine-mutliple-amrfinderplus-output-files-prepending-each-line-with-the-filename)
* [Filter AMRFinderPlus output for a specfic type/subtype/scope/class, etc. using awk.](#filter-amrfinderplus-output-for-a-specfic-typesubtypescopeclass-etc-using-awk)
* [Use a bash loop to run on a number of files serially](#use-a-bash-loop-to-run-on-a-number-of-files-serially)
* [Combine results of AMRFinderPlus on many assemblies into one file for easier analysis](#combine-results-of-amrfinderplus-on-many-assemblies-into-one-file-for-easier-analysis)
* [Considerations for HPC and maximizing throughput](#considerations-for-hpc-and-maximizing-throughput)
When running AMRFinderPlus on many assemblies it is often useful to combine the output from many runs into one file with an additional column for an assembly identifier. There are a few ways to do this outlined in issue 25.
The following assumes the files you want to combine are named *.amrfinder
. It will create a file named combined.tsv that contains all of the AMRFinderPlus files combined.
h=$(head -1 $(ls *.amrfinder | head -1))
echo $'filename\t' "$h" > combined.tsv
grep -v '^Protein identifier' *.amrfinder | sed 's/:/\t/' >> combined.tsv
There are, of course, many ways to do this. Ex:
awk 'NR == 1 || $<column number> == "<value>" { print }' <amrfinder_output>
To use the technique described here you need to know what field number you want to filter on. For example let's say you want only core hits. You will need to know what column you want to filter on, for a combined (protein + nucleotide) of AMRFinderPlus the Scope is column 8.
awk
This assumes that you're using a consistent filename format with .assembly.fa as the extension on your assembly nucleotide FASTA files, and that you want to run AMRFinderPlus serially (one job after the other). See issue 32 for another example.
for assembly in *.assembly.fa
do
base=$(basename $assembly .assembly.fa)
amrfinder -n $assembly --threads 8 --plus > $base.amrfinder
done
Sometimes you want to collate the above output into a single file, prefixing each line by the filename. Here's one way to do that. This assumes that you want to combine all the AMRFinderPlus output files named with a .amrfinder
extension (Inspired by Issue 25)
header=$( head -1 $( ls *.amrfinder | head -1 ) )
echo $'filename\t' "$header" > together.amrfinder.tab
grep '' *.amrfinder | grep -v 'Protein identifier\tContig id' \
| sed 's/:/\t/' >> together.amrfinder.tab
Groups wanting to do very large numbers of AMRFinderPlus analyses may want to run it on a cluster or to run many jobs in parallel.
In our experience CPU is the bottleneck when trying to run many runs of AMRFinderPlus, so to maximize efficiency when running many jobs in parallel we suggest using --threads 1
on each of the jobs.
This assumes you have sufficient RAM to run all of the jobs in parallel. RAM requirements depend on the sequence input because BLAST and HMMER RAM requirements will depend on how many hits they have to keep track of while they're running. In our experience RAM is usually not an issue, but tests on your data may be required to get a good idea of maximum RAM use if your environment is memory limited.
AMRFinderPlus reads and writes some fairly large temporary files in /tmp, so it is possible disk throughput may be a limiting factor in some cases.
- New in AMRFinderPlus
- Documentation for AMRFinder v1 (Depricated)
- Overview
- Install with bioconda (recommended)
- Docker Image
- Install with binary
- Compile from source
- Test your installation
- Usage (syntax/options)
- --organism option
- Examples
- Input file formats
- Output format
- Common errors
- Known issues
- Tips and tricks
- Database updates
- Software upgrades
- Genotypes vs. Phenotypes
- Scope: plus vs. core
- AMRFinderPlus "Method" column
- Element type and Subtype
- Class and Subclass