docs/mode_merge.txt

SYNOPSIS

    metacache merge <result file/directory>... -taxonomy <path> [OPTION]...

    metacache merge -taxonomy <path> [OPTION]... <result file/directory>...


DESCRIPTION

    This mode classifies reads by merging the results of multiple, independent
    queries. These might have been obtained by querying one database with
    different parameters or by querying different databases with different
    reference sequences or build options.

    IMPORTANT: In order to be mergable, independent queries
    need to be run with options:
        -tophits -queryids -lowest species
    and must NOT be run with options that suppress or alter default output
    like, e.g.: -no-map, -no-summary, -separator, etc.

    Possible Use Case:
    If your system has not enough memory for one large database, you can
    split up the set of reference genomes into several databases and query these
    in succession. The results of these independent query runs can then be
    merged to obtain a classification based on the whole set of genomes.


REQUIRED PARAMETERS

    <result file/directory>...
                      MetaCache result files.
                      If directory names are given, they will be searched for
                      sequence files (at most 10 levels deep).
                      IMPORTANT: Result files must have been produced with:
                      -tophits -queryids -lowest species
                      and must NOT be run with options that suppress or alter
                      the default output like, e.g.: -no-map, -no-summary,
                      -separator, etc.

    -taxonomy <path>  directory with taxonomic hierarchy data (see NCBI's
                      taxonomic data files)


MERGING RESULTS OUTPUT

    <file>            Redirect output to file <file>.
                      If not specified, output will be written to stdout.


CLASSIFICATION

    -lowest <rank>    Do not classify on ranks below <rank>
                      (Valid values: sequence, form, variety, subspecies,
                      species, subgenus, genus, subtribe, tribe, subfamily,
                      family, suborder, order, subclass, class, subphylum,
                      phylum, subkingdom, kingdom, domain)
                      default: sequence

    -highest <rank>   Do not classify on ranks above <rank>
                      (Valid values: sequence, form, variety, subspecies,
                      species, subgenus, genus, subtribe, tribe, subfamily,
                      family, suborder, order, subclass, class, subphylum,
                      phylum, subkingdom, kingdom, domain)
                      default: domain

    -hitmin <t>       Sets classification threshhold to <t>.
                      A read will not be classified if less than t features from
                      the database match. Higher values will increase precision
                      at the expense of sensitivity.
                      default: 0

    -hitdiff <d>      Sets candidate LCA threshhold to <d> percent.
                      Influences if only candidate with the most hits will be
                      used as classification result or if taxa of other
                      candidates will be considered.
                      All candidate (taxa) will be included that have at least
                      d% as many hits above the hit-min threshold as the
                      candidate with the most hits.
                      default: 100

    -maxcand <#>      maximum number of reference taxon candidates to consider
                      for each query;
                      A large value can significantly decrease the querying
                      speed!.
                      default: 2

    -cov-percentile <p>
                      Remove the p-th percentile of hit reference sequences with
                      the lowest coverage. Classification is done using only the
                      remaining reference sequences. This can help to reduce
                      false positives, especially whenyour input data has a high
                      sequencing coverage.
                      This feature decreases the querying speed!
                      default: off


GENERAL OUTPUT FORMATTING

    -silent|-verbose  information level during build:
                      silent => none / verbose => most detailed
                      default: neither => only errors/important info

    -no-summary       Dont't show result summary & mapping statistics at the end
                      of the mapping output
                      default: off

    -no-query-params  Don't show query settings at the beginning of the mapping
                      output
                      default: off

    -no-err           Suppress all error messages.
                      default: off


CLASSIFICATION RESULT FORMATTING

    -no-map           Don't report classification for each individual query
                      sequence; show summaries only (useful for quick tests).
                      default: off

    -mapped-only      Don't list unclassified reads/read pairs.
                      default: off

    -taxids           Print taxon ids in addition to taxon names.
                      default: off

    -taxids-only      Print taxon ids instead of taxon names.
                      default: off

    -omit-ranks       Do not print taxon rank names.
                      default: off

    -separate-cols    Prints *all* mapping information (rank, taxon name, taxon
                      ids) in separate columns (see option '-separator').
                      default: off

    -separator <text> Sets string that separates output columns.
                      default: '\t|\t'

    -comment <text>   Sets string that precedes comment (non-mapping) lines.
                      default: '# '

    -queryids         Show a unique id for each query.
                      Note that in paired-end mode a query is a pair of two read
                      sequences. This option will always be activated if option
                      '-hits-per-ref' is given.
                      default: off

    -lineage          Report complete lineage for per-read classification
                      starting with the lowest rank found/allowed and ending
                      with the highest rank allowed. See also options '-lowest'
                      and '-highest'.
                      default: off


ANALYSIS


    ANALYSIS: ABUNDANCES

        -abundances <file>
                      Show absolute and relative abundance of each taxon.
                      If a valid filename is given, the list will be written to
                      this file.
                      default: off

        -abundance-per <rank>
                      Show absolute and relative abundances for each taxon on
                      one specific rank.
                      Classifications on higher ranks will be estimated by
                      distributing them down according to the relative
                      abundances of classifications on or below the given rank.
                      (Valid values: sequence, form, variety, subspecies,
                      species, subgenus, genus, subtribe, tribe, subfamily,
                      family, suborder, order, subclass, class, subphylum,
                      phylum, subkingdom, kingdom, domain)
                      If '-abundances <file>' was given, this list will be
                      printed to the same file.
                      default: off


    ANALYSIS: RAW DATABASE HITS

        -tophits      For each query, print top feature hits in database.
                      default: off

        -allhits      For each query, print all feature hits in database.
                      default: off

        -locations    Show locations in candidate reference sequences.
                      Activates option '-tophits'.
                      default: off

        -hits-per-ref <file>
                      Shows a list of all hits for each reference sequence.
                      If this condensed list is all you need, you should
                      deactive the per-read mapping output with '-no-map'.
                      If a valid filename is given after '-hits-per-ref', the
                      list will be written to a separate file.
                      Option '-queryids' will be activated and the lowest
                      classification rank will be set to 'sequence'.
                      default: off


    ANALYSIS: ALIGNMENTS

        -align        Show semi-global alignment to best candidate reference
                      sequence.
                      Original files of reference sequences must be available.
                      This feature decreases the querying speed!
                      default: off


ADVANCED: GROUND TRUTH BASED EVALUATION

    -ground-truth     Report correct query taxa if known.
                      Queries need to have either a 'taxid|<number>' entry in
                      their header or a sequence id that is also present in the
                      database.
                      This feature decreases the querying speed!
                      default: off

    -precision        Report precision & sensitivity by comparing query taxa
                      (ground truth) and mapped taxa.
                      Queries need to have either a 'taxid|<number>' entry in
                      their header or a sequence id that is also found in the
                      database.
                      This feature decreases the querying speed!
                      default: off

    -taxon-coverage   Report true/false positives and true/false negatives.This
                      option turns on '-precision', so ground truth data needs
                      to be available.
                      This feature decreases the querying speed!
                      default: off


ADVANCED: CUSTOM QUERY SKETCHING (SUBSAMPLING)

    -kmerlen <k>      number of nucleotides/characters in a k-mer
                      default: determined by database

    -sketchlen <s>    number of features (k-mer hashes) per sampling window
                      default: determined by database

    -winlen <w>       number of letters in each sampling window
                      default: determined by database

    -winstride <l>    distance between window starting positions
                      default: determined by database


ADVANCED: PERFORMANCE TUNING / TESTING

    -threads <#>      Sets the maximum number of parallel threads to use.default
                      (on this machine): 8


    -batch-size <#>   Process <#> many queries (reads or read pairs) per thread
                      at once.
                      default (on this machine): 4096

    -query-limit <#>  Classify at max. <#> queries (reads or read pairs) per
                      input file.
                      default: 9223372036854775807