-
Notifications
You must be signed in to change notification settings - Fork 11
/
mode_merge.txt
265 lines (193 loc) · 10.8 KB
/
mode_merge.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
SYNOPSIS
metacache merge <result file/directory>... -taxonomy <path> [OPTION]...
metacache merge -taxonomy <path> [OPTION]... <result file/directory>...
DESCRIPTION
This mode classifies reads by merging the results of multiple, independent
queries. These might have been obtained by querying one database with
different parameters or by querying different databases with different
reference sequences or build options.
IMPORTANT: In order to be mergable, independent queries
need to be run with options:
-tophits -queryids -lowest species
and must NOT be run with options that suppress or alter default output
like, e.g.: -no-map, -no-summary, -separator, etc.
Possible Use Case:
If your system has not enough memory for one large database, you can
split up the set of reference genomes into several databases and query these
in succession. The results of these independent query runs can then be
merged to obtain a classification based on the whole set of genomes.
REQUIRED PARAMETERS
<result file/directory>...
MetaCache result files.
If directory names are given, they will be searched for
sequence files (at most 10 levels deep).
IMPORTANT: Result files must have been produced with:
-tophits -queryids -lowest species
and must NOT be run with options that suppress or alter
the default output like, e.g.: -no-map, -no-summary,
-separator, etc.
-taxonomy <path> directory with taxonomic hierarchy data (see NCBI's
taxonomic data files)
MERGING RESULTS OUTPUT
<file> Redirect output to file <file>.
If not specified, output will be written to stdout.
CLASSIFICATION
-lowest <rank> Do not classify on ranks below <rank>
(Valid values: sequence, form, variety, subspecies,
species, subgenus, genus, subtribe, tribe, subfamily,
family, suborder, order, subclass, class, subphylum,
phylum, subkingdom, kingdom, domain)
default: sequence
-highest <rank> Do not classify on ranks above <rank>
(Valid values: sequence, form, variety, subspecies,
species, subgenus, genus, subtribe, tribe, subfamily,
family, suborder, order, subclass, class, subphylum,
phylum, subkingdom, kingdom, domain)
default: domain
-hitmin <t> Sets classification threshhold to <t>.
A read will not be classified if less than t features from
the database match. Higher values will increase precision
at the expense of sensitivity.
default: 0
-hitdiff <d> Sets candidate LCA threshhold to <d> percent.
Influences if only candidate with the most hits will be
used as classification result or if taxa of other
candidates will be considered.
All candidate (taxa) will be included that have at least
d% as many hits above the hit-min threshold as the
candidate with the most hits.
default: 100
-maxcand <#> maximum number of reference taxon candidates to consider
for each query;
A large value can significantly decrease the querying
speed!.
default: 2
-cov-percentile <p>
Remove the p-th percentile of hit reference sequences with
the lowest coverage. Classification is done using only the
remaining reference sequences. This can help to reduce
false positives, especially whenyour input data has a high
sequencing coverage.
This feature decreases the querying speed!
default: off
GENERAL OUTPUT FORMATTING
-silent|-verbose information level during build:
silent => none / verbose => most detailed
default: neither => only errors/important info
-no-summary Dont't show result summary & mapping statistics at the end
of the mapping output
default: off
-no-query-params Don't show query settings at the beginning of the mapping
output
default: off
-no-err Suppress all error messages.
default: off
CLASSIFICATION RESULT FORMATTING
-no-map Don't report classification for each individual query
sequence; show summaries only (useful for quick tests).
default: off
-mapped-only Don't list unclassified reads/read pairs.
default: off
-taxids Print taxon ids in addition to taxon names.
default: off
-taxids-only Print taxon ids instead of taxon names.
default: off
-omit-ranks Do not print taxon rank names.
default: off
-separate-cols Prints *all* mapping information (rank, taxon name, taxon
ids) in separate columns (see option '-separator').
default: off
-separator <text> Sets string that separates output columns.
default: '\t|\t'
-comment <text> Sets string that precedes comment (non-mapping) lines.
default: '# '
-queryids Show a unique id for each query.
Note that in paired-end mode a query is a pair of two read
sequences. This option will always be activated if option
'-hits-per-ref' is given.
default: off
-lineage Report complete lineage for per-read classification
starting with the lowest rank found/allowed and ending
with the highest rank allowed. See also options '-lowest'
and '-highest'.
default: off
ANALYSIS
ANALYSIS: ABUNDANCES
-abundances <file>
Show absolute and relative abundance of each taxon.
If a valid filename is given, the list will be written to
this file.
default: off
-abundance-per <rank>
Show absolute and relative abundances for each taxon on
one specific rank.
Classifications on higher ranks will be estimated by
distributing them down according to the relative
abundances of classifications on or below the given rank.
(Valid values: sequence, form, variety, subspecies,
species, subgenus, genus, subtribe, tribe, subfamily,
family, suborder, order, subclass, class, subphylum,
phylum, subkingdom, kingdom, domain)
If '-abundances <file>' was given, this list will be
printed to the same file.
default: off
ANALYSIS: RAW DATABASE HITS
-tophits For each query, print top feature hits in database.
default: off
-allhits For each query, print all feature hits in database.
default: off
-locations Show locations in candidate reference sequences.
Activates option '-tophits'.
default: off
-hits-per-ref <file>
Shows a list of all hits for each reference sequence.
If this condensed list is all you need, you should
deactive the per-read mapping output with '-no-map'.
If a valid filename is given after '-hits-per-ref', the
list will be written to a separate file.
Option '-queryids' will be activated and the lowest
classification rank will be set to 'sequence'.
default: off
ANALYSIS: ALIGNMENTS
-align Show semi-global alignment to best candidate reference
sequence.
Original files of reference sequences must be available.
This feature decreases the querying speed!
default: off
ADVANCED: GROUND TRUTH BASED EVALUATION
-ground-truth Report correct query taxa if known.
Queries need to have either a 'taxid|<number>' entry in
their header or a sequence id that is also present in the
database.
This feature decreases the querying speed!
default: off
-precision Report precision & sensitivity by comparing query taxa
(ground truth) and mapped taxa.
Queries need to have either a 'taxid|<number>' entry in
their header or a sequence id that is also found in the
database.
This feature decreases the querying speed!
default: off
-taxon-coverage Report true/false positives and true/false negatives.This
option turns on '-precision', so ground truth data needs
to be available.
This feature decreases the querying speed!
default: off
ADVANCED: CUSTOM QUERY SKETCHING (SUBSAMPLING)
-kmerlen <k> number of nucleotides/characters in a k-mer
default: determined by database
-sketchlen <s> number of features (k-mer hashes) per sampling window
default: determined by database
-winlen <w> number of letters in each sampling window
default: determined by database
-winstride <l> distance between window starting positions
default: determined by database
ADVANCED: PERFORMANCE TUNING / TESTING
-threads <#> Sets the maximum number of parallel threads to use.default
(on this machine): 8
-batch-size <#> Process <#> many queries (reads or read pairs) per thread
at once.
default (on this machine): 4096
-query-limit <#> Classify at max. <#> queries (reads or read pairs) per
input file.
default: 9223372036854775807