Skip to content

Commit

Permalink
index: remove --ref-name-info
Browse files Browse the repository at this point in the history
  • Loading branch information
shenwei356 committed Aug 30, 2024
1 parent 4f4794a commit 020761c
Show file tree
Hide file tree
Showing 5 changed files with 7 additions and 24 deletions.
2 changes: 1 addition & 1 deletion search/en.data.min.json

Large diffs are not rendered by default.

4 changes: 1 addition & 3 deletions tutorials/index/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@
"url" : "https://bioinf.shenwei.me/LexicMap/tutorials/index/",
"headline": "Step 1. Building a database",
"description": "Terminology differences:\nOn this page and in the LexicMap command line options, the term “mask” is used, following the terminology in the LexicHash paper. In the LexicMap manuscript, however, we use “probe” as it is easier to understand. Because these masks, which consist of thousands of k-mers and capture k-mers from sequences through prefix matching, function similarly to DNA probes in molecular biology. Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Steps Output File structure Index size Explore the index TL;DR Prepare input files: Sequences of each reference genome should be saved in separate FASTA\/Q files, with identifiers in the file names.",
"wordCount" : "2973",
"wordCount" : "2913",
"inLanguage": "en",
"isFamilyFriendly": "true",
"mainEntityOfPage": {
Expand Down Expand Up @@ -1840,8 +1840,6 @@ <h1>Step 1. Building a database</h1>
<li><strong>If the RAM is not sufficient</strong>. Please:
<ul>
<li><strong>Use a smaller genome batch size</strong>. It decreases indexing memory occupation and has little affection on searching performance.</li>
<li><strong>Sorting the input file list by species</strong>. So genomes within a batch would be more similar and the memory would be lower.
For LexicMap v0.4.1 or later versions, a flag <code>--ref-name-info</code> can specify a two-column tab-delimted file for mapping reference names to taxonomic information such as species names, and the input files will be sorted according to the taxonomic information.</li>
<li>Use a smaller number of masks, e.g., 20,000 performs well for small genomes (&lt;=5 Mb). And if the queries are long (&gt;= 2kb), there&rsquo;s little affection for the alignment results.</li>
</ul>
</li>
Expand Down
10 changes: 2 additions & 8 deletions tutorials/misc/index-genbank/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@
"url" : "https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-genbank/",
"headline": "Indexing GenBank+RefSeq",
"description": "Tools:\nhttps:\/\/github.com\/pirovc\/genome_updater, for downloading genomes https:\/\/github.com\/shenwei356\/seqkit, for checking sequence files https:\/\/github.com\/shenwei356\/rush, for running jobs Data:\ntime genome_updater.sh -d \u0022refseq,genbank\u0022 -g \u0022archaea,bacteria\u0022 \\ -f \u0022genomic.fna.gz\u0022 -o \u0022genbank\u0022 -M \u0022ncbi\u0022 -t 12 -m -L curl cd genbank\/2024-02-15_11-00-51\/ # ----------------- check the file integrity ----------------- genomes=files # corrupted files # find $genomes -name \u0022*.gz\u0022 \\ fd \u0022.gz$\u0022 $genomes \\ | rush --eta \u0027seqkit seq -w 0 {} \u003e \/dev\/null; if [ $? -ne 0 ]; then echo {}; fi\u0027 \\ \u003e failed.",
"wordCount" : "181",
"wordCount" : "156",
"inLanguage": "en",
"isFamilyFriendly": "true",
"mainEntityOfPage": {
Expand Down Expand Up @@ -1705,17 +1705,11 @@ <h1>Indexing GenBank&#43;RefSeq</h1>
# redownload them:
# run the genome_updater command again, with the flag -i
</code></pre>
<p>Taxonomic information (optional), for reducing index memory.</p>
<pre><code>cut -f 1,8 assembly_summary.txt &gt; ref2species.tsv
</code></pre>
<p>Indexing. On a 48-CPU machine, time: 54 h, ram: 178 GB, index size: 4.94 TB.
If you don&rsquo;t have enough memory, please decrease the value of <code>-b</code>.</p>
<pre><code># --ref-name-info is available for v0.4.1 or later versions.

lexicmap index \
<pre><code>lexicmap index \
-I files/ \
--ref-name-regexp '^(\w{3}_\d{9}\.\d+)' \
--ref-name-info ref2species.tsv \
-O genbank_refseq.lmi --log genbank_refseq.lmi.log \
-b 25000
</code></pre>
Expand Down
10 changes: 2 additions & 8 deletions tutorials/misc/index-gtdb/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@
"url" : "https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-gtdb/",
"headline": "Indexing GTDB",
"description": "Tools:\nhttps:\/\/github.com\/pirovc\/genome_updater, for downloading genomes https:\/\/github.com\/shenwei356\/seqkit, for checking sequence files https:\/\/github.com\/shenwei356\/rush, for running jobs Data:\ntime genome_updater.sh -d \u0022refseq,genbank\u0022 -g \u0022archaea,bacteria\u0022 \\ -f \u0022genomic.fna.gz\u0022 -o \u0022GTDB_complete\u0022 -M \u0022gtdb\u0022 -t 12 -m -L curl cd GTDB_complete\/2024-01-30_19-34-40\/ # ----------------- check the file integrity ----------------- genomes=files # corrupted files # find $genomes -name \u0022*.gz\u0022 \\ fd \u0022.gz$\u0022 $genomes \\ | rush --eta \u0027seqkit seq -w 0 {} \u003e \/dev\/null; if [ $? -ne 0 ]; then echo {}; fi\u0027 \\ \u003e failed.",
"wordCount" : "181",
"wordCount" : "156",
"inLanguage": "en",
"isFamilyFriendly": "true",
"mainEntityOfPage": {
Expand Down Expand Up @@ -1705,17 +1705,11 @@ <h1>Indexing GTDB</h1>
# redownload them:
# run the genome_updater command again, with the flag -i
</code></pre>
<p>Taxonomic information (optional), for reducing index memory.</p>
<pre><code>cut -f 1,8 assembly_summary.txt &gt; ref2species.tsv
</code></pre>
<p>Indexing. On a 48-CPU machine, time: 11 h, ram: 64 GB, index size: 906 GB.
If you don&rsquo;t have enough memory, please decrease the value of <code>-b</code>.</p>
<pre><code># --ref-name-info is available for v0.4.1 or later versions.

lexicmap index \
<pre><code>lexicmap index \
-I files/ \
--ref-name-regexp '^(\w{3}_\d{9}\.\d+)' \
--ref-name-info ref2species.tsv \
-O gtdb_complete.lmi --log gtdb_complete.lmi.log \
-b 5000
</code></pre>
Expand Down
5 changes: 1 addition & 4 deletions usage/index/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@
"url" : "https://bioinf.shenwei.me/LexicMap/usage/index/",
"headline": "index",
"description": "Terminology differences In the LexicMap source code and command line options, the term “mask” is used, following the terminology in the LexicHash paper. In the LexicMap manuscript, however, we use “probe” as it is easier to understand. Because these masks, which consist of thousands of k-mers and capture k-mers from sequences through prefix matching, function similarly to DNA probes in molecular biology. Usage $ lexicmap index -h Generate an index from FASTA\/Q sequences Input: *1.",
"wordCount" : "1368",
"wordCount" : "1341",
"inLanguage": "en",
"isFamilyFriendly": "true",
"mainEntityOfPage": {
Expand Down Expand Up @@ -1801,9 +1801,6 @@ <h1>index</h1>
</span></span><span class="line"><span class="cl"> --partitions int ► Number of partitions for indexing seeds (k-mer-value data) files.
</span></span><span class="line"><span class="cl"> The value needs to be the power of 4. (default 1024)
</span></span><span class="line"><span class="cl"> -s, --rand-seed int ► Rand seed for generating random masks. (default 1)
</span></span><span class="line"><span class="cl"> --ref-name-info string ► A two-column tab-delimted file for mapping reference names
</span></span><span class="line"><span class="cl"> (extracted by --ref-name-regexp) to taxonomic information such as
</span></span><span class="line"><span class="cl"> species names. It helps to reduce memory usage.
</span></span><span class="line"><span class="cl"> -N, --ref-name-regexp string ► Regular expression (must contains &#34;(&#34; and &#34;)&#34;) for extracting the
</span></span><span class="line"><span class="cl"> reference name from the filename. Attention: use double quotation
</span></span><span class="line"><span class="cl"> marks for patterns containing commas, e.g., -p &#39;&#34;A{2,}&#34;&#39; (default
Expand Down

0 comments on commit 020761c

Please sign in to comment.