index: remove --ref-name-info

shenwei356 · Aug 30, 2024 · 020761c · 020761c
1 parent 4f4794a
commit 020761c
Show file tree

Hide file tree

Showing 5 changed files with 7 additions and 24 deletions.
diff --git a/search/en.data.min.json b/search/en.data.min.json
diff --git a/tutorials/index/index.html b/tutorials/index/index.html
@@ -62,7 +62,7 @@
       "url" : "https://bioinf.shenwei.me/LexicMap/tutorials/index/",
       "headline": "Step 1. Building a database",
       "description": "Terminology differences:\nOn this page and in the LexicMap command line options, the term “mask” is used, following the terminology in the LexicHash paper. In the LexicMap manuscript, however, we use “probe” as it is easier to understand. Because these masks, which consist of thousands of k-mers and capture k-mers from sequences through prefix matching, function similarly to DNA probes in molecular biology. Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Steps Output File structure Index size Explore the index TL;DR Prepare input files: Sequences of each reference genome should be saved in separate FASTA\/Q files, with identifiers in the file names.",
-      "wordCount" : "2973",
+      "wordCount" : "2913",
       "inLanguage": "en",
       "isFamilyFriendly": "true",
       "mainEntityOfPage": {
@@ -1840,8 +1840,6 @@ <h1>Step 1. Building a database</h1>
 <li><strong>If the RAM is not sufficient</strong>. Please:
 <ul>
 <li><strong>Use a smaller genome batch size</strong>. It decreases indexing memory occupation and has little affection on searching performance.</li>
-<li><strong>Sorting the input file list by species</strong>. So genomes within a batch would be more similar and the memory would be lower.
-For LexicMap v0.4.1 or later versions, a flag <code>--ref-name-info</code> can specify a two-column tab-delimted file for mapping reference names to taxonomic information such as species names, and the input files will be sorted according to the taxonomic information.</li>
 <li>Use a smaller number of masks, e.g., 20,000 performs well for small genomes (&lt;=5 Mb). And if the queries are long (&gt;= 2kb), there&rsquo;s little affection for the alignment results.</li>
 </ul>
 </li>

diff --git a/tutorials/misc/index-genbank/index.html b/tutorials/misc/index-genbank/index.html
@@ -65,7 +65,7 @@
       "url" : "https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-genbank/",
       "headline": "Indexing GenBank+RefSeq",
       "description": "Tools:\nhttps:\/\/github.com\/pirovc\/genome_updater, for downloading genomes https:\/\/github.com\/shenwei356\/seqkit, for checking sequence files https:\/\/github.com\/shenwei356\/rush, for running jobs Data:\ntime genome_updater.sh -d \u0022refseq,genbank\u0022 -g \u0022archaea,bacteria\u0022 \\ -f \u0022genomic.fna.gz\u0022 -o \u0022genbank\u0022 -M \u0022ncbi\u0022 -t 12 -m -L curl cd genbank\/2024-02-15_11-00-51\/ # ----------------- check the file integrity ----------------- genomes=files # corrupted files # find $genomes -name \u0022*.gz\u0022 \\ fd \u0022.gz$\u0022 $genomes \\ | rush --eta \u0027seqkit seq -w 0 {} \u003e \/dev\/null; if [ $? -ne 0 ]; then echo {}; fi\u0027 \\ \u003e failed.",
-      "wordCount" : "181",
+      "wordCount" : "156",
       "inLanguage": "en",
       "isFamilyFriendly": "true",
       "mainEntityOfPage": {
@@ -1705,17 +1705,11 @@ <h1>Indexing GenBank&#43;RefSeq</h1>
 # redownload them:
 # run the genome_updater command again, with the flag -i
 </code></pre>
-<p>Taxonomic information (optional), for reducing index memory.</p>
-<pre><code>cut -f 1,8 assembly_summary.txt &gt; ref2species.tsv
-</code></pre>
 <p>Indexing. On a 48-CPU machine, time: 54 h, ram: 178 GB, index size: 4.94 TB.
 If you don&rsquo;t have enough memory, please decrease the value of <code>-b</code>.</p>
-<pre><code># --ref-name-info is available for v0.4.1 or later versions.
-
-lexicmap index \
+<pre><code>lexicmap index \
     -I files/ \
     --ref-name-regexp '^(\w{3}_\d{9}\.\d+)' \
-    --ref-name-info ref2species.tsv \
     -O genbank_refseq.lmi --log genbank_refseq.lmi.log \
     -b 25000
 </code></pre>

diff --git a/tutorials/misc/index-gtdb/index.html b/tutorials/misc/index-gtdb/index.html
@@ -65,7 +65,7 @@
       "url" : "https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-gtdb/",
       "headline": "Indexing GTDB",
       "description": "Tools:\nhttps:\/\/github.com\/pirovc\/genome_updater, for downloading genomes https:\/\/github.com\/shenwei356\/seqkit, for checking sequence files https:\/\/github.com\/shenwei356\/rush, for running jobs Data:\ntime genome_updater.sh -d \u0022refseq,genbank\u0022 -g \u0022archaea,bacteria\u0022 \\ -f \u0022genomic.fna.gz\u0022 -o \u0022GTDB_complete\u0022 -M \u0022gtdb\u0022 -t 12 -m -L curl cd GTDB_complete\/2024-01-30_19-34-40\/ # ----------------- check the file integrity ----------------- genomes=files # corrupted files # find $genomes -name \u0022*.gz\u0022 \\ fd \u0022.gz$\u0022 $genomes \\ | rush --eta \u0027seqkit seq -w 0 {} \u003e \/dev\/null; if [ $? -ne 0 ]; then echo {}; fi\u0027 \\ \u003e failed.",
-      "wordCount" : "181",
+      "wordCount" : "156",
       "inLanguage": "en",
       "isFamilyFriendly": "true",
       "mainEntityOfPage": {
@@ -1705,17 +1705,11 @@ <h1>Indexing GTDB</h1>
 # redownload them:
 # run the genome_updater command again, with the flag -i
 </code></pre>
-<p>Taxonomic information (optional), for reducing index memory.</p>
-<pre><code>cut -f 1,8 assembly_summary.txt &gt; ref2species.tsv
-</code></pre>
 <p>Indexing. On a 48-CPU machine, time: 11 h, ram: 64 GB, index size: 906 GB.
 If you don&rsquo;t have enough memory, please decrease the value of <code>-b</code>.</p>
-<pre><code># --ref-name-info is available for v0.4.1 or later versions.
-
-lexicmap index \
+<pre><code>lexicmap index \
     -I files/ \
     --ref-name-regexp '^(\w{3}_\d{9}\.\d+)' \
-    --ref-name-info ref2species.tsv \
     -O gtdb_complete.lmi --log gtdb_complete.lmi.log \
     -b 5000
 </code></pre>

diff --git a/usage/index/index.html b/usage/index/index.html
@@ -59,7 +59,7 @@
       "url" : "https://bioinf.shenwei.me/LexicMap/usage/index/",
       "headline": "index",
       "description": "Terminology differences In the LexicMap source code and command line options, the term “mask” is used, following the terminology in the LexicHash paper. In the LexicMap manuscript, however, we use “probe” as it is easier to understand. Because these masks, which consist of thousands of k-mers and capture k-mers from sequences through prefix matching, function similarly to DNA probes in molecular biology. Usage $ lexicmap index -h Generate an index from FASTA\/Q sequences Input: *1.",
-      "wordCount" : "1368",
+      "wordCount" : "1341",
       "inLanguage": "en",
       "isFamilyFriendly": "true",
       "mainEntityOfPage": {
@@ -1801,9 +1801,6 @@ <h1>index</h1>
 </span></span><span class="line"><span class="cl">      --partitions int            ► Number of partitions for indexing seeds (k-mer-value data) files.
 </span></span><span class="line"><span class="cl">                                  The value needs to be the power of 4. (default 1024)
 </span></span><span class="line"><span class="cl">  -s, --rand-seed int             ► Rand seed for generating random masks. (default 1)
-</span></span><span class="line"><span class="cl">      --ref-name-info string      ► A two-column tab-delimted file for mapping reference names
-</span></span><span class="line"><span class="cl">                                  (extracted by --ref-name-regexp) to taxonomic information such as
-</span></span><span class="line"><span class="cl">                                  species names. It helps to reduce memory usage.
 </span></span><span class="line"><span class="cl">  -N, --ref-name-regexp string    ► Regular expression (must contains &#34;(&#34; and &#34;)&#34;) for extracting the
 </span></span><span class="line"><span class="cl">                                  reference name from the filename. Attention: use double quotation
 </span></span><span class="line"><span class="cl">                                  marks for patterns containing commas, e.g., -p &#39;&#34;A{2,}&#34;&#39; (default