update docs

shenwei356 · Aug 30, 2024 · 4f4794a · 4f4794a
1 parent 5c37b3d
commit 4f4794a
Show file tree

Hide file tree

Showing 32 changed files with 4,903 additions and 156 deletions.
diff --git a/faqs/index.html b/faqs/index.html
@@ -59,7 +59,7 @@
       "url" : "https://bioinf.shenwei.me/LexicMap/faqs/",
       "headline": "FAQs",
       "description": "Table of contents Table of contents Does LexicMap support short reads? Does LexicMap support fungi genomes? How’s the hardware requirement? Can I extract the matched sequences? How can I extract the upstream and downstream flanking sequences of matched regions? Why isn’t the pident 100% when aligning with a sequence from the reference genomes? Why is LexicMap slow for batch searching? Does LexicMap support short reads? LexicMap is mainly designed for sequence alignment with a small number of queries (gene\/plasmid\/virus\/phage sequences) longer than 200 bp by default.",
-      "wordCount" : "612",
+      "wordCount" : "725",
       "inLanguage": "en",
       "isFamilyFriendly": "true",
       "mainEntityOfPage": {
@@ -766,6 +766,50 @@ <h2>Navigation</h2>
 
 
 
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+          <li>
+            <input
+              type="checkbox"
+
+                class="hidden"
+
+            />
+            <label
+
+            >
+
+                <span class="flex">
+                  <a
+                    href="/LexicMap/tutorials/misc/index-globdb/"
+                    class="gdoc-nav__entry"
+                  >
+                    Indexing GlobDB
+                  </a>
+                </span>
+
+
+            </label>
+
+
+          </li>
+
+
+
   </ul>
 
 
@@ -1692,7 +1736,23 @@ <h1>FAQs</h1>
 19. qseq,     Aligned part of query sequence.                     (optional with -a/--all)
 20. sseq,     Aligned part of subject sequence.                   (optional with -a/--all)
 21. align,    Alignment text (&#34;|&#34; and &#34; &#34;) between qseq and sseq. (optional with -a/--all)
-</code></pre><p>And <code>lexicmap util 2blast</code> can help to convert the tabular format to Blast-style format,
+</code></pre><p>An example:</p>
+<pre><code># Extracting similar sequences for a query gene.
+
+# search matches with query coverage &gt;= 90%
+lexicmap search -d gtdb_complete.lmi/ b.gene_E_faecalis_SecY.fasta -o results.tsv \
+    --min-qcov-per-hsp 90 --all
+
+# extract matched sequences as FASTA format
+sed 1d results.tsv | awk -F'\t' '{print &quot;&gt;&quot;$5&quot;:&quot;$14&quot;-&quot;$15&quot;:&quot;$16&quot;\n&quot;$20;}' \
+    | seqkit seq -g &gt; results.fasta
+
+seqkit head -n 1 results.fasta | head -n 3
+&gt;NZ_JALSCK010000007.1:39224-40522:-
+TTGTTCAAGCTATTAAAGAACGCCTTTAAAGTCAAAGACATTAGATCAAAAATCTTATTT
+ACAGTTTTAATCTTGTTTGTATTTCGCCTAGGTGCGCACATTACTGTGCCCGGGGTGAAT
+</code></pre>
+<p>And <code>lexicmap util 2blast</code> can help to convert the tabular format to Blast-style format,
 see <a
   class="gdoc-markdown__link"
   href="https://bioinf.shenwei.me/LexicMap/usage/utils/2blast/#examples"
@@ -1733,21 +1793,18 @@ <h1>FAQs</h1>
         <svg class="gdoc-icon gdoc_link"><use xlink:href="#gdoc_link"></use></svg>
     </a>
 </div>
+<p>LexicMap is mainly designed for sequence alignment with a small number of queries against a database with a huge number (up to 17 million) of genomes.
+There are some ways to improve the search speed.</p>
 <ul>
-<li>
-<p>LexicMap is mainly designed for sequence alignment with a small number of queries against a database with a huge number (up to 16 million) of genomes.</p>
-</li>
-<li>
-<p><code>lexicmap search</code> has a flag <code>-w/--load-whole-seeds</code> to load the whole seed data into memory for
-faster search.</p>
+<li><code>lexicmap search</code> has a flag <code>-n/--top-n-genomes</code> to keep top N genome matches for a query (0 for all) in chaining phase. For queries with a large number of genome hits, a resonable value such as 1000 would reduce the computation time.</li>
+<li><code>lexicmap search</code> has a flag <code>-w/--load-whole-seeds</code> to load the whole seed data into memory for
+faster search.
 <ul>
 <li>For example, for ~85,000 GTDB representative genomes, the memory would be ~260 GB with default parameters.</li>
 </ul>
 </li>
-<li>
-<p><code>lexicmap search</code> also has a flag <code>--pseudo-align</code> to only perform pseudo alignment, which is slightly faster and uses less memory.
-It can be used in searching with long and divergent query sequences like nanopore long-reads.</p>
-</li>
+<li><code>lexicmap search</code> also has a flag <code>--pseudo-align</code> to only perform pseudo alignment, which is slightly faster and uses less memory.
+It can be used in searching with long and divergent query sequences like nanopore long-reads.</li>
 </ul>
 <p>
 
@@ -2113,6 +2170,39 @@ <h1>FAQs</h1>
 
 
 
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
 
 
 

diff --git a/index.html b/index.html
@@ -833,6 +833,39 @@ <h1></h1>
 
 
 
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
 
 
 

diff --git a/index.xml b/index.xml
@@ -12,7 +12,7 @@
       <link>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-gtdb/</link>
       <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
       <guid>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-gtdb/</guid>
-      <description>Tools:&#xA;https://github.com/pirovc/genome_updater, for downloading genomes https://github.com/shenwei356/seqkit, for checking sequence files https://github.com/shenwei356/rush, for running jobs https://github.com/shenwei356/brename, for batch file renaming Data:&#xA;time genome_updater.sh -d &amp;quot;refseq,genbank&amp;quot; -g &amp;quot;archaea,bacteria&amp;quot; \ -f &amp;quot;genomic.fna.gz&amp;quot; -o &amp;quot;GTDB_complete&amp;quot; -M &amp;quot;gtdb&amp;quot; -t 12 -m -L curl cd GTDB_complete/2024-01-30_19-34-40/ # ----------------- just in case, check the file integrity ----------------- genomes=files # corrupted files # find $genomes -name &amp;quot;*.gz&amp;quot; \ fd &amp;quot;.gz$&amp;quot; $genomes \ | rush --eta &#39;seqkit seq -w 0 {} &amp;gt; /dev/null; if [ $?</description>
+      <description>Tools:&#xA;https://github.com/pirovc/genome_updater, for downloading genomes https://github.com/shenwei356/seqkit, for checking sequence files https://github.com/shenwei356/rush, for running jobs Data:&#xA;time genome_updater.sh -d &amp;quot;refseq,genbank&amp;quot; -g &amp;quot;archaea,bacteria&amp;quot; \ -f &amp;quot;genomic.fna.gz&amp;quot; -o &amp;quot;GTDB_complete&amp;quot; -M &amp;quot;gtdb&amp;quot; -t 12 -m -L curl cd GTDB_complete/2024-01-30_19-34-40/ # ----------------- check the file integrity ----------------- genomes=files # corrupted files # find $genomes -name &amp;quot;*.gz&amp;quot; \ fd &amp;quot;.gz$&amp;quot; $genomes \ | rush --eta &#39;seqkit seq -w 0 {} &amp;gt; /dev/null; if [ $? -ne 0 ]; then echo {}; fi&#39; \ &amp;gt; failed.</description>
     </item>
     <item>
       <title>masks</title>
@@ -26,7 +26,7 @@
       <link>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-genbank/</link>
       <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
       <guid>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-genbank/</guid>
-      <description>Tools:&#xA;https://github.com/pirovc/genome_updater, for downloading genomes https://github.com/shenwei356/seqkit, for checking sequence files https://github.com/shenwei356/rush, for running jobs https://github.com/shenwei356/brename, for batch file renaming Data:&#xA;time genome_updater.sh -d &amp;quot;refseq,genbank&amp;quot; -g &amp;quot;archaea,bacteria&amp;quot; \ -f &amp;quot;genomic.fna.gz&amp;quot; -o &amp;quot;genbank&amp;quot; -M &amp;quot;ncbi&amp;quot; -t 12 -m -L curl cd genbank/2024-02-15_11-00-51/ # ----------------- just in case, check the file integrity ----------------- genomes=files # corrupted files # find $genomes -name &amp;quot;*.gz&amp;quot; \ fd &amp;quot;.gz$&amp;quot; $genomes \ | rush --eta &#39;seqkit seq -w 0 {} &amp;gt; /dev/null; if [ $?</description>
+      <description>Tools:&#xA;https://github.com/pirovc/genome_updater, for downloading genomes https://github.com/shenwei356/seqkit, for checking sequence files https://github.com/shenwei356/rush, for running jobs Data:&#xA;time genome_updater.sh -d &amp;quot;refseq,genbank&amp;quot; -g &amp;quot;archaea,bacteria&amp;quot; \ -f &amp;quot;genomic.fna.gz&amp;quot; -o &amp;quot;genbank&amp;quot; -M &amp;quot;ncbi&amp;quot; -t 12 -m -L curl cd genbank/2024-02-15_11-00-51/ # ----------------- check the file integrity ----------------- genomes=files # corrupted files # find $genomes -name &amp;quot;*.gz&amp;quot; \ fd &amp;quot;.gz$&amp;quot; $genomes \ | rush --eta &#39;seqkit seq -w 0 {} &amp;gt; /dev/null; if [ $? -ne 0 ]; then echo {}; fi&#39; \ &amp;gt; failed.</description>
     </item>
     <item>
       <title>kmers</title>
@@ -56,6 +56,13 @@
       <guid>https://bioinf.shenwei.me/LexicMap/usage/utils/genomes/</guid>
       <description>Usage $ lexicmap utils genomes -h View genome IDs in the index Usage: lexicmap utils genomes [flags] Flags: -h, --help help for genomes -d, --index string ► Index directory created by &amp;#34;lexicmap index&amp;#34;. -o, --out-file string ► Out file, supports the &amp;#34;.gz&amp;#34; suffix (&amp;#34;-&amp;#34; for stdout). (default &amp;#34;-&amp;#34;) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments.</description>
     </item>
+    <item>
+      <title>Indexing GlobDB</title>
+      <link>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-globdb/</link>
+      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
+      <guid>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-globdb/</guid>
+      <description># download data wget https://fileshare.lisc.univie.ac.at/globdb/globdb_r220/globdb_r220_genome_fasta.tar.gz tar -zxf globdb_r220_genome_fasta.tar.gz # file list find globdb_r220_genome_fasta/ -name &amp;quot;*.fa.gz&amp;quot; &amp;gt; files.txt # index with lexicmap # elapsed time: 3h:40m:38s # peak rss: 87.15 GB lexicmap index -S -X files.txt -O globdb_r220.lmi --log globdb_r220.lmi -g 50000000 </description>
+    </item>
     <item>
       <title>search</title>
       <link>https://bioinf.shenwei.me/LexicMap/usage/search/</link>