Skip to content

Commit

Permalink
update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
shenwei356 committed Aug 30, 2024
1 parent 5c37b3d commit 4f4794a
Show file tree
Hide file tree
Showing 32 changed files with 4,903 additions and 156 deletions.
114 changes: 102 additions & 12 deletions faqs/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@
"url" : "https://bioinf.shenwei.me/LexicMap/faqs/",
"headline": "FAQs",
"description": "Table of contents Table of contents Does LexicMap support short reads? Does LexicMap support fungi genomes? How’s the hardware requirement? Can I extract the matched sequences? How can I extract the upstream and downstream flanking sequences of matched regions? Why isn’t the pident 100% when aligning with a sequence from the reference genomes? Why is LexicMap slow for batch searching? Does LexicMap support short reads? LexicMap is mainly designed for sequence alignment with a small number of queries (gene\/plasmid\/virus\/phage sequences) longer than 200 bp by default.",
"wordCount" : "612",
"wordCount" : "725",
"inLanguage": "en",
"isFamilyFriendly": "true",
"mainEntityOfPage": {
Expand Down Expand Up @@ -766,6 +766,50 @@ <h2>Navigation</h2>



















<li>
<input
type="checkbox"

class="hidden"

/>
<label

>

<span class="flex">
<a
href="/LexicMap/tutorials/misc/index-globdb/"
class="gdoc-nav__entry"
>
Indexing GlobDB
</a>
</span>


</label>


</li>



</ul>


Expand Down Expand Up @@ -1692,7 +1736,23 @@ <h1>FAQs</h1>
19. qseq, Aligned part of query sequence. (optional with -a/--all)
20. sseq, Aligned part of subject sequence. (optional with -a/--all)
21. align, Alignment text (&#34;|&#34; and &#34; &#34;) between qseq and sseq. (optional with -a/--all)
</code></pre><p>And <code>lexicmap util 2blast</code> can help to convert the tabular format to Blast-style format,
</code></pre><p>An example:</p>
<pre><code># Extracting similar sequences for a query gene.

# search matches with query coverage &gt;= 90%
lexicmap search -d gtdb_complete.lmi/ b.gene_E_faecalis_SecY.fasta -o results.tsv \
--min-qcov-per-hsp 90 --all

# extract matched sequences as FASTA format
sed 1d results.tsv | awk -F'\t' '{print &quot;&gt;&quot;$5&quot;:&quot;$14&quot;-&quot;$15&quot;:&quot;$16&quot;\n&quot;$20;}' \
| seqkit seq -g &gt; results.fasta

seqkit head -n 1 results.fasta | head -n 3
&gt;NZ_JALSCK010000007.1:39224-40522:-
TTGTTCAAGCTATTAAAGAACGCCTTTAAAGTCAAAGACATTAGATCAAAAATCTTATTT
ACAGTTTTAATCTTGTTTGTATTTCGCCTAGGTGCGCACATTACTGTGCCCGGGGTGAAT
</code></pre>
<p>And <code>lexicmap util 2blast</code> can help to convert the tabular format to Blast-style format,
see <a
class="gdoc-markdown__link"
href="https://bioinf.shenwei.me/LexicMap/usage/utils/2blast/#examples"
Expand Down Expand Up @@ -1733,21 +1793,18 @@ <h1>FAQs</h1>
<svg class="gdoc-icon gdoc_link"><use xlink:href="#gdoc_link"></use></svg>
</a>
</div>
<p>LexicMap is mainly designed for sequence alignment with a small number of queries against a database with a huge number (up to 17 million) of genomes.
There are some ways to improve the search speed.</p>
<ul>
<li>
<p>LexicMap is mainly designed for sequence alignment with a small number of queries against a database with a huge number (up to 16 million) of genomes.</p>
</li>
<li>
<p><code>lexicmap search</code> has a flag <code>-w/--load-whole-seeds</code> to load the whole seed data into memory for
faster search.</p>
<li><code>lexicmap search</code> has a flag <code>-n/--top-n-genomes</code> to keep top N genome matches for a query (0 for all) in chaining phase. For queries with a large number of genome hits, a resonable value such as 1000 would reduce the computation time.</li>
<li><code>lexicmap search</code> has a flag <code>-w/--load-whole-seeds</code> to load the whole seed data into memory for
faster search.
<ul>
<li>For example, for ~85,000 GTDB representative genomes, the memory would be ~260 GB with default parameters.</li>
</ul>
</li>
<li>
<p><code>lexicmap search</code> also has a flag <code>--pseudo-align</code> to only perform pseudo alignment, which is slightly faster and uses less memory.
It can be used in searching with long and divergent query sequences like nanopore long-reads.</p>
</li>
<li><code>lexicmap search</code> also has a flag <code>--pseudo-align</code> to only perform pseudo alignment, which is slightly faster and uses less memory.
It can be used in searching with long and divergent query sequences like nanopore long-reads.</li>
</ul>
<p>

Expand Down Expand Up @@ -2113,6 +2170,39 @@ <h1>FAQs</h1>







































Expand Down
33 changes: 33 additions & 0 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -833,6 +833,39 @@ <h1></h1>







































Expand Down
11 changes: 9 additions & 2 deletions index.xml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
<link>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-gtdb/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-gtdb/</guid>
<description>Tools:&#xA;https://github.com/pirovc/genome_updater, for downloading genomes https://github.com/shenwei356/seqkit, for checking sequence files https://github.com/shenwei356/rush, for running jobs https://github.com/shenwei356/brename, for batch file renaming Data:&#xA;time genome_updater.sh -d &amp;quot;refseq,genbank&amp;quot; -g &amp;quot;archaea,bacteria&amp;quot; \ -f &amp;quot;genomic.fna.gz&amp;quot; -o &amp;quot;GTDB_complete&amp;quot; -M &amp;quot;gtdb&amp;quot; -t 12 -m -L curl cd GTDB_complete/2024-01-30_19-34-40/ # ----------------- just in case, check the file integrity ----------------- genomes=files # corrupted files # find $genomes -name &amp;quot;*.gz&amp;quot; \ fd &amp;quot;.gz$&amp;quot; $genomes \ | rush --eta &#39;seqkit seq -w 0 {} &amp;gt; /dev/null; if [ $?</description>
<description>Tools:&#xA;https://github.com/pirovc/genome_updater, for downloading genomes https://github.com/shenwei356/seqkit, for checking sequence files https://github.com/shenwei356/rush, for running jobs Data:&#xA;time genome_updater.sh -d &amp;quot;refseq,genbank&amp;quot; -g &amp;quot;archaea,bacteria&amp;quot; \ -f &amp;quot;genomic.fna.gz&amp;quot; -o &amp;quot;GTDB_complete&amp;quot; -M &amp;quot;gtdb&amp;quot; -t 12 -m -L curl cd GTDB_complete/2024-01-30_19-34-40/ # ----------------- check the file integrity ----------------- genomes=files # corrupted files # find $genomes -name &amp;quot;*.gz&amp;quot; \ fd &amp;quot;.gz$&amp;quot; $genomes \ | rush --eta &#39;seqkit seq -w 0 {} &amp;gt; /dev/null; if [ $? -ne 0 ]; then echo {}; fi&#39; \ &amp;gt; failed.</description>
</item>
<item>
<title>masks</title>
Expand All @@ -26,7 +26,7 @@
<link>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-genbank/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-genbank/</guid>
<description>Tools:&#xA;https://github.com/pirovc/genome_updater, for downloading genomes https://github.com/shenwei356/seqkit, for checking sequence files https://github.com/shenwei356/rush, for running jobs https://github.com/shenwei356/brename, for batch file renaming Data:&#xA;time genome_updater.sh -d &amp;quot;refseq,genbank&amp;quot; -g &amp;quot;archaea,bacteria&amp;quot; \ -f &amp;quot;genomic.fna.gz&amp;quot; -o &amp;quot;genbank&amp;quot; -M &amp;quot;ncbi&amp;quot; -t 12 -m -L curl cd genbank/2024-02-15_11-00-51/ # ----------------- just in case, check the file integrity ----------------- genomes=files # corrupted files # find $genomes -name &amp;quot;*.gz&amp;quot; \ fd &amp;quot;.gz$&amp;quot; $genomes \ | rush --eta &#39;seqkit seq -w 0 {} &amp;gt; /dev/null; if [ $?</description>
<description>Tools:&#xA;https://github.com/pirovc/genome_updater, for downloading genomes https://github.com/shenwei356/seqkit, for checking sequence files https://github.com/shenwei356/rush, for running jobs Data:&#xA;time genome_updater.sh -d &amp;quot;refseq,genbank&amp;quot; -g &amp;quot;archaea,bacteria&amp;quot; \ -f &amp;quot;genomic.fna.gz&amp;quot; -o &amp;quot;genbank&amp;quot; -M &amp;quot;ncbi&amp;quot; -t 12 -m -L curl cd genbank/2024-02-15_11-00-51/ # ----------------- check the file integrity ----------------- genomes=files # corrupted files # find $genomes -name &amp;quot;*.gz&amp;quot; \ fd &amp;quot;.gz$&amp;quot; $genomes \ | rush --eta &#39;seqkit seq -w 0 {} &amp;gt; /dev/null; if [ $? -ne 0 ]; then echo {}; fi&#39; \ &amp;gt; failed.</description>
</item>
<item>
<title>kmers</title>
Expand Down Expand Up @@ -56,6 +56,13 @@
<guid>https://bioinf.shenwei.me/LexicMap/usage/utils/genomes/</guid>
<description>Usage $ lexicmap utils genomes -h View genome IDs in the index Usage: lexicmap utils genomes [flags] Flags: -h, --help help for genomes -d, --index string ► Index directory created by &amp;#34;lexicmap index&amp;#34;. -o, --out-file string ► Out file, supports the &amp;#34;.gz&amp;#34; suffix (&amp;#34;-&amp;#34; for stdout). (default &amp;#34;-&amp;#34;) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments.</description>
</item>
<item>
<title>Indexing GlobDB</title>
<link>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-globdb/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-globdb/</guid>
<description># download data wget https://fileshare.lisc.univie.ac.at/globdb/globdb_r220/globdb_r220_genome_fasta.tar.gz tar -zxf globdb_r220_genome_fasta.tar.gz # file list find globdb_r220_genome_fasta/ -name &amp;quot;*.fa.gz&amp;quot; &amp;gt; files.txt # index with lexicmap # elapsed time: 3h:40m:38s # peak rss: 87.15 GB lexicmap index -S -X files.txt -O globdb_r220.lmi --log globdb_r220.lmi -g 50000000 </description>
</item>
<item>
<title>search</title>
<link>https://bioinf.shenwei.me/LexicMap/usage/search/</link>
Expand Down
Loading

0 comments on commit 4f4794a

Please sign in to comment.