Skip to content

Commit

Permalink
add more docs
Browse files Browse the repository at this point in the history
  • Loading branch information
shenwei356 committed Sep 3, 2024
1 parent 9a8a813 commit 75e0f4a
Show file tree
Hide file tree
Showing 5 changed files with 119 additions and 14 deletions.
21 changes: 19 additions & 2 deletions faqs/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@
"url" : "https://bioinf.shenwei.me/LexicMap/faqs/",
"headline": "FAQs",
"description": "Table of contents Table of contents Does LexicMap support short reads? Does LexicMap support fungi genomes? How’s the hardware requirement? Can I extract the matched sequences? How can I extract the upstream and downstream flanking sequences of matched regions? Why isn’t the pident 100% when aligning with a sequence from the reference genomes? Why is LexicMap slow for batch searching? Does LexicMap support short reads? LexicMap is mainly designed for sequence alignment with a small number of queries (gene\/plasmid\/virus\/phage sequences) longer than 200 bp by default.",
"wordCount" : "731",
"wordCount" : "773",
"inLanguage": "en",
"isFamilyFriendly": "true",
"mainEntityOfPage": {
Expand Down Expand Up @@ -1796,15 +1796,32 @@ <h1>FAQs</h1>
<p>LexicMap is mainly designed for sequence alignment with a small number of queries against a database with a huge number (up to 17 million) of genomes.
There are some ways to improve the search speed of <code>lexicmap search</code>.</p>
<ul>
<li>Increasing the concurrency number
<ul>
<li>Increasing the value of <code>--max-open-files</code> (default 512). You might need to <a
class="gdoc-markdown__link"
href="https://stackoverflow.com/questions/34588/how-do-i-change-the-number-of-open-files-limit-in-linux"
>change the open files limit</a>.</li>
<li>Setting <code>-n/--top-n-genomes</code> to keep top N genome matches for a query (0 for all) in chaining phase. For queries with a large number of genome hits, a resonable value such as 1000 would reduce the computation time.</li>
<li>(If you have many queries) Increase the value of <code>-J/--max-query-conc</code> (default 12), it will increase the memory.</li>
</ul>
</li>
<li>Loading the entire seed data into memoy (It&rsquo;s unnecessary if the index is stored in SSD)
<ul>
<li>Setting <code>-w/--load-whole-seeds</code> to load the whole seed data into memory for faster search. For example, for ~85,000 GTDB representative genomes, the memory would be ~260 GB with default parameters.</li>
</ul>
</li>
<li>Returning less results
<ul>
<li>Setting <code>-n/--top-n-genomes</code> to keep top N genome matches for a query (0 for all) in chaining phase. For queries with a large number of genome hits, a resonable value such as 1000 would reduce the computation time.</li>
</ul>
</li>
<li>Sacrificing accuracy
<ul>
<li>Setting <code>--pseudo-align</code> to only perform pseudo alignment, which is slightly faster and uses less memory.
It can be used in searching with long and divergent query sequences like nanopore long-reads.</li>
</ul>
</li>
</ul>
<p>


Expand Down
2 changes: 1 addition & 1 deletion index.xml
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@
<link>https://bioinf.shenwei.me/LexicMap/tutorials/search/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>https://bioinf.shenwei.me/LexicMap/tutorials/search/</guid>
<description>Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Steps Output Alignment result relationship Output format Examples TL;DR Build a LexicMap index.&#xA;Run:&#xA;For short queries like genes or long reads, returning top N hits.&#xA;lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \ --min-qcov-per-hsp 70 --min-qcov-per-genome 70 --top-n-genomes 1000 For longer queries like plasmids, returning all hits.&#xA;lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \ --min-qcov-per-hsp 0 --min-qcov-per-genome 0 --top-n-genomes 0 Input Query length</description>
<description>Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Improving searching speed Steps Output Alignment result relationship Output format Examples Summarizing results TL;DR Build a LexicMap index.&#xA;Run:&#xA;For short queries like genes or long reads, returning top N hits.&#xA;lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \ --min-qcov-per-hsp 70 --min-qcov-per-genome 70 --top-n-genomes 1000 For longer queries like plasmids, returning all hits.&#xA;lexicmap search -d db.lmi query.</description>
</item>
<item>
<title>Indexing AllTheBacteria</title>
Expand Down
2 changes: 1 addition & 1 deletion search/en.data.min.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion tutorials/index.xml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
<link>https://bioinf.shenwei.me/LexicMap/tutorials/search/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>https://bioinf.shenwei.me/LexicMap/tutorials/search/</guid>
<description>Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Steps Output Alignment result relationship Output format Examples TL;DR Build a LexicMap index.&#xA;Run:&#xA;For short queries like genes or long reads, returning top N hits.&#xA;lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \ --min-qcov-per-hsp 70 --min-qcov-per-genome 70 --top-n-genomes 1000 For longer queries like plasmids, returning all hits.&#xA;lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \ --min-qcov-per-hsp 0 --min-qcov-per-genome 0 --top-n-genomes 0 Input Query length</description>
<description>Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Improving searching speed Steps Output Alignment result relationship Output format Examples Summarizing results TL;DR Build a LexicMap index.&#xA;Run:&#xA;For short queries like genes or long reads, returning top N hits.&#xA;lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \ --min-qcov-per-hsp 70 --min-qcov-per-genome 70 --top-n-genomes 1000 For longer queries like plasmids, returning all hits.&#xA;lexicmap search -d db.lmi query.</description>
</item>
</channel>
</rss>
106 changes: 97 additions & 9 deletions tutorials/search/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,11 @@
<meta name="generator" content="Hugo 0.133.0">


<meta name="description" content="Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Steps Output Alignment result relationship Output format Examples TL;DR Build a LexicMap index.
<meta name="description" content="Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Improving searching speed Steps Output Alignment result relationship Output format Examples Summarizing results TL;DR Build a LexicMap index.
Run:
For short queries like genes or long reads, returning top N hits.
lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \ --min-qcov-per-hsp 70 --min-qcov-per-genome 70 --top-n-genomes 1000 For longer queries like plasmids, returning all hits.
lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \ --min-qcov-per-hsp 0 --min-qcov-per-genome 0 --top-n-genomes 0 Input Query length" />
lexicmap search -d db.lmi query." />

<title>Step 2. Searching | LexicMap: efficient sequence alignment against millions of prokaryotic genomes​</title>

Expand All @@ -42,11 +42,11 @@
content="Step 2. Searching"
/>
<meta property="og:site_name" content="LexicMap: efficient sequence alignment against millions of prokaryotic genomes\u200b" />
<meta property="og:description" content="Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Steps Output Alignment result relationship Output format Examples TL;DR Build a LexicMap index.
<meta property="og:description" content="Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Improving searching speed Steps Output Alignment result relationship Output format Examples Summarizing results TL;DR Build a LexicMap index.
Run:
For short queries like genes or long reads, returning top N hits.
lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \ --min-qcov-per-hsp 70 --min-qcov-per-genome 70 --top-n-genomes 1000 For longer queries like plasmids, returning all hits.
lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \ --min-qcov-per-hsp 0 --min-qcov-per-genome 0 --top-n-genomes 0 Input Query length" />
lexicmap search -d db.lmi query." />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://bioinf.shenwei.me/LexicMap/tutorials/search/" />

Expand All @@ -55,11 +55,11 @@

<meta name="twitter:card" content="summary" />
<meta name="twitter:title" content="Step 2. Searching" />
<meta name="twitter:description" content="Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Steps Output Alignment result relationship Output format Examples TL;DR Build a LexicMap index.
<meta name="twitter:description" content="Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Improving searching speed Steps Output Alignment result relationship Output format Examples Summarizing results TL;DR Build a LexicMap index.
Run:
For short queries like genes or long reads, returning top N hits.
lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \ --min-qcov-per-hsp 70 --min-qcov-per-genome 70 --top-n-genomes 1000 For longer queries like plasmids, returning all hits.
lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \ --min-qcov-per-hsp 0 --min-qcov-per-genome 0 --top-n-genomes 0 Input Query length" />
lexicmap search -d db.lmi query." />


<script type="application/ld+json">
Expand All @@ -70,8 +70,8 @@
"name": "Step 2. Searching",
"url" : "https://bioinf.shenwei.me/LexicMap/tutorials/search/",
"headline": "Step 2. Searching",
"description": "Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Steps Output Alignment result relationship Output format Examples TL;DR Build a LexicMap index.\nRun:\nFor short queries like genes or long reads, returning top N hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 70 --min-qcov-per-genome 70 --top-n-genomes 1000 For longer queries like plasmids, returning all hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 0 --min-qcov-per-genome 0 --top-n-genomes 0 Input Query length",
"wordCount" : "2602",
"description": "Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Improving searching speed Steps Output Alignment result relationship Output format Examples Summarizing results TL;DR Build a LexicMap index.\nRun:\nFor short queries like genes or long reads, returning top N hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 70 --min-qcov-per-genome 70 --top-n-genomes 1000 For longer queries like plasmids, returning all hits.\nlexicmap search -d db.lmi query.",
"wordCount" : "2918",
"inLanguage": "en",
"isFamilyFriendly": "true",
"mainEntityOfPage": {
Expand Down Expand Up @@ -1678,13 +1678,18 @@ <h1>Step 2. Searching</h1>
<li><a href="#input">Input</a></li>
<li><a href="#hardware-requirements">Hardware requirements</a></li>
<li><a href="#algorithm">Algorithm</a></li>
<li><a href="#parameters">Parameters</a></li>
<li><a href="#parameters">Parameters</a>
<ul>
<li><a href="#improving-searching-speed">Improving searching speed</a></li>
</ul>
</li>
<li><a href="#steps">Steps</a></li>
<li><a href="#output">Output</a>
<ul>
<li><a href="#alignment-result-relationship">Alignment result relationship</a></li>
<li><a href="#output-format">Output format</a></li>
<li><a href="#examples">Examples</a></li>
<li><a href="#summarizing-results">Summarizing results</a></li>
</ul>
</li>
</ul></nav>
Expand Down Expand Up @@ -2041,6 +2046,37 @@ <h1>Step 2. Searching</h1>
</div>
</div>

<div class="flex align-center gdoc-page__anchorwrap">
<h3 id="improving-searching-speed"
>
Improving searching speed
</h3>
<a data-clipboard-text="https://bioinf.shenwei.me/LexicMap/tutorials/search/#improving-searching-speed" class="gdoc-page__anchor clip flex align-center" title="Anchor to: Improving searching speed" aria-label="Anchor to: Improving searching speed" href="#improving-searching-speed">
<svg class="gdoc-icon gdoc_link"><use xlink:href="#gdoc_link"></use></svg>
</a>
</div>
<p>Here are some tips to improve the search speed.</p>
<ul>
<li>Increasing the concurrency number
<ul>
<li>Increasing the value of <code>--max-open-files</code> (default 512). You might need to <a
class="gdoc-markdown__link"
href="https://stackoverflow.com/questions/34588/how-do-i-change-the-number-of-open-files-limit-in-linux"
>change the open files limit</a>.</li>
<li>(If you have many queries) Increase the value of <code>-J/--max-query-conc</code> (default 12), it will increase the memory.</li>
</ul>
</li>
<li>Loading the entire seed data into memoy (It&rsquo;s unnecessary if the index is stored in SSD)
<ul>
<li>Setting <code>-w/--load-whole-seeds</code> to load the whole seed data into memory for faster search. For example, for ~85,000 GTDB representative genomes, the memory would be ~260 GB with default parameters.</li>
</ul>
</li>
<li>Returning less results
<ul>
<li>Setting <code>-n/--top-n-genomes</code> to keep top N genome matches for a query (0 for all) in chaining phase. For queries with a large number of genome hits, a resonable value such as 1000 would reduce the computation time.</li>
</ul>
</li>
</ul>
<div class="flex align-center gdoc-page__anchorwrap">
<h2 id="steps"
>
Expand Down Expand Up @@ -2373,6 +2409,58 @@ <h1>Step 2. Searching</h1>
class="gdoc-markdown__link"
href="https://github.com/shenwei356/csvtk"
>csvtk pretty</a>.</p>
<div class="flex align-center gdoc-page__anchorwrap">
<h3 id="summarizing-results"
>
Summarizing results
</h3>
<a data-clipboard-text="https://bioinf.shenwei.me/LexicMap/tutorials/search/#summarizing-results" class="gdoc-page__anchor clip flex align-center" title="Anchor to: Summarizing results" aria-label="Anchor to: Summarizing results" href="#summarizing-results">
<svg class="gdoc-icon gdoc_link"><use xlink:href="#gdoc_link"></use></svg>
</a>
</div>
<p>If you would like to summarize alignment results, e.g., the number of species, here&rsquo;s the method.</p>
<ol>
<li>
<p>Prepare a two-column tab-delimited file for mapping reference (genome) or sequence IDs to any information (such as species name).</p>
<pre><code> # for GTDB/GenBank/RefSeq genomes downloaded with genome_updater
cut -f 1,8 assembly_summary.txt &gt; ref2species.tsv

head -n 3 ass2species.tsv
GCF_002287175.1 Methanobacterium bryantii
GCF_000762265.1 Methanobacterium formicicum
GCF_029601605.1 Methanobacterium formicicum
</code></pre>
</li>
<li>
<p>Add information to the alignment result with <a
class="gdoc-markdown__link"
href="https://github.com/shenwei356/csvtk"
>csvtk</a> or other tools.</p>
<pre><code> # add species
cat b.gene_E_coli_16S.fasta.lexicmap.tsv \
| csvtk mutate -t --after slen -n species -f sgenome \
| csvtk replace -t -f species -p &quot;(.+)&quot; -r &quot;{kv}&quot; -k ass2species.tsv \
&gt; result.with_species.tsv

# filter result with query coverage &gt;= 80 and count the species
cat result.with_species.tsv \
| csvtk uniq -t -f sgenome \
| csvtk filter2 -t -f &quot;\$qcovHSP &gt;= 80&quot; \
| csvtk freq -t -f species -nr \
&gt; result.with_species.tsv.stats.tsv

csvtk head -t -n 5 result.with_species.tsv.stats.tsv \
| csvtk pretty -t

species frequency
------------------------ ---------
Salmonella enterica 135065
Escherichia coli 128071
Streptococcus pneumoniae 51971
Staphylococcus aureus 44215
Pseudomonas aeruginosa 34254</code></pre>
</li>
</ol>

</article>

Expand Down

0 comments on commit 75e0f4a

Please sign in to comment.