Skip to content

Commit

Permalink
add a new tutorial of indexing ATB dataset hosted on OSF
Browse files Browse the repository at this point in the history
  • Loading branch information
shenwei356 committed Sep 18, 2024
1 parent a4f727c commit 883b7e9
Show file tree
Hide file tree
Showing 9 changed files with 140 additions and 23 deletions.
8 changes: 8 additions & 0 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -329,6 +329,14 @@ <h1></h1>
src="https://img.shields.io/badge/platform-any-ec2eb4.svg?style=flat"
alt="Cross-platform"

/></a>
<a
class="gdoc-markdown__link--raw"
href="https://github.com/shenwei356/taxonkit/blob/master/LICENSE"
><img
src="https://img.shields.io/github/license/shenwei356/taxonkit.svg?maxAge=2592000"
alt="license"

/></a></p>
<p><font size=5rem>LexicMap is a <strong>nucleotide sequence alignment</strong> tool for efficiently querying <strong>gene, plasmid, virus, or long-read sequences</strong> against up to <strong>millions</strong> of <strong>prokaryotic genomes</strong>.</font></p>

Expand Down
2 changes: 1 addition & 1 deletion index.xml
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@
<link>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/</guid>
<description>Make sure you have enough disk space, at least 8 TB, &amp;gt;10 TB is preferred.&#xA;Tools:&#xA;https://github.com/shenwei356/rush, for running jobs Info:&#xA;AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps for v0.2 Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like atb): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/&#xA;mkdir -p atb; cd atb; # assembly file list, 650 files in total wget https://bioinf.</description>
<description>Make sure you have enough disk space, at least 8 TB, &amp;gt;10 TB is preferred.&#xA;Tools:&#xA;https://github.com/shenwei356/rush, for running jobs Info:&#xA;AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Data on OSF: https://osf.io/xv7q9/ Steps for v0.2 and later versions hosted at OSF After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at OSF.</description>
</item>
<item>
<title>genomes</title>
Expand Down
8 changes: 8 additions & 0 deletions introduction/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -1708,6 +1708,14 @@ <h1>Introduction</h1>
src="https://img.shields.io/badge/platform-any-ec2eb4.svg?style=flat"
alt="Cross-platform"

/></a>
<a
class="gdoc-markdown__link--raw"
href="https://github.com/shenwei356/taxonkit/blob/master/LICENSE"
><img
src="https://img.shields.io/github/license/shenwei356/taxonkit.svg?maxAge=2592000"
alt="license"

/></a></p>
<p>LexicMap is a <strong>nucleotide sequence alignment</strong> tool for efficiently querying gene, plasmid, viral, or long-read sequences against up to <strong>millions of prokaryotic genomes</strong>.</p>
<p>Preprint:</p>
Expand Down
2 changes: 1 addition & 1 deletion search/en.data.min.json

Large diffs are not rendered by default.

125 changes: 112 additions & 13 deletions tutorials/misc/index-allthebacteria/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,7 @@
<meta name="description" content="Make sure you have enough disk space, at least 8 TB, &gt;10 TB is preferred.
Tools:
https://github.com/shenwei356/rush, for running jobs Info:
AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps for v0.2 Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like atb): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/
mkdir -p atb; cd atb; # assembly file list, 650 files in total wget https://bioinf." />
AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Data on OSF: https://osf.io/xv7q9/ Steps for v0.2 and later versions hosted at OSF After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at OSF." />

<title>Indexing AllTheBacteria | LexicMap: efficient sequence alignment against millions of prokaryotic genomes​</title>

Expand Down Expand Up @@ -45,8 +44,7 @@
<meta property="og:description" content="Make sure you have enough disk space, at least 8 TB, &gt;10 TB is preferred.
Tools:
https://github.com/shenwei356/rush, for running jobs Info:
AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps for v0.2 Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like atb): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/
mkdir -p atb; cd atb; # assembly file list, 650 files in total wget https://bioinf." />
AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Data on OSF: https://osf.io/xv7q9/ Steps for v0.2 and later versions hosted at OSF After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at OSF." />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/" />

Expand All @@ -58,8 +56,7 @@
<meta name="twitter:description" content="Make sure you have enough disk space, at least 8 TB, &gt;10 TB is preferred.
Tools:
https://github.com/shenwei356/rush, for running jobs Info:
AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps for v0.2 Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like atb): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/
mkdir -p atb; cd atb; # assembly file list, 650 files in total wget https://bioinf." />
AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Data on OSF: https://osf.io/xv7q9/ Steps for v0.2 and later versions hosted at OSF After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at OSF." />


<script type="application/ld+json">
Expand All @@ -70,8 +67,8 @@
"name": "Indexing AllTheBacteria",
"url" : "https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/",
"headline": "Indexing AllTheBacteria",
"description": "Make sure you have enough disk space, at least 8 TB, \u003e10 TB is preferred.\nTools:\nhttps:\/\/github.com\/shenwei356\/rush, for running jobs Info:\nAllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps for v0.2 Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like atb): https:\/\/ftp.ebi.ac.uk\/pub\/databases\/AllTheBacteria\/Releases\/0.2\/assembly\/\nmkdir -p atb; cd atb; # assembly file list, 650 files in total wget https:\/\/bioinf.",
"wordCount" : "416",
"description": "Make sure you have enough disk space, at least 8 TB, \u003e10 TB is preferred.\nTools:\nhttps:\/\/github.com\/shenwei356\/rush, for running jobs Info:\nAllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Data on OSF: https:\/\/osf.io\/xv7q9\/ Steps for v0.2 and later versions hosted at OSF After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at OSF.",
"wordCount" : "744",
"inLanguage": "en",
"isFamilyFriendly": "true",
"mainEntityOfPage": {
Expand Down Expand Up @@ -1722,13 +1719,114 @@ <h1>Indexing AllTheBacteria</h1>
class="gdoc-markdown__link"
href="https://www.biorxiv.org/content/10.1101/2024.03.08.584059v1"
>AllTheBacteria - all bacterial genomes assembled, available and searchable</a></li>
<li>Data on OSF: <a
class="gdoc-markdown__link"
href="https://osf.io/xv7q9/"
>https://osf.io/xv7q9/</a></li>
</ul>
<div class="flex align-center gdoc-page__anchorwrap">
<h2 id="steps-for-v02"
<h2 id="steps-for-v02-and-later-versions-hosted-at-osf"
>
Steps for v0.2
Steps for v0.2 and later versions hosted at OSF
</h2>
<a data-clipboard-text="https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/#steps-for-v02" class="gdoc-page__anchor clip flex align-center" title="Anchor to: Steps for v0.2" aria-label="Anchor to: Steps for v0.2" href="#steps-for-v02">
<a data-clipboard-text="https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/#steps-for-v02-and-later-versions-hosted-at-osf" class="gdoc-page__anchor clip flex align-center" title="Anchor to: Steps for v0.2 and later versions hosted at OSF" aria-label="Anchor to: Steps for v0.2 and later versions hosted at OSF" href="#steps-for-v02-and-later-versions-hosted-at-osf">
<svg class="gdoc-icon gdoc_link"><use xlink:href="#gdoc_link"></use></svg>
</a>
</div>
<p>After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at <a
class="gdoc-markdown__link"
href="https://osf.io/xv7q9/"
>OSF</a>.</p>
<ol>
<li>
<p>Downloading the list file of all assemblies in the latest version (v0.2 plus incremental versions). <a
class="gdoc-markdown__link"
href="https://osf.io/zxfmy/"
>assemblies</a>.</p>
<pre><code> mkdir -p atb;
cd atb;

# attention, the URL might changes, please check it in the browser.
wget https://osf.io/download/4yv85/ -O file_list.all.latest.tsv.gz
</code></pre>
<p>If you only need to add assemblies from an incremental version.
Please manually download the file list in the path <code>AllTheBacteria/Assembly/OSF Storage/File_lists</code>.</p>
</li>
<li>
<p>Downloading assembly tarball files.</p>
<pre><code> # tarball file names and their URLs
zcat file_list.all.latest.tsv.gz | awk 'NR&gt;1 {print $3&quot;\t&quot;$4}' | uniq &gt; tar2url.tsv

# download
cat tar2url.tsv | rush --eta -j 2 -c -C download.rush 'wget -O {1} {2}'
</code></pre>
</li>
<li>
<p>Decompressing all tarballs. The decompressed genomes are stored in plain text,
so we use <code>gzip</code> (can be replaced with faster <code>pigz</code> ) to compress them to save disk space.</p>
<pre><code> # {^tar.xz} is for removing the suffix &quot;tar.xz&quot;
ls *.tar.xz | rush --eta -c -C decompress.rush 'tar -Jxf {}; gzip -f {^.tar.xz}/*.fa'

cd ..
</code></pre>
<p>After that, the assemblies directory would have multiple subdirectories.
When you give the directory to <code>lexicmap index -I</code>, it can recursively scan (plain or gz/xz/zstd-compressed) genome files.
You can also give a file list with selected assemblies.</p>
<pre><code> $ tree atb | more
atb
├── atb.assembly.r0.2.batch.1
│   ├── SAMD00013333.fa.gz
│   ├── SAMD00049594.fa.gz
│   ├── SAMD00195911.fa.gz
│   ├── SAMD00195914.fa.gz
</code></pre>
</li>
<li>
<p>Parepare a file list of assemblies.</p>
<ul>
<li>
<p>Just use <code>find</code> or <a
class="gdoc-markdown__link"
href="https://github.com/sharkdp/fd"
>fd</a> (much faster).</p>
<pre><code> # find
find atb/ -name &quot;*.fa.gz&quot; &gt; files.txt

# fd
fd .fa.gz$ atb/ &gt; files.txt
</code></pre>
<p>What it looks like:</p>
<pre><code> $ head -n 2 files.txt
atb/atb.assembly.r0.2.batch.1/SAMD00013333.fa.gz
atb/atb.assembly.r0.2.batch.1/SAMD00049594.fa.gz
</code></pre>
</li>
<li>
<p>(Optional) Only keep assemblies of high-quality.
Please manually download the <code>hq_set.sample_list.txt.gz</code> file from <a
class="gdoc-markdown__link"
href="https://osf.io/xv7q9/"
>this path</a>, e.g., <code>AllTheBacteria/Metadata/OSF Storage/Aggregated/Latest_2024-08/</code> (choose the latest date).</p>
<pre><code> find atb/ -name &quot;*.fa.gz&quot; | grep -w -f &lt;(zcat hq_set.sample_list.txt.gz) &gt; files.txt
</code></pre>
</li>
</ul>
</li>
<li>
<p>Creating a LexicMap index. (more details: <a
class="gdoc-markdown__link"
href="https://bioinf.shenwei.me/LexicMap/tutorials/index/"
>https://bioinf.shenwei.me/LexicMap/tutorials/index/</a>)</p>
<pre><code> lexicmap index -S -X files.txt -O atb.lmi -b 25000 --log atb.lmi.log
</code></pre>
</li>
</ol>
<div class="flex align-center gdoc-page__anchorwrap">
<h2 id="steps-for-v02-hosted-at-ebi-ftp"
>
Steps for v0.2 hosted at EBI ftp
</h2>
<a data-clipboard-text="https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/#steps-for-v02-hosted-at-ebi-ftp" class="gdoc-page__anchor clip flex align-center" title="Anchor to: Steps for v0.2 hosted at EBI ftp" aria-label="Anchor to: Steps for v0.2 hosted at EBI ftp" href="#steps-for-v02-hosted-at-ebi-ftp">
<svg class="gdoc-icon gdoc_link"><use xlink:href="#gdoc_link"></use></svg>
</a>
</div>
Expand Down Expand Up @@ -1760,6 +1858,7 @@ <h1>Indexing AllTheBacteria</h1>
so we use <code>gzip</code> (can be replaced with faster <code>pigz</code> ) to compress them to save disk space.</p>
<pre><code> # {^asm.tar.xz} is for removing the suffix &quot;asm.tar.xz&quot;
ls *.tar.xz | rush --eta -c -C decompress.rush 'tar -Jxf {}; gzip -f {^asm.tar.xz}/*.fa'

cd ..
</code></pre>
<p>After that, the assemblies directory would have multiple subdirectories.
Expand Down Expand Up @@ -1802,8 +1901,8 @@ <h1>Indexing AllTheBacteria</h1>



# index
lexicmap index -S -X atb_hq.txt -O atb_hq.lmi -b 25000 --log atb_hq.lmi.log
# index
lexicmap index -S -X atb_hq.txt -O atb_hq.lmi -b 25000 --log atb_hq.lmi.log
</code></pre>
<p>For 1,858,610 HQ genomes, on a 48-CPU machine, time: 48 h, ram: 85 GB, index size: 3.88 TB.
If you don&rsquo;t have enough memory, please decrease the value of <code>-b</code>.</p>
Expand Down
2 changes: 1 addition & 1 deletion tutorials/misc/index.xml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
<link>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/</guid>
<description>Make sure you have enough disk space, at least 8 TB, &amp;gt;10 TB is preferred.&#xA;Tools:&#xA;https://github.com/shenwei356/rush, for running jobs Info:&#xA;AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps for v0.2 Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like atb): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/&#xA;mkdir -p atb; cd atb; # assembly file list, 650 files in total wget https://bioinf.</description>
<description>Make sure you have enough disk space, at least 8 TB, &amp;gt;10 TB is preferred.&#xA;Tools:&#xA;https://github.com/shenwei356/rush, for running jobs Info:&#xA;AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Data on OSF: https://osf.io/xv7q9/ Steps for v0.2 and later versions hosted at OSF After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at OSF.</description>
</item>
<item>
<title>Indexing GlobDB</title>
Expand Down
2 changes: 1 addition & 1 deletion tutorials/parameters-general.tsv
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Flag Value Function Comment
**`-w/--load-whole-seeds`** Load the whole seed data into memory for faster search Use this if the index is not big and many queries are needed to search.
**`-n/--top-n-genomes`** Default 0, 0 for all Keep top N genome matches for a query in the chaining phase The final number of genome hits might be smaller than this number as some chaining results might fail to pass the criteria in the alignment step.
**`-n/--top-n-genomes`** Default 0, 0 for all Keep top N genome matches for a query in the chaining phase Value 1 is not recommended as the best chaining result does not always bring the best alignment, so it better be >= 5. The final number of genome hits might be smaller than this number as some chaining results might fail to pass the criteria in the alignment step.
**`-a/--all`** Output more columns, e.g., matched sequences. "Use this if you want to output blast-style format with ""lexicmap utils 2blast"""
-J/--max-query-conc Default 12, 0 for all Maximum number of concurrent queries Bigger values do not improve the batch searching speed and consume much memory.
9 changes: 5 additions & 4 deletions tutorials/search/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@
"url" : "https://bioinf.shenwei.me/LexicMap/tutorials/search/",
"headline": "Step 2. Searching",
"description": "Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Improving searching speed Steps Output Alignment result relationship Output format Examples Summarizing results TL;DR Build a LexicMap index.\nRun:\nFor short queries like genes or long reads, returning top N hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 70 --min-qcov-per-genome 70 --top-n-genomes 1000 For longer queries like plasmids, returning all hits.\nlexicmap search -d db.lmi query.",
"wordCount" : "2918",
"wordCount" : "2941",
"inLanguage": "en",
"isFamilyFriendly": "true",
"mainEntityOfPage": {
Expand Down Expand Up @@ -1938,7 +1938,7 @@ <h1>Step 2. Searching</h1>
<td style="text-align:left"><strong><code>-n/--top-n-genomes</code></strong></td>
<td style="text-align:left">Default 0, 0 for all</td>
<td style="text-align:left">Keep top N genome matches for a query in the chaining phase</td>
<td style="text-align:left">The final number of genome hits might be smaller than this number as some chaining results might fail to pass the criteria in the alignment step.</td>
<td style="text-align:left">Value 1 is not recommended as the best chaining result does not always bring the best alignment, so it better be &gt;= 5. The final number of genome hits might be smaller than this number as some chaining results might fail to pass the criteria in the alignment step.</td>
</tr>
<tr>
<td style="text-align:left"><strong><code>-a/--all</code></strong></td>
Expand All @@ -1950,7 +1950,7 @@ <h1>Step 2. Searching</h1>
<td style="text-align:left">-J/&ndash;max-query-conc</td>
<td style="text-align:left">Default 12, 0 for all</td>
<td style="text-align:left">Maximum number of concurrent queries</td>
<td style="text-align:left">Bigger values do not improve the batch searching speed and consume much memory</td>
<td style="text-align:left">Bigger values do not improve the batch searching speed and consume much memory.</td>
</tr>
</tbody>
</table> </div>
Expand Down Expand Up @@ -2492,7 +2492,8 @@ <h1>Step 2. Searching</h1>
Escherichia coli 128071
Streptococcus pneumoniae 51971
Staphylococcus aureus 44215
Pseudomonas aeruginosa 34254</code></pre>
Pseudomonas aeruginosa 34254
</code></pre>
</li>
</ol>

Expand Down
Loading

0 comments on commit 883b7e9

Please sign in to comment.