Skip to content

Commit

Permalink
update tutorials
Browse files Browse the repository at this point in the history
  • Loading branch information
shenwei356 committed Sep 11, 2024
1 parent 78cab8b commit a4f727c
Show file tree
Hide file tree
Showing 8 changed files with 796 additions and 42 deletions.
3 changes: 2 additions & 1 deletion .directory
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
[Dolphin]
Timestamp=2024,4,17,11,43,25.321
SortOrder=1
Timestamp=2024,9,10,11,5,49.66
Version=4
ViewMode=1

Expand Down
650 changes: 650 additions & 0 deletions AllTheBacteria-v0.2.url.txt

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions index.xml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
<link>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-genbank/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-genbank/</guid>
<description>Tools:&#xA;https://github.com/pirovc/genome_updater, for downloading genomes https://github.com/shenwei356/seqkit, for checking sequence files https://github.com/shenwei356/rush, for running jobs Data:&#xA;time genome_updater.sh -d &amp;quot;refseq,genbank&amp;quot; -g &amp;quot;archaea,bacteria&amp;quot; \ -f &amp;quot;genomic.fna.gz&amp;quot; -o &amp;quot;genbank&amp;quot; -M &amp;quot;ncbi&amp;quot; -t 12 -m -L curl cd genbank/2024-02-15_11-00-51/ # ----------------- check the file integrity ----------------- genomes=files # corrupted files # find $genomes -name &amp;quot;*.gz&amp;quot; \ fd &amp;quot;.gz$&amp;quot; $genomes \ | rush --eta &#39;seqkit seq -w 0 {} &amp;gt; /dev/null; if [ $? -ne 0 ]; then echo {}; fi&#39; \ &amp;gt; failed.</description>
<description>Make sure you have enough disk space, &amp;gt;10 TB is preferred.&#xA;Tools:&#xA;https://github.com/pirovc/genome_updater, for downloading genomes https://github.com/shenwei356/seqkit, for checking sequence files https://github.com/shenwei356/rush, for running jobs Data:&#xA;time genome_updater.sh -d &amp;quot;refseq,genbank&amp;quot; -g &amp;quot;archaea,bacteria&amp;quot; \ -f &amp;quot;genomic.fna.gz&amp;quot; -o &amp;quot;genbank&amp;quot; -M &amp;quot;ncbi&amp;quot; -t 12 -m -L curl cd genbank/2024-02-15_11-00-51/ # ----------------- check the file integrity ----------------- genomes=files # corrupted files # find $genomes -name &amp;quot;*.gz&amp;quot; \ fd &amp;quot;.gz$&amp;quot; $genomes \ | rush --eta &#39;seqkit seq -w 0 {} &amp;gt; /dev/null; if [ $?</description>
</item>
<item>
<title>kmers</title>
Expand All @@ -47,7 +47,7 @@
<link>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/</guid>
<description>Info:&#xA;AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps:&#xA;Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like assemblies): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/&#xA;Decompressing all tarballs.&#xA;cd assemblies; ls *.tar.xz | parallel --eta &#39;tar -Jxf {}; gzip {}/*.fa&#39; cd .. After that, the assemblies directory would have multiple subdirectories. When you give the directory to lexicmap index -I, it can recursively scan (plain or gz/xz/zstd-compressed) genome files.</description>
<description>Make sure you have enough disk space, at least 8 TB, &amp;gt;10 TB is preferred.&#xA;Tools:&#xA;https://github.com/shenwei356/rush, for running jobs Info:&#xA;AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps for v0.2 Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like atb): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/&#xA;mkdir -p atb; cd atb; # assembly file list, 650 files in total wget https://bioinf.</description>
</item>
<item>
<title>genomes</title>
Expand Down
2 changes: 1 addition & 1 deletion search/en.data.min.json

Large diffs are not rendered by default.

137 changes: 111 additions & 26 deletions tutorials/misc/index-allthebacteria/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,11 @@
<meta name="generator" content="Hugo 0.133.0">


<meta name="description" content="Info:
AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps:
Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like assemblies): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/
Decompressing all tarballs.
cd assemblies; ls *.tar.xz | parallel --eta &#39;tar -Jxf {}; gzip {}/*.fa&#39; cd .. After that, the assemblies directory would have multiple subdirectories. When you give the directory to lexicmap index -I, it can recursively scan (plain or gz/xz/zstd-compressed) genome files." />
<meta name="description" content="Make sure you have enough disk space, at least 8 TB, &gt;10 TB is preferred.
Tools:
https://github.com/shenwei356/rush, for running jobs Info:
AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps for v0.2 Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like atb): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/
mkdir -p atb; cd atb; # assembly file list, 650 files in total wget https://bioinf." />

<title>Indexing AllTheBacteria | LexicMap: efficient sequence alignment against millions of prokaryotic genomes​</title>

Expand All @@ -42,11 +42,11 @@
content="Indexing AllTheBacteria"
/>
<meta property="og:site_name" content="LexicMap: efficient sequence alignment against millions of prokaryotic genomes\u200b" />
<meta property="og:description" content="Info:
AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps:
Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like assemblies): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/
Decompressing all tarballs.
cd assemblies; ls *.tar.xz | parallel --eta &#39;tar -Jxf {}; gzip {}/*.fa&#39; cd .. After that, the assemblies directory would have multiple subdirectories. When you give the directory to lexicmap index -I, it can recursively scan (plain or gz/xz/zstd-compressed) genome files." />
<meta property="og:description" content="Make sure you have enough disk space, at least 8 TB, &gt;10 TB is preferred.
Tools:
https://github.com/shenwei356/rush, for running jobs Info:
AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps for v0.2 Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like atb): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/
mkdir -p atb; cd atb; # assembly file list, 650 files in total wget https://bioinf." />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/" />

Expand All @@ -55,11 +55,11 @@

<meta name="twitter:card" content="summary" />
<meta name="twitter:title" content="Indexing AllTheBacteria" />
<meta name="twitter:description" content="Info:
AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps:
Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like assemblies): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/
Decompressing all tarballs.
cd assemblies; ls *.tar.xz | parallel --eta &#39;tar -Jxf {}; gzip {}/*.fa&#39; cd .. After that, the assemblies directory would have multiple subdirectories. When you give the directory to lexicmap index -I, it can recursively scan (plain or gz/xz/zstd-compressed) genome files." />
<meta name="twitter:description" content="Make sure you have enough disk space, at least 8 TB, &gt;10 TB is preferred.
Tools:
https://github.com/shenwei356/rush, for running jobs Info:
AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps for v0.2 Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like atb): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/
mkdir -p atb; cd atb; # assembly file list, 650 files in total wget https://bioinf." />


<script type="application/ld+json">
Expand All @@ -70,8 +70,8 @@
"name": "Indexing AllTheBacteria",
"url" : "https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/",
"headline": "Indexing AllTheBacteria",
"description": "Info:\nAllTheBacteria, All WGS isolate bacterial INSDC data to June 2023uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps:\nDownloading assemblies tarballs here (except these starting with unknown__) to a directory (like assemblies): https:\/\/ftp.ebi.ac.uk\/pub\/databases\/AllTheBacteria\/Releases\/0.2\/assembly\/\nDecompressing all tarballs.\ncd assemblies; ls *.tar.xz | parallel --eta \u0027tar -Jxf {}; gzip {}\/*.fa\u0027 cd .. After that, the assemblies directory would have multiple subdirectories. When you give the directory to lexicmap index -I, it can recursively scan (plain or gz\/xz\/zstd-compressed) genome files.",
"wordCount" : "132",
"description": "Make sure you have enough disk space, at least 8 TB, \u003e10 TB is preferred.\nTools:\nhttps:\/\/github.com\/shenwei356\/rush, for running jobs Info:\nAllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps for v0.2 Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like atb): https:\/\/ftp.ebi.ac.uk\/pub\/databases\/AllTheBacteria\/Releases\/0.2\/assembly\/\nmkdir -p atb; cd atb; # assembly file list, 650 files in total wget https:\/\/bioinf.",
"wordCount" : "416",
"inLanguage": "en",
"isFamilyFriendly": "true",
"mainEntityOfPage": {
Expand Down Expand Up @@ -1704,44 +1704,129 @@ <h2>More</h2>
class="gdoc-markdown gdoc-markdown__align--left"
>
<h1>Indexing AllTheBacteria</h1>
<p>Info:</p>
<p><strong>Make sure you have enough disk space, at least 8 TB, &gt;10 TB is preferred.</strong></p>
<p>Tools:</p>
<ul>
<li><a
class="gdoc-markdown__link"
href="https://github.com/shenwei356/rush"
>https://github.com/shenwei356/rush</a>, for running jobs</li>
</ul>
<p>Info:</p>
<ul>
<li><a
class="gdoc-markdown__link"
href="https://github.com/AllTheBacteria/AllTheBacteria"
>AllTheBacteria</a>, All WGS isolate bacterial INSDC data to June 2023uniformly assembled, QC-ed, annotated, searchable.</li>
>AllTheBacteria</a>, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable.</li>
<li>Preprint: <a
class="gdoc-markdown__link"
href="https://www.biorxiv.org/content/10.1101/2024.03.08.584059v1"
>AllTheBacteria - all bacterial genomes assembled, available and searchable</a></li>
</ul>
<p>Steps:</p>
<div class="flex align-center gdoc-page__anchorwrap">
<h2 id="steps-for-v02"
>
Steps for v0.2
</h2>
<a data-clipboard-text="https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/#steps-for-v02" class="gdoc-page__anchor clip flex align-center" title="Anchor to: Steps for v0.2" aria-label="Anchor to: Steps for v0.2" href="#steps-for-v02">
<svg class="gdoc-icon gdoc_link"><use xlink:href="#gdoc_link"></use></svg>
</a>
</div>
<ol>
<li>
<p>Downloading assemblies tarballs here (except these starting with <code>unknown__</code>) to a directory (like assemblies):
<p>Downloading assemblies tarballs here (except these starting with <code>unknown__</code>) to a directory (like <code>atb</code>):
<a
class="gdoc-markdown__link"
href="https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/"
>https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/</a></p>
<pre><code> mkdir -p atb;
cd atb;

# assembly file list, 650 files in total
wget https://bioinf.shenwei.me/LexicMap/AllTheBacteria-v0.2.url.txt

# download
# rush is used: https://github.com/shenwei356/rush
# The download.rush file stores finished jobs, which will be skipped in a second run for resuming jobs.
cat AllTheBacteria-v0.2.url.txt | rush --eta -j 2 -c -C download.rush 'wget {}'


# list of high-quality samples
wget https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/metadata/hq_set.sample_list.txt.gz
</code></pre>
</li>
<li>
<p>Decompressing all tarballs.</p>
<pre><code> cd assemblies;
ls *.tar.xz | parallel --eta 'tar -Jxf {}; gzip {}/*.fa'
<p>Decompressing all tarballs. The decompressed genomes are stored in plain text,
so we use <code>gzip</code> (can be replaced with faster <code>pigz</code> ) to compress them to save disk space.</p>
<pre><code> # {^asm.tar.xz} is for removing the suffix &quot;asm.tar.xz&quot;
ls *.tar.xz | rush --eta -c -C decompress.rush 'tar -Jxf {}; gzip -f {^asm.tar.xz}/*.fa'
cd ..
</code></pre>
<p>After that, the assemblies directory would have multiple subdirectories.
When you give the directory to <code>lexicmap index -I</code>, it can recursively scan (plain or gz/xz/zstd-compressed) genome files.</p>
When you give the directory to <code>lexicmap index -I</code>, it can recursively scan (plain or gz/xz/zstd-compressed) genome files.
You can also give a file list with selected assemblies.</p>
<pre><code> $ tree atb | more
atb
├── achromobacter_xylosoxidans__01
│   ├── SAMD00013333.fa.gz
│   ├── SAMD00049594.fa.gz
│   ├── SAMD00195911.fa.gz
│   ├── SAMD00195914.fa.gz


# disk usage

$ du -sh atb
2.9T atb

$ du -sh atb --apparent-size
2.1T atb
</code></pre>
</li>
<li>
<p>Creating a LexicMap index. (more details: <a
class="gdoc-markdown__link"
href="https://bioinf.shenwei.me/LexicMap/tutorials/index/"
>https://bioinf.shenwei.me/LexicMap/tutorials/index/</a>)</p>
<pre><code>lexicmap index -I assemblies/ -O atb.lmi -b 25000 --log atb.lmi.log
<pre><code> # file paths of all samples
find atb/ -name &quot;*.fa.gz&quot; &gt; atb_all.txt

# wc -l atb_all.txt
# 1876015 atb_all.txt

# file paths of high-quality samples
grep -w -f &lt;(zcat atb/hq_set.sample_list.txt.gz) atb_all.txt &gt; atb_hq.txt

# wc -l atb_hq.txt
# 1858610 atb_hq.txt



# index
lexicmap index -S -X atb_hq.txt -O atb_hq.lmi -b 25000 --log atb_hq.lmi.log
</code></pre>
<p>For 1,858,610 HQ genomes, on a 48-CPU machine, time: 48 h, ram: 85 GB, index size: 3.88 TB.
If you don&rsquo;t have enough memory, please decrease the value of <code>-b</code>.</p>
<pre><code> # disk usage

$ du -sh atb_hq.lmi
4.6T atb_hq.lmi

$ du -sh atb_hq.lmi --apparent-size
3.9T atb_hq.lmi

$ dirsize atb_hq.lmi

atb_hq.lmi: 3.88 TiB (4,261,437,129,065)
2.11 TiB seeds
1.77 TiB genomes
39.22 MiB genomes.map.bin
312.53 KiB masks.bin
332 B info.toml
</code></pre>
<p>Note that, there&rsquo;s a tmp directory <code>atb_hq.lmi</code> being created during indexing.
In the tmp directory, the seed data would be bigger than the final size of <code>seeds</code> directory,
however, the genome files are simply moved to the final index.</p>
</li>
</ol>

Expand Down
Loading

0 comments on commit a4f727c

Please sign in to comment.