Skip to content

Commit

Permalink
add the tutorial for indexing UHGG
Browse files Browse the repository at this point in the history
  • Loading branch information
shenwei356 committed Sep 5, 2024
1 parent 75e0f4a commit 78cab8b
Show file tree
Hide file tree
Showing 33 changed files with 5,044 additions and 352 deletions.
89 changes: 78 additions & 11 deletions faqs/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -604,26 +604,16 @@ <h2>Navigation</h2>
<input
type="checkbox"

class="gdoc-nav__toggle" id="navtree-5198871c"

class="hidden"

/>
<label

for="navtree-5198871c" class="flex justify-between align-center"

>

<span class="flex">More</span>


<svg class="gdoc-icon toggle gdoc_keyboard_arrow_left">
<use xlink:href="#gdoc_keyboard_arrow_left"></use>
</svg>
<svg class="gdoc-icon toggle gdoc_keyboard_arrow_down">
<use xlink:href="#gdoc_keyboard_arrow_down"></use>
</svg>

</label>


Expand Down Expand Up @@ -810,6 +800,50 @@ <h2>Navigation</h2>



















<li>
<input
type="checkbox"

class="hidden"

/>
<label

>

<span class="flex">
<a
href="/LexicMap/tutorials/misc/index-uhgg/"
class="gdoc-nav__entry"
>
Indexing UHGG
</a>
</span>


</label>


</li>



</ul>


Expand Down Expand Up @@ -2219,6 +2253,39 @@ <h1>FAQs</h1>







































Expand Down
33 changes: 33 additions & 0 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -866,6 +866,39 @@ <h1></h1>







































Expand Down
13 changes: 10 additions & 3 deletions index.xml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
<link>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-gtdb/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-gtdb/</guid>
<description>Tools:&#xA;https://github.com/pirovc/genome_updater, for downloading genomes https://github.com/shenwei356/seqkit, for checking sequence files https://github.com/shenwei356/rush, for running jobs Data:&#xA;time genome_updater.sh -d &amp;quot;refseq,genbank&amp;quot; -g &amp;quot;archaea,bacteria&amp;quot; \ -f &amp;quot;genomic.fna.gz&amp;quot; -o &amp;quot;GTDB_complete&amp;quot; -M &amp;quot;gtdb&amp;quot; -t 12 -m -L curl cd GTDB_complete/2024-01-30_19-34-40/ # ----------------- check the file integrity ----------------- genomes=files # corrupted files # find $genomes -name &amp;quot;*.gz&amp;quot; \ fd &amp;quot;.gz$&amp;quot; $genomes \ | rush --eta &#39;seqkit seq -w 0 {} &amp;gt; /dev/null; if [ $? -ne 0 ]; then echo {}; fi&#39; \ &amp;gt; failed.</description>
<description>Info:&#xA;https://gtdb.ecogenomic.org/ Tools:&#xA;https://github.com/pirovc/genome_updater, for downloading genomes https://github.com/shenwei356/seqkit, for checking sequence files https://github.com/shenwei356/rush, for running jobs Data:&#xA;time genome_updater.sh -d &amp;quot;refseq,genbank&amp;quot; -g &amp;quot;archaea,bacteria&amp;quot; \ -f &amp;quot;genomic.fna.gz&amp;quot; -o &amp;quot;GTDB_complete&amp;quot; -M &amp;quot;gtdb&amp;quot; -t 12 -m -L curl cd GTDB_complete/2024-01-30_19-34-40/ # ----------------- check the file integrity ----------------- genomes=files # corrupted files # find $genomes -name &amp;quot;*.gz&amp;quot; \ fd &amp;quot;.gz$&amp;quot; $genomes \ | rush --eta &#39;seqkit seq -w 0 {} &amp;gt; /dev/null; if [ $?</description>
</item>
<item>
<title>masks</title>
Expand Down Expand Up @@ -47,7 +47,7 @@
<link>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/</guid>
<description>Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like assemblies): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/&#xA;Decompressing all tarballs.&#xA;cd assemblies; ls *.tar.xz | parallel --eta &#39;tar -Jxf {}; gzip {}/*.fa&#39; cd .. After that, the assemblies directory would have multiple subdirectories. When you give the directory to lexicmap index -I, it can recursively scan (plain or gz/xz/zstd-compressed) genome files.&#xA;Creating a LexicMap index. (more details: https://bioinf.shenwei.me/LexicMap/tutorials/index/)&#xA;lexicmap index -I assemblies/ -O atb.</description>
<description>Info:&#xA;AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps:&#xA;Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like assemblies): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/&#xA;Decompressing all tarballs.&#xA;cd assemblies; ls *.tar.xz | parallel --eta &#39;tar -Jxf {}; gzip {}/*.fa&#39; cd .. After that, the assemblies directory would have multiple subdirectories. When you give the directory to lexicmap index -I, it can recursively scan (plain or gz/xz/zstd-compressed) genome files.</description>
</item>
<item>
<title>genomes</title>
Expand All @@ -61,7 +61,7 @@
<link>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-globdb/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-globdb/</guid>
<description># download data wget https://fileshare.lisc.univie.ac.at/globdb/globdb_r220/globdb_r220_genome_fasta.tar.gz tar -zxf globdb_r220_genome_fasta.tar.gz # file list find globdb_r220_genome_fasta/ -name &amp;quot;*.fa.gz&amp;quot; &amp;gt; files.txt # index with lexicmap # elapsed time: 3h:40m:38s # peak rss: 87.15 GB lexicmap index -S -X files.txt -O globdb_r220.lmi --log globdb_r220.lmi -g 50000000 </description>
<description>Info:&#xA;GlobDB , a dereplicated dataset of the species reps of the GTDB, GEM, SPIRE and SMAG datasets a lot. https://x.com/daanspeth/status/1822964436950192218 Steps:&#xA;# download data wget https://fileshare.lisc.univie.ac.at/globdb/globdb_r220/globdb_r220_genome_fasta.tar.gz tar -zxf globdb_r220_genome_fasta.tar.gz # file list find globdb_r220_genome_fasta/ -name &amp;quot;*.fa.gz&amp;quot; &amp;gt; files.txt # index with lexicmap # elapsed time: 3h:40m:38s # peak rss: 87.15 GB lexicmap index -S -X files.txt -O globdb_r220.lmi --log globdb_r220.lmi -g 50000000 </description>
</item>
<item>
<title>search</title>
Expand All @@ -77,6 +77,13 @@
<guid>https://bioinf.shenwei.me/LexicMap/usage/utils/subseq/</guid>
<description>Usage $ lexicmap utils subseq -h Exextract subsequence via reference name, sequence ID, position and strand Attention: 1. The option -s/--seq-id is optional. 1) If given, the positions are these in the original sequence. 2) If not given, the positions are these in the concatenated sequence. 2. All degenerate bases in reference genomes were converted to the lexicographic first bases. E.g., N was converted to A. Therefore, consecutive A&amp;#39;s in output might be N&amp;#39;s in the genomes.</description>
</item>
<item>
<title>Indexing UHGG</title>
<link>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-uhgg/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-uhgg/</guid>
<description>Info:&#xA;Unified Human Gastrointestinal Genome (UHGG) v2.0.2 A unified catalog of 204,938 reference genomes from the human gut microbiome Number of Genomes: 289,232 Tools:&#xA;https://github.com/shenwei356/seqkit, for checking sequence files https://github.com/shenwei356/rush, for running jobs Data:&#xA;# meta data wget https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/human-gut/v2.0.2/genomes-all_metadata.tsv # gff url sed 1d genomes-all_metadata.tsv | cut -f 20 | sed &#39;s/v2.0/v2.0.2/&#39; | sed -E &#39;s/^ftp/https/&#39; &amp;gt; url.txt # download gff files mkdir -p files; cd files time cat ../url.txt \ | rush --eta -v &#39;dir={///%}/{//%}&#39; \ &#39;mkdir -p {dir}; curl -s -o {dir}/{%} {}&#39; \ -c -C download.</description>
</item>
<item>
<title>seed-pos</title>
<link>https://bioinf.shenwei.me/LexicMap/usage/utils/seed-pos/</link>
Expand Down
89 changes: 78 additions & 11 deletions installation/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -622,26 +622,16 @@ <h2>Navigation</h2>
<input
type="checkbox"

class="gdoc-nav__toggle" id="navtree-5198871c"

class="hidden"

/>
<label

for="navtree-5198871c" class="flex justify-between align-center"

>

<span class="flex">More</span>


<svg class="gdoc-icon toggle gdoc_keyboard_arrow_left">
<use xlink:href="#gdoc_keyboard_arrow_left"></use>
</svg>
<svg class="gdoc-icon toggle gdoc_keyboard_arrow_down">
<use xlink:href="#gdoc_keyboard_arrow_down"></use>
</svg>

</label>


Expand Down Expand Up @@ -828,6 +818,50 @@ <h2>Navigation</h2>



















<li>
<input
type="checkbox"

class="hidden"

/>
<label

>

<span class="flex">
<a
href="/LexicMap/tutorials/misc/index-uhgg/"
class="gdoc-nav__entry"
>
Indexing UHGG
</a>
</span>


</label>


</li>



</ul>


Expand Down Expand Up @@ -2398,6 +2432,39 @@ <h1>Installation</h1>







































Expand Down
Loading

0 comments on commit 78cab8b

Please sign in to comment.