diff --git a/search/en.data.min.json b/search/en.data.min.json index 9590ffa..47386e7 100644 --- a/search/en.data.min.json +++ b/search/en.data.min.json @@ -1 +1 @@ -[{"id":0,"href":"/LexicMap/usage/utils/masks/","title":"masks","parent":"utils","content":"$ lexicmap utils masks -h View masks of the index or generate new masks randomly Usage: lexicmap utils masks [flags] { -d \u0026lt;index path\u0026gt; | [-k \u0026lt;k\u0026gt;] [-n \u0026lt;masks\u0026gt;] [-s \u0026lt;seed\u0026gt;] } [-o out.tsv.gz] Flags: -h, --help help for masks -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. -k, --kmer int ► Maximum k-mer size. K needs to be \u0026lt;= 32. (default 31) -m, --masks int ► Number of masks. (default 40000) -o, --out-file string ► Out file, supports and recommends a \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) -p, --prefix int ► Length of mask k-mer prefix for checking low-complexity (0 for no checking). (default 15) -s, --seed int ► The seed for generating random masks. (default 1) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Examples $ lexicmap utils masks --quiet -d demo.lmi/ | head -n 10 1 AAAACACCATGGAGCCTTGTGGAACCTTGGC 2 AAAACACGCGATCAGGTCGTCCGTCCCAGTG 3 AAAACACTATGGCCTGATTACCCCATCCCGA 4 AAAACAGGACCGTCCTAGGGTCAATGGTTCG 5 AAAACAGTCTTGTATTATGTACTTCACATTC 6 AAAACATGTTACTACGGTTTTCCGCAATTGG 7 AAAACATTGGTCCTATTGGCGTCACTCGATA 8 AAAACCACTGTGCATATCTCGAATCCCGCTC 9 AAAACCAGCTCTGTAAGCACTAACAACGCTA 10 AAAACCATGGTGCCGTGCATTTGCGCACCTA $ lexicmap utils masks --quiet -d demo.lmi/ | tail -n 10 39991 TTTTGGTCTACAGAAAGTGCGTTATAGATTT 39992 TTTTGGTGTGGAGAAGGACCTCACTGTTAAT 39993 TTTTGTAGACCGAGGTTTTAAGTCCAGGGGG 39994 TTTTGTATGGAATACTTTACAGTCATCAGTT 39995 TTTTGTCATCAGTCGGCACTTAGGGGAACCG 39996 TTTTGTCCCAGTGACCAATCACAGTTCGGGA 39997 TTTTGTCGATAATCCTGCCTCGATTTCTCTT 39998 TTTTGTGAATAAGAGATCCTGTCGCAGGAAA 39999 TTTTGTGCACGACGCTCCTGGTGTATCGCCT 40000 TTTTGTGGCGACGGCGTACCCCGTCTAGGAG # check a specific mask $ lexicmap utils masks --quiet -d demo.lmi/ -m 12345 12345 CATGTTATAGCACTGGCGGCTAACGCCTTTG ","description":"$ lexicmap utils masks -h View masks of the index or generate new masks randomly Usage: lexicmap utils masks [flags] { -d \u0026lt;index path\u0026gt; | [-k \u0026lt;k\u0026gt;] [-n \u0026lt;masks\u0026gt;] [-s \u0026lt;seed\u0026gt;] } [-o out.tsv.gz] Flags: -h, --help help for masks -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. -k, --kmer int ► Maximum k-mer size. K needs to be \u0026lt;= 32. (default 31) -m, --masks int ► Number of masks."},{"id":1,"href":"/LexicMap/usage/index/","title":"index","parent":"Usage","content":"$ lexicmap index -h Generate an index from FASTA/Q sequences Input: *1. Sequences of each reference genome should be saved in separate FASTA/Q files, with reference identifiers in the file names. 2. Input plain or gzip/xz/zstd/bzip2 compressed FASTA/Q files can be given via positional arguments or the flag -X/--infile-list with a list of input files. Flag -S/--skip-file-check is optional for skipping file checking if you trust the file list. 3. Input can also be a directory containing sequence files via the flag -I/--in-dir, with multiple-level sub-directories allowed. A regular expression for matching sequencing files is available via the flag -r/--file-regexp. 4. Some non-isolate assemblies might have extremely large genomes (e.g., GCA_000765055.1, \u0026gt;150 mb). The flag -g/--max-genome is used to skip these input files, and the file list would be written to a file (-G/--big-genomes). You need to increase the value for indexing fungi genomes. 5. Maximum genome size: 268,435,456. More precisely: $total_bases + ($num_contigs - 1) * 1000 \u0026lt;= 268,435,456, as we concatenate contigs with 1000-bp intervals of N’s to reduce the sequence scale to index. Attention: *1) ► You can rename the sequence files for convenience, e.g., GCF_000017205.1.fa.gz, because the genome identifiers in the index and search result would be: the basenames of files with common FASTA/Q file extensions removed, which are extracted via the flag -N/--ref-name-regexp. ► The extracted genome identifiers better be distinct, which will be shown in search results and are used to extract subsequences in the command \u0026#34;lexicmap utils subseq\u0026#34;. 2) ► Unwanted sequences like plasmids can be filtered out by content in FASTA/Q header via regular expressions (-B/--seq-name-filter). 3) All degenerate bases are converted to their lexicographic first bases. E.g., N is converted to A. code bases saved A A A C C C G G G T/U T T M A/C A R A/G A W A/T A S C/G C Y C/T C K G/T G V A/C/G A H A/C/T A D A/G/T A B C/G/T C N A/C/G/T A Important parameters: --- Genome data --- *1. -b/--batch-size, ► Maximum number of genomes in each batch (maximum: 131072, default: 5000). ► If the number of input files exceeds this number, input files are split into multiple batches and indexes are built for all batches. In the end, seed files are merged, while genome data files are kept unchanged and collected. ■ Bigger values increase indexing memory occupation and increase batch searching speed, while single query searching speed is not affected. --- LexicHash mask generation --- 0. -M/--mask-file, ► File with custom masks, which could be exported from an existing index or newly generated by \u0026#34;lexicmap utils masks\u0026#34;. This flag oversides -k/--kmer, -m/--masks, -s/--rand-seed, etc. *1. -k/--kmer, ► K-mer size (maximum: 32, default: 31). ■ Bigger values improve the search specificity and do not increase the index size. *2. -m/--masks, ► Number of LexicHash masks (default: 40000). ■ Bigger values improve the search sensitivity, increase the index size, and slow down the search speed. 3. -p/--seed-min-prefix, ► Minimum length of shared substrings (anchors) in searching (maximum: 32, default: 15). ► This value is used to remove masks with a prefix of low-complexity. --- Seeds data (k-mer-value data) --- *1. --seed-max-desert ► Maximum length of distances between seeds (default: 200). The default value of 200 guarantees queries \u0026gt;=200 bp would match at least one seed. ► Large regions with no seeds are called sketching deserts. Deserts with seed distance larger than this value will be filled by choosing k-mers roughly every --seed-in-desert-dist (50 by default) bases. ■ Big values decrease the search sensitivity for distant targets, speed up the indexing speed, decrease the indexing memory occupation and decrease the index size. While the alignment speed is almost not affected. 2. -c/--chunks, ► Number of seed file chunks (maximum: 128, default: #CPUs). ► Bigger values accelerate the search speed at the cost of a high disk reading load. The maximum number should not exceed the maximum number of open files set by the operating systems. *3. -J/--seed-data-threads ► Number of threads for writing seed data and merging seed chunks from all batches (maximum: -c/--chunks, default: 8). ■ Bigger values increase indexing speed at the cost of slightly higher memory occupation. 4. --partitions, ► Number of partitions for indexing each seed file (default: 512). ► Bigger values bring a little higher memory occupation. 512 is a good value with high searching speed, Larger or smaller values would decrease the speed in \u0026#34;lexicmap search\u0026#34;. ► After indexing, \u0026#34;lexicmap utils reindex-seeds\u0026#34; can be used to reindex the seeds data with another value of this flag. 5. --max-open-files, ► Maximum number of open files (default: 512). ► It\u0026#39;s only used in merging indexes of multiple genome batches. Usage: lexicmap index [flags] [-k \u0026lt;k\u0026gt;] [-m \u0026lt;masks\u0026gt;] { -I \u0026lt;seqs dir\u0026gt; | -X \u0026lt;file list\u0026gt;} -O \u0026lt;out dir\u0026gt; Flags: -b, --batch-size int ► Maximum number of genomes in each batch (maximum value: 131072) (default 5000) -G, --big-genomes string ► Out file of skipped files with $total_bases + ($num_contigs - 1) * $contig_interval \u0026gt;= -g/--max-genome. The second column is one of the skip types: no_valid_seqs, too_large_genome, too_many_seqs. -c, --chunks int ► Number of chunks for storing seeds (k-mer-value data) files. (default 16) --contig-interval int ► Length of interval (N\u0026#39;s) between contigs in a genome. (default 1000) -r, --file-regexp string ► Regular expression for matching sequence files in -I/--in-dir, case ignored. (default \u0026#34;\\\\.(f[aq](st[aq])?|fna)(\\\\.gz|\\\\.xz|\\\\.zst|\\\\.bz2)?$\u0026#34;) --force ► Overwrite existing output directory. -h, --help help for index -I, --in-dir string ► Input directory containing FASTA/Q files. Directory and file symlinks are followed. -k, --kmer int ► Maximum k-mer size. K needs to be \u0026lt;= 32. (default 31) -M, --mask-file string ► File of custom masks. This flag oversides -k/--kmer, -m/--masks, -s/--rand-seed, -p/--seed-min-prefix, etc. -m, --masks int ► Number of LexicHash masks. (default 40000) -g, --max-genome int ► Maximum genome size. Extremely large genomes (e.g., non-isolate assemblies from Genbank) will be skipped. Need to be smaller than the maximum supported genome size: 268435456 (default 15000000) --max-open-files int ► Maximum opened files, used in merging indexes. (default 512) -O, --out-dir string ► Output LexicMap index directory. --partitions int ► Number of partitions for indexing seeds (k-mer-value data) files. (default 512) -s, --rand-seed int ► Rand seed for generating random masks. (default 1) -N, --ref-name-regexp string ► Regular expression (must contains \u0026#34;(\u0026#34; and \u0026#34;)\u0026#34;) for extracting the reference name from the filename. (default \u0026#34;(?i)(.+)\\\\.(f[aq](st[aq])?|fna)(\\\\.gz|\\\\.xz|\\\\.zst|\\\\.bz2)?$\u0026#34;) --save-seed-pos ► Save seed positions, which can be inspected with \u0026#34;lexicmap utils seed-pos\u0026#34;. -J, --seed-data-threads int ► Number of threads for writing seed data and merging seed chunks from all batches, the value should be in range of [1, -c/--chunks] (default 8) -d, --seed-in-desert-dist int ► Distance of k-mers to fill deserts. (default 50) -D, --seed-max-desert int ► Maximum length of sketching deserts, or maximum seed distance. Deserts with seed distance larger than this value will be filled by choosing k-mers roughly every --seed-in-desert-dist bases. (default 200) -p, --seed-min-prefix int ► Minimum length of shared substrings (anchors) in searching. Here, this value is used to remove low-complexity masks and choose k-mers to fill sketching deserts. (default 15) -B, --seq-name-filter strings ► List of regular expressions for filtering out sequences by contents in FASTA/Q header/name, case ignored. -S, --skip-file-check ► Skip input file checking when given files or a file list. Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Examples See Building an index ","description":"$ lexicmap index -h Generate an index from FASTA/Q sequences Input: *1. Sequences of each reference genome should be saved in separate FASTA/Q files, with reference identifiers in the file names. 2. Input plain or gzip/xz/zstd/bzip2 compressed FASTA/Q files can be given via positional arguments or the flag -X/--infile-list with a list of input files. Flag -S/--skip-file-check is optional for skipping file checking if you trust the file list. 3. Input can also be a directory containing sequence files via the flag -I/--in-dir, with multiple-level sub-directories allowed."},{"id":2,"href":"/LexicMap/introduction/","title":"Introduction","parent":"","content":" LexicMap is a nucleotide sequence alignment tool for efficiently querying gene, plasmid, viral, or long-read sequences against up to millions of prokaryotic genomes.\nTable of contents Table of contents Features Introduction Quick start Performance Indexing Searching Installation Algorithm overview Related projects Support License Features LexicMap is scalable to up to millions of prokaryotic genomes. The sensitivity of LexicMap is comparable with Blastn. The alignment is fast and memory-efficient. LexicMap is easy to install, we provide binary files with no dependencies for Linux, Windows, MacOS (x86 and arm CPUs). LexicMap is easy to use (tutorials and usages). Both tabular and Blast-style output formats are available. Besides, we provide several commands to explore the index data and extract indexed subsequences. Introduction Motivation: Alignment against a database of genomes is a fundamental operation in bioinformatics, popularised by BLAST. However, given the increasing rate at which genomes are sequenced, existing tools struggle to scale.\nExisting full alignment tools face challenges of high memory consumption and slow speeds. Alignment-free large-scale sequence searching tools only return the matched genomes, without the vital positional information for downstream analysis. Prefilter+Align strategies have the sensitivity issue in the prefiltering step. Methods: (algorithm overview)\nAn improved version of the sequence sketching method LexicHash is adopted to compute alignment seeds accurately and efficiently. We solved the sketching deserts problem of LexicHash seeds to provide a window guarantee. We added the support of suffix matching of seeds, making seeds much more tolerant to mutations. Any 31-bp seed with a common ≥15 bp prefix or suffix can be matched, which means seeds are immune to any single SNP. A multi-level index enables fast and low-memory variable-length seed matching and chaining. A pseudo alignment algorithm is used to find similar sequence regions from chaining results for alignment. A reimplemented Wavefront alignment algorithm is used for base-level alignment. Results:\nLexicMap enables efficient indexing and searching of both RefSeq+GenBank and the AllTheBacteria datasets (2.3 and 1.9 million genomes respectively). Running at this scale has previously only been achieved by Phylign (previously called mof-search).\nFor searching in all 2,340,672 Genbank+Refseq prokaryotic genomes, Bastn is unable to run with this dataset on common servers as it requires \u0026gt;2000 GB RAM. (see performance).\nWith LexicMap (48 CPUs),\nQuery Genome hits Time RAM A 1.3-kb marker gene 36,633 21s 3.4 GB A 1.5-kb 16S rRNA 1,928,372 6m40s 16.7 GB A 52.8-kb plasmid 551,264 8m54s 20.1 GB 1003 AMR genes 27,577,060 5h18m 41.3 GB Quick start Building an index (see the tutorial of building an index).\n# From a directory with multiple genome files lexicmap index -I genomes/ -O db.lmi # From a file list with one file per line lexicmap index -X files.txt -O db.lmi Querying (see the tutorial of searching).\n# For short queries like genes or long reads, returning top N hits. lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 70 --min-qcov-per-genome 70 --top-n-genomes 1000 # For longer queries like plasmids, returning all hits. lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 0 --min-qcov-per-genome 0 --top-n-genomes 0 Sample output (queries are a few Nanopore Q20 reads). See output format details.\nquery qlen hits sgenome sseqid qcovGnm hsp qcovHSP alenHSP pident gaps qstart qend sstart send sstr slen ------------------ ---- ---- --------------- ------------- ------- --- ------- ------- ------- ---- ------ ---- ------- ------- ---- ------- ERR5396170.1000016 740 1 GCF_013394085.1 NZ_CP040910.1 89.595 1 89.595 663 99.246 0 71 733 13515 14177 + 1887974 ERR5396170.1000000 698 1 GCF_001457615.1 NZ_LN831024.1 85.673 1 85.673 603 98.010 5 53 650 4452083 4452685 + 6316979 ERR5396170.1000017 516 1 GCF_013394085.1 NZ_CP040910.1 94.574 1 94.574 489 99.591 2 27 514 293509 293996 + 1887974 ERR5396170.1000012 848 1 GCF_013394085.1 NZ_CP040910.1 95.165 1 95.165 811 97.411 7 22 828 190329 191136 - 1887974 ERR5396170.1000038 1615 1 GCA_000183865.1 CM001047.1 64.706 1 60.000 973 95.889 13 365 1333 88793 89756 - 2884551 ERR5396170.1000038 1615 1 GCA_000183865.1 CM001047.1 64.706 2 4.706 76 98.684 0 266 341 89817 89892 - 2884551 ERR5396170.1000036 1159 1 GCF_013394085.1 NZ_CP040910.1 95.427 1 95.427 1107 99.729 1 32 1137 1400097 1401203 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 1 86.486 707 99.151 3 104 807 242235 242941 - 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 2 86.486 707 98.444 3 104 807 1138777 1139483 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 3 84.152 688 98.983 4 104 788 154620 155306 - 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 4 84.029 687 99.127 3 104 787 32477 33163 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 5 72.727 595 98.992 3 104 695 1280183 1280777 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 6 11.671 95 100.000 0 693 787 1282480 1282574 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 7 82.064 671 99.106 3 120 787 1768782 1769452 + 1887974 CIGAR string, aligned query and subject sequences can be outputted as extra columns via the flag -a/--all.\n# Extracting similar sequences for a query gene. # search matches with query coverage \u0026gt;= 90% lexicmap search -d gtdb_complete.lmi/ b.gene_E_faecalis_SecY.fasta -o results.tsv \\ --min-qcov-per-hsp 90 --all # extract matched sequences as FASTA format sed 1d results.tsv | awk -F'\\t' '{print \u0026quot;\u0026gt;\u0026quot;$5\u0026quot;:\u0026quot;$14\u0026quot;-\u0026quot;$15\u0026quot;:\u0026quot;$16\u0026quot;\\n\u0026quot;$20;}' \\ | seqkit seq -g \u0026gt; results.fasta Export blast-style format:\nseqkit seq -M 500 q.long-reads.fasta.gz \\ | seqkit head -n 2 \\ | lexicmap search -d demo.lmi/ -a \\ | lexicmap utils 2blast Query = GCF_000017205.1_r160 Length = 478 [Subject genome #1/1] = GCF_000017205.1 Query coverage per genome = 95.188% \u0026gt;NC_009656.1 Length = 6588339 HSP #1 Query coverage per seq = 95.188%, Aligned length = 463, Identities = 95.680%, Gaps = 12 Query range = 13-467, Subject range = 4866862-4867320, Strand = Plus/Plus Query 13 CCTCAAACGAGTCC-AACAGGCCAACGCCTAGCAATCCCTCCCCTGTGGGGCAGGGAAAA 71 |||||||||||||| |||||||| |||||| | ||||||||||||| |||||||||||| Sbjct 4866862 CCTCAAACGAGTCCGAACAGGCCCACGCCTCACGATCCCTCCCCTGTCGGGCAGGGAAAA 4866921 Query 72 TCGTCCTTTATGGTCCGTTCCGGGCACGCACCGGAACGGCGGTCATCTTCCACGGTGCCC 131 |||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||| Sbjct 4866922 TCGTCCTTTATGGTCCGTTCCGGGCACGCACCGGAACGGCGGTCAT-TTCCACGGTGCCC 4866980 Query 132 GCCCACGGCGGACCCGCGGAAACCGACCCGGGCGCCAAGGCGCCCGGGAACGGAGTA-CA 190 ||| ||||||||||| ||||||||||||||||||||||||||||||||||||||||| || Sbjct 4866981 GCC-ACGGCGGACCC-CGGAAACCGACCCGGGCGCCAAGGCGCCCGGGAACGGAGTATCA 4867038 Query 191 CTCGGCGTTCGGCCAGCGACAGC---GACGCGTTGCCGCCCACCGCGGTGGTGTTCACCG 247 |||||||| |||||||||||||| |||||||||||||||||||||||||||||||||| Sbjct 4867039 CTCGGCGT-CGGCCAGCGACAGCAGCGACGCGTTGCCGCCCACCGCGGTGGTGTTCACCG 4867097 Query 248 AGGTGGTGCGCTCGCTGAC-AAACGCAGCAGGTAGTTCGGCCCGCCGGCCTTGGGACCG- 305 ||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||| Sbjct 4867098 AGGTGGTGCGCTCGCTGACGAAACGCAGCAGGTAGTTCGGCCCGCCGGCCTTGGGACCGG 4867157 Query 306 TGCCGGACAGCCCGTGGCCGCCGAACAGTTGCACGCCCACCACCGCGCCGAT-TGGTTTC 364 |||||||||||||||||||||||||| ||||||||||||||||||||||||| ||||| | Sbjct 4867158 TGCCGGACAGCCCGTGGCCGCCGAACGGTTGCACGCCCACCACCGCGCCGATCTGGTTGC 4867217 Query 365 GGTTGACGTAGAGGTTGCCGACCCGCGCCAGCTCTTGGATGCGGCGGGCGGTTTCCTCGT 424 |||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||| Sbjct 4867218 GGTTGACGTAGAGGTTGCCGACCCGCGCCAGCTCTTCGATGCGGCGGGCGGTTTCCTCGT 4867277 Query 425 TGCGGCTGTGGACCCCCATGGTCAGGCCGAAACCGGTGGCGTT 467 ||||||||||||||||||||||||||||||||||||||||||| Sbjct 4867278 TGCGGCTGTGGACCCCCATGGTCAGGCCGAAACCGGTGGCGTT 4867320 Learn more tutorials and usages.\nPerformance Indexing dataset genomes gzip_size tool db_size time RAM GTDB complete 402,538 578 GB LexicMap 906 GB 8 h 21 m 71.1 GB Blastn 360 GB 3 h 11 m 718 MB AllTheBacteria HQ 1,858,610 3.1 TB LexicMap 3.88 TB 48 h 08 m 82.7 GB Blastn 1.76 TB 14 h 03 m 2.9 GB Phylign 248 GB / / Genbank+RefSeq 2,340,672 3.5 TB LexicMap 4.94 TB 52 h 03 m 188.6 GB Blastn 2.15 TB 14 h 04 m 4.3 GB Notes:\nAll files are stored on a server with HDD disks. No files are cached in memory. Tests are performed in a single cluster node with 48 CPU cores (Intel Xeon Gold 6336Y CPU @ 2.40 GHz). LexicMap index building parameters: -k 31 -m 40000. Genome batch size: -b 5000 for GTDB datasets, -b 25000 for others. Searching Blastn failed to run as it requires \u0026gt;2000GB RAM for Genbank+RefSeq and AllTheBacteria datasets. Phylign only has the index for AllTheBacteria HQ dataset.\nGTDB complete (402,538 genomes):\nquery query_len tool genome_hits genome_hits(qcov\u0026gt;50) time RAM a marker gene 1,299 bp LexicMap 5,249 5,234 2.2 s 1.0 GB Blastn 7,121 6,177 2,171 s 351.2 GB a 16S rRNA gene 1,542 bp LexicMap 302,096 278,023 73 s 4.1 GB Blastn 301,197 277,042 2,353 s 378.4 GB a plasmid 52,830 bp LexicMap 63,820 1,188 58 s 4.7 GB Blastn 69,311 2,308 2,262 s 364.7 GB 1033 AMR genes 1 kb (median) LexicMap 4,132,990 2,255,347 1,165 s 20.2 GB Blastn 5,357,772 2,240,766 4,686 s 442.1 GB AllTheBacteria HQ (1,858,610 genomes):\nquery query_len tool genome_hits genome_hits(qcov\u0026gt;50) time RAM a marker gene 1,299 bp LexicMap 33,795 33,786 19 s 2.5 GB Phylign_local 7,936 30 m 48 s 77.6 GB Phylign_cluster 7,936 28 m 33 s a 16S rRNA gene 1,542 bp LexicMap 1,857,641 1,739,767 7 m 50 s 18.2 GB Phylign_local 1,017,765 130 m 33 s 77.0 GB Phylign_cluster 1,017,765 86 m 41 s a plasmid 52,830 bp LexicMap 480,008 3,620 8 m 16 s 15.7 GB Phylign_local 46,822 47 m 33 s 82.6 GB Phylign_cluster 46,822 39 m 34 s 1033 AMR genes 1 kb (median) LexicMap 22,995,817 12,347,425 185 m 25 s 45.1 GB Phylign_local 1,135,215 156 m 08 s 85.9 GB Phylign_cluster 1,135,215 133 m 49 s Genbank+RefSeq (2,340,672 genomes):\nquery query_len tool genome_hits genome_hits(qcov\u0026gt;50) time RAM a marker gene 1,299 bp LexicMap 36,633 36,578 21 s 3.4 GB a 16S rRNA gene 1,542 bp LexicMap 1,928,372 1,381,723 6 m 40 s 16.7 GB a plasmid 52,830 bp LexicMap 551,264 6,559 8 m 54 s 20.1 GB 1033 AMR genes 1 kb (median) LexicMap 27,577,060 14,798,129 318 m 28 s 41.3 GB Notes:\nAll files are stored on a server with HDD disks. No files are cached in memory. Tests are performed in a single cluster node with 48 CPU cores (Intel Xeon Gold 6336Y CPU @ 2.40 GHz). Main searching parameters: LexicMap v0.4.0: --threads 48 --top-n-genomes 0 --min-qcov-per-genome 0 --min-qcov-per-hsp 0 --min-match-pident 70. Blastn v2.15.0+: -num_threads 48 -max_target_seqs 10000000. Phylign (AllTheBacteria fork 9fc65e6): threads: 48, cobs_kmer_thres: 0.33, minimap_preset: \u0026quot;asm20\u0026quot;, nb_best_hits: 5000000, max_ram_gb: 100; For cluster, maximum number of slurm jobs is 100. Installation LexicMap is implemented in Go programming language, executable binary files for most popular operating systems are freely available in release page.\nOr install with conda:\nconda install -c bioconda lexicmap Algorithm overview Related projects High-performance LexicHash computation in Go. Wavefront alignment algorithm (WFA) in Golang. Support Please open an issue to report bugs, propose new functions or ask for help.\nLicense MIT License\n","description":"LexicMap is a nucleotide sequence alignment tool for efficiently querying gene, plasmid, viral, or long-read sequences against up to millions of prokaryotic genomes.\nTable of contents Table of contents Features Introduction Quick start Performance Indexing Searching Installation Algorithm overview Related projects Support License Features LexicMap is scalable to up to millions of prokaryotic genomes. The sensitivity of LexicMap is comparable with Blastn. The alignment is fast and memory-efficient. LexicMap is easy to install, we provide binary files with no dependencies for Linux, Windows, MacOS (x86 and arm CPUs)."},{"id":3,"href":"/LexicMap/usage/utils/kmers/","title":"kmers","parent":"utils","content":"$ lexicmap utils kmers -h View k-mers captured by the masks Attention: 1. Mask index (column mask) is 1-based. 2. Prefix means the length of shared prefix between a k-mer and the mask. 3. K-mer positions (column pos) are 1-based. For reference genomes with multiple sequences, the sequences were concatenated to a single sequence with intervals of N\u0026#39;s. 4. Reversed means if the k-mer is reversed for suffix matching. Usage: lexicmap utils kmers [flags] -d \u0026lt;index path\u0026gt; [-m \u0026lt;mask index\u0026gt;] [-o out.tsv.gz] Flags: -h, --help help for kmers -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. -m, --mask int ► View k-mers captured by Xth mask. (0 for all) (default 1) -o, --out-file string ► Out file, supports and recommends a \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Examples The default output is captured k-mers of the first mask.\n$ lexicmap utils kmers --quiet -d demo.lmi/ | head -n 20 | csvtk pretty -t mask kmer prefix number ref pos strand reversed ---- ------------------------------- ------ ------ --------------- ------- ------ -------- 1 AAAAAAAAAATACTAAGAGGTGACAAAAGAG 4 1 GCF_001544255.1 2142679 + yes 1 AAAAAAAAACAAAGCGGACTTGGACATTGTC 4 1 GCF_000006945.2 3170307 + yes 1 AAAAAAAAACCAGTAAAAAAAGGGGAGTAGA 4 1 GCF_000392875.1 771896 + yes 1 AAAAAAAAACGACTTACCATTAACGTTCAAG 4 1 GCF_003697165.2 803728 + yes 1 AAAAAAAAACTAGGGTTAAATGCCTTATGTT 4 1 GCF_009759685.1 442423 + yes 1 AAAAAAAAAGAGATGAAAAAGGGTGTATTCG 4 1 GCF_001544255.1 1493451 - yes 1 AAAAAAAAATAAAATATCTAACGAGCAAATT 4 1 GCF_001096185.1 2065540 + yes 1 AAAAAAAAATACCATAGACTATGCTCTTAGT 4 1 GCF_000392875.1 134079 - yes 1 AAAAAAAAATAGAGTTTTTTTTCTGGATAAG 4 1 GCF_000392875.1 795189 + yes 1 AAAAAAAAATGTTAACAGAAGGTCCCTACCT 4 1 GCF_002950215.1 2765957 + yes 1 AAAAAAAACAAAAGCTATACTGGTCATGTTC 4 1 GCF_000006945.2 3635995 + yes 1 AAAAAAAACAAAGATACATTTAGGACGGTTA 4 1 GCF_000006945.2 616481 - yes 1 AAAAAAAACAGCCCACCGCCGATTGCGGAAT 4 1 GCF_000742135.1 1208620 + yes 1 AAAAAAAACAGGGTGTCGTGCCCTTGTCAGT 4 1 GCF_003697165.2 627153 - yes 1 AAAAAAAACAGGGTGTTCTTAGATAAAAGGG 4 1 GCF_000742135.1 1723387 - yes 1 AAAAAAAACATATAGTTGTGAAGGCATTGGA 4 1 GCF_001027105.1 2508079 - yes 1 AAAAAAAACCAGTAAAAAAAGGGGAGTAGAA 4 1 GCF_000392875.1 771895 + yes 1 AAAAAAAACCATATTATGTCCGATCCTCACA 4 1 GCF_000392875.1 1060650 + yes 1 AAAAAAAACCCTTCGTCAAGCATTATGGAAT 4 1 GCF_000392875.1 1139573 - yes Specify the mask.\n$ lexicmap utils kmers --quiet -d demo.lmi/ --mask 12345 | csvtk pretty -t mask kmer prefix number ref pos strand reversed ----- ------------------------------- ------ ------ --------------- ------- ------ -------- 12345 CATGTTACAAAAGGTGGGTCAGGCAACGTAT 7 1 GCF_001457655.1 335112 - yes 12345 CATGTTACCAAGGTTAGTCGTATGGCGCTAC 7 1 GCF_001457655.1 23755 - yes 12345 CATGTTACGCGTATTTTAGCGGCTCGCGGAC 7 1 GCF_000006945.2 702224 + yes 12345 CATGTTATAACGGCCTATGAATCGGCATTAC 9 1 GCF_009759685.1 2591866 + no 12345 CATGTTATACGTTGAAACTGTCTTGTTAATA 9 1 GCF_001096185.1 1142460 + yes 12345 CATGTTATACTTTAGATACTTATTTTTAGGA 9 1 GCF_000392875.1 1524553 + no 12345 CATGTTATAGAAGGACGTCGACATCTTGTGG 10 1 GCF_000017205.1 3140677 + no 12345 CATGTTATAGAATTACATACATTGTAACATG 10 1 GCF_006742205.1 704431 - no 12345 CATGTTATAGCACGCTTAATCGCTTGATCCC 13 1 GCF_001027105.1 2655846 + no 12345 CATGTTATAGCATCCTTTTACGTGAAAAGGT 12 1 GCF_000742135.1 4136093 + no 12345 CATGTTATAGCCAGCAAATGGAAGCATCGCG 11 1 GCF_009759685.1 492828 - no 12345 CATGTTATAGCCATTGATGGTAACTTTGATG 11 1 GCF_001096185.1 536843 + no 12345 CATGTTATAGCCTGAAAGGTGCTAAACAACT 11 1 GCF_000006945.2 4876155 + no 12345 CATGTTATAGCCTTCTCCAAGACCAATCAAA 11 1 GCF_000148585.2 1667015 + no 12345 CATGTTATAGCGTAAATCAGCACCGCGCGCC 11 3 GCF_002949675.1 1871326 + no 12345 CATGTTATAGCGTAAATCAGCACCGCGCGCC 11 3 GCF_002950215.1 2326544 + no 12345 CATGTTATAGCGTAAATCAGCACCGCGCGCC 11 3 GCF_003697165.2 3996124 + no 12345 CATGTTATAGCTAACTGCGACTTGTGGCACA 11 1 GCF_900638025.1 991007 - no 12345 CATGTTATAGTCGTGAGGTTCTAAAAAAACT 10 1 GCF_001544255.1 1091256 - no 12345 CATGTTATAGTTTGTCTTACCGCTACTGAAA 10 1 GCF_002950215.1 1457055 + yes 12345 CATGTTATATCCTTCTTGAATACGAGCAATA 9 1 GCF_000392875.1 1963573 + no 12345 CATGTTATATGAACCTTCAACCTTATTTGAC 9 1 GCF_001457655.1 1510084 + no 12345 CATGTTATCCAGGTATTTCACCAGCGCACGC 8 1 GCF_000006945.2 836525 + no 12345 CATGTTATCGAATATTATAACATCGGCTCCC 8 1 GCF_000148585.2 1372855 + yes 12345 CATGTTATCGATAAGGCTATATATGACCTTA 8 1 GCF_002950215.1 878140 - no 12345 CATGTTATCGCTCAGGGTCTGCGGGTATATC 8 1 GCF_002950215.1 1880029 + yes 12345 CATGTTATGCGTATAAAGACGAGTAAAGGTT 8 1 GCF_009759685.1 3827118 + no 12345 CATGTTATGCTGGGACATTTAGCACCGCTAC 8 1 GCF_000006945.2 1988134 + yes \u0026ldquo;reversed\u0026rdquo; means means if the k-mer is reversed for suffix matching. E.g., CATGTTACAAAAGGTGGGTCAGGCAACGTAT is reversed, so you need to reverse it before searching in the genome.\n$ seqkit locate -p $(echo CATGTTACAAAAGGTGGGTCAGGCAACGTAT | rev) refs/GCF_001457655.1.fa.gz -M | csvtk pretty -t seqID patternName pattern strand start end ------------- ------------------------------- ------------------------------- ------ ------ ------ NZ_LN831035.1 TATGCAACGGACTGGGTGGAAAACATTGTAC TATGCAACGGACTGGGTGGAAAACATTGTAC - 335112 335142 For all masks. The result might be very big, therefore, writing to gzip format is recommended.\n$ lexicmap utils kmers -d demo.lmi/ --mask 0 -o kmers.tsv.gz $ zcat kmers.tsv.gz | csvtk freq -t -f mask -nr | head -n 10 mask frequency 1 610 40000 568 31 435 20 432 39997 423 28 419 30018 415 30027 403 79 396 K-mers of a specific mask\n$ lexicmap utils kmers -d demo.lmi/ -m 12345 | head -n 20 | csvtk pretty -t mask kmer prefix number ref pos strand reversed ----- ------------------------------- ------ ------ --------------- ------- ------ -------- 12345 CATGTTACAAAAGGTGGGTCAGGCAACGTAT 7 1 GCF_001457655.1 335112 - yes 12345 CATGTTACCAAGGTTAGTCGTATGGCGCTAC 7 1 GCF_001457655.1 23755 - yes 12345 CATGTTACGCGTATTTTAGCGGCTCGCGGAC 7 1 GCF_000006945.2 702224 + yes 12345 CATGTTATAACGGCCTATGAATCGGCATTAC 9 1 GCF_009759685.1 2591866 + no 12345 CATGTTATACGTTGAAACTGTCTTGTTAATA 9 1 GCF_001096185.1 1142460 + yes 12345 CATGTTATACTTTAGATACTTATTTTTAGGA 9 1 GCF_000392875.1 1524553 + no 12345 CATGTTATAGAAGGACGTCGACATCTTGTGG 10 1 GCF_000017205.1 3140677 + no 12345 CATGTTATAGAATTACATACATTGTAACATG 10 1 GCF_006742205.1 704431 - no 12345 CATGTTATAGCACGCTTAATCGCTTGATCCC 13 1 GCF_001027105.1 2655846 + no 12345 CATGTTATAGCATCCTTTTACGTGAAAAGGT 12 1 GCF_000742135.1 4136093 + no 12345 CATGTTATAGCCAGCAAATGGAAGCATCGCG 11 1 GCF_009759685.1 492828 - no 12345 CATGTTATAGCCATTGATGGTAACTTTGATG 11 1 GCF_001096185.1 536843 + no 12345 CATGTTATAGCCTGAAAGGTGCTAAACAACT 11 1 GCF_000006945.2 4876155 + no 12345 CATGTTATAGCCTTCTCCAAGACCAATCAAA 11 1 GCF_000148585.2 1667015 + no 12345 CATGTTATAGCGTAAATCAGCACCGCGCGCC 11 3 GCF_002949675.1 1871326 + no 12345 CATGTTATAGCGTAAATCAGCACCGCGCGCC 11 3 GCF_002950215.1 2326544 + no 12345 CATGTTATAGCGTAAATCAGCACCGCGCGCC 11 3 GCF_003697165.2 3996124 + no 12345 CATGTTATAGCTAACTGCGACTTGTGGCACA 11 1 GCF_900638025.1 991007 - no 12345 CATGTTATAGTCGTGAGGTTCTAAAAAAACT 10 1 GCF_001544255.1 1091256 - no Lengths of shared prefixes between probes and captured k-mers.\nzcat kmers.tsv.gz \\ | csvtk filter2 -t -f '$reversed == \u0026quot;no\u0026quot;'\\ | csvtk plot hist -t -f prefix -o prefix.hist.png \\ --xlab \u0026quot;length of common prefixes between captured k-mers and masks\u0026quot; The output (TSV format) is formatted with csvtk pretty.\n","description":"$ lexicmap utils kmers -h View k-mers captured by the masks Attention: 1. Mask index (column mask) is 1-based. 2. Prefix means the length of shared prefix between a k-mer and the mask. 3. K-mer positions (column pos) are 1-based. For reference genomes with multiple sequences, the sequences were concatenated to a single sequence with intervals of N\u0026#39;s. 4. Reversed means if the k-mer is reversed for suffix matching. Usage: lexicmap utils kmers [flags] -d \u0026lt;index path\u0026gt; [-m \u0026lt;mask index\u0026gt;] [-o out."},{"id":4,"href":"/LexicMap/tutorials/search/","title":"Searching","parent":"Tutorials","content":" Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Steps Output Alignment result relationship Output format Examples TL;DR Build a LexicMap index.\nRun:\nFor short queries like genes or long reads, returning top N hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 70 --min-qcov-per-genome 70 --top-n-genomes 1000 For longer queries like plasmids, returning all hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 0 --min-qcov-per-genome 0 --top-n-genomes 0 Input Query length\nLexicMap is mainly designed for sequence alignment with a small number of queries (gene/plasmid/virus/phage sequences) longer than 200 bp by default. However, short queries can also be aligned. Input should be (gzipped) FASTA or FASTQ records from files or STDIN.\nHardware requirements See benchmark of index building.\nLexicMap is designed to provide fast and low-memory sequence alignment against millions of prokaryotic genomes.\nCPU: No specific requirements on CPU type and instruction sets. Both x86 and ARM chips are supported. More is better as LexicMap is a CPU-intensive software. It uses all CPUs by default (-j/--threads). RAM More RAM (\u0026gt; 16 GB) is preferred. The memory usage in searching is mainly related to: The number of matched genomes and sequences. The length of query sequences. Similarities between query and target sequences. The number of threads. It uses all CPUs by default (-j/--threads). Disk Sufficient space is required to store the index size. No temporary files are generated during searching. Algorithm Masking: Query sequence is masked by the masks of the index. In other words, each mask captures the most similar k-mer which shares the longest prefix with the mask, and stores its position and strand information. Seeding: For each mask, the captured k-mer is used to search seeds (captured k-mers in reference genomes) sharing prefixes or suffixes of at least p bases. Prefix matching Setting the search range: Since the seeded k-mers are stored in lexicographic order, the k-mer matching turns into a range query. For example, for a query CATGCT requiring matching at least 4-bp prefix is equal to extract k-mers ranging from CATGAA, CATGAC, CATGAG, \u0026hellip;, to CATGTT. Finding the nearest smaller k-mer: The index file of each seed data file stores a list (default 512) of k-mers and offsets in the data file, and the index is loaded in RAM. The nearest k-mer smaller than the range start k-mer (CATGAA) is found by binary search, i.e., CATCAC (blue text in the figure), and the offset is returned as the start position in traversing the seed data file. Retrieving seed data: Seed k-mers are read from the file and checked one by one, and k-mers in the search range are returned, along with the k-mer information (genome batch, genome number, location, and strand). Suffix matching Reversing the query k-mer and performing prefix matching, returning seeds of reversed k-mers (see indexing algorithm). Chaining: Seeding results, i.e., anchors (matched k-mers from the query and subject sequence), are summarized by genome, and deduplicated. Performing chaining (see the paper). Alignment for each chain. Extending the anchor region. for extracting sequences from the query and reference genome. For example, extending 2 kb in upstream and downstream of anchor region. Performing pseudo-alignment with extended query and subject sequences, for find similar regions. For these similar regions that accross more than one reference sequences, splitting them into multiple ones. Fast alignment of query and subject sequence regions with our implementation of Wavefront alignment algorithm. Filtering alignments based on user options. Parameters Flags in bold text are important and frequently used.\nGeneral Flag Value Function Comment -w/--load-whole-seeds Load the whole seed data into memory for faster search Use this if the index is not big and many queries are needed to search. -n/--top-n-genomes Default 0, 0 for all Keep top N genome matches for a query in the chaining phase The final number of genome hits might be smaller than this number as some chaining results might fail to pass the criteria in the alignment step. -a/--all Output more columns, e.g., matched sequences. Use this if you want to output blast-style format with \u0026ldquo;lexicmap utils 2blast\u0026rdquo; -J/\u0026ndash;max-query-conc Default 12, 0 for all Maximum number of concurrent queries Bigger values do not improve the batch searching speed and consume much memory Chaining Flag Value Function Comment -p, --seed-min-prefix Default 15 Minimum (prefix) length of matched seeds. Smaller values produce more results at the cost of slow speed. -P, --seed-min-single-prefix Default 17 Minimum (prefix) length of matched seeds if there\u0026rsquo;s only one pair of seeds matched. Smaller values produce more results at the cost of slow speed. --seed-max-dist Default 10000 Max distance between seeds in seed chaining. --seed-max-gap Default 500 Max gap in seed chaining. Alignment Flag Value Function Comment -Q/--min-qcov-per-genome Default 0 Minimum query coverage (percentage) per genome. -q/--min-qcov-per-hsp Default 0 Minimum query coverage (percentage) per HSP. -l/--align-min-match-len Default 50 Minimum aligned length in a HSP segment. -i/--align-min-match-pident Default 70 Minimum base identity (percentage) in a HSP segment. --align-band Default 50 Band size in backtracking the score matrix. --align-ext-len Default 2000 Extend length of upstream and downstream of seed regions, for extracting query and target sequences for alignment. --align-max-gap Default 20 Maximum gap in a HSP segment. Steps For short queries like genes or long reads, returning top N hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-match-pident 70 \\ --min-qcov-per-hsp 70 \\ --min-qcov-per-genome 70 \\ --top-n-genomes 1000 For longer queries like plasmids, returning all hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-match-pident 70 \\ --min-qcov-per-hsp 0 \\ --min-qcov-per-genome 0 \\ --top-n-genomes 0 Click to show the log of a demo run. ... $ lexicmap search -d demo.lmi/ q.gene.fasta -o q.gene.fasta.lexicmap.tsv 09:32:55.551 [INFO] LexicMap v0.4.0 09:32:55.551 [INFO] https://github.com/shenwei356/LexicMap 09:32:55.551 [INFO] 09:32:55.551 [INFO] checking input files ... 09:32:55.551 [INFO] 1 input file(s) given 09:32:55.551 [INFO] 09:32:55.551 [INFO] loading index: demo.lmi/ 09:32:55.551 [INFO] reading masks... 09:32:55.552 [INFO] reading indexes of seeds (k-mer-value) data... 09:32:55.555 [INFO] creating genome reader pools, each batch with 16 readers... 09:32:55.555 [INFO] index loaded in 4.192051ms 09:32:55.555 [INFO] 09:32:55.555 [INFO] searching ... 09:32:55.596 [INFO] 09:32:55.596 [INFO] processed queries: 1, speed: 1467.452 queries per minute 09:32:55.596 [INFO] 100.0000% (1/1) queries matched 09:32:55.596 [INFO] done searching 09:32:55.596 [INFO] search results saved to: q.gene.fasta.lexicmap.tsv 09:32:55.596 [INFO] 09:32:55.596 [INFO] elapsed time: 45.230604ms 09:32:55.596 [INFO] Extracting similar sequences for a query gene.\n# search matches with query coverage \u0026gt;= 90% lexicmap search -d gtdb_complete.lmi/ b.gene_E_faecalis_SecY.fasta --min-qcov-per-hsp 90 --all -o results.tsv # extract matched sequences as FASTA format sed 1d results.tsv | awk -F\u0026#39;\\t\u0026#39; \u0026#39;{print \u0026#34;\u0026gt;\u0026#34;$5\u0026#34;:\u0026#34;$14\u0026#34;-\u0026#34;$15\u0026#34;:\u0026#34;$16\u0026#34;\\n\u0026#34;$20;}\u0026#39; | seqkit seq -g \u0026gt; results.fasta seqkit head -n 1 results.fasta | head -n 3 \u0026gt;NZ_JALSCK010000007.1:39224-40522:- TTGTTCAAGCTATTAAAGAACGCCTTTAAAGTCAAAGACATTAGATCAAAAATCTTATTT ACAGTTTTAATCTTGTTTGTATTTCGCCTAGGTGCGCACATTACTGTGCCCGGGGTGAAT Exporting blast-like alignment text.\nFrom file:\nlexicmap utils 2blast results.tsv -o results.txt From stdin:\n# align only one long-read \u0026lt;= 500 bp $ seqkit seq -M 500 q.long-reads.fasta.gz \\ | seqkit head -n 1 \\ | lexicmap search -d demo.lmi/ -a \\ | lexicmap utils 2blast Query = GCF_006742205.1_r100 Length = 431 [Subject genome #1/1] = GCF_006742205.1 Query coverage per genome = 92.575% \u0026gt;NZ_AP019721.1 Length = 2422602 HSP #1 Query coverage per seq = 92.575%, Aligned length = 402, Identities = 98.507%, Gaps = 4 Query range = 33-431, Subject range = 1321677-1322077, Strand = Plus/Minus Query 33 TAAAACGATTGCTAATGAGTCACGTATTTCATCTGGTTCGGTAACTATACCGTCTACTAT 92 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1322077 TAAAACGATTGCTAATGAGTCACGTATTTCATCTGGTTCGGTAACTATACCGTCTACTAT 1322018 Query 93 GGACTCAGTGTAACCCTGTAATAAAGAGATTGGCGTACGTAATTCATGTG-TACATTTGC 151 |||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||| Sbjct 1322017 GGACTCAGTGTAACCCTGTAATAAAGAGATTGGCGTACGTAATTCATGTGATACATTTGC 1321958 Query 152 TATAAAATCTTTTTTCATTTGATCAAGATTATGTTCATTTGTCATATCACAGGATGACCA 211 |||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||| Sbjct 1321957 TATAAAATCTTTTTTCATTTGATCAAGATTATGTTCATTTGTCATATCAC-GGATGACCA 1321899 Query 212 TGACAATACCACTTCTACCATTTGTTTGAATTCTATCTATATAACTGGAGATAAATACAT 271 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1321898 TGACAATACCACTTCTACCATTTGTTTGAATTCTATCTATATAACTGGAGATAAATACAT 1321839 Query 272 AGTACCTTGTATTAATTTCTAATTCTAA-TACTCATTCTGTTGTGATTCAAATGGTGCTT 330 |||||||||||||||||||||||||||| ||||||||||||||||||||||||| ||||| Sbjct 1321838 AGTACCTTGTATTAATTTCTAATTCTAAATACTCATTCTGTTGTGATTCAAATGTTGCTT 1321779 Query 331 CAATTTGCTGTTCAATAGATTCTTTTGAAAAATCATCAATGTGACGCATAATATAATCAG 390 |||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||| Sbjct 1321778 CAATTTGCTGTTCAATAGATTCTTTTGAAAAATCATCAATGTGACGCATAATATCATCAG 1321719 Query 391 CCATCTTGTT-GACAATATGATTTCACGTTGATTATTAATGC 431 |||||||||| ||||||||||||||||||||||||||||||| Sbjct 1321718 CCATCTTGTTTGACAATATGATTTCACGTTGATTATTAATGC 1321677 Output Alignment result relationship Query ├── Subject genome # A query might have one or more genome hits, ├── Subject sequence # in different sequences. ├── High-Scoring segment Pair (HSP) # HSP is an alignment segment. Here, the defination of HSP is similar with that in BLAST. Actually there are small gaps in HSPs.\nA High-scoring Segment Pair (HSP) is a local alignment with no gaps that achieves one of the highest alignment scores in a given search. https://www.ncbi.nlm.nih.gov/books/NBK62051/\nOutput format Tab-delimited format with 17+ columns, with 1-based positions.\n1. query, Query sequence ID. 2. qlen, Query sequence length. 3. hits, Number of subject genomes. 4. sgenome, Subject genome ID. 5. sseqid, Subject sequence ID. 6. qcovGnm, Query coverage (percentage) per genome: $(aligned bases in the genome)/$qlen. 7. hsp, Nth HSP in the genome. (just for improving readability) 8. qcovHSP Query coverage (percentage) per HSP: $(aligned bases in a HSP)/$qlen. 9. alenHSP, Aligned length in the current HSP. 10. pident, Percentage of identical matches in the current HSP. 11. gaps, Gaps in the current HSP. 12. qstart, Start of alignment in query sequence. 13. qend, End of alignment in query sequence. 14. sstart, Start of alignment in subject sequence. 15. send, End of alignment in subject sequence. 16. sstr, Subject strand. 17. slen, Subject sequence length. 18. cigar, CIGAR string of the alignment. (optional with -a/--all) 19. qseq, Aligned part of query sequence. (optional with -a/--all) 20. sseq, Aligned part of subject sequence. (optional with -a/--all) 21. align, Alignment text (\u0026quot;|\u0026quot; and \u0026quot; \u0026quot;) between qseq and sseq. (optional with -a/--all) Examples A single-copy gene (SecY) query qlen hits sgenome sseqid qcovGnm hsp qcovHSP alenHSP pident gaps qstart qend sstart send sstr slen ---------------------------------------- ---- ---- --------------- -------------------- ------- --- ------- ------- ------- ---- ------ ---- ------ ------ ---- ------- lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_000395405.1 NZ_KB947497.1 100.000 1 100.000 1299 100.000 0 1 1299 232279 233577 + 274511 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_019731615.1 NZ_JAASJA010000010.1 100.000 1 100.000 1299 100.000 0 1 1299 2798 4096 + 42998 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCA_004103085.1 RPCL01000012.1 100.000 1 100.000 1299 100.000 0 1 1299 44095 45393 + 84242 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_023571745.1 NZ_JAMKBS010000014.1 100.000 1 100.000 1299 100.000 0 1 1299 44077 45375 + 84206 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_013248625.1 NZ_JABTDK010000002.1 100.000 1 100.000 1299 100.000 0 1 1299 9609 10907 + 49787 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_900092155.1 NZ_FLUS01000006.1 100.000 1 100.000 1299 100.000 0 1 1299 63161 64459 + 77366 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_902165815.1 NZ_CABHHZ010000005.1 100.000 1 100.000 1299 100.000 0 1 1299 39386 40684 - 200163 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_014243495.1 NZ_SJAV01000002.1 100.000 1 100.000 1299 100.000 0 1 1299 39085 40383 - 256772 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_900148695.1 NZ_FRXS01000009.1 100.000 1 100.000 1299 100.000 0 1 1299 39230 40528 - 96692 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_902164645.1 NZ_LR607334.1 100.000 1 100.000 1299 100.000 0 1 1299 236677 237975 + 3380663 A 16S rRNA gene query qlen hits sgenome sseqid qcovGnm hsp qcovHSP alenHSP pident gaps qstart qend sstart send sstr slen --------------------------- ---- ------ --------------- ----------------- ------- --- ------- ------- ------- ---- ------ ---- ------- ------- ---- ------- NC_000913.3:4166659-4168200 1542 293398 GCF_002248685.1 NZ_NQBE01000079.1 100.000 1 100.000 1542 100.000 0 1 1542 40 1581 - 99259 NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 1 100.000 1542 100.000 0 1 1542 1270211 1271752 + 5483624 NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 2 100.000 1542 100.000 0 1 1542 5466287 5467828 - 5483624 NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 3 100.000 1543 99.546 2 1 1542 557008 558549 + 5483624 NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 4 100.000 1543 99.482 2 1 1542 4473658 4475199 - 5483624 NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 5 100.000 1543 99.482 2 1 1542 5154150 5155691 - 5483624 NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 6 100.000 1543 99.482 2 1 1542 5195176 5196717 - 5483624 NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 7 100.000 1543 99.482 2 1 1542 5369865 5371406 - 5483624 NC_000913.3:4166659-4168200 1542 293398 GCF_000460355.1 NZ_KE701684.1 100.000 1 100.000 1542 100.000 0 1 1542 1108651 1110192 - 1914390 NC_000913.3:4166659-4168200 1542 293398 GCF_000460355.1 NZ_KE701686.1 100.000 2 100.000 1542 99.741 0 1 1542 100680 102221 + 102235 A plasmid query qlen hits sgenome sseqid qcovGnm hsp qcovHSP alenHSP pident gaps qstart qend sstart send sstr slen ---------- ----- ----- --------------- ------------- ------- --- ------- ------- ------- ---- ------ ----- ------- ------- ---- ------- CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086533.1 97.473 1 75.792 40041 99.995 0 12069 52109 11439 51479 + 51479 CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086533.1 97.473 2 20.316 10733 100.000 0 1 10733 722 11454 + 51479 CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086533.1 97.473 3 1.365 721 100.000 0 52110 52830 1 721 + 51479 CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086535.1 97.473 4 0.916 484 91.116 0 51686 52169 27192 27675 - 34058 CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086535.1 97.473 5 0.829 438 90.868 1 52342 52779 26583 27019 - 34058 CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086533.1 97.473 6 1.552 820 100.000 0 9049 9868 23092 23911 + 51479 CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086534.1 97.473 7 0.502 265 100.000 0 19788 20052 29842 30106 + 47185 CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086533.1 97.473 8 0.159 84 97.619 0 8348 8431 19574 19657 + 51479 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086545.1 97.473 1 75.792 40041 99.995 0 12069 52109 11439 51479 + 51479 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086545.1 97.473 2 20.316 10733 100.000 0 1 10733 722 11454 + 51479 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086545.1 97.473 3 1.365 721 100.000 0 52110 52830 1 721 + 51479 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086547.1 97.473 4 0.916 484 91.116 0 51686 52169 3843 4326 + 34058 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086547.1 97.473 5 0.829 438 90.868 1 52342 52779 4499 4935 + 34058 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086545.1 97.473 6 1.552 820 100.000 0 9049 9868 23092 23911 + 51479 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086546.1 97.473 7 0.502 265 100.000 0 19788 20052 29842 30106 + 47185 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086545.1 97.473 8 0.159 84 97.619 0 8348 8431 19574 19657 + 51479 CP115019.1 52830 58744 GCF_014826015.1 NZ_CP058621.1 97.473 1 77.157 40762 99.993 0 12069 52830 9513 50274 + 51480 CP115019.1 52830 58744 GCF_014826015.1 NZ_CP058621.1 97.473 2 18.033 9528 99.990 1 1207 10733 1 9528 + 51480 CP115019.1 52830 58744 GCF_014826015.1 NZ_CP058621.1 97.473 3 2.283 1206 100.000 0 1 1206 50275 51480 + 51480 CP115019.1 52830 58744 GCF_014826015.1 NZ_CP058618.1 97.473 4 2.497 1319 100.000 0 25153 26471 3019498 3020816 - 4718403 Long reads Queries are a few Nanopore Q20 reads from a mock metagenomic community.\nquery qlen hits sgenome sseqid qcovGnm hsp qcovHSP alenHSP pident gaps qstart qend sstart send sstr slen ------------------ ---- ---- --------------- ------------- ------- --- ------- ------- ------- ---- ------ ---- ------- ------- ---- ------- ERR5396170.1000016 740 1 GCF_013394085.1 NZ_CP040910.1 89.595 1 89.595 663 99.246 0 71 733 13515 14177 + 1887974 ERR5396170.1000000 698 1 GCF_001457615.1 NZ_LN831024.1 85.673 1 85.673 603 98.010 5 53 650 4452083 4452685 + 6316979 ERR5396170.1000017 516 1 GCF_013394085.1 NZ_CP040910.1 94.574 1 94.574 489 99.591 2 27 514 293509 293996 + 1887974 ERR5396170.1000012 848 1 GCF_013394085.1 NZ_CP040910.1 95.165 1 95.165 811 97.411 7 22 828 190329 191136 - 1887974 ERR5396170.1000038 1615 1 GCA_000183865.1 CM001047.1 64.706 1 60.000 973 95.889 13 365 1333 88793 89756 - 2884551 ERR5396170.1000038 1615 1 GCA_000183865.1 CM001047.1 64.706 2 4.706 76 98.684 0 266 341 89817 89892 - 2884551 ERR5396170.1000036 1159 1 GCF_013394085.1 NZ_CP040910.1 95.427 1 95.427 1107 99.729 1 32 1137 1400097 1401203 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 1 86.486 707 99.151 3 104 807 242235 242941 - 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 2 86.486 707 98.444 3 104 807 1138777 1139483 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 3 84.152 688 98.983 4 104 788 154620 155306 - 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 4 84.029 687 99.127 3 104 787 32477 33163 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 5 72.727 595 98.992 3 104 695 1280183 1280777 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 6 11.671 95 100.000 0 693 787 1282480 1282574 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 7 82.064 671 99.106 3 120 787 1768782 1769452 + 1887974 Search results (TSV format) above are formatted with csvtk pretty.\n","description":"Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Steps Output Alignment result relationship Output format Examples TL;DR Build a LexicMap index.\nRun:\nFor short queries like genes or long reads, returning top N hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 70 --min-qcov-per-genome 70 --top-n-genomes 1000 For longer queries like plasmids, returning all hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 0 --min-qcov-per-genome 0 --top-n-genomes 0 Input Query length"},{"id":5,"href":"/LexicMap/usage/utils/genomes/","title":"genomes","parent":"utils","content":" Usage $ lexicmap utils genomes -h View genome IDs in the index Usage: lexicmap utils genomes [flags] Flags: -h, --help help for genomes -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. -o, --out-file string ► Out file, supports the \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 8) Examples $ lexicmap utils genomes -d demo.lmi/ GCF_000006945.2 GCF_000017205.1 GCF_000148585.2 GCF_000392875.1 GCF_000742135.1 GCF_001027105.1 GCF_001096185.1 GCF_001457655.1 GCF_001544255.1 GCF_002949675.1 GCF_002950215.1 GCF_003697165.2 GCF_006742205.1 GCF_009759685.1 GCF_900638025.1 ","description":"Usage $ lexicmap utils genomes -h View genome IDs in the index Usage: lexicmap utils genomes [flags] Flags: -h, --help help for genomes -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. -o, --out-file string ► Out file, supports the \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments."},{"id":6,"href":"/LexicMap/installation/","title":"Installation","parent":"","content":"LexicMap can be installed via conda, executable binary files, or compiling from the source.\nBesides, it supports shell completion, which could help accelerate typing.\nConda Install conda, then run\nconda install -c bioconda lexicmap Linux and MacOS (both x86 and arm CPUs) are supported.\nBinary files Linux Download the binary file.\nOS Arch File, 中国镜像 Linux 64-bit lexicmap_linux_amd64.tar.gz, 中国镜像 Linux arm64 lexicmap_linux_arm64.tar.gz, 中国镜像 Decompress it:\ntar -zxvf lexicmap_linux_amd64.tar.gz If you have the root privilege, simply copy it to /usr/local/bin:\nsudo cp lexicmap /usr/local/bin/ If you don\u0026rsquo;t have the root privilege, copy it to any directory in the environment variable PATH:\nmkdir -p $HOME/bin/; cp lexicmap $HOME/bin/ And optionally add the directory into the environment variable PATH if it\u0026rsquo;s not in.\n# bash echo export PATH=\\$PATH:\\$HOME/bin/ \u0026gt;\u0026gt; $HOME/.bashrc source $HOME/.bash # apply the configuration # zsh echo export PATH=\\$PATH:\\$HOME/bin/ \u0026gt;\u0026gt; $HOME/.zshrc source $HOME/.zshrc # apply the configuration MacOS Download the binary file.\nOS Arch File, 中国镜像 macOS 64-bit lexicmap_darwin_amd64.tar.gz, 中国镜像 macOS arm64 lexicmap_darwin_arm64.tar.gz, 中国镜像 Copy it to any directory in the environment variable PATH:\nmkdir -p $HOME/bin/; cp lexicmap $HOME/bin/ And optionally add the directory into the environment variable PATH if it\u0026rsquo;s not in.\n# bash echo export PATH=\\$PATH:\\$HOME/bin/ \u0026gt;\u0026gt; $HOME/.bashrc source $HOME/.bash # apply the configuration # zsh echo export PATH=\\$PATH:\\$HOME/bin/ \u0026gt;\u0026gt; $HOME/.zshrc source $HOME/.zshrc # apply the configuration Windows Download the binary file.\nOS Arch File, 中国镜像 Windows 64-bit lexicmap_windows_amd64.exe.tar.gz, 中国镜像 Decompress it.\nCopy lexicmap.exe to C:\\WINDOWS\\system32.\nOthers Please open an issue to request binaries for other platforms. Or compiling from the source. Compile from the source Install go.\nwget https://go.dev/dl/go1.22.4.linux-amd64.tar.gz tar -zxf go1.22.4.linux-amd64.tar.gz -C $HOME/ # or # echo \u0026quot;export PATH=$PATH:$HOME/go/bin\u0026quot; \u0026gt;\u0026gt; ~/.bashrc # source ~/.bashrc export PATH=$PATH:$HOME/go/bin Compile LexicMap.\n# ------------- the latest stable version ------------- go get -v -u github.com/shenwei356/LexicMap/lexicmap # The executable binary file is located in: # ~/go/bin/lexicmap # You can also move it to anywhere in the $PATH mkdir -p $HOME/bin cp ~/go/bin/lexicmap $HOME/bin/ # --------------- the development version -------------- git clone https://github.com/shenwei356/LexicMap cd LexicMap/lexicmap/ go build # The executable binary file is located in: # ./lexicmap # You can also move it to anywhere in the $PATH mkdir -p $HOME/bin cp ./lexicmap $HOME/bin/ Shell-completion Supported shell: bash|zsh|fish|powershell\nBash:\n# generate completion shell lexicmap autocompletion --shell bash # configure if never did. # install bash-completion if the \u0026quot;complete\u0026quot; command is not found. echo \u0026quot;for bcfile in ~/.bash_completion.d/* ; do source \\$bcfile; done\u0026quot; \u0026gt;\u0026gt; ~/.bash_completion echo \u0026quot;source ~/.bash_completion\u0026quot; \u0026gt;\u0026gt; ~/.bashrc Zsh:\n# generate completion shell lexicmap autocompletion --shell zsh --file ~/.zfunc/_kmcp # configure if never did echo 'fpath=( ~/.zfunc \u0026quot;${fpath[@]}\u0026quot; )' \u0026gt;\u0026gt; ~/.zshrc echo \u0026quot;autoload -U compinit; compinit\u0026quot; \u0026gt;\u0026gt; ~/.zshrc fish:\nlexicmap autocompletion --shell fish --file ~/.config/fish/completions/lexicmap.fish ","description":"LexicMap can be installed via conda, executable binary files, or compiling from the source.\nBesides, it supports shell completion, which could help accelerate typing.\nConda Install conda, then run\nconda install -c bioconda lexicmap Linux and MacOS (both x86 and arm CPUs) are supported.\nBinary files Linux Download the binary file.\nOS Arch File, 中国镜像 Linux 64-bit lexicmap_linux_amd64.tar.gz, 中国镜像 Linux arm64 lexicmap_linux_arm64.tar.gz, 中国镜像 Decompress it:\ntar -zxvf lexicmap_linux_amd64.tar.gz If you have the root privilege, simply copy it to /usr/local/bin:"},{"id":7,"href":"/LexicMap/usage/search/","title":"search","parent":"Usage","content":"$ lexicmap search -h Search sequences against an index Attention: 1. Input should be (gzipped) FASTA or FASTQ records from files or stdin. 2. For multiple queries, the order of queries might be different from the input. Tips: 1. When using -a/--all, the search result would be formatted to Blast-style format with \u0026#39;lexicmap utils 2blast\u0026#39;. And the search speed would be slightly slowed down. 2. Alignment result filtering is performed in the final phase, so stricter filtering criteria, including -q/--min-qcov-per-hsp, -Q/--min-qcov-per-genome, and -i/--align-min-match-pident, do not significantly accelerate the search speed. Hence, you can search with default parameters and then filter the result with tools like awk or csvtk. Alignment result relationship: Query ├── Subject genome ├── Subject sequence ├── High-Scoring segment Pair (HSP) Here, the defination of HSP is similar with that in BLAST. Actually there are small gaps in HSPs. \u0026gt; A High-scoring Segment Pair (HSP) is a local alignment with no gaps that achieves one of the \u0026gt; highest alignment scores in a given search. https://www.ncbi.nlm.nih.gov/books/NBK62051/ Output format: Tab-delimited format with 17+ columns, with 1-based positions. 1. query, Query sequence ID. 2. qlen, Query sequence length. 3. hits, Number of subject genomes. 4. sgenome, Subject genome ID. 5. sseqid, Subject sequence ID. 6. qcovGnm, Query coverage (percentage) per genome: $(aligned bases in the genome)/$qlen. 7. hsp, Nth HSP in the genome. (just for improving readability) 8. qcovHSP Query coverage (percentage) per HSP: $(aligned bases in a HSP)/$qlen. 9. alenHSP, Aligned length in the current HSP. 10. pident, Percentage of identical matches in the current HSP. 11. gaps, Gaps in the current HSP. 12. qstart, Start of alignment in query sequence. 13. qend, End of alignment in query sequence. 14. sstart, Start of alignment in subject sequence. 15. send, End of alignment in subject sequence. 16. sstr, Subject strand. 17. slen, Subject sequence length. 18. cigar, CIGAR string of the alignment. (optional with -a/--all) 19. qseq, Aligned part of query sequence. (optional with -a/--all) 20. sseq, Aligned part of subject sequence. (optional with -a/--all) 21. align, Alignment text (\u0026#34;|\u0026#34; and \u0026#34; \u0026#34;) between qseq and sseq. (optional with -a/--all) Usage: lexicmap search [flags] -d \u0026lt;index path\u0026gt; [query.fasta.gz ...] [-o query.tsv.gz] Flags: --align-band int ► Band size in backtracking the score matrix (pseduo alignment phase). (default 50) --align-ext-len int ► Extend length of upstream and downstream of seed regions, for extracting query and target sequences for alignment. (default 2000) --align-max-gap int ► Maximum gap in a HSP segment. (default 20) -l, --align-min-match-len int ► Minimum aligned length in a HSP segment. (default 50) -i, --align-min-match-pident float ► Minimum base identity (percentage) in a HSP segment. (default 70) -a, --all ► Output more columns, e.g., matched sequences. Use this if you want to output blast-style format with \u0026#34;lexicmap utils 2blast\u0026#34;. -h, --help help for search -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. -w, --load-whole-seeds ► Load the whole seed data into memory for faster search. --max-open-files int ► Maximum opened files. (default 512) -J, --max-query-conc int ► Maximum number of concurrent queries. Bigger values do not improve the batch searching speed and consume much memory. (default 12) -Q, --min-qcov-per-genome float ► Minimum query coverage (percentage) per genome. -q, --min-qcov-per-hsp float ► Minimum query coverage (percentage) per HSP. -o, --out-file string ► Out file, supports a \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) --pseudo-align ► Only perform pseudo alignment, alignment metrics, including qcovGnm, qcovSHP and pident, will be less accurate. --seed-max-dist int ► Max distance between seeds in seed chaining. (default 10000) --seed-max-gap int ► Max gap in seed chaining. (default 500) -p, --seed-min-prefix int ► Minimum (prefix) length of matched seeds. (default 15) -P, --seed-min-single-prefix int ► Minimum (prefix) length of matched seeds if there\u0026#39;s only one pair of seeds matched. (default 17) -n, --top-n-genomes int ► Keep top N genome matches for a query (0 for all) in chaining phase. Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Examples See Searching ","description":"$ lexicmap search -h Search sequences against an index Attention: 1. Input should be (gzipped) FASTA or FASTQ records from files or stdin. 2. For multiple queries, the order of queries might be different from the input. Tips: 1. When using -a/--all, the search result would be formatted to Blast-style format with \u0026#39;lexicmap utils 2blast\u0026#39;. And the search speed would be slightly slowed down. 2. Alignment result filtering is performed in the final phase, so stricter filtering criteria, including -q/--min-qcov-per-hsp, -Q/--min-qcov-per-genome, and -i/--align-min-match-pident, do not significantly accelerate the search speed."},{"id":8,"href":"/LexicMap/usage/utils/subseq/","title":"subseq","parent":"utils","content":" Usage $ lexicmap utils subseq -h Exextract subsequence via reference name, sequence ID, position and strand Attention: 1. The option -s/--seq-id is optional. 1) If given, the positions are these in the original sequence. 2) If not given, the positions are these in the concatenated sequence. 2. All degenerate bases in reference genomes were converted to the lexicographic first bases. E.g., N was converted to A. Therefore, consecutive A\u0026#39;s in output might be N\u0026#39;s in the genomes. Usage: lexicmap utils subseq [flags] Flags: -h, --help help for subseq -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. -w, --line-width int ► Line width of sequence (0 for no wrap). (default 60) -o, --out-file string ► Out file, supports the \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) -n, --ref-name string ► Reference name. -r, --region string ► Region of the subsequence (1-based). -R, --revcom ► Extract subsequence on the negative strand. -s, --seq-id string ► Sequence ID. If the value is empty, the positions in the region are treated as that in the concatenated sequence. Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Examples Extracting subsequence with genome ID, sequence ID, position range and strand information.\n$ lexicmap utils subseq -d demo.lmi/ -n GCF_003697165.2 -s NZ_CP033092.2 -r 4591684:4593225 -R \u0026gt;NZ_CP033092.2:4591684-4593225:- AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAA GTCGAACGGTAACAGGAAGCAGCTTGCTGCTTTGCTGACGAGTGGCGGACGGGTGAGTAA TGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCAT AACGTCGCAAGACCAAAGAGGGGGACCTTAGGGCCTCTTGCCATCGGATGTGCCCAGATG GGATTAGCTAGTAGGTGGGGTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGA GGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGG GGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCT TCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATT GACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAG GGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCA GATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTC GTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACC GGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCA AACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCC CTTGAGGCGTGGCTTCCGGAGCTAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCA AGGTTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAAT TCGATGCAACGCGAAGAACCTTACCTGGTCTTGACATCCACGGAAGTTTTCAGAGATGAG AATGTGCCTTCGGGAACCGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGA AATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTCCGGC CGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGTC ATCATGGCCCTTACGACCAGGGCTACACACGTGCTACAATGGCGCATACAAAGAGAAGCG ACCTCGCGAGAGCAAGCGGACCTCATAAAGTGCGTCGTAGTCCGGATTGGAGTCTGCAAC TCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGT TCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGTTGCAAAAGAAGTAGGT AGCTTAACCTTCGGGAGGGCGCTTACCACTTTGTGATTCATGACTGGGGTGAAGTCGTAA CAAGGTAACCGTAGGGGAACCTGCGGTTGGATCACCTCCTTA If the sequence ID (-s/--seq-id) is not given, the positions are these in the concatenated sequence.\nChecking sequence lengths of a genome with seqkit.\n$ seqkit fx2tab -nil refs/GCF_003697165.2.fa.gz NZ_CP033092.2 4903501 NZ_CP033091.2 131333 Extracting the 1000-bp interval sequence inserted by lexicmap index.\n$ lexicmap utils subseq -d demo.lmi/ -n GCF_003697165.2 -r 4903502:4904501 \u0026gt;GCF_003697165.2:4903502-4904501:+ AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA It detects if the end position is larger than the sequence length.\n# the length of NZ_CP033092.2 is 4903501 $ lexicmap utils subseq -d demo.lmi/ -n GCF_003697165.2 -s NZ_CP033092.2 -r 4903501:1000000000 \u0026gt;NZ_CP033092.2:4903501-4903501:+ C $ lexicmap utils subseq -d demo.lmi/ -n GCF_003697165.2 -s NZ_CP033092.2 -r 4903502:1000000000 \u0026gt;NZ_CP033092.2:4903502-4903501:+ ","description":"Usage $ lexicmap utils subseq -h Exextract subsequence via reference name, sequence ID, position and strand Attention: 1. The option -s/--seq-id is optional. 1) If given, the positions are these in the original sequence. 2) If not given, the positions are these in the concatenated sequence. 2. All degenerate bases in reference genomes were converted to the lexicographic first bases. E.g., N was converted to A. Therefore, consecutive A\u0026#39;s in output might be N\u0026#39;s in the genomes."},{"id":9,"href":"/LexicMap/releases/","title":"Releases","parent":"","content":" Latest version v0.4.0 v0.4.0 - 2024-07-xx New commands: lexicmap utils 2blast: Convert the default search output to blast-style format. lexicmap index: Support suffix matching of seeds, now seeds are immune to any single SNP!!!, at the cost of doubled seed data. Better sketching desert filling for highly-repetitive regions. Change the default value of --seed-max-desert from 900 to 200 to increase alignment sensitivity. Mask gap regions (N\u0026rsquo;s). Fix skipping interval regions by further including the last k-1 bases of contigs. Fix a bug in indexing small genomes. Change the default value of -b, --batch-size from 10,000 to 5,000. Improve lexichash data structure. Write and merge seed data in parallel, new flag -J/--seed-data-threads. Improve the log. lexicmap search: Fix chaining for highly-repetitive regions. Perform more accurate alignment with WFA. Fix object recycling and reduce memory usage. Fix alignment against genomes with many short contigs. Fix early quit when meeting a sequence shorter than k. Add a new option -J/--max-query-conc to limit the miximum number of concurrent queries, with a default valule of 12 instead of the number of CPUs, which reduces the memory usage in batch searching. Result format: Cluster alignments of each target sequence. Remove the column seeds. Add columns gaps, cigar, align, which can be reformated with lexicmap utils 2blast. lexicmap utils kmers: Fix the progress bar. Fix a bug where some masks do not have any k-mer. Add a new column prefix to show the length of common prefix between the seed and the probe. Add a new column reversed to indicate if the k-mer is reversed for suffix matching. lexicmap utils masks: Add the support of only outputting a specific mask. lexicmap utils seed-pos: New columns: sseqid and pos_seq. More accurate seed distance. Add histograms of numbers of seed in sliding windows. lexicmap utils subseq: Fix a bug when the given end position is larger than the sequence length. Add the strand (\u0026quot;+\u0026quot; or \u0026ldquo;-\u0026rdquo;) in the sequence header. Please run lexicmap version to check update !!! Please run lexicmap autocompletion to update shell autocompletion script !!! Previous versions v0.3.0 v0.3.0 - 2024-05-14 lexicmap index: Better seed coverage by filling sketching deserts. Use longer (1000bp N\u0026rsquo;s, previous: k-1) intervals between contigs. Fix a concurrency bug between genome data writing and k-mer-value data collecting. Change the format of k-mer-value index file, and fix the computation of index partitions. Optionally save seed positions which can be outputted by lexicmap utils seed-pos. lexicmap search: Improved seed-chaining algorithm. Better support of long queries. Add a new flag -w/--load-whole-seeds for loading the whole seed data into memory for faster search. Parallelize alignment in each query, so it\u0026rsquo;s faster for a single query. Optional outputing matched query and subject sequences. 2-5X searching speed with a faster masking method. Change output format. Add output of query start and end positions. Fix a target sequence extracting bug. Keep indexes of genome data in memory. lexicmap utils kmers: Fix a little bug, wrong number of k-mers for the second k-mer in each k-mer pair. New commands: lexicmap utils gen-masks for generating masks from the top N largest genomes. lexicmap utils seed-pos for extracting seed positions via reference names. lexicmap utils reindex-seeds for recreating indexes of k-mer-value (seeds) data. lexicmap utils genomes for list genomes IDs in the index. v0.2.0 v0.2.0 - 2024-02-02 Software architecture and index formats are redesigned to reduce searching memory occupation. Indexing: genomes are processed in batches to reduce RAM usage, then indexes of all batches are merged. Searching: seeds matching is performed on disk yet it\u0026rsquo;s ultra-fast. v0.1.0 v0.1.0 - 2024-01-15 The first release. Seed indexing and querying are performed in RAM. GTDB r214 with 10k masks: index size 75GB, RAM: 130GB. ","description":"Latest version v0.4.0 v0.4.0 - 2024-07-xx New commands: lexicmap utils 2blast: Convert the default search output to blast-style format. lexicmap index: Support suffix matching of seeds, now seeds are immune to any single SNP!!!, at the cost of doubled seed data. Better sketching desert filling for highly-repetitive regions. Change the default value of --seed-max-desert from 900 to 200 to increase alignment sensitivity. Mask gap regions (N\u0026rsquo;s). Fix skipping interval regions by further including the last k-1 bases of contigs."},{"id":10,"href":"/LexicMap/usage/utils/seed-pos/","title":"seed-pos","parent":"utils","content":" Usage $ lexicmap utils seed-pos -h Extract and plot seed positions via reference name(s) Attention: 0. This command requires the index to be created with the flag --save-seed-pos in lexicmap index. 1. Seed/K-mer positions (column pos) are 1-based. For reference genomes with multiple sequences, the sequences were concatenated to a single sequence with intervals of N\u0026#39;s. So values of column pos_gnm and pos_seq might be different. The positions can be used to extract subsequence with \u0026#39;lexicmap utils subseq\u0026#39;. 2. All degenerate bases in reference genomes were converted to the lexicographic first bases. E.g., N was converted to A. Therefore, consecutive A\u0026#39;s in output might be N\u0026#39;s in the genomes. Extra columns: Using -v/--verbose will output more columns: len_aaa, length of consecutive A\u0026#39;s. seq, sequence between the previous and current seed. Figures: Using -O/--plot-dir will write plots into given directory: - Histograms of seed distances. - Histograms of numbers of seeds in sliding windows. Usage: lexicmap utils seed-pos [flags] Flags: -a, --all-refs ► Output for all reference genomes. This would take a long time for an index with a lot of genomes. -b, --bins int ► Number of bins in histograms. (default 100) --color-index int ► Color index (1-7). (default 1) --force ► Overwrite existing output directory. --height float ► Histogram height (unit: inch). (default 4) -h, --help help for seed-pos -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. --max-open-files int ► Maximum opened files, used for extracting sequences. (default 512) -D, --min-dist int ► Only output records with seed distance \u0026gt;= this value. -o, --out-file string ► Out file, supports and recommends a \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) -O, --plot-dir string ► Output directory for 1) histograms of seed distances, 2) histograms of numbers of seeds in sliding windows. --plot-ext string ► Histogram plot file extention. (default \u0026#34;.png\u0026#34;) -n, --ref-name strings ► Reference name(s). -s, --slid-step int ► The step size of sliding windows for counting the number of seeds (default 200) -w, --slid-window int ► The window size of sliding windows for counting the number of seeds (default 500) -v, --verbose ► Show more columns including position of the previous seed and sequence between the two seeds. Warning: it\u0026#39;s slow to extract the sequences, recommend set -D 1000 or higher values to filter results --width float ► Histogram width (unit: inch). (default 6) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Examples Adding the flag --save-seed-pos in index building.\n$ lexicmap index -I refs/ -O demo.lmi --save-seed-pos --force Listing seed position of one genome.\n$ lexicmap utils seed-pos -d demo.lmi/ -n GCF_000017205.1 -o seed_distance.tsv $ head -n 10 seed_distance.tsv | csvtk pretty -t ref seqid pos_gnm pos_seq strand distance --------------- ----------- ------- ------- ------ -------- GCF_000017205.1 NC_009656.1 16 16 + 15 GCF_000017205.1 NC_009656.1 18 18 + 2 GCF_000017205.1 NC_009656.1 71 71 + 53 GCF_000017205.1 NC_009656.1 74 74 - 3 GCF_000017205.1 NC_009656.1 119 119 - 45 GCF_000017205.1 NC_009656.1 123 123 + 4 GCF_000017205.1 NC_009656.1 154 154 + 31 GCF_000017205.1 NC_009656.1 185 185 + 31 GCF_000017205.1 NC_009656.1 269 269 - 84 Check the biggest seed distances.\n$ csvtk freq -t -f distance seed_distance.tsv \\ | csvtk sort -t -k distance:nr \\ | head -n 10 \\ | csvtk pretty -t distance frequency -------- --------- 199 49 198 47 197 40 196 38 195 54 194 36 193 38 192 55 191 40 Or only list records with seed distances longer than a threshold.\n$ lexicmap utils seed-pos -d demo.lmi/ -n GCF_000017205.1 -D 190 \\ | csvtk pretty -t | head -n 5 ref seqid pos_gnm pos_seq strand distance --------------- ----------- ------- ------- ------ -------- GCF_000017205.1 NC_009656.1 13549 13549 + 196 GCF_000017205.1 NC_009656.1 27667 27667 - 190 GCF_000017205.1 NC_009656.1 65318 65318 + 197 Plot histogram of distances between seeds and histogram of number of seeds in sliding windows.\n$ lexicmap utils seed-pos -d demo.lmi/ -n GCF_000017205.1 -o seed_distance.tsv --plot-dir seed_distance In the plot below, there\u0026rsquo;s a peak at 50 bp, because LexicMap fills sketching deserts with extra k-mers (seeds) of which their distance is 50 bp by default.\nMore columns including sequences between two seeds.\n$ lexicmap utils seed-pos -d demo.lmi/ -n GCF_000017205.1 -v \\ | head -n4 | csvtk pretty -t -W 40 --clip ref seqid pos_gnm pos_seq strand distance len_aaa seq --------------- ----------- ------- ------- ------ -------- ------- ---------------------------------------- GCF_000017205.1 NC_009656.1 16 16 + 15 2 TTAAAGAGACCGGCG GCF_000017205.1 NC_009656.1 18 18 + 2 0 AT GCF_000017205.1 NC_009656.1 71 71 + 53 6 TCTAGTGAAATCGAACGGGCAGGTCAATTTCCAACCA... Or only list records with seed distance longer than a threshold.\n$ lexicmap utils seed-pos -d demo.lmi/ -n GCF_000017205.1 -v -D 190 \\ | head -n 2 \\ | csvtk pretty -t -W 40 ref seqid pos_gnm pos_seq strand distance len_aaa seq --------------- ----------- ------- ------- ------ -------- ------- ---------------------------------------- GCF_000017205.1 NC_009656.1 13549 13549 + 196 15 CGAAGCGGCGCCGGCGGACATGTACGACAAGGACCTGGAT GTCTCGGTGGCCGCCATGAGCCGCGAACTGGCCAAGTATG TACGGGCCTATCCGAGCCAGTACATGTGGAGCATGAAGCG CTTCAAGAACCGCCCGGACGGCGAGAAGAAGTGGTACTGA AAAAAGGCGTCGGAAGACGCCTTTTTCATATCCGGG Listing seed position of all genomes.\n$ lexicmap utils seed-pos -d demo.lmi/ --all-refs -o seed-pos.tsv.gz Show the number of seed positions in each genome. Frequencies larger than 40000 (the number of masks) means some k-mers can be foud in more than one positions in a genome.\n$ csvtk freq -t -f ref -nr seed-pos.tsv.gz | csvtk pretty -t ref frequency --------------- --------- GCF_000017205.1 134541 GCF_000742135.1 103771 GCF_003697165.2 92087 GCF_000006945.2 90683 GCF_002950215.1 89638 GCF_002949675.1 84337 GCF_009759685.1 72711 GCF_001027105.1 56737 GCF_000392875.1 55772 GCF_006742205.1 52699 GCF_001544255.1 50000 GCF_900638025.1 46638 GCF_001096185.1 46195 GCF_001457655.1 45822 GCF_000148585.2 44982 Plot the histograms of distances between seeds for all genomes.\n$ lexicmap utils seed-pos -d demo.lmi/ --all-refs -o seed-pos.tsv.gz \\ --plot-dir seed_distance --force 09:56:34.059 [INFO] creating genome reader pools, each batch with 1 readers... processed files: 15 / 15 [======================================] ETA: 0s. done 09:56:34.656 [INFO] seed positions of 15 genomes(s) saved to seed-pos.tsv.gz 09:56:34.656 [INFO] histograms of 15 genomes(s) saved to seed_distance 09:56:34.656 [INFO] 09:56:34.656 [INFO] elapsed time: 598.080462ms 09:56:34.656 [INFO] $ ls seed_distance/ GCF_000006945.2.png GCF_000742135.1.png GCF_001544255.1.png GCF_006742205.1.png GCF_000006945.2.seed_number.png GCF_000742135.1.seed_number.png GCF_001544255.1.seed_number.png GCF_006742205.1.seed_number.png GCF_000017205.1.png GCF_001027105.1.png GCF_002949675.1.png GCF_009759685.1.png GCF_000017205.1.seed_number.png GCF_001027105.1.seed_number.png GCF_002949675.1.seed_number.png GCF_009759685.1.seed_number.png GCF_000148585.2.png GCF_001096185.1.png GCF_002950215.1.png GCF_900638025.1.png GCF_000148585.2.seed_number.png GCF_001096185.1.seed_number.png GCF_002950215.1.seed_number.png GCF_900638025.1.seed_number.png GCF_000392875.1.png GCF_001457655.1.png GCF_003697165.2.png GCF_000392875.1.seed_number.png GCF_001457655.1.seed_number.png GCF_003697165.2.seed_number.png In the plots below, there\u0026rsquo;s a peak at 150 bp, because LexicMap fills sketching deserts with extra k-mers (seeds) of which their distance is 150 bp by default. And they show that the seed number, seed distance and seed density are related to genome sizes.\nGCF_000392875.1 (genome size: 2.9 Mb)\n","description":"Usage $ lexicmap utils seed-pos -h Extract and plot seed positions via reference name(s) Attention: 0. This command requires the index to be created with the flag --save-seed-pos in lexicmap index. 1. Seed/K-mer positions (column pos) are 1-based. For reference genomes with multiple sequences, the sequences were concatenated to a single sequence with intervals of N\u0026#39;s. So values of column pos_gnm and pos_seq might be different. The positions can be used to extract subsequence with \u0026#39;lexicmap utils subseq\u0026#39;."},{"id":11,"href":"/LexicMap/tutorials/","title":"Tutorials","parent":"","content":"","description":""},{"id":12,"href":"/LexicMap/usage/utils/","title":"utils","parent":"Usage","content":"$ lexicmap utils Some utilities Usage: lexicmap utils [command] Available Commands: 2blast Convert the default search output to blast-style format genomes View genome IDs in the index kmers View k-mers captured by the masks masks View masks of the index or generate new masks randomly reindex-seeds Recreate indexes of k-mer-value (seeds) data seed-pos Extract and plot seed positions via reference name(s) subseq Extract subsequence via reference name, sequence ID, position and strand Flags: -h, --help help for utils Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) The output (TSV format) is formatted with csvtk pretty.\n","description":"$ lexicmap utils Some utilities Usage: lexicmap utils [command] Available Commands: 2blast Convert the default search output to blast-style format genomes View genome IDs in the index kmers View k-mers captured by the masks masks View masks of the index or generate new masks randomly reindex-seeds Recreate indexes of k-mer-value (seeds) data seed-pos Extract and plot seed positions via reference name(s) subseq Extract subsequence via reference name, sequence ID, position and strand Flags: -h, --help help for utils Global Flags: -X, --infile-list string ► File of input file list (one file per line)."},{"id":13,"href":"/LexicMap/usage/utils/reindex-seeds/","title":"reindex-seeds","parent":"utils","content":" Usage $ lexicmap utils reindex-seeds -h Recreate indexes of k-mer-value (seeds) data Usage: lexicmap utils reindex-seeds [flags] Flags: -h, --help help for reindex-seeds -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. --partitions int ► Number of partitions for re-indexing seeds (k-mer-value data) files. (default 512) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Examples $ lexicmap utils reindex-seeds -d demo.lmi/ --partitions 1024 10:20:29.150 [INFO] recreating seed indexes with 1024 partitions for: demo.lmi/ processed files: 16 / 16 [======================================] ETA: 0s. done 10:20:29.166 [INFO] update index information file: demo.lmi/info.toml 10:20:29.166 [INFO] finished updating the index information file: demo.lmi/info.toml 10:20:29.166 [INFO] 10:20:29.166 [INFO] elapsed time: 15.981266ms 10:20:29.166 [INFO] ","description":"Usage $ lexicmap utils reindex-seeds -h Recreate indexes of k-mer-value (seeds) data Usage: lexicmap utils reindex-seeds [flags] Flags: -h, --help help for reindex-seeds -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. --partitions int ► Number of partitions for re-indexing seeds (k-mer-value data) files. (default 512) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments."},{"id":14,"href":"/LexicMap/usage/","title":"Usage","parent":"","content":"","description":""},{"id":15,"href":"/LexicMap/faqs/","title":"FAQs","parent":"","content":" Does LexicMap support short reads? LexicMap is mainly designed for sequence alignment with a small number of queries (gene/plasmid/virus/phage sequences) longer than 200 bp by default. However, short queries can also be aligned.\nIf you just want to search long (\u0026gt;1kb) queries for highy similar (\u0026gt;95%) targets, you can build an index with a bigger -D/--seed-max-desert (200 by default), e.g.,\n--seed-max-desert 450 --seed-in-desert-dist 150 Bigger values decrease the search sensitivity for distant targets, speed up the indexing speed, decrease the indexing memory occupation and decrease the index size. While the alignment speed is almost not affected.\nDoes LexicMap support fungi genomes? Yes. LexicMap mainly supports small genomes including prokaryotic, viral, and plasmid genomes. Fungi can also be supported, just remember to increase the value of -g/--max-genome when running lexicmap index, which is used to skip genomes larger than 15Mb by default.\n-g, --max-genome int ► Maximum genome size. Extremely large genomes (e.g., non-isolate assemblies from Genbank) will be skipped. (default 15000000) Maximum genome size is about 268 Mb (268,435,456). More precisely:\n$total_bases + ($num_contigs - 1) * 1000 \u0026lt;= 268,435,456 as we concatenate contigs with 1000-bp intervals of N’s to reduce the sequence scale to index.\nFor big and complex genomes, like the human genome (chr1 is ~248 Mb) which has many repetitive sequences, LexicMap would be slow to align.\nHow\u0026rsquo;s the hardware requirement? For index building. See details hardware requirement. For seaching. See details hardware requirement. Can I extract the matched sequences? Yes, lexicmap search has a flag\n-a, --all ► Output more columns, e.g., matched sequences. Use this if you want to output blast-style format with \u0026#34;lexicmap utils 2blast\u0026#34;. to output CIGAR string, aligned query and subject sequences.\n18. cigar, CIGAR string of the alignment (optional with -a/--all) 19. qseq, Aligned part of query sequence. (optional with -a/--all) 20. sseq, Aligned part of subject sequence. (optional with -a/--all) 21. align, Alignment text (\u0026#34;|\u0026#34; and \u0026#34; \u0026#34;) between qseq and sseq. (optional with -a/--all) And lexicmap util 2blast can help to convert the tabular format to Blast-style format, see examples.\nHow can I extract the upstream and downstream flanking sequences of matched regions? lexicmap utils subseq can extract subsequencess via genome ID, sequence ID and positions. So you can use these information from the search result and expand the region positions to extract flanking sequences.\nWhy isn\u0026rsquo;t the pident 100% when aligning with a sequence from the reference genomes? It happens if there are some degenerate bases (e.g., N) in the query sequence. In the indexing step, all degenerate bases are converted to their lexicographic first bases. E.g., N is converted to A. While for the query sequences, we don\u0026rsquo;t convert them.\nWhy is LexicMap slow for batch searching? LexicMap is mainly designed for sequence alignment with a small number of queries against a database with a huge number (up to 16 million) of genomes.\nlexicmap search has a flag -w/--load-whole-seeds to load the whole seed data into memory for faster search.\nFor example, for ~85,000 GTDB representative genomes, the memory would be ~260 GB with default parameters. lexicmap search also has a flag --pseudo-align to only perform pseudo alignment, which is slightly faster and uses less memory. It can be used in searching with long and divergent query sequences like nanopore long-reads.\nClick to read more detail of the usage.\n","description":"Does LexicMap support short reads? LexicMap is mainly designed for sequence alignment with a small number of queries (gene/plasmid/virus/phage sequences) longer than 200 bp by default. However, short queries can also be aligned.\nIf you just want to search long (\u0026gt;1kb) queries for highy similar (\u0026gt;95%) targets, you can build an index with a bigger -D/--seed-max-desert (200 by default), e.g.,\n--seed-max-desert 450 --seed-in-desert-dist 150 Bigger values decrease the search sensitivity for distant targets, speed up the indexing speed, decrease the indexing memory occupation and decrease the index size."},{"id":16,"href":"/LexicMap/notes/","title":"Notes","parent":"","content":"","description":""},{"id":17,"href":"/LexicMap/","title":"","parent":"","content":" LexicMap LexicMap is a nucleotide sequence alignment tool for efficiently querying gene, plasmid, virus, or long-read sequences against up to millions of prokaryotic genomes.\nIntroduction Feature overview Easy to install Linux, Windows, MacOS and more OS are supported.\nBoth x86 and ARM CPUs are supported.\nJust download the binary files and run!\nOr install it by\nconda install -c bioconda lexicmap Installation Releases Easy to use Step 1: indexing\nlexicmap index -I genomes/ -O db.lmi Step 2: searching\nlexicmap search -d db.lmi q.fasta -o r.tsv Tutorials Usages FAQs Notes Accurate and efficient alignment Using LexicMap to search in the whole 2,340,672 Genbank+Refseq prokaryotic genomes with 48 CPUs.\nQuery Genome hits Time RAM A 1.3-kb marker gene 36,633 21s 3.4 GB A 1.5-kb 16S rRNA 1,928,372 6m40s 16.7 GB A 52.8-kb plasmid 551,264 8m54s 20.1 GB 1003 AMR genes 27,577,060 5h18m 41.3 GB Blastn is unable to run with the same dataset on common servers as it requires \u0026gt;2000 GB RAM.\nPerformance ","description":"LexicMap LexicMap is a nucleotide sequence alignment tool for efficiently querying gene, plasmid, virus, or long-read sequences against up to millions of prokaryotic genomes.\nIntroduction Feature overview Easy to install Linux, Windows, MacOS and more OS are supported.\nBoth x86 and ARM CPUs are supported.\nJust download the binary files and run!\nOr install it by\nconda install -c bioconda lexicmap Installation Releases Easy to use Step 1: indexing"},{"id":18,"href":"/LexicMap/usage/utils/2blast/","title":"2blast","parent":"utils","content":" Usage $ lexicmap utils 2blast -h Convert the default search output to blast-style format LexicMap only stores genome IDs and sequence IDs, without description information. But the option -g/--kv-file-genome enables adding description data after the genome ID with a tabular key-value mapping file. Input: - Output of \u0026#39;lexicmap search\u0026#39; with the flag -a/--all. Usage: lexicmap utils 2blast [flags] Flags: -b, --buffer-size string ► Size of buffer, supported unit: K, M, G. You need increase the value when \u0026#34;bufio.Scanner: token too long\u0026#34; error reported (default \u0026#34;20M\u0026#34;) -h, --help help for 2blast -i, --ignore-case ► Ignore cases of sgenome and sseqid -g, --kv-file-genome string ► Two-column tabular file for mapping the target genome ID (sgenome) to the corresponding value -s, --kv-file-seq string ► Two-column tabular file for mapping the target sequence ID (sseqid) to the corresponding value -o, --out-file string ► Out file, supports and recommends a \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Examples From stdin.\n$ seqkit seq -M 500 q.long-reads.fasta.gz \\ | seqkit head -n 2 \\ | lexicmap search -d demo.lmi/ -a \\ | lexicmap utils 2blast --kv-file-genome ass2species.map Query = GCF_000017205.1_r160 Length = 478 [Subject genome #1/1] = GCF_000017205.1 Pseudomonas aeruginosa Query coverage per genome = 95.188% \u0026gt;NC_009656.1 Length = 6588339 HSP #1 Query coverage per seq = 95.188%, Aligned length = 463, Identities = 95.680%, Gaps = 12 Query range = 13-467, Subject range = 4866862-4867320, Strand = Plus/Plus Query 13 CCTCAAACGAGTCC-AACAGGCCAACGCCTAGCAATCCCTCCCCTGTGGGGCAGGGAAAA 71 |||||||||||||| |||||||| |||||| | ||||||||||||| |||||||||||| Sbjct 4866862 CCTCAAACGAGTCCGAACAGGCCCACGCCTCACGATCCCTCCCCTGTCGGGCAGGGAAAA 4866921 Query 72 TCGTCCTTTATGGTCCGTTCCGGGCACGCACCGGAACGGCGGTCATCTTCCACGGTGCCC 131 |||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||| Sbjct 4866922 TCGTCCTTTATGGTCCGTTCCGGGCACGCACCGGAACGGCGGTCAT-TTCCACGGTGCCC 4866980 Query 132 GCCCACGGCGGACCCGCGGAAACCGACCCGGGCGCCAAGGCGCCCGGGAACGGAGTA-CA 190 ||| ||||||||||| ||||||||||||||||||||||||||||||||||||||||| || Sbjct 4866981 GCC-ACGGCGGACCC-CGGAAACCGACCCGGGCGCCAAGGCGCCCGGGAACGGAGTATCA 4867038 Query 191 CTCGGCGTTCGGCCAGCGACAGC---GACGCGTTGCCGCCCACCGCGGTGGTGTTCACCG 247 |||||||| |||||||||||||| |||||||||||||||||||||||||||||||||| Sbjct 4867039 CTCGGCGT-CGGCCAGCGACAGCAGCGACGCGTTGCCGCCCACCGCGGTGGTGTTCACCG 4867097 Query 248 AGGTGGTGCGCTCGCTGAC-AAACGCAGCAGGTAGTTCGGCCCGCCGGCCTTGGGACCG- 305 ||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||| Sbjct 4867098 AGGTGGTGCGCTCGCTGACGAAACGCAGCAGGTAGTTCGGCCCGCCGGCCTTGGGACCGG 4867157 Query 306 TGCCGGACAGCCCGTGGCCGCCGAACAGTTGCACGCCCACCACCGCGCCGAT-TGGTTTC 364 |||||||||||||||||||||||||| ||||||||||||||||||||||||| ||||| | Sbjct 4867158 TGCCGGACAGCCCGTGGCCGCCGAACGGTTGCACGCCCACCACCGCGCCGATCTGGTTGC 4867217 Query 365 GGTTGACGTAGAGGTTGCCGACCCGCGCCAGCTCTTGGATGCGGCGGGCGGTTTCCTCGT 424 |||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||| Sbjct 4867218 GGTTGACGTAGAGGTTGCCGACCCGCGCCAGCTCTTCGATGCGGCGGGCGGTTTCCTCGT 4867277 Query 425 TGCGGCTGTGGACCCCCATGGTCAGGCCGAAACCGGTGGCGTT 467 ||||||||||||||||||||||||||||||||||||||||||| Sbjct 4867278 TGCGGCTGTGGACCCCCATGGTCAGGCCGAAACCGGTGGCGTT 4867320 Query = GCF_006742205.1_r100 Length = 431 [Subject genome #1/1] = GCF_006742205.1 Staphylococcus epidermidis Query coverage per genome = 92.575% \u0026gt;NZ_AP019721.1 Length = 2422602 HSP #1 Query coverage per seq = 92.575%, Aligned length = 402, Identities = 98.507%, Gaps = 4 Query range = 33-431, Subject range = 1321677-1322077, Strand = Plus/Minus Query 33 TAAAACGATTGCTAATGAGTCACGTATTTCATCTGGTTCGGTAACTATACCGTCTACTAT 92 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1322077 TAAAACGATTGCTAATGAGTCACGTATTTCATCTGGTTCGGTAACTATACCGTCTACTAT 1322018 Query 93 GGACTCAGTGTAACCCTGTAATAAAGAGATTGGCGTACGTAATTCATGTG-TACATTTGC 151 |||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||| Sbjct 1322017 GGACTCAGTGTAACCCTGTAATAAAGAGATTGGCGTACGTAATTCATGTGATACATTTGC 1321958 Query 152 TATAAAATCTTTTTTCATTTGATCAAGATTATGTTCATTTGTCATATCACAGGATGACCA 211 |||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||| Sbjct 1321957 TATAAAATCTTTTTTCATTTGATCAAGATTATGTTCATTTGTCATATCAC-GGATGACCA 1321899 Query 212 TGACAATACCACTTCTACCATTTGTTTGAATTCTATCTATATAACTGGAGATAAATACAT 271 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1321898 TGACAATACCACTTCTACCATTTGTTTGAATTCTATCTATATAACTGGAGATAAATACAT 1321839 Query 272 AGTACCTTGTATTAATTTCTAATTCTAA-TACTCATTCTGTTGTGATTCAAATGGTGCTT 330 |||||||||||||||||||||||||||| ||||||||||||||||||||||||| ||||| Sbjct 1321838 AGTACCTTGTATTAATTTCTAATTCTAAATACTCATTCTGTTGTGATTCAAATGTTGCTT 1321779 Query 331 CAATTTGCTGTTCAATAGATTCTTTTGAAAAATCATCAATGTGACGCATAATATAATCAG 390 |||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||| Sbjct 1321778 CAATTTGCTGTTCAATAGATTCTTTTGAAAAATCATCAATGTGACGCATAATATCATCAG 1321719 Query 391 CCATCTTGTT-GACAATATGATTTCACGTTGATTATTAATGC 431 |||||||||| ||||||||||||||||||||||||||||||| Sbjct 1321718 CCATCTTGTTTGACAATATGATTTCACGTTGATTATTAATGC 1321677 From file.\n$ lexicmap utils 2blast r.lexicmap.tsv -o r.lexicmap.txt ","description":"Usage $ lexicmap utils 2blast -h Convert the default search output to blast-style format LexicMap only stores genome IDs and sequence IDs, without description information. But the option -g/--kv-file-genome enables adding description data after the genome ID with a tabular key-value mapping file. Input: - Output of \u0026#39;lexicmap search\u0026#39; with the flag -a/--all. Usage: lexicmap utils 2blast [flags] Flags: -b, --buffer-size string ► Size of buffer, supported unit: K, M, G."},{"id":19,"href":"/LexicMap/tutorials/index/","title":"Building an index","parent":"Tutorials","content":" Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Steps Output File structure Index size Explore the index TL;DR Prepare input files: Sequences of each reference genome should be saved in separate FASTA/Q files, with identifiers in the file names. E.g., GCF_000006945.2.fna.gz Run: From a directory with multiple genome files:\nlexicmap index -I genomes/ -O db.lmi From a file list with one file per line:\nlexicmap index -X files.txt -O db.lmi Input Genome size\nLexicMap is mainly suitable for small genomes like Archaea, Bacteria, Viruses and plasmids.\nMaximum genome size: 268 Mb (268,435,456). More precisely:\n$total_bases + ($num_contigs - 1) * 1000 \u0026lt;= 268,435,456 as we concatenate contigs with 1000-bp intervals of N’s to reduce the sequence scale to index.\nSequences of each reference genome should be saved in separate FASTA/Q files, with identifiers in the file names.Click to show\nFile type: FASTA/Q files, in plain text or gzip/xz/zstd/bzip2 compressed formats. File name: \u0026ldquo;Genome ID\u0026rdquo; + \u0026ldquo;File extention\u0026rdquo;. E.g., GCF_000006945.2.fna.gz. Genome ID: they should be distinct for accurate result interpretation, which will be shown in the search result. File extention: a regular expression set by the flag -N/--ref-name-regexp is used to extract genome IDs from the file name. The default value supports common sequence file extentions, e.g., .fa, .fasta, .fna, .fa.gz, .fasta.gz, .fna.gz, fasta.xz, fasta.zst, and fasta.bz2. brename can help to batch rename files safely. If you don\u0026rsquo;t want to change the original file names, you can Create and change to a new directory. Create symbolic links (ln -s) for all genome files. Batch rename all the symbolic links with brename. Use this directory as input via the flag -I/--in-dir. Sequences: Only DNA or RNA sequences are supported. Sequence IDs should be distinct for accurate result interpretation, which will be shown in the search result. One or more sequences in each file are allowed. Unwanted sequences can be filtered out by regular expressions from the flag -B/--seq-name-filter. Genome size limit. Some none-isolate assemblies might have extremely large genomes, e.g., GCA_000765055.1 has \u0026gt;150 Mb. The flag -g/--max-genome (default 15 Mb) is used to skip these input files, and the file list would be written to a file via the flag -G/--big-genomes. At most 17,179,869,184 (234) genomes are supported. For more genomes, just build multiple indexes. Input files can be given via one of the following ways:\nPositional arguments. For a few input files. A file list via the flag -X/--infile-list with one file per line. It can be STDIN (-), e.g., you can filter a file list and pass it to lexicmap index. The flag -S/--skip-file-check is optional for skiping input file checking if you believe these files do exist. A directory containing input files via the flag -I/--in-dir. Multiple-level directories are supported. Directory and file symlinks are followed. Hardware requirements See benchmark of index building.\nLexicMap is designed to provide fast and low-memory sequence alignment against millions of prokaryotic genomes.\nCPU: No specific requirements on CPU type and instruction sets. Both x86 and ARM chips are supported. More is better as LexicMap is a CPU-intensive software. It uses all CPUs by default (-j/--threads). RAM More RAM (\u0026gt; 100 GB) is preferred. The memory usage in index building is mainly related to: The number of masks (-m/--masks, default 40,000). The number of genomes. The divergence between genome sequences. Diverse genomes consume more memory. The genome batch size (-b/--batch-size, default 5,000). This is the main parameter to adjust memory usage. The maximum seed distance or the maximum sketching desert size (-D/--seed-max-desert, default 200), and the distance of k-mers to fill deserts (-d/--seed-in-desert-dist, default 50). Bigger -D/--seed-max-desert values decrease the search sensitivity for distant targets, speed up the indexing speed, decrease the indexing memory occupation and decrease the index size. While the alignment speed is almost not affected. If the RAM is not sufficient. Please: Use a smaller genome batch size. It decreases indexing memory occupation and has little affection on searching performance. Use a smaller number of masks, e.g., 20,000 performs well for small genomes (\u0026lt;=5 Mb). And if the queries are long (\u0026gt;= 2kb), there\u0026rsquo;s little affection for the alignment results. Disk More (\u0026gt;2 TB) is better. LexicMap index size is related to the number of input genomes, the divergence between genome sequences, the number of masks, and the maximum seed distance. See some examples. Note that the index size is not linear with the number of genomes, it\u0026rsquo;s sublinear. Because the seed data are compressed with VARINT-GB algorithm, more genome bring higher compression rates. SSD disks are preferred, while HDD disks are also fast enough. Algorithm Generating m LexicHash masks.\nGenerate m prefixes. Generating all permutations of p-bp prefixes that can cover all possible k-mers, p is the biggest value for 4p \u0026lt;= m (desired number of masks), e.g., p=7 for 40,000 masks. Removing low-complexity prefixes. E.g., 16176 out of 16384 (4^7) prefixes are left. Duplicating these prefixes to m prefixes. For each prefix, Randomly generating left k-p bases. If the P-prefix (-p/--seed-min-prefix) is of low-complexity, re-generating. P is the minimum length of substring matches, default 15. If the mask is duplicated, re-generating. Building an index for each genome batch (-b/--batch-size, default 10,000, max 131,072).\nFor each genome file in a genome batch. Optionally discarding sequences via regular expression (-B/--seq-name-filter). Skipping genomes bigger than the value of -g/--max-genome. Concatenating all sequences, with intervals of 1000-bp N\u0026rsquo;s. Capturing the most similar k-mer (in non-gap and non-interval regions) for each mask and recording the k-mer and its location(s) and strand information. Base N is treated as A. Filling sketching deserts (genome regions longer than --seed-max-desert without any captured k-mers/seeds). In a sketching desert, not a single k-mer is captured because there\u0026rsquo;s another k-mer in another place which shares a longer prefix with the mask. As a result, for a query similar to seqs in this region, all captured k-mers can’t match the correct seeds. For a desert region (start, end), masking the extended region (start-1000, end+1000) with the masks. Starting from start, every around --seed-in-desert-dist (default 150) bp, finding a k-mer which is captured by some mask, and add the k-mer and its position information into the index of that mask. Saving the concatenated genome sequence (bit-packed, 2 bits for one base, N is treated as A) and genome information (genome ID, size, and lengths of all sequences) into the genome data file, and creating an index file for the genome data file for fast random subsequence extraction. Duplicate and reverse all k-mers, and save each reversed k-mer along with the duplicated position information in the seed data of the closest (sharing the longgest prefix) mask. This is for suffix matching of seeds. Compressing k-mers and the corresponding data (k-mer-data, or seeds data, including genome batch, genome number, location, and strand) into chunks of files, and creating an index file for each k-mer-data file for fast seeding. Writing summary information into info.toml file. Merging indexes of multiple batches.\nFor each k-mer-data chunk file (belonging to a list of masks), serially reading data of each mask from all batches, merging them and writting to a new file. For genome data files, just moving them. Concatenating genomes.map.bin, which maps each genome ID to its batch ID and index in the batch. Update the index summary file. Parameters Query length\nLexicMap is mainly designed for sequence alignment with a small number of queries (gene/plasmid/virus/phage sequences) longer than 200 bp by default. However, short queries can also be aligned.\nIf you just want to search long (\u0026gt;1kb) queries for highy similar (\u0026gt;95%) targets, you can build an index with a bigger -D/--seed-max-desert (200 by default), e.g.,\n--seed-max-desert 450 --seed-in-desert-dist 150 Bigger values decrease the search sensitivity for distant targets, speed up the indexing speed, decrease the indexing memory occupation and decrease the index size. While the alignment speed is almost not affected.\nFlags in bold text are important and frequently used.\nGenome batches Flag Value Function Comment -b/--batch-size Max: 131072, default: 5000 Maximum number of genomes in each batch If the number of input files exceeds this number, input files are split into multiple batches and indexes are built for all batches. In the end, seed files are merged, while genome data files are kept unchanged and collected. ■ Bigger values increase indexing memory occupation and increase batch searching speed, while single query searching speed is not affected. LexicHash mask generation Flag Value Function Comment -M/--mask-file A file File with custom masks File with custom masks, which could be exported from an existing index or newly generated by \u0026ldquo;lexicmap utils masks\u0026rdquo;. This flag oversides -k/--kmer, -m/--masks, -s/--rand-seed, etc. -k/--kmer Max: 32, default: 31 K-mer size ■ Bigger values improve the search specificity and do not increase the index size. -m/--masks Default: 40,000 Number of masks ■ Bigger values improve the search sensitivity, increase the index size, and slow down the search speed. For smaller genomes like phages/viruses, m=10,000 is high enough. -p/--seed-min-prefix Max: 32, Default: 15 Minimum length of shared substrings (anchors) in searching This value is used to remove masks with a prefix of low-complexity. Seeds (k-mer-value) data Flag Value Function Comment --seed-max-desert Default: 200 Maximum length of distances between seeds The default value of 200 guarantees queries \u0026gt;200 bp would match at least one seed. ► Large regions with no seeds are called sketching deserts. Deserts with seed distance larger than this value will be filled by choosing k-mers roughly every \u0026ndash;seed-in-desert-dist (50 by default) bases. ■ Bigger values decrease the search sensitivity for distant targets, speed up the indexing speed, decrease the indexing memory occupation and decrease the index size. While the alignment speed is almost not affected. -c/--chunks Maximum: 128, default: #CPUs Number of seed file chunks Bigger values accelerate the search speed at the cost of a high disk reading load. The maximum number should not exceed the maximum number of open files set by the operating systems. -J/--seed-data-threads Maximum: -c/\u0026ndash;chunks, default: 8 Number of threads for writing seed data and merging seed chunks from all batches ■ Bigger values increase indexing speed at the cost of slightly higher memory occupation. -p/--partitions Default: 512 Number of partitions for indexing each seed file Bigger values bring a little higher memory occupation. 512 is a good value with high searching speed, larger or smaller values would decrease the speed in lexicmap search. ► After indexing, lexicmap utils reindex-seeds can be used to reindex the seeds data with another value of this flag. --max-open-files Default: 512 Maximum number of open files It\u0026rsquo;s only used in merging indexes of multiple genome batches. Also see the usage of lexicmap index.\nSteps We use a small dataset for demonstration.\nPreparing the test genomes (15 bacterial genomes) in the refs directory.\nNote that the genome files contain the assembly accessions (ID) in the file names.\ngit clone https://github.com/shenwei356/LexicMap cd LexicMap/demo/ ls refs/ GCF_000006945.2.fa.gz GCF_000392875.1.fa.gz GCF_001096185.1.fa.gz GCF_002949675.1.fa.gz GCF_006742205.1.fa.gz GCF_000017205.1.fa.gz GCF_000742135.1.fa.gz GCF_001457655.1.fa.gz GCF_002950215.1.fa.gz GCF_009759685.1.fa.gz GCF_000148585.2.fa.gz GCF_001027105.1.fa.gz GCF_001544255.1.fa.gz GCF_003697165.2.fa.gz GCF_900638025.1.fa.gz Building an index with genomes from a directory.\nlexicmap index -I refs/ -O demo.lmi It would take about 3 seconds and 2 GB RAM in a 16-CPU PC.\nOptionally, we can also use a file list as the input.\n$ head -n 3 files.txt refs/GCF_000006945.2.fa.gz refs/GCF_000017205.1.fa.gz refs/GCF_000148585.2.fa.gz lexicmap index -X files.txt -O demo.lmi Click to show the log of a demo run. ... # here we set a small --batch-size 5 $ lexicmap index -I refs/ -O demo.lmi --batch-size 5 16:22:49.745 [INFO] LexicMap v0.4.0 (14c2606) 16:22:49.745 [INFO] https://github.com/shenwei356/LexicMap 16:22:49.745 [INFO] 16:22:49.745 [INFO] checking input files ... 16:22:49.745 [INFO] 15 input file(s) given 16:22:49.745 [INFO] 16:22:49.745 [INFO] --------------------- [ main parameters ] --------------------- 16:22:49.745 [INFO] 16:22:49.745 [INFO] input and output: 16:22:49.745 [INFO] input directory: refs/ 16:22:49.745 [INFO] regular expression of input files: (?i)\\.(f[aq](st[aq])?|fna)(\\.gz|\\.xz|\\.zst|\\.bz2)?$ 16:22:49.745 [INFO] *regular expression for extracting reference name from file name: (?i)(.+)\\.(f[aq](st[aq])?|fna)(\\.gz|\\.xz|\\.zst|\\.bz2)?$ 16:22:49.745 [INFO] *regular expressions for filtering out sequences: [] 16:22:49.745 [INFO] max genome size: 15000000 16:22:49.745 [INFO] output directory: demo.lmi 16:22:49.745 [INFO] 16:22:49.745 [INFO] mask generation: 16:22:49.745 [INFO] k-mer size: 31 16:22:49.745 [INFO] number of masks: 40000 16:22:49.745 [INFO] rand seed: 1 16:22:49.745 [INFO] prefix length for checking low-complexity in mask generation: 15 16:22:49.745 [INFO] 16:22:49.745 [INFO] seed data: 16:22:49.745 [INFO] maximum sketching desert length: 450 16:22:49.745 [INFO] distance of k-mers to fill deserts: 150 16:22:49.745 [INFO] seeds data chunks: 16 16:22:49.745 [INFO] seeds data indexing partitions: 512 16:22:49.745 [INFO] 16:22:49.745 [INFO] general: 16:22:49.745 [INFO] genome batch size: 5 16:22:49.745 [INFO] batch merge threads: 8 16:22:49.745 [INFO] 16:22:49.745 [INFO] 16:22:49.745 [INFO] --------------------- [ generating masks ] --------------------- 16:22:50.180 [INFO] 16:22:50.180 [INFO] --------------------- [ building index ] --------------------- 16:22:50.328 [INFO] 16:22:50.328 [INFO] ------------------------[ batch 1/3 ]------------------------ 16:22:50.328 [INFO] building index for batch 1 with 5 files... processed files: 5 / 5 [======================================] ETA: 0s. done 16:22:51.192 [INFO] writing seeds... 16:22:51.264 [INFO] finished writing seeds in 71.756662ms 16:22:51.264 [INFO] finished building index for batch 1 in: 935.464336ms 16:22:51.264 [INFO] 16:22:51.264 [INFO] ------------------------[ batch 2/3 ]------------------------ 16:22:51.264 [INFO] building index for batch 2 with 5 files... processed files: 5 / 5 [======================================] ETA: 0s. done 16:22:53.126 [INFO] writing seeds... 16:22:53.212 [INFO] finished writing seeds in 86.823785ms 16:22:53.212 [INFO] finished building index for batch 2 in: 1.948770015s 16:22:53.212 [INFO] 16:22:53.212 [INFO] ------------------------[ batch 3/3 ]------------------------ 16:22:53.212 [INFO] building index for batch 3 with 5 files... processed files: 5 / 5 [======================================] ETA: 0s. done 16:22:54.350 [INFO] writing seeds... 16:22:54.437 [INFO] finished writing seeds in 87.058101ms 16:22:54.437 [INFO] finished building index for batch 3 in: 1.224414126s 16:22:54.437 [INFO] 16:22:54.437 [INFO] merging 3 indexes... 16:22:54.437 [INFO] [round 1] 16:22:54.437 [INFO] batch 1/1, merging 3 indexes to demo.lmi.tmp/r1_b1 with 8 threads... 16:22:54.613 [INFO] [round 1] finished in 175.640164ms 16:22:54.613 [INFO] rename demo.lmi.tmp/r1_b1 to demo.lmi 16:22:54.620 [INFO] 16:22:54.620 [INFO] finished building LexicMap index from 15 files with 40000 masks in 4.875616203s 16:22:54.620 [INFO] LexicMap index saved: demo.lmi 16:22:54.620 [INFO] 16:22:54.620 [INFO] elapsed time: 4.875654824s 16:22:54.620 [INFO] Output The LexicMap index is a directory with multiple files.\nFile structure $ tree demo.lmi/ demo.lmi/ # the index directory ├── genomes # directory of genome data │ └── batch_0000 # genome data of one batch │ ├── genomes.bin # genome data file, containing genome ID, size, sequence lengths, bit-packed sequences │ └── genomes.bin.idx # index of genome data file, for fast subsequence extraction ├── seeds # seed data: pairs of k-mer and its location information (genome batch, genome number, location, strand) │ ├── chunk_000.bin # seed data file │ ├── chunk_000.bin.idx # index of seed data file, for fast seed searching and data extraction ... ... ... │ ├── chunk_015.bin # the number of chunks is set by flag `-c/--chunks`, default: #cpus │ └── chunk_015.bin.idx ├── genomes.map.bin # mapping genome ID to batch number of genome number in the batch ├── info.toml # summary of the index └── masks.bin # mask data Index size LexicMap index size is related to the number of input genomes, the divergence between genome sequences, the number of masks, and the maximum seed distance.\nNote that the index size is not linear with the number of genomes, it\u0026rsquo;s sublinear. Because the seed data are compressed with VARINT-GB algorithm, more genome bring higher compression rates.\nDemo data # 15 genomes demo.lmi/: 59.55 MB 46.31 MB seeds 12.93 MB genomes 312.53 KB masks.bin 375.00 B genomes.map.bin 322.00 B info.toml GTDB repr # 85,205 genomes/ gtdb_repr.lmi: 212.58 GB 145.79 GB seeds 66.78 GB genomes 2.03 MB genomes.map.bin 312.53 KB masks.bin 328.00 B info.toml GTDB complete # 402,538 genomes gtdb_complete.lmi: 905.95 GB 542.97 GB seeds 362.98 GB genomes 9.60 MB genomes.map.bin 312.53 KB masks.bin 329.00 B info.toml Genbank\u0026#43;RefSeq # 2,340,672 genomes genbank_refseq.lmi: 4.94 TB 2.77 TB seeds 2.17 TB genomes 55.81 MB genomes.map.bin 312.53 KB masks.bin 331.00 B info.toml AllTheBacteria HQ # 1,858,610 genomes atb_hq.lmi: 3.88 TB 2.11 TB seeds 1.77 TB genomes 39.22 MB genomes.map.bin 312.53 KB masks.bin 331.00 B info.toml Directory/file sizes are counted with https://github.com/shenwei356/dirsize. Index building parameters: -k 31 -m 40000. Genome batch size: -b 5000 for GTDB datasets, -b 25000 for others. Explore the index lexicmap utils genomes can list genome IDs of indexed genomes, see the usage and example. lexicmap utils masks can list masks of the index, see the usage and example. lexicmap utils kmers can list details of all seeds (k-mers), including reference, location(s) and the strand. see the usage and example. lexicmap utils seed-pos can help to explore the seed positions, see the usage and example. Before that, the flag --save-seed-pos needs to be added to lexicmap index. lexicmap utils subseq can extract subsequences via genome ID, sequence ID and positions, see the usage and example. What\u0026rsquo;s next: Searching ","description":"Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Steps Output File structure Index size Explore the index TL;DR Prepare input files: Sequences of each reference genome should be saved in separate FASTA/Q files, with identifiers in the file names. E.g., GCF_000006945.2.fna.gz Run: From a directory with multiple genome files:\nlexicmap index -I genomes/ -O db.lmi From a file list with one file per line:\nlexicmap index -X files."},{"id":20,"href":"/LexicMap/usage/lexicmap/","title":"lexicmap","parent":"Usage","content":"$ lexicmap -h LexicMap: efficient sequence alignment against millions of prokaryotic genomes Version: v0.4.0 Documents: https://bioinf.shenwei.me/LexicMap Source code: https://github.com/shenwei356/LexicMap Usage: lexicmap [command] Available Commands: autocompletion Generate shell autocompletion scripts index Generate an index from FASTA/Q sequences search Search sequences against an index utils Some utilities version Print version information and check for update Flags: -h, --help help for lexicmap -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Use \u0026#34;lexicmap [command] --help\u0026#34; for more information about a command. ","description":"$ lexicmap -h LexicMap: efficient sequence alignment against millions of prokaryotic genomes Version: v0.4.0 Documents: https://bioinf.shenwei.me/LexicMap Source code: https://github.com/shenwei356/LexicMap Usage: lexicmap [command] Available Commands: autocompletion Generate shell autocompletion scripts index Generate an index from FASTA/Q sequences search Search sequences against an index utils Some utilities version Print version information and check for update Flags: -h, --help help for lexicmap -X, --infile-list string ► File of input file list (one file per line)."},{"id":21,"href":"/LexicMap/notes/motivation/","title":"Motivation","parent":"Notes","content":" BLASTN is not able to scale to millions of bacterial genomes, it\u0026rsquo;s slow and has a high memory occupation. For example, it requires \u0026gt;2000 GB for alignment a 2-kb gene sequence against all the 2.34 millions of prokaryotics genomes in Genbank and RefSeq.\nLarge-scale sequence searching tools only return which genomes a query matches (color), but they can\u0026rsquo;t return positional information.\n","description":"BLASTN is not able to scale to millions of bacterial genomes, it\u0026rsquo;s slow and has a high memory occupation. For example, it requires \u0026gt;2000 GB for alignment a 2-kb gene sequence against all the 2.34 millions of prokaryotics genomes in Genbank and RefSeq.\nLarge-scale sequence searching tools only return which genomes a query matches (color), but they can\u0026rsquo;t return positional information."},{"id":22,"href":"/LexicMap/tags/","title":"Tags","parent":"","content":"","description":""}] \ No newline at end of file +[{"id":0,"href":"/LexicMap/usage/utils/masks/","title":"masks","parent":"utils","content":"$ lexicmap utils masks -h View masks of the index or generate new masks randomly Usage: lexicmap utils masks [flags] { -d \u0026lt;index path\u0026gt; | [-k \u0026lt;k\u0026gt;] [-n \u0026lt;masks\u0026gt;] [-s \u0026lt;seed\u0026gt;] } [-o out.tsv.gz] Flags: -h, --help help for masks -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. -k, --kmer int ► Maximum k-mer size. K needs to be \u0026lt;= 32. (default 31) -m, --masks int ► Number of masks. (default 40000) -o, --out-file string ► Out file, supports and recommends a \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) -p, --prefix int ► Length of mask k-mer prefix for checking low-complexity (0 for no checking). (default 15) -s, --seed int ► The seed for generating random masks. (default 1) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Examples $ lexicmap utils masks --quiet -d demo.lmi/ | head -n 10 1 AAAACACCATGGAGCCTTGTGGAACCTTGGC 2 AAAACACGCGATCAGGTCGTCCGTCCCAGTG 3 AAAACACTATGGCCTGATTACCCCATCCCGA 4 AAAACAGGACCGTCCTAGGGTCAATGGTTCG 5 AAAACAGTCTTGTATTATGTACTTCACATTC 6 AAAACATGTTACTACGGTTTTCCGCAATTGG 7 AAAACATTGGTCCTATTGGCGTCACTCGATA 8 AAAACCACTGTGCATATCTCGAATCCCGCTC 9 AAAACCAGCTCTGTAAGCACTAACAACGCTA 10 AAAACCATGGTGCCGTGCATTTGCGCACCTA $ lexicmap utils masks --quiet -d demo.lmi/ | tail -n 10 39991 TTTTGGTCTACAGAAAGTGCGTTATAGATTT 39992 TTTTGGTGTGGAGAAGGACCTCACTGTTAAT 39993 TTTTGTAGACCGAGGTTTTAAGTCCAGGGGG 39994 TTTTGTATGGAATACTTTACAGTCATCAGTT 39995 TTTTGTCATCAGTCGGCACTTAGGGGAACCG 39996 TTTTGTCCCAGTGACCAATCACAGTTCGGGA 39997 TTTTGTCGATAATCCTGCCTCGATTTCTCTT 39998 TTTTGTGAATAAGAGATCCTGTCGCAGGAAA 39999 TTTTGTGCACGACGCTCCTGGTGTATCGCCT 40000 TTTTGTGGCGACGGCGTACCCCGTCTAGGAG # check a specific mask $ lexicmap utils masks --quiet -d demo.lmi/ -m 12345 12345 CATGTTATAGCACTGGCGGCTAACGCCTTTG ","description":"$ lexicmap utils masks -h View masks of the index or generate new masks randomly Usage: lexicmap utils masks [flags] { -d \u0026lt;index path\u0026gt; | [-k \u0026lt;k\u0026gt;] [-n \u0026lt;masks\u0026gt;] [-s \u0026lt;seed\u0026gt;] } [-o out.tsv.gz] Flags: -h, --help help for masks -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. -k, --kmer int ► Maximum k-mer size. K needs to be \u0026lt;= 32. (default 31) -m, --masks int ► Number of masks."},{"id":1,"href":"/LexicMap/usage/index/","title":"index","parent":"Usage","content":"$ lexicmap index -h Generate an index from FASTA/Q sequences Input: *1. Sequences of each reference genome should be saved in separate FASTA/Q files, with reference identifiers in the file names. 2. Input plain or gzip/xz/zstd/bzip2 compressed FASTA/Q files can be given via positional arguments or the flag -X/--infile-list with a list of input files. Flag -S/--skip-file-check is optional for skipping file checking if you trust the file list. 3. Input can also be a directory containing sequence files via the flag -I/--in-dir, with multiple-level sub-directories allowed. A regular expression for matching sequencing files is available via the flag -r/--file-regexp. 4. Some non-isolate assemblies might have extremely large genomes (e.g., GCA_000765055.1, \u0026gt;150 mb). The flag -g/--max-genome is used to skip these input files, and the file list would be written to a file (-G/--big-genomes). You need to increase the value for indexing fungi genomes. 5. Maximum genome size: 268,435,456. More precisely: $total_bases + ($num_contigs - 1) * 1000 \u0026lt;= 268,435,456, as we concatenate contigs with 1000-bp intervals of N’s to reduce the sequence scale to index. 6. A flag -l/--min-seq-len can filter out sequences shorter than the threshold (default is the k value). Attention: *1) ► You can rename the sequence files for convenience, e.g., GCF_000017205.1.fa.gz, because the genome identifiers in the index and search result would be: the basenames of files with common FASTA/Q file extensions removed, which are extracted via the flag -N/--ref-name-regexp. ► The extracted genome identifiers better be distinct, which will be shown in search results and are used to extract subsequences in the command \u0026#34;lexicmap utils subseq\u0026#34;. 2) ► Unwanted sequences like plasmids can be filtered out by content in FASTA/Q header via regular expressions (-B/--seq-name-filter). 3) All degenerate bases are converted to their lexicographic first bases. E.g., N is converted to A. code bases saved A A A C C C G G G T/U T T M A/C A R A/G A W A/T A S C/G C Y C/T C K G/T G V A/C/G A H A/C/T A D A/G/T A B C/G/T C N A/C/G/T A Important parameters: --- Genome data --- *1. -b/--batch-size, ► Maximum number of genomes in each batch (maximum: 131072, default: 5000). ► If the number of input files exceeds this number, input files are split into multiple batches and indexes are built for all batches. In the end, seed files are merged, while genome data files are kept unchanged and collected. ■ Bigger values increase indexing memory occupation and increase batch searching speed, while single query searching speed is not affected. --- LexicHash mask generation --- 0. -M/--mask-file, ► File with custom masks, which could be exported from an existing index or newly generated by \u0026#34;lexicmap utils masks\u0026#34;. This flag oversides -k/--kmer, -m/--masks, -s/--rand-seed, etc. *1. -k/--kmer, ► K-mer size (maximum: 32, default: 31). ■ Bigger values improve the search specificity and do not increase the index size. *2. -m/--masks, ► Number of LexicHash masks (default: 40000). ■ Bigger values improve the search sensitivity, increase the index size, and slow down the search speed. 3. -p/--seed-min-prefix, ► Minimum length of shared substrings (anchors) in searching (maximum: 32, default: 15). ► This value is used to remove masks with a prefix of low-complexity. --- Seeds data (k-mer-value data) --- *1. --seed-max-desert ► Maximum length of distances between seeds (default: 200). The default value of 200 guarantees queries \u0026gt;=200 bp would match at least one seed. ► Large regions with no seeds are called sketching deserts. Deserts with seed distance larger than this value will be filled by choosing k-mers roughly every --seed-in-desert-dist (50 by default) bases. ■ Big values decrease the search sensitivity for distant targets, speed up the indexing speed, decrease the indexing memory occupation and decrease the index size. While the alignment speed is almost not affected. 2. -c/--chunks, ► Number of seed file chunks (maximum: 128, default: #CPUs). ► Bigger values accelerate the search speed at the cost of a high disk reading load. The maximum number should not exceed the maximum number of open files set by the operating systems. *3. -J/--seed-data-threads ► Number of threads for writing seed data and merging seed chunks from all batches (maximum: -c/--chunks, default: 8). ■ Bigger values increase indexing speed at the cost of slightly higher memory occupation. 4. --partitions, ► Number of partitions for indexing each seed file (default: 512). ► Bigger values bring a little higher memory occupation. 512 is a good value with high searching speed, Larger or smaller values would decrease the speed in \u0026#34;lexicmap search\u0026#34;. ► After indexing, \u0026#34;lexicmap utils reindex-seeds\u0026#34; can be used to reindex the seeds data with another value of this flag. 5. --max-open-files, ► Maximum number of open files (default: 512). ► It\u0026#39;s only used in merging indexes of multiple genome batches. Usage: lexicmap index [flags] [-k \u0026lt;k\u0026gt;] [-m \u0026lt;masks\u0026gt;] { -I \u0026lt;seqs dir\u0026gt; | -X \u0026lt;file list\u0026gt;} -O \u0026lt;out dir\u0026gt; Flags: -b, --batch-size int ► Maximum number of genomes in each batch (maximum value: 131072) (default 5000) -G, --big-genomes string ► Out file of skipped files with $total_bases + ($num_contigs - 1) * $contig_interval \u0026gt;= -g/--max-genome. The second column is one of the skip types: no_valid_seqs, too_large_genome, too_many_seqs. -c, --chunks int ► Number of chunks for storing seeds (k-mer-value data) files. (default 16) --contig-interval int ► Length of interval (N\u0026#39;s) between contigs in a genome. (default 1000) -r, --file-regexp string ► Regular expression for matching sequence files in -I/--in-dir, case ignored. (default \u0026#34;\\\\.(f[aq](st[aq])?|fna)(\\\\.gz|\\\\.xz|\\\\.zst|\\\\.bz2)?$\u0026#34;) --force ► Overwrite existing output directory. -h, --help help for index -I, --in-dir string ► Input directory containing FASTA/Q files. Directory and file symlinks are followed. -k, --kmer int ► Maximum k-mer size. K needs to be \u0026lt;= 32. (default 31) -M, --mask-file string ► File of custom masks. This flag oversides -k/--kmer, -m/--masks, -s/--rand-seed, -p/--seed-min-prefix, etc. -m, --masks int ► Number of LexicHash masks. (default 40000) -g, --max-genome int ► Maximum genome size. Extremely large genomes (e.g., non-isolate assemblies from Genbank) will be skipped. Need to be smaller than the maximum supported genome size: 268435456 (default 15000000) --max-open-files int ► Maximum opened files, used in merging indexes. (default 512) -l, --min-seq-len int ► Maximum sequence length to index. The value would be k for values \u0026lt;= 0 (default -1) --no-desert-filling ► Disable sketching desert filling (only for debug). -O, --out-dir string ► Output LexicMap index directory. --partitions int ► Number of partitions for indexing seeds (k-mer-value data) files. (default 512) -s, --rand-seed int ► Rand seed for generating random masks. (default 1) -N, --ref-name-regexp string ► Regular expression (must contains \u0026#34;(\u0026#34; and \u0026#34;)\u0026#34;) for extracting the reference name from the filename. (default \u0026#34;(?i)(.+)\\\\.(f[aq](st[aq])?|fna)(\\\\.gz|\\\\.xz|\\\\.zst|\\\\.bz2)?$\u0026#34;) --save-seed-pos ► Save seed positions, which can be inspected with \u0026#34;lexicmap utils seed-pos\u0026#34;. -J, --seed-data-threads int ► Number of threads for writing seed data and merging seed chunks from all batches, the value should be in range of [1, -c/--chunks] (default 8) -d, --seed-in-desert-dist int ► Distance of k-mers to fill deserts. (default 50) -D, --seed-max-desert int ► Maximum length of sketching deserts, or maximum seed distance. Deserts with seed distance larger than this value will be filled by choosing k-mers roughly every --seed-in-desert-dist bases. (default 200) -p, --seed-min-prefix int ► Minimum length of shared substrings (anchors) in searching. Here, this value is used to remove low-complexity masks and choose k-mers to fill sketching deserts. (default 15) -B, --seq-name-filter strings ► List of regular expressions for filtering out sequences by contents in FASTA/Q header/name, case ignored. -S, --skip-file-check ► Skip input file checking when given files or a file list. Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Examples See Building an index ","description":"$ lexicmap index -h Generate an index from FASTA/Q sequences Input: *1. Sequences of each reference genome should be saved in separate FASTA/Q files, with reference identifiers in the file names. 2. Input plain or gzip/xz/zstd/bzip2 compressed FASTA/Q files can be given via positional arguments or the flag -X/--infile-list with a list of input files. Flag -S/--skip-file-check is optional for skipping file checking if you trust the file list. 3. Input can also be a directory containing sequence files via the flag -I/--in-dir, with multiple-level sub-directories allowed."},{"id":2,"href":"/LexicMap/introduction/","title":"Introduction","parent":"","content":" LexicMap is a nucleotide sequence alignment tool for efficiently querying gene, plasmid, viral, or long-read sequences against up to millions of prokaryotic genomes.\nTable of contents Table of contents Features Introduction Quick start Performance Indexing Searching Installation Algorithm overview Related projects Support License Features LexicMap is scalable to up to millions of prokaryotic genomes. The sensitivity of LexicMap is comparable with Blastn. The alignment is fast and memory-efficient. LexicMap is easy to install, we provide binary files with no dependencies for Linux, Windows, MacOS (x86 and arm CPUs). LexicMap is easy to use (tutorials and usages). Both tabular and Blast-style output formats are available. Besides, we provide several commands to explore the index data and extract indexed subsequences. Introduction Motivation: Alignment against a database of genomes is a fundamental operation in bioinformatics, popularised by BLAST. However, given the increasing rate at which genomes are sequenced, existing tools struggle to scale.\nExisting full alignment tools face challenges of high memory consumption and slow speeds. Alignment-free large-scale sequence searching tools only return the matched genomes, without the vital positional information for downstream analysis. Prefilter+Align strategies have the sensitivity issue in the prefiltering step. Methods: (algorithm overview)\nAn improved version of the sequence sketching method LexicHash is adopted to compute alignment seeds accurately and efficiently. We solved the sketching deserts problem of LexicHash seeds to provide a window guarantee. We added the support of suffix matching of seeds, making seeds much more tolerant to mutations. Any 31-bp seed with a common ≥15 bp prefix or suffix can be matched, which means seeds are immune to any single SNP. A multi-level index enables fast and low-memory variable-length seed matching and chaining. A pseudo alignment algorithm is used to find similar sequence regions from chaining results for alignment. A reimplemented Wavefront alignment algorithm is used for base-level alignment. Results:\nLexicMap enables efficient indexing and searching of both RefSeq+GenBank and the AllTheBacteria datasets (2.3 and 1.9 million genomes respectively). Running at this scale has previously only been achieved by Phylign (previously called mof-search).\nFor searching in all 2,340,672 Genbank+Refseq prokaryotic genomes, Bastn is unable to run with this dataset on common servers as it requires \u0026gt;2000 GB RAM. (see performance).\nWith LexicMap (48 CPUs),\nQuery Genome hits Time RAM A 1.3-kb marker gene 36,633 21s 3.4 GB A 1.5-kb 16S rRNA 1,928,372 6m40s 16.7 GB A 52.8-kb plasmid 551,264 8m54s 20.1 GB 1003 AMR genes 27,577,060 5h18m 41.3 GB Quick start Building an index (see the tutorial of building an index).\n# From a directory with multiple genome files lexicmap index -I genomes/ -O db.lmi # From a file list with one file per line lexicmap index -X files.txt -O db.lmi Querying (see the tutorial of searching).\n# For short queries like genes or long reads, returning top N hits. lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 70 --min-qcov-per-genome 70 --top-n-genomes 1000 # For longer queries like plasmids, returning all hits. lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 0 --min-qcov-per-genome 0 --top-n-genomes 0 Sample output (queries are a few Nanopore Q20 reads). See output format details.\nquery qlen hits sgenome sseqid qcovGnm hsp qcovHSP alenHSP pident gaps qstart qend sstart send sstr slen ------------------ ---- ---- --------------- ------------- ------- --- ------- ------- ------- ---- ------ ---- ------- ------- ---- ------- ERR5396170.1000016 740 1 GCF_013394085.1 NZ_CP040910.1 89.595 1 89.595 663 99.246 0 71 733 13515 14177 + 1887974 ERR5396170.1000000 698 1 GCF_001457615.1 NZ_LN831024.1 85.673 1 85.673 603 98.010 5 53 650 4452083 4452685 + 6316979 ERR5396170.1000017 516 1 GCF_013394085.1 NZ_CP040910.1 94.574 1 94.574 489 99.591 2 27 514 293509 293996 + 1887974 ERR5396170.1000012 848 1 GCF_013394085.1 NZ_CP040910.1 95.165 1 95.165 811 97.411 7 22 828 190329 191136 - 1887974 ERR5396170.1000038 1615 1 GCA_000183865.1 CM001047.1 64.706 1 60.000 973 95.889 13 365 1333 88793 89756 - 2884551 ERR5396170.1000038 1615 1 GCA_000183865.1 CM001047.1 64.706 2 4.706 76 98.684 0 266 341 89817 89892 - 2884551 ERR5396170.1000036 1159 1 GCF_013394085.1 NZ_CP040910.1 95.427 1 95.427 1107 99.729 1 32 1137 1400097 1401203 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 1 86.486 707 99.151 3 104 807 242235 242941 - 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 2 86.486 707 98.444 3 104 807 1138777 1139483 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 3 84.152 688 98.983 4 104 788 154620 155306 - 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 4 84.029 687 99.127 3 104 787 32477 33163 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 5 72.727 595 98.992 3 104 695 1280183 1280777 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 6 11.671 95 100.000 0 693 787 1282480 1282574 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 7 82.064 671 99.106 3 120 787 1768782 1769452 + 1887974 CIGAR string, aligned query and subject sequences can be outputted as extra columns via the flag -a/--all.\n# Extracting similar sequences for a query gene. # search matches with query coverage \u0026gt;= 90% lexicmap search -d gtdb_complete.lmi/ b.gene_E_faecalis_SecY.fasta -o results.tsv \\ --min-qcov-per-hsp 90 --all # extract matched sequences as FASTA format sed 1d results.tsv | awk -F'\\t' '{print \u0026quot;\u0026gt;\u0026quot;$5\u0026quot;:\u0026quot;$14\u0026quot;-\u0026quot;$15\u0026quot;:\u0026quot;$16\u0026quot;\\n\u0026quot;$20;}' \\ | seqkit seq -g \u0026gt; results.fasta Export blast-style format:\nseqkit seq -M 500 q.long-reads.fasta.gz \\ | seqkit head -n 2 \\ | lexicmap search -d demo.lmi/ -a \\ | lexicmap utils 2blast Query = GCF_000017205.1_r160 Length = 478 [Subject genome #1/1] = GCF_000017205.1 Query coverage per genome = 95.188% \u0026gt;NC_009656.1 Length = 6588339 HSP #1 Query coverage per seq = 95.188%, Aligned length = 463, Identities = 95.680%, Gaps = 12 Query range = 13-467, Subject range = 4866862-4867320, Strand = Plus/Plus Query 13 CCTCAAACGAGTCC-AACAGGCCAACGCCTAGCAATCCCTCCCCTGTGGGGCAGGGAAAA 71 |||||||||||||| |||||||| |||||| | ||||||||||||| |||||||||||| Sbjct 4866862 CCTCAAACGAGTCCGAACAGGCCCACGCCTCACGATCCCTCCCCTGTCGGGCAGGGAAAA 4866921 Query 72 TCGTCCTTTATGGTCCGTTCCGGGCACGCACCGGAACGGCGGTCATCTTCCACGGTGCCC 131 |||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||| Sbjct 4866922 TCGTCCTTTATGGTCCGTTCCGGGCACGCACCGGAACGGCGGTCAT-TTCCACGGTGCCC 4866980 Query 132 GCCCACGGCGGACCCGCGGAAACCGACCCGGGCGCCAAGGCGCCCGGGAACGGAGTA-CA 190 ||| ||||||||||| ||||||||||||||||||||||||||||||||||||||||| || Sbjct 4866981 GCC-ACGGCGGACCC-CGGAAACCGACCCGGGCGCCAAGGCGCCCGGGAACGGAGTATCA 4867038 Query 191 CTCGGCGTTCGGCCAGCGACAGC---GACGCGTTGCCGCCCACCGCGGTGGTGTTCACCG 247 |||||||| |||||||||||||| |||||||||||||||||||||||||||||||||| Sbjct 4867039 CTCGGCGT-CGGCCAGCGACAGCAGCGACGCGTTGCCGCCCACCGCGGTGGTGTTCACCG 4867097 Query 248 AGGTGGTGCGCTCGCTGAC-AAACGCAGCAGGTAGTTCGGCCCGCCGGCCTTGGGACCG- 305 ||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||| Sbjct 4867098 AGGTGGTGCGCTCGCTGACGAAACGCAGCAGGTAGTTCGGCCCGCCGGCCTTGGGACCGG 4867157 Query 306 TGCCGGACAGCCCGTGGCCGCCGAACAGTTGCACGCCCACCACCGCGCCGAT-TGGTTTC 364 |||||||||||||||||||||||||| ||||||||||||||||||||||||| ||||| | Sbjct 4867158 TGCCGGACAGCCCGTGGCCGCCGAACGGTTGCACGCCCACCACCGCGCCGATCTGGTTGC 4867217 Query 365 GGTTGACGTAGAGGTTGCCGACCCGCGCCAGCTCTTGGATGCGGCGGGCGGTTTCCTCGT 424 |||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||| Sbjct 4867218 GGTTGACGTAGAGGTTGCCGACCCGCGCCAGCTCTTCGATGCGGCGGGCGGTTTCCTCGT 4867277 Query 425 TGCGGCTGTGGACCCCCATGGTCAGGCCGAAACCGGTGGCGTT 467 ||||||||||||||||||||||||||||||||||||||||||| Sbjct 4867278 TGCGGCTGTGGACCCCCATGGTCAGGCCGAAACCGGTGGCGTT 4867320 Learn more tutorials and usages.\nPerformance Indexing dataset genomes gzip_size tool db_size time RAM GTDB complete 402,538 578 GB LexicMap 906 GB 8 h 21 m 71.1 GB Blastn 360 GB 3 h 11 m 718 MB AllTheBacteria HQ 1,858,610 3.1 TB LexicMap 3.88 TB 48 h 08 m 82.7 GB Blastn 1.76 TB 14 h 03 m 2.9 GB Phylign 248 GB / / Genbank+RefSeq 2,340,672 3.5 TB LexicMap 4.94 TB 52 h 03 m 188.6 GB Blastn 2.15 TB 14 h 04 m 4.3 GB Notes:\nAll files are stored on a server with HDD disks. No files are cached in memory. Tests are performed in a single cluster node with 48 CPU cores (Intel Xeon Gold 6336Y CPU @ 2.40 GHz). LexicMap index building parameters: -k 31 -m 40000. Genome batch size: -b 5000 for GTDB datasets, -b 25000 for others. Searching Blastn failed to run as it requires \u0026gt;2000GB RAM for Genbank+RefSeq and AllTheBacteria datasets. Phylign only has the index for AllTheBacteria HQ dataset.\nGTDB complete (402,538 genomes):\nquery query_len tool genome_hits genome_hits(qcov\u0026gt;50) time RAM a marker gene 1,299 bp LexicMap 5,249 5,234 2.2 s 1.0 GB Blastn 7,121 6,177 2,171 s 351.2 GB a 16S rRNA gene 1,542 bp LexicMap 302,096 278,023 73 s 4.1 GB Blastn 301,197 277,042 2,353 s 378.4 GB a plasmid 52,830 bp LexicMap 63,820 1,188 58 s 4.7 GB Blastn 69,311 2,308 2,262 s 364.7 GB 1033 AMR genes 1 kb (median) LexicMap 4,132,990 2,255,347 1,165 s 20.2 GB Blastn 5,357,772 2,240,766 4,686 s 442.1 GB AllTheBacteria HQ (1,858,610 genomes):\nquery query_len tool genome_hits genome_hits(qcov\u0026gt;50) time RAM a marker gene 1,299 bp LexicMap 33,795 33,786 19 s 2.5 GB Phylign_local 7,936 30 m 48 s 77.6 GB Phylign_cluster 7,936 28 m 33 s a 16S rRNA gene 1,542 bp LexicMap 1,857,641 1,739,767 7 m 50 s 18.2 GB Phylign_local 1,017,765 130 m 33 s 77.0 GB Phylign_cluster 1,017,765 86 m 41 s a plasmid 52,830 bp LexicMap 480,008 3,620 8 m 16 s 15.7 GB Phylign_local 46,822 47 m 33 s 82.6 GB Phylign_cluster 46,822 39 m 34 s 1033 AMR genes 1 kb (median) LexicMap 22,995,817 12,347,425 185 m 25 s 45.1 GB Phylign_local 1,135,215 156 m 08 s 85.9 GB Phylign_cluster 1,135,215 133 m 49 s Genbank+RefSeq (2,340,672 genomes):\nquery query_len tool genome_hits genome_hits(qcov\u0026gt;50) time RAM a marker gene 1,299 bp LexicMap 36,633 36,578 21 s 3.4 GB a 16S rRNA gene 1,542 bp LexicMap 1,928,372 1,381,723 6 m 40 s 16.7 GB a plasmid 52,830 bp LexicMap 551,264 6,559 8 m 54 s 20.1 GB 1033 AMR genes 1 kb (median) LexicMap 27,577,060 14,798,129 318 m 28 s 41.3 GB Notes:\nAll files are stored on a server with HDD disks. No files are cached in memory. Tests are performed in a single cluster node with 48 CPU cores (Intel Xeon Gold 6336Y CPU @ 2.40 GHz). Main searching parameters: LexicMap v0.4.0: --threads 48 --top-n-genomes 0 --min-qcov-per-genome 0 --min-qcov-per-hsp 0 --min-match-pident 70. Blastn v2.15.0+: -num_threads 48 -max_target_seqs 10000000. Phylign (AllTheBacteria fork 9fc65e6): threads: 48, cobs_kmer_thres: 0.33, minimap_preset: \u0026quot;asm20\u0026quot;, nb_best_hits: 5000000, max_ram_gb: 100; For cluster, maximum number of slurm jobs is 100. Installation LexicMap is implemented in Go programming language, executable binary files for most popular operating systems are freely available in release page.\nOr install with conda:\nconda install -c bioconda lexicmap Algorithm overview Related projects High-performance LexicHash computation in Go. Wavefront alignment algorithm (WFA) in Golang. Support Please open an issue to report bugs, propose new functions or ask for help.\nLicense MIT License\n","description":"LexicMap is a nucleotide sequence alignment tool for efficiently querying gene, plasmid, viral, or long-read sequences against up to millions of prokaryotic genomes.\nTable of contents Table of contents Features Introduction Quick start Performance Indexing Searching Installation Algorithm overview Related projects Support License Features LexicMap is scalable to up to millions of prokaryotic genomes. The sensitivity of LexicMap is comparable with Blastn. The alignment is fast and memory-efficient. LexicMap is easy to install, we provide binary files with no dependencies for Linux, Windows, MacOS (x86 and arm CPUs)."},{"id":3,"href":"/LexicMap/usage/utils/kmers/","title":"kmers","parent":"utils","content":"$ lexicmap utils kmers -h View k-mers captured by the masks Attention: 1. Mask index (column mask) is 1-based. 2. Prefix means the length of shared prefix between a k-mer and the mask. 3. K-mer positions (column pos) are 1-based. For reference genomes with multiple sequences, the sequences were concatenated to a single sequence with intervals of N\u0026#39;s. 4. Reversed means if the k-mer is reversed for suffix matching. Usage: lexicmap utils kmers [flags] -d \u0026lt;index path\u0026gt; [-m \u0026lt;mask index\u0026gt;] [-o out.tsv.gz] Flags: -h, --help help for kmers -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. -m, --mask int ► View k-mers captured by Xth mask. (0 for all) (default 1) -f, --only-forward ► Only output forward k-mers. -o, --out-file string ► Out file, supports and recommends a \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Examples The default output is captured k-mers of the first mask.\n$ lexicmap utils kmers --quiet -d demo.lmi/ | head -n 20 | csvtk pretty -t mask kmer prefix number ref pos strand reversed ---- ------------------------------- ------ ------ --------------- ------- ------ -------- 1 AAAAAAAAAATACTAAGAGGTGACAAAAGAG 4 1 GCF_001544255.1 2142679 + yes 1 AAAAAAAAACAAAGCGGACTTGGACATTGTC 4 1 GCF_000006945.2 3170307 + yes 1 AAAAAAAAACCAGTAAAAAAAGGGGAGTAGA 4 1 GCF_000392875.1 771896 + yes 1 AAAAAAAAACGACTTACCATTAACGTTCAAG 4 1 GCF_003697165.2 803728 + yes 1 AAAAAAAAACTAGGGTTAAATGCCTTATGTT 4 1 GCF_009759685.1 442423 + yes 1 AAAAAAAAAGAGATGAAAAAGGGTGTATTCG 4 1 GCF_001544255.1 1493451 - yes 1 AAAAAAAAATAAAATATCTAACGAGCAAATT 4 1 GCF_001096185.1 2065540 + yes 1 AAAAAAAAATACCATAGACTATGCTCTTAGT 4 1 GCF_000392875.1 134079 - yes 1 AAAAAAAAATAGAGTTTTTTTTCTGGATAAG 4 1 GCF_000392875.1 795189 + yes 1 AAAAAAAAATGTTAACAGAAGGTCCCTACCT 4 1 GCF_002950215.1 2765957 + yes 1 AAAAAAAACAAAAGCTATACTGGTCATGTTC 4 1 GCF_000006945.2 3635995 + yes 1 AAAAAAAACAAAGATACATTTAGGACGGTTA 4 1 GCF_000006945.2 616481 - yes 1 AAAAAAAACAGCCCACCGCCGATTGCGGAAT 4 1 GCF_000742135.1 1208620 + yes 1 AAAAAAAACAGGGTGTCGTGCCCTTGTCAGT 4 1 GCF_003697165.2 627153 - yes 1 AAAAAAAACAGGGTGTTCTTAGATAAAAGGG 4 1 GCF_000742135.1 1723387 - yes 1 AAAAAAAACATATAGTTGTGAAGGCATTGGA 4 1 GCF_001027105.1 2508079 - yes 1 AAAAAAAACCAGTAAAAAAAGGGGAGTAGAA 4 1 GCF_000392875.1 771895 + yes 1 AAAAAAAACCATATTATGTCCGATCCTCACA 4 1 GCF_000392875.1 1060650 + yes 1 AAAAAAAACCCTTCGTCAAGCATTATGGAAT 4 1 GCF_000392875.1 1139573 - yes Only forward k-mers.\n$ lexicmap utils kmers --quiet -d demo.lmi/ -f | head -n 20 | csvtk pretty -t mask kmer prefix number ref pos strand reversed ---- ------------------------------- ------ ------ --------------- ------- ------ -------- 1 AAAACACCAAAAGCCTCTCCGATAACACCAG 9 1 GCF_002949675.1 2046311 + no 1 AAAACACCAAAGTTAAAGTGCCGTTTAGCGT 9 1 GCF_003697165.2 1085073 + no 1 AAAACACCAATTAGTGATTGTGTTTCCTCAA 9 1 GCF_000392875.1 2785764 - no 1 AAAACACCACAGTGAAAGACAACATTTAATA 9 1 GCF_000392875.1 1132052 - no 1 AAAACACCACCACAAATGCATAAGAAAACTT 9 1 GCF_003697165.2 2862670 + no 1 AAAACACCACTCAATCCTTTAAATAAAAACA 9 1 GCF_002949675.1 2467828 - no 1 AAAACACCACTTTACGGGCGTTTTGTGCAAT 9 1 GCF_003697165.2 4241904 - no 1 AAAACACCAGCACGTTCAGCACCGCCACCAG 9 1 GCF_000017205.1 4399207 - no 1 AAAACACCAGCGAACGGAAGAACATCGCGAT 9 1 GCF_003697165.2 248663 + no 1 AAAACACCAGGCCGGAGCAGAAGGTTATTCT 9 1 GCF_003697165.2 4139632 + no 1 AAAACACCATAAACGATTGTTGGAATACCCG 10 1 GCF_009759685.1 268158 + no 1 AAAACACCATCATACACTAAATCAGTAAGTT 10 4 GCF_002949675.1 496925 + no 1 AAAACACCATCATACACTAAATCAGTAAGTT 10 4 GCF_002949675.1 2254974 + no 1 AAAACACCATCATACACTAAATCAGTAAGTT 10 4 GCF_002949675.1 2495183 + no 1 AAAACACCATCATACACTAAATCAGTAAGTT 10 4 GCF_002949675.1 4009312 + no 1 AAAACACCATGAACGCCAACGCCGCCGAGCT 11 1 GCF_000742135.1 2707622 + no 1 AAAACACCATGAGCAAACTCCAGCATATCGG 11 1 GCF_000017205.1 2490011 - no 1 AAAACACCATGCAAAAAACTTCTTTTAGAAA 11 1 GCF_000006945.2 1324151 - no 1 AAAACACCATGCAGCATGTCATAGCGCTGGA 11 1 GCF_003697165.2 422685 + no Specify the mask.\n$ lexicmap utils kmers --quiet -d demo.lmi/ --mask 12345 | csvtk pretty -t mask kmer prefix number ref pos strand reversed ----- ------------------------------- ------ ------ --------------- ------- ------ -------- 12345 CATGTTACAAAAGGTGGGTCAGGCAACGTAT 7 1 GCF_001457655.1 335112 - yes 12345 CATGTTACCAAGGTTAGTCGTATGGCGCTAC 7 1 GCF_001457655.1 23755 - yes 12345 CATGTTACGCGTATTTTAGCGGCTCGCGGAC 7 1 GCF_000006945.2 702224 + yes 12345 CATGTTATAACGGCCTATGAATCGGCATTAC 9 1 GCF_009759685.1 2591866 + no 12345 CATGTTATACGTTGAAACTGTCTTGTTAATA 9 1 GCF_001096185.1 1142460 + yes 12345 CATGTTATACTTTAGATACTTATTTTTAGGA 9 1 GCF_000392875.1 1524553 + no 12345 CATGTTATAGAAGGACGTCGACATCTTGTGG 10 1 GCF_000017205.1 3140677 + no 12345 CATGTTATAGAATTACATACATTGTAACATG 10 1 GCF_006742205.1 704431 - no 12345 CATGTTATAGCACGCTTAATCGCTTGATCCC 13 1 GCF_001027105.1 2655846 + no 12345 CATGTTATAGCATCCTTTTACGTGAAAAGGT 12 1 GCF_000742135.1 4136093 + no 12345 CATGTTATAGCCAGCAAATGGAAGCATCGCG 11 1 GCF_009759685.1 492828 - no 12345 CATGTTATAGCCATTGATGGTAACTTTGATG 11 1 GCF_001096185.1 536843 + no 12345 CATGTTATAGCCTGAAAGGTGCTAAACAACT 11 1 GCF_000006945.2 4876155 + no 12345 CATGTTATAGCCTTCTCCAAGACCAATCAAA 11 1 GCF_000148585.2 1667015 + no 12345 CATGTTATAGCGTAAATCAGCACCGCGCGCC 11 3 GCF_002949675.1 1871326 + no 12345 CATGTTATAGCGTAAATCAGCACCGCGCGCC 11 3 GCF_002950215.1 2326544 + no 12345 CATGTTATAGCGTAAATCAGCACCGCGCGCC 11 3 GCF_003697165.2 3996124 + no 12345 CATGTTATAGCTAACTGCGACTTGTGGCACA 11 1 GCF_900638025.1 991007 - no 12345 CATGTTATAGTCGTGAGGTTCTAAAAAAACT 10 1 GCF_001544255.1 1091256 - no 12345 CATGTTATAGTTTGTCTTACCGCTACTGAAA 10 1 GCF_002950215.1 1457055 + yes 12345 CATGTTATATCCTTCTTGAATACGAGCAATA 9 1 GCF_000392875.1 1963573 + no 12345 CATGTTATATGAACCTTCAACCTTATTTGAC 9 1 GCF_001457655.1 1510084 + no 12345 CATGTTATCCAGGTATTTCACCAGCGCACGC 8 1 GCF_000006945.2 836525 + no 12345 CATGTTATCGAATATTATAACATCGGCTCCC 8 1 GCF_000148585.2 1372855 + yes 12345 CATGTTATCGATAAGGCTATATATGACCTTA 8 1 GCF_002950215.1 878140 - no 12345 CATGTTATCGCTCAGGGTCTGCGGGTATATC 8 1 GCF_002950215.1 1880029 + yes 12345 CATGTTATGCGTATAAAGACGAGTAAAGGTT 8 1 GCF_009759685.1 3827118 + no 12345 CATGTTATGCTGGGACATTTAGCACCGCTAC 8 1 GCF_000006945.2 1988134 + yes \u0026ldquo;reversed\u0026rdquo; means means if the k-mer is reversed for suffix matching. E.g., CATGTTACAAAAGGTGGGTCAGGCAACGTAT is reversed, so you need to reverse it before searching in the genome.\n$ seqkit locate -p $(echo CATGTTACAAAAGGTGGGTCAGGCAACGTAT | rev) refs/GCF_001457655.1.fa.gz -M | csvtk pretty -t seqID patternName pattern strand start end ------------- ------------------------------- ------------------------------- ------ ------ ------ NZ_LN831035.1 TATGCAACGGACTGGGTGGAAAACATTGTAC TATGCAACGGACTGGGTGGAAAACATTGTAC - 335112 335142 For all masks. The result might be very big, therefore, writing to gzip format is recommended.\n$ lexicmap utils kmers -d demo.lmi/ --mask 0 -o kmers.tsv.gz $ zcat kmers.tsv.gz | csvtk freq -t -f mask -nr | head -n 10 mask frequency 1 610 40000 568 31 435 20 432 39997 423 28 419 30018 415 30027 403 79 396 K-mers of a specific mask\n$ lexicmap utils kmers -d demo.lmi/ -m 12345 | head -n 20 | csvtk pretty -t mask kmer prefix number ref pos strand reversed ----- ------------------------------- ------ ------ --------------- ------- ------ -------- 12345 CATGTTACAAAAGGTGGGTCAGGCAACGTAT 7 1 GCF_001457655.1 335112 - yes 12345 CATGTTACCAAGGTTAGTCGTATGGCGCTAC 7 1 GCF_001457655.1 23755 - yes 12345 CATGTTACGCGTATTTTAGCGGCTCGCGGAC 7 1 GCF_000006945.2 702224 + yes 12345 CATGTTATAACGGCCTATGAATCGGCATTAC 9 1 GCF_009759685.1 2591866 + no 12345 CATGTTATACGTTGAAACTGTCTTGTTAATA 9 1 GCF_001096185.1 1142460 + yes 12345 CATGTTATACTTTAGATACTTATTTTTAGGA 9 1 GCF_000392875.1 1524553 + no 12345 CATGTTATAGAAGGACGTCGACATCTTGTGG 10 1 GCF_000017205.1 3140677 + no 12345 CATGTTATAGAATTACATACATTGTAACATG 10 1 GCF_006742205.1 704431 - no 12345 CATGTTATAGCACGCTTAATCGCTTGATCCC 13 1 GCF_001027105.1 2655846 + no 12345 CATGTTATAGCATCCTTTTACGTGAAAAGGT 12 1 GCF_000742135.1 4136093 + no 12345 CATGTTATAGCCAGCAAATGGAAGCATCGCG 11 1 GCF_009759685.1 492828 - no 12345 CATGTTATAGCCATTGATGGTAACTTTGATG 11 1 GCF_001096185.1 536843 + no 12345 CATGTTATAGCCTGAAAGGTGCTAAACAACT 11 1 GCF_000006945.2 4876155 + no 12345 CATGTTATAGCCTTCTCCAAGACCAATCAAA 11 1 GCF_000148585.2 1667015 + no 12345 CATGTTATAGCGTAAATCAGCACCGCGCGCC 11 3 GCF_002949675.1 1871326 + no 12345 CATGTTATAGCGTAAATCAGCACCGCGCGCC 11 3 GCF_002950215.1 2326544 + no 12345 CATGTTATAGCGTAAATCAGCACCGCGCGCC 11 3 GCF_003697165.2 3996124 + no 12345 CATGTTATAGCTAACTGCGACTTGTGGCACA 11 1 GCF_900638025.1 991007 - no 12345 CATGTTATAGTCGTGAGGTTCTAAAAAAACT 10 1 GCF_001544255.1 1091256 - no Lengths of shared prefixes between probes and captured k-mers.\nzcat kmers.tsv.gz \\ | csvtk filter2 -t -f '$reversed == \u0026quot;no\u0026quot;'\\ | csvtk plot hist -t -f prefix -o prefix.hist.png \\ --xlab \u0026quot;length of common prefixes between captured k-mers and masks\u0026quot; The output (TSV format) is formatted with csvtk pretty.\n","description":"$ lexicmap utils kmers -h View k-mers captured by the masks Attention: 1. Mask index (column mask) is 1-based. 2. Prefix means the length of shared prefix between a k-mer and the mask. 3. K-mer positions (column pos) are 1-based. For reference genomes with multiple sequences, the sequences were concatenated to a single sequence with intervals of N\u0026#39;s. 4. Reversed means if the k-mer is reversed for suffix matching. Usage: lexicmap utils kmers [flags] -d \u0026lt;index path\u0026gt; [-m \u0026lt;mask index\u0026gt;] [-o out."},{"id":4,"href":"/LexicMap/tutorials/search/","title":"Searching","parent":"Tutorials","content":" Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Steps Output Alignment result relationship Output format Examples TL;DR Build a LexicMap index.\nRun:\nFor short queries like genes or long reads, returning top N hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 70 --min-qcov-per-genome 70 --top-n-genomes 1000 For longer queries like plasmids, returning all hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 0 --min-qcov-per-genome 0 --top-n-genomes 0 Input Query length\nLexicMap is mainly designed for sequence alignment with a small number of queries (gene/plasmid/virus/phage sequences) longer than 200 bp by default. However, short queries can also be aligned. Input should be (gzipped) FASTA or FASTQ records from files or STDIN.\nHardware requirements See benchmark of index building.\nLexicMap is designed to provide fast and low-memory sequence alignment against millions of prokaryotic genomes.\nCPU: No specific requirements on CPU type and instruction sets. Both x86 and ARM chips are supported. More is better as LexicMap is a CPU-intensive software. It uses all CPUs by default (-j/--threads). RAM More RAM (\u0026gt; 16 GB) is preferred. The memory usage in searching is mainly related to: The number of matched genomes and sequences. The length of query sequences. Similarities between query and target sequences. The number of threads. It uses all CPUs by default (-j/--threads). Disk Sufficient space is required to store the index size. No temporary files are generated during searching. Algorithm Masking: Query sequence is masked by the masks of the index. In other words, each mask captures the most similar k-mer which shares the longest prefix with the mask, and stores its position and strand information. Seeding: For each mask, the captured k-mer is used to search seeds (captured k-mers in reference genomes) sharing prefixes or suffixes of at least p bases. Prefix matching Setting the search range: Since the seeded k-mers are stored in lexicographic order, the k-mer matching turns into a range query. For example, for a query CATGCT requiring matching at least 4-bp prefix is equal to extract k-mers ranging from CATGAA, CATGAC, CATGAG, \u0026hellip;, to CATGTT. Finding the nearest smaller k-mer: The index file of each seed data file stores a list (default 512) of k-mers and offsets in the data file, and the index is loaded in RAM. The nearest k-mer smaller than the range start k-mer (CATGAA) is found by binary search, i.e., CATCAC (blue text in the figure), and the offset is returned as the start position in traversing the seed data file. Retrieving seed data: Seed k-mers are read from the file and checked one by one, and k-mers in the search range are returned, along with the k-mer information (genome batch, genome number, location, and strand). Suffix matching Reversing the query k-mer and performing prefix matching, returning seeds of reversed k-mers (see indexing algorithm). Chaining: Seeding results, i.e., anchors (matched k-mers from the query and subject sequence), are summarized by genome, and deduplicated. Performing chaining (see the paper). Alignment for each chain. Extending the anchor region. for extracting sequences from the query and reference genome. For example, extending 2 kb in upstream and downstream of anchor region. Performing pseudo-alignment with extended query and subject sequences, for find similar regions. For these similar regions that accross more than one reference sequences, splitting them into multiple ones. Fast alignment of query and subject sequence regions with our implementation of Wavefront alignment algorithm. Filtering alignments based on user options. Parameters Flags in bold text are important and frequently used.\nGeneral Flag Value Function Comment -w/--load-whole-seeds Load the whole seed data into memory for faster search Use this if the index is not big and many queries are needed to search. -n/--top-n-genomes Default 0, 0 for all Keep top N genome matches for a query in the chaining phase The final number of genome hits might be smaller than this number as some chaining results might fail to pass the criteria in the alignment step. -a/--all Output more columns, e.g., matched sequences. Use this if you want to output blast-style format with \u0026ldquo;lexicmap utils 2blast\u0026rdquo; -J/\u0026ndash;max-query-conc Default 12, 0 for all Maximum number of concurrent queries Bigger values do not improve the batch searching speed and consume much memory Chaining Flag Value Function Comment -p, --seed-min-prefix Default 15 Minimum (prefix) length of matched seeds. Smaller values produce more results at the cost of slow speed. -P, --seed-min-single-prefix Default 17 Minimum (prefix) length of matched seeds if there\u0026rsquo;s only one pair of seeds matched. Smaller values produce more results at the cost of slow speed. --seed-max-dist Default 10000 Max distance between seeds in seed chaining. --seed-max-gap Default 500 Max gap in seed chaining. Alignment Flag Value Function Comment -Q/--min-qcov-per-genome Default 0 Minimum query coverage (percentage) per genome. -q/--min-qcov-per-hsp Default 0 Minimum query coverage (percentage) per HSP. -l/--align-min-match-len Default 50 Minimum aligned length in a HSP segment. -i/--align-min-match-pident Default 70 Minimum base identity (percentage) in a HSP segment. --align-band Default 50 Band size in backtracking the score matrix. --align-ext-len Default 2000 Extend length of upstream and downstream of seed regions, for extracting query and target sequences for alignment. --align-max-gap Default 20 Maximum gap in a HSP segment. Steps For short queries like genes or long reads, returning top N hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-match-pident 70 \\ --min-qcov-per-hsp 70 \\ --min-qcov-per-genome 70 \\ --top-n-genomes 1000 For longer queries like plasmids, returning all hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-match-pident 70 \\ --min-qcov-per-hsp 0 \\ --min-qcov-per-genome 0 \\ --top-n-genomes 0 Click to show the log of a demo run. ... $ lexicmap search -d demo.lmi/ q.gene.fasta -o q.gene.fasta.lexicmap.tsv 09:32:55.551 [INFO] LexicMap v0.4.0 09:32:55.551 [INFO] https://github.com/shenwei356/LexicMap 09:32:55.551 [INFO] 09:32:55.551 [INFO] checking input files ... 09:32:55.551 [INFO] 1 input file(s) given 09:32:55.551 [INFO] 09:32:55.551 [INFO] loading index: demo.lmi/ 09:32:55.551 [INFO] reading masks... 09:32:55.552 [INFO] reading indexes of seeds (k-mer-value) data... 09:32:55.555 [INFO] creating genome reader pools, each batch with 16 readers... 09:32:55.555 [INFO] index loaded in 4.192051ms 09:32:55.555 [INFO] 09:32:55.555 [INFO] searching ... 09:32:55.596 [INFO] 09:32:55.596 [INFO] processed queries: 1, speed: 1467.452 queries per minute 09:32:55.596 [INFO] 100.0000% (1/1) queries matched 09:32:55.596 [INFO] done searching 09:32:55.596 [INFO] search results saved to: q.gene.fasta.lexicmap.tsv 09:32:55.596 [INFO] 09:32:55.596 [INFO] elapsed time: 45.230604ms 09:32:55.596 [INFO] Extracting similar sequences for a query gene.\n# search matches with query coverage \u0026gt;= 90% lexicmap search -d gtdb_complete.lmi/ b.gene_E_faecalis_SecY.fasta --min-qcov-per-hsp 90 --all -o results.tsv # extract matched sequences as FASTA format sed 1d results.tsv | awk -F\u0026#39;\\t\u0026#39; \u0026#39;{print \u0026#34;\u0026gt;\u0026#34;$5\u0026#34;:\u0026#34;$14\u0026#34;-\u0026#34;$15\u0026#34;:\u0026#34;$16\u0026#34;\\n\u0026#34;$20;}\u0026#39; | seqkit seq -g \u0026gt; results.fasta seqkit head -n 1 results.fasta | head -n 3 \u0026gt;NZ_JALSCK010000007.1:39224-40522:- TTGTTCAAGCTATTAAAGAACGCCTTTAAAGTCAAAGACATTAGATCAAAAATCTTATTT ACAGTTTTAATCTTGTTTGTATTTCGCCTAGGTGCGCACATTACTGTGCCCGGGGTGAAT Exporting blast-like alignment text.\nFrom file:\nlexicmap utils 2blast results.tsv -o results.txt From stdin:\n# align only one long-read \u0026lt;= 500 bp $ seqkit seq -M 500 q.long-reads.fasta.gz \\ | seqkit head -n 1 \\ | lexicmap search -d demo.lmi/ -a \\ | lexicmap utils 2blast Query = GCF_006742205.1_r100 Length = 431 [Subject genome #1/1] = GCF_006742205.1 Query coverage per genome = 92.575% \u0026gt;NZ_AP019721.1 Length = 2422602 HSP #1 Query coverage per seq = 92.575%, Aligned length = 402, Identities = 98.507%, Gaps = 4 Query range = 33-431, Subject range = 1321677-1322077, Strand = Plus/Minus Query 33 TAAAACGATTGCTAATGAGTCACGTATTTCATCTGGTTCGGTAACTATACCGTCTACTAT 92 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1322077 TAAAACGATTGCTAATGAGTCACGTATTTCATCTGGTTCGGTAACTATACCGTCTACTAT 1322018 Query 93 GGACTCAGTGTAACCCTGTAATAAAGAGATTGGCGTACGTAATTCATGTG-TACATTTGC 151 |||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||| Sbjct 1322017 GGACTCAGTGTAACCCTGTAATAAAGAGATTGGCGTACGTAATTCATGTGATACATTTGC 1321958 Query 152 TATAAAATCTTTTTTCATTTGATCAAGATTATGTTCATTTGTCATATCACAGGATGACCA 211 |||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||| Sbjct 1321957 TATAAAATCTTTTTTCATTTGATCAAGATTATGTTCATTTGTCATATCAC-GGATGACCA 1321899 Query 212 TGACAATACCACTTCTACCATTTGTTTGAATTCTATCTATATAACTGGAGATAAATACAT 271 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1321898 TGACAATACCACTTCTACCATTTGTTTGAATTCTATCTATATAACTGGAGATAAATACAT 1321839 Query 272 AGTACCTTGTATTAATTTCTAATTCTAA-TACTCATTCTGTTGTGATTCAAATGGTGCTT 330 |||||||||||||||||||||||||||| ||||||||||||||||||||||||| ||||| Sbjct 1321838 AGTACCTTGTATTAATTTCTAATTCTAAATACTCATTCTGTTGTGATTCAAATGTTGCTT 1321779 Query 331 CAATTTGCTGTTCAATAGATTCTTTTGAAAAATCATCAATGTGACGCATAATATAATCAG 390 |||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||| Sbjct 1321778 CAATTTGCTGTTCAATAGATTCTTTTGAAAAATCATCAATGTGACGCATAATATCATCAG 1321719 Query 391 CCATCTTGTT-GACAATATGATTTCACGTTGATTATTAATGC 431 |||||||||| ||||||||||||||||||||||||||||||| Sbjct 1321718 CCATCTTGTTTGACAATATGATTTCACGTTGATTATTAATGC 1321677 Output Alignment result relationship Query ├── Subject genome # A query might have one or more genome hits, ├── Subject sequence # in different sequences. ├── High-Scoring segment Pair (HSP) # HSP is an alignment segment. Here, the defination of HSP is similar with that in BLAST. Actually there are small gaps in HSPs.\nA High-scoring Segment Pair (HSP) is a local alignment with no gaps that achieves one of the highest alignment scores in a given search. https://www.ncbi.nlm.nih.gov/books/NBK62051/\nOutput format Tab-delimited format with 17+ columns, with 1-based positions.\n1. query, Query sequence ID. 2. qlen, Query sequence length. 3. hits, Number of subject genomes. 4. sgenome, Subject genome ID. 5. sseqid, Subject sequence ID. 6. qcovGnm, Query coverage (percentage) per genome: $(aligned bases in the genome)/$qlen. 7. hsp, Nth HSP in the genome. (just for improving readability) 8. qcovHSP Query coverage (percentage) per HSP: $(aligned bases in a HSP)/$qlen. 9. alenHSP, Aligned length in the current HSP. 10. pident, Percentage of identical matches in the current HSP. 11. gaps, Gaps in the current HSP. 12. qstart, Start of alignment in query sequence. 13. qend, End of alignment in query sequence. 14. sstart, Start of alignment in subject sequence. 15. send, End of alignment in subject sequence. 16. sstr, Subject strand. 17. slen, Subject sequence length. 18. cigar, CIGAR string of the alignment. (optional with -a/--all) 19. qseq, Aligned part of query sequence. (optional with -a/--all) 20. sseq, Aligned part of subject sequence. (optional with -a/--all) 21. align, Alignment text (\u0026quot;|\u0026quot; and \u0026quot; \u0026quot;) between qseq and sseq. (optional with -a/--all) Examples A single-copy gene (SecY) query qlen hits sgenome sseqid qcovGnm hsp qcovHSP alenHSP pident gaps qstart qend sstart send sstr slen ---------------------------------------- ---- ---- --------------- -------------------- ------- --- ------- ------- ------- ---- ------ ---- ------ ------ ---- ------- lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_000395405.1 NZ_KB947497.1 100.000 1 100.000 1299 100.000 0 1 1299 232279 233577 + 274511 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_019731615.1 NZ_JAASJA010000010.1 100.000 1 100.000 1299 100.000 0 1 1299 2798 4096 + 42998 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCA_004103085.1 RPCL01000012.1 100.000 1 100.000 1299 100.000 0 1 1299 44095 45393 + 84242 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_023571745.1 NZ_JAMKBS010000014.1 100.000 1 100.000 1299 100.000 0 1 1299 44077 45375 + 84206 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_013248625.1 NZ_JABTDK010000002.1 100.000 1 100.000 1299 100.000 0 1 1299 9609 10907 + 49787 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_900092155.1 NZ_FLUS01000006.1 100.000 1 100.000 1299 100.000 0 1 1299 63161 64459 + 77366 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_902165815.1 NZ_CABHHZ010000005.1 100.000 1 100.000 1299 100.000 0 1 1299 39386 40684 - 200163 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_014243495.1 NZ_SJAV01000002.1 100.000 1 100.000 1299 100.000 0 1 1299 39085 40383 - 256772 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_900148695.1 NZ_FRXS01000009.1 100.000 1 100.000 1299 100.000 0 1 1299 39230 40528 - 96692 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_902164645.1 NZ_LR607334.1 100.000 1 100.000 1299 100.000 0 1 1299 236677 237975 + 3380663 A 16S rRNA gene query qlen hits sgenome sseqid qcovGnm hsp qcovHSP alenHSP pident gaps qstart qend sstart send sstr slen --------------------------- ---- ------ --------------- ----------------- ------- --- ------- ------- ------- ---- ------ ---- ------- ------- ---- ------- NC_000913.3:4166659-4168200 1542 293398 GCF_002248685.1 NZ_NQBE01000079.1 100.000 1 100.000 1542 100.000 0 1 1542 40 1581 - 99259 NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 1 100.000 1542 100.000 0 1 1542 1270211 1271752 + 5483624 NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 2 100.000 1542 100.000 0 1 1542 5466287 5467828 - 5483624 NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 3 100.000 1543 99.546 2 1 1542 557008 558549 + 5483624 NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 4 100.000 1543 99.482 2 1 1542 4473658 4475199 - 5483624 NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 5 100.000 1543 99.482 2 1 1542 5154150 5155691 - 5483624 NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 6 100.000 1543 99.482 2 1 1542 5195176 5196717 - 5483624 NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 7 100.000 1543 99.482 2 1 1542 5369865 5371406 - 5483624 NC_000913.3:4166659-4168200 1542 293398 GCF_000460355.1 NZ_KE701684.1 100.000 1 100.000 1542 100.000 0 1 1542 1108651 1110192 - 1914390 NC_000913.3:4166659-4168200 1542 293398 GCF_000460355.1 NZ_KE701686.1 100.000 2 100.000 1542 99.741 0 1 1542 100680 102221 + 102235 A plasmid query qlen hits sgenome sseqid qcovGnm hsp qcovHSP alenHSP pident gaps qstart qend sstart send sstr slen ---------- ----- ----- --------------- ------------- ------- --- ------- ------- ------- ---- ------ ----- ------- ------- ---- ------- CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086533.1 97.473 1 75.792 40041 99.995 0 12069 52109 11439 51479 + 51479 CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086533.1 97.473 2 20.316 10733 100.000 0 1 10733 722 11454 + 51479 CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086533.1 97.473 3 1.365 721 100.000 0 52110 52830 1 721 + 51479 CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086535.1 97.473 4 0.916 484 91.116 0 51686 52169 27192 27675 - 34058 CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086535.1 97.473 5 0.829 438 90.868 1 52342 52779 26583 27019 - 34058 CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086533.1 97.473 6 1.552 820 100.000 0 9049 9868 23092 23911 + 51479 CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086534.1 97.473 7 0.502 265 100.000 0 19788 20052 29842 30106 + 47185 CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086533.1 97.473 8 0.159 84 97.619 0 8348 8431 19574 19657 + 51479 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086545.1 97.473 1 75.792 40041 99.995 0 12069 52109 11439 51479 + 51479 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086545.1 97.473 2 20.316 10733 100.000 0 1 10733 722 11454 + 51479 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086545.1 97.473 3 1.365 721 100.000 0 52110 52830 1 721 + 51479 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086547.1 97.473 4 0.916 484 91.116 0 51686 52169 3843 4326 + 34058 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086547.1 97.473 5 0.829 438 90.868 1 52342 52779 4499 4935 + 34058 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086545.1 97.473 6 1.552 820 100.000 0 9049 9868 23092 23911 + 51479 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086546.1 97.473 7 0.502 265 100.000 0 19788 20052 29842 30106 + 47185 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086545.1 97.473 8 0.159 84 97.619 0 8348 8431 19574 19657 + 51479 CP115019.1 52830 58744 GCF_014826015.1 NZ_CP058621.1 97.473 1 77.157 40762 99.993 0 12069 52830 9513 50274 + 51480 CP115019.1 52830 58744 GCF_014826015.1 NZ_CP058621.1 97.473 2 18.033 9528 99.990 1 1207 10733 1 9528 + 51480 CP115019.1 52830 58744 GCF_014826015.1 NZ_CP058621.1 97.473 3 2.283 1206 100.000 0 1 1206 50275 51480 + 51480 CP115019.1 52830 58744 GCF_014826015.1 NZ_CP058618.1 97.473 4 2.497 1319 100.000 0 25153 26471 3019498 3020816 - 4718403 Long reads Queries are a few Nanopore Q20 reads from a mock metagenomic community.\nquery qlen hits sgenome sseqid qcovGnm hsp qcovHSP alenHSP pident gaps qstart qend sstart send sstr slen ------------------ ---- ---- --------------- ------------- ------- --- ------- ------- ------- ---- ------ ---- ------- ------- ---- ------- ERR5396170.1000016 740 1 GCF_013394085.1 NZ_CP040910.1 89.595 1 89.595 663 99.246 0 71 733 13515 14177 + 1887974 ERR5396170.1000000 698 1 GCF_001457615.1 NZ_LN831024.1 85.673 1 85.673 603 98.010 5 53 650 4452083 4452685 + 6316979 ERR5396170.1000017 516 1 GCF_013394085.1 NZ_CP040910.1 94.574 1 94.574 489 99.591 2 27 514 293509 293996 + 1887974 ERR5396170.1000012 848 1 GCF_013394085.1 NZ_CP040910.1 95.165 1 95.165 811 97.411 7 22 828 190329 191136 - 1887974 ERR5396170.1000038 1615 1 GCA_000183865.1 CM001047.1 64.706 1 60.000 973 95.889 13 365 1333 88793 89756 - 2884551 ERR5396170.1000038 1615 1 GCA_000183865.1 CM001047.1 64.706 2 4.706 76 98.684 0 266 341 89817 89892 - 2884551 ERR5396170.1000036 1159 1 GCF_013394085.1 NZ_CP040910.1 95.427 1 95.427 1107 99.729 1 32 1137 1400097 1401203 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 1 86.486 707 99.151 3 104 807 242235 242941 - 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 2 86.486 707 98.444 3 104 807 1138777 1139483 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 3 84.152 688 98.983 4 104 788 154620 155306 - 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 4 84.029 687 99.127 3 104 787 32477 33163 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 5 72.727 595 98.992 3 104 695 1280183 1280777 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 6 11.671 95 100.000 0 693 787 1282480 1282574 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 7 82.064 671 99.106 3 120 787 1768782 1769452 + 1887974 Search results (TSV format) above are formatted with csvtk pretty.\n","description":"Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Steps Output Alignment result relationship Output format Examples TL;DR Build a LexicMap index.\nRun:\nFor short queries like genes or long reads, returning top N hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 70 --min-qcov-per-genome 70 --top-n-genomes 1000 For longer queries like plasmids, returning all hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 0 --min-qcov-per-genome 0 --top-n-genomes 0 Input Query length"},{"id":5,"href":"/LexicMap/usage/utils/genomes/","title":"genomes","parent":"utils","content":" Usage $ lexicmap utils genomes -h View genome IDs in the index Usage: lexicmap utils genomes [flags] Flags: -h, --help help for genomes -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. -o, --out-file string ► Out file, supports the \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 8) Examples $ lexicmap utils genomes -d demo.lmi/ GCF_000006945.2 GCF_000017205.1 GCF_000148585.2 GCF_000392875.1 GCF_000742135.1 GCF_001027105.1 GCF_001096185.1 GCF_001457655.1 GCF_001544255.1 GCF_002949675.1 GCF_002950215.1 GCF_003697165.2 GCF_006742205.1 GCF_009759685.1 GCF_900638025.1 ","description":"Usage $ lexicmap utils genomes -h View genome IDs in the index Usage: lexicmap utils genomes [flags] Flags: -h, --help help for genomes -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. -o, --out-file string ► Out file, supports the \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments."},{"id":6,"href":"/LexicMap/installation/","title":"Installation","parent":"","content":"LexicMap can be installed via conda, executable binary files, or compiling from the source.\nBesides, it supports shell completion, which could help accelerate typing.\nConda Install conda, then run\nconda install -c bioconda lexicmap Linux and MacOS (both x86 and arm CPUs) are supported.\nBinary files Linux Download the binary file.\nOS Arch File, 中国镜像 Linux 64-bit lexicmap_linux_amd64.tar.gz, 中国镜像 Linux arm64 lexicmap_linux_arm64.tar.gz, 中国镜像 Decompress it:\ntar -zxvf lexicmap_linux_amd64.tar.gz If you have the root privilege, simply copy it to /usr/local/bin:\nsudo cp lexicmap /usr/local/bin/ If you don\u0026rsquo;t have the root privilege, copy it to any directory in the environment variable PATH:\nmkdir -p $HOME/bin/; cp lexicmap $HOME/bin/ And optionally add the directory into the environment variable PATH if it\u0026rsquo;s not in.\n# bash echo export PATH=\\$PATH:\\$HOME/bin/ \u0026gt;\u0026gt; $HOME/.bashrc source $HOME/.bash # apply the configuration # zsh echo export PATH=\\$PATH:\\$HOME/bin/ \u0026gt;\u0026gt; $HOME/.zshrc source $HOME/.zshrc # apply the configuration MacOS Download the binary file.\nOS Arch File, 中国镜像 macOS 64-bit lexicmap_darwin_amd64.tar.gz, 中国镜像 macOS arm64 lexicmap_darwin_arm64.tar.gz, 中国镜像 Copy it to any directory in the environment variable PATH:\nmkdir -p $HOME/bin/; cp lexicmap $HOME/bin/ And optionally add the directory into the environment variable PATH if it\u0026rsquo;s not in.\n# bash echo export PATH=\\$PATH:\\$HOME/bin/ \u0026gt;\u0026gt; $HOME/.bashrc source $HOME/.bash # apply the configuration # zsh echo export PATH=\\$PATH:\\$HOME/bin/ \u0026gt;\u0026gt; $HOME/.zshrc source $HOME/.zshrc # apply the configuration Windows Download the binary file.\nOS Arch File, 中国镜像 Windows 64-bit lexicmap_windows_amd64.exe.tar.gz, 中国镜像 Decompress it.\nCopy lexicmap.exe to C:\\WINDOWS\\system32.\nOthers Please open an issue to request binaries for other platforms. Or compiling from the source. Compile from the source Install go.\nwget https://go.dev/dl/go1.22.4.linux-amd64.tar.gz tar -zxf go1.22.4.linux-amd64.tar.gz -C $HOME/ # or # echo \u0026quot;export PATH=$PATH:$HOME/go/bin\u0026quot; \u0026gt;\u0026gt; ~/.bashrc # source ~/.bashrc export PATH=$PATH:$HOME/go/bin Compile LexicMap.\n# ------------- the latest stable version ------------- go get -v -u github.com/shenwei356/LexicMap/lexicmap # The executable binary file is located in: # ~/go/bin/lexicmap # You can also move it to anywhere in the $PATH mkdir -p $HOME/bin cp ~/go/bin/lexicmap $HOME/bin/ # --------------- the development version -------------- git clone https://github.com/shenwei356/LexicMap cd LexicMap/lexicmap/ go build # The executable binary file is located in: # ./lexicmap # You can also move it to anywhere in the $PATH mkdir -p $HOME/bin cp ./lexicmap $HOME/bin/ Shell-completion Supported shell: bash|zsh|fish|powershell\nBash:\n# generate completion shell lexicmap autocompletion --shell bash # configure if never did. # install bash-completion if the \u0026quot;complete\u0026quot; command is not found. echo \u0026quot;for bcfile in ~/.bash_completion.d/* ; do source \\$bcfile; done\u0026quot; \u0026gt;\u0026gt; ~/.bash_completion echo \u0026quot;source ~/.bash_completion\u0026quot; \u0026gt;\u0026gt; ~/.bashrc Zsh:\n# generate completion shell lexicmap autocompletion --shell zsh --file ~/.zfunc/_kmcp # configure if never did echo 'fpath=( ~/.zfunc \u0026quot;${fpath[@]}\u0026quot; )' \u0026gt;\u0026gt; ~/.zshrc echo \u0026quot;autoload -U compinit; compinit\u0026quot; \u0026gt;\u0026gt; ~/.zshrc fish:\nlexicmap autocompletion --shell fish --file ~/.config/fish/completions/lexicmap.fish ","description":"LexicMap can be installed via conda, executable binary files, or compiling from the source.\nBesides, it supports shell completion, which could help accelerate typing.\nConda Install conda, then run\nconda install -c bioconda lexicmap Linux and MacOS (both x86 and arm CPUs) are supported.\nBinary files Linux Download the binary file.\nOS Arch File, 中国镜像 Linux 64-bit lexicmap_linux_amd64.tar.gz, 中国镜像 Linux arm64 lexicmap_linux_arm64.tar.gz, 中国镜像 Decompress it:\ntar -zxvf lexicmap_linux_amd64.tar.gz If you have the root privilege, simply copy it to /usr/local/bin:"},{"id":7,"href":"/LexicMap/usage/search/","title":"search","parent":"Usage","content":"$ lexicmap search -h Search sequences against an index Attention: 1. Input should be (gzipped) FASTA or FASTQ records from files or stdin. 2. For multiple queries, the order of queries might be different from the input. Tips: 1. When using -a/--all, the search result would be formatted to Blast-style format with \u0026#39;lexicmap utils 2blast\u0026#39;. And the search speed would be slightly slowed down. 2. Alignment result filtering is performed in the final phase, so stricter filtering criteria, including -q/--min-qcov-per-hsp, -Q/--min-qcov-per-genome, and -i/--align-min-match-pident, do not significantly accelerate the search speed. Hence, you can search with default parameters and then filter the result with tools like awk or csvtk. Alignment result relationship: Query ├── Subject genome ├── Subject sequence ├── High-Scoring segment Pair (HSP) Here, the defination of HSP is similar with that in BLAST. Actually there are small gaps in HSPs. \u0026gt; A High-scoring Segment Pair (HSP) is a local alignment with no gaps that achieves one of the \u0026gt; highest alignment scores in a given search. https://www.ncbi.nlm.nih.gov/books/NBK62051/ Output format: Tab-delimited format with 17+ columns, with 1-based positions. 1. query, Query sequence ID. 2. qlen, Query sequence length. 3. hits, Number of subject genomes. 4. sgenome, Subject genome ID. 5. sseqid, Subject sequence ID. 6. qcovGnm, Query coverage (percentage) per genome: $(aligned bases in the genome)/$qlen. 7. hsp, Nth HSP in the genome. (just for improving readability) 8. qcovHSP Query coverage (percentage) per HSP: $(aligned bases in a HSP)/$qlen. 9. alenHSP, Aligned length in the current HSP. 10. pident, Percentage of identical matches in the current HSP. 11. gaps, Gaps in the current HSP. 12. qstart, Start of alignment in query sequence. 13. qend, End of alignment in query sequence. 14. sstart, Start of alignment in subject sequence. 15. send, End of alignment in subject sequence. 16. sstr, Subject strand. 17. slen, Subject sequence length. 18. cigar, CIGAR string of the alignment. (optional with -a/--all) 19. qseq, Aligned part of query sequence. (optional with -a/--all) 20. sseq, Aligned part of subject sequence. (optional with -a/--all) 21. align, Alignment text (\u0026#34;|\u0026#34; and \u0026#34; \u0026#34;) between qseq and sseq. (optional with -a/--all) Usage: lexicmap search [flags] -d \u0026lt;index path\u0026gt; [query.fasta.gz ...] [-o query.tsv.gz] Flags: --align-band int ► Band size in backtracking the score matrix (pseduo alignment phase). (default 50) --align-ext-len int ► Extend length of upstream and downstream of seed regions, for extracting query and target sequences for alignment. (default 2000) --align-max-gap int ► Maximum gap in a HSP segment. (default 20) -l, --align-min-match-len int ► Minimum aligned length in a HSP segment. (default 50) -i, --align-min-match-pident float ► Minimum base identity (percentage) in a HSP segment. (default 70) -a, --all ► Output more columns, e.g., matched sequences. Use this if you want to output blast-style format with \u0026#34;lexicmap utils 2blast\u0026#34;. -h, --help help for search -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. -w, --load-whole-seeds ► Load the whole seed data into memory for faster search. --max-open-files int ► Maximum opened files. (default 512) -J, --max-query-conc int ► Maximum number of concurrent queries. Bigger values do not improve the batch searching speed and consume much memory. (default 12) -Q, --min-qcov-per-genome float ► Minimum query coverage (percentage) per genome. -q, --min-qcov-per-hsp float ► Minimum query coverage (percentage) per HSP. -o, --out-file string ► Out file, supports a \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) --pseudo-align ► Only perform pseudo alignment, alignment metrics, including qcovGnm, qcovSHP and pident, will be less accurate. --seed-max-dist int ► Max distance between seeds in seed chaining. (default 10000) --seed-max-gap int ► Max gap in seed chaining. (default 500) -p, --seed-min-prefix int ► Minimum (prefix) length of matched seeds. (default 15) -P, --seed-min-single-prefix int ► Minimum (prefix) length of matched seeds if there\u0026#39;s only one pair of seeds matched. (default 17) -n, --top-n-genomes int ► Keep top N genome matches for a query (0 for all) in chaining phase. Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Examples See Searching ","description":"$ lexicmap search -h Search sequences against an index Attention: 1. Input should be (gzipped) FASTA or FASTQ records from files or stdin. 2. For multiple queries, the order of queries might be different from the input. Tips: 1. When using -a/--all, the search result would be formatted to Blast-style format with \u0026#39;lexicmap utils 2blast\u0026#39;. And the search speed would be slightly slowed down. 2. Alignment result filtering is performed in the final phase, so stricter filtering criteria, including -q/--min-qcov-per-hsp, -Q/--min-qcov-per-genome, and -i/--align-min-match-pident, do not significantly accelerate the search speed."},{"id":8,"href":"/LexicMap/usage/utils/subseq/","title":"subseq","parent":"utils","content":" Usage $ lexicmap utils subseq -h Exextract subsequence via reference name, sequence ID, position and strand Attention: 1. The option -s/--seq-id is optional. 1) If given, the positions are these in the original sequence. 2) If not given, the positions are these in the concatenated sequence. 2. All degenerate bases in reference genomes were converted to the lexicographic first bases. E.g., N was converted to A. Therefore, consecutive A\u0026#39;s in output might be N\u0026#39;s in the genomes. Usage: lexicmap utils subseq [flags] Flags: -h, --help help for subseq -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. -w, --line-width int ► Line width of sequence (0 for no wrap). (default 60) -o, --out-file string ► Out file, supports the \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) -n, --ref-name string ► Reference name. -r, --region string ► Region of the subsequence (1-based). -R, --revcom ► Extract subsequence on the negative strand. -s, --seq-id string ► Sequence ID. If the value is empty, the positions in the region are treated as that in the concatenated sequence. Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Examples Extracting subsequence with genome ID, sequence ID, position range and strand information.\n$ lexicmap utils subseq -d demo.lmi/ -n GCF_003697165.2 -s NZ_CP033092.2 -r 4591684:4593225 -R \u0026gt;NZ_CP033092.2:4591684-4593225:- AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAA GTCGAACGGTAACAGGAAGCAGCTTGCTGCTTTGCTGACGAGTGGCGGACGGGTGAGTAA TGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCAT AACGTCGCAAGACCAAAGAGGGGGACCTTAGGGCCTCTTGCCATCGGATGTGCCCAGATG GGATTAGCTAGTAGGTGGGGTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGA GGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGG GGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCT TCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATT GACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAG GGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCA GATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTC GTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACC GGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCA AACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCC CTTGAGGCGTGGCTTCCGGAGCTAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCA AGGTTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAAT TCGATGCAACGCGAAGAACCTTACCTGGTCTTGACATCCACGGAAGTTTTCAGAGATGAG AATGTGCCTTCGGGAACCGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGA AATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTCCGGC CGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGTC ATCATGGCCCTTACGACCAGGGCTACACACGTGCTACAATGGCGCATACAAAGAGAAGCG ACCTCGCGAGAGCAAGCGGACCTCATAAAGTGCGTCGTAGTCCGGATTGGAGTCTGCAAC TCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGT TCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGTTGCAAAAGAAGTAGGT AGCTTAACCTTCGGGAGGGCGCTTACCACTTTGTGATTCATGACTGGGGTGAAGTCGTAA CAAGGTAACCGTAGGGGAACCTGCGGTTGGATCACCTCCTTA If the sequence ID (-s/--seq-id) is not given, the positions are these in the concatenated sequence.\nChecking sequence lengths of a genome with seqkit.\n$ seqkit fx2tab -nil refs/GCF_003697165.2.fa.gz NZ_CP033092.2 4903501 NZ_CP033091.2 131333 Extracting the 1000-bp interval sequence inserted by lexicmap index.\n$ lexicmap utils subseq -d demo.lmi/ -n GCF_003697165.2 -r 4903502:4904501 \u0026gt;GCF_003697165.2:4903502-4904501:+ AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA It detects if the end position is larger than the sequence length.\n# the length of NZ_CP033092.2 is 4903501 $ lexicmap utils subseq -d demo.lmi/ -n GCF_003697165.2 -s NZ_CP033092.2 -r 4903501:1000000000 \u0026gt;NZ_CP033092.2:4903501-4903501:+ C $ lexicmap utils subseq -d demo.lmi/ -n GCF_003697165.2 -s NZ_CP033092.2 -r 4903502:1000000000 \u0026gt;NZ_CP033092.2:4903502-4903501:+ ","description":"Usage $ lexicmap utils subseq -h Exextract subsequence via reference name, sequence ID, position and strand Attention: 1. The option -s/--seq-id is optional. 1) If given, the positions are these in the original sequence. 2) If not given, the positions are these in the concatenated sequence. 2. All degenerate bases in reference genomes were converted to the lexicographic first bases. E.g., N was converted to A. Therefore, consecutive A\u0026#39;s in output might be N\u0026#39;s in the genomes."},{"id":9,"href":"/LexicMap/releases/","title":"Releases","parent":"","content":" Latest version v0.4.0 v0.4.0 - 2024-07-xx New commands: lexicmap utils 2blast: Convert the default search output to blast-style format. lexicmap index: Support suffix matching of seeds, now seeds are immune to any single SNP!!!, at the cost of doubled seed data. Better sketching desert filling for highly-repetitive regions. Change the default value of --seed-max-desert from 900 to 200 to increase alignment sensitivity. Mask gap regions (N\u0026rsquo;s). Fix skipping interval regions by further including the last k-1 bases of contigs. Fix a bug in indexing small genomes. Change the default value of -b, --batch-size from 10,000 to 5,000. Improve lexichash data structure. Write and merge seed data in parallel, new flag -J/--seed-data-threads. Improve the log. lexicmap search: Fix chaining for highly-repetitive regions. Perform more accurate alignment with WFA. Fix object recycling and reduce memory usage. Fix alignment against genomes with many short contigs. Fix early quit when meeting a sequence shorter than k. Add a new option -J/--max-query-conc to limit the miximum number of concurrent queries, with a default valule of 12 instead of the number of CPUs, which reduces the memory usage in batch searching. Result format: Cluster alignments of each target sequence. Remove the column seeds. Add columns gaps, cigar, align, which can be reformated with lexicmap utils 2blast. lexicmap utils kmers: Fix the progress bar. Fix a bug where some masks do not have any k-mer. Add a new column prefix to show the length of common prefix between the seed and the probe. Add a new column reversed to indicate if the k-mer is reversed for suffix matching. lexicmap utils masks: Add the support of only outputting a specific mask. lexicmap utils seed-pos: New columns: sseqid and pos_seq. More accurate seed distance. Add histograms of numbers of seed in sliding windows. lexicmap utils subseq: Fix a bug when the given end position is larger than the sequence length. Add the strand (\u0026quot;+\u0026quot; or \u0026ldquo;-\u0026rdquo;) in the sequence header. Please run lexicmap version to check update !!! Please run lexicmap autocompletion to update shell autocompletion script !!! Previous versions v0.3.0 v0.3.0 - 2024-05-14 lexicmap index: Better seed coverage by filling sketching deserts. Use longer (1000bp N\u0026rsquo;s, previous: k-1) intervals between contigs. Fix a concurrency bug between genome data writing and k-mer-value data collecting. Change the format of k-mer-value index file, and fix the computation of index partitions. Optionally save seed positions which can be outputted by lexicmap utils seed-pos. lexicmap search: Improved seed-chaining algorithm. Better support of long queries. Add a new flag -w/--load-whole-seeds for loading the whole seed data into memory for faster search. Parallelize alignment in each query, so it\u0026rsquo;s faster for a single query. Optional outputing matched query and subject sequences. 2-5X searching speed with a faster masking method. Change output format. Add output of query start and end positions. Fix a target sequence extracting bug. Keep indexes of genome data in memory. lexicmap utils kmers: Fix a little bug, wrong number of k-mers for the second k-mer in each k-mer pair. New commands: lexicmap utils gen-masks for generating masks from the top N largest genomes. lexicmap utils seed-pos for extracting seed positions via reference names. lexicmap utils reindex-seeds for recreating indexes of k-mer-value (seeds) data. lexicmap utils genomes for list genomes IDs in the index. v0.2.0 v0.2.0 - 2024-02-02 Software architecture and index formats are redesigned to reduce searching memory occupation. Indexing: genomes are processed in batches to reduce RAM usage, then indexes of all batches are merged. Searching: seeds matching is performed on disk yet it\u0026rsquo;s ultra-fast. v0.1.0 v0.1.0 - 2024-01-15 The first release. Seed indexing and querying are performed in RAM. GTDB r214 with 10k masks: index size 75GB, RAM: 130GB. ","description":"Latest version v0.4.0 v0.4.0 - 2024-07-xx New commands: lexicmap utils 2blast: Convert the default search output to blast-style format. lexicmap index: Support suffix matching of seeds, now seeds are immune to any single SNP!!!, at the cost of doubled seed data. Better sketching desert filling for highly-repetitive regions. Change the default value of --seed-max-desert from 900 to 200 to increase alignment sensitivity. Mask gap regions (N\u0026rsquo;s). Fix skipping interval regions by further including the last k-1 bases of contigs."},{"id":10,"href":"/LexicMap/usage/utils/seed-pos/","title":"seed-pos","parent":"utils","content":" Usage $ lexicmap utils seed-pos -h Extract and plot seed positions via reference name(s) Attention: 0. This command requires the index to be created with the flag --save-seed-pos in lexicmap index. 1. Seed/K-mer positions (column pos) are 1-based. For reference genomes with multiple sequences, the sequences were concatenated to a single sequence with intervals of N\u0026#39;s. So values of column pos_gnm and pos_seq might be different. The positions can be used to extract subsequence with \u0026#39;lexicmap utils subseq\u0026#39;. 2. All degenerate bases in reference genomes were converted to the lexicographic first bases. E.g., N was converted to A. Therefore, consecutive A\u0026#39;s in output might be N\u0026#39;s in the genomes. Extra columns: Using -v/--verbose will output more columns: len_aaa, length of consecutive A\u0026#39;s. seq, sequence between the previous and current seed. Figures: Using -O/--plot-dir will write plots into given directory: - Histograms of seed distances. - Histograms of numbers of seeds in sliding windows. Usage: lexicmap utils seed-pos [flags] Flags: -a, --all-refs ► Output for all reference genomes. This would take a long time for an index with a lot of genomes. -b, --bins int ► Number of bins in histograms. (default 100) --color-index int ► Color index (1-7). (default 1) --force ► Overwrite existing output directory. --height float ► Histogram height (unit: inch). (default 4) -h, --help help for seed-pos -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. --max-open-files int ► Maximum opened files, used for extracting sequences. (default 512) -D, --min-dist int ► Only output records with seed distance \u0026gt;= this value. -o, --out-file string ► Out file, supports and recommends a \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) -O, --plot-dir string ► Output directory for 1) histograms of seed distances, 2) histograms of numbers of seeds in sliding windows. --plot-ext string ► Histogram plot file extention. (default \u0026#34;.png\u0026#34;) -n, --ref-name strings ► Reference name(s). -s, --slid-step int ► The step size of sliding windows for counting the number of seeds (default 200) -w, --slid-window int ► The window size of sliding windows for counting the number of seeds (default 500) -v, --verbose ► Show more columns including position of the previous seed and sequence between the two seeds. Warning: it\u0026#39;s slow to extract the sequences, recommend set -D 1000 or higher values to filter results --width float ► Histogram width (unit: inch). (default 6) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Examples Adding the flag --save-seed-pos in index building.\n$ lexicmap index -I refs/ -O demo.lmi --save-seed-pos --force Listing seed position of one genome.\n$ lexicmap utils seed-pos -d demo.lmi/ -n GCF_000017205.1 -o seed_distance.tsv $ head -n 10 seed_distance.tsv | csvtk pretty -t ref seqid pos_gnm pos_seq strand distance --------------- ----------- ------- ------- ------ -------- GCF_000017205.1 NC_009656.1 16 16 + 15 GCF_000017205.1 NC_009656.1 18 18 + 2 GCF_000017205.1 NC_009656.1 71 71 + 53 GCF_000017205.1 NC_009656.1 74 74 - 3 GCF_000017205.1 NC_009656.1 119 119 - 45 GCF_000017205.1 NC_009656.1 123 123 + 4 GCF_000017205.1 NC_009656.1 154 154 + 31 GCF_000017205.1 NC_009656.1 185 185 + 31 GCF_000017205.1 NC_009656.1 269 269 - 84 Check the biggest seed distances.\n$ csvtk freq -t -f distance seed_distance.tsv \\ | csvtk sort -t -k distance:nr \\ | head -n 10 \\ | csvtk pretty -t distance frequency -------- --------- 199 49 198 47 197 40 196 38 195 54 194 36 193 38 192 55 191 40 Or only list records with seed distances longer than a threshold.\n$ lexicmap utils seed-pos -d demo.lmi/ -n GCF_000017205.1 -D 190 \\ | csvtk pretty -t | head -n 5 ref seqid pos_gnm pos_seq strand distance --------------- ----------- ------- ------- ------ -------- GCF_000017205.1 NC_009656.1 13549 13549 + 196 GCF_000017205.1 NC_009656.1 27667 27667 - 190 GCF_000017205.1 NC_009656.1 65318 65318 + 197 Plot histogram of distances between seeds and histogram of number of seeds in sliding windows.\n$ lexicmap utils seed-pos -d demo.lmi/ -n GCF_000017205.1 -o seed_distance.tsv --plot-dir seed_distance In the plot below, there\u0026rsquo;s a peak at 50 bp, because LexicMap fills sketching deserts with extra k-mers (seeds) of which their distance is 50 bp by default.\nMore columns including sequences between two seeds.\n$ lexicmap utils seed-pos -d demo.lmi/ -n GCF_000017205.1 -v \\ | head -n4 | csvtk pretty -t -W 40 --clip ref seqid pos_gnm pos_seq strand distance len_aaa seq --------------- ----------- ------- ------- ------ -------- ------- ---------------------------------------- GCF_000017205.1 NC_009656.1 16 16 + 15 2 TTAAAGAGACCGGCG GCF_000017205.1 NC_009656.1 18 18 + 2 0 AT GCF_000017205.1 NC_009656.1 71 71 + 53 6 TCTAGTGAAATCGAACGGGCAGGTCAATTTCCAACCA... Or only list records with seed distance longer than a threshold.\n$ lexicmap utils seed-pos -d demo.lmi/ -n GCF_000017205.1 -v -D 190 \\ | head -n 2 \\ | csvtk pretty -t -W 40 ref seqid pos_gnm pos_seq strand distance len_aaa seq --------------- ----------- ------- ------- ------ -------- ------- ---------------------------------------- GCF_000017205.1 NC_009656.1 13549 13549 + 196 15 CGAAGCGGCGCCGGCGGACATGTACGACAAGGACCTGGAT GTCTCGGTGGCCGCCATGAGCCGCGAACTGGCCAAGTATG TACGGGCCTATCCGAGCCAGTACATGTGGAGCATGAAGCG CTTCAAGAACCGCCCGGACGGCGAGAAGAAGTGGTACTGA AAAAAGGCGTCGGAAGACGCCTTTTTCATATCCGGG Listing seed position of all genomes.\n$ lexicmap utils seed-pos -d demo.lmi/ --all-refs -o seed-pos.tsv.gz Show the number of seed positions in each genome. Frequencies larger than 40000 (the number of masks) means some k-mers can be foud in more than one positions in a genome.\n$ csvtk freq -t -f ref -nr seed-pos.tsv.gz | csvtk pretty -t ref frequency --------------- --------- GCF_000017205.1 134541 GCF_000742135.1 103771 GCF_003697165.2 92087 GCF_000006945.2 90683 GCF_002950215.1 89638 GCF_002949675.1 84337 GCF_009759685.1 72711 GCF_001027105.1 56737 GCF_000392875.1 55772 GCF_006742205.1 52699 GCF_001544255.1 50000 GCF_900638025.1 46638 GCF_001096185.1 46195 GCF_001457655.1 45822 GCF_000148585.2 44982 Plot the histograms of distances between seeds for all genomes.\n$ lexicmap utils seed-pos -d demo.lmi/ --all-refs -o seed-pos.tsv.gz \\ --plot-dir seed_distance --force 09:56:34.059 [INFO] creating genome reader pools, each batch with 1 readers... processed files: 15 / 15 [======================================] ETA: 0s. done 09:56:34.656 [INFO] seed positions of 15 genomes(s) saved to seed-pos.tsv.gz 09:56:34.656 [INFO] histograms of 15 genomes(s) saved to seed_distance 09:56:34.656 [INFO] 09:56:34.656 [INFO] elapsed time: 598.080462ms 09:56:34.656 [INFO] $ ls seed_distance/ GCF_000006945.2.png GCF_000742135.1.png GCF_001544255.1.png GCF_006742205.1.png GCF_000006945.2.seed_number.png GCF_000742135.1.seed_number.png GCF_001544255.1.seed_number.png GCF_006742205.1.seed_number.png GCF_000017205.1.png GCF_001027105.1.png GCF_002949675.1.png GCF_009759685.1.png GCF_000017205.1.seed_number.png GCF_001027105.1.seed_number.png GCF_002949675.1.seed_number.png GCF_009759685.1.seed_number.png GCF_000148585.2.png GCF_001096185.1.png GCF_002950215.1.png GCF_900638025.1.png GCF_000148585.2.seed_number.png GCF_001096185.1.seed_number.png GCF_002950215.1.seed_number.png GCF_900638025.1.seed_number.png GCF_000392875.1.png GCF_001457655.1.png GCF_003697165.2.png GCF_000392875.1.seed_number.png GCF_001457655.1.seed_number.png GCF_003697165.2.seed_number.png In the plots below, there\u0026rsquo;s a peak at 150 bp, because LexicMap fills sketching deserts with extra k-mers (seeds) of which their distance is 150 bp by default. And they show that the seed number, seed distance and seed density are related to genome sizes.\nGCF_000392875.1 (genome size: 2.9 Mb)\n","description":"Usage $ lexicmap utils seed-pos -h Extract and plot seed positions via reference name(s) Attention: 0. This command requires the index to be created with the flag --save-seed-pos in lexicmap index. 1. Seed/K-mer positions (column pos) are 1-based. For reference genomes with multiple sequences, the sequences were concatenated to a single sequence with intervals of N\u0026#39;s. So values of column pos_gnm and pos_seq might be different. The positions can be used to extract subsequence with \u0026#39;lexicmap utils subseq\u0026#39;."},{"id":11,"href":"/LexicMap/tutorials/","title":"Tutorials","parent":"","content":"","description":""},{"id":12,"href":"/LexicMap/usage/utils/","title":"utils","parent":"Usage","content":"$ lexicmap utils Some utilities Usage: lexicmap utils [command] Available Commands: 2blast Convert the default search output to blast-style format genomes View genome IDs in the index kmers View k-mers captured by the masks masks View masks of the index or generate new masks randomly reindex-seeds Recreate indexes of k-mer-value (seeds) data seed-pos Extract and plot seed positions via reference name(s) subseq Extract subsequence via reference name, sequence ID, position and strand Flags: -h, --help help for utils Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) The output (TSV format) is formatted with csvtk pretty.\n","description":"$ lexicmap utils Some utilities Usage: lexicmap utils [command] Available Commands: 2blast Convert the default search output to blast-style format genomes View genome IDs in the index kmers View k-mers captured by the masks masks View masks of the index or generate new masks randomly reindex-seeds Recreate indexes of k-mer-value (seeds) data seed-pos Extract and plot seed positions via reference name(s) subseq Extract subsequence via reference name, sequence ID, position and strand Flags: -h, --help help for utils Global Flags: -X, --infile-list string ► File of input file list (one file per line)."},{"id":13,"href":"/LexicMap/usage/utils/reindex-seeds/","title":"reindex-seeds","parent":"utils","content":" Usage $ lexicmap utils reindex-seeds -h Recreate indexes of k-mer-value (seeds) data Usage: lexicmap utils reindex-seeds [flags] Flags: -h, --help help for reindex-seeds -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. --partitions int ► Number of partitions for re-indexing seeds (k-mer-value data) files. (default 512) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Examples $ lexicmap utils reindex-seeds -d demo.lmi/ --partitions 1024 10:20:29.150 [INFO] recreating seed indexes with 1024 partitions for: demo.lmi/ processed files: 16 / 16 [======================================] ETA: 0s. done 10:20:29.166 [INFO] update index information file: demo.lmi/info.toml 10:20:29.166 [INFO] finished updating the index information file: demo.lmi/info.toml 10:20:29.166 [INFO] 10:20:29.166 [INFO] elapsed time: 15.981266ms 10:20:29.166 [INFO] ","description":"Usage $ lexicmap utils reindex-seeds -h Recreate indexes of k-mer-value (seeds) data Usage: lexicmap utils reindex-seeds [flags] Flags: -h, --help help for reindex-seeds -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. --partitions int ► Number of partitions for re-indexing seeds (k-mer-value data) files. (default 512) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments."},{"id":14,"href":"/LexicMap/usage/","title":"Usage","parent":"","content":"","description":""},{"id":15,"href":"/LexicMap/faqs/","title":"FAQs","parent":"","content":" Does LexicMap support short reads? LexicMap is mainly designed for sequence alignment with a small number of queries (gene/plasmid/virus/phage sequences) longer than 200 bp by default. However, short queries can also be aligned.\nIf you just want to search long (\u0026gt;1kb) queries for highy similar (\u0026gt;95%) targets, you can build an index with a bigger -D/--seed-max-desert (200 by default), e.g.,\n--seed-max-desert 450 --seed-in-desert-dist 150 Bigger values decrease the search sensitivity for distant targets, speed up the indexing speed, decrease the indexing memory occupation and decrease the index size. While the alignment speed is almost not affected.\nDoes LexicMap support fungi genomes? Yes. LexicMap mainly supports small genomes including prokaryotic, viral, and plasmid genomes. Fungi can also be supported, just remember to increase the value of -g/--max-genome when running lexicmap index, which is used to skip genomes larger than 15Mb by default.\n-g, --max-genome int ► Maximum genome size. Extremely large genomes (e.g., non-isolate assemblies from Genbank) will be skipped. (default 15000000) Maximum genome size is about 268 Mb (268,435,456). More precisely:\n$total_bases + ($num_contigs - 1) * 1000 \u0026lt;= 268,435,456 as we concatenate contigs with 1000-bp intervals of N’s to reduce the sequence scale to index.\nFor big and complex genomes, like the human genome (chr1 is ~248 Mb) which has many repetitive sequences, LexicMap would be slow to align.\nHow\u0026rsquo;s the hardware requirement? For index building. See details hardware requirement. For seaching. See details hardware requirement. Can I extract the matched sequences? Yes, lexicmap search has a flag\n-a, --all ► Output more columns, e.g., matched sequences. Use this if you want to output blast-style format with \u0026#34;lexicmap utils 2blast\u0026#34;. to output CIGAR string, aligned query and subject sequences.\n18. cigar, CIGAR string of the alignment (optional with -a/--all) 19. qseq, Aligned part of query sequence. (optional with -a/--all) 20. sseq, Aligned part of subject sequence. (optional with -a/--all) 21. align, Alignment text (\u0026#34;|\u0026#34; and \u0026#34; \u0026#34;) between qseq and sseq. (optional with -a/--all) And lexicmap util 2blast can help to convert the tabular format to Blast-style format, see examples.\nHow can I extract the upstream and downstream flanking sequences of matched regions? lexicmap utils subseq can extract subsequencess via genome ID, sequence ID and positions. So you can use these information from the search result and expand the region positions to extract flanking sequences.\nWhy isn\u0026rsquo;t the pident 100% when aligning with a sequence from the reference genomes? It happens if there are some degenerate bases (e.g., N) in the query sequence. In the indexing step, all degenerate bases are converted to their lexicographic first bases. E.g., N is converted to A. While for the query sequences, we don\u0026rsquo;t convert them.\nWhy is LexicMap slow for batch searching? LexicMap is mainly designed for sequence alignment with a small number of queries against a database with a huge number (up to 16 million) of genomes.\nlexicmap search has a flag -w/--load-whole-seeds to load the whole seed data into memory for faster search.\nFor example, for ~85,000 GTDB representative genomes, the memory would be ~260 GB with default parameters. lexicmap search also has a flag --pseudo-align to only perform pseudo alignment, which is slightly faster and uses less memory. It can be used in searching with long and divergent query sequences like nanopore long-reads.\nClick to read more detail of the usage.\n","description":"Does LexicMap support short reads? LexicMap is mainly designed for sequence alignment with a small number of queries (gene/plasmid/virus/phage sequences) longer than 200 bp by default. However, short queries can also be aligned.\nIf you just want to search long (\u0026gt;1kb) queries for highy similar (\u0026gt;95%) targets, you can build an index with a bigger -D/--seed-max-desert (200 by default), e.g.,\n--seed-max-desert 450 --seed-in-desert-dist 150 Bigger values decrease the search sensitivity for distant targets, speed up the indexing speed, decrease the indexing memory occupation and decrease the index size."},{"id":16,"href":"/LexicMap/notes/","title":"Notes","parent":"","content":"","description":""},{"id":17,"href":"/LexicMap/","title":"","parent":"","content":" LexicMap LexicMap is a nucleotide sequence alignment tool for efficiently querying gene, plasmid, virus, or long-read sequences against up to millions of prokaryotic genomes.\nIntroduction Feature overview Easy to install Linux, Windows, MacOS and more OS are supported.\nBoth x86 and ARM CPUs are supported.\nJust download the binary files and run!\nOr install it by\nconda install -c bioconda lexicmap Installation Releases Easy to use Step 1: indexing\nlexicmap index -I genomes/ -O db.lmi Step 2: searching\nlexicmap search -d db.lmi q.fasta -o r.tsv Tutorials Usages FAQs Notes Accurate and efficient alignment Using LexicMap to search in the whole 2,340,672 Genbank+Refseq prokaryotic genomes with 48 CPUs.\nQuery Genome hits Time RAM A 1.3-kb marker gene 36,633 21s 3.4 GB A 1.5-kb 16S rRNA 1,928,372 6m40s 16.7 GB A 52.8-kb plasmid 551,264 8m54s 20.1 GB 1003 AMR genes 27,577,060 5h18m 41.3 GB Blastn is unable to run with the same dataset on common servers as it requires \u0026gt;2000 GB RAM.\nPerformance ","description":"LexicMap LexicMap is a nucleotide sequence alignment tool for efficiently querying gene, plasmid, virus, or long-read sequences against up to millions of prokaryotic genomes.\nIntroduction Feature overview Easy to install Linux, Windows, MacOS and more OS are supported.\nBoth x86 and ARM CPUs are supported.\nJust download the binary files and run!\nOr install it by\nconda install -c bioconda lexicmap Installation Releases Easy to use Step 1: indexing"},{"id":18,"href":"/LexicMap/usage/utils/2blast/","title":"2blast","parent":"utils","content":" Usage $ lexicmap utils 2blast -h Convert the default search output to blast-style format LexicMap only stores genome IDs and sequence IDs, without description information. But the option -g/--kv-file-genome enables adding description data after the genome ID with a tabular key-value mapping file. Input: - Output of \u0026#39;lexicmap search\u0026#39; with the flag -a/--all. Usage: lexicmap utils 2blast [flags] Flags: -b, --buffer-size string ► Size of buffer, supported unit: K, M, G. You need increase the value when \u0026#34;bufio.Scanner: token too long\u0026#34; error reported (default \u0026#34;20M\u0026#34;) -h, --help help for 2blast -i, --ignore-case ► Ignore cases of sgenome and sseqid -g, --kv-file-genome string ► Two-column tabular file for mapping the target genome ID (sgenome) to the corresponding value -s, --kv-file-seq string ► Two-column tabular file for mapping the target sequence ID (sseqid) to the corresponding value -o, --out-file string ► Out file, supports and recommends a \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Examples From stdin.\n$ seqkit seq -M 500 q.long-reads.fasta.gz \\ | seqkit head -n 2 \\ | lexicmap search -d demo.lmi/ -a \\ | lexicmap utils 2blast --kv-file-genome ass2species.map Query = GCF_000017205.1_r160 Length = 478 [Subject genome #1/1] = GCF_000017205.1 Pseudomonas aeruginosa Query coverage per genome = 95.188% \u0026gt;NC_009656.1 Length = 6588339 HSP #1 Query coverage per seq = 95.188%, Aligned length = 463, Identities = 95.680%, Gaps = 12 Query range = 13-467, Subject range = 4866862-4867320, Strand = Plus/Plus Query 13 CCTCAAACGAGTCC-AACAGGCCAACGCCTAGCAATCCCTCCCCTGTGGGGCAGGGAAAA 71 |||||||||||||| |||||||| |||||| | ||||||||||||| |||||||||||| Sbjct 4866862 CCTCAAACGAGTCCGAACAGGCCCACGCCTCACGATCCCTCCCCTGTCGGGCAGGGAAAA 4866921 Query 72 TCGTCCTTTATGGTCCGTTCCGGGCACGCACCGGAACGGCGGTCATCTTCCACGGTGCCC 131 |||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||| Sbjct 4866922 TCGTCCTTTATGGTCCGTTCCGGGCACGCACCGGAACGGCGGTCAT-TTCCACGGTGCCC 4866980 Query 132 GCCCACGGCGGACCCGCGGAAACCGACCCGGGCGCCAAGGCGCCCGGGAACGGAGTA-CA 190 ||| ||||||||||| ||||||||||||||||||||||||||||||||||||||||| || Sbjct 4866981 GCC-ACGGCGGACCC-CGGAAACCGACCCGGGCGCCAAGGCGCCCGGGAACGGAGTATCA 4867038 Query 191 CTCGGCGTTCGGCCAGCGACAGC---GACGCGTTGCCGCCCACCGCGGTGGTGTTCACCG 247 |||||||| |||||||||||||| |||||||||||||||||||||||||||||||||| Sbjct 4867039 CTCGGCGT-CGGCCAGCGACAGCAGCGACGCGTTGCCGCCCACCGCGGTGGTGTTCACCG 4867097 Query 248 AGGTGGTGCGCTCGCTGAC-AAACGCAGCAGGTAGTTCGGCCCGCCGGCCTTGGGACCG- 305 ||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||| Sbjct 4867098 AGGTGGTGCGCTCGCTGACGAAACGCAGCAGGTAGTTCGGCCCGCCGGCCTTGGGACCGG 4867157 Query 306 TGCCGGACAGCCCGTGGCCGCCGAACAGTTGCACGCCCACCACCGCGCCGAT-TGGTTTC 364 |||||||||||||||||||||||||| ||||||||||||||||||||||||| ||||| | Sbjct 4867158 TGCCGGACAGCCCGTGGCCGCCGAACGGTTGCACGCCCACCACCGCGCCGATCTGGTTGC 4867217 Query 365 GGTTGACGTAGAGGTTGCCGACCCGCGCCAGCTCTTGGATGCGGCGGGCGGTTTCCTCGT 424 |||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||| Sbjct 4867218 GGTTGACGTAGAGGTTGCCGACCCGCGCCAGCTCTTCGATGCGGCGGGCGGTTTCCTCGT 4867277 Query 425 TGCGGCTGTGGACCCCCATGGTCAGGCCGAAACCGGTGGCGTT 467 ||||||||||||||||||||||||||||||||||||||||||| Sbjct 4867278 TGCGGCTGTGGACCCCCATGGTCAGGCCGAAACCGGTGGCGTT 4867320 Query = GCF_006742205.1_r100 Length = 431 [Subject genome #1/1] = GCF_006742205.1 Staphylococcus epidermidis Query coverage per genome = 92.575% \u0026gt;NZ_AP019721.1 Length = 2422602 HSP #1 Query coverage per seq = 92.575%, Aligned length = 402, Identities = 98.507%, Gaps = 4 Query range = 33-431, Subject range = 1321677-1322077, Strand = Plus/Minus Query 33 TAAAACGATTGCTAATGAGTCACGTATTTCATCTGGTTCGGTAACTATACCGTCTACTAT 92 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1322077 TAAAACGATTGCTAATGAGTCACGTATTTCATCTGGTTCGGTAACTATACCGTCTACTAT 1322018 Query 93 GGACTCAGTGTAACCCTGTAATAAAGAGATTGGCGTACGTAATTCATGTG-TACATTTGC 151 |||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||| Sbjct 1322017 GGACTCAGTGTAACCCTGTAATAAAGAGATTGGCGTACGTAATTCATGTGATACATTTGC 1321958 Query 152 TATAAAATCTTTTTTCATTTGATCAAGATTATGTTCATTTGTCATATCACAGGATGACCA 211 |||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||| Sbjct 1321957 TATAAAATCTTTTTTCATTTGATCAAGATTATGTTCATTTGTCATATCAC-GGATGACCA 1321899 Query 212 TGACAATACCACTTCTACCATTTGTTTGAATTCTATCTATATAACTGGAGATAAATACAT 271 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1321898 TGACAATACCACTTCTACCATTTGTTTGAATTCTATCTATATAACTGGAGATAAATACAT 1321839 Query 272 AGTACCTTGTATTAATTTCTAATTCTAA-TACTCATTCTGTTGTGATTCAAATGGTGCTT 330 |||||||||||||||||||||||||||| ||||||||||||||||||||||||| ||||| Sbjct 1321838 AGTACCTTGTATTAATTTCTAATTCTAAATACTCATTCTGTTGTGATTCAAATGTTGCTT 1321779 Query 331 CAATTTGCTGTTCAATAGATTCTTTTGAAAAATCATCAATGTGACGCATAATATAATCAG 390 |||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||| Sbjct 1321778 CAATTTGCTGTTCAATAGATTCTTTTGAAAAATCATCAATGTGACGCATAATATCATCAG 1321719 Query 391 CCATCTTGTT-GACAATATGATTTCACGTTGATTATTAATGC 431 |||||||||| ||||||||||||||||||||||||||||||| Sbjct 1321718 CCATCTTGTTTGACAATATGATTTCACGTTGATTATTAATGC 1321677 From file.\n$ lexicmap utils 2blast r.lexicmap.tsv -o r.lexicmap.txt ","description":"Usage $ lexicmap utils 2blast -h Convert the default search output to blast-style format LexicMap only stores genome IDs and sequence IDs, without description information. But the option -g/--kv-file-genome enables adding description data after the genome ID with a tabular key-value mapping file. Input: - Output of \u0026#39;lexicmap search\u0026#39; with the flag -a/--all. Usage: lexicmap utils 2blast [flags] Flags: -b, --buffer-size string ► Size of buffer, supported unit: K, M, G."},{"id":19,"href":"/LexicMap/tutorials/index/","title":"Building an index","parent":"Tutorials","content":" Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Steps Output File structure Index size Explore the index TL;DR Prepare input files: Sequences of each reference genome should be saved in separate FASTA/Q files, with identifiers in the file names. E.g., GCF_000006945.2.fna.gz Run: From a directory with multiple genome files:\nlexicmap index -I genomes/ -O db.lmi From a file list with one file per line:\nlexicmap index -X files.txt -O db.lmi Input Genome size\nLexicMap is mainly suitable for small genomes like Archaea, Bacteria, Viruses and plasmids.\nMaximum genome size: 268 Mb (268,435,456). More precisely:\n$total_bases + ($num_contigs - 1) * 1000 \u0026lt;= 268,435,456 as we concatenate contigs with 1000-bp intervals of N’s to reduce the sequence scale to index.\nSequences of each reference genome should be saved in separate FASTA/Q files, with identifiers in the file names.Click to show\nFile type: FASTA/Q files, in plain text or gzip/xz/zstd/bzip2 compressed formats. File name: \u0026ldquo;Genome ID\u0026rdquo; + \u0026ldquo;File extention\u0026rdquo;. E.g., GCF_000006945.2.fna.gz. Genome ID: they should be distinct for accurate result interpretation, which will be shown in the search result. File extention: a regular expression set by the flag -N/--ref-name-regexp is used to extract genome IDs from the file name. The default value supports common sequence file extentions, e.g., .fa, .fasta, .fna, .fa.gz, .fasta.gz, .fna.gz, fasta.xz, fasta.zst, and fasta.bz2. brename can help to batch rename files safely. If you don\u0026rsquo;t want to change the original file names, you can Create and change to a new directory. Create symbolic links (ln -s) for all genome files. Batch rename all the symbolic links with brename. Use this directory as input via the flag -I/--in-dir. Sequences: Only DNA or RNA sequences are supported. Sequence IDs should be distinct for accurate result interpretation, which will be shown in the search result. One or more sequences in each file are allowed. Unwanted sequences can be filtered out by regular expressions from the flag -B/--seq-name-filter. Genome size limit. Some none-isolate assemblies might have extremely large genomes, e.g., GCA_000765055.1 has \u0026gt;150 Mb. The flag -g/--max-genome (default 15 Mb) is used to skip these input files, and the file list would be written to a file via the flag -G/--big-genomes. Minimum sequence length. A flag -l/--min-seq-len can filter out sequences shorter than the threshold (default is the k value). At most 17,179,869,184 (234) genomes are supported. For more genomes, just build multiple indexes. Input files can be given via one of the following ways:\nPositional arguments. For a few input files. A file list via the flag -X/--infile-list with one file per line. It can be STDIN (-), e.g., you can filter a file list and pass it to lexicmap index. The flag -S/--skip-file-check is optional for skiping input file checking if you believe these files do exist. A directory containing input files via the flag -I/--in-dir. Multiple-level directories are supported. Directory and file symlinks are followed. Hardware requirements See benchmark of index building.\nLexicMap is designed to provide fast and low-memory sequence alignment against millions of prokaryotic genomes.\nCPU: No specific requirements on CPU type and instruction sets. Both x86 and ARM chips are supported. More is better as LexicMap is a CPU-intensive software. It uses all CPUs by default (-j/--threads). RAM More RAM (\u0026gt; 100 GB) is preferred. The memory usage in index building is mainly related to: The number of masks (-m/--masks, default 40,000). The number of genomes. The divergence between genome sequences. Diverse genomes consume more memory. The genome batch size (-b/--batch-size, default 5,000). This is the main parameter to adjust memory usage. The maximum seed distance or the maximum sketching desert size (-D/--seed-max-desert, default 200), and the distance of k-mers to fill deserts (-d/--seed-in-desert-dist, default 50). Bigger -D/--seed-max-desert values decrease the search sensitivity for distant targets, speed up the indexing speed, decrease the indexing memory occupation and decrease the index size. While the alignment speed is almost not affected. If the RAM is not sufficient. Please: Use a smaller genome batch size. It decreases indexing memory occupation and has little affection on searching performance. Use a smaller number of masks, e.g., 20,000 performs well for small genomes (\u0026lt;=5 Mb). And if the queries are long (\u0026gt;= 2kb), there\u0026rsquo;s little affection for the alignment results. Disk More (\u0026gt;2 TB) is better. LexicMap index size is related to the number of input genomes, the divergence between genome sequences, the number of masks, and the maximum seed distance. See some examples. Note that the index size is not linear with the number of genomes, it\u0026rsquo;s sublinear. Because the seed data are compressed with VARINT-GB algorithm, more genome bring higher compression rates. SSD disks are preferred, while HDD disks are also fast enough. Algorithm Generating m LexicHash masks.\nGenerate m prefixes. Generating all permutations of p-bp prefixes that can cover all possible k-mers, p is the biggest value for 4p \u0026lt;= m (desired number of masks), e.g., p=7 for 40,000 masks. Removing low-complexity prefixes. E.g., 16176 out of 16384 (4^7) prefixes are left. Duplicating these prefixes to m prefixes. For each prefix, Randomly generating left k-p bases. If the P-prefix (-p/--seed-min-prefix) is of low-complexity, re-generating. P is the minimum length of substring matches, default 15. If the mask is duplicated, re-generating. Building an index for each genome batch (-b/--batch-size, default 10,000, max 131,072).\nFor each genome file in a genome batch. Optionally discarding sequences via regular expression (-B/--seq-name-filter). Skipping genomes bigger than the value of -g/--max-genome. Concatenating all sequences, with intervals of 1000-bp N\u0026rsquo;s. Capturing the most similar k-mer (in non-gap and non-interval regions) for each mask and recording the k-mer and its location(s) and strand information. Base N is treated as A. Filling sketching deserts (genome regions longer than --seed-max-desert without any captured k-mers/seeds). In a sketching desert, not a single k-mer is captured because there\u0026rsquo;s another k-mer in another place which shares a longer prefix with the mask. As a result, for a query similar to seqs in this region, all captured k-mers can’t match the correct seeds. For a desert region (start, end), masking the extended region (start-1000, end+1000) with the masks. Starting from start, every around --seed-in-desert-dist (default 150) bp, finding a k-mer which is captured by some mask, and add the k-mer and its position information into the index of that mask. Saving the concatenated genome sequence (bit-packed, 2 bits for one base, N is treated as A) and genome information (genome ID, size, and lengths of all sequences) into the genome data file, and creating an index file for the genome data file for fast random subsequence extraction. Duplicate and reverse all k-mers, and save each reversed k-mer along with the duplicated position information in the seed data of the closest (sharing the longgest prefix) mask. This is for suffix matching of seeds. Compressing k-mers and the corresponding data (k-mer-data, or seeds data, including genome batch, genome number, location, and strand) into chunks of files, and creating an index file for each k-mer-data file for fast seeding. Writing summary information into info.toml file. Merging indexes of multiple batches.\nFor each k-mer-data chunk file (belonging to a list of masks), serially reading data of each mask from all batches, merging them and writting to a new file. For genome data files, just moving them. Concatenating genomes.map.bin, which maps each genome ID to its batch ID and index in the batch. Update the index summary file. Parameters Query length\nLexicMap is mainly designed for sequence alignment with a small number of queries (gene/plasmid/virus/phage sequences) longer than 200 bp by default. However, short queries can also be aligned.\nIf you just want to search long (\u0026gt;1kb) queries for highy similar (\u0026gt;95%) targets, you can build an index with a bigger -D/--seed-max-desert (200 by default), e.g.,\n--seed-max-desert 450 --seed-in-desert-dist 150 Bigger values decrease the search sensitivity for distant targets, speed up the indexing speed, decrease the indexing memory occupation and decrease the index size. While the alignment speed is almost not affected.\nFlags in bold text are important and frequently used.\nGenome batches Flag Value Function Comment -b/--batch-size Max: 131072, default: 5000 Maximum number of genomes in each batch If the number of input files exceeds this number, input files are split into multiple batches and indexes are built for all batches. In the end, seed files are merged, while genome data files are kept unchanged and collected. ■ Bigger values increase indexing memory occupation and increase batch searching speed, while single query searching speed is not affected. LexicHash mask generation Flag Value Function Comment -M/--mask-file A file File with custom masks File with custom masks, which could be exported from an existing index or newly generated by \u0026ldquo;lexicmap utils masks\u0026rdquo;. This flag oversides -k/--kmer, -m/--masks, -s/--rand-seed, etc. -k/--kmer Max: 32, default: 31 K-mer size ■ Bigger values improve the search specificity and do not increase the index size. -m/--masks Default: 40,000 Number of masks ■ Bigger values improve the search sensitivity, increase the index size, and slow down the search speed. For smaller genomes like phages/viruses, m=10,000 is high enough. -p/--seed-min-prefix Max: 32, Default: 15 Minimum length of shared substrings (anchors) in searching This value is used to remove masks with a prefix of low-complexity. Seeds (k-mer-value) data Flag Value Function Comment --seed-max-desert Default: 200 Maximum length of distances between seeds The default value of 200 guarantees queries \u0026gt;200 bp would match at least one seed. ► Large regions with no seeds are called sketching deserts. Deserts with seed distance larger than this value will be filled by choosing k-mers roughly every \u0026ndash;seed-in-desert-dist (50 by default) bases. ■ Bigger values decrease the search sensitivity for distant targets, speed up the indexing speed, decrease the indexing memory occupation and decrease the index size. While the alignment speed is almost not affected. -c/--chunks Maximum: 128, default: #CPUs Number of seed file chunks Bigger values accelerate the search speed at the cost of a high disk reading load. The maximum number should not exceed the maximum number of open files set by the operating systems. -J/--seed-data-threads Maximum: -c/\u0026ndash;chunks, default: 8 Number of threads for writing seed data and merging seed chunks from all batches ■ Bigger values increase indexing speed at the cost of slightly higher memory occupation. -p/--partitions Default: 512 Number of partitions for indexing each seed file Bigger values bring a little higher memory occupation. 512 is a good value with high searching speed, larger or smaller values would decrease the speed in lexicmap search. ► After indexing, lexicmap utils reindex-seeds can be used to reindex the seeds data with another value of this flag. --max-open-files Default: 512 Maximum number of open files It\u0026rsquo;s only used in merging indexes of multiple genome batches. Also see the usage of lexicmap index.\nSteps We use a small dataset for demonstration.\nPreparing the test genomes (15 bacterial genomes) in the refs directory.\nNote that the genome files contain the assembly accessions (ID) in the file names.\ngit clone https://github.com/shenwei356/LexicMap cd LexicMap/demo/ ls refs/ GCF_000006945.2.fa.gz GCF_000392875.1.fa.gz GCF_001096185.1.fa.gz GCF_002949675.1.fa.gz GCF_006742205.1.fa.gz GCF_000017205.1.fa.gz GCF_000742135.1.fa.gz GCF_001457655.1.fa.gz GCF_002950215.1.fa.gz GCF_009759685.1.fa.gz GCF_000148585.2.fa.gz GCF_001027105.1.fa.gz GCF_001544255.1.fa.gz GCF_003697165.2.fa.gz GCF_900638025.1.fa.gz Building an index with genomes from a directory.\nlexicmap index -I refs/ -O demo.lmi It would take about 3 seconds and 2 GB RAM in a 16-CPU PC.\nOptionally, we can also use a file list as the input.\n$ head -n 3 files.txt refs/GCF_000006945.2.fa.gz refs/GCF_000017205.1.fa.gz refs/GCF_000148585.2.fa.gz lexicmap index -X files.txt -O demo.lmi Click to show the log of a demo run. ... # here we set a small --batch-size 5 $ lexicmap index -I refs/ -O demo.lmi --batch-size 5 16:22:49.745 [INFO] LexicMap v0.4.0 (14c2606) 16:22:49.745 [INFO] https://github.com/shenwei356/LexicMap 16:22:49.745 [INFO] 16:22:49.745 [INFO] checking input files ... 16:22:49.745 [INFO] 15 input file(s) given 16:22:49.745 [INFO] 16:22:49.745 [INFO] --------------------- [ main parameters ] --------------------- 16:22:49.745 [INFO] 16:22:49.745 [INFO] input and output: 16:22:49.745 [INFO] input directory: refs/ 16:22:49.745 [INFO] regular expression of input files: (?i)\\.(f[aq](st[aq])?|fna)(\\.gz|\\.xz|\\.zst|\\.bz2)?$ 16:22:49.745 [INFO] *regular expression for extracting reference name from file name: (?i)(.+)\\.(f[aq](st[aq])?|fna)(\\.gz|\\.xz|\\.zst|\\.bz2)?$ 16:22:49.745 [INFO] *regular expressions for filtering out sequences: [] 16:22:49.745 [INFO] max genome size: 15000000 16:22:49.745 [INFO] output directory: demo.lmi 16:22:49.745 [INFO] 16:22:49.745 [INFO] mask generation: 16:22:49.745 [INFO] k-mer size: 31 16:22:49.745 [INFO] number of masks: 40000 16:22:49.745 [INFO] rand seed: 1 16:22:49.745 [INFO] prefix length for checking low-complexity in mask generation: 15 16:22:49.745 [INFO] 16:22:49.745 [INFO] seed data: 16:22:49.745 [INFO] maximum sketching desert length: 450 16:22:49.745 [INFO] distance of k-mers to fill deserts: 150 16:22:49.745 [INFO] seeds data chunks: 16 16:22:49.745 [INFO] seeds data indexing partitions: 512 16:22:49.745 [INFO] 16:22:49.745 [INFO] general: 16:22:49.745 [INFO] genome batch size: 5 16:22:49.745 [INFO] batch merge threads: 8 16:22:49.745 [INFO] 16:22:49.745 [INFO] 16:22:49.745 [INFO] --------------------- [ generating masks ] --------------------- 16:22:50.180 [INFO] 16:22:50.180 [INFO] --------------------- [ building index ] --------------------- 16:22:50.328 [INFO] 16:22:50.328 [INFO] ------------------------[ batch 1/3 ]------------------------ 16:22:50.328 [INFO] building index for batch 1 with 5 files... processed files: 5 / 5 [======================================] ETA: 0s. done 16:22:51.192 [INFO] writing seeds... 16:22:51.264 [INFO] finished writing seeds in 71.756662ms 16:22:51.264 [INFO] finished building index for batch 1 in: 935.464336ms 16:22:51.264 [INFO] 16:22:51.264 [INFO] ------------------------[ batch 2/3 ]------------------------ 16:22:51.264 [INFO] building index for batch 2 with 5 files... processed files: 5 / 5 [======================================] ETA: 0s. done 16:22:53.126 [INFO] writing seeds... 16:22:53.212 [INFO] finished writing seeds in 86.823785ms 16:22:53.212 [INFO] finished building index for batch 2 in: 1.948770015s 16:22:53.212 [INFO] 16:22:53.212 [INFO] ------------------------[ batch 3/3 ]------------------------ 16:22:53.212 [INFO] building index for batch 3 with 5 files... processed files: 5 / 5 [======================================] ETA: 0s. done 16:22:54.350 [INFO] writing seeds... 16:22:54.437 [INFO] finished writing seeds in 87.058101ms 16:22:54.437 [INFO] finished building index for batch 3 in: 1.224414126s 16:22:54.437 [INFO] 16:22:54.437 [INFO] merging 3 indexes... 16:22:54.437 [INFO] [round 1] 16:22:54.437 [INFO] batch 1/1, merging 3 indexes to demo.lmi.tmp/r1_b1 with 8 threads... 16:22:54.613 [INFO] [round 1] finished in 175.640164ms 16:22:54.613 [INFO] rename demo.lmi.tmp/r1_b1 to demo.lmi 16:22:54.620 [INFO] 16:22:54.620 [INFO] finished building LexicMap index from 15 files with 40000 masks in 4.875616203s 16:22:54.620 [INFO] LexicMap index saved: demo.lmi 16:22:54.620 [INFO] 16:22:54.620 [INFO] elapsed time: 4.875654824s 16:22:54.620 [INFO] Output The LexicMap index is a directory with multiple files.\nFile structure $ tree demo.lmi/ demo.lmi/ # the index directory ├── genomes # directory of genome data │ └── batch_0000 # genome data of one batch │ ├── genomes.bin # genome data file, containing genome ID, size, sequence lengths, bit-packed sequences │ └── genomes.bin.idx # index of genome data file, for fast subsequence extraction ├── seeds # seed data: pairs of k-mer and its location information (genome batch, genome number, location, strand) │ ├── chunk_000.bin # seed data file │ ├── chunk_000.bin.idx # index of seed data file, for fast seed searching and data extraction ... ... ... │ ├── chunk_015.bin # the number of chunks is set by flag `-c/--chunks`, default: #cpus │ └── chunk_015.bin.idx ├── genomes.map.bin # mapping genome ID to batch number of genome number in the batch ├── info.toml # summary of the index └── masks.bin # mask data Index size LexicMap index size is related to the number of input genomes, the divergence between genome sequences, the number of masks, and the maximum seed distance.\nNote that the index size is not linear with the number of genomes, it\u0026rsquo;s sublinear. Because the seed data are compressed with VARINT-GB algorithm, more genome bring higher compression rates.\nDemo data # 15 genomes demo.lmi/: 59.55 MB 46.31 MB seeds 12.93 MB genomes 312.53 KB masks.bin 375.00 B genomes.map.bin 322.00 B info.toml GTDB repr # 85,205 genomes/ gtdb_repr.lmi: 212.58 GB 145.79 GB seeds 66.78 GB genomes 2.03 MB genomes.map.bin 312.53 KB masks.bin 328.00 B info.toml GTDB complete # 402,538 genomes gtdb_complete.lmi: 905.95 GB 542.97 GB seeds 362.98 GB genomes 9.60 MB genomes.map.bin 312.53 KB masks.bin 329.00 B info.toml Genbank\u0026#43;RefSeq # 2,340,672 genomes genbank_refseq.lmi: 4.94 TB 2.77 TB seeds 2.17 TB genomes 55.81 MB genomes.map.bin 312.53 KB masks.bin 331.00 B info.toml AllTheBacteria HQ # 1,858,610 genomes atb_hq.lmi: 3.88 TB 2.11 TB seeds 1.77 TB genomes 39.22 MB genomes.map.bin 312.53 KB masks.bin 331.00 B info.toml Directory/file sizes are counted with https://github.com/shenwei356/dirsize. Index building parameters: -k 31 -m 40000. Genome batch size: -b 5000 for GTDB datasets, -b 25000 for others. Explore the index lexicmap utils genomes can list genome IDs of indexed genomes, see the usage and example. lexicmap utils masks can list masks of the index, see the usage and example. lexicmap utils kmers can list details of all seeds (k-mers), including reference, location(s) and the strand. see the usage and example. lexicmap utils seed-pos can help to explore the seed positions, see the usage and example. Before that, the flag --save-seed-pos needs to be added to lexicmap index. lexicmap utils subseq can extract subsequences via genome ID, sequence ID and positions, see the usage and example. What\u0026rsquo;s next: Searching ","description":"Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Steps Output File structure Index size Explore the index TL;DR Prepare input files: Sequences of each reference genome should be saved in separate FASTA/Q files, with identifiers in the file names. E.g., GCF_000006945.2.fna.gz Run: From a directory with multiple genome files:\nlexicmap index -I genomes/ -O db.lmi From a file list with one file per line:\nlexicmap index -X files."},{"id":20,"href":"/LexicMap/usage/lexicmap/","title":"lexicmap","parent":"Usage","content":"$ lexicmap -h LexicMap: efficient sequence alignment against millions of prokaryotic genomes Version: v0.4.0 Documents: https://bioinf.shenwei.me/LexicMap Source code: https://github.com/shenwei356/LexicMap Usage: lexicmap [command] Available Commands: autocompletion Generate shell autocompletion scripts index Generate an index from FASTA/Q sequences search Search sequences against an index utils Some utilities version Print version information and check for update Flags: -h, --help help for lexicmap -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Use \u0026#34;lexicmap [command] --help\u0026#34; for more information about a command. ","description":"$ lexicmap -h LexicMap: efficient sequence alignment against millions of prokaryotic genomes Version: v0.4.0 Documents: https://bioinf.shenwei.me/LexicMap Source code: https://github.com/shenwei356/LexicMap Usage: lexicmap [command] Available Commands: autocompletion Generate shell autocompletion scripts index Generate an index from FASTA/Q sequences search Search sequences against an index utils Some utilities version Print version information and check for update Flags: -h, --help help for lexicmap -X, --infile-list string ► File of input file list (one file per line)."},{"id":21,"href":"/LexicMap/notes/motivation/","title":"Motivation","parent":"Notes","content":" BLASTN is not able to scale to millions of bacterial genomes, it\u0026rsquo;s slow and has a high memory occupation. For example, it requires \u0026gt;2000 GB for alignment a 2-kb gene sequence against all the 2.34 millions of prokaryotics genomes in Genbank and RefSeq.\nLarge-scale sequence searching tools only return which genomes a query matches (color), but they can\u0026rsquo;t return positional information.\n","description":"BLASTN is not able to scale to millions of bacterial genomes, it\u0026rsquo;s slow and has a high memory occupation. For example, it requires \u0026gt;2000 GB for alignment a 2-kb gene sequence against all the 2.34 millions of prokaryotics genomes in Genbank and RefSeq.\nLarge-scale sequence searching tools only return which genomes a query matches (color), but they can\u0026rsquo;t return positional information."},{"id":22,"href":"/LexicMap/tags/","title":"Tags","parent":"","content":"","description":""}] \ No newline at end of file diff --git a/tutorials/index/index.html b/tutorials/index/index.html index 36a8f25..9ae2fd6 100644 --- a/tutorials/index/index.html +++ b/tutorials/index/index.html @@ -65,7 +65,7 @@ "url" : "https://bioinf.shenwei.me/LexicMap/tutorials/index/", "headline": "Building an index", "description": "Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Steps Output File structure Index size Explore the index TL;DR Prepare input files: Sequences of each reference genome should be saved in separate FASTA\/Q files, with identifiers in the file names. E.g., GCF_000006945.2.fna.gz Run: From a directory with multiple genome files:\nlexicmap index -I genomes\/ -O db.lmi From a file list with one file per line:\nlexicmap index -X files.", - "wordCount" : "2744", + "wordCount" : "2763", "inLanguage": "en", "isFamilyFriendly": "true", "mainEntityOfPage": { @@ -1545,6 +1545,7 @@

Building an index

>GCA_000765055.1 has >150 Mb. The flag -g/--max-genome (default 15 Mb) is used to skip these input files, and the file list would be written to a file via the flag -G/--big-genomes. +
  • Minimum sequence length. A flag -l/--min-seq-len can filter out sequences shorter than the threshold (default is the k value).
  • At most 17,179,869,184 (234) genomes are supported. For more genomes, just build multiple indexes.
  • diff --git a/usage/index/index.html b/usage/index/index.html index e97575b..7a40c44 100644 --- a/usage/index/index.html +++ b/usage/index/index.html @@ -59,7 +59,7 @@ "url" : "https://bioinf.shenwei.me/LexicMap/usage/index/", "headline": "index", "description": "$ lexicmap index -h Generate an index from FASTA\/Q sequences Input: *1. Sequences of each reference genome should be saved in separate FASTA\/Q files, with reference identifiers in the file names. 2. Input plain or gzip\/xz\/zstd\/bzip2 compressed FASTA\/Q files can be given via positional arguments or the flag -X\/--infile-list with a list of input files. Flag -S\/--skip-file-check is optional for skipping file checking if you trust the file list. 3. Input can also be a directory containing sequence files via the flag -I\/--in-dir, with multiple-level sub-directories allowed.", - "wordCount" : "1278", + "wordCount" : "1324", "inLanguage": "en", "isFamilyFriendly": "true", "mainEntityOfPage": { @@ -1436,6 +1436,7 @@

    index

    5. Maximum genome size: 268,435,456. More precisely: $total_bases + ($num_contigs - 1) * 1000 <= 268,435,456, as we concatenate contigs with 1000-bp intervals of N’s to reduce the sequence scale to index. + 6. A flag -l/--min-seq-len can filter out sequences shorter than the threshold (default is the k value). Attention: *1) ► You can rename the sequence files for convenience, e.g., GCF_000017205.1.fa.gz, because the genome @@ -1539,6 +1540,9 @@

    index

    assemblies from Genbank) will be skipped. Need to be smaller than the maximum supported genome size: 268435456 (default 15000000) --max-open-files int ► Maximum opened files, used in merging indexes. (default 512) + -l, --min-seq-len int ► Maximum sequence length to index. The value would be k for values + <= 0 (default -1) + --no-desert-filling ► Disable sketching desert filling (only for debug). -O, --out-dir string ► Output LexicMap index directory. --partitions int ► Number of partitions for indexing seeds (k-mer-value data) files. (default 512) diff --git a/usage/utils/kmers/index.html b/usage/utils/kmers/index.html index 4719f65..af31b54 100644 --- a/usage/utils/kmers/index.html +++ b/usage/utils/kmers/index.html @@ -59,7 +59,7 @@ "url" : "https://bioinf.shenwei.me/LexicMap/usage/utils/kmers/", "headline": "kmers", "description": "$ lexicmap utils kmers -h View k-mers captured by the masks Attention: 1. Mask index (column mask) is 1-based. 2. Prefix means the length of shared prefix between a k-mer and the mask. 3. K-mer positions (column pos) are 1-based. For reference genomes with multiple sequences, the sequences were concatenated to a single sequence with intervals of N\u0027s. 4. Reversed means if the k-mer is reversed for suffix matching. Usage: lexicmap utils kmers [flags] -d \u003cindex path\u003e [-m \u003cmask index\u003e] [-o out.", - "wordCount" : "1003", + "wordCount" : "1197", "inLanguage": "en", "isFamilyFriendly": "true", "mainEntityOfPage": { @@ -1443,6 +1443,7 @@

    kmers

    -h, --help help for kmers -d, --index string ► Index directory created by "lexicmap index". -m, --mask int ► View k-mers captured by Xth mask. (0 for all) (default 1) + -f, --only-forward ► Only output forward k-mers. -o, --out-file string ► Out file, supports and recommends a ".gz" suffix ("-" for stdout). (default "-") @@ -1489,6 +1490,30 @@

    kmers

    1 AAAAAAAACCATATTATGTCCGATCCTCACA 4 1 GCF_000392875.1 1060650 + yes 1 AAAAAAAACCCTTCGTCAAGCATTATGGAAT 4 1 GCF_000392875.1 1139573 - yes +

    Only forward k-mers.

    +
     $ lexicmap utils kmers --quiet -d demo.lmi/ -f | head -n 20 | csvtk pretty -t
    + mask   kmer                              prefix   number   ref               pos       strand   reversed
    + ----   -------------------------------   ------   ------   ---------------   -------   ------   --------
    + 1      AAAACACCAAAAGCCTCTCCGATAACACCAG   9        1        GCF_002949675.1   2046311   +        no
    + 1      AAAACACCAAAGTTAAAGTGCCGTTTAGCGT   9        1        GCF_003697165.2   1085073   +        no
    + 1      AAAACACCAATTAGTGATTGTGTTTCCTCAA   9        1        GCF_000392875.1   2785764   -        no
    + 1      AAAACACCACAGTGAAAGACAACATTTAATA   9        1        GCF_000392875.1   1132052   -        no
    + 1      AAAACACCACCACAAATGCATAAGAAAACTT   9        1        GCF_003697165.2   2862670   +        no
    + 1      AAAACACCACTCAATCCTTTAAATAAAAACA   9        1        GCF_002949675.1   2467828   -        no
    + 1      AAAACACCACTTTACGGGCGTTTTGTGCAAT   9        1        GCF_003697165.2   4241904   -        no
    + 1      AAAACACCAGCACGTTCAGCACCGCCACCAG   9        1        GCF_000017205.1   4399207   -        no
    + 1      AAAACACCAGCGAACGGAAGAACATCGCGAT   9        1        GCF_003697165.2   248663    +        no
    + 1      AAAACACCAGGCCGGAGCAGAAGGTTATTCT   9        1        GCF_003697165.2   4139632   +        no
    + 1      AAAACACCATAAACGATTGTTGGAATACCCG   10       1        GCF_009759685.1   268158    +        no
    + 1      AAAACACCATCATACACTAAATCAGTAAGTT   10       4        GCF_002949675.1   496925    +        no
    + 1      AAAACACCATCATACACTAAATCAGTAAGTT   10       4        GCF_002949675.1   2254974   +        no
    + 1      AAAACACCATCATACACTAAATCAGTAAGTT   10       4        GCF_002949675.1   2495183   +        no
    + 1      AAAACACCATCATACACTAAATCAGTAAGTT   10       4        GCF_002949675.1   4009312   +        no
    + 1      AAAACACCATGAACGCCAACGCCGCCGAGCT   11       1        GCF_000742135.1   2707622   +        no
    + 1      AAAACACCATGAGCAAACTCCAGCATATCGG   11       1        GCF_000017205.1   2490011   -        no
    + 1      AAAACACCATGCAAAAAACTTCTTTTAGAAA   11       1        GCF_000006945.2   1324151   -        no
    + 1      AAAACACCATGCAGCATGTCATAGCGCTGGA   11       1        GCF_003697165.2   422685    +        no
    +
  • Specify the mask.