From d920bdc33963e3bf800aa291eebee3283ba5d8a2 Mon Sep 17 00:00:00 2001 From: MatthewRalston Date: Sat, 3 Feb 2024 21:39:47 -0500 Subject: [PATCH] Updates to README, RELEASE_NOTES officially added. '.release.sh' included. minor update to .gitignore. --- .gitignore | 3 ++- .release.sh | 22 +++++++++++++++++++++ README.md | 49 ++++++++++++++++++++++++++++++----------------- RELEASE_NOTES.txt | 27 ++++++++++++++++++++++++++ 4 files changed, 82 insertions(+), 19 deletions(-) create mode 100644 .release.sh create mode 100644 RELEASE_NOTES.txt diff --git a/.gitignore b/.gitignore index 5a4bcf1..d1e3080 100644 --- a/.gitignore +++ b/.gitignore @@ -8,4 +8,5 @@ kdb/__pycache__ **.kdb examples/example_report/*.png examples/example_report -test/data \ No newline at end of file +test/data +pypi_token.foo diff --git a/.release.sh b/.release.sh new file mode 100644 index 0000000..b662a67 --- /dev/null +++ b/.release.sh @@ -0,0 +1,22 @@ +#!/bin/bash +cat < A Python CLI and module for k-mer profiles, similarities, and graph databases -NOTE: This project is in alpha stage. Development is ongoing. But feel free to clone the repository and play with the code for yourself. +NOTE: This project is in beta stage. Development is ongoing. But feel free to clone the repository and play with the code for yourself. ## Development Status [![Downloads](https://static.pepy.tech/personalized-badge/kmerdb?period=total&units=international_system&left_color=grey&right_color=brightgreen&left_text=Downloads)](https://pypi.org/project/kmerdb) @@ -20,12 +20,12 @@ NOTE: This project is in alpha stage. Development is ongoing. But feel free to c ## Summary -KDB is a Python library designed for bioinformatics applications. It addresses the ['k-mer' problem](https://en.wikipedia.org/wiki/K-mer) (substrings of length k) in a simple and performant manner. It stores the k-mer counts/abundances and total counts. You can think of the current form as a "pre-index", as it includes all the essential information for indexing on any field in the landscape of k-mer to sequence relationships. One restriction is that k-mers with unspecified sequence residues 'N' create gaps in the k-mer to sequence relationship space, and are excluded. That said, non-standard IUPAC residues are supported. +KDB is a Python library designed for bioinformatics applications. It addresses the ['k-mer' problem](https://en.wikipedia.org/wiki/K-mer) (substrings of length k) in a simple and performant manner. It stores the k-mer counts/abundances in a columnar format, with input file checksums, total counts, nullomers, and mononucleotide counts in a YAML formatted header in the first block of the `bgzf` formatted `.kdb` file. One restriction is that k-mers with unspecified sequence residues 'N' create gaps in the k-mer to sequence relationship space, and are excluded. That said, non-standard IUPAC residues are supported. Please see the [Quickstart guide](https://matthewralston.github.io/kmerdb/quickstart) for more information about the format, the library, and the project. -The k-mer spectrum of the fasta or fastq sequencing data is stored in the `.kdb` format spec, a bgzf file similar to `.bam`. For those familiar with `.bam`, a `view` and `header` functions are provided to decompress a `.kdb` file into a standard output stream. +The k-mer spectrum of the fasta or fastq sequencing data is stored in the `.kdb` format spec, a bgzf file similar to `.bam`. For those familiar with `.bam`, a `view` and `header` functions are provided to decompress a `.kdb` file into a standard output stream. The output file is compatible with `zlib`. @@ -79,23 +79,34 @@ See `python -m kmerdb -h` for details. ```bash kmerdb --help -# Build a [composite] profile to a new or existing .kdb file +# Build a [composite] profile to a new .kdb file kmerdb profile -k 8 example1.fq.gz example2.fq.gz profile.8.kdb +# Note: zlib compatibility +zcat profile.8.kdb + # View the raw data kmerdb view profile.8.kdb # -H for full header # View the header kmerdb header profile.8.kdb -# Collate the files -kmerdb matrix -p $cores pass *.8.kdb +# Collate the files. See 'kmerdb matrix -h' for more information. +# Note: the 'pass' subcommand passes the int counts through collation, without normalization. +# In this case the shell interprets '*.8.kdb' as all 8-mer profiles in the current working directory. +# The k-mer profiles are read in parallel (-p $cores), and collated into one Pandas dataframe, which is printed to STDOUT. +# Other options include DESeq2 normalization, frequency matrix, or PCA|tSNE based dimensionality reduction techniques. +kmerdb matrix -p $cores pass *.8.kdb > kmer_count_dataframe.tsv # Calculate similarity between two (or more) profiles -kmerdb distance correlation profile1.kdb profile2.kdb (...) +# The correlation distance from Numpy is used on one or more profiles, or piped output from 'kmerdb matrix'. +kmerdb distance correlation profile1.kdb profile2.kdb (...) > distance.tsv + +# A condensed, one-line invocation of the matrix and distance command using the bash shell's pipe mechanism is as follows. +kmerdb matrix pass *.8.kdb | kmerdb distance correlation STDIN > distance.tsv ``` -## Usage note: +## IUPAC support: ```bash kmerdb profile -k $k input.fa output.kdb # This may discard non-IUPAC characters, this feature lacks documentation! @@ -132,9 +143,9 @@ python setup.py test ## License -Created by Matthew Ralston - [Scientist, Programmer, Musician](http://matthewralston.github.io) - [Email](mailto:mrals89@gmail.com) +Created by Matthew Ralston - [Scientist, Programmer, Musician](http://matthewralston.github.io) - [Email](mailto:mralston.development@gmail.com) -Distributed under the Apache license. See `LICENSE.txt` for the copy distributed with this project. Open source software is not for everyone, but for those of us starting out and trying to put the ecosystem ahead of ego, we march into the information age with this ethos. +Distributed under the Apache license. See `LICENSE.txt` for the copy distributed with this project. Open source software is not for everyone, but we march into the information age with this ethos. I have the patent rights to this software. You may use and distribute this software, gratis, so long as the original LICENSE.txt is distributed along with the software. This software is distributed AS IS and provides no warranties of any kind. ``` Copyright 2020 Matthew Ralston @@ -154,7 +165,7 @@ Distributed under the Apache license. See `LICENSE.txt` for the copy distributed ## Contributing -1. Fork it () +1. Fork it () 2. Create your feature branch (`git checkout -b feature/fooBar`) 3. Commit your changes (`git commit -am 'Add some fooBar'`) 4. Push to the branch (`git push origin feature/fooBar`) @@ -164,22 +175,24 @@ Distributed under the Apache license. See `LICENSE.txt` for the copy distributed Thank you to the authors of kPAL and Jellyfish for the early inspiration. And thank you to others for the encouragement along the way, who shall remain nameless. I wanted this library to be a good strategy for assessing these k-mer profiles, in a way that is both cost aware of the analytical tasks at play, capable of storing the exact profiles in sync with the current assemblies, and then updating the kmer databases only when needed to generate enough spectral signature information. -The intention is that more developers would want to add functionality to the codebase or even just utilize things downstream, but to build out directly with numpy and scipy/scikit as needed to suggest the basic infrastructure for the ML problems and modeling approaches that could be applied to such datasets. This project has begun under GPL v3.0 and hopefully could gain some interest. +The intention is that more developers would want to add functionality to the codebase or even just utilize things downstream, but to build out directly with numpy and scipy/scikit as needed to suggest the basic infrastructure for the ML problems and modeling approaches that could be applied to such datasets. This project began under GPL v3.0 and was relicensed with Apache v2. Hopefully this project could gain some interest. I have so much fun working on just this one project. There's more to it than meets the eye. I'm working on a preprint, and the draft is included in some of the latest versions of the codebase, specifically .Rmd files. More on the flip-side of this file. Literally. And figuratively. It's so complex with technology these days. diff --git a/RELEASE_NOTES.txt b/RELEASE_NOTES.txt new file mode 100644 index 0000000..896da25 --- /dev/null +++ b/RELEASE_NOTES.txt @@ -0,0 +1,27 @@ +============= +| v0.7.6 | +============= +The tabular format specification has boiled down to a 4 or 5 column design, and the metadata header has been stabilized since 0.7.4, in Jan/Feb of 2023. The header now consists of explicit Numpy dtypes, int64 most of the time. Frequency columns are included for the sake of it, but int count profiles have taken the front seat in the project. + +The columnar format is now: rownum, kmerid, count, frequency. 'Metadata' i.e. the 6 neighboring k-mer ids, is completely optional, and very much still in alpha. The scipy and biopython kmeans and hierarchical clustering features have been briefly tested, and the numpy distances now form the core of the distance command. + +I'm most proud of the profile and matrix commands, the latter may read profiles into memory in parallel, collating the count column as it goes. I'm not sure how this would perform on the sorted .kdb files. + +Minor bug fixes and regressions on the fileutil and __init__.py files round out 0.7.6 from 0.7.4. Basically reduces smell and tests the --sorted feature. The --re-sort and --un-sort features on the view command remain a little too untested... + +============= +| v0.6.5 | +------------- +The numerical backbone of the project has been solidified, more sanity checks and assertions throughout runtime. The memory tends to be an issue even for mild choices of k. We are now using 'uint64' and 'float64' for indexes, counts, and frequencies. Parallelization has been improved in the matrix command for 'quick' loading of count profiles into memory. Currently KDBReader is lazy load, only reading the header metadata when file is first opened. Behavior other than the 'slurp()' and '_slurp' methods are decided only by the source and Bio.bgzf module. In principle, you could read the file line-by-line if you wanted to, but the behavior is sufficient at the moment for acceptance testing. + +In addition to these 'features' my focus has been mostly focused on getting the ideal Pearson and Spearman correlation coefficients to understand profile fidelity behavior. + + +============= +| v0.0.7 | +------------- +There have been 3 pre-releases in the codebase thus far, and we are on version number 0.0.7. The codebase has changed into a sophisticated on-disk k-mer counting strategy, with multiple parallelization options. The first of which is native OS parallelism using something like GNU parallels to run the program on many genomes or metagenomes, simultaneously. The second parallelization option use the Python3 multiprocessing library, particularly for processing fastq datasets. + +When I say on-disk, I mean the file format I've created is essentially an indexed database. This allows rapid access to sequential transition probabilities, on-disk, during a random walk representing a Markov process. This is a key feature for anyone who wants to seriously traverse a 4k dimensional space. + +The codebase also currently contains a randomization feature, distance matrices, arbitrary interoperability with Pandas dataframes, clustering features, normalization, standardization, and dimensionality reduction. The suite is currently ready for a regression feature that I've promised, and I'd like to implement this early this Spring. Next, I'd be interested in working on the following features that would make the suite ready for another beta release.