Merge pull request #124 from MatthewRalston/graph_algo

3-tuple i think
MatthewRalston · Mar 28, 2024 · 84a33ea · 84a33ea
2 parents 52b19f1 + fce3a3d
commit 84a33ea
Show file tree

Hide file tree

Showing 19 changed files with 2,808 additions and 573 deletions.
diff --git a/.build.sh b/.build.sh
@@ -35,8 +35,11 @@ EOF
 
 
 # cd
+rm -rf kmerdb-0.7*dist-info/ kmerdb.egg-info/ build/
+
 
 python -m build
 auditwheel repair --plat manylinux2014_x86_64 dist/kmerdb-*linux_x86_64.whl
 mv wheelhouse/* dist
 rm dist/kmerdb-*linux_x86_64.whl
+
diff --git a/.clean_local_install.sh b/.clean_local_install.sh
@@ -18,9 +18,9 @@ EOF
 
 
 
-rm -rf /home/matt/.pyenv/versions/kdb/lib/python3.10/site-packages/kmerdb /home/matt/.pyenv/versions/kdb/lib/python3.10/site-packages/kmerdb-*.egg-info /ffast2/kdb/kmerdb.egg-info /ffast2/kdb/build /ffast2/kdb/dist
-rm -rf /home/matt/.pyenv/versions/3.10.1/envs/kdb/lib/python3.10/site-packages/kmerdb-*
-cd /ffast2/kdb/
+rm -rf /home/matt/.pyenv/versions/kdb/lib/python3.11/site-packages/kmerdb /home/matt/.pyenv/versions/kdb/lib/python3.11/site-packages/kmerdb-*.egg-info /home/matt/Projects/kdb/kmerdb.egg-info /home/matt/Projects/kdb/build /home/matt/Projects/kdb/dist
+rm -rf /home/matt/.pyenv/versions/3.11.7/envs/kdb/lib/python3.11/site-packages/kmerdb-*
+cd /home/matt/Projects/kdb/
 rm -rf dist build kmerdb.egg-info wheelhouse
 cd
 
diff --git a/.gitignore b/.gitignore
@@ -8,4 +8,5 @@ kdb/__pycache__
 **.kdb
 examples/example_report/*.png
 examples/example_report
-test/data
+test/data
+pypi_token.foo
diff --git a/.release.sh b/.release.sh
@@ -0,0 +1,22 @@
+#!/bin/bash
+cat <<EOF
+   Copyright 2020 Matthew Ralston
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+EOF
+
+
+rm -rf dist/*
+python setup.py sdist
+twine upload dist/*
+
diff --git a/.upgrade_python_version.sh b/.upgrade_python_version.sh
@@ -0,0 +1,10 @@
+#!/bin/bash
+
+#pip install -U --python 3.11.7 --require-virtualenv --isolated --debug -I
+
+
+echo "pip  --python 3.11.7 --require-virtualenv --isolated --debug install -U"
+echo "..."
+pip  --python 3.11.7 --user --require-virtualenv --isolated --debug install -U
+
+
diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 # README - kmerdb
 > A Python CLI and module for k-mer profiles, similarities, and graph databases
 
-NOTE: This project is in alpha stage. Development is ongoing. But feel free to clone the repository and play with the code for yourself.
+NOTE: This project is in beta stage. Development is ongoing. But feel free to clone the repository and play with the code for yourself.
 
 ## Development Status
 [![Downloads](https://static.pepy.tech/personalized-badge/kmerdb?period=total&units=international_system&left_color=grey&right_color=brightgreen&left_text=Downloads)](https://pypi.org/project/kmerdb)
@@ -20,12 +20,12 @@ NOTE: This project is in alpha stage. Development is ongoing. But feel free to c
 
 ## Summary 
 
-KDB is a Python library designed for bioinformatics applications. It addresses the ['k-mer' problem](https://en.wikipedia.org/wiki/K-mer) (substrings of length k) in a simple and performant manner. It stores the k-mer counts/abundances and total counts. You can think of the current form as a "pre-index", as it includes all the essential information for indexing on any field in the landscape of k-mer to sequence relationships. One restriction is that k-mers with unspecified sequence residues 'N' create gaps in the k-mer to sequence relationship space, and are excluded. That said, non-standard IUPAC residues are supported.
+KDB is a Python library designed for bioinformatics applications. It addresses the ['k-mer' problem](https://en.wikipedia.org/wiki/K-mer) (substrings of length k) in a simple and performant manner. It stores the k-mer counts/abundances in a columnar format, with input file checksums, total counts, nullomers, and mononucleotide counts in a YAML formatted header in the first block of the `bgzf` formatted `.kdb` file. One restriction is that k-mers with unspecified sequence residues 'N' create gaps in the k-mer to sequence relationship space, and are excluded. That said, non-standard IUPAC residues are supported.
 
 
 Please see the [Quickstart guide](https://matthewralston.github.io/kmerdb/quickstart) for more information about the format, the library, and the project.
 
-The k-mer spectrum of the fasta or fastq sequencing data is stored in the `.kdb` format spec, a bgzf file similar to `.bam`. For those familiar with `.bam`, a `view` and `header` functions are provided to decompress a `.kdb` file into a standard output stream.
+The k-mer spectrum of the fasta or fastq sequencing data is stored in the `.kdb` format spec, a bgzf file similar to `.bam`. For those familiar with `.bam`, a `view` and `header` functions are provided to decompress a `.kdb` file into a standard output stream. The output file is compatible with `zlib`.
 
 
 
@@ -79,23 +79,37 @@ See `python -m kmerdb -h` for details.
 
 ```bash
 kmerdb --help
-# Build a [composite] profile to a new or existing .kdb file
+# Build a [composite] profile to a new .kdb file
 kmerdb profile -k 8 example1.fq.gz example2.fq.gz profile.8.kdb
 
+# Note: zlib compatibility
+zcat profile.8.kdb
+
+# Build a weighted edge list
+kmerdb graph -k 12 example1.fq.gz example2.fq.gz edges.kdbg
+
 # View the raw data
 kmerdb view profile.8.kdb # -H for full header
 
 # View the header
 kmerdb header profile.8.kdb
 
-# Collate the files
-kmerdb matrix -p $cores pass *.8.kdb
+# Collate the files. See 'kmerdb matrix -h' for more information.
+# Note: the 'pass' subcommand passes the int counts through collation, without normalization.
+# In this case the shell interprets '*.8.kdb' as all 8-mer profiles in the current working directory.
+# The k-mer profiles are read in parallel (-p $cores), and collated into one Pandas dataframe, which is printed to STDOUT.
+# Other options include DESeq2 normalization, frequency matrix, or PCA|tSNE based dimensionality reduction techniques.
+kmerdb matrix -p $cores pass *.8.kdb > kmer_count_dataframe.tsv
 
 # Calculate similarity between two (or more) profiles
-kmerdb distance correlation profile1.kdb profile2.kdb (...)
+# The correlation distance from Numpy is used on one or more profiles, or piped output from 'kmerdb matrix'.
+kmerdb distance correlation profile1.kdb profile2.kdb (...) > distance.tsv
+
+# A condensed, one-line invocation of the matrix and distance command using the bash shell's pipe mechanism is as follows.
+kmerdb matrix pass *.8.kdb | kmerdb distance correlation STDIN > distance.tsv
 ```
 
-## Usage note:
+## IUPAC support:
 
 ```bash
 kmerdb profile -k $k input.fa output.kdb # This may discard non-IUPAC characters, this feature lacks documentation!
@@ -132,9 +146,9 @@ python setup.py test
 
 ## License
 
-Created by Matthew Ralston - [Scientist, Programmer, Musician](http://matthewralston.github.io) - [Email](mailto:mrals89@gmail.com)
+Created by Matthew Ralston - [Scientist, Programmer, Musician](http://matthewralston.github.io) - [Email](mailto:mralston.development@gmail.com)
 
-Distributed under the Apache license. See `LICENSE.txt` for the copy distributed with this project. Open source software is not for everyone, but for those of us starting out and trying to put the ecosystem ahead of ego, we march into the information age with this ethos.
+Distributed under the Apache license. See `LICENSE.txt` for the copy distributed with this project. Open source software is not for everyone, but we march into the information age with this ethos. I have the patent rights to this software. You may use and distribute this software, gratis, so long as the original LICENSE.txt is distributed along with the software. This software is distributed AS IS and provides no warranties of any kind.
 
 ```
    Copyright 2020 Matthew Ralston
@@ -154,7 +168,7 @@ Distributed under the Apache license. See `LICENSE.txt` for the copy distributed
 
 ## Contributing
 
-1. Fork it (<https://github.com/MatthewRalston/kdb/fork>)
+1. Fork it (<https://github.com/MatthewRalston/kmerdb/fork>)
 2. Create your feature branch (`git checkout -b feature/fooBar`)
 3. Commit your changes (`git commit -am 'Add some fooBar'`)
 4. Push to the branch (`git push origin feature/fooBar`)
@@ -164,22 +178,24 @@ Distributed under the Apache license. See `LICENSE.txt` for the copy distributed
 
 Thank you to the authors of kPAL and Jellyfish for the early inspiration. And thank you to others for the encouragement along the way, who shall remain nameless. I wanted this library to be a good strategy for assessing these k-mer profiles, in a way that is both cost aware of the analytical tasks at play, capable of storing the exact profiles in sync with the current assemblies, and then updating the kmer databases only when needed to generate enough spectral signature information.
 
-The intention is that more developers would want to add functionality to the codebase or even just utilize things downstream, but to build out directly with numpy and scipy/scikit as needed to suggest the basic infrastructure for the ML problems and modeling approaches that could be applied to such datasets. This project has begun under GPL v3.0 and hopefully could gain some interest.
+The intention is that more developers would want to add functionality to the codebase or even just utilize things downstream, but to build out directly with numpy and scipy/scikit as needed to suggest the basic infrastructure for the ML problems and modeling approaches that could be applied to such datasets. This project began under GPL v3.0 and was relicensed with Apache v2. Hopefully this project could gain some interest. I have so much fun working on just this one project. There's more to it than meets the eye. I'm working on a preprint, and the draft is included in some of the latest versions of the codebase, specifically .Rmd files.
 
 More on the flip-side of this file. Literally. And figuratively. It's so complex with technology these days.
 
 <!--
-Thanks of course to that French girl from the children's series.
+Thanks of course to that French girl from the children's series. 
 Thanks to my former mentors BC, MR, IN, CR, and my newer bosses PJ and KL.
 Thanks to the Pap lab for the first dataset that I continue to use.
-Thank you to Ryan for the food and stuff.
+Thank you to Ryan for the food and stuff. I actually made this project specifically so you and I could converse...
 Thanks to Blahah for tolerating someone snooping and imitating his Ruby style.
-Thanks to Erin for getting my feet wet in this new field.
-Thanks to Rachel for the good memories and friendship.
+Thanks to Erin for getting my feet wet in this new field. You are my mvp.
+Thanks to Rachel for the good memories and friendship. And Sophie too. veggies n' R love.
 Thanks to Yasmeen for the usual banter.
-Thanks to Max, Robin, and Robert for the halfway decent memories in St. Louis.
+Thanks to Max, Robin, and Robert for the good memories in St. Louis.
 Thanks to Freddy Miller for the good memories.
-Thanks to Nichole for the cookies and good memories.
+Thanks to Nichole for the cookies and good memories. And your cute furballs too!
+Thanks to Stace for the lessons, convos, and even embarassing moments. You're kind of awesome to me.
+Thanks to a few friends I met in 2023 that reminded me I have a lot to learn about friendship, dating, and street smarts.
 And thanks to my family and friends.
 Go Blue Hens
 -->
diff --git a/RELEASE_NOTES.txt b/RELEASE_NOTES.txt
@@ -0,0 +1,47 @@
+=============
+| v0.7.8    |
+=============
+Still fixing a lot of issue wrt the interface (logging info) and stepping through neighbor list creation/validations, adjacencies format specification, other issues w.r.t main __init__.py profile and graph subcommands (profile, make_kdbg in __init__.py, [ALL] fileutil.py )
+
+Major issues. ick.
+
+
+
+
+=============
+| v0.7.7    |
+=============
+New basic format spec (.kdbg) released for weighted edge list. IUPAC warning setting abstracted for two base methods for k-mer counter: validate_seqRecoard_and_detect_IUPAC. _shred method needed for edge list, performance assessment needed vs vanilla k counter.
+
+graph format has 3 numpy columns: 2,3,4 and those are the n1, n2, edge_weight vars. added to metadata config. solver solution still unformed.
+
+Needs README.md and website description.
+
+
+=============
+| v0.7.6    |
+=============
+The tabular format specification has boiled down to a 4 or 5 column design, and the metadata header has been stabilized since 0.7.4, in Jan/Feb of 2023. The header now consists of explicit Numpy dtypes, int64 most of the time. Frequency columns are included for the sake of it, but int count profiles have taken the front seat in the project.
+
+The columnar format is now: rownum, kmerid, count, frequency. 'Metadata' i.e. the 6 neighboring k-mer ids, is completely optional, and very much still in alpha. The scipy and biopython kmeans and hierarchical clustering features have been briefly tested, and the numpy distances now form the core of the distance command.
+
+I'm most proud of the profile and matrix commands, the latter may read profiles into memory in parallel, collating the count column as it goes. I'm not sure how this would perform on the sorted .kdb files.
+
+Minor bug fixes and regressions on the fileutil and __init__.py files round out 0.7.6 from 0.7.4. Basically reduces smell and tests the --sorted feature. The --re-sort and --un-sort features on the view command remain a little too untested...
+
+=============
+| v0.6.5    |
+-------------
+The numerical backbone of the project has been solidified, more sanity checks and assertions throughout runtime. The memory tends to be an issue even for mild choices of k. We are now using 'uint64' and 'float64' for indexes, counts, and frequencies. Parallelization has been improved in the matrix command for 'quick' loading of count profiles into memory. Currently KDBReader is lazy load, only reading the header metadata when file is first opened. Behavior other than the 'slurp()' and '_slurp' methods are decided only by the source and Bio.bgzf module. In principle, you could read the file line-by-line if you wanted to, but the behavior is sufficient at the moment for acceptance testing.
+
+In addition to these 'features' my focus has been mostly focused on getting the ideal Pearson and Spearman correlation coefficients to understand profile fidelity behavior.
+
+
+=============
+| v0.0.7    |
+-------------
+There have been 3 pre-releases in the codebase thus far, and we are on version number 0.0.7. The codebase has changed into a sophisticated on-disk k-mer counting strategy, with multiple parallelization options. The first of which is native OS parallelism using something like GNU parallels to run the program on many genomes or metagenomes, simultaneously. The second parallelization option use the Python3 multiprocessing library, particularly for processing fastq datasets.
+
+When I say on-disk, I mean the file format I've created is essentially an indexed database. This allows rapid access to sequential transition probabilities, on-disk, during a random walk representing a Markov process. This is a key feature for anyone who wants to seriously traverse a 4<sup>k</sup> dimensional space.
+
+The codebase also currently contains a randomization feature, distance matrices, arbitrary interoperability with Pandas dataframes, clustering features, normalization, standardization, and dimensionality reduction. The suite is currently ready for a regression feature that I've promised, and I'd like to implement this early this Spring. Next, I'd be interested in working on the following features that would make the suite ready for another beta release. 
diff --git a/TODO.org b/TODO.org
@@ -7,6 +7,63 @@
 
 
 
+* 3/25/24 - finished weighted edge list, planning assembler
+** Personal Remarks
+*** Today marks the beginning of the end... of the DeBruijn graph format pull-request from branch 'graph_algo'
+*** I'm doing a little bit better mentally. Learned today about non-stiumlant ADHD meds
+*** In hindsight, I've never been diagnosed with ADHD. I have reasonable hyper-focus, but I get derailed with alternate versions of ... oops I literally forgot what the psychiatrist calls it when you change tasks and get unfocused. Wow.
+*** I like my new therapist/counselor and her level of care seems nice. Let's see how the next 3 months goes.
+*** Okay, that's enough about meTM. 
+** Project remarks
+*** I'm very happy with the recent additions to the the graph_algo branch. The feature 'seems' to be working quite well regarding neighboring/subsequent k-mers appended to the id array.
+*** Specifically, I have a --quiet option that will silence most output delivered to the screen in addition to the verbosity setting.
+*** By DEFAULT I print an obnoxious amount of output to the STDERR stream, without the verbosity settings changed from the default of warning level (-v, -vv).
+*** I believe this demonstrates to the user how adjacencies in the id array are considered aka that they have the k-1 subsequence in common.
+*** These assertions introduced in kmerdb.graph are essential to verify that subsequent read counts, propagate an error, which is displayed to the user as a warning
+*** describing the nature of the assertion failures and suggesting the reason why.
+*** More specifically: it should be added to the README.md that the number of assertion failures should roughly equal the number of reads in a .fq file, triggering the issue via k-mer ids from the end of one read and the beginning of the next.
+
+NOTE: ADJACENCY ERRORS DETECTED: Found 24999 'improper' k-mer pairs/adjacencies from the input file(s),
+ where subsequent k-mers in the k-mer id array (produced from the sliding window method over input seqs/reads) did not share k-1 residues in common.
+ These *may* be introduced in the array from the last k-mer of one seq/read (in the .fa/.fq) and the first k-mer of the next seq/read.
+*** Okay, with this settled, I can now describe any plans for revision to the .kdbg format, as well as a description of a first-pass networkx based solution to graph traversal and stop criterion during contig generation.
+*** With that said, I absolutely need a visualizer at this point to check my work.
+** TODO Code cleanup
+*** Documentation
+**** Deprecations
+***** strand_specific
+***** all_metadata
+**** IUPAC
+**** README
+*** kmerdb module
+   - [X] kmer.py
+     - [ ] verbose => quiet
+   - [X] graph.py
+   - [X] parse.py
+   - [ ] __init__.py
+*** README.md
+   - [ ] README.md
+     - [ ] Document the *new* IUPAC strategy for 'kmerdb.kmer._shred_for_graph'
+     - [ ] Provide
+*** website -  matthewralston.github.io/kmerdb
+    - [/] Expanded documentation on subcommands.
+      - [ ] profile
+      - [ ] view
+      - [ ] distance (SWAP ORDER)
+      - [ ] matrix (SWAP ORDER)
+      - [ ] NEW! graph
+      - [ ] kmeans
+      - [ ] hierarchical
+      - [ ] probability
+    - [ ] DONT DO YET graph/assembly page
+    - [/] API
+      - [ ] reading .kdbg or .kdb files
+      - [ ] writing .kdbg or .kdb files
+** TODO Assembly algorithm planning
+** TODO CPU (NetworkX) implementation (overview)
+** TODO Stop criterion
+  - [ ] when are the *necessary* traversals are completed
+  - [ ] How do we rank these?
 
 * Lost comments
 
@@ -22,9 +79,8 @@
 ** Remembering that it's only supposed to be a k-mer count vector storage medium right now
 ** Scoping scoping where does it end
 ** Is my life's work pointless?
-** Losing my best friend because of relapse
-*** Sent 1 basic sorry, got an acknowledgement.
-*** Felt like shit, lost my fucking wallet and chillum
+** Losing my best friend because of argument
+*** Sent 1 basic sorry, got an minor acknowledgement.
 *** Smoking habit down to 1 cig a day (just bored, less and less dynamism of focus.
 *** Recalling the CortizoneTM
 *** Apply gently
@@ -34,7 +90,7 @@
 ** Time/money management issues mounting
 
 * Code maintenance
-** TODO COMMENTS [7/7]
+** FEEDBACK COMMENTS [7/7]
 DEADLINE: <2022-01-29 Sat> SCHEDULED: <2022-01-27 Thu>
   - [X] util
     - [X] merge_metadata_lists [3/3]