Assert subsequent shredded kmers are neighborly, skip if fails (fastq centered) #123

MatthewRalston · 2024-03-05T03:39:40Z

The _shred method returns a ordered list of kmer IDs. Assert two subsequent kmers in the list are neighborly (share a k-1 mer in common) or else do not increment that edge (it mustn't exist)

The text was updated successfully, but these errors were encountered:

MatthewRalston · 2024-03-05T16:48:05Z

g++???? Why not a++

…ecord_and_detect_IUPAC. Default for base shred method set to true, so warnings are enables across .fa/.fq file. Default set to True throughout, so no question there. Default invocation (therefore suppressed) produces single warning for iupac module. 'Standard sequence' warning deprecated, it's lousy information. Adds/changes graph.py, __init__.py for edge graph creation 'graph' command. Bump version. Closes #123.

…ecord_and_detect_IUPAC. Default for base shred method set to true, so warnings are enables across .fa/.fq files. Default set to True throughout, so no question there. Default invocation (therefore suppressed) produces single warning for iupac module. 'Standard sequence' warning deprecated, it's lousy information. 789ish lines added. Adds/changes graph.py, __init__.py for edge graph creation 'graph' command. Bump version. Closes #123. On other note, version bump for first inclusion of the graph structure into disk and memory. Beginning alternate pipeline of commands for assembly.

…ecord_and_detect_IUPAC. Default for base shred method set to true, so warnings are enables across .fa/.fq files. Default set to True throughout, so no question there. Default invocation (therefore suppressed) produces single warning for iupac module. 'Standard sequence' warning deprecated, it's lousy information. 789ish lines added. Adds/changes graph.py, __init__.py for edge graph creation 'graph' command. Bump version. Closes #123. On other note, version bump for first inclusion of the graph structure into disk and memory. Beginning alternate pipeline of commands for assembly. If i'm honest, the whole codebase needs a one over. Issue #124.

MatthewRalston · 2024-03-18T16:07:40Z

Okay so e3eb1d8 addresses the 3-tuple again (#124) during edge list and neighbor construction. A working data structure init process will create fundamental faculties for graph construction and traversal, regardless of implementation.

MatthewRalston · 2024-03-18T18:08:51Z

This is getting fleshed out in kmerdb/graph.py

MatthewRalston · 2024-03-20T15:26:46Z

Key Question

What is needed for working data structure initialization? Why isn't it working?

Related issues

#126 #122

Key features

The latest commit closes the issue of working neighbor structure implementation as a subtask. (also, it doesn't... ongoing.)

More on that later. While the neighbor structure "isn't" 'not working' right now, it appears fixed. For example, the neighbor structure pairings (the edge list from prior commits) have a k-1 overlap as expected. So simply put, the 'neighbors' functionality in this and prior commits appear to have a working 8-tuple that is added to the source by virtue of:

neighbors[k_id] = kmer.neighbors(cur, k_id, k)

I am using a simple method to initialize the edge list

create the pair
pass around the neighbor structure, AND edge list
use pair to create "ordered" list of the key_id duple (path) (so .... etc.)

The hidden tasks include

distinguish in data structure spec whether the nature of the pair is kmer to neighbor, or a "prospective" neighbor to k-mer, given by some prioritization parm set... etc. and thus a traversal approach

MatthewRalston · 2024-03-29T00:52:42Z

Key Question

What is needed for working data structure initialization? Why isn't it working?

The node and edge list and prioritization or sort strategy for edge representation, weights, multigraph and combination representation, orientation of edges, dual strandedness and .kdbg row metadata (non-int, but Boolean) (i.e. fast lookup) row metadata fields is not yet finalized.

Node files
Edge files
Walk files
Path files
Tree files
Forward walk
Reverse walk
[[ node ]] ---|||| node_id, pos_walk (id in walk file or path file, pos_path, next_edge id (aka edge 2-tuple), next_path id
[[ Edge ]] --------- node1_id, node2_id, pos_path, pos_walk,
prospective bool (aka most edges in a walk/path/climb should be retrospective in the destination context...), preceding walk id, next walk id,
Walk schema
path schema
Forward schema
Reverse schema
Solution schema

[[ The walk file ]]

Walks files are just like path files, and primarily contain an ordering of edges. All walks are paths, but a walk may have a forward and reverse direction, and so all walks and their originating context (aka a .kdbg file) must either be minimal (all edges and a positioning id (i) only - a "retrospective " bool, a "solutional" bool (if the walk is said to be solutional from an assembly process associated from .kdbg version 1.0 0 or greater, a version number associated with the kmerdb release, the sha256 of the git release (on each edge yes), or expanded (retrospective, prospective, previous forks investigate and their node IDs)

minimal walks

A minimal walk file must also include all edges of the original context (a.k.a. all edges observed from the dataset(s) in the .kdbg header), marked with a retrospective bool, along with one or more copies of the same edge prospective bool = True when representing a specific walk (not a minimal path, a single linear representation of edges, a sort order with no presumed origin id)

solutional path

a walk, along with all previous walks (in chronological aka integer id, by reference, along with the sha256sum of the git release that produced the walk

[[ solutional path file ]]

Header metadata will have the source and the parameters in the header. And a walk id - (a sha256 of the walk) for an associated walk file, and walk name (given at "runtime" via CLI). May be 0 to represent unspecific or unqualified walk (origin unclear)

Related issues

#126 #122

Key features

The latest commit closes the issue of working neighbor structure implementation as a subtask. (also, it doesn't... ongoing.)
More on that later. While the neighbor structure "isn't" 'not working' right now, it appears fixed. For example, the neighbor structure pairings (the edge list from prior commits) have a k-1 overlap as expected. So simply put, the 'neighbors' functionality in this and prior commits appear to have a working 8-tuple that is added to the source by virtue of:

neighbors[k_id] = kmer.neighbors(cur, k_id, k)
I am using a simple method to initialize the edge list

create the pair

pass around the neighbor structure, AND edge list

use pair to create "ordered" list of the key_id duple (path) (so .... etc.)

The hidden tasks include

distinguish in data structure spec whether the nature of the pair is kmer to neighbor, or a "prospective" neighbor to k-mer, given by some prioritization parm set... etc. and thus a traversal approach

The neighbor structure 🌪️is manifested by particular kmer IDs🌬️, which may be accessed from kmer arrays loaded alongside the edge list during assembly.

A working pipeline would include all components of the workflow onto the next step but all commands are partial.

MatthewRalston · 2024-04-04T19:15:47Z

Key Comment... I guess i didn't know how valuable this is...

but there is no particular sort order. But what does that mean about your data and how its output. During printing, i tend to use a numerical index (1 : 4^k) or a lexical, provided order. Like from a graph or edge list, generated by virtue of a sequence from data input, where order should be known to the user. Provided order of the sequences in a fasta or fastq file determines the order of k-mers. However, the "neighbor" structure is generated sparsely, or incompletely. i.e. not all edges and orientations of nodes in the pair are represented in the posited neighbor structure, which is generated by k-mers from the dataset plus the 8 nearest neighbor k-mers. They are produced as such

e.g. k-mer "X" ACTGACTG

--- k-1 mers (if k is 12, then the k-1-mer is 11bp long)
- ## k-1-mer X = CTGACTG (k-1 mer via first char removed. k-1 mer that receives appends).
- ## k-1-mer Y = ACTGACT (k-1 mer via last char removed. k-1 mer that gets at its prepend.)

MatthewRalston · 2024-04-04T19:30:36Z

so this is the kdbg and neighbor structure of the input sequence data, and not the "total" or "theoretical" edge list from all 16,777,216 theoretically-derived k-mers. Nor is it the sequence data alone, but also all possible neighbor pairs from the sequence data alone, and therefore at least one or more paths through the neighbor structure to all relevant edges in a undirected graph producing weights.

The sort order is there to play with, its not task uno right now.

MatthewRalston mentioned this issue Mar 5, 2024

3-tuple i think #124

Merged

MatthewRalston pinned this issue Mar 18, 2024

MatthewRalston self-assigned this Mar 18, 2024

MatthewRalston changed the title ~~Assert subsequent shredded kmers are neighborly, skip if fails (fast centered)~~ Assert subsequent shredded kmers are neighborly, skip if fails (fastq centered) Mar 18, 2024

MatthewRalston mentioned this issue Mar 18, 2024

Maybe take a walk? #122

Open

MatthewRalston closed this as completed in #124 Mar 28, 2024

MatthewRalston unpinned this issue Apr 4, 2024

MatthewRalston reopened this Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assert subsequent shredded kmers are neighborly, skip if fails (fastq centered) #123

Assert subsequent shredded kmers are neighborly, skip if fails (fastq centered) #123

MatthewRalston commented Mar 5, 2024

MatthewRalston commented Mar 5, 2024

MatthewRalston commented Mar 18, 2024 •

edited

Loading

MatthewRalston commented Mar 18, 2024

MatthewRalston commented Mar 20, 2024

MatthewRalston commented Mar 29, 2024

Key Question

Related issues

Key features

MatthewRalston commented Apr 4, 2024

e.g. k-mer "X" ACTGACTG

MatthewRalston commented Apr 4, 2024

Assert subsequent shredded kmers are neighborly, skip if fails (fastq centered) #123

Assert subsequent shredded kmers are neighborly, skip if fails (fastq centered) #123

Comments

MatthewRalston commented Mar 5, 2024

MatthewRalston commented Mar 5, 2024

MatthewRalston commented Mar 18, 2024 • edited Loading

MatthewRalston commented Mar 18, 2024

MatthewRalston commented Mar 20, 2024

Key Question

Related issues

Key features

MatthewRalston commented Mar 29, 2024

Key Question

[[ The walk file ]]

minimal walks

solutional path

[[ solutional path file ]]

Related issues

Key features

MatthewRalston commented Apr 4, 2024

e.g. k-mer "X" ACTGACTG

MatthewRalston commented Apr 4, 2024

MatthewRalston commented Mar 18, 2024 •

edited

Loading