-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it still a 'graph'? #1
Comments
I agree that the graph nature of GFA is very important. That's why GFA 1.0 separated link |
I guess one reason for storing non-dovetail overlaps with "E" record to encompass these cases (1) the segment record are contigs and (2) to facilitate general cases for sequence alignments between the reads. If I understand @thegenemyers right, in case (1), the "edges" are necessary if one have constructed contigs, and the edges will be able to encode the case of the figure 2: https://github.com/thegenemyers/GFA-spec/blob/master/GFA2.Fig2.png For (2), I guess the fundamental question is what is the exact scope of the information stored in GFA. Even in the dovetail alignment cases, if we store transitive-reduce edges in GFA but they are not in the final graph, some filtering is needed for generating the final graph. In such case, those non dovetail alignments are just other special "edges" that can be filtered out easily. (BTW, I personally like simple "directed" graph representation and edges and paths from both strands are represented. However, in practice, as long as the format can encode the equivalent information, it is good for me as external data exchange format.) |
I guess I'm still not convinced that case 1 (segments == contigs) is the way to go. To use the same example again (now with segment names): In the original example, we make a contig from But what if we want to have two contigs, one for each phased haplotype: I don't see any way of doing that if contigs have to be graph segments. But if contigs can be paths through the graph, then we can keep the full graph structure and have both contigs. I'm used to SPAdes assembly graphs where this sort of thing is common - in SPAdes a single graph segment can occur in multiple contigs (or more than once in a single contig). But again, I do totally understand the appeal of having contig sequences in a GFA file in their entirety. Perhaps contigs could be GFA paths containing a segment order and a sequence? And then maybe the segment lines leave the sequence out (to reduce redundancy), instead referencing their slice of the path? I'm not sure what would be best... |
A given path |
Is this the kind of thing you're proposing?
Or else we could just store the path sequences in the path lines:
But either way is a bit redundant, no? E.g. the sequence for segment A is in there three times: once in its own segment line and once in each path. |
Hi Ryan, Jack, GFA2 still encodes a graph, its just an undirected multigraph with E-type edges and G-type edges. Your issue arises when you want to interpret a segment as a line segment. If you think of a segment as a dot as in a traditional graph, then E-edges are just edges between the dots, and 2 dots may have several edges between them e.g. the After picture is the graph: So the question isn't "is it a graph", as it is, but rather, "what do I draw" when I want to represent each segment as a line that has length (versus a zero-width dot). A simple proposal would be that one could simply ignore all E edges that are not dovetails. I have a suggestion to the format for E edges that will make it trivial to identify which represent dovetails, containments, or general edges. I will propose this in issue GFA-spec#33 shortly. The collection of dovetails, when considering segments to have length, can be given direction as your prefer. Jack wants a different specification for dovetail, containment, or general edges. I prefer one consolidated, general concept in which the cases that would be distinguished by a different letter in the first column are just as easily distinguished from looking at the E-line. Please see my forthcoming proposal in GFA-spec#33 (comment) |
I think the ideal answer to "what do I draw" would be to draw something very much like your original examples! Lines which overlap each other in ways that mimic the sequences. Bandage can't currently do anything like that, but it's not too difficult to imagine a future version (or some other tool) that could... |
@thegenemyers Just to clarify, my first name is Shaun and last Jackman. |
Shaun, On 9/1/16, 6:16 PM, Shaun Jackman wrote:
|
No offence taken, Gene! |
@richarddurbin @thegenemyers @sjackman @pb-jchin @lh3 @pmelsted
I like the proposal, and I like the simplicity of it - just two core components, segments and edges, where edges have been defined loosely enough to cover most all cases.
However, I have always considered sequence graphs (be they assembly graphs, MSA graphs, etc) as a directed graph in the mathematical sense, and I'm struggling to see how these new, flexible edges fit with that.
With your example in the whitepaper, the first case (Before) still seems like a real graph to me (I'll assume the directionality is left to right). If you asked me to do a depth first search starting from any segment, I'd know what to do. There are no cycles so I could do a topological sort. I.e. anything that applies to directed graphs in general could be applied here.
But the second case (After) is tougher. I'm not sure how I'd do a depth first search there. When you allow an edge to connect any part of any two segments, I feel like we've lost some of the 'graphiness' of the format. And I think it's important to keep that graphiness. If we allow GFA to define something that's not quite a real graph, then we lose access to decades of graph algorithms developed by people much smarter than me!
Here's an example I quickly sketched up:
When the edge connects the end of one segment to the start of another, the graph structure is clear. But for more general edges, I'm thrown. For example, in the first case,
A+,B+
is a valid path andB+,A+
is invalid. Can you define paths in the third case?The only solution I can think of is to have separate GFA line types for 'real' edges (what you call dovetails) and more general edges (any other kind of overlap). Then the graph structure will be defined solely by the segments and 'real' edges, with other edges not contributing to graph structure. Though that unfortunately reduces the simplicity of the format, which is what I liked about your proposal...
Thoughts? Am I too hung up on 'graphiness'? Or does this new GFA still rigidly define a digraph in a way I'm missing?
The text was updated successfully, but these errors were encountered: