Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it still a 'graph'? #1

Open
rrwick opened this issue Sep 1, 2016 · 10 comments
Open

Is it still a 'graph'? #1

rrwick opened this issue Sep 1, 2016 · 10 comments

Comments

@rrwick
Copy link

rrwick commented Sep 1, 2016

@richarddurbin @thegenemyers @sjackman @pb-jchin @lh3 @pmelsted

I like the proposal, and I like the simplicity of it - just two core components, segments and edges, where edges have been defined loosely enough to cover most all cases.

However, I have always considered sequence graphs (be they assembly graphs, MSA graphs, etc) as a directed graph in the mathematical sense, and I'm struggling to see how these new, flexible edges fit with that.

With your example in the whitepaper, the first case (Before) still seems like a real graph to me (I'll assume the directionality is left to right). If you asked me to do a depth first search starting from any segment, I'd know what to do. There are no cycles so I could do a topological sort. I.e. anything that applies to directed graphs in general could be applied here.

But the second case (After) is tougher. I'm not sure how I'd do a depth first search there. When you allow an edge to connect any part of any two segments, I feel like we've lost some of the 'graphiness' of the format. And I think it's important to keep that graphiness. If we allow GFA to define something that's not quite a real graph, then we lose access to decades of graph algorithms developed by people much smarter than me!

Here's an example I quickly sketched up:
screen shot 2016-09-01 at 10 18 13 am
When the edge connects the end of one segment to the start of another, the graph structure is clear. But for more general edges, I'm thrown. For example, in the first case, A+,B+ is a valid path and B+,A+ is invalid. Can you define paths in the third case?

The only solution I can think of is to have separate GFA line types for 'real' edges (what you call dovetails) and more general edges (any other kind of overlap). Then the graph structure will be defined solely by the segments and 'real' edges, with other edges not contributing to graph structure. Though that unfortunately reduces the simplicity of the format, which is what I liked about your proposal...

Thoughts? Am I too hung up on 'graphiness'? Or does this new GFA still rigidly define a digraph in a way I'm missing?

@sjackman
Copy link

sjackman commented Sep 1, 2016

The only solution I can think of is to have separate GFA line types for 'real' edges (what you call dovetails) and more general edges (any other kind of overlap).

I agree that the graph nature of GFA is very important. That's why GFA 1.0 separated link L records from containment C records. It makes sense I think to have two record types, edge E records that specify a directed edge between two vertices like a link L record does, and alignment A records that do nothing more than specify sequence similarity between two sequences (segment S records) and do not specify an edge in the graph.

@pb-jchin
Copy link

pb-jchin commented Sep 1, 2016

I guess one reason for storing non-dovetail overlaps with "E" record to encompass these cases (1) the segment record are contigs and (2) to facilitate general cases for sequence alignments between the reads.

If I understand @thegenemyers right, in case (1), the "edges" are necessary if one have constructed contigs, and the edges will be able to encode the case of the figure 2: https://github.com/thegenemyers/GFA-spec/blob/master/GFA2.Fig2.png
This could be quite useful if we want to represent genome assembly contigs as GFA records and NCBI or EBI takes GFA as input for assembly submission.

For (2), I guess the fundamental question is what is the exact scope of the information stored in GFA. Even in the dovetail alignment cases, if we store transitive-reduce edges in GFA but they are not in the final graph, some filtering is needed for generating the final graph. In such case, those non dovetail alignments are just other special "edges" that can be filtered out easily.

(BTW, I personally like simple "directed" graph representation and edges and paths from both strands are represented. However, in practice, as long as the format can encode the equivalent information, it is good for me as external data exchange format.)

@rrwick
Copy link
Author

rrwick commented Sep 1, 2016

I guess I'm still not convinced that case 1 (segments == contigs) is the way to go.

To use the same example again (now with segment names):
screen shot 2016-09-01 at 4 02 38 pm

In the original example, we make a contig from A,C,D,E,G, leaving segments B and F as the alternate sequences.

But what if we want to have two contigs, one for each phased haplotype: A,C,D,E,G and A,B,D,F,G? Like this:
screen shot 2016-09-01 at 4 02 38 pm_2

I don't see any way of doing that if contigs have to be graph segments. But if contigs can be paths through the graph, then we can keep the full graph structure and have both contigs. I'm used to SPAdes assembly graphs where this sort of thing is common - in SPAdes a single graph segment can occur in multiple contigs (or more than once in a single contig).

But again, I do totally understand the appeal of having contig sequences in a GFA file in their entirety. Perhaps contigs could be GFA paths containing a segment order and a sequence? And then maybe the segment lines leave the sequence out (to reduce redundancy), instead referencing their slice of the path? I'm not sure what would be best...

@sjackman
Copy link

sjackman commented Sep 1, 2016

A given path P record could have an associated segment S record specifying its sequence.

@rrwick
Copy link
Author

rrwick commented Sep 1, 2016

Is this the kind of thing you're proposing?

S   A   TGAC
S   B   A
S   C   G
S   D   CAAC
S   E   C
S   F   G
S   G   AGGC
L   A   +   B   +
L   A   +   C   +
L   B   +   D   +
L   C   +   D   +
L   D   +   E   +
L   D   +   F   +
L   E   +   G   +
L   F   +   G   +
P   contig_1    A,C,D,E,G   # associate with contig_1_seg
P   contig_2    A,B,D,F,G   # associate with contig_2_seg
S   contig_1_seg    TGACGCAACCAGGC
S   contig_2_seg    TGACACAACGAGGC

Or else we could just store the path sequences in the path lines:

S   A   TGAC
S   B   A
S   C   G
S   D   CAAC
S   E   C
S   F   G
S   G   AGGC
L   A   +   B   +
L   A   +   C   +
L   B   +   D   +
L   C   +   D   +
L   D   +   E   +
L   D   +   F   +
L   E   +   G   +
L   F   +   G   +
P   contig_1    A,C,D,E,G       TGACGCAACCAGGC
P   contig_2    A,B,D,F,G       TGACACAACGAGGC

But either way is a bit redundant, no? E.g. the sequence for segment A is in there three times: once in its own segment line and once in each path.

@thegenemyers
Copy link
Owner

Hi Ryan, Jack,

GFA2 still encodes a graph, its just an undirected multigraph with E-type edges and G-type edges. Your issue arises when you want to interpret a segment as a line segment. If you think of a segment as a dot as in a traditional graph, then E-edges are just edges between the dots, and 2 dots may have several edges between them e.g. the After picture is the graph:

screen shot 2016-09-01 at 11 03 56 am

So the question isn't "is it a graph", as it is, but rather, "what do I draw" when I want to represent each segment as a line that has length (versus a zero-width dot). A simple proposal would be that one could simply ignore all E edges that are not dovetails. I have a suggestion to the format for E edges that will make it trivial to identify which represent dovetails, containments, or general edges. I will propose this in issue GFA-spec#33 shortly. The collection of dovetails, when considering segments to have length, can be given direction as your prefer.

Jack wants a different specification for dovetail, containment, or general edges. I prefer one consolidated, general concept in which the cases that would be distinguished by a different letter in the first column are just as easily distinguished from looking at the E-line. Please see my forthcoming proposal in GFA-spec#33 (comment)

@rrwick
Copy link
Author

rrwick commented Sep 1, 2016

I think the ideal answer to "what do I draw" would be to draw something very much like your original examples! Lines which overlap each other in ways that mimic the sequences. Bandage can't currently do anything like that, but it's not too difficult to imagine a future version (or some other tool) that could...

@sjackman
Copy link

sjackman commented Sep 1, 2016

@thegenemyers Just to clarify, my first name is Shaun and last Jackman.

@thegenemyers
Copy link
Owner

Shaun,
My apologies, I was tired today. I also called Heng Li, 'Henry' in
another comment :-)
I respect you guys, no offense intended.
-- Gene

On 9/1/16, 6:16 PM, Shaun Jackman wrote:

@thegenemyers https://github.com/thegenemyers Just to clarify, my
first name is Shaun and last Jackman.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGkkNgc0HLY_7wxbtECVzqrRY_0gqqBsks5qlvpigaJpZM4JyNBh.

@sjackman
Copy link

sjackman commented Sep 1, 2016

No offence taken, Gene!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants