Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mechanism to support different offset (uoffset_t) size per table #179

Closed
TYoungSL opened this issue Jun 26, 2021 · 12 comments
Closed

Mechanism to support different offset (uoffset_t) size per table #179

TYoungSL opened this issue Jun 26, 2021 · 12 comments
Labels
wontfix This will not be worked on

Comments

@TYoungSL
Copy link
Contributor

TYoungSL commented Jun 26, 2021

Per my #158 (comment) leading to @aardappel's plans for extending FlatBuffers to be 64-bit capable and/or creating FlatBuffers2, an attribute per table to change the encoding to a non-standard offset size may be a useful respite until official action is taken to provide support.

table LargeTable (fs_offsetSize:8) { // long sized uoffset_t
  someLargeVector:[ulong]; // can be > 1GB, > 2GB, etc.
  anotherLargeVector:[ulong];
}

table SlightlyLargerTable (fs_offsetSize:5) { // throw not supported or truncated long
  someLargeVector:[ulong];
  anotherLargeVector:[ulong];
}

table SmallTable (fs_offsetSize:2) { // short sized uoffset_t
  someTinyTable:[byte]; // should fail to encode if length > 32767ish?
}

table TinyTable (fs_offsetSize:1) { // byte sized uoffset_t
  someTinyTable:[byte]; // should fail to encode if length > 127ish?
}

Of course such an implementation would not be expected to be binary compatible with other implementations, but keeping it simple might allow it to be compatible with future spec.

An implementation of variable sized ints is not meaningfully useful for the additional code required at this time, but they could also be done in a similar fashion.

table VarIntOffsetTable (fs_offsetVarSize) { // use LEBs instead of fixed-size uoffset_t
  someTinyTable:[ulong]; // can be arbitrarily large
}

They were discussed as a proposal for FlatBuffers2. They are a flexible alternative solution to achieve different offset sizing, but the additional code may not be necessary to achieve the same goals and is unlikely to provide any real world benefits over explicitly specified per-table offset sizing.

@jamescourtney
Copy link
Owner

Thanks for the thoughtful comments. This is something I'd love to see happen, but the line for FlatSharp right now is not doing anything to cause binary incompatibility with the canonical library.

So the starting point for this needs to be a proposal in the official FlatBuffer repo. If you don't get traction there, then I think it's worth considering if there are sane ways FlatSharp can do extensions such that people can't mix standard and extension formats.

One idea for this is a set of directive statements at the top of the file:

FlatSharp.Directive OffsetSize = 8;

Ignoring the issue of imports, the thing I like about this is that it will cause a hard break with flatc, so it won't be possible to compile a schema there if this directive is also specified.

@TYoungSL
Copy link
Contributor Author

TYoungSL commented Jun 27, 2021

@aardappel wants to do it for FlatBuffers2;

From google/flatbuffers#5875 (comment)

  1. Support 64-bit offsets from the start. They would be optional for vectors and certain other things, allowing buffers >2GB. See https://github.com/google/flatbuffers/projects/10#card-14545298

There's plenty of support, but not immediate concern. I'll create an issue for it if I can't find relevant discussion in the issue regarding specifying offset size (google/flatbuffers#5471).

Yes, a parser-breaking extension to the FBS grammar would be a good idea to break unintentional apparent compatibility.

I'd propose something that would be more possible for other implementations to generically support.
E.g. along the lines of directive FlatSharp OffsetSize = 8; or similar; such that the grammar can be expressed with a definite start (directive), namespace token (FlatSharp here), termination token (;), in the vein of #pragma.
Anything unsupported by the implementation attempting to support it would break with a better explanation than just a line number.

table LargeTable {
  directive FlatSharp OffsetSize = 8; // long sized uoffset_t
  someLargeVector:[ulong]; // can be > 1GB, > 2GB, etc.
  anotherLargeVector:[ulong];
}

In a compiler that is attempting to support directives generically;

Unsupported table directive in LargeTable: FlatSharp OffsetSize = 8

In FlatSharp if OffsetSize is unsupported;

Unsupported FlatSharp table directive in LargeTable: OffsetSize = 8

In a compiler that is not attempting support;

Unknown member directive in LargeTable, line 2

A word other than directive may be appropriate, maybe something implying more requirement than option.

@TYoungSL TYoungSL changed the title Attribute to support different offset (uoffset_t) size per table. Mechanism to support different offset (uoffset_t) size per table Jun 27, 2021
@jamescourtney
Copy link
Owner

I'm interested to see if the FlatBuffers2 proposal has any real traction or is just a wish list. Thanks for asking. I read it over a year ago and thought "that sounds neat!", and I still think that! It really would be a meaningful improvement to the format in a lot of ways.

With respect to FlatSharp, implementing "mixed-mode" offset sizes is going to be hard. This is due to implementation details in FlatSharp where it wants to write a single method for each CLR Type, so if there is a method to serialize TableA with 4 byte offsets and a different method for 8 byte offsets, that's going to invalidate a lot of assumptions. Not sure if that's better or worse than having a global offset directive.

@aardappel
Copy link

I'm interested to see if the FlatBuffers2 proposal has any real traction or is just a wish list.

A wish list.

@jamescourtney
Copy link
Owner

I'm interested to see if the FlatBuffers2 proposal has any real traction or is just a wish list.

A wish list.

Appreciate the response, @aardappel!

@jamescourtney
Copy link
Owner

@TYoungSL, I'm really of two minds about this.

The first half says that large buffers would be useful, and adding them in a non-dangerous (ie, incompatible with flatc) way would be a fun project.

The other half says that at the end of the day, FlatSharp is an implementation of FlatBuffers, 2GB limit and all. So while I can toe the line of the FlatBuffers format with various features (type facades, indexed vectors, etc), binary compatibility and correctness of the format are king, because much of the usefulness of the project derives from those two things.

The parts that concern me most are:

  • The additional test/maintenance overhead. If I commit to supporting "big buffer" mode in perpetuity, do I need to double my test cases?
  • The fact that once this is added, FlatSharp is really doing two things: FlatBuffers and something that looks a bit like FlatBuffers.
  • People (including myself at Microsoft) actually use this library to run real systems, and one of the promises I do my best to fulfill is that I will not write malformed data.

@TYoungSL
Copy link
Contributor Author

If ultimately we have data structures that exceed 1-2GB, perhaps FlatBuffers was a poor choice.

@jamescourtney
Copy link
Owner

If ultimately we have data structures that exceed 1-2GB, perhaps FlatBuffers was a poor choice.

Most serialization formats are going to be poor choices if your goal is 100GB! Protobuf has a hard limit of 2GB. The only one I know of that supports 2^64 is Cap'n Proto, but I've not personally used it or the C# implementation.

Based on my admittedly limited knowledge of your problem space, I think you have a few viable paths forward, but they all involve you writing some code.

  • Write an indirection wrapper on top of your root FlatBuffer type. I don't think you ever said that your arrays of structs had to be contiguous, so have 100 1GB files and just use some modulo arithmetic to figure out the mapping of index -> file. Matrices might be slightly trickier, but it seems like you could divide up the larger matrix into many smaller ones. You can then wrap this in some object that exposes an indexer that accepts a long and presents a unified view to the world. Then your logic can operate mostly independently of the actual file structure of the data. I think you maintain your ability to to SIMD here as well.

  • I don't think you ever mentioned having interop be a concern, so fork FlatSharp and use it as the basis for a FlatBuffer-inspired data format of your own design. This would allow you to support huge buffers natively, add first-class support for System.Numerics types, and other things that your use cases require. I'd be happy to consult with you on some of these changes. The changes wouldn't be huge -- the troubling part for me is trying to run some flavor of mixed mode where FlatSharp supports both bigbuffer and standard modes. Either by themselves is a much more straightforward configuration.

  • Consider binary encoding or Cap'n Proto.

@TYoungSL
Copy link
Contributor Author

Cap'n Proto's C# implementations leave something to be desired.

@jamescourtney
Copy link
Owner

Based on the comments in google/flatbuffers#5471, it seems like 64 bit mode might have a future sooner than the FlatBuffers2 proposal. If that comes to pass, then I'll definitely support it in FlatSharp.

@jamescourtney
Copy link
Owner

jamescourtney commented Jun 29, 2021

To clarify, I will do my best to support it. There are still some limitations in .NET around int32 that need to be thought through. For example, Span<T> and IList<T> accept only int32 indexes.

@TYoungSL
Copy link
Contributor Author

Yeah interested to see what shakes out.

@jamescourtney jamescourtney added the wontfix This will not be worked on label Jul 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants