Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RNTuple] Roadmap of writing RNTuple to disk #336

Open
Moelf opened this issue May 4, 2024 · 6 comments
Open

[RNTuple] Roadmap of writing RNTuple to disk #336

Moelf opened this issue May 4, 2024 · 6 comments

Comments

@Moelf
Copy link
Member

Moelf commented May 4, 2024

My current plan is to break this huge task into four large stages:

Stage 1 (Before July 2024)

In Stage 1 we aggressively use dummy data (raw bytes copied from real root files) and try to figure out what are the big-stroke steps we need when writing RNTuple to disk. Roughly they should be:

  • TFile header, RNTuple Anchor, RNTuple header
  • Update RNTuple anchor location
  • Writing out pages
  • Then RNTuple footer, TDictionary?
  • Update locations in TFile header and TDictionary (RNTuple Anchor)

Usability after Stage 1:

bare minimal, user can write out primitive fields that are not too long.

Stage 2 (Before Oct 2024)

In Stage 2 we try to peel off dummy stuff in two ways:

  • Reduce amount of "hard code dummy and modify" code smell
  • allow larger files by dealing with cluster group related things correctly

At this stage we will still be very rigid when it comes to schema complexity, but closer to production quality in the "file size" aspect.

Usability after Stage 2:

Basically functional for very simple data schema up to any size, as long as it fits in RAM (one-shot writing)

Stage 3 (Before June 2025)

In Stage 3 we will expand the level of completeness in two critical ways:

  • Improve schema support, in particular offset vector fields, consider possibility of switching to AwkwardArray.jl at this stage
  • Allow appending / streaming of data onto disk

Usability after Stage 3:

This will be analysis production ready -- users can write nanoAOD-kind of files with any size and can append to existing dataset, this will be the medium milestone for a production-level useful RNTuple writer

Stage 4 (unknown time)

In Stage 4 we try to complete whatever is still missing, possible items:

  • vastly improve schema supports, we need a good design for this to be maintainable in the long term
  • introduce other features of RNTuple and provide APIs, such as alias column
@Moelf Moelf changed the title [RNTuple] Roadmap to writing RNTuple to disk [RNTuple] Roadmap of writing RNTuple to disk May 4, 2024
@Moelf
Copy link
Member Author

Moelf commented Jun 18, 2024

image

I have fully mapped every byte in a file now. @tamasgal might be interested in this:

hexproj.zip

@Moelf Moelf mentioned this issue Jun 30, 2024
1 task
@Moelf
Copy link
Member Author

Moelf commented Jul 8, 2024

image

@tamasgal
Copy link
Member

tamasgal commented Jul 8, 2024

Haha awesome work Jerry! I still have not seen any RNTuple in our experiment but it's a good feeling that you shed so many light on this area :)

@Moelf
Copy link
Member Author

Moelf commented Jul 8, 2024

I expect them to show up after ROOT team freezes them this year/early next year. (1.0.0)

@tamasgal
Copy link
Member

tamasgal commented Jul 9, 2024

We should add this to our documentation as the golden reference ;)

@Moelf
Copy link
Member Author

Moelf commented Sep 27, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants