You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Right now, there is no API for restoring state of a log from disk… The process can’t be volatile regarding data.
Imagine you have to restart a process, the commit-log should be able to be quickly read from a folder on disk, you don't want to loose all the log that you had already ingested...
Since the end-to-end functionality is complex, we can start by breaking up the tasks top-down:
CommitLog
When opening a commit_log, a folder should be given, the within the folder files should be organised into tuples of same named files, index/log, sub directories and other things should be ignored. It would be good to read the files in order of creation OR even better, by their offset (name).
letmut c = CommitLog::open("~/foo");// everything should work the same down here
c.write(…)
c.read(…)
Segment
when opening a segment, both log and index files path should be given for full check. Each file check is performed by the index/log structs, but, the segment should ensure that the returning struct will be open for writing or closed …
File level
Index File
Trickiest part, since it is the reference for where data is stored on the files themselves. I would say this is the first part to be implemented.
The procedure must reopen a given file, and check its content / space left.
The index is truncated on creation (filled with empty bytes), that’s good because it spare space in disk and memory but a bit bad because when reopening we have to figure out where did we stop writing to it. If we just look the file size, it will tell you always the max_size defined beforehand, so you need to check where is the first empty byte to actually make sense of it.
There are several ways of doing it so, mainly what I’ve seen implemented was binary search within the file to lookup entries.
One idea was to actually, read the file in reverse until you find the first “existing” byte, set that as the end of the file and then do a quick check on entries size, by trying to divide the entries into the default entry size (20).
Log
There isn’t too much to do here other than open the file and check if it is still "open" (meaning that it has space left for writes). That's done by properly checking the size, empty bytes shouldn't count.
important for file implementations check the reference links.
Questions still open here:
Should we keep storing segments as Vec<Segment>?
How efficient it is memory?
Should we check BTrees for efficiency?
Should we store indexes adding the commit-log (base-offset + index offset) or should we use both individually?
I’ve seen other implementations where all the offsets where global, the base (filename) + the internal file sparsing, I couldn’t never dive into figuring out the tradeoffs. It seems like if you don’t do it you have a bit more trouble searching later on reading. (I’ll move this to another issue probably).
Acceptance criteria
At the end of this task, we should be able to reopen log from disk following the above instructions/considerations.
Right now, there is no API for restoring state of a log from disk… The process can’t be volatile regarding data.
Imagine you have to restart a process, the commit-log should be able to be quickly read from a folder on disk, you don't want to loose all the log that you had already ingested...
Since the end-to-end functionality is complex, we can start by breaking up the tasks top-down:
CommitLog
When opening a commit_log, a folder should be given, the within the folder files should be organised into tuples of same named files, index/log, sub directories and other things should be ignored. It would be good to read the files in order of creation OR even better, by their offset (name).
API for reopening a directory
Segment
when opening a segment, both log and index files path should be given for full check. Each file check is performed by the index/log structs, but, the segment should ensure that the returning struct will be open for writing or closed …
File level
Index File
Trickiest part, since it is the reference for where data is stored on the files themselves. I would say this is the first part to be implemented.
The procedure must reopen a given file, and check its content / space left.
The index is truncated on creation (filled with empty bytes), that’s good because it spare space in disk and memory but a bit bad because when reopening we have to figure out where did we stop writing to it. If we just look the file size, it will tell you always the max_size defined beforehand, so you need to check where is the first empty byte to actually make sense of it.
There are several ways of doing it so, mainly what I’ve seen implemented was binary search within the file to lookup entries.
One idea was to actually, read the file in reverse until you find the first “existing” byte, set that as the end of the file and then do a quick check on entries size, by trying to divide the entries into the default entry size (20).
Log
There isn’t too much to do here other than open the file and check if it is still "open" (meaning that it has space left for writes). That's done by properly checking the size, empty bytes shouldn't count.
important for file implementations check the reference links.
Questions still open here:
Vec<Segment>
?Acceptance criteria
At the end of this task, we should be able to reopen log from disk following the above instructions/considerations.
References:
Kafka Log
Kafka Index
https://github.com/travisjeffery/jocko/blob/master/commitlog/commitlog.go#L95-L133
https://github.com/zowens/commitlog/blob/master/src/index.rs#L203-L226
The text was updated successfully, but these errors were encountered: