-
-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compressed output format (jsongz) for rethinkdb export/import #251
base: master
Are you sure you want to change the base?
Conversation
23176e9
to
f6e3e40
Compare
Hello @iantocristian 👋 Could you please add some unit/integration tests for this functionality? |
I would but it looks like no unit/integration tests exist for the import/export scripts in general 😅 . Seems like a big job. |
@lsabi could you please double check this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All in all it seems good. The style has been maintained but there are missing tests, which could become a problem.
We could write them in a second moment.
Apart from the comments I've added it seems ok. @gabor-boros do you have anything to add? Especially for the comment about the new line
rethinkdb/_export.py
Outdated
if options.format == "jsongz": | ||
if options.compression_level is None: | ||
options.compression_level = -1 | ||
elif options.compression_level < 0 or options.compression_level > 9: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if someone passes -1 as option?
In my opinion it should be changed into elif options.compression_level < -1 or options.compression_level > 9:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reasoning was: passing -1
is the same as not specifying the compression level, it's not really setting the compression_level
.
Happy to make the change as suggested though, it might make it easier to switch between setting and not setting the compression level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your reasoning, but then if someone passes -1
, if options.compression_level is None
is evaluated False
, if options.compression_level < 0 or options.compression_level > 9
is evaluated to True
and raises an exception. Which is incorrect, since -1
is an acceptable value.
If you have another suggestion on how to handle such situation, feel free to add it. Mine was just a potential suggestion on how to handle it.
All in all, It's no big deal to support -1
, but could prevent some errors/exceptions, since I assume that the default value is an acceptable value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated as suggested.
for item in list(row.keys()): | ||
if item not in fields: | ||
del row[item] | ||
if first: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implies that the objects in the JSON
array are each on a separate line.
I'm no compression expert, but since it'll be binary and unreadable from a high level perspective, why not skip the new line? Object would be written as follows
[{...},{...},{...}...{...}]
which would save n + 1 + 1 new lines (n-1 between each object, plus 2 for the first and the last, + 1 for the last row which EOF
and is good. On huge table this could imply saving a considerable amount of bits. @gabor-boros do you have any knowledge about the topic?
Or maybe I'm wrong and the \n are used for compression. Let me know, I'm curious now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code here is replicating what the json export does. \n
are not used for compression.
One can still unpack the jsongz file get the json file from inside (it's a standard gzip file), in which case the formatting might help. I considered the gains from removing the \n
-s marginal, but you might be right. If you have really small documents, the extra end lines might make a difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should say that import
script includes a custom json paarser (which I found odd, not sure what the reason for using a custom parser was, performance perhaps?) which might be affected by the lack of new lines (I expect it to be happy though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you have really small documents, the extra end lines might make a difference
Did you mean big documents? Hehe
Regarding the custom json parser, I have no clue why there's a custom one. Probably, when the library has been written, there were parsers that did not fit/match the requirements. Nowadays there are tons of high performance parsers. In order to not break anything, I would keep the custom one for now.
I am on the same page here. We need some tests for the backup and restore functionality. But it feels like a different story / PR is in order. |
One other comment I got was related to the Downside of using |
I haven't worked on the import nor export scripts, but an option is to have a list of supported extensions and try to match the last part of the filename against the list. Another alternative, could be to have a hash list with the supported extensions pointing towards other extensions like
This way, files ending with |
To me it looks good. Only the the new line feed which I don't know if it's worth removing it or not. We can, in a second moment, write tests and check how much it influences the size of the generated file. @gabor-boros what do you think? For me it can be passed |
It's one new line per document right? + another one at the start and another one end. For a table with 1000 documents with average size of 1kb, you gain 1002 bytes, uncompressed, less than 0.1% gain. Not worth it imo. Another point is that jsongz = gzipped json. So it should be the same output as json, but compressed. |
Percentages vary based on the size of the document. But, if you have a table with millions of records, it implies saving MBs of space. Sure it'll not be much in comparison to the total size, but it may be better and easier to fit it into memory. I'm not sure there'll be such a big table, though. Nevertheless, as I said, this can be done in a second moment. What's the point about the jsongz? I don't understand it |
👍
That it wasn't my intention to change the content that's being dumped, just to compress it. |
Don't worry, we can keep them separate and merge one day. @gabor-boros do you have anything to add/to complain about this PR? |
@lsabi / @gabor-boros just wondering if there was an update on getting this merged in? It would be super useful for us. Thanks |
Reason for the change
#249
Description
Implemented feature in rethindkb-export and rethinkdb-import scripts: a new export format jsongz - gzipped json.
On export side, there is a new jsongz writer (based on the json writer implementation), passing the output through a zilb compressor.
On the import side, JsonGzSourceFile extends JsonSourceFile (slightly modified) and can read from the gzipped json data files directly. And an addition to SourceFile constructor to read the uncompressed size from the gzip trailer.
Checklist
References
Usage:
rethinkdb-export -e test -d export --format jsongz
rethinkdb-export -e test -d export --format jsongz --compression-level 5
rethinkdb-import -i test -d export
Tested with python 2.7.16 and python 3.8.5