-
Notifications
You must be signed in to change notification settings - Fork 834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Parquet Byte Stream Split Encoding #5293
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not part of the apache foundation but I will try to use your code ASAP in my application. I reviewed it and it seems good.
@@ -0,0 +1,76 @@ | |||
use crate::basic::{Encoding, Type}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to add the apache license header to get it approved.
@@ -0,0 +1,104 @@ | |||
use std::marker::PhantomData; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to add the apache license header to get it approved.
FTR, what are the actual numbers? |
.downcast_ref::<Float64Array>() | ||
.unwrap(); | ||
|
||
// This file contains floats from a standard normal distribution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can probably do an exact comparison on a few values as well, given that the file isn't going to change :-)
], | ||
vec![f32::from_le_bytes([0xA3, 0xB4, 0xC5, 0xD6])], | ||
]; | ||
test_byte_stream_split_decode::<FloatType>(data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add a test for DoubleType?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done 👍
On my machine, it was about 17us for floats and 36us for doubles, pretty much the same speed for encoding and decoding. These are all just short of 4GB/s. The implementation I built from was around 1.2GB/s. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code makes sense to me, and looks well tested, thank you
for batch in record_reader { | ||
let batch = batch.unwrap(); | ||
row_count += batch.num_rows(); | ||
let f32_col = batch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW you can use https://docs.rs/arrow-array/latest/arrow_array/cast/trait.AsArray.html to make this less verbose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done 👍
Perhaps we could update the README to no longer say that we don't support this encoding |
Updated the readme. Is this ready to merge now? |
Thanks again |
@tustvold I see the V50 release is from 4 days ago, and I supposed it should contain this change, but I tried to update to v50 and enable BYTE_STREAM_SPLIT and I get this error:
|
The release was cut prior to this being merged, as it goes through an ASF mandated RC process that takes at least 3 days. You can view the changelog and/or the git history to see what made a particular release. |
Tracking next release in #5453 |
Which issue does this PR close?
Closes #4102.
This builds on the previous attempt at a PR: #4183
Rationale for this change
This brings us up to speed with the full set of Parquet encodings, I believe. It will also be important for the likely addition of f16 and fixed len byte arrays to the byte stream split encoding.
What changes are included in this PR?
Are there any user-facing changes?
No API additions