-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel Arrow file format reading #8503
Comments
See also #8504 |
I'd like to have a try. |
This sounds like a good idea to me in theory -- I am not sure how easy/hard it would be to do with the existing arrow IPC reader In general, the strategy for paralleizing Paruqet and CSV is to be to split up the file by ranges, and then have each of the Perhaps we could do the same for arrow files which could use the first byte of the RecordBatches 🤔 This code explains it a bit more: https://github.com/apache/arrow-datafusion/blob/6b433a839948c406a41128186e81572ec1fff689/datafusion/core/src/datasource/physical_plan/file_groups.rs#L35-L79 |
There maybe several RecordBatches(blocks in arrow-rs) in a Arrow file(I didn't notice it before). We can handle it like rowgroups in parquet. I will check whether DICTIONARY can be handled correctly since there maybe Delta DICTIONARY. Thanks. |
It seems that delta dictionary batches not supported yet. And I think a pub function to provide offsets is needed in upstream. Like impl<R: Read + Seek> FileReader<R> {
pub fn blocks(&self) -> Vec<Block> {
&self.blocks
}
//OR
pub fn offsets(&self) -> Vec<i64> {
&self.blocks.iter().map(Block::offset).collect()
}
} |
apache/arrow-rs#5249 adds a lower-level reader that should enable this and other use-cases
Delta and replacement dictionaries are only supported by IPC streams, not files |
get it! Thanks |
I will complete this after next release of arrow-rs. |
The next release is tracked by apache/arrow-rs#5234 |
Is your feature request related to a problem or challenge?
DataFusion can now automatically read CSV and parquet files in parallel (see #6325 for CSV)
It would be great to do the same for "Arrow" files
Describe the solution you'd like
Basically implement what is described in #6325 for Arrow -- and read a single large arrow file in parallel
Describe alternatives you've considered
Some research may be required -- I am not sure if finding record boundaries is feasible
Additional context
I found this while writing tests for #8451
The text was updated successfully, but these errors were encountered: