-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature request: add ability to ingest from S3 bucket #912
Comments
Thus current logic will attempt to list all entries with every ingest iteration. So if you really have billions of object, I'm afraid even if I assume your objects aren't just sitting under one key prefix, but partitioned somehow. Thus you likely can express a listing logic far more efficiently than what generic I therefore see three options:
|
One example of very large S3 bucket is to process AWS Cloudtrail logs I'm looking at using |
For Using ETAG similarly allows you to store any kind of state data - it's a fully opaque object. For example this image is using ETAG to store the sequence number of the last processed blockchain block.
|
This crate looks applicable https://github.com/datafusion-contrib/datafusion-objectstore-s3 |
Yea, we are already using s3 object store crate under the hood, to have DataFusion query data directly from Parquet on S3. I think the key problem here is not how to access the data, but what listing and state management mechanism to use. I've re-read all above and extracted two work items from this feature request:
@mingfang could you please take a look and confirm that those would address your need? |
Now that all requirements are captured in the feature tickets I will close this one. I added both of the linked tickets as stretch goals to the "Usability" objective of our current milestone - hope we'll get to them soon. Thanks again @mingfang! |
We need to ingest many objects(billions) from a S3 bucket.
Due to the large number of objects, just getting the object listing will take a very long time.
The ingestion should be similar to FilesGlob.
The text was updated successfully, but these errors were encountered: