Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pulls from S3 bucket rather than from AWS URL #35

Open
Dadoof opened this issue Feb 29, 2024 · 3 comments
Open

Pulls from S3 bucket rather than from AWS URL #35

Dadoof opened this issue Feb 29, 2024 · 3 comments

Comments

@Dadoof
Copy link

Dadoof commented Feb 29, 2024

Hello there folks,

Was making use of this opendata, to get that new 0.25 degree data. I noticed something that I would like investigate.

As is stands now, the tools I see, namely 'client.py', pulls data from a URL. For example, something like:
wget https://ecmwf-forecasts.s3.eu-central-1.amazonaws.com/20240227/12z/0p25/enfo/20240227120000-0h-enfo-ef.index

I believe that, from an amazon AWS EC2 instance, this would be a faster pull mechanism:
aws s3 cp --no-sign-request s3://ecmwf-forecasts/20240227/12z/0p25/enfo/20240227120000-0h-enfo-ef.index .

Those are command line steps, of course, Inside client.py and such, it would be different tools. My description above was simply to show the difference between a pull via HTTPS and AWS S3.

Any chance of adding capability to pull from the S3 bucket (and thus, the AWS 'backbone') into an AWS EC2 instance, rather than HTTP?

Regards,
Brian E.

@floriankrb
Copy link
Member

I would assume that a nice pull request to improve this would be welcome.

Two things to check though:

  • "I believe ... would be faster" : Some benchmarks could be needed, is it really faster ? When ? Where (AWS has different regions) ?
  • "from an amazon EC2 instance": Checking if the code is running in an EC2 instance should be robust. We do not want to break things elsewhere to support this use case.

@Dadoof
Copy link
Author

Dadoof commented Mar 1, 2024

Hi there,

As for the proper benchmarking, you are correct. For me, anecdotally, it is a good deal quicker. Did this today:

time aws s3 cp --no-sign-request s3://ecmwf-forecasts/20240202/12z/0p25/enfo/20240202120000-0h-enfo-ef.grib2.
real    0m13.074s
user    0m9.010s
sys     0m10.453s

time wget https://ecmwf-forecasts.s3.eu-central-1.amazonaws.com/20240202/12z/0p25/enfo/20240202120000-0h-enfo-ef.grib2
real    1m2.003s
user    0m2.418s
sys     0m5.887s

Indicating that, for this one simple case, the movement from the S3 bucket is a bit quicker (13 seconds vs 1 minute)

As for that EC2 instance: I was merely hoping that would be an option, not that it would replace any other capabilities. That if one wanted to use S3 buckets rather than AWS HTTP sites as the location to get data from, that option would exist.

Regards,
Brian E

@jvahl
Copy link

jvahl commented Mar 2, 2024

To pull files directly from the S3 URI (s3://...), the backend would need to utilize boto3 instead of requests. I think it would be best to start building this capability in the multiurl dependency which executes the downloads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants