Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load ICON data from HF #66

Open
peterdudfield opened this issue Sep 4, 2024 · 17 comments
Open

Load ICON data from HF #66

peterdudfield opened this issue Sep 4, 2024 · 17 comments

Comments

@peterdudfield
Copy link
Contributor

peterdudfield commented Sep 4, 2024

Detailed Description

It would be great to be able to load the ICON data from HF in our nwp open dataset

Context

Possible Implementation

  • Try to find a method that can lazily load the ICON data from HF in load nwp
@peterdudfield peterdudfield changed the title Load ICon data from HF Load ICON data from HF Sep 4, 2024
@gabrielelibardi
Copy link

@peterdudfield I'd like to have a look at this, could you give me permission to create a branch

@peterdudfield
Copy link
Contributor Author

Hi @gabrielelibardi , we actually paused development on this in ocf_datapipes. But this could be done in ocf-data-sampler. Would you like to try it there?

@gabrielelibardi
Copy link

I managed to run the pvnet_datapipe on some of the icon-eu huggingface data that I downloaded locally (one day worth of data). This needs some change in the code but I think the problem that you refer to in this issue is different, this seems to be independent of the postprocessing done in ocf_datapipes and just about this line ds = xr.open_mfdataset(zarr_paths, engine="zarr", combine="nested", concat_dim="time"). If zarr_paths is all the paths to each .zarr.zip file on hugging-face that ds is never initialized as it just takes too long. I presume this is because of all the metadata that need to be downloaded from each .zarr.zip file. Once the ds is initialized then the data will be loaded lazily as you create the batches. Maybe caching the metadata locally could speed things up. Do I understand this correctly? @peterdudfield

@peterdudfield
Copy link
Contributor Author

Thanks, great you managed it with one day.

hmmm, interesting. The meta is normally quite small.

Does it scale linear with the number of files you provide in xr.open_mfdataset? Like can you load from 2 data files from HF quickly?

I have seen before if they are differnet shapes, then xr.open_mfdataset can take a long time, as it tries sort these out

@gabrielelibardi
Copy link

It definitely gets slower with more files. Here is a flamegraph svg for the profiling. profile. I am getting a year worth of hugging face .zarr.zip files and trying to make a xarray from the first 20. Most of the time is spend in requests to the HF server. If I try to make an xarray with too many paths eventually it pisses the hf server huggingface_hub.errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface.co/api/datasets/openclimatefix/dwd-icon-eu.

@peterdudfield peterdudfield transferred this issue from openclimatefix/ocf_datapipes Oct 14, 2024
@peterdudfield
Copy link
Contributor Author

Thanks for doing this, and nice to see the profile. Whats the current timings for making an xarray with the first 20?
You totaly sure its only getting the metadata, not downloading all the data?

Do you know roughtly how many paths until you get a 429?

@gabrielelibardi
Copy link

It takes 58 secs to create an xarray with 20. Pretty sure it is not downloading the data (that would be more than 60 GB in 60 secs a bit too crazy for my laptop).
I think I was trying 50 or so and I was already getting a 429. Maybe it is because the .zmetadata are too small so maybe there are a lot of separate requests too quickly, if I download a whole .zip.zarr file it does not complain.

@gabrielelibardi
Copy link

For training with ECMWF data you also use multiple zarr file or just one? You stream it from S3 buckets or keep it on the training server?

@peterdudfield
Copy link
Contributor Author

For training with ECMWF data you also use multiple zarr file or just one? You stream it from S3 buckets or keep it on the training server?

For training ECMWF, we tend to join our daily ECMWF files into year (or monthly) files and then open multi open zarrs. We tend to try to keep the data close to where the model is running, so either locally if trianing localy, or in cloud, if we are training in the cloud. It would be nice to stream from HF though, so we dont have to re organise the files, and the live data is available.

Thanks for these benchmark figures. Yea I agree it's just downloading metadata not the whole thing.

Feels like there should be some caching we can do, so for example, loading metadata for each month one, and then quickly loading this new file in the future. I really dont know the solution for this sorry.

@devsjc @Sukh-P @AUdaltsova might have some ideas?

@jacobbieker
Copy link
Member

You should be able to do something like kerchunk or virtualizarr to save the metadata into one file and open that. That way you wouldn't need all the requests for the metadata. Getting the data is still then limited by HF request limits, but is at least then less of an issue.

@gabrielelibardi
Copy link

Thanks a lot @jacobbieker @peterdudfield! I haven't tried it yet but looks promising. For now I put 1 month of data on a cloud storage (something like S3). This solves the problem with the 429 error. I tried to run the script/save_batches.py though and it is impractically slow (1 sample every 3 secs or so). It is not a problem for me to self-host the icon dataset (or parts of it) but I would like to keep large amounts of data off the training instance as we would spin this one on and off. I would be curious to know in your case for the training of PVNet how long did you need to create the batches for the training data and when you were training on the cloud what solutions worked best for you.

@peterdudfield
Copy link
Contributor Author

Thanks @gabrielelibardi

Is this slow running speed when you load from s3? Where are you running the code from, an ec2 instance?

We tend to get a speed of more like 1 batch every 1sec ish (I would have to look up the batch size). So it can still take a day or so to make batches. We get the fastest results by having the data very near i.e locally or on a disk attached to an vm. This is more set up, but faster. Using Multi-processing helps too, I think thats in the making batches script already.

@gabrielelibardi
Copy link

Thank you @peterdudfield, it is a colocated S3 bucket and I am running from a VPS but these are not aws services so there might be some differences in terms of bandwidth. In general though it is faster than streaming it from HF, which makes this quite useless in my opinion. Maybe you have tried this before with a more powerful instance and it was faster? Still I think implementing this is still beneficial as it should be the same code weather you stream form local or remote .zarr.zip files. I can implement this in the ocf-data-sampler repo, for me this one worked pretty well though, what is the major difference?

@peterdudfield
Copy link
Contributor Author

peterdudfield commented Oct 21, 2024

Thanks.

Yea we first use HF to save the ICON data, as there is not rolling archive at the moment. Your totally right though, we should consider a further set that make its useful for other people to use.

We are migrating from ocf-datapipes to ocf-data-sample. Essentially to simply the code and remove torch.datapipes which doesnt fit our use case anymore.

That would be greatly appreciate, if you can put the code in for ICON in this repo

@gabrielelibardi
Copy link

@peterdudfield can you give me the permission to push a feature branch to ocf-data-sample please.

@peterdudfield
Copy link
Contributor Author

Yea, I can defiantely open access. What we general do is let OS contributors

  1. clone the repo
  2. do their changes
  3. do a PR back to OCF's repo
    Does that work for you?

@gabrielelibardi
Copy link

Sure!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants