Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to fastly extract the dataset #20

Open
Raion-Shin opened this issue Sep 2, 2024 · 2 comments
Open

How to fastly extract the dataset #20

Raion-Shin opened this issue Sep 2, 2024 · 2 comments

Comments

@Raion-Shin
Copy link

I downloaded the .tar.gz file in https://huggingface.co/datasets/TIGER-Lab/M-BEIR, but it's really large and the pv command shows that I need 2.5 days to extract the file!
Can you provide smaller zip files that package each dataset into a zip file? Thanks very much!

@nrdyava
Copy link

nrdyava commented Sep 30, 2024

After downloading the .tar.gz files, use the following command to combine the files into a single file:
sh -c 'cat mbeir_images.tar.gz.part-00 mbeir_images.tar.gz.part-01 mbeir_images.tar.gz.part-02 mbeir_images.tar.gz.part-03 > mbeir_images.tar.gz'

Next extract images from the combined file:
tar -xzf mbeir_images.tar.gz

It will not take 2.5 days. I was able to complete the whole process in just 10 hrs

@Raion-Shin
Copy link
Author

After downloading the .tar.gz files, use the following command to combine the files into a single file: sh -c 'cat mbeir_images.tar.gz.part-00 mbeir_images.tar.gz.part-01 mbeir_images.tar.gz.part-02 mbeir_images.tar.gz.part-03 > mbeir_images.tar.gz'

Next extract images from the combined file: tar -xzf mbeir_images.tar.gz

It will not take 2.5 days. I was able to complete the whole process in just 10 hrs

Thanks. But I'm extracting it with a 2-core CPU, so it takes a long time. It'll be better if you split it into many smaller zip files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants