-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Design proposal for model_download
custom download path
#179
Comments
Makes sense to me and saves us from passing the output_dir to each method called.
I would follow established pattern. torchtune integrates with HF. I would follow what they do.
We should bubble up permission errors (readonly directory, non-existing folder) to the caller so they can troubleshoot (create missing dir, chmod, etc.).
If the user passes output_dir, it should not read from the default cache. We should perform a fresh download.
I like output_dir, if we ever decide to add support for it for the Kaggle / Colab cache (we would mount the cached file at the output_dir location), then "output_dir" makes more sense to "download_path" |
pllllzzzz merge this, its so frustrating to move datasets again and again. |
Can you share more about your use case requiring you to move the datasets? Why the default cache location doesn't work for your use case? Thank you |
@DivyanshuSinghania refer to #175 (comment) for a temporary solution 😄 |
Proposal: Adding
output_dir
toModelHttpResolver
for Direct DownloadingThis proposal outlines a plan to add an
output_dir
argument to the__call__
method of theModelHttpResolver
class. This new argument will allow users to download models directly to a specified directory, bypassing the default caching mechanism and using a flat directory structure.Motivation:
This feature is primarily driven by user-feedback (#175) and the need for better integration between
kagglehub
and other libraries, specificallytorchtune
(see pytorch/torchtune#1852).Proposed Changes:
Introduce a
Cache
class:This class will encapsulate the caching logic from
cache.py
, providing a cleaner interface and enabling the flexible directory structure based on theoverride_dir
argument.Modify
ModelHttpResolver.__call__
:Add the
output_dir
argument and instantiate theCache
class based on its value.Pass
output_dir
throughmodel_download
:Handle
output_dir
in other Resolvers:Add warning logs to other resolvers to inform users that the
output_dir
argument is not supported and will be ignored.Advantages:
Cache
class discourages use of global variables, promoting better maintainability and extensibility.Request for Feedback:
This proposal outlines one possible implementation. Any feedback on alternative implementations is appreciated. Some other passing thoughts include:
Cache
class design: Is this the best way to encapsulate the caching logic? Are there alternative approaches?output_dir
(e.g., permissions issues) be handled?output_dir
is specified, should we completely bypass the cache or should we consider using it as a secondary storage location? How might we handle cache invalidation in such a case?output_dir
to align withtorchtune
. Butdownload_path
was suggested in Feature Request: Add Support for Custom Path in kagglehub.model_download #175. This may or may not make the intention clearer.The text was updated successfully, but these errors were encountered: