Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Design proposal for model_download custom download path #179

Open
KeijiBranshi opened this issue Nov 6, 2024 · 4 comments
Open

[RFC] Design proposal for model_download custom download path #179

KeijiBranshi opened this issue Nov 6, 2024 · 4 comments
Assignees

Comments

@KeijiBranshi
Copy link
Member

Proposal: Adding output_dir to ModelHttpResolver for Direct Downloading

This proposal outlines a plan to add an output_dir argument to the __call__ method of the ModelHttpResolver class. This new argument will allow users to download models directly to a specified directory, bypassing the default caching mechanism and using a flat directory structure.

Motivation:

This feature is primarily driven by user-feedback (#175) and the need for better integration between kagglehub and other libraries, specifically torchtune (see pytorch/torchtune#1852).

NOTE: torchtune currently uses huggingface_hub for model downloading, leveraging its snapshot_download function and its local_dir argument for specifying custom download locations.

Proposed Changes:

  1. Introduce a Cache class:

    This class will encapsulate the caching logic from cache.py, providing a cleaner interface and enabling the flexible directory structure based on the override_dir argument.

    class Cache:
        def __init__(self, override_dir: Optional[str] = None):
            self.override_dir = override_dir
    
        def get_path(self, handle: ResourceHandle, path: Optional[str] = None) -> str:
            if self.override_dir:
                return os.path.join(self.override_dir, path or "")  # Flat structure
            else:
                return get_cached_path(handle, path)  # Existing nested structure
    
        def get_archive_path(self, handle: ResourceHandle) -> str:
            if self.override_dir:
                return os.path.join(self.override_dir, f"{handle.version!s}.archive")
            else:
                return get_cached_archive_path(handle)
    
        def get_completion_marker_filepath(self, handle: ResourceHandle, path: Optional[str] = None) -> str:
            if self.override_dir:
                 return os.path.join(self.override_dir, f"{path or handle.version!s}.complete")
            else:
                return _get_completion_marker_filepath(handle, path)
    
        def load_from_cache(self, handle: ResourceHandle, path: Optional[str] = None) -> Optional[str]:
            """Return path for the requested resource from the cache or output_dir."""
            marker_path = self.get_completion_marker_filepath(handle, path)
            full_path = self.get_path(handle, path)
            return full_path if os.path.exists(marker_path) and os.path.exists(full_path) else None
    
        # Migrate other relevant methods from cache.py (e.g., mark_as_complete, delete_from_cache)  to the Cache class, adapting them similarly.
  2. Modify ModelHttpResolver.__call__:

    Add the output_dir argument and instantiate the Cache class based on its value.

    def __call__(
        self, h: ModelHandle, path: Optional[str] = None, *, force_download: Optional[bool] = False, output_dir: Optional[str] = None
    ) -> str:
        cache = Cache(override_dir=output_dir)
        # ...use cache.get_path(), cache.get_archive_path(), cache.load_from_cache(), and cache.get_completion_marker_filepath()...
  3. Pass output_dir through model_download:

    def model_download(
        # ...existing arguments
        output_dir: Optional[str] = None,
    
    ):
        # Existing code...
        return registry.model_resolver(h, path, force_download=force_download, output_dir=output_dir)
  4. Handle output_dir in other Resolvers:

    Add warning logs to other resolvers to inform users that the output_dir argument is not supported and will be ignored.

Advantages:

  • Flexibility: Users can choose between using the cache or downloading directly to a specific directory. This also allows users to bypass the deeply nested subdirectory structure in the default cache.
  • Improved Code Organization: Encapsulating caching logic within the Cache class discourages use of global variables, promoting better maintainability and extensibility.

Request for Feedback:

This proposal outlines one possible implementation. Any feedback on alternative implementations is appreciated. Some other passing thoughts include:

  • Cache class design: Is this the best way to encapsulate the caching logic? Are there alternative approaches?
  • Flat directory structure: Is this the most user-friendly approach for direct downloads? Should we consider other directory structures?
  • Error handling: How should errors related to output_dir (e.g., permissions issues) be handled?
  • Interaction with existing cache: If output_dir is specified, should we completely bypass the cache or should we consider using it as a secondary storage location? How might we handle cache invalidation in such a case?
  • Naming: I chose output_dir to align with torchtune. But download_path was suggested in Feature Request: Add Support for Custom Path in kagglehub.model_download #175. This may or may not make the intention clearer.
@rosbo
Copy link
Contributor

rosbo commented Nov 7, 2024

Cache class design: Is this the best way to encapsulate the caching logic? Are there alternative approaches?

Makes sense to me and saves us from passing the output_dir to each method called.

Flat directory structure: Is this the most user-friendly approach for direct downloads? Should we consider other directory structures?

I would follow established pattern. torchtune integrates with HF. I would follow what they do.

Error handling: How should errors related to output_dir (e.g., permissions issues) be handled?

We should bubble up permission errors (readonly directory, non-existing folder) to the caller so they can troubleshoot (create missing dir, chmod, etc.).

Interaction with existing cache: If output_dir is specified, should we completely bypass the cache or should we consider using it as a secondary storage location? How might we handle cache invalidation in such a case?

If the user passes output_dir, it should not read from the default cache. We should perform a fresh download.

Naming: I chose output_dir to align with torchtune. But download_path was suggested in #175. This may or may not make the intention clearer.

I like output_dir, if we ever decide to add support for it for the Kaggle / Colab cache (we would mount the cached file at the output_dir location), then "output_dir" makes more sense to "download_path"

@DivyanshuSinghania
Copy link

pllllzzzz merge this, its so frustrating to move datasets again and again.

@rosbo
Copy link
Contributor

rosbo commented Nov 19, 2024

Hi @DivyanshuSinghania,

Can you share more about your use case requiring you to move the datasets? Why the default cache location doesn't work for your use case?

Thank you

@toiletbowlkuku
Copy link

@DivyanshuSinghania refer to #175 (comment) for a temporary solution 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants