[RFC] Design proposal for `model_download` custom download path #179

KeijiBranshi · 2024-11-06T00:48:43Z

Proposal: Adding `output_dir` to `ModelHttpResolver` for Direct Downloading

This proposal outlines a plan to add an output_dir argument to the __call__ method of the ModelHttpResolver class. This new argument will allow users to download models directly to a specified directory, bypassing the default caching mechanism and using a flat directory structure.

Motivation:

This feature is primarily driven by user-feedback (#175) and the need for better integration between kagglehub and other libraries, specifically torchtune (see pytorch/torchtune#1852).

NOTE: torchtune currently uses huggingface_hub for model downloading, leveraging its snapshot_download function and its local_dir argument for specifying custom download locations.

Proposed Changes:

Introduce a Cache class:

This class will encapsulate the caching logic from cache.py, providing a cleaner interface and enabling the flexible directory structure based on the override_dir argument.

class Cache:
    def __init__(self, override_dir: Optional[str] = None):
        self.override_dir = override_dir

    def get_path(self, handle: ResourceHandle, path: Optional[str] = None) -> str:
        if self.override_dir:
            return os.path.join(self.override_dir, path or "")  # Flat structure
        else:
            return get_cached_path(handle, path)  # Existing nested structure

    def get_archive_path(self, handle: ResourceHandle) -> str:
        if self.override_dir:
            return os.path.join(self.override_dir, f"{handle.version!s}.archive")
        else:
            return get_cached_archive_path(handle)

    def get_completion_marker_filepath(self, handle: ResourceHandle, path: Optional[str] = None) -> str:
        if self.override_dir:
             return os.path.join(self.override_dir, f"{path or handle.version!s}.complete")
        else:
            return _get_completion_marker_filepath(handle, path)

    def load_from_cache(self, handle: ResourceHandle, path: Optional[str] = None) -> Optional[str]:
        """Return path for the requested resource from the cache or output_dir."""
        marker_path = self.get_completion_marker_filepath(handle, path)
        full_path = self.get_path(handle, path)
        return full_path if os.path.exists(marker_path) and os.path.exists(full_path) else None

    # Migrate other relevant methods from cache.py (e.g., mark_as_complete, delete_from_cache)  to the Cache class, adapting them similarly.

Modify ModelHttpResolver.__call__:

Add the output_dir argument and instantiate the Cache class based on its value.

def __call__(
    self, h: ModelHandle, path: Optional[str] = None, *, force_download: Optional[bool] = False, output_dir: Optional[str] = None
) -> str:
    cache = Cache(override_dir=output_dir)
    # ...use cache.get_path(), cache.get_archive_path(), cache.load_from_cache(), and cache.get_completion_marker_filepath()...

Pass output_dir through model_download:

def model_download(
    # ...existing arguments
    output_dir: Optional[str] = None,

):
    # Existing code...
    return registry.model_resolver(h, path, force_download=force_download, output_dir=output_dir)

Handle output_dir in other Resolvers:

Add warning logs to other resolvers to inform users that the output_dir argument is not supported and will be ignored.

Advantages:

Flexibility: Users can choose between using the cache or downloading directly to a specific directory. This also allows users to bypass the deeply nested subdirectory structure in the default cache.
Improved Code Organization: Encapsulating caching logic within the Cache class discourages use of global variables, promoting better maintainability and extensibility.

Request for Feedback:

This proposal outlines one possible implementation. Any feedback on alternative implementations is appreciated. Some other passing thoughts include:

Cache class design: Is this the best way to encapsulate the caching logic? Are there alternative approaches?
Flat directory structure: Is this the most user-friendly approach for direct downloads? Should we consider other directory structures?
Error handling: How should errors related to output_dir (e.g., permissions issues) be handled?
Interaction with existing cache: If output_dir is specified, should we completely bypass the cache or should we consider using it as a secondary storage location? How might we handle cache invalidation in such a case?
Naming: I chose output_dir to align with torchtune. But download_path was suggested in Feature Request: Add Support for Custom Path in kagglehub.model_download #175. This may or may not make the intention clearer.

The text was updated successfully, but these errors were encountered:

rosbo · 2024-11-07T00:24:28Z

Cache class design: Is this the best way to encapsulate the caching logic? Are there alternative approaches?

Makes sense to me and saves us from passing the output_dir to each method called.

Flat directory structure: Is this the most user-friendly approach for direct downloads? Should we consider other directory structures?

I would follow established pattern. torchtune integrates with HF. I would follow what they do.

Error handling: How should errors related to output_dir (e.g., permissions issues) be handled?

We should bubble up permission errors (readonly directory, non-existing folder) to the caller so they can troubleshoot (create missing dir, chmod, etc.).

Interaction with existing cache: If output_dir is specified, should we completely bypass the cache or should we consider using it as a secondary storage location? How might we handle cache invalidation in such a case?

If the user passes output_dir, it should not read from the default cache. We should perform a fresh download.

Naming: I chose output_dir to align with torchtune. But download_path was suggested in #175. This may or may not make the intention clearer.

I like output_dir, if we ever decide to add support for it for the Kaggle / Colab cache (we would mount the cached file at the output_dir location), then "output_dir" makes more sense to "download_path"

DivyanshuSinghania · 2024-11-19T11:20:34Z

pllllzzzz merge this, its so frustrating to move datasets again and again.

rosbo · 2024-11-19T16:58:16Z

Hi @DivyanshuSinghania,

Can you share more about your use case requiring you to move the datasets? Why the default cache location doesn't work for your use case?

Thank you

toiletbowlkuku · 2024-12-16T07:37:53Z

@DivyanshuSinghania refer to #175 (comment) for a temporary solution 😄

KeijiBranshi assigned rosbo and neshdev Nov 6, 2024

KeijiBranshi mentioned this issue Nov 6, 2024

feat(cli): download models from Kaggle KeijiBranshi/torchtune#1

Draft

14 tasks

KeijiBranshi mentioned this issue Nov 13, 2024

feat(cli): allow users to download models from Kaggle pytorch/torchtune#2002

Merged

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Design proposal for `model_download` custom download path #179

[RFC] Design proposal for `model_download` custom download path #179

KeijiBranshi commented Nov 6, 2024

rosbo commented Nov 7, 2024

DivyanshuSinghania commented Nov 19, 2024

rosbo commented Nov 19, 2024

toiletbowlkuku commented Dec 16, 2024

[RFC] Design proposal for model_download custom download path #179

[RFC] Design proposal for model_download custom download path #179

Comments

KeijiBranshi commented Nov 6, 2024

Proposal: Adding output_dir to ModelHttpResolver for Direct Downloading

rosbo commented Nov 7, 2024

DivyanshuSinghania commented Nov 19, 2024

rosbo commented Nov 19, 2024

toiletbowlkuku commented Dec 16, 2024

[RFC] Design proposal for `model_download` custom download path #179

[RFC] Design proposal for `model_download` custom download path #179

Proposal: Adding `output_dir` to `ModelHttpResolver` for Direct Downloading