Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(cli): allow users to download models from Kaggle #2002

Merged

Conversation

KeijiBranshi
Copy link
Contributor

Context

What is the purpose of this PR? Is it to

  • add a new feature
  • fix a bug
  • update tests and/or documentation
  • other (please add here)

Please link to any issues this PR addresses.

#1852

Changelog

What are the changes made in this PR?

  • 5527fd8
    • Added 3 new command line arguments: --source, --kaggle-username, and --kaggle-api-key
    • Splits download logic based on --source value (only valid options are huggingface and kaggle). When --source is not provided, defaults to huggingface
    • Kaggle download logic placed in helper function _download_from_kaggle()
    • The new logic mostly mirrors _download_from_huggingface() with a few notable exceptions:
      • Authentication with kagglehub requires the caller's username and api token as separate arguments to kagglehub.set_kaggle_credentials() (these can also be set as environment variables). Since both are required, appropriate error messages are raised when one is omitted. However, since Kaggle supports anonymous downloads on public non-gated models, the option to omit both is also supported.
      • The exceptions thrown by kagglehub.model_download are different than those raised by huggingface_hub.snapshot_download.
      • kagglehub.model_download doesn't support direct download to a caller-specified directory via --output-dir. This feature is being worked on in kagglehub (see [RFC] Design proposal for model_download custom download path Kaggle/kagglehub#179), but in the meantime a warning was just placed here instead.
      • some extra validation is done on Kaggle model handles (see _validate_kaggle_model_handle) as per the requests made here. This validation is primarily used to raise warnings when downloads may not be guaranteed to work with torchtune.
  • a3776d6
    • tests were added to validate the CLI behavior when --source kaggle is provided
  • b30f741
    • updated documentation

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

  • run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
  • add unit tests for any new functionality
  • update docstrings for any new or updated methods or classes
  • run unit tests via pytest tests
  • run recipe tests via pytest tests -m integration_test
  • manually run any new or modified recipes with sufficient proof of correctness
  • include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

  • I did not change any public API
  • I have added an example to docs or docstrings

Copy link

pytorch-bot bot commented Nov 13, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2002

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 00493ef with merge base d9b6c79 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 13, 2024
@KeijiBranshi KeijiBranshi changed the title Keijibranshi/feat/download from kaggle feat(cli): allow users to download models from Kaggle Nov 13, 2024
@joecummings
Copy link
Contributor

Awesome stuff!! Will be reviewing this today or tomorrow :)

Copy link
Contributor

@joecummings joecummings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments/questions mostly about authentication, but overall this looks great!

docs/source/tune_cli.rst Show resolved Hide resolved
torchtune/_cli/download.py Outdated Show resolved Hide resolved
torchtune/_cli/download.py Show resolved Hide resolved
docs/source/tune_cli.rst Show resolved Hide resolved
Copy link
Contributor

@joecummings joecummings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will wait for CI, but LGTM!

Thanks @KeijiBranshi for your hard work on this and making it easier for torchtune users to download models from the Kaggle Hub!

@KeijiBranshi
Copy link
Contributor Author

Will wait for CI, but LGTM!

Thanks @KeijiBranshi for your hard work on this and making it easier for torchtune users to download models from the Kaggle Hub!

Thanks so much for your help and review! Really appreciated your guidance on everything.

Looks like some of my test are failing. Could've sworn I ran before pushing, but i guess it slipped through. Resolving those now 🙂

@KeijiBranshi
Copy link
Contributor Author

@joecummings success!

Thanks again for all your help. Let me know if anything else needs to be done for the release process 🚀

@KeijiBranshi KeijiBranshi force-pushed the keijibranshi/feat/download-from-kaggle branch from 87d34bb to 00493ef Compare November 20, 2024 22:45
@KeijiBranshi
Copy link
Contributor Author

@joecummings quick fyi, I just rebased to the most recent main branch commit.

Let us know if there's anything else we should do to get this merged 🙂

@joecummings
Copy link
Contributor

joecummings commented Nov 21, 2024

@joecummings quick fyi, I just rebased to the most recent main branch commit.

Let us know if there's anything else we should do to get this merged 🙂

Thanks for the rebase. Will merge this today!! Afterwards, I'll kick off a "nightly" build so users can have immediate access to this feature even before tomorrow's build of torchtune.

@joecummings joecummings linked an issue Nov 21, 2024 that may be closed by this pull request
@joecummings joecummings merged commit e9fd56a into pytorch:main Nov 21, 2024
17 checks passed
@KeijiBranshi
Copy link
Contributor Author

@joecummings quick fyi, I just rebased to the most recent main branch commit.
Let us know if there's anything else we should do to get this merged 🙂

Thanks for the rebase. Will merge this today!! Afterwards, I'll kick off a "nightly" build so users can have immediate access to this feature even before tomorrow's build of torchtune.

Thank you so much!! Looking forward to it.

Please ping me if things break or users see issues 🙂. If I'm unresponsive for any reason, you can also email [email protected].

@joecummings joecummings mentioned this pull request Nov 26, 2024
44 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] Kaggle Model Hub Integration for Torchtune
4 participants