-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(Do not merge) MVP support for Kaggle Packages #196
base: main
Are you sure you want to change the base?
Conversation
src/kagglehub/packages.py
Outdated
apply(module) | ||
|
||
|
||
logger = logging.getLogger(__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you move this up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks
src/kagglehub/packages.py
Outdated
if getattr(member, "__module__", None) != module.__name__: | ||
continue | ||
# These denylisted entries cause infinite loops | ||
elif inspect.isclass(member) and name not in ["__base__", "__class__"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the worst case depth of this? Can we use an explicit stack here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only recurses on classes defined within the user's module, so you'd have to have a ton of nested classes to hit the limit, but fair point to be defensive and it removes one of the helper functions here, fixed.
src/kagglehub/packages.py
Outdated
self.path: pathlib.Path = package_path | ||
self.datasources: VersionedDatasources = read_requirements(package_path / "kagglehub_requirements.yaml") | ||
|
||
self._token_stack = deque() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can just be a list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks!
Otherwise, assumes we're in an interactive Kaggle Notebook where package | ||
data should be written to the /kaggle/working dir.""" | ||
scope = PackageScope.get() | ||
package_path = scope.path if scope else pathlib.Path("/kaggle/working/package") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we are on colab or offline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any time a Package is imported -- whether via kagglehub.package_import
or directly, whether on Colab or offline -- the package's code should be applying its own PackageScope
(through logic in the generated __init__.py
file, with the helpers above in this file) so we assume it's only unset when you're defining the Package, currently intended only within a Kaggle Notebook, though we'll probably want to add more flexibility in the future.
Added a bit more docstring on this point, thanks.
@dster2 - When ready, please add unit test for all the source files as well. |
31b4c18
to
0d1a162
Compare
Thanks Nesh! Yes will work on unit tests, just trying to jump-start review on the core patterns + impl |
13c0887
to
ca9cfb8
Compare
Support tracking accessed datasources and writing / reading them to a requirements.yaml file. Support a "Package Scope" using that file which applies those datasource versions at runtime if the user requests a datasource without an explicit version. Support importing a Package. This uses a handle equivalent to a Notebook handle, and like Notebooks is currently limited to the latest version, with version support coming soon. Support Package Asset files whose path honors the current Package Scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general direction makes sense. I added a few suggestion to make the code cleaner and less error-prone.
@@ -34,6 +34,11 @@ class ModelHandle(ResourceHandle): | |||
def is_versioned(self) -> bool: | |||
return self.version is not None and self.version > 0 | |||
|
|||
def with_version(self, version: int): # noqa: ANN201 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of using noqa
to ignore the warning, can you simply define the return type?
def with_version(self, version: int) -> ModelHandle:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately not, you can't refer to your defining class inside yourself :\ there is typing.Self
for this purpose, but it didn't work with Python 3.9 I think. There was another workaround that was also a bit messy so I thought to just denylist it... can take another pass though.
@@ -57,6 +62,9 @@ class DatasetHandle(ResourceHandle): | |||
def is_versioned(self) -> bool: | |||
return self.version is not None and self.version > 0 | |||
|
|||
def with_version(self, version: int): # noqa: ANN201 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, please specify the DatasetHandle
return type.
@@ -15,15 +15,15 @@ | |||
NUM_UNVERSIONED_NOTEBOOK_PARTS = 2 # e.g.: <owner>/<notebook> | |||
|
|||
|
|||
@dataclass | |||
@dataclass(frozen=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was frozen=True
suggested automatically by the linter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is required to make it a hashable type to use as key in the VersionedDatasources
dictionary.
@@ -45,6 +47,7 @@ def __call__( | |||
|
|||
if not api_client.has_credentials(): | |||
if cached_path: | |||
register_accessed_datasource(h, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be cleaner to use a Python decorator: https://realpython.com/primer-on-python-decorators/#adding-syntactic-sugar
This way, you could simply annotate the call method of each resolver.
For the unversioned handle case, we mutate the handle to include the version so you could just pick that up when running logic post running the decorator wrapped method.
I wrote a quick POC: https://admin.kaggle.com/code/rosebv/register-access-decorator?scriptVersionId=213907476 (shared with Kaggle admins :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is great, thanks! Definitely cleans up the cruft where I was calling this before each return
statement. However it's not quite so easy, at least with the current setup... since I made the handles immutable (frozen=True
) I think the decorator cannot see a modified handle from its perspective of just seeing the args. Will play around more to see if we can make it work though.
@@ -278,6 +294,11 @@ def _extract_archive(archive_path: str, out_path: str) -> None: | |||
|
|||
|
|||
def _get_current_version(api_client: KaggleApiV1Client, h: ResourceHandle) -> int: | |||
# Check if there's a Package in scope which has stored a version number used when it was created. | |||
version_from_package = get_package_datasource_version_number(h) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think version_from_package_scope
would be clearer.
@@ -83,7 +84,9 @@ def __call__( | |||
f"You can acces the other files othe attached competition at '{cached_path}'" | |||
) | |||
raise ValueError(msg) | |||
register_accessed_datasource(h, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use a similar decorator pattern.
requirements.yaml
file.__init__.py
uses these helpers to automatically apply the package scope to all methods defined in the package.Not sure who should review this, so casting a wide net of potentially interested parties 😅 Please include others who might be interested
Current plan is to not merge into
main
quite yet, we'll retain a workaround to use this newkagglehub
version for beta Package testing, and rollout tomain
and our Docker image in January.http://b/377958580