Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Do not merge) MVP support for Kaggle Packages #196

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

dster2
Copy link

@dster2 dster2 commented Dec 18, 2024

  1. Support tracking accessed datasources and writing / reading them to a requirements.yaml file.
  2. Support a "Package Scope" using that file which applies those datasource versions at runtime if the user requests a datasource without an explicit version. Kaggle Packages' __init__.py uses these helpers to automatically apply the package scope to all methods defined in the package.
  3. Support importing a Package. This uses a handle equivalent to a Notebook handle, and like Notebooks is currently limited to the latest version, with version support coming soon.
  4. Support Package Asset files whose path honors the current Package Scope.

Not sure who should review this, so casting a wide net of potentially interested parties 😅 Please include others who might be interested

Current plan is to not merge into main quite yet, we'll retain a workaround to use this new kagglehub version for beta Package testing, and rollout to main and our Docker image in January.

http://b/377958580

apply(module)


logger = logging.getLogger(__name__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you move this up?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks

if getattr(member, "__module__", None) != module.__name__:
continue
# These denylisted entries cause infinite loops
elif inspect.isclass(member) and name not in ["__base__", "__class__"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the worst case depth of this? Can we use an explicit stack here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only recurses on classes defined within the user's module, so you'd have to have a ton of nested classes to hit the limit, but fair point to be defensive and it removes one of the helper functions here, fixed.

self.path: pathlib.Path = package_path
self.datasources: VersionedDatasources = read_requirements(package_path / "kagglehub_requirements.yaml")

self._token_stack = deque()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can just be a list.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks!

Otherwise, assumes we're in an interactive Kaggle Notebook where package
data should be written to the /kaggle/working dir."""
scope = PackageScope.get()
package_path = scope.path if scope else pathlib.Path("/kaggle/working/package")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we are on colab or offline?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any time a Package is imported -- whether via kagglehub.package_import or directly, whether on Colab or offline -- the package's code should be applying its own PackageScope (through logic in the generated __init__.py file, with the helpers above in this file) so we assume it's only unset when you're defining the Package, currently intended only within a Kaggle Notebook, though we'll probably want to add more flexibility in the future.

Added a bit more docstring on this point, thanks.

@neshdev
Copy link
Member

neshdev commented Dec 18, 2024

@dster2 - When ready, please add unit test for all the source files as well.

@dster2 dster2 force-pushed the packages branch 3 times, most recently from 31b4c18 to 0d1a162 Compare December 18, 2024 04:42
@dster2
Copy link
Author

dster2 commented Dec 18, 2024

Thanks Nesh! Yes will work on unit tests, just trying to jump-start review on the core patterns + impl

@dster2 dster2 force-pushed the packages branch 2 times, most recently from 13c0887 to ca9cfb8 Compare December 18, 2024 05:11
Support tracking accessed datasources and writing / reading them to a
requirements.yaml file.
Support a "Package Scope" using that file which applies those datasource
versions at runtime if the user requests a datasource without an
explicit version.
Support importing a Package.  This uses a handle equivalent to a
Notebook handle, and like Notebooks is currently limited to the latest
version, with version support coming soon.
Support Package Asset files whose path honors the current Package Scope.
Copy link
Contributor

@rosbo rosbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general direction makes sense. I added a few suggestion to make the code cleaner and less error-prone.

@@ -34,6 +34,11 @@ class ModelHandle(ResourceHandle):
def is_versioned(self) -> bool:
return self.version is not None and self.version > 0

def with_version(self, version: int): # noqa: ANN201
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of using noqa to ignore the warning, can you simply define the return type?

def with_version(self, version: int) -> ModelHandle:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not, you can't refer to your defining class inside yourself :\ there is typing.Self for this purpose, but it didn't work with Python 3.9 I think. There was another workaround that was also a bit messy so I thought to just denylist it... can take another pass though.

@@ -57,6 +62,9 @@ class DatasetHandle(ResourceHandle):
def is_versioned(self) -> bool:
return self.version is not None and self.version > 0

def with_version(self, version: int): # noqa: ANN201
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, please specify the DatasetHandle return type.

@@ -15,15 +15,15 @@
NUM_UNVERSIONED_NOTEBOOK_PARTS = 2 # e.g.: <owner>/<notebook>


@dataclass
@dataclass(frozen=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was frozen=True suggested automatically by the linter?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is required to make it a hashable type to use as key in the VersionedDatasources dictionary.

@@ -45,6 +47,7 @@ def __call__(

if not api_client.has_credentials():
if cached_path:
register_accessed_datasource(h, None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be cleaner to use a Python decorator: https://realpython.com/primer-on-python-decorators/#adding-syntactic-sugar

This way, you could simply annotate the call method of each resolver.

For the unversioned handle case, we mutate the handle to include the version so you could just pick that up when running logic post running the decorator wrapped method.

I wrote a quick POC: https://admin.kaggle.com/code/rosebv/register-access-decorator?scriptVersionId=213907476 (shared with Kaggle admins :))

https://screenshot.googleplex.com/vWyJ3M7AwC2utMU

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is great, thanks! Definitely cleans up the cruft where I was calling this before each return statement. However it's not quite so easy, at least with the current setup... since I made the handles immutable (frozen=True) I think the decorator cannot see a modified handle from its perspective of just seeing the args. Will play around more to see if we can make it work though.

@@ -278,6 +294,11 @@ def _extract_archive(archive_path: str, out_path: str) -> None:


def _get_current_version(api_client: KaggleApiV1Client, h: ResourceHandle) -> int:
# Check if there's a Package in scope which has stored a version number used when it was created.
version_from_package = get_package_datasource_version_number(h)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think version_from_package_scope would be clearer.

@@ -83,7 +84,9 @@ def __call__(
f"You can acces the other files othe attached competition at '{cached_path}'"
)
raise ValueError(msg)
register_accessed_datasource(h, None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use a similar decorator pattern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants