Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for predefined hooks #203

Open
deitch opened this issue May 3, 2024 · 14 comments
Open

Add support for predefined hooks #203

deitch opened this issue May 3, 2024 · 14 comments
Assignees

Comments

@deitch
Copy link

deitch commented May 3, 2024

I have been working with the NVIDIA container toolkit which generates a CDI yaml file, including hooks. The hooks all call nvidia-ctk to do 3 things:

  • create symlinks (relative to the root of a container)
  • chmod (container root prepended to target paths)
  • update ldcache (run ldconfig within the chroot of the container)

These are pretty basic standard activities. It seems like it would make sense to have containerEdits within the spec capable of handling these. The spec includes container edit things like mounts, env, deviceNodes, additionalGIDs, and of course hooks. I think that creating symlinks, changing permissions on dir/file, and maybe updating ldcache (I am mixed on that one) are the kinds of capabilities that would fit as first-class citizens within containerEdits.

Thoughts?

@elezar elezar self-assigned this May 3, 2024
@elezar elezar changed the title support for additional container edit capabilities? Add support for predefined hooks May 13, 2024
@reidpr
Copy link

reidpr commented May 16, 2024

I’m working on CDI for Charliecloud, a lightweight, fully-unprivileged container implementation for HPC applications (hpc/charliecloud#1902). (In fact, it looks like @elezar was already helpful in some offline correspondence.) This would also be useful for us.

In fact, regardless of what the standard says, I anticipate not running nvidia-ctk hook to do these things, but rather interpreting such clauses as @deitch suggests and just do the things internally. Such a kludge seems better to me than calling external programs for these “basic standard” tasks, but I’d much rather follow documented behavior.

I would be happy to provide a draft PR for folks to look at.

@elezar
Copy link
Contributor

elezar commented May 17, 2024

@reidpr we did discuss this in our latest working group meeting (notes here) and will most likely not be implementing this as part of the spec for the time being.

One of the primary concerns is that even for "basic standard" tasks the applicable interfaces may not be applicable. Take a hook that "updates the ldcache" as an example. In the case of the NVIDIA Container Toolkit -- which implements this as the nvidia-ctk hook update-ldcache command (or nvidia-cdi-hook update-ldcache with the change from NVIDIA/nvidia-container-toolkit#474) -- we take this to mean, run ldconfig from the HOST to update /etc/ld.so.cache IN THE CONTAINER using the specified folders. Here we need to make provision for different paths to ldconfig on the host (e.g. on ubuntu the correct path is ldconfig.real since ldconfig is a shell wrapper) as well as containers that don't have an ldcache (e.g. busybox).

If we need to capture this logic in some kind of binary, this still needs to be available on the host and would complicate CDI spec consumption since we would have to distribute this binary somehow.

Our current strategy for how to handle this would be to provide better documentation and examples for what these hooks do so that their behaviour can be replicated across various environments. This does place the burden on the hook vendors to provide this documentation though and cannot necessarily be enforced by the spec.

One question I have is whether there is some middle ground where we don't implement the hooks in the spec, but more strictly prescribe their behaviour? That is to say that as a vendor, NVIDIA would still distribute the nvidia-cdi-hook binary that would include an update-ldcache subcommand, but that the pre-and post conditions for this command are defined by a public spec. As the spec owners, we could even provide a set of test cases to validate this. This should in theory allow hooks to be more interchangeable and allow users to determine whether hooks are applicable given their environments.

@deitch
Copy link
Author

deitch commented May 17, 2024

One of the primary concerns is that even for "basic standard" tasks the applicable interfaces may not be applicable. Take a hook that "updates the ldcache" as an example. In the case of the NVIDIA Container Toolkit -- which implements this as the nvidia-ctk hook update-ldcache command (or nvidia-cdi-hook update-ldcache with the change from NVIDIA/nvidia-container-toolkit#474) -- we take this to mean, run ldconfig from the HOST to update /etc/ld.so.cache IN THE CONTAINER using the specified folders.

I can see that as a concern. At the same time, I do think it is a narrowly prescribed set of activities. I originally thought 3, now there is a variant on 1, so total of 4.

Would we want to implement the truly standard ones, i.e. chmod and symlinks?

@elezar
Copy link
Contributor

elezar commented May 17, 2024

For clarification, the chmod hook that was added to nvidia-ctk was to workaround a crun bug and should be removed from this discussion.

@reidpr
Copy link

reidpr commented May 17, 2024

Thank you @elezar.

Take a hook that "updates the ldcache" as an example ... we take this to mean, run ldconfig from the HOST to update /etc/ld.so.cache IN THE CONTAINER .... Here we need to make provision for different paths to ldconfig on the host ...

I see your point. I wasn’t in that discussion obviously, but in my view that type of logic belongs more in the container implementation rather than vendor hooks. However you’ve been thinking about this longer than me.

I have a couple follow-ups.

Can you clarify the reasoning on why NVIDIA (or HPE, or ...) should be figuring out host ldconfig stuff rather than Charliecloud (or Podman, or ...)? What am I missing?

What if the host does not have ldconfig? (E.g., Alpine is based on musl, which doesn’t.)

as well as containers that don't have an ldcache...

Good point.

Do these containers need ldconfig run at all, though?

Is it reasonably common to have containers without an ldconfig executable within the container but still needing an ldcache update?

Our current strategy for how to handle this would be to provide better documentation and examples for what these hooks do .... [I]s some middle ground where we don't implement the hooks in the spec, but more strictly prescribe their behaviour?

I could work with this.

I really like the declarative aspect of CDI. Strictly prescribed hooks would be consistent with that, though (in my view) in kind of a roundabout way.

@reidpr
Copy link

reidpr commented May 17, 2024

latest working group meeting (notes here)

Are these meetings open to the public? It sounds like something I’d be interested in helping with.

@elezar
Copy link
Contributor

elezar commented May 21, 2024

Are these meetings open to the public? It sounds like something I’d be interested in helping with.

Yes, the meetings are public. All the required details should be in the linked document. Ideally they are recorded too, but there is sometimes some lag between when these are available on YouTube.

Copy link

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

@deitch
Copy link
Author

deitch commented Aug 20, 2024

I will comment to keep it open.

@elezar
Copy link
Contributor

elezar commented Aug 20, 2024

I have created #225 as an initial attempt to capture the intent of the "predefined" hooks.

@deitch
Copy link
Author

deitch commented Aug 20, 2024

Cool, thank you @elezar . What do we think about chmod? Or is that just an artifact going away anyways, as you commented above?

Copy link

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

@deitch
Copy link
Author

deitch commented Nov 19, 2024

Not stale.

@elezar
Copy link
Contributor

elezar commented Nov 21, 2024

Cool, thank you @elezar . What do we think about chmod? Or is that just an artifact going away anyways, as you commented above?

The chmod hook was added to our tooling as a workaround for a particular runc bug. I don't think we should formalize this and rather focus on the remaining hooks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants