Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix creation of AWs/Jobs with same name in different namespaces #652

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ChristianZaccaria
Copy link
Contributor

@ChristianZaccaria ChristianZaccaria commented Sep 29, 2023

Issue link

Resolves #433
And resolves #383
And closes #671

In the genericresource.go file there is a SyncQueueJob function which is used to create resources inside etcd. In this function, the identification of AppWrappers/Jobs was primarily based on a label (resourceName) which had a value of the AW name without taking into account the namespace. As a result, when 2 AWs/Jobs with the same name were created in different namespaces, the 2nd AW/Job was identified as a duplicate.

What changes have been made

  • Updated labels from appwrapper.mcad.ibm.com to workload.codeflare.dev/appwrapper, and from resourceName to workload.codeflare.dev/resourceName.
  • Created helper function GetRandomString() to truncate the name of objects in etcd at 60 characters instead of 63, to allow to append 3 random alphanumeric (lowercased) characters to avoid unexpected clashes when using similar-but-ending-differently long names.
  • Label selectors now take into account the AppWrapper namespace and not just the name.
  • Updated quotaManagement.rbac.apiGroup from ibm.com to quota.codeflare.dev

Verification steps

  1. Have MCAD running.
  2. Create two AWs/Jobs with the same name but in different namespaces.
  3. See their pods created respectively, and running.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • e2e test
    • Manual tests

@openshift-ci openshift-ci bot requested review from dmatch01 and tardieu September 29, 2023 12:56
@openshift-ci
Copy link

openshift-ci bot commented Sep 29, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign asm582 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ChristianZaccaria ChristianZaccaria requested review from kpouget, astefanutti, metalcycling and z103cb and removed request for dmatch01 and tardieu September 29, 2023 12:57
Comment on lines 314 to 315
if len(newName + namespace) > 63 {
newName = newName[:len(newName) - (len(newName) + len(namespace) - 63)]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at quick glance, I'm not 100% sure about the correctness of these lines.
but actually, even the original code:

	if len(newName) > 63 {
			newName = newName[:63]

makes me doubtful. Do we really want to truncate the name at 64 chars?
wouldn't it lead to unexpected clashes when using similar-but-ending-differently long names?

Raising an error could be saner (and thus making the AppWrapper impossible to process), it's a common K8s thing to abort on long names.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried in the past to concatenate the namespace and name of a job and observed that this would indeed run long for many of our jobs. The use of the hyphen is also problematic because it is ambiguous. We should have separate labels for the name and namespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kpouget, you raised a very good point. For what I could read, it seems that it is done to meet with DNS standards RFC 1035/1123. Since Kubernetes namespaces and names are often used to form a DNS subdomain, K8s enforces it here.

After looking at this code in particular again, I can confirm that we can keep the original code here as the namespace is used in the creation of the object anyways.

Now, to deal with very long names, we could raise an error and stop the operation as suggested. But I was thinking what if we handle the error this way? Here are some ideas:

  • Hashing the name for uniqueness?
  • Or, Truncate to 60 characters and append i.e., 3 random numbers?

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tardieu, having separate labels for the name and namespace does look like the best option instead of using a hyphen.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, to deal with very long names, we could raise an error and stop the operation as suggested. But I was thinking what if we handle the error this way?
do you know how this DNS name is used? is it something that the AppWrapper owner will ever want to use?
if it's purely internal, then using a hash, or truncating the name + random chars seems to be a good idea

@@ -163,7 +163,7 @@ func (gr *GenericResources) Cleanup(aw *arbv1.AppWrapper, awr *arbv1.AppWrapperG
}

// Get the resource to see if it exists
labelSelector := fmt.Sprintf("%s=%s, %s=%s", appwrapperJobName, aw.Name, resourceName, unstruct.GetName())
labelSelector := fmt.Sprintf("%s=%s, %s=%s", appwrapperJobLabelName, aw.Name, resourceName, unstruct.GetNamespace() + "-" + unstruct.GetName())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
labelSelector := fmt.Sprintf("%s=%s, %s=%s", appwrapperJobLabelName, aw.Name, resourceName, unstruct.GetNamespace() + "-" + unstruct.GetName())
labelSelector := fmt.Sprintf("%s=%s, %s=%s-%s", appwrapperJobLabelName, aw.Name, resourceName, unstruct.GetNamespace(), unstruct.GetName())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't why I didn't think of that, thanks @astefanutti :)

@@ -163,7 +163,7 @@ func (gr *GenericResources) Cleanup(aw *arbv1.AppWrapper, awr *arbv1.AppWrapperG
}

// Get the resource to see if it exists
labelSelector := fmt.Sprintf("%s=%s, %s=%s", appwrapperJobName, aw.Name, resourceName, unstruct.GetName())
labelSelector := fmt.Sprintf("%s=%s, %s=%s", appwrapperJobLabelName, aw.Name, resourceName, unstruct.GetNamespace() + "-" + unstruct.GetName())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can that label be removed altogether? It'll avoid those cluster-wide list queries, and be fully deterministic. Because even with that fix, you could imagine case when labels clash.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For normal cleanup of eg an appwrapper containing a pytorchjob, we should only delete the pytorch job and let the training operator do the rest. But we are also exploring adding "forceful" cleanup (kill -9) as a last resort where we delete (forcefully) any resource that can be traced back to an appwrapper. For the latter at least, it makes sense to select on labels.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would assume the underlying operators propagate the label, right? It seems owner references and cascading deletion would be a more robust and generally supported mechanism. Granted it would restrict the generic items to be within the same namespace as that of the AppWrapper (cross namespace ownership is not allowed), but even that may possibly be seen as a good restriction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indeed assumes that wrapped resources in the appwrapper are properly labelled and that labels are propagated by operators for resources indirectly created. Again, this is only a last resort, but an important last resort. This definitely does not preclude trying to do the right thing first. Ownership references are not an option in general due to the cross namespace restriction. Transitive ownership and cascading deletion is good practice but there is no guarantee operators follow these practices.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For more context, we do experience node failures scenarios where the kube scheduler hence typically operators fail to terminate pods. The pods remain in terminating state until forcefully deleted (irrespective of grace periods), hence the need to orchestrate forceful deletion in MCAD.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, even for forceful deletion, I am not advocating for relying solely on labels, but labels are one of the tools we want to use to scope the forceful deletion.

@tardieu
Copy link
Member

tardieu commented Sep 29, 2023

Do we have any plans to change the label name from appwrapper.mcad.ibm.com to something consistent with the group name, maybe appwrapper.worload.codeflare.dev?

@astefanutti
Copy link
Contributor

Do we have any plans to change the label name from appwrapper.mcad.ibm.com to something consistent with the group name, maybe appwrapper.worload.codeflare.dev?

Yes, I agree it'll have to be changed, e.g., to worload.codeflare.dev/appwrapper. @tardieu is this the only purpose of that label? If yes, @ChristianZaccaria maybe you could tackle that change in the scope of that PR, WDYT?

@ChristianZaccaria
Copy link
Contributor Author

ChristianZaccaria commented Sep 29, 2023

Do we have any plans to change the label name from appwrapper.mcad.ibm.com to something consistent with the group name, maybe appwrapper.worload.codeflare.dev?

Yes, I agree it'll have to be changed, e.g., to worload.codeflare.dev/appwrapper. @tardieu is this the only purpose of that label? If yes, @ChristianZaccaria maybe you could tackle that change in the scope of that PR, WDYT?

@astefanutti Yes, sounds good to me. It's a good idea to include the change here.

@tardieu
Copy link
Member

tardieu commented Sep 29, 2023

@tardieu is this the only purpose of that label?

The label is used to select resources directly created by MCAD (in genericresource.go) during creation, deletion, and to decide if the job is complete. As discussed, this is probably redundant with just querying for the generic items one by one.

The same label is used to select pods directly or indirectly caused by an appwrapper (in queuejob_controller_ex.go) to look at pod statuses and decide if the job is healthy or not.

The change needs to be consistent across. The introduction of a separate label for the namespace should also be done consistently across. Otherwise, we will end up confusing pods resulting from appwrapper mynamespace/myname with the pods of appwrapper myothernamespace/myname.

This does mean that all tests and all appwrapper yamls in circulation have to be adjusted. I would rather sync this change with the apigroup change than have users go back and forth between 3 CRD revisions (old group + old label, new group + old label, new group + new label).

@ChristianZaccaria ChristianZaccaria force-pushed the aw-different-namespaces branch 2 times, most recently from 8d5acc1 to 719dea1 Compare October 4, 2023 15:39
@tardieu tardieu self-requested a review October 4, 2023 16:35
@tardieu
Copy link
Member

tardieu commented Oct 4, 2023

If I read this correctly, the latest push adds a label with the namespace of the wrapped resource itself rather than the appwrapper namespace. I am not sure I understand the logic.

@tardieu
Copy link
Member

tardieu commented Oct 4, 2023

We should probably avoid unqualified label names like resourceName and replace them with names such as workload.codeflare.dev/resourceName.

@ChristianZaccaria
Copy link
Contributor Author

If I read this correctly, the latest push adds a label with the namespace of the wrapped resource itself rather than the appwrapper namespace. I am not sure I understand the logic.

I think I understand your point. Would it make more sense to include labels for both wrapped resources and appwrappers? or just use the appwrapper name and namespace? I think that the issue with this is that i.e., if two jobs are created with same name and different namespaces, the 2nd job's pods are not created, so it's not originally the appwrapper where the issue lies. Let me know if I misunderstand, thank you Olivier

@ChristianZaccaria ChristianZaccaria force-pushed the aw-different-namespaces branch 7 times, most recently from 64d82ba to a24f2f9 Compare November 10, 2023 17:22
@ChristianZaccaria ChristianZaccaria force-pushed the aw-different-namespaces branch 2 times, most recently from 0693e5d to 85d8617 Compare November 14, 2023 12:24
@ChristianZaccaria
Copy link
Contributor Author

@astefanutti Hi Antonin, I'm wondering if you think updating the quotaManagement.rbac.apiGroup like this is correct? From ibm.com to quota.codeflare.dev. Thanks!

- script: helm upgrade --install mcad-controller deployment/mcad-controller --namespace kube-system --wait --set loglevel=${LOG_LEVEL} --set resources.requests.cpu=1000m --set resources.requests.memory=1024Mi --set resources.limits.cpu=4000m --set resources.limits.memory=4096Mi --set image.repository=$IMAGE_REPOSITORY_MCAD --set image.tag=$IMAGE_TAG_MCAD --set image.pullPolicy=$MCAD_IMAGE_PULL_POLICY --set configMap.quotaEnabled='"true"' --set quotaManagement.rbac.apiGroup=ibm.com --set quotaManagement.rbac.resource=quotasubtrees --set configMap.name=mcad-controller-configmap --set configMap.preemptionEnabled='"true"'

@ChristianZaccaria
Copy link
Contributor Author

I attempted to not use completely the resourceName label but caused the e2e tests to fail. I believe it has to do with certain scenarios where the resourceName differs from the AppWrapper name, such as this example:

labels:
appwrapper.mcad.ibm.com: hold-completion-job-03
job-name: hold-completion-job-03-01
resourceName: hold-completion-job-03-01

@kpouget
Copy link

kpouget commented Nov 24, 2023

Hello @ChristianZaccaria , a lot of changes seems to be unrelated, or indirectly at least, to the original issue:
image

shouldn't these change be merged in another PR? 🤔

@ChristianZaccaria
Copy link
Contributor Author

ChristianZaccaria commented Nov 27, 2023

Hello @ChristianZaccaria , a lot of changes seems to be unrelated, or indirectly at least, to the original issue: image

shouldn't these change be merged in another PR? 🤔

Hi @kpouget, it was suggested to include these changes in the scope of this PR: #652 (comment)

@ChristianZaccaria ChristianZaccaria force-pushed the aw-different-namespaces branch 2 times, most recently from 993f665 to 10b2c49 Compare November 28, 2023 12:19
newName = newName[:63]
if len(newName) > 60 {
newName = newName[:60]
newName += GetRandomString(3)
}

err = deleteObject(namespaced, namespace, newName, rsrc, dclient)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will almost always fail for length >60 because the string you create is not deterministic. You should use a hash for getting the last 3 digits

newName = newName[:63]
if len(newName) > 60 {
newName = newName[:60]
newName += GetRandomString(3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants