Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should not attempt to modify resources in normal namespace #87

Closed
erikgb opened this issue Sep 25, 2023 · 27 comments · Fixed by #99
Closed

Should not attempt to modify resources in normal namespace #87

erikgb opened this issue Sep 25, 2023 · 27 comments · Fixed by #99
Labels
bug Something isn't working

Comments

@erikgb
Copy link
Contributor

erikgb commented Sep 25, 2023

Describe the bug

We are running a proof-of-concept in one of our OpenShift clusters. And while the controller runs find without any watches, it fails badly when we enable watches of roles and role bindings. This issue might be relevant for other resource types, but namespace RBAC is the obvious start for propagating namespace resources for us.

It seems like the controller attempts to annotate resources in namespaces that are not annotated with any accurate.cybozu.com annotations, and this must be wrong! Errors are emitted constantly, and here is an example:

{"level":"error","ts":"2023-09-25T18:21:37Z","msg":"failed to check the controller reference","controller":"role","controllerGroup":"rbac.authorization.k8s.io","controllerKind":"Role","Role":{"name":"cluster-samples-operator","namespace":"openshift-cluster-samples-operator"},"namespace":"openshift-cluster-samples-operator","name":"cluster-samples-operator","reconcileID":"957d956a-b15e-4463-a458-ca5b1a32582b","error":"failed to add accurate.cybozu.com/propagate-generated annotation: roles.rbac.authorization.k8s.io \"cluster-samples-operator\" is forbidden: user \"system:serviceaccount:accurate:accurate-controller-manager\" (groups=[\"system:serviceaccounts\" \"system:serviceaccounts:accurate\" \"system:authenticated\"]) is attempting to grant RBAC permissions not currently held"

This is a total blocker for us. The controller must NOT touch resources in namespaces that are irrelevant for Accurate.

Environments

  • Version: 1.1.0
  • OS: N/A

To Reproduce

  1. Install Accurate
  2. Configure Accurate to watch roles and role bindings
  3. Observe that Accurate adds (or attempts to add) annotations to resources in namespaces that are not observed/managed by Accurate. On Openshift this update is unsuccessful and the controller logs become extremely chatty.

Expected behavior

No attempt to update irrelevant namespace resources and no errors in controller logs.

Additional context

The controller errors might only occur on Openshift, which is a secure by default Kubernetes distribution. But the root cause here is the wrong attempt to change irrelevant resources.

@erikgb erikgb added the bug Something isn't working label Sep 25, 2023
@zeroalphat
Copy link
Contributor

@erikgb
Thanks for your feedback.
We will investigate it.

@ymmt2005
Copy link
Member

ymmt2005 commented Sep 27, 2023

This is fixable but not easy.
The accurate controller tries to add that annotation to record if the resource is generated from another resource, mainly for performance reasons.

If we skip this check, we have to check all watched resources when an Accurate label is added to a namespace.

With Accurate, namespaces are created dynamically by users, so statically granting permission only for a set of namespaces to the accurate controller wouldn't make sense, I guess. If we grant permissions for all namespaces, this issue is no longer a problem.

I want to mark this as by design.

@erikgb
Copy link
Contributor Author

erikgb commented Sep 28, 2023

I want to mark this as by design.

That would be very unfortunate and imply that Accurate is not something we can use in our clusters. And I would expect this to be problematic for a significant share of potential users of Accurate. Is this design mentioned somewhere in the docs?

By addressing, not to speak of modifying, resources in namespaces not marked as an Accurate-namespace, ref. https://cybozu-go.github.io/accurate/concepts.html#namespace-types, I think Accurate should be considered a holistic software product. Meaning that you cannot by any means migrate to/from Accurate. Your cluster must be bootstrapped with Accurate and deleted if you want to remove Accurate.

A temporary workaround could be introducing an operator config (regexp) for namespaces that Accurate should stay away from. But this would just be a minimum viable workaround and not a solution to the problem IMO.

I could be interested in participating in designing a fix for this, and performance is important. But not more important than doing the right thing. 😉

@ymmt2005
Copy link
Member

ymmt2005 commented Sep 28, 2023

@erikgb
Let me ask a few questions.

  1. The accurate controller failed to modify some resources in your cluster, and errors were recorded. What was your problem with this behavior? Noisy error logs?
  2. When you give Accurate permissions only for limited namespaces, how will you give it permissions for new namespaces that your user creates via SubNamespace?

@erikgb
Copy link
Contributor Author

erikgb commented Sep 28, 2023

  1. The accurate controller failed to modify some resources in your cluster, and errors were recorded. What was your problem with this behavior? Noisy error logs?

Noisy error logs for sure. And since Accurate has RBAC in all namespaces I would have to check if the controller managed to add annotations to any resources of the watched kind. I can do that.

  1. When you give Accurate permissions only for limited namespaces, how will you give more permissions when a user creates a new SubNamespace?

I don't think we can/should limit the Accurate permissions to namespaces using RBAC. But IMO Accurate should never modify (or attempt to modify) resources in namespaces that are not annotated with an Accurate annotation. Could you please explain why this requirement is not possible to implement?

@ymmt2005
Copy link
Member

ymmt2005 commented Sep 28, 2023

Thank you.

This problem can be fixed. It's just not very easy.

Interestingly, we find the same error in our environments even though our Role resources are in Accurate managed namespaces and the accurate controller has the admin privilege for Roles.

This is caused because that Role has more privileges than ClusterRole admin.
Kubernetes API server has an additional check when editing Roles to prevent privilege escalation, so Roles are very special.
https://kubernetes.io/docs/reference/access-authn-authz/rbac/#restrictions-on-role-creation-or-update

Nevertheless, our Accurate has been working fine for us because the Roles that caused error logs were actually not intended to be propagated (if so, we'd have given necessary privileges).

Based on these, we are open to accepting a fix for this, but for us, this is not a high priority.

@erikgb
Copy link
Contributor Author

erikgb commented Sep 28, 2023

Based on these, we are open to accepting a fix for this, but for us, this is not a high priority.

@ymmt2005 Thanks for your feedback on this. I will discuss with my team to eventually make a contribution in this area. I would expect you to prefer a design proposal before this change is implemented in a PR. Do you have any guidelines on the structure of such designs in this project?

@ymmt2005
Copy link
Member

Not in this project, but we usually write a design doc when we work on something new.
Examples are in https://github.com/cybozu-go/moco/tree/main/docs/designdoc

You may do the same, or since this is just fixing a bug, you can also use this issue to
explain your idea for the fix.

@tenstad
Copy link

tenstad commented Oct 2, 2023

To make the issue a bit more precise, I suggest the following as a starting point for restricting the scope in which Accurate acts:

If neither a namespace nor any resources within it has an Accurate annotation, and no other Accurate annotations reference the namespace, Accurate should not act within the namespace.

Then I wonder:

  1. Is propagate-generated currently the onle feature causing Accurate to act outside of the scope stated above?

A very simple and temporary solution to the problem would then be allowing the feature to be disabed entirely.

  1. Would only annotating dependents with generate if their owner has the propagate-generated annotation restrict Accurate to the scope stated above?

To not have to check owners for all watched resources when Accurate is added to a namespace, cound the owners instead be watched, and their dependants be annotated generate whenever they are annotated propagate-generated? Similarly to configuring watched resource types, one would also configure which resource types can have propagate-generated. The
already existing watched list could be used, or a separate list could be added to the config.

@ymmt2005
Copy link
Member

ymmt2005 commented Oct 3, 2023

Is propagate-generated currently the onle feature causing Accurate to act outside of the scope stated above?

Yes, as far as I remember right now.

Would only annotating dependents with generate if their owner has the propagate-generated annotation restrict Accurate to the scope stated above?

I don't understand this question very well.

@tenstad
Copy link

tenstad commented Oct 3, 2023

Could this line be removed:

ann[constants.AnnGenerated] = notGenerated

So that watched resources are only annotated with generate if they have an owner with the propagate-generated annotation.

@ymmt2005
Copy link
Member

ymmt2005 commented Oct 3, 2023

That annotation is added for better performance.
Since a resource is generated only once, we want to check the owner only once.
Without this annotation, Accurate would have to check the same resource many times.

@tenstad
Copy link

tenstad commented Oct 3, 2023

I agree that the performance benefit is substantial! Could we still get around it somehow?

What if it was required to specify which resource types may be an owner (and have the propagate-generated annotaion)? So that owners are only looked up if they are of correct resource type? Could even configure owner resource types per resource type, to further limit lookups. E.g. {owners: {Secret: [Certificate, ExternalSecret]}}, meaning both Certificate and ExternalSecret can generate Secret.

@ymmt2005
Copy link
Member

ymmt2005 commented Oct 3, 2023

What if it was required to specify which resource types may be an owner (and have the propagate-generated annotaion)?

It's doable, but I don't know if it helps mitigate this reported issue.

@tenstad
Copy link

tenstad commented Oct 3, 2023

In combination with removing the mentioned line, it could be a tradeoff between performance and resolving the issue.

Otherwise, I think we need some outside the box thinking to get this one resolved.

@tenstad
Copy link

tenstad commented Oct 3, 2023

What if that line is removed by default. And a feature gate (or flag) enables better performance, with the tradeoff of annotating additional resources in other namespaces?

@ymmt2005
Copy link
Member

ymmt2005 commented Oct 3, 2023

What if that line is removed by default. And a feature gate (or flag) enables better performance, with the tradeoff of annotating additional resources in other namespaces?

I prefer a fix to skip that line if the namespace does not have labels for Accurate.
In that case, we need to reconcile all watched resources and check accurate.cybozu.com/propagate-generated annotation of their parents when an accurate label is added to a namespace.

@ymmt2005
Copy link
Member

ymmt2005 commented Oct 5, 2023

@erikgb @tenstad
Hello, I have some news.
It turned out that this feature, accurate.cybozu.com/propagate-generated annotation, is no longer used in our company.

So @zoetrope and I agreed that we could simply disable this feature by default or even remove it.
What do you think?

@erikgb
Copy link
Contributor Author

erikgb commented Oct 5, 2023

So @zoetrope and I agreed that we could simply disable this feature by default or even remove it.
What do you think?

@ymmt2005, this is really good news! 🎉 I think the end goal should be to remove the feature completely, as it will remove some critical code and make the Accurate operator even simpler and more understandable. It will definitely also simplify/improve RBAC and controller machinery. If I am not mistaken, controller-runtime will by default spin up an informer/cache (if it not already exist) for any client Get/List call. So the lookup of any owner resource can be more costly than anticipated.

How to move forward is really up to you. Our short-term goal would be to have a release of Accurate that allows us to run without this really problematic issue present. We are not prepared to promote Accurate beyond our POC/sandbox cluster without it. So as a start, I would suggest merging #92 and then cutting a release. How does that sound to you?

@ymmt2005
Copy link
Member

ymmt2005 commented Oct 5, 2023

I would suggest merging #92 and then cutting a release. How does that sound to you?

I'm fine. @zoetrope what's your count?

@erikgb
Copy link
Contributor Author

erikgb commented Oct 9, 2023

@ymmt2005 Are you able to cut a new release soon? We would like to continue our testing of Accurate, and cannot proceed without the new feature gate introduced in #92.

@ymmt2005
Copy link
Member

ymmt2005 commented Oct 9, 2023

I'm not doing that for a while. @zoetrope could you?

@zoetrope
Copy link
Member

@erikgb
Sorry for the wait. I will release it today.

@zoetrope
Copy link
Member

released: https://github.com/cybozu-go/accurate/releases/tag/v1.2.0

@erikgb
Copy link
Contributor Author

erikgb commented Oct 10, 2023

Thanks @zoetrope! 👍

@erikgb
Copy link
Contributor Author

erikgb commented Oct 12, 2023

@ymmt2005 @zoetrope The latest release (with the new feature gate enabled) seems to work very well in our POC cluster, which means we are ready to move forward. 🎉 To resolve this issue, I suggest to:

  • Deprecate the propagate-generate feature (mainly doc update)
  • Promote the DisablePropagateGenerated to Beta and enable it by default.

WDYT? If you are willing to accept PRs with these suggested changes, we should be able to remove the feature and clean up the code after a couple of releases.

@ymmt2005
Copy link
Member

I'm fine with your suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants