-
Notifications
You must be signed in to change notification settings - Fork 0
wip/object endpoint #1
base: object-endpoint
Are you sure you want to change the base?
wip/object endpoint #1
Conversation
I can see there's a DCO missing somewhere -- the bot says so. Probably better to check that out. I'll try to review this today. :) |
…ing as boolean ref: 6410 Signed-off-by: Chin-Ya Huang <[email protected]>
…gine identity flags and fields Signed-off-by: Eric Weber <[email protected]>
Signed-off-by: Eric Weber <[email protected]>
Signed-off-by: Eric Weber <[email protected]>
Signed-off-by: Eric Weber <[email protected]>
Signed-off-by: Eric Weber <[email protected]>
Signed-off-by: Eric Weber <[email protected]>
Recurring job pod status should be failed (error) if it creates and starts a backup or snapshot failed. Recurring job will start a new pod to start a new backup because setting RestartPolicy `onFailure`. Ref: 4255 Signed-off-by: James Lu <[email protected]>
ref: longhorn/longhorn 4826 Signed-off-by: Jack Lin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good. Some questions, mostly.
api/objectendpoint.go
Outdated
func (s *Server) ObjectEndpointUpdate(rw http.ResponseWriter, req *http.Request) (err error) { | ||
apiContext := api.GetApiContext(req) | ||
resp := &client.GenericCollection{} | ||
apiContext.Write(resp) | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I misunderstanding this, or is this function just a stub?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a stub. I should have put a // TODO
here.
This function is executed for example, when you do a kubectl edit objectendpoint foobar-endpoint
, or through the "Edit" function in the UI. But I'm starting to doubt that this is a good idea at all. Maybe we should just forbid editing object endpoints and make the user delete/recreate them? The problem is that there isn't much to edit and what's there probably shouldn't be editable, since I have no idea how to sanely respond to something like the storage class changing? I could make the controller re-deploy on a new volume, but do I have to migrate the data? What if the size is changed and the new size is insufficient for the data?
There are lots of corner cases in this discussion that I don't yet know how to handle, so at the moment I'm inclined to say that you just can't edit an object endpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we choose what we allow to be edited? E.g., how will we enable debug on the service if we need to? Can't it be through an update to the endpoint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are admission webhooks that can validate manifests submitted to the API based on pretty arbitrary rules and we could probably implement that to make sure nothing nonsensical is submitted.
Failing such a webhook will result in an error on the client side. For the UI we can (and should) design the dialogues such that it's impossible to do the wrong thing.
But that precludes that there actually is a sensible way to edit an objectendpoint, which I'm less and less sure about the longer I think about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so my follow up question, albeit tangential to this bit, is: how would we change something like debug levels on the endpoint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can always do a kubectl edit
on the respective deployment, assuming that the controller doesn't have logic built in to revert that change (which it currently doesn't have).
But offering that capability through the CRD would require this kind of option to be modeled in the CRD of course. Then the controller would need logic to adapt the deployment/etc, when the generation of the object endpoint K8s object changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this doesn't actually edit, is there a return type like ENOTSUP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately there isn't. This API is based on the assumption that only supported and implemented functionality is exposed as API endpoint.
controller/controller_manager.go
Outdated
func StartControllers(logger logrus.FieldLogger, stopCh <-chan struct{}, | ||
controllerID, serviceAccount, managerImage, backingImageManagerImage, shareManagerImage, | ||
kubeconfigPath, version string, proxyConnCounter util.Counter) (*datastore.DataStore, *WebsocketController, error) { | ||
func StartControllers( | ||
logger logrus.FieldLogger, | ||
stopCh <-chan struct{}, | ||
controllerID, | ||
serviceAccount, | ||
managerImage, | ||
backingImageManagerImage, | ||
shareManagerImage, | ||
objectEndpointImage, | ||
kubeconfigPath, | ||
version string, | ||
proxyConnCounter util.Counter) (*datastore.DataStore, *WebsocketController, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This formatting seems unnecessary. Was this a result of gofmt
or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I broke the long line, so I could better see where I should put the additional argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get that helps, but it also unnecessarily changes the formatting of the code around you. It doesn't look like one argument per line is the format being used. Do you know what happens if this code is run through gofmt
or the likes? I'd imagine the code base to be automatically formatted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I run gofmt every time I save in my editor, so I guess it's fine by that, but I'll revert this if it creeps you out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't particularly care. It's just one of those things that I imagine could irk someone reviewing it against the original code base.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that formatting-only changes are bad practice, but I figured since there is an actual code change here, I may as well. But now I've put back the long lines, so it's kinda moot anyways.
if quit { | ||
return false | ||
} | ||
defer oec.queue.Done(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the semantic of Done()
here? Should it be different depending on whether it's being called after we have actually run through oec.syncObjectEndpoint()
below, or whether it's being called due to quit
being true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're missing some context here:
https://pkg.go.dev/k8s.io/client-go/util/workqueue
https://go.dev/ref/spec#Defer_statements
This is the only place where an item is removed from the queue. Done()
does this if the item is not marked as dirty again during processing, or if the item may be retried:
Done marks item as done processing, and if it has been marked as dirty again while it was being processed, it will be re-added to the queue for re-processing.
Forget()
works differently, as it marks an item as not-to-be-retried, but doesn't remove it from the queue just yet.
Forget indicates that an item is finished being retried. Doesn't matter whether it's for perm failing or for success, we'll stop the rate limiter from tracking it. This only clears the
rateLimiter
, you still have to callDone
on the queue.
So the logic here is:
if quit:
- leave the queue as-is and return
else: - process item (this possibly marks the item as non-retry-able or as dirty)
if the item is dirty: add it back to the queue
else if the item failed but can be retried: add it back to the queue
else: mark the item as finished processing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're missing some context here: https://pkg.go.dev/k8s.io/client-go/util/workqueue https://go.dev/ref/spec#Defer_statements
I was missing the workqueue context, for sure. In case of defer, that was just a misunderstanding of how the compiler understands sequence of defer
statements, which actually took me writing a small program to validate that my understanding was wrong ;)
So the logic here is:
if quit:
* leave the queue as-is and return else: * process item (this possibly marks the item as non-retry-able or as dirty) if the item is dirty: add it back to the queue else if the item failed but can be retried: add it back to the queue else: mark the item as finished processing
Gotcha. Thanks :)
if err == nil { | ||
oec.queue.Forget(key) | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely not obvious. If we're calling on a function named handleError()
, one expects this function to be dealing with an error. It seems however that what we're doing is handling state from processNextWorkItem()
that should be dealt there instead.
I mean, if err == nil
should be validated before calling this function. At most this function should assert to ensure error != nil
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this confusing too, but stuck with it as this is consistent with the other controllers.
I guess the reasoning here is that handleError
handles the content of the error return value regardless of what that is.
There is good reason to do it this way, as we're relying on the error return value to tell us if we should re-try a work item or if we can finalize it and remove from the queue. And if we should finalize it as successful or as "all retries failed this error is final".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we're not doing any other logic on this function IF err == nil
, so there's no point in doing that bit in this function. Regardless of whether other controllers do that, this is confusing, and adds nothing of real, tangible value.
Handling the error, IF err != nil
would be handled the same way, to tell us if we should retry or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what you're suggesting is that we split handling the error value into two parts accross two functions:
func (oec *ObjectEndpointController) processNextWorkItem() bool {
key, quit := oec.queue.Get()
if quit {
return false
}
defer oec.queue.Done(key)
err := oec.syncObjectEndpoint(key.(string))
if err == nil {
oec.queue.Forget(key)
} else {
oec.handleError(err, key)
}
return true
}
[...]
func (oec *ObjectEndpointController) handleError(err error, key interface{}) {
if oec.queue.NumRequeues(key) < maxRetries {
oec.logger.WithError(err).Errorf("")
oec.queue.AddRateLimited(key)
return
}
utilruntime.HandleError(err)
oec.logger.WithError(err).Errorf("")
oec.queue.Forget(key)
}
This would make it even more confusing IMO, since now you have two places where we handle error values. And more importantly two places where we decide what to do with the item on the queue next. The handleError()
function makes sure that based on the error value (which may be nil
, indicating success) the item is retried or removed from the queue, I don't see why we should handle the success case in a different place and more importantly why we should spread the code for making the decision to re-queue over several places.
The point of this function isn't to handle only failures, it's to handle the re-queue decision based off of the value of err
, which happens to be of type error
but may also be nil
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what you're suggesting is that we split handling the error value into two parts accross two functions:
func (oec *ObjectEndpointController) processNextWorkItem() bool { key, quit := oec.queue.Get() if quit { return false } defer oec.queue.Done(key) err := oec.syncObjectEndpoint(key.(string)) if err == nil { oec.queue.Forget(key) } else { oec.handleError(err, key) } return true } [...] func (oec *ObjectEndpointController) handleError(err error, key interface{}) { if oec.queue.NumRequeues(key) < maxRetries { oec.logger.WithError(err).Errorf("") oec.queue.AddRateLimited(key) return } utilruntime.HandleError(err) oec.logger.WithError(err).Errorf("") oec.queue.Forget(key) }This would make it even more confusing IMO, since now you have two places where we handle error values.
No we don't. We have a place where we handle in case it's not an error, and a place where we handle in case it's an error. As far as I know, err == nil
means "no error", so no error to be handled.
And more importantly two places where we decide what to do with the item on the queue next. The
handleError()
function makes sure that based on the error value (which may benil
, indicating success) the item is retried or removed from the queue, I don't see why we should handle the success case in a different place and more importantly why we should spread the code for making the decision to re-queue over several places. The point of this function isn't to handle only failures, it's to handle the re-queue decision based off of the value oferr
, which happens to be of typeerror
but may also benil
.
Then rename the function. That is no longer handling an Error. It's doing something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does handleErr add value to the reader? Suggest to pull the logic in processNextWorkItem. Having the queue logic in one place makes it easier for the reader to grok the whole thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing the queue item handling all in one place sounds reasonable. I've pulled the result handling logic into the processNextWorkItem
function.
controller/uninstall_controller.go
Outdated
@@ -161,7 +162,10 @@ func NewUninstallController( | |||
ds.SnapshotInformer.AddEventHandler(c.controlleeHandler()) | |||
cacheSyncs = append(cacheSyncs, ds.SnapshotInformer.HasSynced) | |||
} | |||
|
|||
if _, err := extensionsClient.ApiextensionsV1().CustomResourceDefinitions().Get(context.TODO(), CRDObjectEndpointName, metav1.GetOptions{}); err == nil { | |||
ds.ObjectEndpointInformer.AddEventHandler(c.controlleeHandler()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a typo? controlleeHandler()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, see controller/uninstall_controller.go:174
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then yikes, that's a really weird name 🤷
defer func() { | ||
err = errors.Wrapf(err, "failed to delete object endpoints") | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this being called regardless of whether we hit an error condition or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is being called regardless, but depending on whether err
is set or nil
, we will return the set error or retain nil
:
Wrapf returns an error annotating err with a stack trace at the point Wrapf is called, and the format specifier. If err is nil, Wrapf returns nil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
datastore/kubernetes.go
Outdated
@@ -73,13 +73,28 @@ func (s *DataStore) getSupportBundleManagerLabel(supportBundle *longhorn.Support | |||
types.SupportBundleManagerLabelKey: supportBundle.Name, | |||
} | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this line should not be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good spot
@@ -650,6 +665,16 @@ func (s *DataStore) DeleteConfigMap(namespace, name string) error { | |||
return nil | |||
} | |||
|
|||
// CreateSecret creates Secret with the given object | |||
func (s *DataStore) CreateSecret(namespace string, obj *corev1.Secret) (*corev1.Secret, error) { | |||
return s.kubeClient.CoreV1().Secrets(namespace).Create(context.TODO(), obj, metav1.CreateOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
context.TODO()
? Does this mean we need to work on this at some point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO returns a non-nil, empty Context. Code should use context.TODO when it's unclear which Context to use or it is not yet available (because the surrounding function has not yet been extended to accept a Context parameter).
https://pkg.go.dev/context#TODO
In this case, as in many other places in the code, the kubernetes client expects a context of some sort: https://pkg.go.dev/k8s.io/client-go/kubernetes/typed/core/v1#SecretInterface
This context is used to communicate deadlines, cancelation signals etc. We're already relying on the controller's queue to handle retries and errors correctly, so we don't need to handle those cases here, hence we just give an empty non-nil context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this file automatically generated? If not, why are we removing stuff from things that don't seem related to our stuff at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is automatically generated. I'd have to try to track this down using the git history, but I trust that this is accurately generated. There is a slight possibility that things like version differences between my environment and the one used for generating the crds.yaml on the master
branch caused the unexpected diff. Or that the file on master
was edited by hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha
Thanks for going through this and prompting me to go back and look up the exact details on several of the mechanisms at play here. |
I think it's reasonable to merge this into this repo and then open the PR from this branch. I suspect it will allow us to better track that. But whatever you prefer should work. |
3a8034c
to
354b750
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this thing tested?
There are a couple of functions without docstring. Suggest adding them. Many weren't obvious in what they do.
There were unused returns. Suggest running a suite of linters:
A good starting point would be golangci-lint --config=scripts/golangci.yml
with config from ceph-csi
https://github.com/ceph/ceph-csi/blob/devel/scripts/golangci.yml.in
if err == nil { | ||
oec.queue.Forget(key) | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does handleErr add value to the reader? Suggest to pull the logic in processNextWorkItem. Having the queue logic in one place makes it easier for the reader to grok the whole thing
api/objectendpoint.go
Outdated
func (s *Server) ObjectEndpointUpdate(rw http.ResponseWriter, req *http.Request) (err error) { | ||
apiContext := api.GetApiContext(req) | ||
resp := &client.GenericCollection{} | ||
apiContext.Write(resp) | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this doesn't actually edit, is there a return type like ENOTSUP?
api/storageclass.go
Outdated
return nil | ||
} | ||
|
||
func (s *Server) storageClassList(apiContext *api.ApiContext) (*client.GenericCollection, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better name for this? Why separate function under almost the same name with little code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One function is an the API handler that creates an HTTP response for the REST API. The other is a private function that actually compiles the list, which is used by the websocket server directly, as it doesn't need to wrap the response with HTTP headers. Both do about the same, which is in this case return a list of storage classes.
The naming is consistent with the implementation for other API endpoints.
} | ||
|
||
func (oec *ObjectEndpointController) worker() { | ||
for oec.processNextWorkItem() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check return code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for function() {}
just loops function
so long as function
returns something truthy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and it does look simple.
How is error handling modeled here?
What does the bool return mean? End worker execution? An error that processNext.. could not handle occurred? If so, why let worker swallow that error and not bubble that up?
switch endpoint.Status.State { | ||
case longhorn.ObjectEndpointStateStarting, longhorn.ObjectEndpointStateError: | ||
dpl, err := oec.ds.GetDeployment(endpoint.Name) | ||
if err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nested logic tends to be harder to understand than necessary. On each nest the reader has to keep state.
Suggest denesting once (here and in the following):
if err != nil {
return err
}
if dpl.Status... > 0 {
return nil
}
datastore/longhorn.go
Outdated
@@ -4540,3 +4540,124 @@ func (s *DataStore) RemoveFinalizerForLHVolumeAttachment(va *longhorn.VolumeAtta | |||
func (s *DataStore) DeleteLHVolumeAttachment(vaName string) error { | |||
return s.lhClient.LonghornV1beta2().VolumeAttachments(s.namespace).Delete(context.TODO(), vaName, metav1.DeleteOptions{}) | |||
} | |||
|
|||
// GetObjectEndpointRO is the read-only version of GetObjectEndpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name is misleading. The return isn't RO, it "suggests" a behavior to the caller. Suggest renaming it mutable or making it the default and letting the caller make a choice of copy/no copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sticking to the naming convention with the rest of the datastore here, where the RO
suffix implies that the returned object should be used read-only and no suffix implies a deep copy taking place.
k8s/crds.yaml
Outdated
description: An address where the S3 endpoint is exposed. | ||
type: string | ||
state: | ||
description: "The state of the object endpoint as observed by the object endpoint controller. \n The object endpoint implemets a state machine with this, beginning at \"unknown\". Once the object endpoint controller detects the new object endpoint and begins resource creation the state is transformed to \"starting\". The object endpoint controller observes the resources and once all resources are up and ready the state transitions to \"running\". The state remains in \"running\" until a the object endpoint is deleted, at which point the controller will clean up the associated resources. \n │ object endpoint created │ └──►┌───────────┐ │ unknown │ ┌── └───────────┘ │ │ controller creates resources │ └──►┌───────────┐ │ starting │ ┌── └───────────┘ │ │ controller detects all resources ready │ └──►┌───────────┐ │ running │ ┌── └───────────┘ │ │ object endpoint is marked for deletion │ └──►┌───────────┐ │ stopping │ ┌── └───────────┘ │ │ controller waits for dependent resources to disappear before removing │ the finalizer, thereby letting the object endpoint be deleted" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest adding line breaks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is generated.
// endpoint and begins resource creation the state is transformed to | ||
// "starting". The object endpoint controller observes the resources and once | ||
// all resources are up and ready the state transitions to "running". | ||
// The state remains in "running" until a the object endpoint is deleted, at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-a
// The state of the object endpoint as observed by the object endpoint | ||
// controller. | ||
// | ||
// The object endpoint implemets a state machine with this, beginning at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failure states? Starting -> Stopping possible?
Starting -> unknown possible?
manager/objectendpoint.go
Outdated
"github.com/sirupsen/logrus" | ||
) | ||
|
||
func (m *VolumeManager) ListObjectEndpointsSorted() ([]*longhorn.ObjectEndpoint, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does sorted deserve its own function? What is special about the sort?
Suggest to at least add the info to the name by what this is sorted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The datastore (i.e. the cache of K8s objects) keeps the data as a map of keys to objects. Keys are normally just the names of the K8s objects but can be something else as well as they must of course be unique. But they are always string like data. This means however that the datastore has a canonical notion of order between two objects, which is just the order of the keys.
To preserve the order when querying the datastore for a list of all objects endpoints the endpoints are put into the return value sorted by their keys.
This is different from just generally requesting a list of object endpoints as an "unsorted" list function would not guarantee the order of object endpoints in the list to be stable between two invocations.
The function name indicates that the returned list is sorted by "something", implying that the order of object endpoints will be stable between invocations at the cost of the sorting operation.
The overarching goal here is to let the API return consistent results. Keep in mind that there may be multiple instances of the longhorn-manager. Each instance may keep the contents of the datastore laid out differently in memory and K8s can decide to route different API requests to differnt instances. So not only does this make the result of the listing API consistent between two consecutive calls to the same longhorn-manager instance, it also makes it consistent between two different instances.
While it's not impossible to remove this guarantee from the API and just let the frontend bring the data into a consistent format, it reduces the probability of 'glitchy' corner cases in the frontend.
I'd be fine with adding a comment, but the name of the function itself is just fine as is. It returns a somehow-sorted list of object endpoints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second paragraph raises some questions.
What is a glitchy corner case and how are they handled?
Can duplicates occur in the result?
What if two manager instances disagree? Do they disagree on order or also on entries?
Is this function exported as RPC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking about the likes of the following "glitchyness":
Reloading the same page of the UI repeatedly and the order changing at random. This would e.g. be annoying if you have so many object endpoints that you need pagination to display the list. Then adding or deleting one (implicitly reloading the list) could totally change which endpoints end up on the page you're viewing. With sorting, that shifts at most by one, which is expected as well. The point of making sure this function has as much consistency in its behavior as possible is, that the UI then doesn't need to handle inconsistencies, i.e. you don't need to make sure the UI sorts the data before displaying it.
Duplicates can not occur as a single request is routed-to and handled-by one instance only.
Sorting will ensure that the instances don't disagree on order.
The source of truth in all cases is the K8s API. Each longhorn-manager instance keeps an in-memory cache of the objects in the K8s API and updates that itself. This of course means that there may be a brief time-window when the managers disagree on entries.
This function is indirectly exported as RPC. It is what backs the "listing" functionality, i.e. when you send a GET
request to the /v1/objectendpoints
API endpoint, this function will compile the list of object endpoints that will then be marshalled into a JSON and put into the response's body.
Signed-off-by: Chin-Ya Huang <[email protected]>
Longhorn 6522 Signed-off-by: Eric Weber <[email protected]>
Signed-off-by: Chin-Ya Huang <[email protected]>
Signed-off-by: Chin-Ya Huang <[email protected]>
ref: 5297 Signed-off-by: Chin-Ya Huang <[email protected]>
ref: 3786 Signed-off-by: Chin-Ya Huang <[email protected]>
Signed-off-by: Chin-Ya Huang <[email protected]>
Longhorn 4979 Signed-off-by: Eric Weber <[email protected]>
Signed-off-by: Chin-Ya Huang <[email protected]>
Signed-off-by: Chin-Ya Huang <[email protected]>
Signed-off-by: Chin-Ya Huang <[email protected]>
Signed-off-by: Eric Weber <[email protected]>
Signed-off-by: Eric Weber <[email protected]>
Fix spelling mistakes found by codespell Signed-off-by: Moritz Röhrich <[email protected]>
Fix unit tests of object store controller by adding the new states and test them. Signed-off-by: Moritz Röhrich <[email protected]>
- Only initialize the .spec.Image and .spec.uiImage fields of the ObjectStore if they don't already contain values - Test that .spec.Image, .spec.UiImage and .spec.TargetState are correctly initialized for new object stores - Test that new object stores successfully transition into "starting" state when first being picked up by the controller Signed-off-by: Moritz Röhrich <[email protected]>
Deploy ingress for accessing the s3gw UI Signed-off-by: Moritz Röhrich <[email protected]>
- Fix calling multiple updates within the same reconciliation - Move the controller to an event driven model, monitoring events on resources rather than periodically re-syncing. - Make only the Longhorn manager which is responsible for the Volume responsible for the ObjectStore. This avoids multiple controllers posting updates to the ObjectStore resources with conflicting contents Signed-off-by: Moritz Röhrich <[email protected]>
- Add mutating webhook for initializing the object store - Add ownership to ensure only one controller modifies an object store CR - Add ingress configurations for S3 Signed-off-by: Moritz Röhrich <[email protected]>
Sync list of tls secrets to UI to enable selecting a secret when creating an endpoint for an object store Signed-off-by: Moritz Röhrich <[email protected]>
Fix some problems in the object store controller: - possible nil pointer dereference in handleStopping function - more usage of datastore.GetResourceRO functions, in cases where the returned object is not modified - less lookups in the enqueue*() functions Frontend API: - Add endpoints for listing TLS secrets - Fix object store creation never having any endpoints Signed-off-by: Moritz Röhrich <[email protected]>
Rebase on master and fixup broken things Signed-off-by: Moritz Röhrich <[email protected]>
Fix review remarks Signed-off-by: Moritz Röhrich <[email protected]>
Remove ingres for the object store UI. This is no longer needed when using the Longhorn UI's nginx as reverse proxy. Fix deleting an object store by removing the finalizer without waiting on resources to be deleted. Since OwnerReferences are used for automatic cleanup, the finalizer can be removed immediately. Signed-off-by: Moritz Röhrich <[email protected]>
Fix websocket controller: Relay secret, object store and settings information as expected by the UI frontend and as used by the object store tab in the UI Error handling: Remove custom error variables from the object store controller, since in Go, errors created with the errors.New function can not be compared with the errors.Is function as expected This mechanism is replaced by just wrapping normal errors with errors.Wrapf Error handling: Fix some conditions in checkDeployment. When a deployment is scaled up, there is a brief period where it has 0 replicas and 0 unavailable replicas. Therefore it is insufficient to just check the count of unavailable replicas to make sure the deployment is ready, the total count of replicas has to checked too. Signed-off-by: Moritz Röhrich <[email protected]>
Expose size and usage information of the longhorn volume associated with an object store via the internal API to the UI. This allows the Longhorn UI to receive information about the actual size and free space in an object store. Signed-off-by: Moritz Röhrich <[email protected]>
Fix event trigger for PVC events. This had misakenly been wired up to call the event handler for Service events, but now it's fixed to call the right event handler. Signed-off-by: Moritz Röhrich <[email protected]>
Fix creating a secret when given credentials from the UI. Instead of over-writing an existing secret or similar antics, generate a name without a conflict for a new secret to use. Signed-off-by: Moritz Röhrich <[email protected]>
Fix volume expansion: - ensure a volume expansion is only tried when the new size is larger than the old size - propagate the new size to the Longhorn volume, thereby expanding it, when an object store is updated Fix update: - Propagate the container image from the UI to the deployment, thereby allowing users to update the s3gw container images from the Longhorn UI even for existing object stores and even to newer versions than those that are originally shipped Signed-off-by: Moritz Röhrich <[email protected]>
- Fix s3gw status page container port - Force setting the `.spec.image` and `.spec.uiImage` properties of the object store via webhook on creation - Don't transition the object store to error state when creating resources fails, instead retry - Check PV and Longhorn Volume health - Don't check service health as a K8s Service resource doesn't have any status indicating healthyness or not. It either exists or not, readiness is dealt with on the deployment/pod level Signed-off-by: Moritz Röhrich <[email protected]>
- Update CRDs, move size property under .spec, rename .spec.storage to .spec.volumeParameters - Improve webhook to make storage attributes immutable if they don't also change the backing volume - Improve error handling in the controller Signed-off-by: Moritz Röhrich <[email protected]>
Update CRDs Safe image modifications when updating images Signed-off-by: Moritz Röhrich <[email protected]>
Fix secret creation with logic that supports reclaiming managed secrets and errors out when user-created secrets are conflicting with to-be-created ones Adapt the kubernetes config map controller to be able to create and manage multiple storage classes. Add default settings to allow the kubernetes config map controller to maanger the stub storage class for the volumes of object stores Signed-off-by: Moritz Röhrich <[email protected]>
Stop propagating image settings to the object store controller. Applying default image settings to object stores is handled in the webhooks. Therefore the object store controller no longer needs access to this information, besides it would have to be fetched from the appropriate setting anyways. Signed-off-by: Moritz Röhrich <[email protected]>
Fix the race condition on start of a new object store. As long as there is just the object store and no longhorn volume yet some controller needs to be responsible for the object store. Usually it would be the one responsible for the volume, this doesn't exist yet and there are no provisions to track the owner another way. To fix this, just let the controller on the first node handle it. Fix error handling by setting the object store state to error only once and otherwise propagate errors up to the reconcile function. Signed-off-by: Moritz Röhrich <[email protected]>
Signed-off-by: Moritz Röhrich <[email protected]>
- Adjust container launch options for the s3gw to match the new launch options for s3gw v0.23.0 and up - Cleanup access credential secrets when they have been created through the longhorn manager and the object store is deleted Signed-off-by: Moritz Röhrich <[email protected]>
- Fix potential nil pointer dereference when receiving API errors other than ErrNotFound - Export Object Store size specs in the prometheus exporter - Send the number of deployed object stores to the telemetry endpoint (if sending telemetry is enabled) longhorn/longhorn#6720 Signed-off-by: Moritz Röhrich <[email protected]>
- Add admission webhook for validating object store resources - Force foreground cascading deleteion of object stores. This ensures that the volume and PV are cleaned up by the object store controller and resources aren't leaked before removing the finalizer on the object store. - Consolidate starting state. The object store's state is only set to starting in the `initializeObjectStore` function, not whenever resources are created - Error conditions: All errors except server-side API errors will cause the object store state to become `error`. There is only one place in the code where the state can be set to `error`, which is at the end of the reconcile function. - Telemetry setting: Depending on the Longhorn setting allowing for sending telemetry, the telemetry for the s3gw is switchen on or off. Signed-off-by: Moritz Röhrich <[email protected]>
- Reorder the health checks during the creation of an object store When creating an object store, the responsibility for managing the resources making up that object store rests with the controller that's also responsible for the longhorn volume. There is a brief moment while the object store is starting when the information which controller is responsible for the longhorn volume has not yet been propagated to all controllers. Therefore the controller must first wait until the longhorn volume is healthy and has an owner before it can determine for sure if it's responsible or not. Only then the other resources can be created without running into race conditions with other controller who think they are responsible. - Clean up utility functions and constants Signed-off-by: Moritz Röhrich <[email protected]>
Rebase on master and update vendored modules Fix logic in test script. To ensure only the necessary unit tests are executed (or of nothing has been changed, all unit tests are executed), the logic needs to make sure not to interprete vendored modules as changed source since it will fail to find unit tests for them. Signed-off-by: Moritz Röhrich <[email protected]>
3046278
to
1a93355
Compare
Simplify deletes by removing preceeding gets and dealing with errors. Signed-off-by: Moritz Röhrich <[email protected]>
Longhorn Manager: Object Endpoint Controller
Implement an object endpoint controller for managing s3gw instances through Longhorn CRDs and the Longhorn UI.
Related: https://github.com/aquarist-labs/s3gw/issues/470