wip/object endpoint #1

m-ildefons · 2023-07-31T09:34:48Z

Longhorn Manager: Object Endpoint Controller

Implement an object endpoint controller for managing s3gw instances through Longhorn CRDs and the Longhorn UI.

Related: https://github.com/aquarist-labs/s3gw/issues/470

jecluis · 2023-08-01T08:37:34Z

I can see there's a DCO missing somewhere -- the bot says so. Probably better to check that out.

I'll try to review this today. :)

…ing as boolean ref: 6410 Signed-off-by: Chin-Ya Huang <[email protected]>

…gine identity flags and fields Signed-off-by: Eric Weber <[email protected]>

Signed-off-by: Eric Weber <[email protected]>

Recurring job pod status should be failed (error) if it creates and starts a backup or snapshot failed. Recurring job will start a new pod to start a new backup because setting RestartPolicy `onFailure`. Ref: 4255 Signed-off-by: James Lu <[email protected]>

ref: longhorn/longhorn 4826 Signed-off-by: Jack Lin <[email protected]>

jecluis

Generally looks good. Some questions, mostly.

jecluis · 2023-08-10T16:32:42Z

api/objectendpoint.go

+func (s *Server) ObjectEndpointUpdate(rw http.ResponseWriter, req *http.Request) (err error) {
+	apiContext := api.GetApiContext(req)
+	resp := &client.GenericCollection{}
+	apiContext.Write(resp)
+	return nil
+}


Am I misunderstanding this, or is this function just a stub?

This is just a stub. I should have put a // TODO here.

This function is executed for example, when you do a kubectl edit objectendpoint foobar-endpoint, or through the "Edit" function in the UI. But I'm starting to doubt that this is a good idea at all. Maybe we should just forbid editing object endpoints and make the user delete/recreate them? The problem is that there isn't much to edit and what's there probably shouldn't be editable, since I have no idea how to sanely respond to something like the storage class changing? I could make the controller re-deploy on a new volume, but do I have to migrate the data? What if the size is changed and the new size is insufficient for the data?
There are lots of corner cases in this discussion that I don't yet know how to handle, so at the moment I'm inclined to say that you just can't edit an object endpoint.

Can we choose what we allow to be edited? E.g., how will we enable debug on the service if we need to? Can't it be through an update to the endpoint?

There are admission webhooks that can validate manifests submitted to the API based on pretty arbitrary rules and we could probably implement that to make sure nothing nonsensical is submitted.
Failing such a webhook will result in an error on the client side. For the UI we can (and should) design the dialogues such that it's impossible to do the wrong thing.
But that precludes that there actually is a sensible way to edit an objectendpoint, which I'm less and less sure about the longer I think about it.

Okay, so my follow up question, albeit tangential to this bit, is: how would we change something like debug levels on the endpoint?

You can always do a kubectl edit on the respective deployment, assuming that the controller doesn't have logic built in to revert that change (which it currently doesn't have).

But offering that capability through the CRD would require this kind of option to be modeled in the CRD of course. Then the controller would need logic to adapt the deployment/etc, when the generation of the object endpoint K8s object changes.

If this doesn't actually edit, is there a return type like ENOTSUP?

Unfortunately there isn't. This API is based on the assumption that only supported and implemented functionality is exposed as API endpoint.

jecluis · 2023-08-10T16:41:06Z

controller/controller_manager.go

-func StartControllers(logger logrus.FieldLogger, stopCh <-chan struct{},
-	controllerID, serviceAccount, managerImage, backingImageManagerImage, shareManagerImage,
-	kubeconfigPath, version string, proxyConnCounter util.Counter) (*datastore.DataStore, *WebsocketController, error) {
+func StartControllers(
+	logger logrus.FieldLogger,
+	stopCh <-chan struct{},
+	controllerID,
+	serviceAccount,
+	managerImage,
+	backingImageManagerImage,
+	shareManagerImage,
+	objectEndpointImage,
+	kubeconfigPath,
+	version string,
+	proxyConnCounter util.Counter) (*datastore.DataStore, *WebsocketController, error) {


This formatting seems unnecessary. Was this a result of gofmt or something?

I broke the long line, so I could better see where I should put the additional argument

I get that helps, but it also unnecessarily changes the formatting of the code around you. It doesn't look like one argument per line is the format being used. Do you know what happens if this code is run through gofmt or the likes? I'd imagine the code base to be automatically formatted.

I run gofmt every time I save in my editor, so I guess it's fine by that, but I'll revert this if it creeps you out.

I don't particularly care. It's just one of those things that I imagine could irk someone reviewing it against the original code base.

I know that formatting-only changes are bad practice, but I figured since there is an actual code change here, I may as well. But now I've put back the long lines, so it's kinda moot anyways.

jecluis · 2023-08-10T16:49:34Z

controller/object_endpoint_controller.go

+	if quit {
+		return false
+	}
+	defer oec.queue.Done(key)


What's the semantic of Done() here? Should it be different depending on whether it's being called after we have actually run through oec.syncObjectEndpoint() below, or whether it's being called due to quit being true?

I think you're missing some context here:
https://pkg.go.dev/k8s.io/client-go/util/workqueue
https://go.dev/ref/spec#Defer_statements

This is the only place where an item is removed from the queue. Done() does this if the item is not marked as dirty again during processing, or if the item may be retried:

Done marks item as done processing, and if it has been marked as dirty again while it was being processed, it will be re-added to the queue for re-processing.

Forget() works differently, as it marks an item as not-to-be-retried, but doesn't remove it from the queue just yet.

Forget indicates that an item is finished being retried. Doesn't matter whether it's for perm failing or for success, we'll stop the rate limiter from tracking it. This only clears the rateLimiter, you still have to call Done on the queue.

So the logic here is:

if quit:

leave the queue as-is and return
else:

process item (this possibly marks the item as non-retry-able or as dirty)
if the item is dirty: add it back to the queue
else if the item failed but can be retried: add it back to the queue
else: mark the item as finished processing

I think you're missing some context here: https://pkg.go.dev/k8s.io/client-go/util/workqueue https://go.dev/ref/spec#Defer_statements

I was missing the workqueue context, for sure. In case of defer, that was just a misunderstanding of how the compiler understands sequence of defer statements, which actually took me writing a small program to validate that my understanding was wrong ;)

So the logic here is:

if quit:

* leave the queue as-is and return else: * process item (this possibly marks the item as non-retry-able or as dirty) if the item is dirty: add it back to the queue else if the item failed but can be retried: add it back to the queue else: mark the item as finished processing

Gotcha. Thanks :)

jecluis · 2023-08-10T16:53:32Z

controller/object_endpoint_controller.go

+	if err == nil {
+		oec.queue.Forget(key)
+		return
+	}


This is definitely not obvious. If we're calling on a function named handleError(), one expects this function to be dealing with an error. It seems however that what we're doing is handling state from processNextWorkItem() that should be dealt there instead.

I mean, if err == nil should be validated before calling this function. At most this function should assert to ensure error != nil.

I found this confusing too, but stuck with it as this is consistent with the other controllers.
I guess the reasoning here is that handleError handles the content of the error return value regardless of what that is.

There is good reason to do it this way, as we're relying on the error return value to tell us if we should re-try a work item or if we can finalize it and remove from the queue. And if we should finalize it as successful or as "all retries failed this error is final".

But we're not doing any other logic on this function IF err == nil, so there's no point in doing that bit in this function. Regardless of whether other controllers do that, this is confusing, and adds nothing of real, tangible value.

Handling the error, IF err != nil would be handled the same way, to tell us if we should retry or not.

So what you're suggesting is that we split handling the error value into two parts accross two functions:

func (oec *ObjectEndpointController) processNextWorkItem() bool { key, quit := oec.queue.Get() if quit { return false } defer oec.queue.Done(key) err := oec.syncObjectEndpoint(key.(string)) if err == nil { oec.queue.Forget(key) } else { oec.handleError(err, key) } return true } [...] func (oec *ObjectEndpointController) handleError(err error, key interface{}) { if oec.queue.NumRequeues(key) < maxRetries { oec.logger.WithError(err).Errorf("") oec.queue.AddRateLimited(key) return } utilruntime.HandleError(err) oec.logger.WithError(err).Errorf("") oec.queue.Forget(key) }

This would make it even more confusing IMO, since now you have two places where we handle error values. And more importantly two places where we decide what to do with the item on the queue next. The handleError() function makes sure that based on the error value (which may be nil, indicating success) the item is retried or removed from the queue, I don't see why we should handle the success case in a different place and more importantly why we should spread the code for making the decision to re-queue over several places.
The point of this function isn't to handle only failures, it's to handle the re-queue decision based off of the value of err, which happens to be of type error but may also be nil.

So what you're suggesting is that we split handling the error value into two parts accross two functions:

func (oec *ObjectEndpointController) processNextWorkItem() bool { key, quit := oec.queue.Get() if quit { return false } defer oec.queue.Done(key) err := oec.syncObjectEndpoint(key.(string)) if err == nil { oec.queue.Forget(key) } else { oec.handleError(err, key) } return true } [...] func (oec *ObjectEndpointController) handleError(err error, key interface{}) { if oec.queue.NumRequeues(key) < maxRetries { oec.logger.WithError(err).Errorf("") oec.queue.AddRateLimited(key) return } utilruntime.HandleError(err) oec.logger.WithError(err).Errorf("") oec.queue.Forget(key) }

This would make it even more confusing IMO, since now you have two places where we handle error values.

No we don't. We have a place where we handle in case it's not an error, and a place where we handle in case it's an error. As far as I know, err == nil means "no error", so no error to be handled.

And more importantly two places where we decide what to do with the item on the queue next. The handleError() function makes sure that based on the error value (which may be nil, indicating success) the item is retried or removed from the queue, I don't see why we should handle the success case in a different place and more importantly why we should spread the code for making the decision to re-queue over several places. The point of this function isn't to handle only failures, it's to handle the re-queue decision based off of the value of err, which happens to be of type error but may also be nil.

Then rename the function. That is no longer handling an Error. It's doing something else.

Does handleErr add value to the reader? Suggest to pull the logic in processNextWorkItem. Having the queue logic in one place makes it easier for the reader to grok the whole thing

Doing the queue item handling all in one place sounds reasonable. I've pulled the result handling logic into the processNextWorkItem function.

controller/object_endpoint_controller.go

jecluis · 2023-08-10T17:13:09Z

controller/uninstall_controller.go

@@ -161,7 +162,10 @@ func NewUninstallController(
 		ds.SnapshotInformer.AddEventHandler(c.controlleeHandler())
 		cacheSyncs = append(cacheSyncs, ds.SnapshotInformer.HasSynced)
 	}
-
+	if _, err := extensionsClient.ApiextensionsV1().CustomResourceDefinitions().Get(context.TODO(), CRDObjectEndpointName, metav1.GetOptions{}); err == nil {
+		ds.ObjectEndpointInformer.AddEventHandler(c.controlleeHandler())


is this a typo? controlleeHandler()?

Nope, see controller/uninstall_controller.go:174

Then yikes, that's a really weird name 🤷

jecluis · 2023-08-10T17:16:49Z

controller/uninstall_controller.go

+	defer func() {
+		err = errors.Wrapf(err, "failed to delete object endpoints")
+	}()


isn't this being called regardless of whether we hit an error condition or not?

It is being called regardless, but depending on whether err is set or nil, we will return the set error or retain nil:

Wrapf returns an error annotating err with a stack trace at the point Wrapf is called, and the format specifier. If err is nil, Wrapf returns nil.

https://pkg.go.dev/github.com/pkg/errors#Wrapf

jecluis · 2023-08-10T17:19:00Z

datastore/kubernetes.go

@@ -73,13 +73,28 @@ func (s *DataStore) getSupportBundleManagerLabel(supportBundle *longhorn.Support
 		types.SupportBundleManagerLabelKey: supportBundle.Name,
 	}
 }
-


I suspect this line should not be removed.

jecluis · 2023-08-10T17:20:12Z

datastore/kubernetes.go

@@ -650,6 +665,16 @@ func (s *DataStore) DeleteConfigMap(namespace, name string) error {
 	return nil
 }

+// CreateSecret creates Secret with the given object
+func (s *DataStore) CreateSecret(namespace string, obj *corev1.Secret) (*corev1.Secret, error) {
+	return s.kubeClient.CoreV1().Secrets(namespace).Create(context.TODO(), obj, metav1.CreateOptions{})


context.TODO() ? Does this mean we need to work on this at some point?

TODO returns a non-nil, empty Context. Code should use context.TODO when it's unclear which Context to use or it is not yet available (because the surrounding function has not yet been extended to accept a Context parameter).

https://pkg.go.dev/context#TODO

In this case, as in many other places in the code, the kubernetes client expects a context of some sort: https://pkg.go.dev/k8s.io/client-go/kubernetes/typed/core/v1#SecretInterface
This context is used to communicate deadlines, cancelation signals etc. We're already relying on the controller's queue to handle retries and errors correctly, so we don't need to handle those cases here, hence we just give an empty non-nil context.

jecluis · 2023-08-10T17:28:31Z

k8s/crds.yaml

is this file automatically generated? If not, why are we removing stuff from things that don't seem related to our stuff at all?

This is automatically generated. I'd have to try to track this down using the git history, but I trust that this is accurately generated. There is a slight possibility that things like version differences between my environment and the one used for generating the crds.yaml on the master branch caused the unexpected diff. Or that the file on master was edited by hand.

m-ildefons · 2023-08-11T08:00:40Z

Thanks for going through this and prompting me to go back and look up the exact details on several of the mechanisms at play here.
I'll fix the DCO signoff problem before creating a PR against the Longhorn repo as I also want to squash away some of the more minor commits

jecluis · 2023-08-11T09:06:38Z

Thanks for going through this and prompting me to go back and look up the exact details on several of the mechanisms at play here. I'll fix the DCO signoff problem before creating a PR against the Longhorn repo as I also want to squash away some of the more minor commits

I think it's reasonable to merge this into this repo and then open the PR from this branch. I suspect it will allow us to better track that. But whatever you prefer should work.

irq0

How is this thing tested?

There are a couple of functions without docstring. Suggest adding them. Many weren't obvious in what they do.

There were unused returns. Suggest running a suite of linters:

A good starting point would be golangci-lint --config=scripts/golangci.yml
with config from ceph-csi
https://github.com/ceph/ceph-csi/blob/devel/scripts/golangci.yml.in

irq0 · 2023-08-11T11:23:42Z

controller/object_endpoint_controller.go

+	if err == nil {
+		oec.queue.Forget(key)
+		return
+	}


Does handleErr add value to the reader? Suggest to pull the logic in processNextWorkItem. Having the queue logic in one place makes it easier for the reader to grok the whole thing

irq0 · 2023-08-11T11:29:07Z

api/objectendpoint.go

+func (s *Server) ObjectEndpointUpdate(rw http.ResponseWriter, req *http.Request) (err error) {
+	apiContext := api.GetApiContext(req)
+	resp := &client.GenericCollection{}
+	apiContext.Write(resp)
+	return nil
+}


If this doesn't actually edit, is there a return type like ENOTSUP?

irq0 · 2023-08-11T11:34:58Z

api/storageclass.go

+	return nil
+}
+
+func (s *Server) storageClassList(apiContext *api.ApiContext) (*client.GenericCollection, error) {


Is there a better name for this? Why separate function under almost the same name with little code?

One function is an the API handler that creates an HTTP response for the REST API. The other is a private function that actually compiles the list, which is used by the websocket server directly, as it doesn't need to wrap the response with HTTP headers. Both do about the same, which is in this case return a list of storage classes.
The naming is consistent with the implementation for other API endpoints.

irq0 · 2023-08-11T11:38:42Z

controller/object_endpoint_controller.go

+}
+
+func (oec *ObjectEndpointController) worker() {
+	for oec.processNextWorkItem() {


Check return code?

for function() {} just loops function so long as function returns something truthy.

Yes and it does look simple.

How is error handling modeled here?
What does the bool return mean? End worker execution? An error that processNext.. could not handle occurred? If so, why let worker swallow that error and not bubble that up?

irq0 · 2023-08-11T11:42:30Z

controller/object_endpoint_controller.go

+	switch endpoint.Status.State {
+	case longhorn.ObjectEndpointStateStarting, longhorn.ObjectEndpointStateError:
+		dpl, err := oec.ds.GetDeployment(endpoint.Name)
+		if err == nil {


Nested logic tends to be harder to understand than necessary. On each nest the reader has to keep state.
Suggest denesting once (here and in the following):

if err != nil { return err } if dpl.Status... > 0 { return nil }

irq0 · 2023-08-11T11:55:20Z

datastore/longhorn.go

@@ -4540,3 +4540,124 @@ func (s *DataStore) RemoveFinalizerForLHVolumeAttachment(va *longhorn.VolumeAtta
 func (s *DataStore) DeleteLHVolumeAttachment(vaName string) error {
 	return s.lhClient.LonghornV1beta2().VolumeAttachments(s.namespace).Delete(context.TODO(), vaName, metav1.DeleteOptions{})
 }
+
+// GetObjectEndpointRO is the read-only version of GetObjectEndpoint.


This name is misleading. The return isn't RO, it "suggests" a behavior to the caller. Suggest renaming it mutable or making it the default and letting the caller make a choice of copy/no copy

I'm sticking to the naming convention with the rest of the datastore here, where the RO suffix implies that the returned object should be used read-only and no suffix implies a deep copy taking place.

irq0 · 2023-08-11T11:57:10Z

k8s/crds.yaml

+                description: An address where the S3 endpoint is exposed.
+                type: string
+              state:
+                description: "The state of the object endpoint as observed by the object endpoint controller. \n The object endpoint implemets a state machine with this, beginning at \"unknown\". Once the object endpoint controller detects the new object endpoint and begins resource creation the state is transformed to \"starting\". The object endpoint controller observes the resources and once all resources are up and ready the state transitions to \"running\". The state remains in \"running\" until a the object endpoint is deleted, at which point the controller will clean up the associated resources. \n  │ object endpoint created  │  └──►┌───────────┐      │ unknown   │  ┌── └───────────┘  │  │ controller creates resources  │  └──►┌───────────┐      │ starting  │  ┌── └───────────┘  │  │ controller detects all resources ready  │  └──►┌───────────┐      │ running   │  ┌── └───────────┘  │  │ object endpoint is marked for deletion  │  └──►┌───────────┐      │ stopping  │  ┌── └───────────┘  │  │ controller waits for dependent resources to disappear before removing  │ the finalizer, thereby letting the object endpoint be deleted"


Suggest adding line breaks

This file is generated.

irq0 · 2023-08-11T11:58:55Z

k8s/pkg/apis/longhorn/v1beta2/objectendpoint.go

+	// endpoint and begins resource creation the state is transformed to
+	// "starting". The object endpoint controller observes the resources and once
+	// all resources are up and ready the state transitions to "running".
+	// The state remains in "running" until a the object endpoint is deleted, at


irq0 · 2023-08-11T12:00:19Z

k8s/pkg/apis/longhorn/v1beta2/objectendpoint.go

+	// The state of the object endpoint as observed by the object endpoint
+	// controller.
+	//
+	// The object endpoint implemets a state machine with this, beginning at


Failure states? Starting -> Stopping possible?
Starting -> unknown possible?

irq0 · 2023-08-11T12:03:03Z

manager/objectendpoint.go

+	"github.com/sirupsen/logrus"
+)
+
+func (m *VolumeManager) ListObjectEndpointsSorted() ([]*longhorn.ObjectEndpoint, error) {


Why does sorted deserve its own function? What is special about the sort?
Suggest to at least add the info to the name by what this is sorted

The datastore (i.e. the cache of K8s objects) keeps the data as a map of keys to objects. Keys are normally just the names of the K8s objects but can be something else as well as they must of course be unique. But they are always string like data. This means however that the datastore has a canonical notion of order between two objects, which is just the order of the keys.
To preserve the order when querying the datastore for a list of all objects endpoints the endpoints are put into the return value sorted by their keys.
This is different from just generally requesting a list of object endpoints as an "unsorted" list function would not guarantee the order of object endpoints in the list to be stable between two invocations.
The function name indicates that the returned list is sorted by "something", implying that the order of object endpoints will be stable between invocations at the cost of the sorting operation.

The overarching goal here is to let the API return consistent results. Keep in mind that there may be multiple instances of the longhorn-manager. Each instance may keep the contents of the datastore laid out differently in memory and K8s can decide to route different API requests to differnt instances. So not only does this make the result of the listing API consistent between two consecutive calls to the same longhorn-manager instance, it also makes it consistent between two different instances.
While it's not impossible to remove this guarantee from the API and just let the frontend bring the data into a consistent format, it reduces the probability of 'glitchy' corner cases in the frontend.

I'd be fine with adding a comment, but the name of the function itself is just fine as is. It returns a somehow-sorted list of object endpoints.

The second paragraph raises some questions.
What is a glitchy corner case and how are they handled?
Can duplicates occur in the result?
What if two manager instances disagree? Do they disagree on order or also on entries?

Is this function exported as RPC?

I'm thinking about the likes of the following "glitchyness":
Reloading the same page of the UI repeatedly and the order changing at random. This would e.g. be annoying if you have so many object endpoints that you need pagination to display the list. Then adding or deleting one (implicitly reloading the list) could totally change which endpoints end up on the page you're viewing. With sorting, that shifts at most by one, which is expected as well. The point of making sure this function has as much consistency in its behavior as possible is, that the UI then doesn't need to handle inconsistencies, i.e. you don't need to make sure the UI sorts the data before displaying it.

Duplicates can not occur as a single request is routed-to and handled-by one instance only.
Sorting will ensure that the instances don't disagree on order.
The source of truth in all cases is the K8s API. Each longhorn-manager instance keeps an in-memory cache of the objects in the K8s API and updates that itself. This of course means that there may be a brief time-window when the managers disagree on entries.

This function is indirectly exported as RPC. It is what backs the "listing" functionality, i.e. when you send a GET request to the /v1/objectendpoints API endpoint, this function will compile the list of object endpoints that will then be marshalled into a JSON and put into the response's body.

Signed-off-by: Chin-Ya Huang <[email protected]>

Longhorn 6522 Signed-off-by: Eric Weber <[email protected]>

Signed-off-by: Chin-Ya Huang <[email protected]>

ref: 5297 Signed-off-by: Chin-Ya Huang <[email protected]>

ref: 3786 Signed-off-by: Chin-Ya Huang <[email protected]>

Signed-off-by: Chin-Ya Huang <[email protected]>

Longhorn 4979 Signed-off-by: Eric Weber <[email protected]>

Signed-off-by: Chin-Ya Huang <[email protected]>

Signed-off-by: Eric Weber <[email protected]>

Fix spelling mistakes found by codespell Signed-off-by: Moritz Röhrich <[email protected]>

Fix unit tests of object store controller by adding the new states and test them. Signed-off-by: Moritz Röhrich <[email protected]>

- Only initialize the .spec.Image and .spec.uiImage fields of the ObjectStore if they don't already contain values - Test that .spec.Image, .spec.UiImage and .spec.TargetState are correctly initialized for new object stores - Test that new object stores successfully transition into "starting" state when first being picked up by the controller Signed-off-by: Moritz Röhrich <[email protected]>

Deploy ingress for accessing the s3gw UI Signed-off-by: Moritz Röhrich <[email protected]>

- Fix calling multiple updates within the same reconciliation - Move the controller to an event driven model, monitoring events on resources rather than periodically re-syncing. - Make only the Longhorn manager which is responsible for the Volume responsible for the ObjectStore. This avoids multiple controllers posting updates to the ObjectStore resources with conflicting contents Signed-off-by: Moritz Röhrich <[email protected]>

- Add mutating webhook for initializing the object store - Add ownership to ensure only one controller modifies an object store CR - Add ingress configurations for S3 Signed-off-by: Moritz Röhrich <[email protected]>

Sync list of tls secrets to UI to enable selecting a secret when creating an endpoint for an object store Signed-off-by: Moritz Röhrich <[email protected]>

Fix some problems in the object store controller: - possible nil pointer dereference in handleStopping function - more usage of datastore.GetResourceRO functions, in cases where the returned object is not modified - less lookups in the enqueue*() functions Frontend API: - Add endpoints for listing TLS secrets - Fix object store creation never having any endpoints Signed-off-by: Moritz Röhrich <[email protected]>

Rebase on master and fixup broken things Signed-off-by: Moritz Röhrich <[email protected]>

Fix review remarks Signed-off-by: Moritz Röhrich <[email protected]>

Remove ingres for the object store UI. This is no longer needed when using the Longhorn UI's nginx as reverse proxy. Fix deleting an object store by removing the finalizer without waiting on resources to be deleted. Since OwnerReferences are used for automatic cleanup, the finalizer can be removed immediately. Signed-off-by: Moritz Röhrich <[email protected]>

Fix websocket controller: Relay secret, object store and settings information as expected by the UI frontend and as used by the object store tab in the UI Error handling: Remove custom error variables from the object store controller, since in Go, errors created with the errors.New function can not be compared with the errors.Is function as expected This mechanism is replaced by just wrapping normal errors with errors.Wrapf Error handling: Fix some conditions in checkDeployment. When a deployment is scaled up, there is a brief period where it has 0 replicas and 0 unavailable replicas. Therefore it is insufficient to just check the count of unavailable replicas to make sure the deployment is ready, the total count of replicas has to checked too. Signed-off-by: Moritz Röhrich <[email protected]>

Expose size and usage information of the longhorn volume associated with an object store via the internal API to the UI. This allows the Longhorn UI to receive information about the actual size and free space in an object store. Signed-off-by: Moritz Röhrich <[email protected]>

Fix event trigger for PVC events. This had misakenly been wired up to call the event handler for Service events, but now it's fixed to call the right event handler. Signed-off-by: Moritz Röhrich <[email protected]>

Fix creating a secret when given credentials from the UI. Instead of over-writing an existing secret or similar antics, generate a name without a conflict for a new secret to use. Signed-off-by: Moritz Röhrich <[email protected]>

Fix volume expansion: - ensure a volume expansion is only tried when the new size is larger than the old size - propagate the new size to the Longhorn volume, thereby expanding it, when an object store is updated Fix update: - Propagate the container image from the UI to the deployment, thereby allowing users to update the s3gw container images from the Longhorn UI even for existing object stores and even to newer versions than those that are originally shipped Signed-off-by: Moritz Röhrich <[email protected]>

- Fix s3gw status page container port - Force setting the `.spec.image` and `.spec.uiImage` properties of the object store via webhook on creation - Don't transition the object store to error state when creating resources fails, instead retry - Check PV and Longhorn Volume health - Don't check service health as a K8s Service resource doesn't have any status indicating healthyness or not. It either exists or not, readiness is dealt with on the deployment/pod level Signed-off-by: Moritz Röhrich <[email protected]>

- Update CRDs, move size property under .spec, rename .spec.storage to .spec.volumeParameters - Improve webhook to make storage attributes immutable if they don't also change the backing volume - Improve error handling in the controller Signed-off-by: Moritz Röhrich <[email protected]>

Update CRDs Safe image modifications when updating images Signed-off-by: Moritz Röhrich <[email protected]>

Fix secret creation with logic that supports reclaiming managed secrets and errors out when user-created secrets are conflicting with to-be-created ones Adapt the kubernetes config map controller to be able to create and manage multiple storage classes. Add default settings to allow the kubernetes config map controller to maanger the stub storage class for the volumes of object stores Signed-off-by: Moritz Röhrich <[email protected]>

Stop propagating image settings to the object store controller. Applying default image settings to object stores is handled in the webhooks. Therefore the object store controller no longer needs access to this information, besides it would have to be fetched from the appropriate setting anyways. Signed-off-by: Moritz Röhrich <[email protected]>

Fix the race condition on start of a new object store. As long as there is just the object store and no longhorn volume yet some controller needs to be responsible for the object store. Usually it would be the one responsible for the volume, this doesn't exist yet and there are no provisions to track the owner another way. To fix this, just let the controller on the first node handle it. Fix error handling by setting the object store state to error only once and otherwise propagate errors up to the reconcile function. Signed-off-by: Moritz Röhrich <[email protected]>

Signed-off-by: Moritz Röhrich <[email protected]>

- Adjust container launch options for the s3gw to match the new launch options for s3gw v0.23.0 and up - Cleanup access credential secrets when they have been created through the longhorn manager and the object store is deleted Signed-off-by: Moritz Röhrich <[email protected]>

- Fix potential nil pointer dereference when receiving API errors other than ErrNotFound - Export Object Store size specs in the prometheus exporter - Send the number of deployed object stores to the telemetry endpoint (if sending telemetry is enabled) longhorn/longhorn#6720 Signed-off-by: Moritz Röhrich <[email protected]>

- Add admission webhook for validating object store resources - Force foreground cascading deleteion of object stores. This ensures that the volume and PV are cleaned up by the object store controller and resources aren't leaked before removing the finalizer on the object store. - Consolidate starting state. The object store's state is only set to starting in the `initializeObjectStore` function, not whenever resources are created - Error conditions: All errors except server-side API errors will cause the object store state to become `error`. There is only one place in the code where the state can be set to `error`, which is at the end of the reconcile function. - Telemetry setting: Depending on the Longhorn setting allowing for sending telemetry, the telemetry for the s3gw is switchen on or off. Signed-off-by: Moritz Röhrich <[email protected]>

- Reorder the health checks during the creation of an object store When creating an object store, the responsibility for managing the resources making up that object store rests with the controller that's also responsible for the longhorn volume. There is a brief moment while the object store is starting when the information which controller is responsible for the longhorn volume has not yet been propagated to all controllers. Therefore the controller must first wait until the longhorn volume is healthy and has an owner before it can determine for sure if it's responsible or not. Only then the other resources can be created without running into race conditions with other controller who think they are responsible. - Clean up utility functions and constants Signed-off-by: Moritz Röhrich <[email protected]>

Rebase on master and update vendored modules Fix logic in test script. To ensure only the necessary unit tests are executed (or of nothing has been changed, all unit tests are executed), the logic needs to make sure not to interprete vendored modules as changed source since it will fail to find unit tests for them. Signed-off-by: Moritz Röhrich <[email protected]>

Simplify deletes by removing preceeding gets and dealing with errors. Signed-off-by: Moritz Röhrich <[email protected]>

m-ildefons requested a review from jecluis July 31, 2023 09:34

m-ildefons self-assigned this Jul 31, 2023

c3y1huang and others added 9 commits August 1, 2023 12:35

fix(telemetric): collect SettingNameSnapshotDataIntegrityCronJob sett…

c4924b4

…ing as boolean ref: 6410 Signed-off-by: Chin-Ya Huang <[email protected]>

Use versions of longhorn-engine and longhorn-instance-manager with en…

5116799

…gine identity flags and fields Signed-off-by: Eric Weber <[email protected]>

Use new identity validation flags and fields

39fe49b

Signed-off-by: Eric Weber <[email protected]>

Enable identity validation for deprecated EngineBinary flow

daec828

Signed-off-by: Eric Weber <[email protected]>

Add validation flags to engine during live upgrade

cebd71a

Signed-off-by: Eric Weber <[email protected]>

Fix SnapshotClone with engine identity validation enabled

fa44b99

Signed-off-by: Eric Weber <[email protected]>

Bump CurrentInstanceManagerAPIVersion

d36fc04

Signed-off-by: Eric Weber <[email protected]>

feat(setting): add tolerate node disk empty selector volume setting

840dbf1

ref: longhorn/longhorn 4826 Signed-off-by: Jack Lin <[email protected]>

jecluis suggested changes Aug 10, 2023

View reviewed changes

m-ildefons force-pushed the wip/object-endpoint branch from 3a8034c to 354b750 Compare August 11, 2023 10:17

irq0 reviewed Aug 11, 2023

View reviewed changes

c3y1huang and others added 13 commits August 16, 2023 18:00

fix: typo in file name

560a410

Signed-off-by: Chin-Ya Huang <[email protected]>

Add replicaName argument to ReplicaRebuildVerify

59b4379

Longhorn 6522 Signed-off-by: Eric Weber <[email protected]>

refactor: cleanup imports

acb922e

Signed-off-by: Chin-Ya Huang <[email protected]>

refactor: duplicated package imported

30776e5

Signed-off-by: Chin-Ya Huang <[email protected]>

feat(metrics): add pvc name to volume metrics

907c633

ref: 5297 Signed-off-by: Chin-Ya Huang <[email protected]>

feat(util): remove non-existing source directory log

ca83dfe

ref: 3786 Signed-off-by: Chin-Ya Huang <[email protected]>

refactor: cleanup unused CopyHostDirectoryContent

f3a3fb3

Signed-off-by: Chin-Ya Huang <[email protected]>

Refactor CreateSnapshot to return error when backup cannot complete

8d431af

Longhorn 4979 Signed-off-by: Eric Weber <[email protected]>

refactor: cleanup imports

aef3226

Signed-off-by: Chin-Ya Huang <[email protected]>

fix(support-bundle): missing informer in cacheSyncs

8b1f12e

Signed-off-by: Chin-Ya Huang <[email protected]>

refactor: cleanup imports

f2ffd98

Signed-off-by: Chin-Ya Huang <[email protected]>

Add ReplicaDiskSoftAntiAffinity setting and volume spec field

eb321af

Signed-off-by: Eric Weber <[email protected]>

Handle ReplicaDiskSoftAntiAffinity during upgrade

b16882f

Signed-off-by: Eric Weber <[email protected]>

m-ildefons added 28 commits November 28, 2023 17:31

object store: fix spelling

56633a7

Fix spelling mistakes found by codespell Signed-off-by: Moritz Röhrich <[email protected]>

object store: fix unit tests

3f4216c

Fix unit tests of object store controller by adding the new states and test them. Signed-off-by: Moritz Röhrich <[email protected]>

object store: deploy ingress

c0ed496

Deploy ingress for accessing the s3gw UI Signed-off-by: Moritz Röhrich <[email protected]>

object store: webhook, ownership, ingresses

7162d79

- Add mutating webhook for initializing the object store - Add ownership to ensure only one controller modifies an object store CR - Add ingress configurations for S3 Signed-off-by: Moritz Röhrich <[email protected]>

object store: sync list of tls secrets

e41dddb

Sync list of tls secrets to UI to enable selecting a secret when creating an endpoint for an object store Signed-off-by: Moritz Röhrich <[email protected]>

object store: rebase and fix

833d9ab

Rebase on master and fixup broken things Signed-off-by: Moritz Röhrich <[email protected]>

object store: fixes for review

8dfdfb4

Fix review remarks Signed-off-by: Moritz Röhrich <[email protected]>

object store: fix event trigger

9bf594a

Fix event trigger for PVC events. This had misakenly been wired up to call the event handler for Service events, but now it's fixed to call the right event handler. Signed-off-by: Moritz Röhrich <[email protected]>

object store: fix secret creation

fad0bd0

Fix creating a secret when given credentials from the UI. Instead of over-writing an existing secret or similar antics, generate a name without a conflict for a new secret to use. Signed-off-by: Moritz Röhrich <[email protected]>

object store: safe image modifications

53750c6

Update CRDs Safe image modifications when updating images Signed-off-by: Moritz Röhrich <[email protected]>

object store: fix typo

7065db2

Signed-off-by: Moritz Röhrich <[email protected]>

m-ildefons force-pushed the wip/object-endpoint branch from 3046278 to 1a93355 Compare November 29, 2023 09:46

object store: simplify deletes

cbaa80e

Simplify deletes by removing preceeding gets and dealing with errors. Signed-off-by: Moritz Röhrich <[email protected]>

wip/object endpoint #1

Are you sure you want to change the base?

wip/object endpoint #1

Conversation

m-ildefons commented Jul 31, 2023

Longhorn Manager: Object Endpoint Controller

jecluis commented Aug 1, 2023

jecluis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m-ildefons Aug 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m-ildefons commented Aug 11, 2023

jecluis commented Aug 11, 2023

irq0 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m-ildefons Aug 11, 2023 •

edited

Loading