Skip to content

Commit

Permalink
Add sabakan-triggered automatic repair functionality
Browse files Browse the repository at this point in the history
Signed-off-by: morimoto-cybozu <[email protected]>
  • Loading branch information
morimoto-cybozu committed Apr 23, 2024
1 parent c78c22e commit d62ed0f
Show file tree
Hide file tree
Showing 22 changed files with 728 additions and 55 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,10 @@ This project employs a versioning scheme described in [RELEASE.md](RELEASE.md#ve

## [Unreleased]

### Added

- Add sabakan-triggered automatic repair functionality in [#725](https://github.com/cybozu-go/cke/pull/725)

## [1.28.0]

### Changed
Expand Down
2 changes: 2 additions & 0 deletions constraints.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ type Constraints struct {
MinimumWorkers int `json:"minimum-workers"`
MaximumWorkers int `json:"maximum-workers"`
RebootMaximumUnreachable int `json:"maximum-unreachable-nodes-for-reboot"`
MaximumRepairs int `json:"maximum-repair-queue-entries"`
}

// Check checks the cluster satisfies the constraints
Expand Down Expand Up @@ -41,5 +42,6 @@ func DefaultConstraints() *Constraints {
MinimumWorkers: 1,
MaximumWorkers: 0,
RebootMaximumUnreachable: 0,
MaximumRepairs: 0,
}
}
30 changes: 28 additions & 2 deletions docs/ckecli.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,11 @@ $ ckecli [--config FILE] <subcommand> args...
- [`ckecli sabakan get-template`](#ckecli-sabakan-get-template)
- [`ckecli sabakan set-variables FILE`](#ckecli-sabakan-set-variables-file)
- [`ckecli sabakan get-variables`](#ckecli-sabakan-get-variables)
- [`ckecli auto-repair`](#ckecli-auto-repair)
- [`ckecli auto-repair enable|disable`](#ckecli-auto-repair-enabledisable)
- [`ckecli auto-repair is-enabled`](#ckecli-auto-repair-is-enabled)
- [`ckecli auto-repair set-variables FILE`](#ckecli-auto-repair-set-variables-file)
- [`ckecli auto-repair get-variables`](#ckecli-auto-repair-get-variables)
- [`ckecli status`](#ckecli-status)

## `ckecli cluster`
Expand All @@ -91,6 +96,7 @@ Set a constraint on the cluster configuration.
- `minimum-workers`
- `maximum-workers`
- `maximum-unreachable-nodes-for-reboot`
- `maximum-repair-queue-entries`

### `ckecli constraints show`

Expand Down Expand Up @@ -408,12 +414,32 @@ Get the cluster configuration template.

### `ckecli sabakan set-variables FILE`

Set the query variables to search machines in sabakan.
Set the query variables to search available machines in sabakan.
`FILE` should contain JSON as described in [sabakan integration](sabakan-integration.md#variables).

### `ckecli sabakan get-variables`

Get the query variables to search machines in sabakan.
Get the query variables to search available machines in sabakan.

## `ckecli auto-repair`

### `ckecli auto-repair enable|disable`

Enable/Disable [sabakan-triggered automatic repair](sabakan-triggered-repair.md).

### `ckecli auto-repair is-enabled`

Show sabakan-triggered automatic repair is enabled or disabled.
It displays `true` or `false`.

### `ckecli auto-repair set-variables FILE`

Set the query variables to search non-healthy machines in sabakan.
`FILE` should contain JSON as described in [sabakan-triggered automatic repair](sabakan-triggered-repair.md#query).

### `ckecli auto-repair get-variables`

Get the query variables to search non-healthy machines in sabakan.

## `ckecli status`

Expand Down
1 change: 1 addition & 0 deletions docs/constraints.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,4 @@ Cluster should satisfy these constraints.
| `minimum-workers` | int | 1 | The minimum number of worker nodes |
| `maximum-workers` | int | 0 | The maximum number of worker nodes. 0 means unlimited. |
| `maximum-unreachable-nodes-for-reboot` | int | 0 | The maximum number of unreachable nodes allowed for operating reboot. |
| `maximum-repair-queue-entries` | int | 0 | The maximum number of repair queue entries |
86 changes: 86 additions & 0 deletions docs/sabakan-triggered-repair.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
Automatic repair triggered by sabakan
=====================================

[Sabakan][sabakan] is management software for server machines in a data center.
It stores the status information of machines as well as their spec information.
By referring to machines' status information in sabakan, CKE can initiate the repair of a non-healthy machine.

This functionality is similar to [sabakan integration](sabakan-integration.md).

How it works
------------

CKE periodically queries sabakan to retrieve machines' status information in a data center.
If CKE finds non-healthy machines, it creates [repair queue entries](repair.md) for those machines.

The fields of a repair queue entry are determined based on the [information of the non-healthy machine](https://github.com/cybozu-go/sabakan/blob/main/docs/machine.md).
* `address`: `.spec.ipv4[0]`
* `machine_type`: `.spec.bmc.type`
* `operation`: `.status.state`

Users can configure the query to choose non-healthy machines.
The queries are executed via sabakan [GraphQL `searchMachines`](https://github.com/cybozu-go/sabakan/blob/master/docs/graphql.md) API.

Query
-----

CKE uses the following GraphQL query to retrieve machine information from sabakan.

```
query ckeSearch($having: MachineParams, $notHaving: MachineParams) {
searchMachines(having: $having, notHaving: $notHaving) {
# snip
}
}
```

The following values are used for `$having` and `$notHaving` variables by default.
Users can change these values by [specifying a JSON object](ckecli.md#ckecli-auto-repair-set-variables-file).

```json
{
"having": {
"states": ["UNHEALTHY", "UNREACHABLE"]
},
"notHaving": {
"roles": ["boot"]
}
}
```

The type of `$having` and `$notHaving` is `MachineParams`.
Consult [GraphQL schema][schema] for the definition of `MachineParams`.

Enqueue limiters
----------------

### Limiter for a single machine

In order not to repeat repair operations too quickly for a single unstable machine, CKE checks recent repair queue entries before enqueueing.
If it finds a recent entry for the machine in question, no matter whether the entry has finished or not, it refrains from creating an additional entry.

CKE considers all persisting queue entries as "recent" for simplicity.
A user should delete a finished repair queue entry for a machine once they consider the machine repaired.
* If a repair queue entry has finished with success and a user considers the machine stable, they should delete the finished entry.
* If a repair queue entry has finished with failure or a user considers the machine unstable, they should repair the machine manually. After the machine gets repaired, they should delete the finished entry.

### Limiter for a cluster

Sabakan may occasionally report false-positive non-healthy machines.
If CKE believes all of the failure reports and initiates a lot of repair operations, the Kubernetes cluster will be stuck -- or worse, corrupted.

Even when the failure reports are correct, it would be good for CKE to refrain from repairing too many machines.
For example, the failure of many servers might be caused by the temporary power failure of a whole server rack.
In that case, CKE should not mark the machines unrepairable as a result of pointless repair operations.
Once the machines are marked unrepairable, sabakan will delete all data on those machines.

In order not to initiate too many repair operations, CKE checks the number of recent repair queue entries plus the number of new failure reports before enqueueing.
If it finds excessive numbers of entries/reports, no matter whether the entries have finished or not, it refrains from creating an additional entry.

The maximum number of recent repair queue entries and new failure reports is [configurable](ckecli.md#ckecli-constraints-set-name-value) as a [constraint `maximum-repair-queue-entries`](constraints.md).

As stated above, CKE considers all persisting queue entries as "recent" for simplicity.


[sabakan]: https://github.com/cybozu-go/sabakan
[schema]: https://github.com/cybozu-go/sabakan/blob/master/gql/schema.graphql
9 changes: 9 additions & 0 deletions mtest/ckecli_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -137,4 +137,13 @@ func testCKECLI() {
ckecliSafe("sabakan", "enable")
ckecliSafe("sabakan", "get-url")
})

It("should invoke auto-repair subcommand successfully", func() {
ckecliSafe("auto-repair", "is-enabled")
ckecliSafe("auto-repair", "disable")
ckecliSafe("auto-repair", "enable")
f := remoteTempFile(`{"having":{"states":["UNHEALTHY","UNREACHABLE"]},"notHaving":{"roles":["boot"]}}`)
ckecliSafe("auto-repair", "set-variables", f)
ckecliSafe("auto-repair", "get-variables")
})
}
16 changes: 16 additions & 0 deletions pkg/ckecli/cmd/auto_repair.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
package cmd

import (
"github.com/spf13/cobra"
)

// autoRepairCmd represents the auto-repair command
var autoRepairCmd = &cobra.Command{
Use: "auto-repair",
Short: "auto-repair subcommand",
Long: `auto-repair subcommand`,
}

func init() {
rootCmd.AddCommand(autoRepairCmd)
}
26 changes: 26 additions & 0 deletions pkg/ckecli/cmd/auto_repair_disable.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
package cmd

import (
"context"

"github.com/cybozu-go/well"
"github.com/spf13/cobra"
)

var autoRepairDisableCmd = &cobra.Command{
Use: "disable",
Short: "disable sabakan-triggered automatic repair",
Long: `Disable sabakan-triggered automatic repair.`,

RunE: func(cmd *cobra.Command, args []string) error {
well.Go(func(ctx context.Context) error {
return storage.EnableAutoRepair(ctx, false)
})
well.Stop()
return well.Wait()
},
}

func init() {
autoRepairCmd.AddCommand(autoRepairDisableCmd)
}
26 changes: 26 additions & 0 deletions pkg/ckecli/cmd/auto_repair_enable.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
package cmd

import (
"context"

"github.com/cybozu-go/well"
"github.com/spf13/cobra"
)

var autoRepairEnableCmd = &cobra.Command{
Use: "enable",
Short: "enable sabakan-triggered automatic repair",
Long: `Enable sabakan-triggered automatic repair.`,

RunE: func(cmd *cobra.Command, args []string) error {
well.Go(func(ctx context.Context) error {
return storage.EnableAutoRepair(ctx, true)
})
well.Stop()
return well.Wait()
},
}

func init() {
autoRepairCmd.AddCommand(autoRepairEnableCmd)
}
33 changes: 33 additions & 0 deletions pkg/ckecli/cmd/auto_repair_get_variables.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
package cmd

import (
"context"
"os"

"github.com/cybozu-go/well"
"github.com/spf13/cobra"
)

// autoRepairGetVariablesCmd represents the "auto-repair get-variables" command
var autoRepairGetVariablesCmd = &cobra.Command{
Use: "get-variables",
Short: "get the query variables to search non-healthy machines in sabakan",
Long: `Get the query variables to search non-healthy machines in sabakan.`,

RunE: func(cmd *cobra.Command, args []string) error {
well.Go(func(ctx context.Context) error {
data, err := storage.GetAutoRepairQueryVariables(ctx)
if err != nil {
return err
}
os.Stdout.Write(data)
return nil
})
well.Stop()
return well.Wait()
},
}

func init() {
autoRepairCmd.AddCommand(autoRepairGetVariablesCmd)
}
32 changes: 32 additions & 0 deletions pkg/ckecli/cmd/auto_repair_is_enabled.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
package cmd

import (
"context"
"fmt"

"github.com/cybozu-go/well"
"github.com/spf13/cobra"
)

var autoRepairIsEnabledCmd = &cobra.Command{
Use: "is-enabled",
Short: "show sabakan-triggered automatic repair status",
Long: `Show whether sabakan-triggered automatic repair is enabled or not. "true" if enabled.`,

RunE: func(cmd *cobra.Command, args []string) error {
well.Go(func(ctx context.Context) error {
disabled, err := storage.IsAutoRepairDisabled(ctx)
if err != nil {
return err
}
fmt.Println(!disabled)
return nil
})
well.Stop()
return well.Wait()
},
}

func init() {
autoRepairCmd.AddCommand(autoRepairIsEnabledCmd)
}
61 changes: 61 additions & 0 deletions pkg/ckecli/cmd/auto_repair_set_variables.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
package cmd

import (
"context"
"encoding/json"
"os"

"github.com/cybozu-go/cke/sabakan"
"github.com/cybozu-go/well"
"github.com/spf13/cobra"
)

// autoRepairSetVariablesCmd represents the "auto-repair set-variables" command
var autoRepairSetVariablesCmd = &cobra.Command{
Use: "set-variables FILE",
Short: "set the query variables to search non-healthy machines in sabakan",
Long: `Set the query variables to search non-healthy machines in sabakan.
FILE should contain a JSON object like this:
{
"having": {
"labels": [{"name": "foo", "value": "bar"}],
"racks": [0, 1, 2],
"roles": ["worker"],
"states": ["UNREACHABLE"],
"minDaysBeforeRetire": 90
},
"notHaving": {
}
}
`,

Args: cobra.ExactArgs(1),
RunE: func(cmd *cobra.Command, args []string) error {
data, err := os.ReadFile(args[0])
if err != nil {
return err
}

vars := new(sabakan.QueryVariables)
err = json.Unmarshal(data, vars)
if err != nil {
return err
}
err = vars.IsValid()
if err != nil {
return err
}

well.Go(func(ctx context.Context) error {
return storage.SetAutoRepairQueryVariables(ctx, string(data))
})
well.Stop()
return well.Wait()
},
}

func init() {
autoRepairCmd.AddCommand(autoRepairSetVariablesCmd)
}
Loading

0 comments on commit d62ed0f

Please sign in to comment.