-
Notifications
You must be signed in to change notification settings - Fork 15
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add sabakan-triggered automatic repair functionality (backport of #725)
Signed-off-by: morimoto-cybozu <[email protected]>
- Loading branch information
1 parent
73da2e0
commit 99bbf7b
Showing
24 changed files
with
731 additions
and
57 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
Automatic repair triggered by sabakan | ||
===================================== | ||
|
||
[Sabakan][sabakan] is management software for server machines in a data center. | ||
It stores the status information of machines as well as their spec information. | ||
By referring to machines' status information in sabakan, CKE can initiate the repair of a non-healthy machine. | ||
|
||
This functionality is similar to [sabakan integration](sabakan-integration.md). | ||
|
||
How it works | ||
------------ | ||
|
||
CKE periodically queries sabakan to retrieve machines' status information in a data center. | ||
If CKE finds non-healthy machines, it creates [repair queue entries](repair.md) for those machines. | ||
|
||
The fields of a repair queue entry are determined based on the [information of the non-healthy machine](https://github.com/cybozu-go/sabakan/blob/main/docs/machine.md). | ||
* `address`: `.spec.ipv4[0]` | ||
* `machine_type`: `.spec.bmc.type` | ||
* `operation`: `.status.state` | ||
|
||
Users can configure the query to choose non-healthy machines. | ||
The queries are executed via sabakan [GraphQL `searchMachines`](https://github.com/cybozu-go/sabakan/blob/master/docs/graphql.md) API. | ||
|
||
Query | ||
----- | ||
|
||
CKE uses the following GraphQL query to retrieve machine information from sabakan. | ||
|
||
``` | ||
query ckeSearch($having: MachineParams, $notHaving: MachineParams) { | ||
searchMachines(having: $having, notHaving: $notHaving) { | ||
# snip | ||
} | ||
} | ||
``` | ||
|
||
The following values are used for `$having` and `$notHaving` variables by default. | ||
Users can change these values by [specifying a JSON object](ckecli.md#ckecli-auto-repair-set-variables-file). | ||
|
||
```json | ||
{ | ||
"having": { | ||
"states": ["UNHEALTHY", "UNREACHABLE"] | ||
}, | ||
"notHaving": { | ||
"roles": ["boot"] | ||
} | ||
} | ||
``` | ||
|
||
The type of `$having` and `$notHaving` is `MachineParams`. | ||
Consult [GraphQL schema][schema] for the definition of `MachineParams`. | ||
|
||
Enqueue limiters | ||
---------------- | ||
|
||
### Limiter for a single machine | ||
|
||
In order not to repeat repair operations too quickly for a single unstable machine, CKE checks recent repair queue entries before enqueueing. | ||
If it finds a recent entry for the machine in question, no matter whether the entry has finished or not, it refrains from creating an additional entry. | ||
|
||
CKE considers all persisting queue entries as "recent" for simplicity. | ||
A user should delete a finished repair queue entry for a machine once they consider the machine repaired. | ||
* If a repair queue entry has finished with success and a user considers the machine stable, they should delete the finished entry. | ||
* If a repair queue entry has finished with failure or a user considers the machine unstable, they should repair the machine manually. After the machine gets repaired, they should delete the finished entry. | ||
|
||
### Limiter for a cluster | ||
|
||
Sabakan may occasionally report false-positive non-healthy machines. | ||
If CKE believes all of the failure reports and initiates a lot of repair operations, the Kubernetes cluster will be stuck -- or worse, corrupted. | ||
|
||
Even when the failure reports are correct, it would be good for CKE to refrain from repairing too many machines. | ||
For example, the failure of many servers might be caused by the temporary power failure of a whole server rack. | ||
In that case, CKE should not mark the machines unrepairable as a result of pointless repair operations. | ||
Once the machines are marked unrepairable, sabakan will delete all data on those machines. | ||
|
||
In order not to initiate too many repair operations, CKE checks the number of recent repair queue entries plus the number of new failure reports before enqueueing. | ||
If it finds excessive numbers of entries/reports, no matter whether the entries have finished or not, it refrains from creating an additional entry. | ||
|
||
The maximum number of recent repair queue entries and new failure reports is [configurable](ckecli.md#ckecli-constraints-set-name-value) as a [constraint `maximum-repair-queue-entries`](constraints.md). | ||
|
||
As stated above, CKE considers all persisting queue entries as "recent" for simplicity. | ||
|
||
|
||
[sabakan]: https://github.com/cybozu-go/sabakan | ||
[schema]: https://github.com/cybozu-go/sabakan/blob/main/gql/graph/schema.graphqls |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
package cmd | ||
|
||
import ( | ||
"github.com/spf13/cobra" | ||
) | ||
|
||
// autoRepairCmd represents the auto-repair command | ||
var autoRepairCmd = &cobra.Command{ | ||
Use: "auto-repair", | ||
Short: "auto-repair subcommand", | ||
Long: `auto-repair subcommand`, | ||
} | ||
|
||
func init() { | ||
rootCmd.AddCommand(autoRepairCmd) | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
package cmd | ||
|
||
import ( | ||
"context" | ||
|
||
"github.com/cybozu-go/well" | ||
"github.com/spf13/cobra" | ||
) | ||
|
||
var autoRepairDisableCmd = &cobra.Command{ | ||
Use: "disable", | ||
Short: "disable sabakan-triggered automatic repair", | ||
Long: `Disable sabakan-triggered automatic repair.`, | ||
|
||
RunE: func(cmd *cobra.Command, args []string) error { | ||
well.Go(func(ctx context.Context) error { | ||
return storage.EnableAutoRepair(ctx, false) | ||
}) | ||
well.Stop() | ||
return well.Wait() | ||
}, | ||
} | ||
|
||
func init() { | ||
autoRepairCmd.AddCommand(autoRepairDisableCmd) | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
package cmd | ||
|
||
import ( | ||
"context" | ||
|
||
"github.com/cybozu-go/well" | ||
"github.com/spf13/cobra" | ||
) | ||
|
||
var autoRepairEnableCmd = &cobra.Command{ | ||
Use: "enable", | ||
Short: "enable sabakan-triggered automatic repair", | ||
Long: `Enable sabakan-triggered automatic repair.`, | ||
|
||
RunE: func(cmd *cobra.Command, args []string) error { | ||
well.Go(func(ctx context.Context) error { | ||
return storage.EnableAutoRepair(ctx, true) | ||
}) | ||
well.Stop() | ||
return well.Wait() | ||
}, | ||
} | ||
|
||
func init() { | ||
autoRepairCmd.AddCommand(autoRepairEnableCmd) | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
package cmd | ||
|
||
import ( | ||
"context" | ||
"os" | ||
|
||
"github.com/cybozu-go/well" | ||
"github.com/spf13/cobra" | ||
) | ||
|
||
// autoRepairGetVariablesCmd represents the "auto-repair get-variables" command | ||
var autoRepairGetVariablesCmd = &cobra.Command{ | ||
Use: "get-variables", | ||
Short: "get the query variables to search non-healthy machines in sabakan", | ||
Long: `Get the query variables to search non-healthy machines in sabakan.`, | ||
|
||
RunE: func(cmd *cobra.Command, args []string) error { | ||
well.Go(func(ctx context.Context) error { | ||
data, err := storage.GetAutoRepairQueryVariables(ctx) | ||
if err != nil { | ||
return err | ||
} | ||
os.Stdout.Write(data) | ||
return nil | ||
}) | ||
well.Stop() | ||
return well.Wait() | ||
}, | ||
} | ||
|
||
func init() { | ||
autoRepairCmd.AddCommand(autoRepairGetVariablesCmd) | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
package cmd | ||
|
||
import ( | ||
"context" | ||
"fmt" | ||
|
||
"github.com/cybozu-go/well" | ||
"github.com/spf13/cobra" | ||
) | ||
|
||
var autoRepairIsEnabledCmd = &cobra.Command{ | ||
Use: "is-enabled", | ||
Short: "show sabakan-triggered automatic repair status", | ||
Long: `Show whether sabakan-triggered automatic repair is enabled or not. "true" if enabled.`, | ||
|
||
RunE: func(cmd *cobra.Command, args []string) error { | ||
well.Go(func(ctx context.Context) error { | ||
disabled, err := storage.IsAutoRepairDisabled(ctx) | ||
if err != nil { | ||
return err | ||
} | ||
fmt.Println(!disabled) | ||
return nil | ||
}) | ||
well.Stop() | ||
return well.Wait() | ||
}, | ||
} | ||
|
||
func init() { | ||
autoRepairCmd.AddCommand(autoRepairIsEnabledCmd) | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
package cmd | ||
|
||
import ( | ||
"context" | ||
"encoding/json" | ||
"os" | ||
|
||
"github.com/cybozu-go/cke/sabakan" | ||
"github.com/cybozu-go/well" | ||
"github.com/spf13/cobra" | ||
) | ||
|
||
// autoRepairSetVariablesCmd represents the "auto-repair set-variables" command | ||
var autoRepairSetVariablesCmd = &cobra.Command{ | ||
Use: "set-variables FILE", | ||
Short: "set the query variables to search non-healthy machines in sabakan", | ||
Long: `Set the query variables to search non-healthy machines in sabakan. | ||
FILE should contain a JSON object like this: | ||
{ | ||
"having": { | ||
"labels": [{"name": "foo", "value": "bar"}], | ||
"racks": [0, 1, 2], | ||
"roles": ["worker"], | ||
"states": ["UNREACHABLE"], | ||
"minDaysBeforeRetire": 90 | ||
}, | ||
"notHaving": { | ||
} | ||
} | ||
`, | ||
|
||
Args: cobra.ExactArgs(1), | ||
RunE: func(cmd *cobra.Command, args []string) error { | ||
data, err := os.ReadFile(args[0]) | ||
if err != nil { | ||
return err | ||
} | ||
|
||
vars := new(sabakan.QueryVariables) | ||
err = json.Unmarshal(data, vars) | ||
if err != nil { | ||
return err | ||
} | ||
err = vars.IsValid() | ||
if err != nil { | ||
return err | ||
} | ||
|
||
well.Go(func(ctx context.Context) error { | ||
return storage.SetAutoRepairQueryVariables(ctx, string(data)) | ||
}) | ||
well.Stop() | ||
return well.Wait() | ||
}, | ||
} | ||
|
||
func init() { | ||
autoRepairCmd.AddCommand(autoRepairSetVariablesCmd) | ||
} |
Oops, something went wrong.