Roadmap for CI Setup Improvements #5136
Labels
Category: Build/CI
Requests, Issues and Changes targeting gradle, groovy, Jenkins, etc.
Category: Doc
Requests, Issues and Changes targeting javadoc and module documentation
Size: L
Very big effort likely requiring a lot of research and work in many areas across the codebase
Status: Needs Discussion
Requires help discussing a reported issue or provided PR
Topic: Stabilization
Requests, Issues and Changes related to improving stablity and reducing flakyness
Type: Improvement
Request for or addition/enhancement of a feature
Milestone
Motivation
Problems with our current/previous CI setup includes high cost, complexity, artifactory downtimes, and lack of reproducibility.
Cost
Currently, CI workers cannot be scaled down below two even though contributor activity currently is very low so that most days of the week we don't even need a single one.
The available resources for CI runs should be small by default to save cost as long as we don't have a lot of activity.
In times of high activity (e.g. during peak times or bigger efforts spanning a lot of repos / PRs) it should automatically scale or at least be possible for privileged contributors to enable a higher scale of available resources.
Complexity
The different CI jobs are unnecessarily entwined and complex, including e.g. copying around the build harness and other artifacts instead of publishing them to and consuming them from artifactory. Even for long-time contributors the CI setup is hard to understand, debug, and fix. Often times we need to wait for @Cervator to find time to resolve an issue or update configuration etc.
Individual CI jobs should be independent of each other and use artifactory as the source of truth. Aside of test results that don't need to be persisted long-term, any (build) artifacts should be published to and consumed from artifactory. Job contracts (required inputs, expected outputs) should be documented and supported with architecture and data flow diagrams to make the CI setup easier to understand and maintain for everyone. This will allow us to distribute the work better and react faster in case of issues.
Artifactory Downtimes
In the past, artifactory went down every few weeks/months (depending on activity), IIRC mostly due to out of space or out of memory issues. This negatively impacts active contributors in locally building and testing as well as CI runs consuming artifacts from or publishing artifacts to artifactory. While "old" contributors can often rely to a degree on cached information, artifactory downtimes highly affect new contributors that cannot.
Artifactory as the source of truth should be as stable and highly available as possible to avoid blocking contributors old or new. Space issues should be mitigated by adding more capacity, archiving or rotating artifacts. Periodically run health checks should verify artifactory is available and at least attempting to restart it if it's not.
Lack of Reproducibility
Due to the complexity of our CI setup, especially interdependencies between jobs, as well as custom logic in Jenkinsfiles, CI runs are currently hard to reproduce for developers. Building a release in particular is currently not actionable locally due to the release tag not including information on the included module status (last commit at release time). In addition, the lack of proper (read: non-SNAPSHOT) versioning and BoM information across omega makes releases irreproducible.
In addition to reducing complexity, instead of maintaining a lot of logic in the Jenkinsfiles, this logic should be maintained as gradle tasks where possible such that it can be locally reproduced easily by developers. Furthermore, a workspace pinning mechanism similar to @skaldarnar's NodeGooey would already help to more easily reproduce issues of other contributors or (omega) releases by providing a kind of BoM.
Proposal
Concerns
Task Breakdown
tbd
Additional notes
Current CI Setup
Desired CI Setup
The text was updated successfully, but these errors were encountered: