Skip to content

Commit

Permalink
Document Service Catalogs + AI (#352)
Browse files Browse the repository at this point in the history
Documents new functionality under our service catalog + AI.
  • Loading branch information
michaeljguarino authored Dec 31, 2024
1 parent 5c311af commit 18653ed
Show file tree
Hide file tree
Showing 11 changed files with 354 additions and 233 deletions.
29 changes: 29 additions & 0 deletions pages/ai/architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
title: Plural AI Architecture
description: How Plural AI Works
---

## Overview

At its core, Plural AI has three main components:

* A causal graph of the high-level objects that define your infrastructure. An example is this Plural Service owns this Kubernetes Deployment, which owns a ReplicaSet which owns a set of Pods.
* A permission engine to ensure any set of objects within the graph are interactable by the given user of Plural's AI. This hardens the governance process around access to the completions for our AI. The presence of Plural's agent in your Kubernetes fleet also makes the ability to query end-clusters much more secure from a networking perspective.
* Our PR Automation framework - this allows us to hook into SCM providers and automate code fixes in a reviewable, safe way.

In the parlance of the AI industry, you can think of it as a highly advanced RAG (retrieval augmented generation), with an agent-like behavior, since it's always on and triggered by any emergent issue in your infrastructure.

## In Detail

Here's a detailed walkthrough of how the AI engine works in the case of a Plural Service with a failing Kubernetes deployment.

1. The engine is told the service is failing from our internal event bus
2. The failing components of that service are collected, with the failing deployment selected first
3. The metadata of the service are added to the prompt (what cluster its on, how its sourcing its configuration, etc)
4. The events, replica sets, spec of the k8s deployment are queried and added to a prompt
5. The failing pods for the deployment are selected from the replica sets, and a random subset are queried individually
6. Each failing pods events and spec are added to the growing prompt

This will then craft an insight for the Deployment node, which can be combined with insights from any other components to collect to a service-level insight.

If this investigation were done again, we'd be able to cache any non-stale insights and prevent rerunning the inference a second time where it would be unnecessary.
24 changes: 24 additions & 0 deletions pages/ai/cost.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
---
title: Plural AI Cost Analysis
description: How much will Plural AI cost me?
---

Plural AI is built to be extremely cost efficient. We've found it will often heavily outcompete the spend you would use on advanced APM tools, and likely even DIY prometheus setups. That said, AI inference is not cheap in general, and we do a number of things to work around that:

* Our causal knowledge graph heavily caches at each layer of the tree. This allows us to ensure repeated attempts to generate the same insight are deduplicated, reducing inference API calls dramatically
* You can split the model used by usecase. Insight generation can leverage cheap, fast models, whereas the tool calls that ultimately generate PRs use smarter, advanced models, but are executed less frequently so the cost isn't felt as hard.
* We use AI sparingly. Inference is only done when we know something is wrong.

That said, what does that actually mean?

## Basic Cost Analysis

We at Plural dogfood our own AI functionality in our own infrastructure. This includes a sandbox test fleet of over 10 clusters, and a production fleet of around 5 clusters for both our main services and Plural Cloud. Plural's AI Engine runs on the management clusters for each of these domains since launch, and while we might do a decent-ish job of caretaking those environments, or current daily OpenAI bill is $~2.64 per day, or roughly $81 per month.

This is staggeringly cost effective, when you consider a Datadog bill for our equivalent infrastructure is at minimum $10k, even a prometheus setup is well over 100/mo for the necessary compute including datastore, grafana, grafana's database, load balancers, and agents. Granted, some of these services will ultimately be necessary to have Plural AI reach its full potential, but we could see a world where:

```sh
OpenTelemetry + Plural AI >> Datadog/New Relic
```

as a general debugging platform, while being a miniscule fraction of the current cost.
25 changes: 25 additions & 0 deletions pages/ai/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
---
title: Plural AI
description: Plural's AI Engine Removes the Gruntwork from Infrastructure
---

{% callout severity="info" %}
If you just want to skip the text and see it in action, skip to the [demo video](/ai/overview#demo-video)
{% /callout %}

Managing infrastructure is full of mind-numbing tasks, from troubleshooting the same misconfiguration for the hundredth time, to whack-a-moling Datadog alerts, to playing internal IT support to application developers who cannot be bothered to learn the basics of foundational technology like Kubernetes. Plural AI allows you to outsource all those time-sucks to LLMs so you can focus on building value-added platforms for your enterprise.

In particular, Plural AI has a few differentiators to its approach:

* A bring-your-own-LLM model - allows you to use the LLM already approved by your enterprise and not worry about us as a MITM
* An always-on troubleshooting engine - taking signals from failed kubernetes services, failed terraform runs, and other misfires in your infrastructure to run a consistent investigative process and summarize the results. Eliminate manual digging and just fix the issue instead.
* Automated Fixes - Take any insight from our troubleshooting engine and generate a fix PR automatically, generated from our ability to introspect the GitOps code defining that piece of infrastructure.
* AI Explanation - Complex or Domain-specific pages can be explained w/ one click with AI, eliminating internal support burdens for engineers.
* AI Chat - any workflow above can be further refined or expanded in a full ChatGPT-like experience. Paste additional context into chats automatically, or generate PRs once you and the AI has found the fix.


# Demo Video

To see this all in action, feel free to browse our live demo video on Youtube of our GenAI integration:

{% embed url="https://youtu.be/LxurfPikps8" aspectRatio="16 / 9" /%}
60 changes: 60 additions & 0 deletions pages/ai/setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
---
title: Setup Plural AI
description: How to configure Plural AI
---

Plural AI can easily be configured via the `DeploymentSettings` CRD or at `/settings/global/ai-provider` in your Plural Console instance. An example `DeploymentSettings` config is below:

```yaml
apiVersion: deployments.plural.sh/v1alpha1
kind: DeploymentSettings
metadata:
name: global
namespace: plrl-deploy-operator
spec:
managementRepo: pluralsh/plrl-boot-aws

ai:
enabled: true
provider: OPENAI
anthropic: # example anthropic config
model: claude-3-5-sonnet-latest
tokenSecretRef:
name: ai-config
key: anthropic

openAI: # example openai config
tokenSecretRef:
name: ai-config
key: openai

vertex: # example VertexAI config
project: pluralsh-test-384515
location: us-east1
model: gemini-1.5-pro-002
serviceAccountJsonSecretRef:
name: ai-config
key: vertex
```
You can see the full schema at our [Operator API Reference](/deployments/operator/api#deploymentsettings).
In all these cases, you need to create an additional secret in the `plrl-deploy-operator` namespace to reference api keys and auth secrets. It would look something like this:

```yaml
apiVersion: v1
kind: Secret
metadata:
name: ai-config
namespace: plrl-deploy-operator
stringData:
vertex: <service account json string>
openai: <access-token>
anthropic: <access-token>
```

{% callout severity="warn" %}
Be sure not to commit this secret resource into your Git repository in plain-text, as that will result in a git secret exposure.

Plural provides a number of mechanisms to manage secrets, or you can use the established patterns within your engineering organization.
{% /callout %}
11 changes: 11 additions & 0 deletions pages/catalog/contributing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
---
title: Contribution Program
description: Contributing to Plural's Service Catalog
---

We run a continuous Contributor Program to help maintain our catalog from the community. The bounties for the various tasks are as follows:

* $100 for an application update (note that many applications should auto-update)
* $250 for an application onboarding

To qualify for the bounty, you'll need to submit a PR to https://github.com/pluralsh/scaffolds.git and once it's been approved and merged, DM a member of the Plural staff on Discord to receive your payout.
112 changes: 112 additions & 0 deletions pages/catalog/creation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
---
title: Creating Your Own Catalog
description: Defining your own service catalogs with Plural
---

## Overview

{% callout severity="info" %}
TLDR: skip to [Examples](/catalog/creation#examples) to see a link to our Github repository with our full default catalog for working examples
{% /callout %}

Plural Service Catalogs are ultimately driven off of two kubernetes custom resources: `Catalog` and `PrAutomation`. Here are examples of both:

```yaml
apiVersion: deployments.plural.sh/v1alpha1
kind: Catalog
metadata:
name: data-engineering
spec:
name: data-engineering
category: data
icon: https://docs.plural.sh/favicon-128.png
author: Plural
description: |
Sets up OSS data infrastructure using Plural
bindings:
create:
- groupName: developers # controls who can spawn prs from this catalog
```
```yaml
apiVersion: deployments.plural.sh/v1alpha1
kind: PrAutomation
metadata:
name: airbyte
spec:
name: airbyte
icon: https://plural-assets.s3.us-east-2.amazonaws.com/uploads/repos/d79a69b7-dfcd-480a-a51d-518865fd6e7c/airbyte.png
identifier: mgmt
documentation: |
Sets up an airbyte instance for a given cloud
creates:
git:
ref: sebastian/prod-2981-set-up-catalog-pipeline # TODO set to main
folder: catalogs/data/airbyte
templates:
- source: helm
destination: helm/airbyte/{{ context.cluster }}
external: true
- source: services/oauth-proxy-ingress.yaml.liquid
destination: services/apps/airbyte/oauth-proxy-ingress.yaml.liquid
external: true
- source: "terraform/{{ context.cloud }}"
destination: "terraform/apps/airbyte/{{ context.cluster }}"
external: true
- source: airbyte-raw-servicedeployment.yaml
destination: "bootstrap/apps/airbyte/{{ context.cluster }}/airbyte-raw-servicedeployment.yaml"
external: true
- source: airbyte-servicedeployment.yaml
destination: "bootstrap/apps/airbyte/{{ context.cluster }}/airbyte-servicedeployment.yaml"
external: true
- source: airbyte-stack.yaml
destination: "bootstrap/apps/airbyte/{{ context.cluster }}/airbyte-stack.yaml"
external: true
- source: oauth-proxy-config-servicedeployment.yaml
destination: "bootstrap/apps/airbyte/{{ context.cluster }}/oauth-proxy-config-servicedeployment.yaml"
external: true
- source: README.md
destination: documentation/airbyte/README.md
external: true
repositoryRef:
name: scaffolds
catalogRef: # <-- NOTE this references the Catalog CRD
name: data-engineering
scmConnectionRef:
name: plural
title: "Setting up airbyte on cluster {{ context.cluster }} for {{ context.cloud }}"
message: |
Set up airbyte on {{ context.cluster }} ({{ context.cloud }})
Will set up an airbyte deployment, including object storage and postgres setup
configuration:
- name: cluster
type: STRING
documentation: Handle of the cluster you want to deploy airbyte to.
- name: stackCluster
type: STRING
documentation: Handle of the cluster used to run Infrastructure Stacks for provisioning the infrastructure. Defaults to the management cluster.
default: mgmt
- name: cloud
type: ENUM
documentation: Cloud provider you want to deploy airbyte to.
values:
- aws
- name: bucket
type: STRING
documentation: The name of the S3/GCS/Azure Blob bucket you'll use for airbyte logs. This must be globally unique.
- name: hostname
type: STRING
documentation: The DNS name you'll host airbyte under.
- name: region
type: STRING
documentation: The cloud provider region you're going to use to deploy cloud resources.
```
A catalog is a container for many PRAutomations which themselves control the code-generation to accomplish the provisioning task being implemented. In this case, we're provisioning [Airbyte](https://airbyte.com/). The real work is being done in the referenced templates.
## Examples
The best way to get some inspiration on how to write your own templates is to look through some examples, and that's why we've made our default service catalog open source. You can browse it here:
https://github.com/pluralsh/scaffolds/tree/main/setup/catalogs
25 changes: 25 additions & 0 deletions pages/catalog/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
---
title: Service Catalog
description: Enterprise Grade Self-Service with Plural
---

{% callout severity="info" %}
If you just want to skip the text and see it in action, skip to the [demo video](/catalog/overview#demo-video)
{% /callout %}

Plural provides a full-stack GitOps platform for provisioning resources with both IaC frameworks like terraform and Kubernetes manifests like helm and kustomize. This alone is very powerful, but most enterprises want to go a step beyond and implement full self-service. This provides two main benefits:

* Reduction of manual toil and error in repeatable infrastructure provisioning paths
* Ensuring compliance with enterprise cybersecurity and reliabilty standards in the creation of new infrastructure, eg the creation of "Golden Paths".

Plural accomplishes this via our Catalog feature, which allows [PR Automations](/deployments/pr-automation) to be bundled according to common infrastructure provisioning usecases. We like the code generation approach for a number of reasons:

* Clear tie-in with established review-and-approval mechanisms in the PR-process
* Great customizability throughout the lifecycle.
* Generality - any infrastructure provisioning task can be represented as some terraform + GitOps code in theory

# Demo Video

To see this all in action in provisioning a relatively complex application in [Dagster](https://dagster.io/), feel free to browse our live demo video on Youtube of our GenAI integration:

{% embed url="https://youtu.be/5D6myZ7sm2k" aspectRatio="16 / 9" /%}
5 changes: 3 additions & 2 deletions pages/faq/certifications.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ title: Certifications
description: What certifications does Plural have?
---

Plural is currently a part of the **Cloud Native Computing Foundation** and **Cloud Native Landscape**, and is certified to be **GDPR-compliant**.
Plural is currently a part of the **Cloud Native Computing Foundation** and **Cloud Native Landscape**. In addition we maintain the following certifications:

We are currently working toward **SOC 2 compliance**.
* **GDPR**
* **SOC 2 Type 2**
23 changes: 21 additions & 2 deletions pages/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,26 @@ Plural is a unified cloud orchestrator for the management of Kubernetes at scale

In addition, we support a robust, enterprise-ready [Architecture](/deployments/architecture). This uses a separation of management cluster and an agent w/in each workload cluster to achieve scalability and enhanced security to compensate for the risks caused by introducing a Single-Pane-of-Glass to Kubernetes. The agent can only communicate to the management cluster via egress networking, and executes all write operations with local credentials, removing the need for the management cluster to be a repository of global credentials. If you want to learn more about the nuts-and-bolts feel free to visit our [Architecture Page](/deployments/architecture).

## Plural Open Source Marketplace
## Plural AI

Plural integrates heavily with LLMs to enable complex automation within the realm of GitOps ordinary deterministic methods struggle to get right. This includes:

* running root cause analysis on failing kubernetes services, using a hand-tailored evidence graph Plural extracts from its own fleet control plane
* using AI troubleshooting insights to autogenerate fix prs by introspecting Plural's own GitOps and IaC engines
* using AI code generation to generate PRs for scaling recommendations from our Kubecost integration
* "Explain with AI" and chat feature to explain any complex kubernetes object in the system to reduce Platform engineering support burden from app developers still new to kubernetes.

The goal of Plural's AI implementation is not to shoehorn LLMs into every infrastructure workflow, which is not just misguided but actually dangerous. Instead, we're trying to automate all the mindless gruntwork that comes with infrastructure, like troubleshooting well-known bugs, fixing YAML typos, and explaining the details of well-known, established technology like Kubernetes. This is the sort of thing that wastes precious engineering time, and bogs down enterprises trying to build serious developer platforms.

You can read more about it under [Plural AI](/ai/overview).

## Plural Service Catalog

We also maintain a catalog of open source applications like Airbyte, Airflow, etc. that can be deployed to kubernetes on most major clouds. The entire infrastructure is extensible and recreatable as users and software vendors as well.

You can also define your own internal catalogs and vendors can share catalogs of their own software. It is meant to be a standard interface to support developer self-service for virtually any infrastructure provisioning workflow.

The full docs are available under [Service Catalog](/catalog/overview).


We also maintain a catalog of open source applications like Airbyte, Airflow, etc. that can be deployed to kubernetes on most major clouds. We're in progress to merging that experience with our modernized Fleet Management platform, but if you're interested in any of them, we're happy to support them in the context of a commercial plan.

Loading

0 comments on commit 18653ed

Please sign in to comment.