Document Service Catalogs + AI (#352)

Documents new functionality under our service catalog + AI.
pluralsh · Dec 31, 2024 · 18653ed · 18653ed
1 parent 5c311af
commit 18653ed
Show file tree

Hide file tree

Showing 11 changed files with 354 additions and 233 deletions.
diff --git a/pages/ai/architecture.md b/pages/ai/architecture.md
@@ -0,0 +1,29 @@
+---
+title: Plural AI Architecture
+description: How Plural AI Works
+---
+
+## Overview
+
+At its core, Plural AI has three main components:
+
+* A causal graph of the high-level objects that define your infrastructure.  An example is this Plural Service owns this Kubernetes Deployment, which owns a ReplicaSet which owns a set of Pods.
+* A permission engine to ensure any set of objects within the graph are interactable by the given user of Plural's AI.  This hardens the governance process around access to the completions for our AI.  The presence of Plural's agent in your Kubernetes fleet also makes the ability to query end-clusters much more secure from a networking perspective.
+* Our PR Automation framework - this allows us to hook into SCM providers and automate code fixes in a reviewable, safe way.
+
+In the parlance of the AI industry, you can think of it as a highly advanced RAG (retrieval augmented generation), with an agent-like behavior, since it's always on and triggered by any emergent issue in your infrastructure.
+
+## In Detail
+
+Here's a detailed walkthrough of how the AI engine works in the case of a Plural Service with a failing Kubernetes deployment.
+
+1. The engine is told the service is failing from our internal event bus
+2. The failing components of that service are collected, with the failing deployment selected first
+3. The metadata of the service are added to the prompt (what cluster its on, how its sourcing its configuration, etc)
+4. The events, replica sets, spec of the k8s deployment are queried and added to a prompt
+5. The failing pods for the deployment are selected from the replica sets, and a random subset are queried individually
+6. Each failing pods events and spec are added to the growing prompt
+
+This will then craft an insight for the Deployment node, which can be combined with insights from any other components to collect to a service-level insight.
+
+If this investigation were done again, we'd be able to cache any non-stale insights and prevent rerunning the inference a second time where it would be unnecessary.
diff --git a/pages/ai/cost.md b/pages/ai/cost.md
@@ -0,0 +1,24 @@
+---
+title: Plural AI Cost Analysis
+description: How much will Plural AI cost me?
+---
+
+Plural AI is built to be extremely cost efficient.  We've found it will often heavily outcompete the spend you would use on advanced APM tools, and likely even DIY prometheus setups.  That said, AI inference is not cheap in general, and we do a number of things to work around that:
+
+* Our causal knowledge graph heavily caches at each layer of the tree. This allows us to ensure repeated attempts to generate the same insight are deduplicated, reducing inference API calls dramatically
+* You can split the model used by usecase.  Insight generation can leverage cheap, fast models, whereas the tool calls that ultimately generate PRs use smarter, advanced models, but are executed less frequently so the cost isn't felt as hard.
+* We use AI sparingly.  Inference is only done when we know something is wrong.
+
+That said, what does that actually mean?
+
+## Basic Cost Analysis
+
+We at Plural dogfood our own AI functionality in our own infrastructure.  This includes a sandbox test fleet of over 10 clusters, and a production fleet of around 5 clusters for both our main services and Plural Cloud.  Plural's AI Engine runs on the management clusters for each of these domains since launch, and while we might do a decent-ish job of caretaking those environments, or current daily OpenAI bill is $~2.64 per day, or roughly $81 per month.
+
+This is staggeringly cost effective, when you consider a Datadog bill for our equivalent infrastructure is at minimum $10k, even a prometheus setup is well over 100/mo for the necessary compute including datastore, grafana, grafana's database, load balancers, and agents.  Granted, some of these services will ultimately be necessary to have Plural AI reach its full potential, but we could see a world where:
+
+```sh
+OpenTelemetry + Plural AI >> Datadog/New Relic
+```
+
+as a general debugging platform, while being a miniscule fraction of the current cost.
diff --git a/pages/ai/overview.md b/pages/ai/overview.md
@@ -0,0 +1,25 @@
+---
+title: Plural AI
+description: Plural's AI Engine Removes the Gruntwork from Infrastructure
+---
+
+{% callout severity="info" %}
+If you just want to skip the text and see it in action, skip to the [demo video](/ai/overview#demo-video)
+{% /callout %}
+
+Managing infrastructure is full of mind-numbing tasks, from troubleshooting the same misconfiguration for the hundredth time, to whack-a-moling Datadog alerts, to playing internal IT support to application developers who cannot be bothered to learn the basics of foundational technology like Kubernetes.  Plural AI allows you to outsource all those time-sucks to LLMs so you can focus on building value-added platforms for your enterprise.
+
+In particular, Plural AI has a few differentiators to its approach:
+
+* A bring-your-own-LLM model - allows you to use the LLM already approved by your enterprise and not worry about us as a MITM
+* An always-on troubleshooting engine - taking signals from failed kubernetes services, failed terraform runs, and other misfires in your infrastructure to run a consistent investigative process and summarize the results.  Eliminate manual digging and just fix the issue instead.
+* Automated Fixes - Take any insight from our troubleshooting engine and generate a fix PR automatically, generated from our ability to introspect the GitOps code defining that piece of infrastructure.
+* AI Explanation - Complex or Domain-specific pages can be explained w/ one click with AI, eliminating internal support burdens for engineers.
+* AI Chat - any workflow above can be further refined or expanded in a full ChatGPT-like experience.  Paste additional context into chats automatically, or generate PRs once you and the AI has found the fix.
+
+
+# Demo Video
+
+To see this all in action, feel free to browse our live demo video on Youtube of our GenAI integration:
+
+{% embed url="https://youtu.be/LxurfPikps8" aspectRatio="16 / 9" /%}
diff --git a/pages/ai/setup.md b/pages/ai/setup.md
@@ -0,0 +1,60 @@
+---
+title: Setup Plural AI
+description: How to configure Plural AI
+---
+
+Plural AI can easily be configured via the `DeploymentSettings` CRD or at `/settings/global/ai-provider` in your Plural Console instance.  An example `DeploymentSettings` config is below:
+
+```yaml
+apiVersion: deployments.plural.sh/v1alpha1
+kind: DeploymentSettings
+metadata:
+  name: global
+  namespace: plrl-deploy-operator
+spec:
+  managementRepo: pluralsh/plrl-boot-aws
+
+  ai:
+    enabled: true
+    provider: OPENAI
+    anthropic: # example anthropic config
+      model: claude-3-5-sonnet-latest
+      tokenSecretRef:
+        name: ai-config
+        key: anthropic
+
+    openAI: # example openai config
+      tokenSecretRef:
+        name: ai-config
+        key: openai
+
+    vertex: # example VertexAI config
+      project: pluralsh-test-384515
+      location: us-east1
+      model: gemini-1.5-pro-002
+      serviceAccountJsonSecretRef:
+        name: ai-config
+        key: vertex
+```
+
+You can see the full schema at our [Operator API Reference](/deployments/operator/api#deploymentsettings).  
+
+In all these cases, you need to create an additional secret in the `plrl-deploy-operator` namespace to reference api keys and auth secrets.  It would look something like this:
+
+```yaml
+apiVersion: v1
+kind: Secret
+metadata:
+  name: ai-config
+  namespace: plrl-deploy-operator
+stringData:
+  vertex: <service account json string>
+  openai: <access-token>
+  anthropic: <access-token>
+```
+
+{% callout severity="warn" %}
+Be sure not to commit this secret resource into your Git repository in plain-text, as that will result in a git secret exposure.
+
+Plural provides a number of mechanisms to manage secrets, or you can use the established patterns within your engineering organization.
+{% /callout %}
diff --git a/pages/catalog/contributing.md b/pages/catalog/contributing.md
@@ -0,0 +1,11 @@
+---
+title: Contribution Program
+description: Contributing to Plural's Service Catalog
+---
+
+We run a continuous Contributor Program to help maintain our catalog from the community.  The bounties for the various tasks are as follows:
+
+* $100 for an application update (note that many applications should auto-update)
+* $250 for an application onboarding
+
+To qualify for the bounty, you'll need to submit a PR to https://github.com/pluralsh/scaffolds.git and once it's been approved and merged, DM a member of the Plural staff on Discord to receive your payout.
diff --git a/pages/catalog/creation.md b/pages/catalog/creation.md
@@ -0,0 +1,112 @@
+---
+title: Creating Your Own Catalog
+description: Defining your own service catalogs with Plural
+---
+
+## Overview
+
+{% callout severity="info" %}
+TLDR: skip to [Examples](/catalog/creation#examples) to see a link to our Github repository with our full default catalog for working examples
+{% /callout %}
+
+Plural Service Catalogs are ultimately driven off of two kubernetes custom resources: `Catalog` and `PrAutomation`.  Here are examples of both:
+
+```yaml
+apiVersion: deployments.plural.sh/v1alpha1
+kind: Catalog
+metadata:
+  name: data-engineering
+spec:
+  name: data-engineering
+  category: data
+  icon: https://docs.plural.sh/favicon-128.png
+  author: Plural
+  description: |
+    Sets up OSS data infrastructure using Plural
+  bindings:
+    create:
+    - groupName: developers # controls who can spawn prs from this catalog
+```
+
+```yaml
+apiVersion: deployments.plural.sh/v1alpha1
+kind: PrAutomation
+metadata:
+  name: airbyte
+spec:
+  name: airbyte
+  icon: https://plural-assets.s3.us-east-2.amazonaws.com/uploads/repos/d79a69b7-dfcd-480a-a51d-518865fd6e7c/airbyte.png
+  identifier: mgmt
+  documentation: |
+    Sets up an airbyte instance for a given cloud
+  creates:
+    git:
+      ref: sebastian/prod-2981-set-up-catalog-pipeline # TODO set to main
+      folder: catalogs/data/airbyte
+    templates:
+    - source: helm
+      destination: helm/airbyte/{{ context.cluster }}
+      external: true
+    - source: services/oauth-proxy-ingress.yaml.liquid
+      destination: services/apps/airbyte/oauth-proxy-ingress.yaml.liquid
+      external: true
+    - source: "terraform/{{ context.cloud }}"
+      destination: "terraform/apps/airbyte/{{ context.cluster }}"
+      external: true
+    - source: airbyte-raw-servicedeployment.yaml
+      destination: "bootstrap/apps/airbyte/{{ context.cluster }}/airbyte-raw-servicedeployment.yaml"
+      external: true
+    - source: airbyte-servicedeployment.yaml
+      destination: "bootstrap/apps/airbyte/{{ context.cluster }}/airbyte-servicedeployment.yaml"
+      external: true
+    - source: airbyte-stack.yaml
+      destination: "bootstrap/apps/airbyte/{{ context.cluster }}/airbyte-stack.yaml"
+      external: true
+    - source: oauth-proxy-config-servicedeployment.yaml
+      destination: "bootstrap/apps/airbyte/{{ context.cluster }}/oauth-proxy-config-servicedeployment.yaml"
+      external: true
+    - source: README.md
+      destination: documentation/airbyte/README.md
+      external: true
+  repositoryRef:
+    name: scaffolds
+  catalogRef: # <-- NOTE this references the Catalog CRD
+    name: data-engineering
+  scmConnectionRef:
+    name: plural  
+  title: "Setting up airbyte on cluster {{ context.cluster }} for {{ context.cloud }}"
+  message: |
+    Set up airbyte on {{ context.cluster }} ({{ context.cloud }})
+
+    Will set up an airbyte deployment, including object storage and postgres setup
+  configuration:
+  - name: cluster
+    type: STRING
+    documentation: Handle of the cluster you want to deploy airbyte to.
+  - name: stackCluster
+    type: STRING
+    documentation: Handle of the cluster used to run Infrastructure Stacks for provisioning the infrastructure. Defaults to the management cluster.
+    default: mgmt
+  - name: cloud
+    type: ENUM
+    documentation: Cloud provider you want to deploy airbyte to.
+    values:
+    - aws
+  - name: bucket
+    type: STRING
+    documentation: The name of the S3/GCS/Azure Blob bucket you'll use for airbyte logs. This must be globally unique.
+  - name: hostname
+    type: STRING
+    documentation: The DNS name you'll host airbyte under.
+  - name: region
+    type: STRING
+    documentation: The cloud provider region you're going to use to deploy cloud resources.
+```
+
+A catalog is a container for many PRAutomations which themselves control the code-generation to accomplish the provisioning task being implemented.  In this case, we're provisioning [Airbyte](https://airbyte.com/).  The real work is being done in the referenced templates.
+
+## Examples
+
+The best way to get some inspiration on how to write your own templates is to look through some examples, and that's why we've made our default service catalog open source.  You can browse it here:
+
+https://github.com/pluralsh/scaffolds/tree/main/setup/catalogs
diff --git a/pages/catalog/overview.md b/pages/catalog/overview.md
@@ -0,0 +1,25 @@
+---
+title: Service Catalog
+description: Enterprise Grade Self-Service with Plural
+---
+
+{% callout severity="info" %}
+If you just want to skip the text and see it in action, skip to the [demo video](/catalog/overview#demo-video)
+{% /callout %}
+
+Plural provides a full-stack GitOps platform for provisioning resources with both IaC frameworks like terraform and Kubernetes manifests like helm and kustomize.  This alone is very powerful, but most enterprises want to go a step beyond and implement full self-service.  This provides two main benefits:
+
+* Reduction of manual toil and error in repeatable infrastructure provisioning paths
+* Ensuring compliance with enterprise cybersecurity and reliabilty standards in the creation of new infrastructure, eg the creation of "Golden Paths".
+
+Plural accomplishes this via our Catalog feature, which allows [PR Automations](/deployments/pr-automation) to be bundled according to common infrastructure provisioning usecases.  We like the code generation approach for a number of reasons:
+
+* Clear tie-in with established review-and-approval mechanisms in the PR-process
+* Great customizability throughout the lifecycle.
+* Generality - any infrastructure provisioning task can be represented as some terraform + GitOps code in theory
+
+# Demo Video
+
+To see this all in action in provisioning a relatively complex application in [Dagster](https://dagster.io/), feel free to browse our live demo video on Youtube of our GenAI integration:
+
+{% embed url="https://youtu.be/5D6myZ7sm2k" aspectRatio="16 / 9" /%}
diff --git a/pages/faq/certifications.md b/pages/faq/certifications.md
@@ -3,6 +3,7 @@ title: Certifications
 description: What certifications does Plural have?
 ---
 
-Plural is currently a part of the **Cloud Native Computing Foundation** and **Cloud Native Landscape**, and is certified to be **GDPR-compliant**.
+Plural is currently a part of the **Cloud Native Computing Foundation** and **Cloud Native Landscape**. In addition we maintain the following certifications:
 
-We are currently working toward **SOC 2 compliance**.
+* **GDPR**
+* **SOC 2 Type 2**
diff --git a/pages/introduction.md b/pages/introduction.md
@@ -15,7 +15,26 @@ Plural is a unified cloud orchestrator for the management of Kubernetes at scale
 
 In addition, we support a robust, enterprise-ready [Architecture](/deployments/architecture). This uses a separation of management cluster and an agent w/in each workload cluster to achieve scalability and enhanced security to compensate for the risks caused by introducing a Single-Pane-of-Glass to Kubernetes.  The agent can only communicate to the management cluster via egress networking, and executes all write operations with local credentials, removing the need for the management cluster to be a repository of global credentials.  If you want to learn more about the nuts-and-bolts feel free to visit our [Architecture Page](/deployments/architecture).
 
-## Plural Open Source Marketplace
+## Plural AI
+
+Plural integrates heavily with LLMs to enable complex automation within the realm of GitOps ordinary deterministic methods struggle to get right.  This includes:
+
+* running root cause analysis on failing kubernetes services, using a hand-tailored evidence graph Plural extracts from its own fleet control plane
+* using AI troubleshooting insights to autogenerate fix prs by introspecting Plural's own GitOps and IaC engines
+* using AI code generation to generate PRs for scaling recommendations from our Kubecost integration
+* "Explain with AI" and chat feature to explain any complex kubernetes object in the system to reduce Platform engineering support burden from app developers still new to kubernetes.
+
+The goal of Plural's AI implementation is not to shoehorn LLMs into every infrastructure workflow, which is not just misguided but actually dangerous.  Instead, we're trying to automate all the mindless gruntwork that comes with infrastructure, like troubleshooting well-known bugs, fixing YAML typos, and explaining the details of well-known, established technology like Kubernetes.  This is the sort of thing that wastes precious engineering time, and bogs down enterprises trying to build serious developer platforms.
+
+You can read more about it under [Plural AI](/ai/overview).
+
+## Plural Service Catalog
+
+We also maintain a catalog of open source applications like Airbyte, Airflow, etc. that can be deployed to kubernetes on most major clouds.  The entire infrastructure is extensible and recreatable as users and software vendors as well.  
+
+You can also define your own internal catalogs and vendors can share catalogs of their own software. It is meant to be a standard interface to support developer self-service for virtually any infrastructure provisioning workflow.
+
+The full docs are available under [Service Catalog](/catalog/overview).
+
 
-We also maintain a catalog of open source applications like Airbyte, Airflow, etc. that can be deployed to kubernetes on most major clouds.  We're in progress to merging that experience with our modernized Fleet Management platform, but if you're interested in any of them, we're happy to support them in the context of a commercial plan.