Cryptic SDS Error in Spire Agent #5709

sgav-twilio · 2024-12-12T23:37:14Z

spire-helm-charts-hardened version: 0.21.0 and 0.15.1
spire-agent version: v1.9.6 and v1.8.4
spire-server version: v1.9.6 and v1.8.4
subsystem: spire-agent

We're running Spire in tandem with Istio in an EKS environment. Our Istio sidecar proxies are connected to spire-agents via a workload socket. We use Spire to retrieve x509 certs/trust bundles as well as federated trust bundles.

We've also configured a liveness probe on our istio-proxies with the following command:

- /bin/sh
- -c
- |
 if ! curl -s http://localhost:15000/config_dump | grep -E -q '<federated_trust_domain>'; then
      echo "An incomplete trust bundle was found"
      exit 1
 fi

This command essentially tries to check the envoy config dump and then check for the existence of the federated trust domain in the config dump. We have this check to guard against potential problems we had seen with the federated trust bundle not being appended for certain workloads.

We've noticed this liveness probe failing for workloads whenever the error below for example occurs in the corresponding spire-agent pods:

Received error from stream secrets server" error="<nil>" method=StreamSecrets pid=3112366 service=SDS.v3 subsystem_name=endpoints

We've noticed very few spire-agent pods (1-2 in a 15 node cluster for example) having this error regularly however it does seem to be correlated with failures. We've also noticed that the nodes that have a spire-agent reporting this error do not seem to have any abnormal cpu/memory usage nor do we see anything abnormal in terms of cpu/memory from the workloads themselves. Would anyone know what this error means and why it would appear regularly for certain spire-agent/nodes?

The text was updated successfully, but these errors were encountered:

MarcosDY · 2024-12-17T20:50:14Z

Could you check if there are any attestation failures?
If possible, please share additional logs to help us investigate further.

This issue might be related to #5638.

sgav-twilio · 2024-12-19T22:10:58Z

Hey @MarcosDY, no attestation failures as far as I could tell. Any additional logs were not errors or warnings unfortunately and I'm not sure how much help they would be, but let me see about getting them. To switch gears, this is in fact related to the issue you posted actually, in fact the liveness probe's existence is because of #5638. I actually happen to have a permanent fix for that issue. I've tested it on a k8s cluster here, but let me summarize the problem and the fix.

The original problem in #5638 related to failures of the spire-agent in retrieving federated bundles if it lost contact with the kubelet for example, thereby when contact with the kubelet is restored, the agent gives the workload the bundle, but not the federated bundle. The culprit is actually in this function here where the agent would return the bundle but not the federated bundle since the workload triggering the SDS endpoint would not have its identity assigned when the agent lost contact with the kubelet. You can verify this by doing the following in a k8s cluster in an EKS environment (or similar):

Get into the EC2 instance (or equivalent) for a corresponding spire-agent and find the process running the spire-agent
Run the following in the EC2 instance to simulate losing communication with the kubelet:
nsenter -t <pid_of_spire_agent> --net iptables -A INPUT -s 127.0.0.1 -j REJECT -p tcp --dport 10250
Deploy a k8s pod manifest that connects to the spire-agent via a workload socket
Delete the iptables rule added in step 2 (e.g. nsenter -t <pid_of_spire_agent> --net iptables -D INPUT <number_of_rule>)
The federated trust bundle will not appear in the workload (we are using Istio sidecar proxies, so the trust bundle will not appear in the Envoy config dump)

The fix would involve:

Making the federated bundle url's for the default clusterspiffeid's available for the spire-agent cache. Basically allow specifying a new parameter called defaultFederatesWith specifying basically the same federatesWith url's used for the spire-server here, just for the spire-agent as an option. Or alternatively, allow the spire-agent to access ClusterSPIFFEID objects and just access federatesWith from the default ClusterSPIFFEID object provided a new setting like agentUseDefaultInFailure is set to true.
Have the buildWorkloadUpdate function mentioned earlier here have access to the defaultFederatesWith url's (or the federatesWith from the default ClusterSPIFFID object), check for their existence and if the length of w.Identities is zero, and return the corresponding value from the bundle map if so.

I've tested something like this on a k8s cluster and it works, and if this approach sounds acceptable, I can make a PR here too. A solution for that problem would really help, as the only way we could think of to guard against this problem is having a liveness probe to check if the federated bundle does not exist. But a liveness probe or a similar mechanism that simply forces re-attestation by restarting is not a good long-term solution and leads to problems of its own (such as likely, the one mentioned in this issue originally).

MarcosDY added the triage/in-progress Issue triage is in progress label Dec 17, 2024

MarcosDY self-assigned this Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cryptic SDS Error in Spire Agent #5709

Cryptic SDS Error in Spire Agent #5709

sgav-twilio commented Dec 12, 2024 •

edited

Loading

MarcosDY commented Dec 17, 2024

sgav-twilio commented Dec 19, 2024 •

edited

Loading

Cryptic SDS Error in Spire Agent #5709

Cryptic SDS Error in Spire Agent #5709

Comments

sgav-twilio commented Dec 12, 2024 • edited Loading

MarcosDY commented Dec 17, 2024

sgav-twilio commented Dec 19, 2024 • edited Loading

sgav-twilio commented Dec 12, 2024 •

edited

Loading

sgav-twilio commented Dec 19, 2024 •

edited

Loading