Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide standardized traffic metrics #554

Closed
tomkerkhove opened this issue Feb 16, 2021 · 22 comments
Closed

Provide standardized traffic metrics #554

tomkerkhove opened this issue Feb 16, 2021 · 22 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@tomkerkhove
Copy link
Contributor

What would you like to be added:
Provide standardized traffic metrics for all gateways to implement so that other tools can rely on a common way to get the metrics.

Ideally, this would fully align with the metrics that SMI spec leverages to make it, even more, easier to integrate.

I've proposed to extend SMI to beyond Service Meshes but then this project started so it might be better to just align instead of re-invent.

Why is this needed:

Tools & platforms need a unified way to get traffic metrics for all gateways, regardless of what the effective gateway is that is being used.

In my case, we are building HTTP-based autoscaling for KEDA (experimental) so we will rely on things such as SMI, but here we were hoping to rely on a standard/SDK for getting the metrics as well instead of re-implementing every gateway.

/cc @michelleN @bridgetkromhout

@tomkerkhove tomkerkhove added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 16, 2021
@robscott
Copy link
Member

Hey @tomkerkhove, thanks for bringing this up! We're always interested in finding any overlap with SMI Spec or any other related projects and trying to develop a standard that works for all use cases. In this case, metrics are something that we haven't prioritized yet, but something that could be in scope in the future. This actually came up in Slack a couple weeks ago and I raised it at our community meeting as well.

A few quick follow up questions related to SMI metrics:

  1. Did you envision Gateway API supporting the same set of metrics? A subset/superset? Are there metrics that would be more relevant to only certain use cases?
  2. What level of standardization do you think would be ideal to provide?
  3. Would it be possible to expose this with standard prometheus metrics instead of a CRD?
  4. In your experience, how difficult has it been for different implementations to provide this standardized set of metrics?
  5. Are there any areas you'd like to improve on with the metrics in SMI?

@tomkerkhove
Copy link
Contributor Author

Thanks for your reply!

In terms of what metrics I do not have much of a preference, but # of requests is the minimum for me.

A CRD is ideal for us/me as we do not want to force Prometheus on everyone. For example, some companies rely on their cloud provider for that so don't need it (including me).

In terms of standardization, I'd hope every gateway API has to/will provide these metrics and hopefully compatible with SMI so that we can have the same metric experience for Service Meshes, Ingresses/gateways and eventually hoping for service-to-service as well.

@beriberikix
Copy link
Contributor

I would suggest success/failure vs. raw count, since success rate or # of requests can both be calculated from it.

@tomkerkhove
Copy link
Contributor Author

Yes, but ideally they are there out of the box so no computation needs to be done which makes it easier to use

@bowei
Copy link
Contributor

bowei commented Feb 19, 2021

Given the SMI experience, are there any issues with using CRs (which are backed by etcd db) to store metrics that are fast changing and tend to be emphemeral? I know that K8S metrics resources have special handling to avoid overwhelming the API server.

(https://github.com/kubernetes/metrics/blob/master/pkg/apis/metrics/v1beta1/types.go)

@evankanderson
Copy link
Contributor

Other art happening concurrently:

https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/semantic_conventions/http-metrics.md

I'd share bowei's concern about the apiserver not necessarily being the best place for fast-moving metrics.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 1, 2021
@tomkerkhove
Copy link
Contributor Author

Closing as discussed in standup that this is not on the radar for now.

@tomkerkhove
Copy link
Contributor Author

tomkerkhove commented Dec 22, 2021

Since time has passed by, I'm wondering if there are plans to integrate Traffic Metrics into the spec? If so, would be nice to re-use some of the SMI spec or OpenTelemetry ones.

@howardjohn
Copy link
Contributor

I am admittedly not very familiar with them, but my understanding is that opentelemetry also defined some standardized request metric schemas which could be a reasonable API if this project decides to 'endorse' a metrics scheme.

@tomkerkhove
Copy link
Contributor Author

That's definitely correct, but what if it's not endorsed and rather part of the spec to provide these?

This is what SMI does and allows end-users to rely on a standard way of getting metrics; regardless of what standard that is being used for the semantics. Tooling knows they are there, for every Gateway API.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 21, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tomkerkhove
Copy link
Contributor Author

Given the current state of the project, I think it's time to reconsider this for consistent metrics across implementation.

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Jun 2, 2022
@k8s-ci-robot
Copy link
Contributor

@tomkerkhove: Reopened this issue.

In response to this:

Given the current state of the project, I think it's time to reconsider this for consistent metrics across implementation.

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mikemorris
Copy link
Contributor

mikemorris commented Jul 6, 2022

@tomkerkhove Some of the folks who have been involved in SMI are trying to get a new working group started for discussing E/W mesh applications of Gateway API that it sounds like may be of interest to you kubernetes/community#6724

In early discussions though, we decided to not focus on a spec for telemetry at this time - it's been an under-implemented/adopted part of SMI, and the divergence in implementations, vendors and evolving standards has made it challenging to build the consensus needed for a standard to become widely adopted.

It has been encouraging to see some of the work OpenTelemetry has been doing, and I think for the near future it would be best to focus on implementation/adoption within that group, with the goal of laying the groundwork to eventually enable broader adoption in projects like Gateway API, rather than starting a parallel effort.

@tomkerkhove
Copy link
Contributor Author

Thanks for the update.

I strongly believe Gateway API is more than service meshes and this is a common misconception but I will jump to that thread and see what gives because in the end if it's purely focussing on Service Meshes then SMI already covered that.

@youngnick
Copy link
Contributor

I one thousand percent agree that Gateway API is more than service meshes, but I'm supportive of the new WG because the core API has so many todos already that we're not going to have bandwidth to properly address service mesh use cases for some time. Having a WG that works on the service mesh problems and how to integrate the work SMI has already done with Gateway API and report back will be super useful.

@robscott
Copy link
Member

For anyone still interested in this - there's a related discussion in OpenTelemetry now: open-telemetry/semantic-conventions#1675

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests