Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc-alt: metrics #20728

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

rpc-alt: metrics #20728

wants to merge 8 commits into from

Conversation

amnn
Copy link
Member

@amnn amnn commented Dec 24, 2024

Description

Add metrics tracking to the RPC service (tracking latencies as well as success/failure/total counts for requests, broken down by method, and DB queries). This required some additional/surrounding changes:

  • Upgrading to the latest version of jsonrpsee (just for sui-indexer-alt-jsonrpc), which has better support for adding middleware that is aware of a JSON-RPC request's structure. This is used to add a middleware to track request latencies.
  • Factoring out the MetricsService from the indexing framework, to be used in the reader as well. This was mostly motivated by the fact that in a test cluster or local network, we would need to serve metrics for both the reader and the indexer from the same endpoint -- so we will want to run one instance of this service and register metrics for both the reader and the indexer (this change also simplifies factoring out the ingestion service because we can now factor out its metrics).
  • Similarly, factoring out the database stats collector, so we can collect stats about the connection pool on both the read and the write side.
  • Now that the indexer does not control the lifecycle of its metrics service, the graceful shutdown logic has been tweaked: Previously, the indexer would trigger a cancellation as part of it shutdown, which might cause other unrelated services to wind down as well. Now, the indexer is responsible for just its own shutdown and we move the responsibility to coordinate shutdowns between unrelated services to the code that starts the services together. This is achieved using "child" cancellation tokens, which will trigger if their parents trigger, but not vice versa.

The changes to introduce RPC discovery were also coupled with this change -- the rpc.discover endpoint was added using the existing sui-open-rpc crate -- the coupling happened because it was useful for the metrics service to know which methods were supported to avoid polluting metrics with unrecognised methods.

Test plan

Unit tests were added for rpc discovery, graceful shutdown and metrics:

sui$ cargo nextest run -p sui-indexer-alt-jsonrpc

Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

  • Protocol:
  • Nodes (Validators and Full nodes):
  • gRPC:
  • JSON-RPC:
  • GraphQL:
  • CLI:
  • Rust SDK:

amnn added 8 commits December 20, 2024 19:52
## Description

Upgrade jsonrpsee we're using for this indexer to the latest version
(0.24.7) rather than the fork of 0.18.2 that we were using before. This
is because 0.24.7 has better support for middleware that can access the
parsed JSONRPC method (useful for measuring the time spent processing
each method).

## Test plan

Run the service and query it:

```
sui$ cargo run -p sui-indexer-alt-jsonrpc
```

```
curl -LX POST  "http://localhost:6000" \
    --header 'Content-Type: application/json' \
    --data '{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "suix_getReferenceGasPrice",
  "params": []
}' | jq .
{
  "jsonrpc": "2.0",
  "id": 1,
  "result": "1000"
}
```
## Description

Track request latencies, success and failure rates, report them through
a dedicated metrics endpoint, and report them through tracing as well.

## Test plan

New unit tests for request counts and graceful shutdown, and manual
testing for tracing:

```
sui$ RUST_LOG=DEBUG cargo run -p -- sui-indexer-alt-jsonrpc
2024-12-21T00:02:46.720775Z  INFO sui_indexer_alt_jsonrpc: Starting JSON-RPC service on 0.0.0.0:6000
2024-12-21T00:02:46.720775Z  INFO sui_indexer_alt_jsonrpc::metrics: Starting metrics service on 0.0.0.0:9184
2024-12-21T00:02:49.676408Z DEBUG jsonrpsee-server: Accepting new connection 1/100
2024-12-21T00:02:49.691414Z DEBUG sui_indexer_alt_jsonrpc::api: SELECT "kv_epoch_starts"."reference_gas_price" FROM "kv_epoch_starts" ORDER BY "kv_epoch_starts"."epoch" DESC  LIMIT $1 -- binds: [1]
2024-12-21T00:02:49.694647Z  INFO sui_indexer_alt_jsonrpc::metrics::middleware: Request succeeded method="suix_getReferenceGasPrice" elapsed_ms=1.7868458000000002e-5
2024-12-21T00:03:02.630356Z DEBUG jsonrpsee-server: Accepting new connection 1/100
2024-12-21T00:03:02.630585Z  INFO sui_indexer_alt_jsonrpc::metrics::middleware: Request failed method="suix_getReferenceGasPrice_non_existent" code=-32601 elapsed_ms=3.9333e-8
```
## Description

Request metrics are labelled with their method, which is a user input.
Users can pollute our metrics by sending methods that the RPC does not
support. This change mitigates that by detecting unrecognized methods
and normalizing them to an "<UNKNOWN>" label.

## Test plan

Updated unit tests:

```
sui$ cargo nextest run -p sui-indexer-alt-jsonrpc
```
## Description

Track how long each Database request is taking, how many requests we've
issued, and how many have succeeded/failed.

## Test plan
Run server, make request, check metrics.
## Description

Add an automatic function to advertise the RPC's schema (version,
methods, types) to clients.

## Test plan

New unit tests:

```
sui$ cargo nextest run -p sui-indexer-alt-jsonrpc
```
## Description

The RPC and Indexer implementations use essentially the same metrics
service, and when they are combined in a single process they need to use
the same instance of that metrics service as well. This change starts
that process, by pulling out the implementation from `sui-indexer-alt`.

In a later change, the metrics service in the RPC implementation will
switch over to this common crate as well. This unblocks running both
services together in one process, which is something we will need to do
as part of E2E tests.

## Test plan

Run the indexer and make sure it can be cancelled (with Ctrl-C), and
that it also shuts down cleanly when run to a specific last checkpoint.
## Description

...and use it in the RPC implementation as well. A parameter has been
added to add a prefix to all metric names. This ensures that if an
Indexer and RPC service are sharing a metrics service, they won't stomp
on each others metrics.

## Test plan

Run the RPC service and check its metrics.
@amnn amnn self-assigned this Dec 24, 2024
Copy link

vercel bot commented Dec 24, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

3 Skipped Deployments
Name Status Preview Comments Updated (UTC)
multisig-toolkit ⬜️ Ignored (Inspect) Dec 24, 2024 4:09pm
sui-kiosk ⬜️ Ignored (Inspect) Dec 24, 2024 4:09pm
sui-typescript-docs ⬜️ Ignored (Inspect) Dec 24, 2024 4:09pm

@amnn amnn mentioned this pull request Dec 24, 2024
7 tasks
@lxfind
Copy link
Contributor

lxfind commented Dec 24, 2024

Curious why do we have to move the metrics to a separate crate? It's still part of the framework, right?

@amnn
Copy link
Member Author

amnn commented Dec 24, 2024

Curious why do we have to move the metrics to a separate crate? It's still part of the framework, right?

In my mind the indexing framework is about writing to the database, so it didn't make sense that the read side would depend on the indexing framework, especially not to use its metrics service, which is a shared component to both (put another way, I wouldn't think to look inside the indexing framework for a metrics service).

This PR does not get us there, but I think ideally (when prepared for external folks to use) the indexing framework would just deal with the prometheus Registry type (and even then, gated behind a feature), and exposing those metrics would be left as an implementation detail. As it stands, I think the current metrics service is quite specific to how we report metrics from our services (I was surprised that we didn't already have a shared crate for this already).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants