Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-17587: Prometheus Writer duplicate TYPE information in exposition format #2902

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

mlbiscoc
Copy link
Contributor

https://issues.apache.org/jira/browse/SOLR-17587

Description

Solr's Prometheus writer duplicates # TYPE <metric name> <prometheus metric type> in it's exposition format for coreregistry metrics.

This is an illegal format and depending on the technologies prometheus exposition verification for example Telegraf, this will fail. For Prometheus server itself, this still passes and collects the metrics just fine for some reason.

This is because the Prometheus Writer takes Dropwizard registries and exports them to Prometheus Registries to expose them in Prometheus format. Solr creates Dropwizard registry for every core and differentiates the metrics that way even though they have the same metric names.

For prometheus, this creates an issue in that metrics should be differentiated in it's attributes and tags. So when the metrics are output with the Prometheus response writer, it duplicates the TYPE information because it is a registry for every core and doesn't know that the other core registries have the same metric name and results in duplicate TYPE information.

Solution

When metrics are going to be exported for prometheus, we merge all the core Dropwizard metric registries into a single registry and export that registry into prometheus. Duplicate metric names in a registry is not allowed in prometheus, so we will also append the core name to the Dropwizard metric to differentiate which metric belongs to what core and parse the labels accordingly.

This also allowed to clean up and simply some of the SolrPrometheusCoreFormatter code.

Tests

Updated the test accordingly with the coreName existing in the Dropwizard metric names and it's parsing.

Also added an assert in testPrometheusStructureOutput to confirm there is no duplicate TYPE information in prometheus output.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

@mlbiscoc
Copy link
Contributor Author

Going to bump this PR. It's not a blocker for standard Prometheus server metrics collection but it can potentially block users using other exporters/collectors.

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but had you kept the test that you had removed because hossman didn't like it (it was imperfect), a reviewer (like me but you too) would be able to see in this PR a diff against the output format to understand the impact (or lack of impact).

@dsmiley
Copy link
Contributor

dsmiley commented Dec 19, 2024

I could say the same thing for dependencies. People add/remove dependencies in build files (Maven/Gradle/Ivy) but ultimately what we want to know is, what JARs changed in the ultimate distribution.

@mlbiscoc
Copy link
Contributor Author

Correct me if I'm wrong, but had you kept the test that you had removed because hossman didn't like it (it was imperfect), a reviewer (like me but you too) would be able to see in this PR a diff against the output format to understand the impact (or lack of impact).

In theory, yes you would see the difference that is happening here. But since you can't see the output, here is a basically the change to the output:

Before:

curl 'localhost:8983/solr/admin/metrics?wt=prometheus' | grep "#" | sort
# TYPE solr_metrics_core_average_request_time gauge
# TYPE solr_metrics_core_average_request_time gauge
# TYPE solr_metrics_core_average_searcher_warmup_time gauge
# TYPE solr_metrics_core_average_searcher_warmup_time gauge
# TYPE solr_metrics_core_cache gauge
# TYPE solr_metrics_core_cache gauge
# TYPE solr_metrics_core_highlighter_requests_total counter
# TYPE solr_metrics_core_highlighter_requests_total counter
# TYPE solr_metrics_core_index_size_bytes gauge
# TYPE solr_metrics_core_index_size_bytes gauge
# TYPE solr_metrics_core_requests_time_total counter
# TYPE solr_metrics_core_requests_time_total counter
# TYPE solr_metrics_core_requests_total counter
# TYPE solr_metrics_core_requests_total counter
# TYPE solr_metrics_core_searcher_documents gauge
# TYPE solr_metrics_core_searcher_documents gauge
# TYPE solr_metrics_core_tlog_total counter
# TYPE solr_metrics_core_tlog_total counter
# TYPE solr_metrics_core_update_handler gauge
# TYPE solr_metrics_core_update_handler gauge
# TYPE solr_metrics_jetty_dispatches_total counter
# TYPE solr_metrics_jetty_requests_total counter
# TYPE solr_metrics_jetty_response_total counter
# TYPE solr_metrics_jvm_buffers gauge
# TYPE solr_metrics_jvm_buffers_bytes gauge
# TYPE solr_metrics_jvm_gc gauge
# TYPE solr_metrics_jvm_gc_seconds gauge
# TYPE solr_metrics_jvm_heap gauge
# TYPE solr_metrics_jvm_memory_pools_bytes gauge
# TYPE solr_metrics_jvm_threads gauge
# TYPE solr_metrics_node_connections gauge
# TYPE solr_metrics_node_core_root_fs_bytes gauge
# TYPE solr_metrics_node_cores gauge
# TYPE solr_metrics_node_requests_time_total counter
# TYPE solr_metrics_node_requests_total counter
# TYPE solr_metrics_node_thread_pool_total counter
# TYPE solr_metrics_os gauge

After:

curl 'localhost:8983/solr/admin/metrics?wt=prometheus' | grep "#" | sort
# TYPE solr_metrics_core_average_request_time gauge
# TYPE solr_metrics_core_average_searcher_warmup_time gauge
# TYPE solr_metrics_core_cache gauge
# TYPE solr_metrics_core_highlighter_requests_total counter
# TYPE solr_metrics_core_index_size_bytes gauge
# TYPE solr_metrics_core_requests_time_total counter
# TYPE solr_metrics_core_requests_total counter
# TYPE solr_metrics_core_searcher_documents gauge
# TYPE solr_metrics_core_tlog_total counter
# TYPE solr_metrics_core_update_handler gauge
# TYPE solr_metrics_jetty_dispatches_total counter
# TYPE solr_metrics_jetty_requests_total counter
# TYPE solr_metrics_jetty_response_total counter
# TYPE solr_metrics_jvm_buffers gauge
# TYPE solr_metrics_jvm_buffers_bytes gauge
# TYPE solr_metrics_jvm_gc gauge
# TYPE solr_metrics_jvm_gc_seconds gauge
# TYPE solr_metrics_jvm_heap gauge
# TYPE solr_metrics_jvm_memory_pools_bytes gauge
# TYPE solr_metrics_jvm_threads gauge
# TYPE solr_metrics_node_connections gauge
# TYPE solr_metrics_node_core_root_fs_bytes gauge
# TYPE solr_metrics_node_cores gauge
# TYPE solr_metrics_node_requests_time_total counter
# TYPE solr_metrics_node_requests_total counter
# TYPE solr_metrics_node_thread_pool_total counter
# TYPE solr_metrics_os gauge

# is just a comment in Prometheus but # TYPE is special. Some technologies with verify there are dupes while others don't. This just removes the duplicates and merged the metrics under a single # TYPE

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting much better now

@@ -552,6 +549,11 @@ private List<MetricType> parseMetricTypes(SolrParams params) {
return metricTypes;
}

private String getCoreNameFromRegistry(String registryName) {
String coreName = registryName.substring(registryName.indexOf('.') + 1);
return coreName.replaceAll("\\.", "_");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for single char find & replace, just use replace. You'll see by code inspection it's much faster.

@@ -30,18 +30,18 @@ public abstract class SolrCoreMetric extends SolrMetric {
public String coreName;

public SolrCoreMetric(Metric dropwizardMetric, String metricName) {
super(dropwizardMetric, metricName);
super(dropwizardMetric, metricName.substring(metricName.indexOf(".") + 1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you do that, a little comment like // chop off ___ would be really helpful to the reader

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Comment on lines 37 to 44
coreName = cloudPattern.group(1);
labels.put("core", cloudPattern.group(1));
labels.put("collection", cloudPattern.group(2));
labels.put("shard", cloudPattern.group(3));
labels.put("replica", cloudPattern.group(4));
coreName = cloudPattern.group("core");
labels.put("core", cloudPattern.group("core"));
labels.put("collection", cloudPattern.group("collection"));
labels.put("shard", cloudPattern.group("shard"));
labels.put("replica", cloudPattern.group("replica"));
} else if (standalonePattern.find()) {
coreName = standalonePattern.group(1);
labels.put("core", standalonePattern.group(1));
coreName = standalonePattern.group("core");
labels.put("core", standalonePattern.group("core"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant for you to do this generically with a loop, thus without the code here actually referring to any names. See Matcher.namedGroups().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Moved them in loops so we can remove all that hard coded .put()

Comment on lines 34 to 35
Matcher cloudPattern = CLOUD_CORE_PATTERN.matcher(metricName);
Matcher standalonePattern = STANDALONE_CORE_PATTERN.matcher(metricName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't those patterns be defined here on SolrCoreMetric where they are used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point. I feel like we discussed this on the PR when I introduced this feature... Or maybe not. Regardless, it should probably be in SolrCoreMetric, so I moved it.

Comment on lines 34 to 35
Matcher cloudPattern = CLOUD_CORE_PATTERN.matcher(metricName);
Matcher standalonePattern = STANDALONE_CORE_PATTERN.matcher(metricName);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point. I feel like we discussed this on the PR when I introduced this feature... Or maybe not. Regardless, it should probably be in SolrCoreMetric, so I moved it.

Comment on lines 37 to 44
coreName = cloudPattern.group(1);
labels.put("core", cloudPattern.group(1));
labels.put("collection", cloudPattern.group(2));
labels.put("shard", cloudPattern.group(3));
labels.put("replica", cloudPattern.group(4));
coreName = cloudPattern.group("core");
labels.put("core", cloudPattern.group("core"));
labels.put("collection", cloudPattern.group("collection"));
labels.put("shard", cloudPattern.group("shard"));
labels.put("replica", cloudPattern.group("replica"));
} else if (standalonePattern.find()) {
coreName = standalonePattern.group(1);
labels.put("core", standalonePattern.group(1));
coreName = standalonePattern.group("core");
labels.put("core", standalonePattern.group("core"));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Moved them in loops so we can remove all that hard coded .put()

@@ -30,18 +30,18 @@ public abstract class SolrCoreMetric extends SolrMetric {
public String coreName;

public SolrCoreMetric(Metric dropwizardMetric, String metricName) {
super(dropwizardMetric, metricName);
super(dropwizardMetric, metricName.substring(metricName.indexOf(".") + 1));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Comment on lines +117 to +120
throw new SolrException(
SolrException.ErrorCode.SERVER_ERROR,
"Error occurred exporting Dropwizard Metric to Prometheus",
e);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also wanted to mention this. I was thinking to just remove the log.warn and actually throw a SolrException. I think at the time I thought that not failing metrics entirely and even just partially getting metrics was ok. But after some thought, adding this would actually fail any metrics from posting but helps exporting tests actually get caught if there is something wrong or even a user finding a bug. WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants