Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply allowlist filter on metricnames with labels #5

Open
mimaison opened this issue Jan 30, 2024 · 11 comments
Open

Apply allowlist filter on metricnames with labels #5

mimaison opened this issue Jan 30, 2024 · 11 comments

Comments

@mimaison
Copy link
Contributor

At the moment, the allowlist is applied on the metric names without the labels. We should allow filtering metrics with specific labels.

@OwenCorrigan76 OwenCorrigan76 self-assigned this Mar 20, 2024
@OwenCorrigan76 OwenCorrigan76 removed their assignment Jun 7, 2024
@mimaison
Copy link
Contributor Author

Currently the allowlist configuration is just applied on the metric names and ignores labels. I think it's important to allow filtering on labels too.

The main use cases where this is useful are:

  • Requests metrics: When investigating issues you often want fine grained requests metrics but typically only for a specific subset of requests. For example for consumer issues, you'd only enable metrics for JOIN_GROUP and HEARTBEAT requests and not on the whole ~70 requests types.
  • Topic metrics: In most cases users track the overall throughput across all topics. When investigating issues, it's important to get per topic throughput metrics.

The allowlist configuration accepts a list of regex patterns that are matched against the metrics in the Prometheus format. In the Prometheus format, labels are appended after the metric name between curly brackets, and separated by commas. For example:

kafka_network_requestmetrics_localtimems{request="CreateDelegationToken",quantile="0.999",}

This brings some challenges if we want to also filter labels:

  • Curly brackets have a special meaning in regex so there's potentially a conflict.
  • Labels are ordered arbitrarily (the order is fixed). As a user I'd like to specify labels in any order and even be able to omit labels I'm not interested in filtering on.
  • Prometheus injects a trailing commas after the last label. I think we shouldn't require users to set it in their filters.

With these issues, I see 2 potential implementations that solve all issues:

  1. Treat curly bracket in a special way and exclude them from the regex logic. For example, if an entry in the allowlist is some_metric_name{mylabel="my-value"}, the reporter will first extract and parse the text in between curly brackets. It will then only apply some_metric_name as a regex to filter metrics, and finally only keep metrics that have the labels provided.

    • Pros: Configuration stays inline. This will simplify making it dynamically reconfigurable at runtime in the future.
    • Cons: Users are not able to use {} for regex matching.
  2. Change the way users provide their metrics filter. One option is to do it via a configuration file. For example in YAML:

    - some_metric_name
      labels:
        - mylabel: "my-value"
    • Pros: Users can use complex regex including {}. YAML allows pretty much any characters if a key is surrounded by quotes.
    • Cons: Need to provide an additional file.

@k-wall @OwenCorrigan76 @scholzj @tombentley

  • Can you think of any other challenges in implementing this feature?
  • Do you have a preference in the 2 alternatives I described above? or propose something else?

@antonio-pedro99
Copy link
Member

antonio-pedro99 commented Jun 22, 2024

@mimaison I think [2] would be enough to handle also other label-matching operators. Therefore, [2] would be the way to go. But few things to take into consideration if we are extracting and then parsing the text inside the {}:

  • there are other label matching operators other than =
  • labels can also be regex, like, http_requests_total{replica!="rep-a",replica=~"rep.*"}. In the previous example, replica should not match rep-a, and should regex match "rep.*"(anything starting with rep."
  • {name="something"} is a valid filter, are we considering it here?

@mimaison
Copy link
Contributor Author

The goal is not to re-implement PromQL. We just need a basic filtering mechanism that supports labels too. I think we should really keep it simple as I see the extreme configurability of jmx_exporter one of its drawbacks. It allows writing filters that are hard to understand and ultimately expensive to compute making it pretty slow if you enable too many metrics.

@k-wall
Copy link

k-wall commented Jun 24, 2024

Do we really need this feature in the reporter? Do we really need it right now? Prometheus already has the ability to drop metrics at ingress time, so if the volume of metrics coming from the reporter was truly bothersome for a user, they'd already have that knob to twiddle. (Aside: doesn't this go beyond the scope of the original Strimzi Proposal?)

I'd say let's wait on this one (YAGNI). If we get user feedback saying this is important, then we can think then with a fresh proposal.

@antonio-pedro99
Copy link
Member

The goal is not to re-implement PromQL. We just need a basic filtering mechanism that supports labels too. I think we should really keep it simple as I see the extreme configurability of jmx_exporter one of its drawbacks. It allows writing filters that are hard to understand and ultimately expensive to compute making it pretty slow if you enable too many metrics.

I agree with you.

@antonio-pedro99
Copy link
Member

@mimaison what's the update of this?
Are we following @k-wall's suggestion? Otherwise it would be interesting to pickup this issue

@mimaison
Copy link
Contributor Author

mimaison commented Jul 1, 2024

@antonio-pedro99 Thanks for volunteering, however the first step is to decide what to do. If you have opinions on the various points being discussed, please share them.

@mimaison
Copy link
Contributor Author

mimaison commented Jul 1, 2024

Following the update of the Prometheus client 4c1fd9e, a trailing commas is no longer injected after the last label.

I think it would be worth checking whether we could change the regex separator (maybe to ;) and see how a filter with labels (so escaping the various conflicting characters) would look like.

@mimaison
Copy link
Contributor Author

I spent some time investigating our various options. The "best" (least worst?) solution I found was to allow configuring the separator for the allowlist and have logic to mimic the Prometheus metric name serialization to compute the expected final name with labels.

Here is a PoC: https://github.com/strimzi/metrics-reporter/compare/main...mimaison:metrics-reporter:label_filter?expand=1

This allows users to include labels in their allowlist. For example using the following allowlist:

prometheus.metrics.reporter.allowlist=kafka_server_kafka_network_requestmetrics_totaltimems\\{request="(Produce|Fetch)".*

I only get these metrics:

# TYPE kafka_server_kafka_network_requestmetrics_totaltimems summary
kafka_server_kafka_network_requestmetrics_totaltimems{request="Fetch",quantile="0.5"} 0.0
kafka_server_kafka_network_requestmetrics_totaltimems{request="Fetch",quantile="0.75"} 0.0
kafka_server_kafka_network_requestmetrics_totaltimems{request="Fetch",quantile="0.95"} 0.0
kafka_server_kafka_network_requestmetrics_totaltimems{request="Fetch",quantile="0.98"} 0.0
kafka_server_kafka_network_requestmetrics_totaltimems{request="Fetch",quantile="0.99"} 0.0
kafka_server_kafka_network_requestmetrics_totaltimems{request="Fetch",quantile="0.999"} 0.0
kafka_server_kafka_network_requestmetrics_totaltimems_count{request="Fetch"} 0
kafka_server_kafka_network_requestmetrics_totaltimems_sum{request="Fetch"} 0.0
kafka_server_kafka_network_requestmetrics_totaltimems{request="Produce",quantile="0.5"} 0.0
kafka_server_kafka_network_requestmetrics_totaltimems{request="Produce",quantile="0.75"} 0.0
kafka_server_kafka_network_requestmetrics_totaltimems{request="Produce",quantile="0.95"} 0.0
kafka_server_kafka_network_requestmetrics_totaltimems{request="Produce",quantile="0.98"} 0.0
kafka_server_kafka_network_requestmetrics_totaltimems{request="Produce",quantile="0.99"} 0.0
kafka_server_kafka_network_requestmetrics_totaltimems{request="Produce",quantile="0.999"} 0.0
kafka_server_kafka_network_requestmetrics_totaltimems_count{request="Produce"} 0
kafka_server_kafka_network_requestmetrics_totaltimems_sum{request="Produce"} 0.0

In the end, it's ok but not a great. The main issues are:

  • the allowlist syntax can easily get messy as you need to escape regex characters. You may need to change the separator if you want to use it in one of the label values.
  • we have to mimic the Prometheus text format logic. I went for a very basic implementation which you can does not take into account suffixes (kafka_server_kafka_network_requestmetrics_totaltimems_count{request="Produce"} does not technically match the allowlist)

Also while doing this work I identified #30

@k-wall
Copy link

k-wall commented Jul 19, 2024

prometheus.metrics.reporter.allowlist=kafka_server_kafka_network_requestmetrics_totaltimems\{request="(Produce|Fetch)".*

is label order in the serialised form of the metric something that's guaranteed?

@mimaison
Copy link
Contributor Author

mimaison commented Jul 19, 2024

Labels are sorted by name. This is mentioned in the javadoc: https://prometheus.github.io/client_java/api/io/prometheus/metrics/model/snapshots/Labels.html

I assume the exposition formats preserve that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants