Additional Thread Metrics #13483

akats7 · 2025-03-10T02:30:53Z

Is your feature request related to a problem? Please describe.

The current scope of thread metrics appears to be limited to thread count, there are other thread based metrics that are rather critical, such as thread cpu time and metrics based on thread state.

Describe the solution you'd like

Add additional thread metrics for:

jvm.thread.cpu_time
jvm.thread.user_time

Describe alternatives you've considered

Using the JMX Gatherer

Additional context

No response

steverao · 2025-03-10T15:00:53Z

Could you clarify whether the metrics jvm.thread.blocked, jvm.thread.waiting and jvm.thread.timed_waiting represent the number of threads in the corresponding states?

akats7 · 2025-03-10T15:06:40Z

Yeah that is what I meant, but now see that there's a count metric emitted per state, in that case just CPU and User time seems to be a gap

trask · 2025-03-10T16:35:08Z

hi @akats7!

what attributes would you propose on jvm.thread.cpu_time / jvm.thread.user_time?

akats7 · 2025-03-10T17:17:07Z

Hey @trask,

I'd have to dig a bit into the internals of the runtime metric modules, but one approach could be to just support this for JMX

akats7 · 2025-03-13T20:54:17Z

@trask can we add cpu time to runtime metrics through the thread MBean using ManagementFactory. Then we rely on mbean operations to get the time values. I get there may be cardinality concerns since thread name/pool name would have to be an attribute, so it can be disabled by default.

trask · 2025-03-13T22:13:02Z

@SylvainJuge @robsunday I'm hesitant for people to add new JMX metrics in the middle of your convergence effort, so would like to defer to you here

akats7 · 2025-03-14T00:23:34Z

Thanks @trask! I do want to point out that these are rather important metrics. We've had a lot of internal requests for this from users who are migrating from vendor products that supported this out of the box.

SylvainJuge · 2025-03-14T15:20:16Z

To expand a bit on the "convergence effort" context, we are currently trying to add JVM metrics in a YAML descriptor with #13392, this YAML will NOT be directly used by instrumentation but will in the future be used by jmx-scraper which is a CLI program replacing JMX Gatherer, but using the same JMX implementation as instrumentation (and thus inheriting it's yaml support).

What we are currently focusing on for JVM metrics in YAML, is the ability to capture them in a way that is compliant with semantic conventions, which is already done by the instrumentation/runtime-telemetry modules but with code.

The instrumentation/runtime-telemetry uses code and JMX listeners that can't be replicated with YAML, thus some of the metrics we can capture with YAML can't be exactly replicated.
For example, with jvm.thread.count we can't capture the jvm.thread.state or jvm.thread.daemon. In short, depending on how the metric is captured we might or might not provide the expected details and attributes, which then makes those attributes as optional/recommended only.

I think we can add new metrics even if the current work is still in-progress, I would suggest to do that in a few steps:

experiment with yaml to see if and how those could be captured remotely through JMX with jmx-scraper
discuss their definition here
add them to runtime-telemetry modules to validate we can capture them as expected
contribute their definitions to semantic conventions as experimental (this could mean having to change the implementations done previously).
add their semconv-compliant definitions to jvm.yaml that is being added in jmx add jvm metrics yaml #13392 , which hopefully would have been merged in the mean time.

As a temporary work-around, if you are able to capture those with YAML configuration, you should be able to provide a YAML file for them. However this is not a great OOTB experience, could easily break if the metric definition changes when adding it to semconv.

akats7 · 2025-03-14T15:35:12Z

Hey @SylvainJuge, thanks for the context. So part of the issue is that I believe the jmx-scraper is only able to scrape attributes and not execute operations which would be required for these metrics.

In regards to the experimentation, I've already done this with the JMX Gatherer since it allows you to directly interact with the mbeans if using a custom script. However since the gatherer instruments also only allow the use of attributes, I had to rely on transformation closures to overwrite other mbeans which is not ideal.

Also, if possible part of this ask is to be able to move away from the remote approach, I might be missing something but is there a reason its preferable to interact with a JMX server vs just scraping it directly since the javaagent runs in the same JVM?

SylvainJuge · 2025-03-14T15:45:18Z

Also, if possible part of this ask is to be able to move away from the remote approach, I might be missing something but is there a reason its preferable to interact with a JMX server vs just scraping it directly since the javaagent runs in the same JVM?

Ideally, we should not force users to deploy an instrumentation agent to capture runtime metrics if those could be obtained externally with JMX scraper or gatherer.

However, we already have the case of some metrics that can't be captured without instrumentation and explicit code as they can only be captured from within the JVM, either because they require advanced JMX features or rely on JFR events. So this is something we can do already, but it adds more constraints on the users, for example the JVM metrics are not exactly the same if using Java 17 or Java 8, which could lead to user confusion or missed expectations.

If I understand it correctly, those metrics would be more in the "runtime-telemetry only" and would be very unlikely supported through YAML due to needing some post-processing, is that correct ?

Also, could you try to elaborate a bit on their definitions/attributes and from which MBean attribute would they be captured ?

akats7 · 2025-03-14T15:52:02Z

Yep, thats exactly right, for example to get cpu_time we'd likely need get the AllThreadIds attribute and then call getThreadCpuTime and getThreadInfo for attributes such as name.

And I understand that this utility should still exist for users who want these metrics but don't need the other functionality of the agent. But the situation that we find ourselves in is that the majority of our teams are that are already leveraging the agent for instrumentation also have a need for these metrics, so it would be ideal to not have to configure a jmx server and an additional scraping process when the agent is already in place.

SylvainJuge · 2025-03-14T17:28:52Z

I agree with you @akats7 , this is probably a use-case for which we could either document (or provide a dedicated config option) when only runtime metrics (or JMX metrics) needs to be captured and sent to OTLP, without any instrumentation nor tracing involved

For JMX metrics that are defined in yaml, this could help providing details on JVM rumtime metrics while still allowing to capture metrics defined in yaml, for example if you run a Kafka broker or cluster it would be relevant to capture both by adding the agent to the JVM.

akats7 · 2025-03-14T18:44:16Z

So just to clarify, is there a path forward to add these as experimental out of the box jvm metrics. I'd be happy to contribute this

SylvainJuge · 2025-03-17T08:42:01Z

If those new metrics are only captured through code, their implementation is part of instrumentation/runtime-telemetry and would not have to be replicated with YAML at all, which makes things a bit simpler.

In order to add/change things to semconv, we need to have at least an experimental implementation to validate what is being added in semconv is correct and technically achievable, that creates a kind of chicken-egg problem and you have to work on both sides at the same time.

I would suggest to do the following:

add definition of any new experimental metrics in semconv in a draft PR
add support for those in instrumentation in a draft PR, the implementation would be in instrumentation/runtime-telemetry and would likely require adding variants for java 8 (JMX) and Java 17 (JFR/JMX)
update the semconv draft PR to match the implementation, make it non-draft and start discussing any details like metric names and attributes if needed, if it's reusing existing attributes and fits existing metrics I don't expect this to raise many discussions.
Once semconv PR is merged, update the instrumentation PR with any semconv changes, then mark it as ready for review. This last step should be quite quick as most of the discussion should have been done in the semconv PR.

akats7 · 2025-03-17T22:54:27Z

@SylvainJuge That sounds like a plan to me, thanks!

jack-berg · 2025-03-18T18:16:17Z

Here's a little prototype of how this might end up looking in the runtime-telemetry instrumentation library: jack-berg/opentelemetry-java-docs@c295ad4

Some things to note:

Proposing a metric named jvm.thread.time, with dimensions:
- cpu.mode - from attribute registry, with reported values user and system
- thread.name_template - no semantic convention for this yet, but we need a way to group threads together to avoid unbounded cardinality, since its totally possible to continuously start / end new threads indefinitely over an applications lifecycle
thread.name_template is evaluated by matching each thread name against a well known thread name naming patterns expressed as glob patterns. If no match is found, return catch all other. Used glob patterns for convenience, but there may be a better way to express this. We would likely want to make the set of thread name templates configurable, but have a reasonable default set based on values we've seen in practice. Can improve the algo by caching the thread name matching results, but need to be able to clear cache entries which haven't been seen in a while.

SylvainJuge · 2025-03-19T16:08:49Z

This looks nice, should we make the attributes optional and only recommended ?

if there is another way to capture the same aggregated value without any dimension through JMX, it could be something we capture without attributes from outside of the JVM, but this is just a bonus.
for the thread.name_template I expect it might raise lots of questions on how to add new values, so maybe we could first start without it and then add it later. Also, I haven't tried but could we leverage thread groups identity to provide a single value here ?

jack-berg · 2025-03-19T23:01:38Z

This looks nice, should we make the attributes optional and only recommended ?

Could be recommended, or even opt-in, which would allow capture from JMX. But for what its worth, this data is probably most useful with both of these attributes. Without the attributes, I would imagine its not terribly far off from jvm.cpu.time? not sure.

From a configuration standpoint, maybe the registration accepts an optional Function<String, String> which is responsible for grouping thread names into thread.name_template. We could provide one or more options out of the box, allow easy env var / system property configuration to extend the built in patterns, or allow the user to completely override the logic with their own.

Good configuration options seems like the answer to controversy over how we group thread names into templates.

Also, I haven't tried but could we leverage thread groups identity to provide a single value here ?

I haven't found a way to access a thread group name for a given thread id, but if available this would be a good thing to look into.

akats7 · 2025-03-20T00:50:58Z

I think any out of the box naming options would need to be pretty extensive for it to be valuable. Not exactly sure what is deemed to be acceptable cardinality but for threads tied to common frameworks/tools maybe we can get away with just using thread names and escaping the numerical values, and anything that does not match just set to custom.

SylvainJuge · 2025-03-20T08:33:12Z

I am not sure if it's a good idea, but for grouping threads, I wonder if we could maybe instrument all ThreadFactory instances where the thread name is being set.

By default we could provide the configuration through a Map<String,String> where the key is ThreadFactory FQN and the map value the otel attribute value to use. This would allow to add a dedicated value for each FQN of thread factory. Also, finding implementation of ThreadFactory in publicly available OSS code could be easier than collecting all the thread name patterns.

In addition to that, for some known set FQN of ThreadFactory would be handled through explicit code, for example the thread factory of tomcat or other application servers would be included and could use different values.

Ideally this would be something extensible where each instrumentation module could register its own known FQN or naming strategies to prevent having to maintain one potentially humongous list of known patterns that no one will ever attempt to remove things from.

Also, this would be more "labelling threads" rather than grouping them by pattern, which would then require some specification and description of the individual values. One advantage of the patterns is that it's quite easy to know which threads they refer to.

jack-berg · 2025-03-20T14:32:46Z

maybe we can get away with just using thread names and escaping the numerical values

I bet that would be a pretty good heuristic.

I wonder if we could maybe instrument all ThreadFactory instances where the thread name is being set.

I think that could work, but the downside is its only available in the context of the java agent. Other JVM runtime metrics are available as standalone library instrumentation. Could be something where the standalone library instrumentation does something more simplistic, and the java agent improves upon the strategy using techniques only available with bytecode instrumentation.

akats7 added enhancement New feature or request needs triage New issue that requires triage labels Mar 10, 2025

steverao added needs author feedback Waiting for additional feedback from the author and removed needs triage New issue that requires triage labels Mar 10, 2025

github-actions bot removed the needs author feedback Waiting for additional feedback from the author label Mar 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional Thread Metrics #13483

Additional Thread Metrics #13483

akats7 commented Mar 10, 2025 •

edited

Loading

steverao commented Mar 10, 2025

akats7 commented Mar 10, 2025

trask commented Mar 10, 2025

akats7 commented Mar 10, 2025

akats7 commented Mar 13, 2025 •

edited

Loading

trask commented Mar 13, 2025

akats7 commented Mar 14, 2025

SylvainJuge commented Mar 14, 2025

akats7 commented Mar 14, 2025

SylvainJuge commented Mar 14, 2025

akats7 commented Mar 14, 2025

SylvainJuge commented Mar 14, 2025

akats7 commented Mar 14, 2025

SylvainJuge commented Mar 17, 2025

akats7 commented Mar 17, 2025

jack-berg commented Mar 18, 2025

SylvainJuge commented Mar 19, 2025

jack-berg commented Mar 19, 2025

akats7 commented Mar 20, 2025 •

edited

Loading

SylvainJuge commented Mar 20, 2025

jack-berg commented Mar 20, 2025

Additional Thread Metrics #13483

Additional Thread Metrics #13483

Comments

akats7 commented Mar 10, 2025 • edited Loading

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

steverao commented Mar 10, 2025

akats7 commented Mar 10, 2025

trask commented Mar 10, 2025

akats7 commented Mar 10, 2025

akats7 commented Mar 13, 2025 • edited Loading

trask commented Mar 13, 2025

akats7 commented Mar 14, 2025

SylvainJuge commented Mar 14, 2025

akats7 commented Mar 14, 2025

SylvainJuge commented Mar 14, 2025

akats7 commented Mar 14, 2025

SylvainJuge commented Mar 14, 2025

akats7 commented Mar 14, 2025

SylvainJuge commented Mar 17, 2025

akats7 commented Mar 17, 2025

jack-berg commented Mar 18, 2025

SylvainJuge commented Mar 19, 2025

jack-berg commented Mar 19, 2025

akats7 commented Mar 20, 2025 • edited Loading

SylvainJuge commented Mar 20, 2025

jack-berg commented Mar 20, 2025

akats7 commented Mar 10, 2025 •

edited

Loading

akats7 commented Mar 13, 2025 •

edited

Loading

akats7 commented Mar 20, 2025 •

edited

Loading