The OpenTelemetry Java agent instruments your application by running inside the same Java Virtual Machine (JVM). Like any other software agent, the Java agent requires system resources like CPU, memory, and network bandwidth. The use of resources by the agent is called agent overhead or performance overhead. The OpenTelemetry Java agent has minimal impact on system performance when instrumenting JVM applications, although the final agent overhead depends on multiple factors.
Some factors that might increase agent overhead are environmental, such as the physical machine architecture, CPU frequency, amount and speed of memory, system temperature, and resource contention. Other factors include virtualization and containerization, the operating system and its libraries, the JVM version and vendor, JVM settings, the algorithmic design of the software being monitored, and software dependencies.
Due to the complexity of modern software and the broad diversity in deployment scenarios, it is impossible to come up with a single agent overhead estimate. To find the overhead of any instrumentation agent in a given deployment, you have to conduct experiments and collect measurements directly. Therefore, treat all statements about performance as general information and guidelines that are subject to evaluation in a specific system.
The following sections describe the minimum requirements of the OpenTelemetry Java agent, as well as potential constraints impacting performance, and guidelines to optimize and troubleshoot the performance of the agent.
The following best practices and techniques might help reduce overhead caused by the Java agent.
The volume of spans processed by the instrumentation might impact agent overhead. You can configure trace sampling to adjust the span volume and reduce resource usage. See Sampling.
You can further reduce agent overhead by turning off instrumentations that
aren’t needed or are producing too many spans. To turn off an instrumentation,
use -Dotel.instrumentation.<name>.enabled=false
or the
OTEL_INSTRUMENTATION_<NAME>_ENABLED
environment variable, where <name>
is
the name of the instrumentation.
For example, the following option turns off the JDBC instrumentation:
-Dotel.instrumentation.jdbc.enabled=false
Increasing the maximum heap size of the JVM using the -Xmx<size>
option might
help in alleviating agent overhead issues, as instrumentations can generate a
large number of short-lived objects in memory.
Too much manual instrumentation might introduce inefficiencies that increase
agent overhead. For example, using @WithSpan
on every method results in a high
span volume, which in turn increases noise in the data and consumes more system
resources.
Make sure to provision enough resources for your instrumentation and for the Collector. The amount of resources such as memory or disk depend on your application architecture and needs. For example, a common setup is to run the instrumented application on the same host as the OpenTelemetry Collector. In that case, consider rightsizing the resources for the Collector and optimize its settings. See Scaling.
In general, the more telemetry you collect from your application, the greater the impact on agent overhead. For example, tracing methods that aren’t relevant to your application can still produce considerable agent overhead because tracing such methods is computationally more expensive than running the method itself. Similarly, high cardinality tags in metrics might increase memory usage. Debug logging, if turned on, also increases write operations to disk and memory usage.
Some instrumentations, for example JDBC or Redis, produce high span volumes that increase agent overhead. For more information on how to turn off unnecessary instrumentations, see Turn off specific instrumentations.
[!NOTE] Experimental features of the Java agent might increase agent overhead due to the experimental focus on functionality over performance. Stable features are safer in terms of agent overhead.
When troubleshooting agent overhead issues, do the following:
Consider taking the following actions to decrease agent overhead:
Measuring agent overhead in your own environment and deployments provides accurate data about the impact of instrumentation on the performance of your application or service. The following guidelines describe the general steps for collecting and comparing reliable agent overhead measurements.
Different users of your application or service might notice different aspects of agent overhead. For example, while end users might notice degradation in service latency, power users with heavy workloads pay more attention to CPU overhead. On the other hand, users who deploy frequently, for example due to elastic workloads, care more about startup time.
Reduce your measurements to factors that are sure to impact user experience, so your datasets don’t contain irrelevant information. Some examples of measurements include the following:
By measuring agent overhead in a controlled test environment you can better identify the factors affecting performance. When preparing a test environment, complete the following:
Design the tests that you run against the test environment to resemble typical workloads as much as possible. For example, if some REST API endpoints of your service are susceptible to high request volumes, create a test that simulates heavy network traffic.
For Java applications, use a warm-up phase prior to starting measurements. The JVM is a highly dynamic machine that performs a large number of optimizations through just-in-time compilation (JIT). The warm-up phase helps the application to finish most of its class loading and gives the JIT compiler time to run the majority of optimizations.
Make sure to run a large number of requests and to repeat the test pass many times. This repetition helps to ensure a representative data sample. Include error scenarios in your test data. Simulate an error rate similar to that of a normal workload, typically between 2% and 10%.
To identify which factors might be affecting performance and causing agent overhead, collect measurements in the same environment after modifying a single factor or condition.
After collecting data from multiple passes, you can plot results in a chart or compare averages using statistical tests to check for significant differences.
Consider that different stacks, applications, and environments might result in different operational characteristics and different agent overhead measurement results.
[i18n] feedback_question
Thank you. Your feedback is appreciated!
Please let us know how we can improve this page. Your feedback is appreciated!