Internal telemetry

You can inspect the health of any OpenTelemetry Collector instance by checking its own internal telemetry. Read on to learn about this telemetry and how to configure it to help you monitor and troubleshoot the Collector.

Activate internal telemetry in the Collector

By default, the Collector exposes its own telemetry in two ways:

  • Internal metrics are exposed using a Prometheus interface which defaults to port 8888.
  • Logs are emitted to stderr by default.

Configure internal metrics

You can configure how internal metrics are generated and exposed by the Collector. By default, the Collector generates basic metrics about itself and exposes them using the OpenTelemetry Go Prometheus exporter for scraping at http://127.0.0.1:8888/metrics. You can expose the endpoint to one specific or all network interfaces when needed. For containerized environments, you might want to expose this port on a public interface.

Set the Prometheus config under service::telemetry::metrics:

service:
  telemetry:
    metrics:
      readers:
        - pull:
            exporter:
              prometheus:
                host: '0.0.0.0'
                port: 8888

You can adjust the verbosity of the Collector metrics output by setting the level field to one of the following values:

  • none: no telemetry is collected.
  • basic: essential service telemetry.
  • normal: the default level, adds standard indicators on top of basic.
  • detailed: the most verbose level, includes dimensions and views.

Each verbosity level represents a threshold at which certain metrics are emitted. For the complete list of metrics, with a breakdown by level, see Lists of internal metrics.

The default level for metrics output is normal. To use another level, set service::telemetry::metrics::level:

service:
  telemetry:
    metrics:
      level: detailed

Configure internal logs

Log output is found in stderr. You can configure logs in the config service::telemetry::logs. The configuration options are:

Field nameDefault valueDescription
levelINFOSets the minimum enabled logging level. Other possible values are DEBUG, WARN, and ERROR.
developmentfalsePuts the logger in development mode.
encodingconsoleSets the logger’s encoding. The other possible value is json.
disable_callerfalseStops annotating logs with the calling function’s file name and line number. By default, all logs are annotated.
disable_stacktracefalseDisables automatic stacktrace capturing. Stacktraces are captured for logs at WARN level and above in development and at ERROR level and above in production.
sampling::enabledtrueSets a sampling policy.
sampling::tick10sThe interval in seconds that the logger applies to each sampling.
sampling::initial10The number of messages logged at the start of each sampling::tick.
sampling::thereafter100Sets the sampling policy for subsequent messages after sampling::initial messages are logged. When sampling::thereafter is set to N, every Nth message is logged and all others are dropped. If N is zero, the logger drops all messages after sampling::initial messages are logged.
output_paths["stderr"]A list of URLs or file paths to write logging output to.
error_output_paths["stderr"]A list of URLs or file paths to write logger errors to.
initial_fieldsA collection of static key-value pairs added to all log entries to enrich logging context. By default, there is no initial field.

You can also see logs for the Collector on a Linux systemd system using journalctl:

journalctl | grep otelcol
journalctl | grep otelcol | grep Error

The following configuration can be used to emit internal logs from the Collector to an OTLP/gRPC backend:

service:
  telemetry:
    logs:
      processors:
        - batch:
            exporter:
              otlp:
                protocol: grpc/protobuf
                endpoint: https://backend:4317

Configure internal traces

The Collector does not expose traces by default, but it can be configured to.

The following configuration can be used to emit internal traces from the Collector to an OTLP/gRPC backend:

service:
  telemetry:
    traces:
      processors:
        - batch:
            exporter:
              otlp:
                protocol: grpc/protobuf
                endpoint: https://backend:4317

See the example configuration for additional options. Note that the tracer_provider section there corresponds to traces here.

Self-monitoring

The Collector can be configured to push its own telemetry to an OTLP receiver and send the data through configured pipelines. In the following example, the Collector is configured to push metrics, traces, and logs every 10s using OTLP gRPC to localhost:14317:

receivers:
  otlp/internal:
    protocols:
      grpc:
        endpoint: localhost:14317
exporters:
  debug:
service:
  pipelines:
    metrics:
      receivers: [otlp/internal]
      exporters: [debug]
    traces:
      receivers: [otlp/internal]
      exporters: [debug]
  telemetry:
    metrics:
      readers:
        - periodic:
            interval: 10000
            exporter:
              otlp:
                protocol: grpc/protobuf
                endpoint: http://localhost:14317
    traces:
      processors:
        - batch:
            exporter:
              otlp:
                protocol: grpc/protobuf
                endpoint: http://localhost:14317
    logs:
      processors:
        - batch:
            exporter:
              otlp:
                protocol: grpc/protobuf
                endpoint: http://localhost:14317

Types of internal telemetry

The OpenTelemetry Collector aims to be a model of observable service by clearly exposing its own operational metrics. Additionally, it collects host resource metrics that can help you understand if problems are caused by a different process on the same host. Specific components of the Collector can also emit their own custom telemetry. In this section, you will learn about the different types of observability emitted by the Collector itself.

Summary of values observable with internal metrics

The Collector emits internal metrics for at least the following values:

  • Process uptime and CPU time since start.
  • Process memory and heap usage.
  • For receivers: Items accepted and refused, per data type.
  • For processors: Incoming and outgoing items.
  • For exporters: Items the exporter sent, failed to enqueue, and failed to send, per data type.
  • For exporters: Queue size and capacity.
  • Count, duration, and size of HTTP/gRPC requests and responses.

A more detailed list is available in the following sections.

Lists of internal metrics

The following tables group each internal metric by level of verbosity: basic, normal, and detailed. Each metric is identified by name and description and categorized by instrumentation type.

basic-level metrics

Metric nameDescriptionType
otelcol_exporter_enqueue_failed_
log_records
Number of logs that exporter(s) failed to enqueue.Counter
otelcol_exporter_enqueue_failed_
metric_points
Number of metric points that exporter(s) failed to enqueue.Counter
otelcol_exporter_enqueue_failed_
spans
Number of spans that exporter(s) failed to enqueue.Counter
otelcol_exporter_queue_capacityFixed capacity of the sending queue, in batches.Gauge
otelcol_exporter_queue_sizeCurrent size of the sending queue, in batches.Gauge
otelcol_exporter_send_failed_
log_records
Number of logs that exporter(s) failed to send to destination.Counter
otelcol_exporter_send_failed_
metric_points
Number of metric points that exporter(s) failed to send to destination.Counter
otelcol_exporter_send_failed_
spans
Number of spans that exporter(s) failed to send to destination.Counter
otelcol_exporter_sent_log_recordsNumber of logs successfully sent to destination.Counter
otelcol_exporter_sent_metric_pointsNumber of metric points successfully sent to destination.Counter
otelcol_exporter_sent_spansNumber of spans successfully sent to destination.Counter
otelcol_process_cpu_secondsTotal CPU user and system time in seconds.Counter
otelcol_process_memory_rssTotal physical memory (resident set size) in bytes.Gauge
otelcol_process_runtime_heap_
alloc_bytes
Bytes of allocated heap objects (see ‘go doc runtime.MemStats.HeapAlloc’).Gauge
otelcol_process_runtime_total_
alloc_bytes
Cumulative bytes allocated for heap objects (see ‘go doc runtime.MemStats.TotalAlloc’).Counter
otelcol_process_runtime_total_
sys_memory_bytes
Total bytes of memory obtained from the OS (see ‘go doc runtime.MemStats.Sys’).Gauge
otelcol_process_uptimeUptime of the process in seconds.Counter
otelcol_processor_batch_batch_
send_size
Number of units in the batch that was sent.Histogram
otelcol_processor_batch_batch_size_
trigger_send
Number of times the batch was sent due to a size trigger.Counter
otelcol_processor_batch_metadata_
cardinality
Number of distinct metadata value combinations being processed.Counter
otelcol_processor_batch_timeout_
trigger_send
Number of times the batch was sent due to a timeout trigger.Counter
otelcol_processor_incoming_itemsNumber of items passed to the processor.Counter
otelcol_processor_outgoing_itemsNumber of items emitted from the processor.Counter
otelcol_receiver_accepted_
log_records
Number of logs successfully ingested and pushed into the pipeline.Counter
otelcol_receiver_accepted_
metric_points
Number of metric points successfully ingested and pushed into the pipeline.Counter
otelcol_receiver_accepted_spansNumber of spans successfully ingested and pushed into the pipeline.Counter
otelcol_receiver_refused_
log_records
Number of logs that could not be pushed into the pipeline.Counter
otelcol_receiver_refused_
metric_points
Number of metric points that could not be pushed into the pipeline.Counter
otelcol_receiver_refused_spansNumber of spans that could not be pushed into the pipeline.Counter
otelcol_scraper_errored_
metric_points
Number of metric points the Collector failed to scrape.Counter
otelcol_scraper_scraped_
metric_points
Number of metric points scraped by the Collector.Counter

Additional normal-level metrics

There are currently no metrics specific to normal verbosity.

Additional detailed-level metrics

Metric nameDescriptionType
http_client_active_requestsNumber of active HTTP client requests.Counter
http_client_connection_durationMeasures the duration of the successfully established outbound HTTP connections.Histogram
http_client_open_connectionsNumber of outbound HTTP connections that are active or idle on the client.Counter
http_client_request_sizeMeasures the size of HTTP client request bodies.Counter
http_client_durationMeasures the duration of HTTP client requests.Histogram
http_client_response_sizeMeasures the size of HTTP client response bodies.Counter
http_server_active_requestsNumber of active HTTP server requests.Counter
http_server_request_sizeMeasures the size of HTTP server request bodies.Counter
http_server_durationMeasures the duration of HTTP server requests.Histogram
http_server_response_sizeMeasures the size of HTTP server response bodies.Counter
otelcol_processor_batch_batch_
send_size_bytes
Number of bytes in the batch that was sent.Histogram
rpc_client_durationMeasures the duration of outbound RPC.Histogram
rpc_client_request_sizeMeasures the size of RPC request messages (uncompressed).Histogram
rpc_client_requests_per_rpcMeasures the number of messages received per RPC. Should be 1 for all non-streaming RPCs.Histogram
rpc_client_response_sizeMeasures the size of RPC response messages (uncompressed).Histogram
rpc_client_responses_per_rpcMeasures the number of messages sent per RPC. Should be 1 for all non-streaming RPCs.Histogram
rpc_server_durationMeasures the duration of inbound RPC.Histogram
rpc_server_request_sizeMeasures the size of RPC request messages (uncompressed).Histogram
rpc_server_requests_per_rpcMeasures the number of messages received per RPC. Should be 1 for all non-streaming RPCs.Histogram
rpc_server_response_sizeMeasures the size of RPC response messages (uncompressed).Histogram
rpc_server_responses_per_rpcMeasures the number of messages sent per RPC. Should be 1 for all non-streaming RPCs.Histogram

Events observable with internal logs

The Collector logs the following internal events:

  • A Collector instance starts or stops.
  • Data dropping begins due to throttling for a specified reason, such as local saturation, downstream saturation, downstream unavailable, etc.
  • Data dropping due to throttling stops.
  • Data dropping begins due to invalid data. A sample of the invalid data is included.
  • Data dropping due to invalid data stops.
  • A crash is detected, differentiated from a clean stop. Crash data is included if available.

Telemetry maturity levels

Traces

Tracing instrumentation is still under active development, and changes might be made to span names, attached attributes, instrumented endpoints, or other aspects of the telemetry. Until this feature graduates to stable, there are no guarantees of backwards compatibility for tracing instrumentation.

Metrics

The Collector’s metrics follow a four-stage lifecycle:

Alpha metric → Stable metric → Deprecated metric → Deleted metric

Alpha

Alpha metrics have no stability guarantees. These metrics can be modified or deleted at any time.

Stable

Stable metrics are guaranteed to not change. This means:

  • A stable metric without a deprecated signature will not be deleted or renamed.
  • A stable metric’s type and attributes will not be modified.

Deprecated

Deprecated metrics are slated for deletion but are still available for use. The description of these metrics include an annotation about the version in which they became deprecated. For example:

Before deprecation:

# HELP otelcol_exporter_queue_size this counts things
# TYPE otelcol_exporter_queue_size counter
otelcol_exporter_queue_size 0

After deprecation:

# HELP otelcol_exporter_queue_size (Deprecated since 1.15.0) this counts things
# TYPE otelcol_exporter_queue_size counter
otelcol_exporter_queue_size 0

Deleted

Deleted metrics are no longer published and cannot be used.

Logs

Individual log entries and their formatting might change from one release to the next. There are no stability guarantees at this time.

Use internal telemetry to monitor the Collector

This section recommends best practices for monitoring the Collector using its own telemetry.

Critical monitoring

Data loss

Use the rate of otelcol_processor_dropped_log_records > 0, otelcol_processor_dropped_spans > 0, and otelcol_processor_dropped_metric_points > 0 to detect data loss. Depending on your project’s requirements, select a narrow time window before alerting begins to avoid notifications for small losses that are within the desired reliability range and not considered outages.

Secondary monitoring

Queue length

Most exporters provide a queue and/or retry mechanism that is recommended for use in any production deployment of the Collector.

The otelcol_exporter_queue_capacity metric indicates the capacity, in batches, of the sending queue. The otelcol_exporter_queue_size metric indicates the current size of the sending queue. Use these two metrics to check if the queue capacity can support your workload.

Using the following three metrics, you can identify the number of spans, metric points, and log records that failed to reach the sending queue:

  • otelcol_exporter_enqueue_failed_spans
  • otelcol_exporter_enqueue_failed_metric_points
  • otelcol_exporter_enqueue_failed_log_records

These failures could be caused by a queue filled with unsettled elements. You might need to decrease your sending rate or horizontally scale Collectors.

The queue or retry mechanism also supports logging for monitoring. Check the logs for messages such as Dropping data because sending_queue is full.

Receive failures

Sustained rates of otelcol_receiver_refused_log_records, otelcol_receiver_refused_spans, and otelcol_receiver_refused_metric_points indicate that too many errors were returned to clients. Depending on the deployment and the clients’ resilience, this might indicate clients’ data loss.

Sustained rates of otelcol_exporter_send_failed_log_records, otelcol_exporter_send_failed_spans, and otelcol_exporter_send_failed_metric_points indicate that the Collector is not able to export data as expected. These metrics do not inherently imply data loss since there could be retries. But a high rate of failures could indicate issues with the network or backend receiving the data.

Data flow

You can monitor data ingress with the otelcol_receiver_accepted_log_records, otelcol_receiver_accepted_spans, and otelcol_receiver_accepted_metric_points metrics and data egress with the otelcol_exporter_sent_log_records, otelcol_exporter_sent_spans, and otelcol_exporter_sent_metric_points metrics.