With traces, you can observe requests as they move from one service to another in a distributed system. Tracing is highly practical for both high-level and in-depth analysis of systems.
However, if the large majority of your requests are successful and finish with acceptable latency and no errors, you do not need 100% of your traces to meaningfully observe your applications and systems. You just need the right sampling.
It’s important to use consistent terminology when discussing sampling. A trace or span is considered “sampled” or “not sampled”:
Sometimes, the definitions of these terms get mixed up. You might find someone states that they are “sampling out data” or that data not processed or exported is considered “sampled”. These are incorrect statements.
Sampling is one of the most effective ways to reduce the costs of observability without losing visibility. Although there are other ways to lower costs, such as filtering or aggregating data, these other methods do not adhere to the concept of representativeness, which is crucial when performing in-depth analysis of application or system behavior.
Representativeness is the principle that a smaller group can accurately represent a larger group. Additionally, representativeness can be mathematically verified, meaning that you can have high confidence that a smaller sample of data accurately represents the larger group.
Additionally, the more data you generate, the less data you actually need to have a representative sample. For high-volume systems, is quite common for a sampling rate of 1% or lower to very accurately represent the other 99% of data.
Consider sampling if you meet any of the following criteria:
Finally, consider your overall budget. If you have limited budget for observability, but can afford to spend time to effectively sample, then sampling can generally be worth it.
Sampling might not be appropriate for you. You might want to avoid sampling if you meet any of the following criteria:
Finally, consider the following three costs associated with sampling:
Sampling, while effective at reducing observability costs, might introduce other unexpected costs if not performed well. It could be cheaper to allocate more resources for observability instead, either with a vendor or compute when self-hosting, depending on your observability backend, the nature of your data, and your attempts to sample effectively.
Head sampling is a sampling technique used to make a sampling decision as early as possible. A decision to sample or drop a span or trace is not made by inspecting the trace as a whole.
For example, the most common form of head sampling is Consistent Probability Sampling. This is also be referred to as Deterministic Sampling. In this case, a sampling decision is made based on the trace ID and the desired percentage of traces to sample. This ensures that whole traces are sampled - no missing spans - at a consistent rate, such as 5% of all traces.
The upsides to head sampling are:
The primary downside to head sampling is that it is not possible to make a sampling decision based on data in the entire trace. For example, you cannot ensure that all traces with an error within them are sampled with head sampling alone. For this situation and many others, you need tail sampling.
Tail sampling is where the decision to sample a trace takes place by considering all or most of the spans within the trace. Tail Sampling gives you the option to sample your traces based on specific criteria derived from different parts of a trace, which isn’t an option with Head Sampling.
Some examples of how you can use Tail Sampling include:
As you can see, tail sampling allows for a much higher degree of sophistication in how you sample data. For larger systems that must sample telemetry, it is almost always necessary to use Tail Sampling to balance data volume with the usefulness of that data.
There are three primary downsides to tail sampling today:
Finally, for some systems, tail sampling might be used in conjunction with Head Sampling. For example, a set of services that produce an extremely high volume of trace data might first use head sampling to sample only a small percentage of traces, and then later in the telemetry pipeline use tail sampling to make more sophisticated sampling decisions before exporting to a backend. This is often done in the interest of protecting the telemetry pipeline from being overloaded.
The OpenTelemetry Collector includes the following sampling processors:
For the individual language-specific implementations of the OpenTelemetry API & SDK, you will find support for sampling in the respective documentation pages:
Many vendors offer comprehensive sampling solutions that incorporate head sampling, tail sampling, and other features that can support sophisticated sampling needs. These solutions may also be optimized specifically for the vendor’s backend. If you are sending telemetry to a vendor, consider using their sampling solutions.
Cette page est-elle utile?
Thank you. Your feedback is appreciated!
Please let us know how we can improve this page. Your feedback is appreciated!