Skip to content
The Algorithm
InsightsPlatform Engineering
Platform EngineeringCross-Industry10 min read · 2024-06-21

OpenTelemetry for Enterprise-Scale Distributed Tracing

OpenTelemetry has become the industry standard for instrumentation across traces, metrics, and logs. Graduating from CNCF incubation with broad vendor support, it provides a single instrumentation API that decouples application code from observability backends. For enterprises running hundreds of services across multiple teams, the OpenTelemetry Collector architecture — not the auto-instrumentation libraries alone — is what makes enterprise-scale tracing operational. Tail-based sampling at the Collector tier, attribute-based routing to multiple backends, and the Collector pipeline configuration required for regulated workloads require design decisions that are not in the getting-started documentation.

OpenTelemetry graduated from CNCF incubation in 2021 and has since achieved near-universal support across observability vendors, cloud providers, and instrumentation libraries. The vendor lock-in problem that plagued observability is functionally resolved for new greenfield deployments. The remaining adoption challenge for enterprise organisations is not vendor support or API stability. It is the collector architecture, sampling strategy, and organisational instrumentation discipline required to operate OpenTelemetry at enterprise scale across hundreds of services, multiple teams, and several observability backends simultaneously.

The Collector Architecture Most Teams Do Not Start With

The OpenTelemetry documentation describes two deployment modes: no-collector where applications send telemetry directly to the backend, and collector-as-agent where a collector sidecar or DaemonSet receives telemetry from local applications and forwards to the backend. For an enterprise deployment with multiple teams, multiple backends, and a need for centralised sampling and routing decisions, a two-tier collector architecture is required: agent collectors that run close to applications, and gateway collectors that aggregate from agents and perform centralised processing before routing to backends.

The gateway collector tier provides capabilities that agent collectors cannot. Tail-based sampling can only be implemented at the gateway tier where all spans for a trace arrive before the sampling decision is made. Multi-backend routing can send traces to Jaeger for development teams and to Datadog for operations and compliance teams simultaneously. Sensitive attribute scrubbing can remove PII or PHI values from span attributes before they reach the backend. Implementing tail sampling requires sufficient collector gateway memory to buffer complete traces before evaluation, which is a capacity planning input that agent-only deployments do not require.

Auto-Instrumentation vs. Manual Instrumentation

OpenTelemetry provides auto-instrumentation libraries for Java, Python, Node.js, .NET, and Go that inject tracing and metrics collection into application code without source modification. Auto-instrumentation covers standard framework interactions such as HTTP requests, database queries, and message queue operations. It is the correct starting point for legacy services being onboarded to an enterprise observability platform.

Manual instrumentation is required for compliance-specific telemetry. Auto-instrumentation produces spans for HTTP calls and database queries; it does not produce spans with PHI access indicators, authorisation decision outcomes, or data classification labels. For regulated workloads where compliance audit evidence requires span-level context beyond what auto-instrumentation captures, developers must add manual span creation, attribute setting, and event recording in application code. Establishing enterprise standards for compliance span attributes and enforcing those standards through SDK wrappers distributed to all development teams is the instrumentation discipline component of the enterprise OpenTelemetry programme.

Sampling Strategy at Enterprise Scale

Enterprise-scale distributed systems generate trace volumes that would be cost-prohibitive to retain in full. The sampling strategy determines which traces are retained, at what cost, and with what coverage of compliance-relevant events. Three sampling strategies are relevant for regulated enterprise deployments. Head-based probabilistic sampling retains N percent of all traces but discards compliance-relevant traces randomly. Head-based always-on sampling for flagged operations always retains traces for PHI access operations. Tail-based sampling, implemented at the gateway collector, retains traces that meet post-hoc criteria regardless of the sampling decision made at initiation.

The Collector's tail sampling processor evaluates configured policies against completed traces. A policy that retains all traces with span attributes indicating PHI access, all traces with error status codes, and all traces with latency above the P99 threshold while probabilistically sampling 0.1 percent of routine traces achieves both compliance coverage and cost control. This is the correct sampling architecture for regulated enterprise deployments.

Multi-Backend Routing for Enterprise Observability

Enterprise organisations rarely have a single observability backend. Development teams may use Jaeger for local debugging. Operations teams may use Grafana Tempo for correlated metrics and traces. Security and compliance teams may require that all telemetry be forwarded to a SIEM. The OpenTelemetry Collector's fan-out routing capability handles all of these simultaneously: a single gateway collector receives all telemetry and routes it to multiple exporters based on configurable rules.

For regulated organisations, the SIEM routing requirement is non-negotiable. Compliance audit evidence must reach a tamper-evident, long-retention store regardless of which operational observability backend is in use. Configuring the gateway collector with a dedicated SIEM exporter ensures that compliance-relevant telemetry reaches the audit archive even when the primary observability backend is unavailable. The compliance routing configuration is a higher-reliability path than the operational routing and should be treated as critical infrastructure accordingly.

Organisational Adoption Patterns

Enterprise OpenTelemetry adoption that begins with a single platform team attempting to instrument all services centrally consistently stalls. The pattern that succeeds is a platform team that provides opinionated SDK wrappers embedding compliance attribute standards, auto-instrumentation configurations that work with the enterprise's specific frameworks, collector infrastructure that handles routing and sampling, and an adoption programme that measures instrumentation coverage across the service catalogue. Service teams adopt the SDK wrappers and auto-instrumentation with minimal friction; the platform team ensures the resulting telemetry meets enterprise standards. Coverage improves incrementally without requiring central execution of instrumentation changes.

Related Articles
Compliance Engineering

EU AI Act: What CTOs Actually Need to Do Before August 2026

Read →
Vendor Recovery

The Vendor Rescue Pattern: How to Recover a Failed Implementation in 12 Weeks

Read →
AI in Regulated Industries

The LLM Hallucination Problem in Regulated Environments: What 'Acceptable Error Rate' Actually Means

Read →
Facing This?

The engineering behind this article is available as a service.

We have done this work — not advised on it, not reviewed documentation about it. If the problem in this article is your problem, the first call is with a senior engineer who has solved it.

Talk to an EngineerSee Case Studies →
Engage Us