OpenTelemetry Collector → Datadog Agent

OpenTelemetry Collector ↔ Datadog Agent: integration to migration path.

Your collection tier is load-bearing, so we never flip a switch. The OpenTelemetry Collector deploys alongside the Datadog Agent first — both run, with no pipeline overlap — and only then does the Collector take over collection per cluster while Datadog stays the backend. No app-code redeploy, no flag day, and every phase rolls back in minutes.

This is an agent-tier swap, not a backend swap: the Collector ships straight into Datadog via its exporter, so the Datadog backend, dashboards and monitors stay intact. The honest gaps — Live Processes, NPM, CWS, DBM — we name up front.

The idea

Swap the agent, keep the backend.

The framing that makes this zero-outage: the Collector and the Datadog Agent are both collection-tier processes, and replacing one with the other does not require leaving Datadog. The Collector's Datadog exporter performs the OTel-resource to Datadog-tag translation in-process and ships directly into Datadog intake, so storage, indexing, dashboards and monitors are untouched. The Collector adds upstream OTTL transform, tail-based sampling and one config language across the fleet — and because backend migration is a separable later workstream, the same Collector fleet can later fan out to an OSS backend with a one-line exporter change. Each cluster cuts over independently, each reversible.

The phases

Seven steps. Each one reversible.

0

Baseline & inventory

We document every host's Datadog Agent version and enabled features, the custom conf.d integrations, APM tracer language per service, and log/metric/trace volume. Read-only — and we explicitly classify each SaaS-only subagent as keep or abandon.

Users see: No user impact.

Rollback: N/A

1

Stand up Collector in canary

An OpenTelemetry Collector deploys as a daemonset plus gateway pair in one canary cluster with a single Datadog exporter, wired to a subset of receivers. The Collector and Datadog Agent both run, with no pipeline overlap.

Users see: None for users.

Rollback: Helm uninstall — the Datadog Agent still owns everything. Under 15 minutes.

2

Collector takes over net-new signals

Every net-new app and cluster onboards Collector-first via OTLP, and the Datadog Agent on those nodes runs thin — logs and APM disabled, retaining only the SaaS-only subagents from Phase 0. Old workloads are untouched.

Users see: None — net-new traffic flows through the Collector to Datadog.

Rollback: Re-enable logs and APM in the Datadog Agent on the cluster and restart; the Collector quiesces. Under 15 minutes.

3

Migrate existing sources, per cluster

Cluster by cluster in low-to-high blast-radius waves, the Collector takes over logs, metrics and node signals while the equivalent Agent subsystem is disabled one at a time with a 24h soak between each. We validate facets, monitors and volume at Datadog.

Users see: None — same data, same backend, different collection process.

Rollback: Re-enable the most recently disabled Agent subsystem and restart. Under 15 minutes.

4

Dual-export proving, OSS in parallel

The Collector exports to both Datadog and an OSS backend, with every active dashboard and monitor reproduced OSS-side against the same stream. Tail-sampling and batch both run before the fan-out. Where parity is impossible — Watchdog, Log Patterns, DBM, NPM — we declare the gap.

Users see: None — engineers can compare both backends on identical data.

Rollback: Remove the OSS exporter from the pipelines. Under five minutes.

5

Cut Datadog intake

The Collector exports OSS-only and the Datadog exporter comes out of the pipelines. The thin Agent continues shipping only the retained SaaS-only signals. Datadog dashboards and monitors quiesce on cutover services.

Users see: None for end users — on-call works OSS-side, proven on real incidents or drills first.

Rollback: Re-add the Datadog exporter to the pipelines; dashboards reanimate on next intake. Under 15 minutes.

6

Datadog Agent retirement (or thin-Agent permanence)

Either full retirement — no Agent, with SaaS-only features replaced by OSS workstreams or abandoned — or thin-Agent permanence, where the Agent stays only for retained subagents and the contract is reduced to those SKUs.

Users see: None.

Rollback: Within a 30-day evidence window, reinstall from your config tool. Beyond the window, out of scope.

Feature parity

What moves cleanly, and what doesn't.

CapabilityOpenTelemetry CollectorDatadog AgentParity
Logs/metrics/traces receive receivers: otlp, prometheus, filelog, hostmetrics, kubeletstats (≥80 contrib) Autodiscovery-driven integration checks (≥600 integrations) At parity
Integration breadth prometheus + filelog + dedicated receivers; BYO Prom exporter for long-tail ≥600 prebuilt DD integration checks + dashboards SaaS only
Upstream transform transform processor OTTL (replace_pattern, delete_key) logs_config.processing_rules (mask only, log-only) At parity
Tail-based sampling tail_sampling processor (gateway tier, by outcome) Head/ingest-control only (apm_config.error_sample_rate) OSS only
k8s enrichment k8sattributes processor (API-server pod/ns/node) Autodiscovery ad_identifiers annotations At parity
Exemplars / semconv OTel semconv + histogram exemplar propagation (128-bit) Unified Service Tagging (service/env/version) At parity
Self-telemetry localhost:8888/metrics (otelcol_exporter_*) scrapable anywhere datadog.agent.* into DD (vendor-locked) At parity
Config-as-code Single YAML, offline otelcol validate, OTel Builder distros datadog.yaml + conf.d/*.yaml (needs DD account) Partial
Live Processes / Containers None (no processreceiver parity) DD process/container subagents SaaS only
Runtime security / flow None (Falco/Tetragon/Hubble separate workstreams) DD CWS (eBPF), CSPM, NPM, USM subagents SaaS only
Database monitoring postgresqlreceiver / mysqlreceiver (metrics only) DD DBM dbm-agent (per-query plan capture) Partial
Backend portability Exporter swap [datadog]→[otlphttp/signoz], one line DD-Agent proprietary protobuf → DD only OSS only
Compliance boundary Self-managed fleet in your SSP (patch/TLS/key custody) DD-managed vendor SOC 2 boundary Partial

What we're honest about

The caveats most vendors leave out.

SaaS-only subagents cannot be replicated

Live Processes, Live Containers, NPM, USM, CWS, CSPM and DBM ship proprietary protobuf via Datadog Agent subprocesses with no Collector exporter parity. You either keep a thin Datadog Agent permanently for those, or replace them OSS-side with Falco, Tetragon, Cilium Hubble or pganalyze — and we make that call explicit in Phase 0, never by accident.

Integration breadth is a real long tail

Datadog's 600-plus prebuilt integration checks have no one-to-one Collector equivalent. The Collector relies on the prometheus and filelog receivers plus a smaller dedicated set; long-tail services like Cassandra JMX or Couchbase need a bring-your-own Prometheus exporter and lose the prebuilt Datadog dashboard. We map each enabled integration to a receiver, a Prom exporter, or 'abandoned.'

Database Monitoring drops to metrics-only

Datadog DBM's per-query plan capture against pg_stat_activity and SQL Server DMVs has no Collector parity — the postgresql and mysql receivers collect metrics only. If DBM is doing real work, that is a thin-Agent retain or a pg_stat_statements-plus-Grafana workstream, not a clean swap.

The Collector fleet is now your control

A managed-vendor SOC 2 boundary becomes your responsibility: patching, TLS, key rotation and a reproducible build all move into your SSP. That is exactly why we run it — gateway HA with anti-affinity, a file_storage durability queue, mTLS on the gateway tier, and a tested DR runbook. Managed, not just installed.

Why this beats a flag day

Reversible at every step.

Every phase is a Helm value or Collector YAML change with an under-15-minute rollback while the Datadog Agent stays installed as the break-glass collection-of-record — re-enable a disabled subsystem and restart, or re-add an exporter and dashboards reanimate on next intake. We never uninstall the Agent or drop Datadog SKUs until the OSS backend has proven parity through a minimum 30-day green soak and on-call coverage on real incidents. The soak gate is the point: the new fleet earns its keep before any bridge is burned.

See whether your agent tier swaps cleanly.

A 30-minute call with a senior observability engineer. We inventory your Datadog Agent features, classify which SaaS-only subagents — NPM, CWS, DBM, Live Processes — are in real use, and tell you honestly which need a thin Agent or an OSS workstream. Before you commit.

Map my migration →