Prometheus → Datadog Metrics
Prometheus ↔ Datadog Metrics: integration to migration path.
Alerting is load-bearing, so we never flip a switch. Datadog comes online alongside Prometheus first — a remote_write block fans the same samples to Datadog while Alertmanager keeps paging — and only then does Datadog take over alerting one service at a time. No instrumentation change, no flag day, and every phase rolls back in minutes.
Prometheus stays authoritative for alerting until Phase 4. The honest exception is the Datadog custom-metric bill — we name it up front and gate the migration on it, because nothing else matters if the invoice detonates.
The idea
Remote-write a curated subset first.
The topology that makes this zero-outage: your existing Prometheus servers keep scraping, keep evaluating recording and alerting rules, and keep their local TSDB — a single remote_write block fans the same samples to Datadog's intake, gated by an allow-list that decides exactly which series become billable custom metrics. No application team changes instrumentation. Datadog gains long-term storage, dashboards, Monitors and Watchdog on the shipped subset while Alertmanager still pages humans. That parallel run lets you rebuild dashboards, prove monitors in shadow mode, then cut the notification path per service — each step independently reversible.
The phases
Seven steps. Each one reversible.
Baseline & inventory
We inventory every scrape job, the top-50 cardinality offenders, all recording and alerting rules with their routing, your Grafana dashboards and the full Alertmanager route tree. Read-only — and we produce an estimated Datadog custom-metric count and cost.
Stand up parallel ship to Datadog
Prometheus remote-writes to Datadog with an aggressive allow-list — only the SLI metrics for migrated dashboards and SLOs. No Datadog monitors yet, no Prometheus alert disabled. Alertmanager stays authoritative.
Rebuild dashboards; monitors in shadow mode
Every Grafana dashboard on shipped metrics gets a Datadog twin, and every Prometheus alerting rule gets a Datadog Monitor — defined but muted. Alertmanager remains the only path that pages a human.
Cut alerting to Datadog, per service
Service by service, low to high blast radius, the Prometheus rule is silenced in Alertmanager and the Datadog Monitor is unmuted with PagerDuty verified end-to-end. Prometheus still scrapes and still evaluates — only the notification path is cut.
Expand the metric set with cardinality controls
Datadog ingest broadens to everything dashboards and exploration need, but with Distribution Metrics for latency histograms and Metric Without Limits applied to the top-20 cardinality offenders.
Downscale Prometheus to short-retention
Alertmanager is decommissioned (or kept only for in-cluster operational rules), and Prometheus is reduced to a 2–6h scrape engine that remote-writes to Datadog. Recording rules keep evaluating so derived series ship at full fidelity.
Final state
Either Prometheus stays in short-retention scrape-engine mode indefinitely — recommended, because the OpenMetrics exporter ecosystem is the scrape engine — or every scrape job is migrated to Datadog Agent integrations one-for-one, a 6–12 month workstream.
Feature parity
What moves cleanly, and what doesn't.
| Capability | Prometheus | Datadog Metrics | Parity |
|---|---|---|---|
| Metric ingest | Scrape /metrics (OpenMetrics) + remote_write | Datadog Agent / openmetrics check / intake /api/v2/series | At parity |
| Service discovery | kubernetes_sd_configs / consul_sd_configs / ec2_sd_configs | DD Agent Autodiscovery (per-host) | At parity |
| Query language | PromQL (rate, histogram_quantile, by()) | Datadog query syntax (.as_rate(), p95:); DDSQL preview | Partial |
| Histogram percentiles | Classic _bucket{le} client-side; native histograms 2.40+ | Datadog Distribution Metrics (TDigest, post-hoc quantiles) | Partial |
| Alerting / routing | Alerting rules + Alertmanager (inhibition, grouping, silences) | Datadog Monitors (composite, anomaly, outlier, forecast) + downtime API | Partial |
| Dashboards-as-code | Grafana JSON via TF grafana_dashboard / Grizzly jsonnet | Datadog datadog_dashboard TF + datadog-sync-cli | At parity |
| Long-term retention | Local TSDB ~15d; Thanos/Mimir on your storage | Datadog 15-month custom-metric retention | At parity |
| Cardinality / cost control | Write-time write_relabel_configs discipline | Datadog Metric Without Limits (server-side tag reshaping) | SaaS only |
| Anomaly / AIOps | predict_linear() / Robust-MAD recording rules (per-signal) | Datadog Watchdog (zero-config, whole-tenant) | SaaS only |
| Forecasting | predict_linear() over a recording-rule window | Datadog forecast() + Forecast Alert monitor | At parity |
| Tag enforcement at intake | Accepts any label; relabel discipline upstream | DD intake lowercasing, ≤200 tags/record, length cap | SaaS only |
| Cross-signal correlation | Grafana ≥10 + Tempo + Loki + exemplar trace_id (four panes) | Datadog metric→trace exemplar→span→log (one click) | Partial |
| Compliance attestations | Self-hosted; the data store of record is yours | Datadog SOC 2 / ISO / FedRAMP site (US1-FED) | Partial |
What we're honest about
The caveats most vendors leave out.
The custom-metric bill is the real risk
Datadog bills per unique (metric_name, tag_set) combination, so a single latency histogram across services, environments and methods can be thousands of billable series. The write_relabel_configs allow-list is the only thing standing between your Prometheus cardinality and your Datadog invoice — we validate the post-relabel sample against Datadog's metric explorer in Phase 1 before expanding scope, and set a daily billing alert.
Metric Without Limits has no OSS analog
Datadog's server-side cardinality reshaping — declare which tag keys stay queryable, the rest drop from billing but still aggregate — has no architectural equivalent in a Prometheus-shaped TSDB. Cardinality control on the OSS side is write-time relabel discipline; you cannot retroactively unship a label dimension. This is a SaaS-only lever, and we use it deliberately.
Alertmanager semantics don't all map 1:1
Inhibition becomes a composite monitor and composite monitors don't compose recursively; grouping, silences and routing have clean Datadog equivalents but anomaly, outlier and forecast monitor types are Datadog's win. For any inhibition chain deeper than two levels we redesign the alert topology rather than translate it blind.
Keep Prometheus as the scrape engine
Datadog Agent integrations don't cover every exporter on every version. The honest end state keeps Prometheus indefinitely as a short-retention scrape engine feeding Datadog — node_exporter, kube-state-metrics and cAdvisor are that engine. We won't promise full Prometheus retirement to a budget owner unless integration coverage is verified per-exporter.
Why this beats a flag day
Reversible at every step.
Every phase is a config change with a tight rollback window while both stacks run in parallel — commenting out a remote_write block reverts in under two minutes, a per-service alerting cut reverts in under 15, and the heavier downscale phases revert in under 30. We never decommission Alertmanager or downscale Prometheus until Datadog has held alerting authority through a minimum 30-day green soak with zero Alertmanager-fired pages. The soak gate is the point: the new path proves itself in production before any bridge is burned.
See whether your metrics migrate within budget.
A 30-minute call with a senior observability engineer. We run your top-50 cardinality offenders, estimate the Datadog custom-metric bill within an order of magnitude, and tell you honestly which Prometheus semantics — inhibition, native histograms — won't translate cleanly. Before you sign a contract.
Map my migration →