Prometheus → Datadog Metrics

Prometheus ↔ Datadog Metrics: integration to migration path.

Alerting is load-bearing, so we never flip a switch. Datadog comes online alongside Prometheus first — a remote_write block fans the same samples to Datadog while Alertmanager keeps paging — and only then does Datadog take over alerting one service at a time. No instrumentation change, no flag day, and every phase rolls back in minutes.

Prometheus stays authoritative for alerting until Phase 4. The honest exception is the Datadog custom-metric bill — we name it up front and gate the migration on it, because nothing else matters if the invoice detonates.

The idea

Remote-write a curated subset first.

The topology that makes this zero-outage: your existing Prometheus servers keep scraping, keep evaluating recording and alerting rules, and keep their local TSDB — a single remote_write block fans the same samples to Datadog's intake, gated by an allow-list that decides exactly which series become billable custom metrics. No application team changes instrumentation. Datadog gains long-term storage, dashboards, Monitors and Watchdog on the shipped subset while Alertmanager still pages humans. That parallel run lets you rebuild dashboards, prove monitors in shadow mode, then cut the notification path per service — each step independently reversible.

The phases

Seven steps. Each one reversible.

0

Baseline & inventory

We inventory every scrape job, the top-50 cardinality offenders, all recording and alerting rules with their routing, your Grafana dashboards and the full Alertmanager route tree. Read-only — and we produce an estimated Datadog custom-metric count and cost.

Users see: No user impact.

Rollback: N/A

1

Stand up parallel ship to Datadog

Prometheus remote-writes to Datadog with an aggressive allow-list — only the SLI metrics for migrated dashboards and SLOs. No Datadog monitors yet, no Prometheus alert disabled. Alertmanager stays authoritative.

Users see: None.

Rollback: Comment out remote_write: and reload Prometheus. Under two minutes.

2

Rebuild dashboards; monitors in shadow mode

Every Grafana dashboard on shipped metrics gets a Datadog twin, and every Prometheus alerting rule gets a Datadog Monitor — defined but muted. Alertmanager remains the only path that pages a human.

Users see: Engineers can browse Datadog dashboards; no pages route through Datadog yet.

Rollback: Delete the Datadog dashboards and monitors via Terraform destroy. Under 30 minutes.

3

Cut alerting to Datadog, per service

Service by service, low to high blast radius, the Prometheus rule is silenced in Alertmanager and the Datadog Monitor is unmuted with PagerDuty verified end-to-end. Prometheus still scrapes and still evaluates — only the notification path is cut.

Users see: On-call receives pages from Datadog instead of Alertmanager; page content differs, so we coordinate runbooks.

Rollback: Re-enable the Prometheus rule and re-mute the Datadog Monitor — a single git revert and reload. Under 15 minutes per service.

4

Expand the metric set with cardinality controls

Datadog ingest broadens to everything dashboards and exploration need, but with Distribution Metrics for latency histograms and Metric Without Limits applied to the top-20 cardinality offenders.

Users see: None — engineers see richer Datadog content; alerts behave as before.

Rollback: Tighten the allow-list back to the Phase 3 set and disable Metric Without Limits per metric. Under 30 minutes.

5

Downscale Prometheus to short-retention

Alertmanager is decommissioned (or kept only for in-cluster operational rules), and Prometheus is reduced to a 2–6h scrape engine that remote-writes to Datadog. Recording rules keep evaluating so derived series ship at full fidelity.

Users see: Grafana dashboards keep working for recent data; queries spanning more than the retention window repoint at Datadog or get rebuilt.

Rollback: Re-increase retention, redeploy Alertmanager from git, re-enable Prometheus alerting. Under 30 minutes if config is in git; up to two hours if Alertmanager was deleted from version control.

6

Final state

Either Prometheus stays in short-retention scrape-engine mode indefinitely — recommended, because the OpenMetrics exporter ecosystem is the scrape engine — or every scrape job is migrated to Datadog Agent integrations one-for-one, a 6–12 month workstream.

Users see: None.

Rollback: Re-expand retention. Under 30 minutes.

Feature parity

What moves cleanly, and what doesn't.

CapabilityPrometheusDatadog MetricsParity
Metric ingest Scrape /metrics (OpenMetrics) + remote_write Datadog Agent / openmetrics check / intake /api/v2/series At parity
Service discovery kubernetes_sd_configs / consul_sd_configs / ec2_sd_configs DD Agent Autodiscovery (per-host) At parity
Query language PromQL (rate, histogram_quantile, by()) Datadog query syntax (.as_rate(), p95:); DDSQL preview Partial
Histogram percentiles Classic _bucket{le} client-side; native histograms 2.40+ Datadog Distribution Metrics (TDigest, post-hoc quantiles) Partial
Alerting / routing Alerting rules + Alertmanager (inhibition, grouping, silences) Datadog Monitors (composite, anomaly, outlier, forecast) + downtime API Partial
Dashboards-as-code Grafana JSON via TF grafana_dashboard / Grizzly jsonnet Datadog datadog_dashboard TF + datadog-sync-cli At parity
Long-term retention Local TSDB ~15d; Thanos/Mimir on your storage Datadog 15-month custom-metric retention At parity
Cardinality / cost control Write-time write_relabel_configs discipline Datadog Metric Without Limits (server-side tag reshaping) SaaS only
Anomaly / AIOps predict_linear() / Robust-MAD recording rules (per-signal) Datadog Watchdog (zero-config, whole-tenant) SaaS only
Forecasting predict_linear() over a recording-rule window Datadog forecast() + Forecast Alert monitor At parity
Tag enforcement at intake Accepts any label; relabel discipline upstream DD intake lowercasing, ≤200 tags/record, length cap SaaS only
Cross-signal correlation Grafana ≥10 + Tempo + Loki + exemplar trace_id (four panes) Datadog metric→trace exemplar→span→log (one click) Partial
Compliance attestations Self-hosted; the data store of record is yours Datadog SOC 2 / ISO / FedRAMP site (US1-FED) Partial

What we're honest about

The caveats most vendors leave out.

The custom-metric bill is the real risk

Datadog bills per unique (metric_name, tag_set) combination, so a single latency histogram across services, environments and methods can be thousands of billable series. The write_relabel_configs allow-list is the only thing standing between your Prometheus cardinality and your Datadog invoice — we validate the post-relabel sample against Datadog's metric explorer in Phase 1 before expanding scope, and set a daily billing alert.

Metric Without Limits has no OSS analog

Datadog's server-side cardinality reshaping — declare which tag keys stay queryable, the rest drop from billing but still aggregate — has no architectural equivalent in a Prometheus-shaped TSDB. Cardinality control on the OSS side is write-time relabel discipline; you cannot retroactively unship a label dimension. This is a SaaS-only lever, and we use it deliberately.

Alertmanager semantics don't all map 1:1

Inhibition becomes a composite monitor and composite monitors don't compose recursively; grouping, silences and routing have clean Datadog equivalents but anomaly, outlier and forecast monitor types are Datadog's win. For any inhibition chain deeper than two levels we redesign the alert topology rather than translate it blind.

Keep Prometheus as the scrape engine

Datadog Agent integrations don't cover every exporter on every version. The honest end state keeps Prometheus indefinitely as a short-retention scrape engine feeding Datadog — node_exporter, kube-state-metrics and cAdvisor are that engine. We won't promise full Prometheus retirement to a budget owner unless integration coverage is verified per-exporter.

Why this beats a flag day

Reversible at every step.

Every phase is a config change with a tight rollback window while both stacks run in parallel — commenting out a remote_write block reverts in under two minutes, a per-service alerting cut reverts in under 15, and the heavier downscale phases revert in under 30. We never decommission Alertmanager or downscale Prometheus until Datadog has held alerting authority through a minimum 30-day green soak with zero Alertmanager-fired pages. The soak gate is the point: the new path proves itself in production before any bridge is burned.

See whether your metrics migrate within budget.

A 30-minute call with a senior observability engineer. We run your top-50 cardinality offenders, estimate the Datadog custom-metric bill within an order of magnitude, and tell you honestly which Prometheus semantics — inhibition, native histograms — won't translate cleanly. Before you sign a contract.

Map my migration →