Loki + Grafana → Datadog Logs

Loki + Grafana ↔ Datadog Logs: integration to migration path.

Logging is load-bearing for every incident, so we never flip a switch. Loki + Grafana deploys alongside Datadog first — one OpenTelemetry Collector tees the same log line to both backends — and only then does Loki take over indexed search in phases. No flag day, no re-credentialing, and every phase rolls back in minutes.

Datadog stays the indexed-search source of record until Phase 4. The honest exceptions — Watchdog, Sensitive Data Scanner, Cloud SIEM — we name up front, because the rest only matters if you can trust it.

The idea

Tee one Collector to both backends first.

The trick that makes this zero-outage: an OpenTelemetry Collector gateway sits between your apps and your backends, and its logs pipeline fans out to both the Datadog exporter and the Loki exporter from a single source. Apps send OTLP once; the same parsed, redacted line lands in both Datadog and Loki. That single switchable point of control lets Loki parallel-run at full fidelity while Datadog still owns indexed search — so you move dashboards, monitors and the source of record one layer at a time, each reversible, never betting an outage on a cutover.

The phases

Seven steps. Each one reversible.

0

Baseline & inventory

We map every Datadog log source by intake path, GB/day and EPS, plus the Pipelines, indexed facets, monitors, dashboards and retention tiers in play. Read-only — nothing is touched, and we pull a 90-day volume sample by service and env for Loki sizing.

Users see: No user impact.

Rollback: N/A

1

Loki goes live; tee a canary

Loki and Grafana stand up in your cloud on S3-backed storage, and an OpenTelemetry Collector gateway tees one canary service to both Datadog and Loki. Datadog stays the source of record.

Users see: None — engineers see the canary in both Datadog and Grafana Explore.

Rollback: Revert the canary pod spec; uninstall Loki — nothing depends on it.

2

Dual-export every service

Every service ships to both backends through the Collector, or via a Promtail/Alloy sidecar alongside the Datadog Agent on legacy hosts. Loki parallel-runs at full fidelity while Datadog remains authoritative.

Users see: None.

Rollback: Per service: revert the pod spec or stop Promtail. Under 15 minutes.

3

Rebuild dashboards, monitors, pipelines on OSS

Every Datadog log dashboard gets a Grafana twin in git, every log-alert monitor a Loki ruler or Grafana unified-alert equivalent, and every Pipeline processor is reimplemented as OTTL or VRL upstream of the fan-out. Additive only.

Users see: None during build — engineers start using Grafana Explore for ad-hoc.

Rollback: Additive only; Datadog remains source of record.

4

Switch source of record to Loki

Loki becomes the indexed-search source of record and on-call defaults to Grafana. Datadog indexed retention is cut, moved to Flex, or reduced to a filtered sample through the Collector.

Users see: On-call default UI and runbook URLs change, with at least two weeks lead notice.

Rollback: Revert index retention, drop the filter, switch pager destinations back. Under 15 minutes.

5

Retire Datadog-specific features (or accept the gap)

For each Datadog-only feature — Watchdog, Log Patterns, Sensitive Data Scanner, Cloud SIEM — we replace it OSS-side, swap a narrower SaaS, or record it as an accepted gap with a compensating control. The Datadog sample shrinks toward zero.

Users see: Per feature; communicated accordingly.

Rollback: Re-enable the Datadog sample and feature config. Under 15 minutes if index config is preserved.

6

Retire Datadog Logs

The Datadog log exporter comes out of the Collector, indexes are set to size zero, and the Logs SKU is dropped at the next renewal. Archives are retained per your compliance hold in your own Object Lock buckets.

Users see: None — Grafana has been primary since Phase 4.

Rollback: Re-add the exporter and reissue the key in minutes — but if the SKU was dropped, rollback means re-contracting, so the 30-day Phase 5 soak gates this step.

Feature parity

What moves cleanly, and what doesn't.

CapabilityLoki + GrafanaDatadog LogsParity
Log signal coverage Loki distributor /loki/api/v1/push ingest Datadog Log intake http-intake.logs.<site>/api/v2/logs At parity
Collection / agent Promtail / Grafana Alloy / OTel Collector loki exporter Datadog Agent logs: block / Lambda Forwarder At parity
Query language Loki LogQL (| json, count_over_time, rate) Datadog log search syntax (faceted, intake-indexed) At parity
Dashboards-as-code Grafana dashboard JSON via Terraform grafana_dashboard / Grizzly jsonnet Datadog datadog_dashboard TF + datadog-sync-cli At parity
Alerting Loki ruler recording/alerting rules + Grafana unified alerting Datadog log-alert Monitors At parity
Live tail Grafana Explore Live tailing / /loki/api/v1/tail WebSocket Datadog Live Tail At parity
RBAC + multi-tenancy X-Scope-OrgID header + per-tenant limits_config overrides Datadog orgs (separate accounts), parent-child views At parity
Retention / storage tiers S3 lifecycle Standard→IA→Glacier IR via compactor Datadog Flex Logs + Archives (S3-backed) Partial
Log-to-metric Loki ruler recording rules → Prometheus remote-write Datadog Generate Metrics from Logs (billable custom metric) At parity
Anomaly / AIOps None (no Watchdog equivalent; per-signal only) Datadog Watchdog (zero-config cross-signal) SaaS only
Pattern clustering Grafana 11 log patterns / LogQL | pattern extraction Datadog Log Patterns (auto-clustering) Partial
Intake PII redaction None first-party (redact in Collector/Vector tier) Datadog Sensitive Data Scanner (intake-side) SaaS only
SIEM detection content None (Wazuh/OpenSearch is a separate workstream) Datadog Cloud SIEM (Sigma rules, ATT&CK) SaaS only
Compliance attestations Self-hosted; controls in your SSP (Object Lock, KMS) Datadog SOC 2 / ISO / FedRAMP boundary Partial

What we're honest about

The caveats most vendors leave out.

Watchdog AIOps has no OSS parity

Datadog Watchdog runs zero-config anomaly detection across logs, metrics and APM at whole-tenant scope. There is no general-purpose open-source replacement — per-signal anomaly via predict_linear() or isolation-forest is possible but never zero-config. Plan to accept the gap or replace it with a narrower-scope SaaS, and we will tell you which up front.

Intake-time PII scanning moves upstream

Loki has no first-party Sensitive Data Scanner equivalent. Both backends assume you redact upstream, so we move PII redaction into the Collector or Vector tier — Presidio is the closest parity for managed rule packs. It is a real workstream, not a checkbox, and we scope it before Phase 2.

Pattern clustering is a functional gap

Datadog Log Patterns auto-clusters semantically similar lines. Grafana 11's log patterns is closer but not equivalent, and LogQL's pattern extraction is a manual primitive, not auto-clustering. We treat this as an honest gap rather than pretend the UX is identical.

Self-hosting means you own the math

Loki forces the cardinality conversation upfront, you own retention tiering on S3, and Cloud SIEM detection content is a separate Wazuh/OpenSearch workstream — not a logging-migration concern. At more than 2TB/day indexed the TCO typically wins; below that, Datadog often does. We size it honestly against your contracted rate.

Why this beats a flag day

Reversible at every step.

Every phase up to retirement is a Collector or agent config flip with an under-15-minute rollback window while the parallel pipeline is live — revert a pod spec, drop a filter, switch a pager destination back. We never cancel the Datadog Logs contract until Loki has held source-of-record on-call for a minimum 30-day green soak. The soak gate is the point: you only burn the bridge once the new path has proven itself in production.

See whether your log estate migrates cleanly.

A 30-minute call with a senior observability engineer. We map your Datadog Pipelines, facets and monitors, size Loki against your real volume, and tell you honestly which Datadog-only features have no OSS parity — before you commit to anything.

Map my migration →