Monitoring tells you when something is broken. Observability tells you why. The distinction matters because modern distributed systems fail in ways that are impossible to predict — the combination of a slow database query, a network partition, and a memory leak on one service out of fifty produces symptoms that no predefined dashboard anticipated.

Observability is built on three pillars: metrics (what is happening), traces (how requests flow through the system), and logs (what happened at specific moments). The tools in this space collect, store, and query this data to help you understand your system's behavior.

OpenTelemetry: The Instrumentation Standard

OpenTelemetry (OTel) is not an observability platform — it is a vendor-neutral standard for instrumenting applications. OTel provides APIs, SDKs, and tools for generating and collecting telemetry data (metrics, traces, and logs) that you then send to the observability backend of your choice.

Why OTel Matters

Before OpenTelemetry, every observability vendor had its own instrumentation library. Switching from Datadog to New Relic meant re-instrumenting your entire application. OTel eliminates this lock-in — instrument once with OTel, and send data to any compatible backend.

Components

Auto-Instrumentation

For many languages (Java, Python, Node.js, .NET), OTel provides auto-instrumentation that captures traces and metrics from common frameworks (Express, Django, Spring Boot, gRPC) without code modifications. Install the agent, configure the export destination, and you get distributed tracing across your services.

Best for: Any team wanting vendor-neutral instrumentation. Use OTel regardless of which backend you choose.

Pricing: Free and open source.

The Grafana Stack (Open Source)

The Grafana stack provides an open-source observability platform using purpose-built databases for each telemetry type.

Grafana

Grafana is the visualization and dashboarding layer. It queries data from multiple sources — Prometheus, Loki, Tempo, and dozens of other data sources — and presents it in dashboards, alerts, and explorations.

Grafana's strength is its flexibility. A single dashboard can show metrics from Prometheus, logs from Loki, and traces from Tempo, with links between them for correlation.

Prometheus (Metrics)

Prometheus is the standard for metrics collection in cloud-native environments. It scrapes metrics endpoints from your services and stores them in a time-series database. PromQL (Prometheus Query Language) provides powerful querying for alerting and analysis.

Prometheus excels at infrastructure and application metrics — request rates, error rates, latency percentiles, CPU usage, memory consumption, and custom business metrics.

Scaling consideration: Single-node Prometheus has storage and query limitations for large deployments. Solutions include Thanos, Cortex, or Grafana Mimir for long-term storage and horizontal scaling.

Loki (Logs)

Loki is a log aggregation system designed to be cost-effective and easy to operate. Unlike Elasticsearch-based solutions that index the full text of every log, Loki indexes only metadata (labels) and stores log data in compressed chunks.

This design trade-off means Loki uses significantly less storage and compute than Elasticsearch, but full-text search across all logs is not as fast. For most operational use cases — "show me logs from service X in the last 10 minutes" — Loki performs well.

Tempo (Traces)

Tempo is a distributed tracing backend that stores traces in object storage (S3, GCS). According to Grafana, Tempo requires no sampling — it can store every trace, not just a sample.

Tempo integrates with OpenTelemetry for trace collection and with Grafana for visualization. The "Trace to Logs" and "Trace to Metrics" features in Grafana let you jump from a slow trace directly to the relevant logs and metrics.

Strengths of the Grafana Stack

Limitations

Best for: Teams with infrastructure expertise that want cost-effective, open-source observability.

Pricing: Free and open source. Grafana Cloud offers a managed version with a generous free tier.

Datadog

Datadog provides a managed observability platform covering metrics, traces, logs, security monitoring, and more. According to the company, Datadog provides over 750 integrations for monitoring infrastructure, applications, and third-party services.

Strengths

Limitations

Best for: Teams that want comprehensive, managed observability and can budget for it.

Pricing: Infrastructure from $18/host/month. APM from $35/host/month. Log Management from $0.10/GB ingested plus $2.55/million log events for 15-day retention. Free tier: up to 5 hosts.

Other Notable Options

New Relic

New Relic provides a managed observability platform with a generous free tier (100 GB/month of data ingest). The pricing model — per-GB ingestion rather than per-host — can be more predictable than Datadog for some workloads. New Relic's all-in-one platform covers APM, infrastructure monitoring, logs, browser monitoring, and synthetic checks. The platform includes AI-powered error analysis and supports full OpenTelemetry data ingestion. Starting price is $0.35/GB beyond the free tier, with no per-host fees.

Honeycomb

Honeycomb focuses on high-cardinality event analysis and distributed tracing. It is designed for debugging complex distributed systems where you need to ask arbitrary questions about your telemetry data — questions you did not anticipate when building dashboards. Honeycomb's BubbleUp feature automatically identifies the attributes that correlate with anomalous behavior, dramatically reducing investigation time. The platform is OpenTelemetry-native and excels at answering "why" questions during incident investigation. Free tier includes 20 million events/month.

SigNoz

SigNoz is an open-source alternative to Datadog, providing metrics, traces, and logs in a single platform with OpenTelemetry-native instrumentation. Built on ClickHouse for fast query performance, SigNoz is easier to operate than the full Grafana stack (one platform instead of four separate components) while providing more integrated features than any single Grafana component. SigNoz supports alerts, dashboards, service maps, and flame graphs out of the box. Self-hosted is free with no limits; SigNoz Cloud starts at $199/month with usage-based pricing.

What Changed in 2026

OpenTelemetry

Grafana Stack

Datadog

Quick Comparison

FeatureGrafana StackDatadogNew RelicSigNoz
DeploymentSelf-hosted / CloudSaaS onlySaaS onlySelf-hosted / Cloud
MetricsPrometheus / MimirBuilt-inBuilt-inClickHouse
LogsLokiBuilt-inBuilt-inClickHouse
TracesTempoBuilt-inBuilt-inClickHouse
OTel NativeYes (via Alloy)SupportedSupportedYes (native)
AI/ML FeaturesLimitedWatchdog AI, Bits AIApplied IntelligenceQuery builder
Free TierUnlimited (self-hosted)5 hosts, 14 days100 GB/monthUnlimited (self-hosted)
Starting PriceFree / $0 (Cloud free tier)$18/host/month$0.35/GBFree / Cloud from $199/mo

Decision Framework

Choose the Grafana Stack if:

Choose Datadog if:

Choose OpenTelemetry regardless:

Implementation Strategy

  1. Start with OTel instrumentation: Add OpenTelemetry to your services using auto-instrumentation
  2. Choose a backend: Start with Grafana Cloud free tier or Datadog free trial to evaluate
  3. Instrument the critical path: Focus on the request paths that generate revenue or serve users
  4. Set up alerts on SLOs: Alert on service level objectives (99.9% of requests complete in under 500ms) rather than raw metrics
  5. Build investigation workflows: Practice navigating from alert to metrics to traces to logs to find root causes
  6. Iterate: Add custom metrics and traces as you discover gaps in your visibility

The best observability system is the one your team actually uses to investigate incidents. Start simple, instrument the critical paths, and expand coverage as you learn what data you need.

FAQ

What is the difference between monitoring and observability?

Monitoring tells you when something is broken by tracking predefined metrics and thresholds. Observability tells you why it broke by letting you ask arbitrary questions about your system using metrics, traces, and logs. Monitoring is reactive (alert when CPU > 90%), while observability is investigative (why are requests to service X slow when called from service Y on Tuesdays?).

Should I use OpenTelemetry even if I use Datadog?

Yes. OpenTelemetry is a vendor-neutral instrumentation standard, not a competing platform. Using OTel for instrumentation protects you from vendor lock-in — if you later switch from Datadog to Grafana or another backend, you do not need to re-instrument your applications. Datadog fully supports receiving OTel data.

How much does Datadog cost compared to open-source alternatives?

Datadog starts at $18/host/month for infrastructure monitoring and $35/host/month for APM. Log management is $0.10/GB ingested. For a team with 50 hosts using APM and logs, costs can exceed $2,000/month. The Grafana stack (Prometheus, Loki, Tempo) and SigNoz are open source and free to self-host — your costs are infrastructure only. However, self-hosting requires operational expertise.

What are the three pillars of observability?

The three pillars of observability are: metrics (quantitative measurements like request rate, error rate, and latency), traces (records of how requests flow through distributed services), and logs (timestamped records of discrete events). Modern observability platforms correlate all three to help you investigate incidents efficiently.

Is Grafana free to use?

Yes, Grafana is open source (AGPL v3) and free to self-host with no limits. The full Grafana stack — Grafana for dashboards, Prometheus for metrics, Loki for logs, and Tempo for traces — can be run entirely on your own infrastructure. Grafana Cloud also offers a managed version with a generous free tier that includes 10,000 metrics series, 50 GB logs, and 50 GB traces.