Monitoring tells you when something is broken. Observability tells you why. The distinction matters because modern distributed systems fail in ways that are impossible to predict — the combination of a slow database query, a network partition, and a memory leak on one service out of fifty produces symptoms that no predefined dashboard anticipated.
Observability is built on three pillars: metrics (what is happening), traces (how requests flow through the system), and logs (what happened at specific moments). The tools in this space collect, store, and query this data to help you understand your system's behavior.
OpenTelemetry: The Instrumentation Standard
OpenTelemetry (OTel) is not an observability platform — it is a vendor-neutral standard for instrumenting applications. OTel provides APIs, SDKs, and tools for generating and collecting telemetry data (metrics, traces, and logs) that you then send to the observability backend of your choice.
Why OTel Matters
Before OpenTelemetry, every observability vendor had its own instrumentation library. Switching from Datadog to New Relic meant re-instrumenting your entire application. OTel eliminates this lock-in — instrument once with OTel, and send data to any compatible backend.
Components
- API: Vendor-neutral interfaces for creating traces, metrics, and logs in your application code
- SDK: Implementations that process and export telemetry data
- Auto-instrumentation: Agents that automatically instrument common frameworks and libraries without code changes
- Collector: A proxy that receives, processes, and exports telemetry data. Can filter, sample, and route data to multiple backends simultaneously
Auto-Instrumentation
For many languages (Java, Python, Node.js, .NET), OTel provides auto-instrumentation that captures traces and metrics from common frameworks (Express, Django, Spring Boot, gRPC) without code modifications. Install the agent, configure the export destination, and you get distributed tracing across your services.
Best for: Any team wanting vendor-neutral instrumentation. Use OTel regardless of which backend you choose.
Pricing: Free and open source.
The Grafana Stack (Open Source)
The Grafana stack provides an open-source observability platform using purpose-built databases for each telemetry type.
Grafana
Grafana is the visualization and dashboarding layer. It queries data from multiple sources — Prometheus, Loki, Tempo, and dozens of other data sources — and presents it in dashboards, alerts, and explorations.
Grafana's strength is its flexibility. A single dashboard can show metrics from Prometheus, logs from Loki, and traces from Tempo, with links between them for correlation.
Prometheus (Metrics)
Prometheus is the standard for metrics collection in cloud-native environments. It scrapes metrics endpoints from your services and stores them in a time-series database. PromQL (Prometheus Query Language) provides powerful querying for alerting and analysis.
Prometheus excels at infrastructure and application metrics — request rates, error rates, latency percentiles, CPU usage, memory consumption, and custom business metrics.
Scaling consideration: Single-node Prometheus has storage and query limitations for large deployments. Solutions include Thanos, Cortex, or Grafana Mimir for long-term storage and horizontal scaling.
Loki (Logs)
Loki is a log aggregation system designed to be cost-effective and easy to operate. Unlike Elasticsearch-based solutions that index the full text of every log, Loki indexes only metadata (labels) and stores log data in compressed chunks.
This design trade-off means Loki uses significantly less storage and compute than Elasticsearch, but full-text search across all logs is not as fast. For most operational use cases — "show me logs from service X in the last 10 minutes" — Loki performs well.
Tempo (Traces)
Tempo is a distributed tracing backend that stores traces in object storage (S3, GCS). According to Grafana, Tempo requires no sampling — it can store every trace, not just a sample.
Tempo integrates with OpenTelemetry for trace collection and with Grafana for visualization. The "Trace to Logs" and "Trace to Metrics" features in Grafana let you jump from a slow trace directly to the relevant logs and metrics.
Strengths of the Grafana Stack
- Open source: Run everything on your own infrastructure with full control
- Cost-effective: No per-host or per-GB pricing from a vendor. Your costs are infrastructure only
- Correlated data: Grafana links metrics, logs, and traces for seamless debugging
- Community: Large, active community with extensive documentation and examples
- Flexibility: Mix and match components. Use Prometheus with Elasticsearch instead of Loki. Use Tempo with Datadog instead of Grafana
Limitations
- Operational burden: Running Prometheus, Loki, Tempo, and Grafana requires infrastructure expertise
- Scaling complexity: Each component has its own scaling model. High-availability setups are nontrivial
- No built-in APM: Application performance monitoring (code-level profiling, dependency maps) requires additional tools
- Alert management: Grafana alerting has improved but is not as sophisticated as PagerDuty or Opsgenie
Best for: Teams with infrastructure expertise that want cost-effective, open-source observability.
Pricing: Free and open source. Grafana Cloud offers a managed version with a generous free tier.
Datadog
Datadog provides a managed observability platform covering metrics, traces, logs, security monitoring, and more. According to the company, Datadog provides over 750 integrations for monitoring infrastructure, applications, and third-party services.
Strengths
- Fully managed: No infrastructure to operate. Datadog handles storage, scaling, and availability
- Breadth: Metrics, traces, logs, profiling, security, synthetic monitoring, RUM (Real User Monitoring), CI visibility — everything in one platform
- Integrations: Out-of-the-box integrations with AWS, GCP, Azure, Kubernetes, Docker, databases, and hundreds of application frameworks
- APM: Deep application performance monitoring with code-level profiling, dependency maps, and runtime metrics
- Dashboards and alerting: Polished dashboarding with anomaly detection, forecasting, and sophisticated alert conditions
- Correlation: Navigate seamlessly between metrics, traces, and logs for a specific incident
- Watchdog AI: AI-powered anomaly detection that surfaces issues before they impact users
Limitations
- Cost: Datadog is expensive. Per-host pricing for infrastructure, per-GB pricing for logs, per-million-spans pricing for traces — costs add up quickly at scale
- Vendor lock-in: While Datadog supports OpenTelemetry, many features work best with the Datadog agent and libraries
- Bill shock: Without careful cost management, Datadog bills can escalate unexpectedly as data volumes grow
- Complexity: The platform has so many features that teams can spend months configuring it
Best for: Teams that want comprehensive, managed observability and can budget for it.
Pricing: Infrastructure from $18/host/month. APM from $35/host/month. Log Management from $0.10/GB ingested plus $2.55/million log events for 15-day retention. Free tier: up to 5 hosts.
Other Notable Options
New Relic
New Relic provides a managed observability platform with a generous free tier (100 GB/month of data ingest). The pricing model — per-GB ingestion rather than per-host — can be more predictable than Datadog for some workloads. New Relic's all-in-one platform covers APM, infrastructure monitoring, logs, browser monitoring, and synthetic checks. The platform includes AI-powered error analysis and supports full OpenTelemetry data ingestion. Starting price is $0.35/GB beyond the free tier, with no per-host fees.
Honeycomb
Honeycomb focuses on high-cardinality event analysis and distributed tracing. It is designed for debugging complex distributed systems where you need to ask arbitrary questions about your telemetry data — questions you did not anticipate when building dashboards. Honeycomb's BubbleUp feature automatically identifies the attributes that correlate with anomalous behavior, dramatically reducing investigation time. The platform is OpenTelemetry-native and excels at answering "why" questions during incident investigation. Free tier includes 20 million events/month.
SigNoz
SigNoz is an open-source alternative to Datadog, providing metrics, traces, and logs in a single platform with OpenTelemetry-native instrumentation. Built on ClickHouse for fast query performance, SigNoz is easier to operate than the full Grafana stack (one platform instead of four separate components) while providing more integrated features than any single Grafana component. SigNoz supports alerts, dashboards, service maps, and flame graphs out of the box. Self-hosted is free with no limits; SigNoz Cloud starts at $199/month with usage-based pricing.
What Changed in 2026
OpenTelemetry
- Logs API stable: OTel Logs API reached stable status across all major languages, completing the three-pillar instrumentation story
- Profiling signal: Continuous profiling added as a fourth signal type, with initial support in Java and Go SDKs
- OTel Collector improvements: OpAMP (Open Agent Management Protocol) support for remote configuration of collectors at scale
Grafana Stack
- Grafana 11: Revamped explore experience, improved alerting with Grafana Incident integration, and native support for OpenTelemetry semantic conventions
- Grafana Alloy: The OpenTelemetry Collector distribution from Grafana (replacing Grafana Agent) reached GA, simplifying the pipeline from instrumentation to backend
- Adaptive Metrics: Grafana Cloud now offers AI-driven recommendations to drop unused metrics, reducing costs by up to 30%
Datadog
- Bits AI GA: Datadog's AI assistant for incident investigation reached general availability — correlates metrics, traces, and logs to suggest root causes
- Universal Service Monitoring: eBPF-based service discovery and monitoring without any code instrumentation or agent configuration
- Cloud Cost Management: Expanded integration linking infrastructure costs to specific services, helping teams understand the cost of observability itself
Quick Comparison
| Feature | Grafana Stack | Datadog | New Relic | SigNoz |
|---|---|---|---|---|
| Deployment | Self-hosted / Cloud | SaaS only | SaaS only | Self-hosted / Cloud |
| Metrics | Prometheus / Mimir | Built-in | Built-in | ClickHouse |
| Logs | Loki | Built-in | Built-in | ClickHouse |
| Traces | Tempo | Built-in | Built-in | ClickHouse |
| OTel Native | Yes (via Alloy) | Supported | Supported | Yes (native) |
| AI/ML Features | Limited | Watchdog AI, Bits AI | Applied Intelligence | Query builder |
| Free Tier | Unlimited (self-hosted) | 5 hosts, 14 days | 100 GB/month | Unlimited (self-hosted) |
| Starting Price | Free / $0 (Cloud free tier) | $18/host/month | $0.35/GB | Free / Cloud from $199/mo |
Decision Framework
Choose the Grafana Stack if:
- You have infrastructure expertise and want to minimize vendor costs
- Data sovereignty or compliance requires keeping telemetry data on your infrastructure
- You want maximum flexibility in how you collect, store, and query telemetry
Choose Datadog if:
- You want a fully managed platform and can budget for it
- Breadth of features (APM, security, CI visibility, RUM) is valuable
- Your team does not have the capacity to operate observability infrastructure
Choose OpenTelemetry regardless:
- Use OTel for instrumentation regardless of which backend you choose
- It protects you from vendor lock-in and provides a consistent instrumentation experience
- Auto-instrumentation reduces the effort to get started
Implementation Strategy
- Start with OTel instrumentation: Add OpenTelemetry to your services using auto-instrumentation
- Choose a backend: Start with Grafana Cloud free tier or Datadog free trial to evaluate
- Instrument the critical path: Focus on the request paths that generate revenue or serve users
- Set up alerts on SLOs: Alert on service level objectives (99.9% of requests complete in under 500ms) rather than raw metrics
- Build investigation workflows: Practice navigating from alert to metrics to traces to logs to find root causes
- Iterate: Add custom metrics and traces as you discover gaps in your visibility
The best observability system is the one your team actually uses to investigate incidents. Start simple, instrument the critical paths, and expand coverage as you learn what data you need.
FAQ
What is the difference between monitoring and observability?
Monitoring tells you when something is broken by tracking predefined metrics and thresholds. Observability tells you why it broke by letting you ask arbitrary questions about your system using metrics, traces, and logs. Monitoring is reactive (alert when CPU > 90%), while observability is investigative (why are requests to service X slow when called from service Y on Tuesdays?).
Should I use OpenTelemetry even if I use Datadog?
Yes. OpenTelemetry is a vendor-neutral instrumentation standard, not a competing platform. Using OTel for instrumentation protects you from vendor lock-in — if you later switch from Datadog to Grafana or another backend, you do not need to re-instrument your applications. Datadog fully supports receiving OTel data.
How much does Datadog cost compared to open-source alternatives?
Datadog starts at $18/host/month for infrastructure monitoring and $35/host/month for APM. Log management is $0.10/GB ingested. For a team with 50 hosts using APM and logs, costs can exceed $2,000/month. The Grafana stack (Prometheus, Loki, Tempo) and SigNoz are open source and free to self-host — your costs are infrastructure only. However, self-hosting requires operational expertise.
What are the three pillars of observability?
The three pillars of observability are: metrics (quantitative measurements like request rate, error rate, and latency), traces (records of how requests flow through distributed services), and logs (timestamped records of discrete events). Modern observability platforms correlate all three to help you investigate incidents efficiently.
Is Grafana free to use?
Yes, Grafana is open source (AGPL v3) and free to self-host with no limits. The full Grafana stack — Grafana for dashboards, Prometheus for metrics, Loki for logs, and Tempo for traces — can be run entirely on your own infrastructure. Grafana Cloud also offers a managed version with a generous free tier that includes 10,000 metrics series, 50 GB logs, and 50 GB traces.