OpenTelemetry in Practice: Setting Up Metrics, Traces, and Logs for Your Applications

Devops & Infrastructure, Tips & Tricks, and Tutorials

OpenTelemetry in Practice: Setting Up Metrics, Traces, and Logs for Your Applications

Your deployment pipeline is green, the health check returns 200, and users are reporting that the site is slow. You check the server — CPU is fine, memory is fine, disk is fine. So where's the problem?

This is the observability gap. Most teams have some logging, maybe a dashboard showing CPU and memory, but no coherent way to answer the question that actually matters during an incident: what is happening, right now, across all the services involved in this request?

OpenTelemetry has become the industry standard for closing that gap. It's a CNCF project (the second most active after Kubernetes) that provides a single, vendor-neutral framework for collecting metrics, traces, and logs from any application — regardless of language, framework, or infrastructure. This guide covers what OpenTelemetry actually is, how to set it up, and how to connect it to your deployment pipeline so you always know whether the last release caused the problem.

The Three Pillars — and Why You Need All of Them

Observability rests on three signal types, each answering a different question:

Metrics answer is something wrong? They're aggregated numbers — request count, error rate, response time percentiles, CPU usage, queue depth. Metrics are cheap to collect, cheap to store, and the foundation of alerting. When your PagerDuty goes off at 2 AM, it's because a metric crossed a threshold.

Traces answer where is it slow? A distributed trace follows a single request as it moves across services, databases, message queues, and external APIs. Each operation is a span with a start time, duration, and metadata. When a user reports that checkout is slow, traces show you that 800ms of a 1.2s request is spent waiting for the payment gateway.

Logs answer why did it break? Logs provide the detailed context that metrics and traces can't: the exact SQL query that failed, the malformed JSON payload, the specific validation error. The key is correlating logs with traces — when a trace shows a slow span, the correlated logs explain what happened during that span.

Without all three, you're debugging with partial information:

  • Metrics without traces: you know something is slow, but not where
  • Traces without logs: you know where it's slow, but not why
  • Logs without traces: you have detail but can't connect it across services

Why OpenTelemetry Won

Before OpenTelemetry, every observability vendor had its own instrumentation library. If you used Datadog, you installed the Datadog agent. If you used New Relic, you installed their SDK. Switching vendors meant re-instrumenting your entire application.

OpenTelemetry solves this by separating instrumentation (collecting telemetry from your application) from export (sending it to a backend). You instrument once with OpenTelemetry, then export to whatever backend you choose — Grafana, Datadog, New Relic, Honeycomb, Jaeger, or any combination. Switch backends without changing a line of application code.

The project provides:

  • Language SDKs for most major languages (Python, JavaScript/Node.js, Java, Go, .NET, Ruby, PHP, Rust)
  • Auto-instrumentation agents that instrument popular libraries without code changes
  • The OpenTelemetry Collector — a proxy that receives, processes, and exports telemetry data
  • Semantic conventions — standardised attribute names so http.request.method means the same thing everywhere

Getting Started: The Two Approaches

Zero-Code Auto-Instrumentation

The fastest way to start is auto-instrumentation — agents or libraries that hook into your runtime and automatically instrument HTTP requests, database queries, and other common operations without changing your application code.

Python:

pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install

# Run your app with auto-instrumentation
opentelemetry-instrument \
  --service_name my-app \
  --exporter_otlp_endpoint http://collector:4317 \
  python app.py

This automatically instruments Flask, Django, FastAPI, SQLAlchemy, requests, urllib3, psycopg2, and dozens of other Python libraries.

Node.js:

npm install @opentelemetry/auto-instrumentations-node \
  @opentelemetry/sdk-node @opentelemetry/exporter-trace-otlp-grpc
// tracing.js — require this BEFORE your application
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');

const sdk = new NodeSDK({
  serviceName: 'my-app',
  traceExporter: new OTLPTraceExporter({
    url: 'http://collector:4317',
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();
node --require ./tracing.js app.js

This instruments Express, Fastify, Koa, pg, mysql2, mongodb, Redis, HTTP client calls, and more.

Java:

# Download the agent JAR
curl -L -o opentelemetry-javaagent.jar \
  https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar

# Run with the agent attached
java -javaagent:opentelemetry-javaagent.jar \
  -Dotel.service.name=my-app \
  -Dotel.exporter.otlp.endpoint=http://collector:4317 \
  -jar app.jar

The Java agent supports over 100 libraries — Spring, Quarkus, JDBC, Hibernate, Kafka, gRPC, and more. It uses bytecode manipulation, so no code changes are needed at all.

What auto-instrumentation costs: typically 2-5% runtime overhead and 5-15% on startup time. For most applications, this is negligible. For latency-critical hot paths, you can selectively disable specific instrumentations.

Manual SDK Instrumentation

Auto-instrumentation covers infrastructure operations (HTTP, databases, messaging) but can't instrument your business logic. If you want to trace calculate shipping cost or count orders placed per minute, you need the SDK.

Python example:

from opentelemetry import trace, metrics

tracer = trace.get_tracer("order-service")
meter = metrics.get_meter("order-service")
orders_counter = meter.create_counter("orders.placed", description="Total orders placed")

def place_order(request):
    with tracer.start_as_current_span("place_order") as span:
        span.set_attribute("order.items", len(request.items))
        span.set_attribute("order.customer_id", request.customer_id)

        try:
            order = process_order(request)
            orders_counter.add(1)
            span.set_attribute("order.id", order.id)
            return order
        except Exception as e:
            span.record_exception(e)
            span.set_status(trace.StatusCode.ERROR, str(e))
            raise

Node.js example:

const { trace, metrics } = require('@opentelemetry/api');

const tracer = trace.getTracer('order-service');
const meter = metrics.getMeter('order-service');
const ordersCounter = meter.createCounter('orders.placed');

async function placeOrder(request) {
  return tracer.startActiveSpan('place_order', async (span) => {
    span.setAttribute('order.items', request.items.length);
    span.setAttribute('order.customer_id', request.customerId);

    try {
      const order = await processOrder(request);
      ordersCounter.add(1);
      span.setAttribute('order.id', order.id);
      return order;
    } catch (error) {
      span.recordException(error);
      span.setStatus({ code: trace.SpanStatusCode.ERROR, message: error.message });
      throw error;
    } finally {
      span.end();
    }
  });
}

Auto-instrumentation and manual instrumentation work together seamlessly. The auto-instrumented spans (HTTP, database) and your custom business spans are linked into the same trace automatically.

The Collector: Your Telemetry Router

The OpenTelemetry Collector is a proxy that sits between your applications and your observability backend. Applications send telemetry to the collector, and the collector routes it to whichever backends you use.

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024

exporters:
  prometheus:
    endpoint: 0.0.0.0:8889
  otlp/traces:
    endpoint: tempo:4317
    tls:
      insecure: true
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/traces]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki]

Why not export directly from your application? Three reasons:

  1. Decoupling. Your applications don't need to know about your observability backend. Switch from Jaeger to Grafana Tempo? Change the collector config, not your application code.
  2. Processing. The collector can sample, filter, enrich, and batch telemetry before forwarding — reducing storage costs and noise.
  3. Reliability. The collector buffers data during backend outages. Direct export risks dropping telemetry when the backend is unreachable.

Sampling: Don't Trace Everything in Production

Tracing every request in production generates enormous volumes of data. A service handling 1,000 requests per second produces 86 million traces per day. At approximately 1 KB per span with a 5-span average trace, that's ~430 GB/day.

Head-based sampling decides at the start of a trace whether to record it:

# Environment variable — sample 10% of traces
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1

Tail-based sampling (configured in the collector) decides after the trace completes, allowing you to keep all error traces and slow traces while sampling normal traffic:

processors:
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-traces
        type: latency
        latency: { threshold_ms: 1000 }
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

This keeps 100% of errors and slow requests (the ones you actually need to debug) while sampling 10% of healthy traffic. In practice, this reduces trace storage by 80-90% while preserving diagnostic value.

Connecting Observability to Your Deployment Pipeline

Observability becomes most valuable when it's connected to your deployment pipeline. Correlating deployments with metric changes answers the most common production question: did the last deployment cause this?

With DeployHQ, you can use post-deployment commands or webhooks to annotate your monitoring dashboards whenever a deployment completes:

# Post-deploy command: annotate Grafana with deployment event
curl -s -X POST http://grafana:3000/api/annotations \
  -H "Authorization: Bearer $GRAFANA_API_KEY" \
  -H "Content-Type: application/json" \
  -d "{
    \"text\": \"Deployed $DEPLOYHQ_REVISION to $DEPLOYHQ_SERVER\",
    \"tags\": [\"deployment\", \"$DEPLOYHQ_PROJECT\"]
  }"

When you see a metric anomaly on a dashboard, the deployment annotation immediately tells you whether it correlates with a release — saving the first 10 minutes of every incident investigation.

For zero-downtime deployments, observability is especially critical. You need to verify that the new version is healthy before the old version is taken down. Health check endpoints should go beyond a simple HTTP 200:

# Flask example — readiness probe that checks dependencies
@app.route('/health/ready')
def readiness():
    checks = {}

    try:
        db.session.execute(text('SELECT 1'))
        checks['database'] = 'ok'
    except Exception as e:
        checks['database'] = f'failed: {e}'
        return jsonify(checks), 503

    try:
        redis_client.ping()
        checks['cache'] = 'ok'
    except Exception as e:
        checks['cache'] = f'failed: {e}'
        return jsonify(checks), 503

    checks['status'] = 'ready'
    return jsonify(checks), 200

A Practical Starting Stack

If you're starting from zero, here's a realistic observability stack that works for small-to-medium deployments without enterprise licensing costs:

Component Tool Purpose
Instrumentation OpenTelemetry SDK + Auto-instrumentation Collect traces, metrics, logs
Collector OpenTelemetry Collector Route and process telemetry
Metrics Prometheus + Grafana Metric storage, dashboards, alerting
Traces Grafana Tempo Distributed trace storage and search
Logs Grafana Loki Log aggregation with trace correlation
Alerting Grafana Alerting Threshold-based alerts to Slack/PagerDuty

This entire stack is open source and can run on a single server for small deployments, or scale horizontally as your traffic grows. A Docker Compose setup can have the full stack running in minutes.

For teams that prefer managed solutions, OpenTelemetry exports to Datadog, New Relic, Honeycomb, Grafana Cloud, and AWS CloudWatch without any instrumentation changes — switch exporters in the collector config and you're done.

What to Alert On

The most common mistake with observability is alerting on symptoms instead of causes. CPU at 90% is a symptom. Database connection pool exhausted is a cause. Alert on the metrics closest to the actual problem.

Good alerts:

  • Error rate exceeds 1% of requests (5-minute rolling window)
  • P99 response time exceeds 2 seconds
  • Database connection pool active connections equals max pool size
  • Queue depth exceeding consumer throughput for >5 minutes

Bad alerts:

  • CPU above 80% (normal for many workloads)
  • Any single 500 error (noise — use error rate instead)
  • Disk usage above 70% (too early, creates alert fatigue)

Essential OpenTelemetry metrics to track:

  • http.server.request.duration — response time histogram, broken down by route and status code
  • http.server.active_requests — current in-flight requests (early saturation warning)
  • db.client.connections.usage — database pool utilization
  • process.runtime.*.memory — language-specific memory metrics (heap, RSS)

Common Mistakes

Logging too much. Debug-level logging in production generates noise that obscures real signals. Use structured logging (JSON) with severity levels, and keep production at INFO or WARN. Use trace context to find detailed logs when you need them.

Ignoring trace propagation. OpenTelemetry propagates trace context automatically for supported libraries, but custom HTTP clients, manual thread pools, and async operations can break the chain. If your traces end at service boundaries, check that W3C Trace Context headers (traceparent, tracestate) are being forwarded.

Sampling too aggressively too early. Start with 100% sampling in development and staging. Only reduce sampling in production when storage costs justify it — and always keep 100% sampling for errors.

Treating observability as optional. We'll add monitoring later is technical debt that compounds. Instrument from day one, even if it's just the auto-instrumentation agent. The cost of adding it is trivial compared to the cost of debugging a production incident blind.

Not connecting deployments to metrics. Every deployment should be annotated in your monitoring dashboards. Without this correlation, the first 10 minutes of every incident are spent asking did anything change? — a question your CI/CD pipeline already knows the answer to.


Ready to deploy your observable application? DeployHQ handles the build and deployment pipeline while you focus on application quality. Post-deployment hooks make it easy to connect your releases to your monitoring — so you always know what changed and when.

Questions about setting up observability for your services? Reach out at support@deployhq.com or find us on Twitter/X.