Avancé 6 chapitres

📊 Monitoring & Observabilité

Progression 6 chapitres

Programme

OpenTelemetry : architecture et Collector Gratuit

Les 3 piliers de l'observabilité (traces, métriques, logs), l'architecture OpenTelemetry SDK → Collector → Backend, et la configuration du Collector en détail.

Instrumentation et sampling OpenTelemetry

Instrumentation automatique et manuelle avec OpenTelemetry, stratégies de sampling pour gérer le volume, semantic conventions et stack complète de démo.

Prometheus : architecture et PromQL

Architecture interne de Prometheus (TSDB, scraping, service discovery), PromQL avancé et recording/alerting rules pour le monitoring production.

Alertmanager et Grafana dashboards

Alertmanager avec routing et escalade (PagerDuty, Slack), Grafana dashboards efficaces, et stratégies de scaling Prometheus (Thanos, Cortex).

SRE : SLI/SLO et Error Budgets

Les principes SRE de Google : SLI/SLO/SLA avec calculs concrets, error budgets comme outil de décision, et élimination du toil.

Incident Management et Chaos Engineering

Incident management structuré, postmortems blameless, Chaos Engineering (Chaos Monkey, Litmus) et organisation des astreintes on-call.

Le monitoring, c’est savoir que ton système est cassé. L’observabilité, c’est comprendre pourquoi il est cassé. Cette formation te donne les deux — de la théorie aux fichiers de config prêts à déployer.

On va construire ensemble une stack d’observabilité complète basée sur Prometheus, Grafana, Loki, Tempo et OpenTelemetry. À la fin, tu auras une vision 360° de tes systèmes : métriques, logs et traces, corrélés entre eux.

Module 1 — Fondamentaux du Monitoring et de l’Observabilité

Monitoring vs Observabilité : quelle différence ?

Le monitoring répond à la question : “Est-ce que ça marche ?” L’observabilité répond à : “Pourquoi ça ne marche pas ?”

Aspect	Monitoring	Observabilité
Approche	Réactive — alertes sur seuils connus	Exploratoire — investigation de problèmes inconnus
Questions	Prédéfinies (CPU > 90% ?)	Ad-hoc (pourquoi le P99 a doublé ?)
Données	Métriques agrégées	Métriques + logs + traces corrélés
Adapté pour	Pannes connues	Systèmes distribués complexes

En pratique, tu as besoin des deux. Le monitoring te réveille à 3h du matin. L’observabilité te permet de résoudre le problème en 10 minutes au lieu de 2 heures.

Les 3 piliers de l’observabilité

1. Métriques

Des valeurs numériques agrégées dans le temps. Légères, rapides à interroger, idéales pour les alertes.

# Exemple : taux de requêtes HTTP par seconde
rate(http_requests_total{service="api"}[5m])

Types de métriques :

Counter — ne fait qu’augmenter (requêtes totales, erreurs totales)
Gauge — monte et descend (température, connexions actives)
Histogram — distribution de valeurs (latence, taille des réponses)
Summary — comme histogram, mais calcule les quantiles côté client

2. Logs

Des événements textuels horodatés. Riches en contexte, mais volumineux.

{
  "timestamp": "2026-03-20T08:15:32Z",
  "level": "error",
  "service": "payment-api",
  "trace_id": "abc123def456",
  "message": "Timeout connecting to stripe API",
  "duration_ms": 30000
}

Bonnes pratiques logs :

Format structuré (JSON) — toujours
Inclure le trace_id pour la corrélation
Niveaux cohérents : DEBUG, INFO, WARN, ERROR
Pas de données sensibles (PII, tokens)

3. Traces distribuées

Le parcours complet d’une requête à travers tes microservices.

[Gateway] ──(12ms)──> [Auth Service] ──(3ms)──> [User DB]
     │
     └──(45ms)──> [Payment Service] ──(30000ms)──> [Stripe API] ← TIMEOUT

Une trace est composée de spans — chaque span représente une opération (appel HTTP, requête DB, traitement).

La méthode RED (pour les services)

Pour chaque service orienté requêtes, mesure ces 3 choses :

Signal	Description	Métrique Prometheus
Rate	Requêtes par seconde	`rate(http_requests_total[5m])`
Errors	Taux d’erreurs	`rate(http_requests_total{status=~"5.."}[5m])`
Duration	Latence (histogramme)	`histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))`

# Dashboard RED complet pour un service
# Rate
sum(rate(http_requests_total{service="api"}[5m]))

# Error ratio
sum(rate(http_requests_total{service="api",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="api"}[5m]))

# Duration P99
histogram_quantile(0.99,
  sum by (le) (rate(http_request_duration_seconds_bucket{service="api"}[5m]))
)

La méthode USE (pour les ressources)

Pour chaque ressource infrastructure (CPU, mémoire, disque, réseau) :

Signal	Description	Exemple
Utilization	% de la capacité utilisée	CPU à 75%
Saturation	File d’attente / backlog	Load average, disk I/O queue
Errors	Nombre d’erreurs	ECC errors, packet drops

# USE pour le CPU
# Utilization
1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)

# Saturation (load average / nombre de CPUs)
node_load1 / count without (cpu) (node_cpu_seconds_total{mode="idle"})

# Errors
rate(node_edac_correctable_errors_total[5m])

SLIs, SLOs, SLAs et Error Budgets

Définitions

SLI (Service Level Indicator) — la métrique mesurée (ex: % de requêtes < 300ms)
SLO (Service Level Objective) — l’objectif interne (ex: 99.9% des requêtes < 300ms)
SLA (Service Level Agreement) — l’engagement contractuel (ex: 99.5% — toujours plus bas que le SLO)
Error Budget — la marge d’erreur tolérée avant de violer le SLO

Exemple concret

Ton API de paiement :

# SLI : latence
sli:
  type: latency
  metric: "http_request_duration_seconds"
  threshold: 0.3  # 300ms
  filter: '{service="payment-api", method!="OPTIONS"}'

# SLO : 99.9% sur 30 jours
slo:
  target: 0.999
  window: 30d

# Calcul de l'error budget
# 30 jours = 43,200 minutes
# Error budget = 0.1% = 43.2 minutes de downtime autorisé
# Ou : sur 1M de requêtes, 1,000 peuvent être lentes

# Calcul du SLI en PromQL
sum(rate(http_request_duration_seconds_bucket{
  service="payment-api",
  le="0.3"
}[30d]))
/
sum(rate(http_request_duration_seconds_count{
  service="payment-api"
}[30d]))

# Consommation de l'error budget
1 - (
  (1 - sli_value) / (1 - 0.999)
)

Error budget en pratique

Error Budget restant	Action
> 50%	Déploiements normaux, expérimentation OK
25-50%	Prudence, review renforcée
< 25%	Gel des déploiements non-critiques
0%	Incident — focus fiabilité uniquement

Module 2 — Prometheus

Architecture de Prometheus

Prometheus fonctionne en pull : il va chercher les métriques sur tes services à intervalles réguliers (scraping).

graph TB
    subgraph Prom["Prometheus"]
        Scraper["Scraper (pull)"] --> TSDB["TSDB (stockage)"]
        TSDB --> Rules["Rule Engine
recording + alerting"]
    end
    Targets["Targets /metrics
node_exporter, app, blackbox"] -->|"scrape"| Scraper
    Rules -->|"push alerts"| AM["Alertmanager
Slack, PagerDuty, Email"]
    style Prom fill:#1a2332,stroke:#dc2626,color:#f1f5f9
    style Scraper fill:#1a2332,stroke:#f59e0b,color:#f1f5f9
    style TSDB fill:#1a2332,stroke:#3b82f6,color:#f1f5f9
    style Rules fill:#1a2332,stroke:#a855f7,color:#f1f5f9
    style Targets fill:#1a2332,stroke:#22c55e,color:#f1f5f9
    style AM fill:#1a2332,stroke:#dc2626,color:#f1f5f9

Configuration de base

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: production
    env: prod

rule_files:
  - "rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

scrape_configs:
  # Prometheus se scrape lui-même
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # Node Exporter — métriques système
  - job_name: "node"
    static_configs:
      - targets:
          - "node-exporter-1:9100"
          - "node-exporter-2:9100"
    relabel_configs:
      - source_labels: [__address__]
        regex: "(.+):9100"
        target_label: instance
        replacement: "${1}"

  # Découverte automatique via Docker
  - job_name: "docker"
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 30s
    relabel_configs:
      - source_labels: [__meta_docker_container_label_prometheus_scrape]
        regex: "true"
        action: keep
      - source_labels: [__meta_docker_container_label_prometheus_port]
        target_label: __address__
        regex: (.+)
        replacement: "${1}"
      - source_labels: [__meta_docker_container_name]
        target_label: container

  # Découverte Kubernetes
  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels:
          [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

  # Blackbox Exporter — monitoring externe
  - job_name: "blackbox-http"
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://devopslab.ch
          - https://api.devopslab.ch/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

PromQL — de débutant à avancé

Les bases

# Valeur instantanée d'une métrique
up{job="node"}

# Filtrage avec labels
http_requests_total{method="GET", status="200"}

# Regex sur labels
http_requests_total{status=~"5.."}      # 5xx
http_requests_total{path!~"/health.*"}  # exclure health

Fonctions de taux (rate)

# rate() — taux par seconde sur une fenêtre (lissé)
rate(http_requests_total[5m])

# irate() — taux instantané (2 derniers points)
irate(http_requests_total[5m])

# increase() — augmentation absolue sur la fenêtre
increase(http_requests_total[1h])

Règle d’or : utilise rate() pour les alertes et dashboards, irate() uniquement pour les graphes haute résolution.

Agrégations

# Somme par service
sum by (service) (rate(http_requests_total[5m]))

# Moyenne par instance
avg by (instance) (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# Top 5 des pods par CPU
topk(5, sum by (pod) (rate(container_cpu_usage_seconds_total[5m])))

# Quantiles
quantile(0.95, rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m]))

Histogrammes et quantiles

# P50 (médiane) de la latence
histogram_quantile(0.5,
  sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)

# P99 par service
histogram_quantile(0.99,
  sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
)

# Apdex score (seuil satisfait=0.3s, toléré=1.2s)
(
  sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
  +
  sum(rate(http_request_duration_seconds_bucket{le="1.2"}[5m]))
)
/ 2
/ sum(rate(http_request_duration_seconds_count[5m]))

Requêtes avancées

# Prédiction : disque plein dans combien d'heures ?
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0

# Taux d'erreur avec seuil minimum de trafic (éviter les faux positifs)
(
  sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
  /
  sum by (service) (rate(http_requests_total[5m]))
) > 0.05
and
sum by (service) (rate(http_requests_total[5m])) > 10

# Détection d'anomalie : écart par rapport à la moyenne sur 7 jours
(
  avg_over_time(rate(http_requests_total[5m])[1h:5m])
  -
  avg_over_time(rate(http_requests_total[5m])[7d:1h])
)
/
stddev_over_time(rate(http_requests_total[5m])[7d:1h])
> 3

# Ratio de cache hit
sum(rate(cache_hits_total[5m]))
/
(sum(rate(cache_hits_total[5m])) + sum(rate(cache_misses_total[5m])))

Exporters

Node Exporter

Métriques système (CPU, mémoire, disque, réseau) :

# Installation
docker run -d \
  --name node-exporter \
  --net host \
  --pid host \
  -v /:/host:ro,rslave \
  quay.io/prometheus/node-exporter \
  --path.rootfs=/host

Métriques clés :

# CPU utilisé
1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))

# Mémoire utilisée (%)
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# Espace disque restant (%)
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}

# Trafic réseau
rate(node_network_receive_bytes_total{device="eth0"}[5m]) * 8  # bits/s

Blackbox Exporter

Monitoring boîte noire (HTTP, TCP, DNS, ICMP) :

# blackbox.yml
modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: [200, 301, 302]
      follow_redirects: true
      preferred_ip_protocol: ip4
      tls_config:
        insecure_skip_verify: false

  http_post_json:
    prober: http
    timeout: 5s
    http:
      method: POST
      headers:
        Content-Type: application/json
      body: '{"healthcheck": true}'

  tcp_connect:
    prober: tcp
    timeout: 5s

  icmp_ping:
    prober: icmp
    timeout: 5s
    icmp:
      preferred_ip_protocol: ip4

  dns_resolution:
    prober: dns
    timeout: 5s
    dns:
      query_name: devopslab.ch
      query_type: A
      valid_rcodes:
        - NOERROR

# Latence HTTP du probe
probe_http_duration_seconds{phase="resolve"}

# Certificat SSL : jours avant expiration
(probe_ssl_earliest_cert_expiry - time()) / 86400

# Disponibilité sur 30 jours
avg_over_time(probe_success{job="blackbox-http"}[30d]) * 100

Exporter custom en Python

# custom_exporter.py
from prometheus_client import start_http_server, Gauge, Counter, Histogram
import time
import psycopg2

# Définition des métriques
db_connections = Gauge(
    'app_db_connections_active',
    'Number of active database connections',
    ['database']
)

orders_total = Counter(
    'app_orders_total',
    'Total number of orders processed',
    ['status', 'payment_method']
)

order_processing_time = Histogram(
    'app_order_processing_seconds',
    'Time to process an order',
    ['order_type'],
    buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

def collect_metrics():
    """Collecte les métriques métier."""
    conn = psycopg2.connect("postgresql://localhost/myapp")
    cur = conn.cursor()

    # Connexions actives
    cur.execute("SELECT count(*) FROM pg_stat_activity WHERE state = 'active'")
    db_connections.labels(database="myapp").set(cur.fetchone()[0])

    # Commandes par statut
    cur.execute("""
        SELECT status, payment_method, count(*)
        FROM orders
        WHERE created_at > now() - interval '5 minutes'
        GROUP BY status, payment_method
    """)
    for status, method, count in cur.fetchall():
        orders_total.labels(status=status, payment_method=method).inc(count)

    cur.close()
    conn.close()

if __name__ == '__main__':
    start_http_server(8000)
    while True:
        collect_metrics()
        time.sleep(15)

Recording Rules

Les recording rules pré-calculent des requêtes coûteuses :

# rules/recording.yml
groups:
  - name: http_rules
    interval: 30s
    rules:
      # Taux de requêtes par service (pré-calculé)
      - record: service:http_requests:rate5m
        expr: sum by (service) (rate(http_requests_total[5m]))

      # Taux d'erreur par service
      - record: service:http_errors:ratio5m
        expr: |
          sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum by (service) (rate(http_requests_total[5m]))

      # Latence P99 par service
      - record: service:http_latency:p99_5m
        expr: |
          histogram_quantile(0.99,
            sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

  - name: node_rules
    interval: 1m
    rules:
      - record: instance:node_cpu:utilization
        expr: 1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))

      - record: instance:node_memory:utilization
        expr: 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

      - record: instance:node_disk:predicted_full_hours
        expr: |
          -1 * (
            node_filesystem_avail_bytes{mountpoint="/"}
            /
            deriv(node_filesystem_avail_bytes{mountpoint="/"}[6h])
          ) / 3600

  - name: slo_rules
    interval: 30s
    rules:
      # SLI : ratio de requêtes rapides (< 300ms)
      - record: slo:api_latency:ratio5m
        expr: |
          sum(rate(http_request_duration_seconds_bucket{service="api", le="0.3"}[5m]))
          /
          sum(rate(http_request_duration_seconds_count{service="api"}[5m]))

      # Error budget restant (sur 30 jours, SLO 99.9%)
      - record: slo:api_latency:error_budget_remaining
        expr: |
          1 - (
            (1 - slo:api_latency:ratio5m)
            /
            (1 - 0.999)
          )

Module 3 — Grafana

Installation

# docker-compose pour Grafana standalone
services:
  grafana:
    image: grafana/grafana:11.5.0
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_USER: admin
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
      GF_USERS_ALLOW_SIGN_UP: "false"
      GF_AUTH_ANONYMOUS_ENABLED: "false"
      GF_INSTALL_PLUGINS: "grafana-clock-panel,grafana-piechart-panel"
    volumes:
      - grafana-data:/var/lib/grafana
      - ./provisioning:/etc/grafana/provisioning

volumes:
  grafana-data:

Provisioning des datasources

# provisioning/datasources/datasources.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    jsonData:
      timeInterval: "15s"
      exemplarTraceIdDestinations:
        - name: traceID
          datasourceUid: tempo

  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    jsonData:
      derivedFields:
        - datasourceUid: tempo
          matcherRegex: '"trace_id":"(\w+)"'
          name: TraceID
          url: "$${__value.raw}"

  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200
    jsonData:
      tracesToLogs:
        datasourceUid: loki
        filterByTraceID: true
        filterBySpanID: true
      tracesToMetrics:
        datasourceUid: prometheus
      nodeGraph:
        enabled: true
      serviceMap:
        datasourceUid: prometheus

Dashboards efficaces

Les principes d’un bon dashboard :

Un dashboard = un objectif (overview, service detail, debugging)
Hiérarchie visuelle — les métriques importantes en haut et en gros
Seuils visuels — vert/jaune/rouge immédiatement lisibles
Liens entre dashboards — drill-down du overview vers le détail

Dashboard JSON provisionné

# provisioning/dashboards/dashboards.yml
apiVersion: 1

providers:
  - name: default
    orgId: 1
    folder: "Production"
    type: file
    disableDeletion: false
    editable: true
    options:
      path: /etc/grafana/provisioning/dashboards/json
      foldersFromFilesStructure: true

Variables et templating

Les variables rendent tes dashboards dynamiques et réutilisables :

{
  "templating": {
    "list": [
      {
        "name": "datasource",
        "type": "datasource",
        "query": "prometheus"
      },
      {
        "name": "namespace",
        "type": "query",
        "datasource": "${datasource}",
        "query": "label_values(kube_pod_info, namespace)",
        "refresh": 2,
        "sort": 1
      },
      {
        "name": "service",
        "type": "query",
        "datasource": "${datasource}",
        "query": "label_values(http_requests_total{namespace=\"$namespace\"}, service)",
        "refresh": 2,
        "sort": 1,
        "multi": true,
        "includeAll": true
      },
      {
        "name": "percentile",
        "type": "custom",
        "query": "50,90,95,99",
        "current": {
          "text": "99",
          "value": "99"
        }
      }
    ]
  }
}

Utilisation dans les requêtes :

# Variable simple
rate(http_requests_total{namespace="$namespace", service=~"$service"}[5m])

# Variable avec regex multi-valeur
rate(http_requests_total{service=~"${service:pipe}"}[$__rate_interval])

# Variable dans le quantile
histogram_quantile(0.$percentile,
  sum by (le) (rate(http_request_duration_seconds_bucket{service=~"$service"}[$__rate_interval]))
)

Dashboards as Code

Grafonnet (Jsonnet)

// dashboards/service-overview.jsonnet
local grafana = import 'github.com/grafana/grafonnet/gen/grafonnet-latest/main.libsonnet';

local dashboard = grafana.dashboard;
local panel = grafana.panel;
local prometheus = grafana.query.prometheus;
local variable = grafana.dashboard.variable;

local ds = variable.datasource.new('datasource', 'prometheus');
local svc = variable.query.new('service')
  + variable.query.withDatasourceFromVariable(ds)
  + variable.query.queryTypes.withLabelValues('service', 'http_requests_total')
  + variable.query.withRefresh(2)
  + variable.query.selectionOptions.withMulti(true)
  + variable.query.selectionOptions.withIncludeAll(true);

local requestRatePanel =
  panel.timeSeries.new('Request Rate')
  + panel.timeSeries.queryOptions.withTargets([
    prometheus.new('${datasource}',
      'sum by (service) (rate(http_requests_total{service=~"$service"}[$__rate_interval]))'
    )
    + prometheus.withLegendFormat('{{ service }}'),
  ])
  + panel.timeSeries.standardOptions.withUnit('reqps')
  + panel.timeSeries.gridPos.withW(12)
  + panel.timeSeries.gridPos.withH(8);

local errorRatePanel =
  panel.timeSeries.new('Error Rate')
  + panel.timeSeries.queryOptions.withTargets([
    prometheus.new('${datasource}',
      |||
        sum by (service) (rate(http_requests_total{service=~"$service", status=~"5.."}[$__rate_interval]))
        /
        sum by (service) (rate(http_requests_total{service=~"$service"}[$__rate_interval]))
      |||
    )
    + prometheus.withLegendFormat('{{ service }}'),
  ])
  + panel.timeSeries.standardOptions.withUnit('percentunit')
  + panel.timeSeries.fieldConfig.defaults.custom.withThresholdsStyleMode('area')
  + panel.timeSeries.standardOptions.withOverrides([])
  + panel.timeSeries.gridPos.withW(12)
  + panel.timeSeries.gridPos.withH(8)
  + panel.timeSeries.gridPos.withX(12);

local latencyPanel =
  panel.timeSeries.new('Latency P99')
  + panel.timeSeries.queryOptions.withTargets([
    prometheus.new('${datasource}',
      |||
        histogram_quantile(0.99,
          sum by (service, le) (rate(http_request_duration_seconds_bucket{service=~"$service"}[$__rate_interval]))
        )
      |||
    )
    + prometheus.withLegendFormat('{{ service }}'),
  ])
  + panel.timeSeries.standardOptions.withUnit('s')
  + panel.timeSeries.gridPos.withW(24)
  + panel.timeSeries.gridPos.withH(8);

dashboard.new('Service Overview')
+ dashboard.withUid('service-overview')
+ dashboard.withTags(['generated', 'service'])
+ dashboard.withTimezone('browser')
+ dashboard.withRefresh('30s')
+ dashboard.withVariables([ds, svc])
+ dashboard.withPanels([
  requestRatePanel,
  errorRatePanel,
  latencyPanel,
])

Terraform Provider Grafana

# main.tf
terraform {
  required_providers {
    grafana = {
      source  = "grafana/grafana"
      version = "~> 3.0"
    }
  }
}

provider "grafana" {
  url  = "https://grafana.devopslab.ch"
  auth = var.grafana_api_key
}

resource "grafana_folder" "production" {
  title = "Production"
}

resource "grafana_dashboard" "service_overview" {
  folder      = grafana_folder.production.id
  config_json = file("${path.module}/dashboards/service-overview.json")
}

resource "grafana_data_source" "prometheus" {
  type = "prometheus"
  name = "Prometheus"
  url  = "http://prometheus:9090"

  json_data_encoded = jsonencode({
    timeInterval = "15s"
  })
}

# Alerting via Terraform
resource "grafana_contact_point" "slack" {
  name = "Slack Alerts"

  slack {
    url     = var.slack_webhook_url
    channel = "#alerts-prod"
    title   = "{{ .CommonLabels.alertname }}"
    text    = "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"
  }
}

resource "grafana_notification_policy" "default" {
  contact_point = grafana_contact_point.slack.name
  group_by      = ["alertname", "service"]

  policy {
    matcher {
      label = "severity"
      match = "="
      value = "critical"
    }
    contact_point = grafana_contact_point.slack.name
    group_wait    = "30s"
    group_interval = "5m"
  }
}

Module 4 — Loki (Agrégation de Logs)

Architecture de Loki

Loki est un système d’agrégation de logs inspiré de Prometheus. Il n’indexe pas le contenu des logs — seulement les labels. C’est ce qui le rend léger et économique.

graph TB
    PT["Promtail (agent)"] -->|"push"| Dist
    GA["Grafana Alloy (agent)"] -->|"push"| Dist
    DL["Docker Log Driver"] -->|"push"| Dist
    subgraph Loki["Loki"]
        Dist["Distributor"] --> Ing["Ingester"]
        Ing --> Store["Storage (S3/local)"]
    end
    Store --> Graf["Grafana (visualisation)"]
    style Loki fill:#1a2332,stroke:#dc2626,color:#f1f5f9
    style PT fill:#1a2332,stroke:#22c55e,color:#f1f5f9
    style GA fill:#1a2332,stroke:#22c55e,color:#f1f5f9
    style DL fill:#1a2332,stroke:#22c55e,color:#f1f5f9
    style Dist fill:#1a2332,stroke:#f59e0b,color:#f1f5f9
    style Ing fill:#1a2332,stroke:#3b82f6,color:#f1f5f9
    style Store fill:#1a2332,stroke:#a855f7,color:#f1f5f9
    style Graf fill:#1a2332,stroke:#f59e0b,color:#f1f5f9

Configuration Loki

# loki-config.yml
auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: info

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

storage_config:
  filesystem:
    directory: /loki/storage

limits_config:
  retention_period: 30d
  max_query_length: 721h
  max_query_parallelism: 32
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20
  per_stream_rate_limit: 5MB
  max_entries_limit_per_query: 10000

compactor:
  working_directory: /loki/compactor
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h

ruler:
  storage:
    type: local
    local:
      directory: /loki/rules
  rule_path: /loki/rules-temp
  alertmanager_url: http://alertmanager:9093
  ring:
    kvstore:
      store: inmemory
  enable_api: true

Promtail — l’agent de collecte

# promtail-config.yml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push
    tenant_id: default

scrape_configs:
  # Logs Docker via journal
  - job_name: docker
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s
    relabel_configs:
      - source_labels: ['__meta_docker_container_name']
        regex: '/(.*)'
        target_label: container
      - source_labels: ['__meta_docker_container_log_stream']
        target_label: stream
      - source_labels: ['__meta_docker_container_label_com_docker_compose_service']
        target_label: service

  # Logs fichiers système
  - job_name: system
    static_configs:
      - targets:
          - localhost
        labels:
          job: syslog
          __path__: /var/log/syslog
      - targets:
          - localhost
        labels:
          job: authlog
          __path__: /var/log/auth.log

    pipeline_stages:
      # Parser les logs JSON
      - json:
          expressions:
            level: level
            msg: message
            trace_id: trace_id
            duration: duration_ms
      - labels:
          level:
          trace_id:
      - metrics:
          log_lines_total:
            type: Counter
            description: "Total log lines"
            source: level
            config:
              action: inc
              match_all: true

  # Logs Kubernetes
  - job_name: kubernetes
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        target_label: app
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod
    pipeline_stages:
      - cri: {}
      - json:
          expressions:
            level: level
            trace_id: trace_id
      - labels:
          level:
      - structured_metadata:
          trace_id:

Grafana Alloy (successeur de Promtail)

// alloy-config.alloy
// Grafana Alloy — collecteur unifié (logs, métriques, traces)

// Découverte des containers Docker
discovery.docker "containers" {
  host = "unix:///var/run/docker.sock"
}

// Collecte des logs Docker
loki.source.docker "docker_logs" {
  host       = "unix:///var/run/docker.sock"
  targets    = discovery.docker.containers.targets
  forward_to = [loki.process.pipeline.receiver]
}

// Pipeline de traitement
loki.process "pipeline" {
  // Parser JSON
  stage.json {
    expressions = {
      level    = "level",
      trace_id = "trace_id",
      msg      = "message",
    }
  }

  // Ajouter le level comme label
  stage.labels {
    values = {
      level = "",
    }
  }

  // Structured metadata (pas indexé mais filtrable)
  stage.structured_metadata {
    values = {
      trace_id = "",
    }
  }

  // Drop les logs de health check
  stage.drop {
    expression = ".*GET /health.*"
    drop_counter_reason = "healthcheck"
  }

  forward_to = [loki.write.default.receiver]
}

// Envoi vers Loki
loki.write "default" {
  endpoint {
    url = "http://loki:3100/loki/api/v1/push"
  }
}

// Bonus : collecte des métriques aussi
prometheus.scrape "docker_metrics" {
  targets    = discovery.docker.containers.targets
  forward_to = [prometheus.remote_write.default.receiver]
}

prometheus.remote_write "default" {
  endpoint {
    url = "http://prometheus:9090/api/v1/write"
  }
}

LogQL — le langage de requête

Requêtes de base

# Tous les logs d'un service
{service="payment-api"}

# Filtrer par niveau
{service="payment-api"} |= "error"
{service="payment-api"} | json | level="error"

# Exclure les health checks
{service="api"} != "/health"

# Regex
{service="api"} |~ "timeout|connection refused"

# Parser JSON et filtrer
{service="api"}
  | json
  | level="error"
  | duration_ms > 5000
  | line_format "{{.timestamp}} [{{.level}}] {{.message}} ({{.duration_ms}}ms)"

Requêtes métriques (agrégation)

# Nombre de logs d'erreur par minute
count_over_time({service="api"} |= "error" [1m])

# Taux de logs par service
sum by (service) (rate({job="docker"} [5m]))

# Bytes de logs par service (pour détecter les services trop bavards)
sum by (service) (bytes_over_time({job="docker"} [1h]))

# P99 de la durée extraite des logs
{service="api"} | json | unwrap duration_ms | __error__=""
quantile_over_time(0.99, {service="api"} | json | unwrap duration_ms [5m])

# Top 5 des messages d'erreur les plus fréquents
topk(5,
  sum by (message) (count_over_time(
    {service="api"} | json | level="error" [1h]
  ))
)

# Détection de patterns : logs contenant "OOM"
count_over_time({job=~".+"} |= "OOM" [5m]) > 0

Corrélation logs-métriques

La magie opère quand tu relies logs et métriques. Dans Grafana, configure les derived fields dans la datasource Loki :

# Dans la datasource Loki
jsonData:
  derivedFields:
    - datasourceUid: tempo     # Lien vers Tempo
      matcherRegex: '"trace_id":"(\w+)"'
      name: TraceID
      url: "$${__value.raw}"
    - datasourceUid: prometheus
      matcherRegex: '"service":"(\w+)"'
      name: ServiceDashboard
      url: "/d/service-detail/service-detail?var-service=$${__value.raw}"

Flux de debug typique :

Alerte Prometheus → taux d’erreur > 5%
Dashboard Grafana → le service payment-api est impacté
Logs Loki → {service="payment-api"} | json | level="error"
Trace → clic sur le trace_id → vue complète dans Tempo
Root cause → le span vers Stripe montre un timeout

Module 5 — Alerting

Configuration complète d’Alertmanager

# alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_from: "alerts@devopslab.ch"
  smtp_smarthost: "smtp.example.com:587"
  smtp_auth_username: "alerts@devopslab.ch"
  smtp_auth_password: "${SMTP_PASSWORD}"
  smtp_require_tls: true
  slack_api_url: "${SLACK_WEBHOOK_URL}"
  pagerduty_url: "https://events.pagerduty.com/v2/enqueue"

# Templates personnalisés
templates:
  - "/etc/alertmanager/templates/*.tmpl"

# Inhibition : supprimer les alertes redondantes
inhibit_rules:
  # Si le cluster est down, pas besoin d'alerter sur chaque service
  - source_matchers:
      - alertname = ClusterDown
    target_matchers:
      - severity =~ "warning|info"
    equal: [cluster]

  # Si un nœud est down, pas besoin d'alerter sur ses services
  - source_matchers:
      - alertname = NodeDown
    target_matchers:
      - alertname != NodeDown
    equal: [instance]

  # Critical inhibe warning pour la même alerte
  - source_matchers:
      - severity = critical
    target_matchers:
      - severity = warning
    equal: [alertname, service]

# Routing
route:
  receiver: slack-default
  group_by: [alertname, cluster, service]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    # Alertes critiques → PagerDuty + Slack
    - receiver: pagerduty-critical
      matchers:
        - severity = critical
      group_wait: 10s
      repeat_interval: 1h
      continue: true  # Continue vers la route suivante aussi

    - receiver: slack-critical
      matchers:
        - severity = critical
      group_wait: 10s

    # Alertes warning → Slack uniquement
    - receiver: slack-warning
      matchers:
        - severity = warning
      group_wait: 1m
      repeat_interval: 12h

    # Alertes business → channel dédié
    - receiver: slack-business
      matchers:
        - team = business
      group_by: [alertname]
      repeat_interval: 24h

    # Watchdog (heartbeat de l'alerting)
    - receiver: "null"
      matchers:
        - alertname = Watchdog

# Receivers
receivers:
  - name: "null"

  - name: slack-default
    slack_configs:
      - channel: "#alerts-default"
        send_resolved: true
        title: '{{ if eq .Status "firing" }}🔥{{ else }}✅{{ end }} {{ .CommonLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
          *{{ .Labels.severity | toUpper }}* - {{ .Labels.service }}
          {{ .Annotations.description }}
          {{ if .Labels.instance }}Instance: `{{ .Labels.instance }}`{{ end }}
          {{ end }}
        actions:
          - type: button
            text: "📊 Dashboard"
            url: "{{ (index .Alerts 0).Annotations.dashboard_url }}"
          - type: button
            text: "📖 Runbook"
            url: "{{ (index .Alerts 0).Annotations.runbook_url }}"

  - name: slack-critical
    slack_configs:
      - channel: "#alerts-critical"
        send_resolved: true
        color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
        title: '🚨 CRITICAL: {{ .CommonLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
          *Service:* {{ .Labels.service }}
          *Description:* {{ .Annotations.description }}
          *Impact:* {{ .Annotations.impact }}
          *Runbook:* {{ .Annotations.runbook_url }}
          {{ end }}

  - name: slack-warning
    slack_configs:
      - channel: "#alerts-warning"
        send_resolved: true
        title: '⚠️ {{ .CommonLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
          {{ .Annotations.description }}
          {{ end }}

  - name: slack-business
    slack_configs:
      - channel: "#alerts-business"
        send_resolved: true
        title: '📊 {{ .CommonLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
          {{ .Annotations.description }}
          {{ end }}

  - name: pagerduty-critical
    pagerduty_configs:
      - routing_key: "${PAGERDUTY_ROUTING_KEY}"
        severity: critical
        description: "{{ .CommonLabels.alertname }}: {{ .CommonAnnotations.description }}"
        details:
          firing: "{{ .Alerts.Firing | len }}"
          resolved: "{{ .Alerts.Resolved | len }}"
          service: "{{ .CommonLabels.service }}"

Règles d’alerte Prometheus

# rules/alerts.yml
groups:
  - name: service_alerts
    rules:
      # Taux d'erreur élevé
      - alert: HighErrorRate
        expr: service:http_errors:ratio5m > 0.05
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Taux d'erreur élevé sur {{ $labels.service }}"
          description: >-
            Le service {{ $labels.service }} a un taux d'erreur de
            {{ $value | humanizePercentage }} (seuil: 5%).
          impact: "Les utilisateurs reçoivent des erreurs 5xx."
          dashboard_url: "https://grafana.devopslab.ch/d/service-detail?var-service={{ $labels.service }}"
          runbook_url: "https://wiki.devopslab.ch/runbooks/high-error-rate"

      # Latence élevée
      - alert: HighLatencyP99
        expr: service:http_latency:p99_5m > 1
        for: 10m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "Latence P99 élevée sur {{ $labels.service }}"
          description: >-
            Le P99 du service {{ $labels.service }} est à {{ $value | humanizeDuration }}
            (seuil: 1s).

      # SLO en danger
      - alert: SLOBudgetBurning
        expr: slo:api_latency:error_budget_remaining < 0.25
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Error budget critique pour l'API"
          description: >-
            Il ne reste que {{ $value | humanizePercentage }} de l'error budget.
            Gel des déploiements recommandé.

  - name: infrastructure_alerts
    rules:
      # Nœud down
      - alert: NodeDown
        expr: up{job="node"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Nœud {{ $labels.instance }} injoignable"
          description: "Le node exporter de {{ $labels.instance }} ne répond plus depuis 2 minutes."

      # CPU saturé
      - alert: HighCPU
        expr: instance:node_cpu:utilization > 0.85
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "CPU élevé sur {{ $labels.instance }}"
          description: "CPU à {{ $value | humanizePercentage }} depuis 15 minutes."

      # Disque bientôt plein
      - alert: DiskFillingUp
        expr: instance:node_disk:predicted_full_hours > 0 and instance:node_disk:predicted_full_hours < 24
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Disque de {{ $labels.instance }} bientôt plein"
          description: >-
            Au rythme actuel, le disque sera plein dans
            {{ $value | humanizeDuration }}.

      # Certificat SSL expire bientôt
      - alert: SSLCertExpiringSoon
        expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 14
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Certificat SSL de {{ $labels.instance }} expire dans {{ $value | humanize }} jours"

      # Mémoire critique
      - alert: HighMemory
        expr: instance:node_memory:utilization > 0.9
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Mémoire critique sur {{ $labels.instance }}"
          description: "Utilisation mémoire à {{ $value | humanizePercentage }}."

  - name: watchdog
    rules:
      # Alerte qui fire toujours — si elle disparaît, l'alerting est cassé
      - alert: Watchdog
        expr: vector(1)
        labels:
          severity: none
        annotations:
          summary: "Watchdog — l'alerting fonctionne"

Bonnes pratiques anti-alert fatigue

L’alert fatigue tue la fiabilité. Voici comment l’éviter :

1. Chaque alerte doit être actionnable

Si tu reçois une alerte et que la réaction est “ok je regarde demain”, c’est pas une alerte, c’est du bruit. Supprime-la ou transforme-la en dashboard.

2. Deux niveaux suffisent

Niveau	Action	Destination
`critical`	Réveille quelqu’un	PagerDuty + Slack
`warning`	À traiter aux heures ouvrées	Slack uniquement

Pas de info, pas de notice. Si c’est informatif, c’est un dashboard.

3. Seuils avec for — toujours

# ❌ Mauvais — fire sur un spike de 15 secondes
- alert: HighCPU
  expr: node_cpu_utilization > 0.8

# ✅ Bon — problème persistant
- alert: HighCPU
  expr: node_cpu_utilization > 0.85
  for: 15m

4. Seuil minimum de trafic

# ❌ Alerte sur un seul échec
http_errors_ratio > 0.5

# ✅ Seulement si trafic significatif
http_errors_ratio > 0.05 and http_requests_rate > 10

5. Alertes basées sur les symptômes, pas les causes

# ❌ Alerte sur la cause (trop bas niveau)
- alert: PodRestarting
  expr: rate(kube_pod_container_status_restarts_total[15m]) > 0

# ✅ Alerte sur le symptôme (impact utilisateur)
- alert: HighErrorRate
  expr: service:http_errors:ratio5m > 0.05
  for: 5m

6. Runbooks obligatoires

Chaque alerte doit avoir un lien vers un runbook qui explique :

Ce que l’alerte signifie
Comment diagnostiquer
Comment résoudre
Qui escalader si bloqué

Module 6 — Tracing Distribué

OpenTelemetry — le standard

OpenTelemetry (OTel) est le standard ouvert pour la télémétrie. Il fournit des APIs, SDKs et un collecteur pour métriques, logs et traces.

graph TB
    SA["Service A (OTel SDK)"] -->|"OTLP"| Coll
    SB["Service B (OTel SDK)"] -->|"OTLP"| Coll
    SC["Service C (OTel SDK)"] -->|"OTLP"| Coll
    Coll["OTel Collector
Receivers → Process → Exporters"] --> Tempo["Tempo (traces)"]
    Coll --> Prometheus["Prometheus (métriques)"]
    style SA fill:#1a2332,stroke:#3b82f6,color:#f1f5f9
    style SB fill:#1a2332,stroke:#3b82f6,color:#f1f5f9
    style SC fill:#1a2332,stroke:#3b82f6,color:#f1f5f9
    style Coll fill:#1a2332,stroke:#dc2626,color:#f1f5f9
    style Tempo fill:#1a2332,stroke:#a855f7,color:#f1f5f9
    style Prometheus fill:#1a2332,stroke:#f59e0b,color:#f1f5f9

Configuration du Collector OTel

# otel-collector-config.yml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1000

  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128

  attributes:
    actions:
      - key: environment
        value: production
        action: upsert

  tail_sampling:
    policies:
      # Garder toutes les traces avec erreurs
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      # Garder 10% des traces normales
      - name: sampling
        type: probabilistic
        probabilistic:
          sampling_percentage: 10
      # Garder les traces lentes (> 2s)
      - name: slow-traces
        type: latency
        latency:
          threshold_ms: 2000

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

  prometheus:
    endpoint: 0.0.0.0:8889
    namespace: otel
    resource_to_telemetry_conversion:
      enabled: true

  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlp/tempo]

    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]

    logs:
      receivers: [otlp]
      processors: [memory_limiter, attributes, batch]
      exporters: [loki]

Instrumentation automatique vs manuelle

Auto-instrumentation Python

# Installation
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install

# Lancement avec auto-instrumentation
opentelemetry-instrument \
  --service_name payment-api \
  --exporter_otlp_endpoint http://otel-collector:4317 \
  --exporter_otlp_protocol grpc \
  python app.py

# Ou via le code — configuration programmatique
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor

# Configuration du provider
resource = Resource.create({
    "service.name": "payment-api",
    "service.version": "1.2.0",
    "deployment.environment": "production",
})

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(
    OTLPSpanExporter(endpoint="http://otel-collector:4317")
)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# Auto-instrumentation des frameworks
FlaskInstrumentor().instrument()
RequestsInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument()

Instrumentation manuelle Python

from opentelemetry import trace
import logging

tracer = trace.get_tracer("payment.service")
logger = logging.getLogger(__name__)

@app.route("/api/payment", methods=["POST"])
def process_payment():
    with tracer.start_as_current_span("process_payment") as span:
        # Ajouter des attributs au span
        span.set_attribute("payment.method", request.json["method"])
        span.set_attribute("payment.amount", request.json["amount"])
        span.set_attribute("payment.currency", request.json["currency"])

        # Injecter le trace_id dans les logs
        ctx = trace.get_current_span().get_span_context()
        logger.info(
            "Processing payment",
            extra={
                "trace_id": format(ctx.trace_id, '032x'),
                "span_id": format(ctx.span_id, '016x'),
                "amount": request.json["amount"],
            }
        )

        # Span enfant pour l'appel DB
        with tracer.start_as_current_span("validate_card") as child:
            child.set_attribute("db.system", "postgresql")
            child.set_attribute("db.statement", "SELECT * FROM cards WHERE...")
            card = validate_card(request.json["card_token"])

        # Span enfant pour l'appel externe
        with tracer.start_as_current_span("charge_stripe") as child:
            child.set_attribute("http.method", "POST")
            child.set_attribute("http.url", "https://api.stripe.com/v1/charges")
            try:
                result = stripe.Charge.create(
                    amount=request.json["amount"],
                    currency=request.json["currency"],
                    source=card.token,
                )
                child.set_attribute("payment.stripe_id", result.id)
            except stripe.error.CardError as e:
                child.set_status(trace.StatusCode.ERROR, str(e))
                child.record_exception(e)
                span.set_attribute("payment.status", "failed")
                raise

        span.set_attribute("payment.status", "success")
        return {"status": "ok", "charge_id": result.id}

Instrumentation Go

package main

import (
	"context"
	"log"
	"net/http"
	"time"

	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/attribute"
	"go.opentelemetry.io/otel/codes"
	"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
	"go.opentelemetry.io/otel/propagation"
	"go.opentelemetry.io/otel/sdk/resource"
	sdktrace "go.opentelemetry.io/otel/sdk/trace"
	semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
	"go.opentelemetry.io/otel/trace"
	"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)

var tracer trace.Tracer

func initTracer() func() {
	ctx := context.Background()

	exporter, err := otlptracegrpc.New(ctx,
		otlptracegrpc.WithEndpoint("otel-collector:4317"),
		otlptracegrpc.WithInsecure(),
	)
	if err != nil {
		log.Fatalf("failed to create exporter: %v", err)
	}

	res, _ := resource.Merge(
		resource.Default(),
		resource.NewWithAttributes(
			semconv.SchemaURL,
			semconv.ServiceName("order-service"),
			semconv.ServiceVersion("2.1.0"),
			attribute.String("environment", "production"),
		),
	)

	tp := sdktrace.NewTracerProvider(
		sdktrace.WithBatcher(exporter),
		sdktrace.WithResource(res),
		sdktrace.WithSampler(sdktrace.ParentBased(
			sdktrace.TraceIDRatioBased(0.1), // 10% sampling
		)),
	)

	otel.SetTracerProvider(tp)
	otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
		propagation.TraceContext{},
		propagation.Baggage{},
	))

	tracer = tp.Tracer("order-service")

	return func() {
		ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
		defer cancel()
		tp.Shutdown(ctx)
	}
}

func processOrder(ctx context.Context, orderID string) error {
	ctx, span := tracer.Start(ctx, "processOrder",
		trace.WithAttributes(
			attribute.String("order.id", orderID),
		),
	)
	defer span.End()

	// Span enfant : validation
	if err := validateOrder(ctx, orderID); err != nil {
		span.RecordError(err)
		span.SetStatus(codes.Error, err.Error())
		return err
	}

	// Span enfant : paiement
	if err := chargePayment(ctx, orderID); err != nil {
		span.RecordError(err)
		span.SetStatus(codes.Error, err.Error())
		return err
	}

	span.SetAttributes(attribute.String("order.status", "completed"))
	return nil
}

func validateOrder(ctx context.Context, orderID string) error {
	ctx, span := tracer.Start(ctx, "validateOrder")
	defer span.End()

	span.SetAttributes(
		attribute.String("db.system", "postgresql"),
		attribute.String("db.operation", "SELECT"),
	)
	// ... validation logic
	return nil
}

func main() {
	shutdown := initTracer()
	defer shutdown()

	// HTTP handler avec instrumentation automatique
	handler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		orderID := r.URL.Query().Get("id")
		if err := processOrder(r.Context(), orderID); err != nil {
			http.Error(w, err.Error(), http.StatusInternalServerError)
			return
		}
		w.Write([]byte(`{"status":"ok"}`))
	})

	// otelhttp instrumente automatiquement les requêtes HTTP
	wrappedHandler := otelhttp.NewHandler(handler, "order-api")

	log.Println("Starting server on :8080")
	log.Fatal(http.ListenAndServe(":8080", wrappedHandler))
}

Configuration de Tempo

# tempo-config.yml
server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318

ingester:
  max_block_duration: 5m

compactor:
  compaction:
    block_retention: 336h  # 14 jours

metrics_generator:
  registry:
    external_labels:
      source: tempo
      cluster: production
  storage:
    path: /var/tempo/generator/wal
    remote_write:
      - url: http://prometheus:9090/api/v1/write
        send_exemplars: true
  traces_storage:
    path: /var/tempo/generator/traces
  processor:
    service_graphs:
      dimensions:
        - service.namespace
      enable_client_server_prefix: true
    span_metrics:
      dimensions:
        - http.method
        - http.status_code
        - service.version

storage:
  trace:
    backend: local
    wal:
      path: /var/tempo/wal
    local:
      path: /var/tempo/blocks

overrides:
  defaults:
    metrics_generator:
      processors:
        - service-graphs
        - span-metrics

Corrélation traces-métriques-logs

La corrélation complète — le Saint Graal de l’observabilité :

# Dans ton application, assure-toi d'inclure le trace_id partout

# 1. Logs structurés avec trace_id
# → Loki peut lier vers Tempo via derived fields

# 2. Métriques avec exemplars
# → Prometheus stocke des trace_id dans les exemplars des histogrammes

# 3. Tempo génère des métriques de spans
# → service_graphs et span_metrics dans Prometheus

# Configuration Grafana — datasource Tempo avec corrélation
# (voir provisioning/datasources plus haut)

Flux de debug complet :

1. ALERTE: "HighLatencyP99 sur payment-api" (Prometheus → Alertmanager → Slack)
      │
2. DASHBOARD: Ouvre le dashboard RED, voit le spike de latence
      │  → Clique sur un point du graphe, voit les exemplars
      │
3. TRACE: Clique sur l'exemplar → trace complète dans Tempo
      │  → payment-api (12ms) → stripe-client (30045ms) ← LE COUPABLE
      │
4. LOGS: Depuis le span, clique "View Logs" → Loki
      │  → "Connection timeout to api.stripe.com after 30s"
      │
5. ROOT CAUSE: Stripe a un incident. Rien à faire de notre côté.
      → Ajouter un circuit breaker avec fallback.

Bonus — Stack LGTM complète en Docker Compose

Voici un Docker Compose complet pour déployer la stack Loki + Grafana + Tempo + Mimir (Prometheus) :

# docker-compose.yml
# Stack LGTM - Observabilité complète
# Usage: docker compose up -d

services:
  # ============================================
  # Grafana — Visualisation
  # ============================================
  grafana:
    image: grafana/grafana:11.5.0
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_USER: admin
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD:-admin}
      GF_AUTH_ANONYMOUS_ENABLED: "false"
      GF_FEATURE_TOGGLES_ENABLE: "traceQLStreaming,metricsSummary"
    volumes:
      - grafana-data:/var/lib/grafana
      - ./config/grafana/provisioning:/etc/grafana/provisioning
    depends_on:
      - prometheus
      - loki
      - tempo
    networks:
      - observability
    restart: unless-stopped

  # ============================================
  # Prometheus — Métriques
  # ============================================
  prometheus:
    image: prom/prometheus:v2.53.0
    ports:
      - "9090:9090"
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=30d"
      - "--web.enable-remote-write-receiver"
      - "--enable-feature=exemplar-storage"
      - "--enable-feature=native-histograms"
    volumes:
      - ./config/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./config/prometheus/rules:/etc/prometheus/rules:ro
      - prometheus-data:/prometheus
    networks:
      - observability
    restart: unless-stopped

  # ============================================
  # Alertmanager — Alertes
  # ============================================
  alertmanager:
    image: prom/alertmanager:v0.27.0
    ports:
      - "9093:9093"
    command:
      - "--config.file=/etc/alertmanager/alertmanager.yml"
      - "--storage.path=/alertmanager"
    volumes:
      - ./config/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
      - alertmanager-data:/alertmanager
    networks:
      - observability
    restart: unless-stopped

  # ============================================
  # Loki — Logs
  # ============================================
  loki:
    image: grafana/loki:3.4.0
    ports:
      - "3100:3100"
    command: -config.file=/etc/loki/loki-config.yml
    volumes:
      - ./config/loki/loki-config.yml:/etc/loki/loki-config.yml:ro
      - loki-data:/loki
    networks:
      - observability
    restart: unless-stopped

  # ============================================
  # Tempo — Traces
  # ============================================
  tempo:
    image: grafana/tempo:2.6.0
    ports:
      - "3200:3200"   # HTTP API
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
    command: -config.file=/etc/tempo/tempo-config.yml
    volumes:
      - ./config/tempo/tempo-config.yml:/etc/tempo/tempo-config.yml:ro
      - tempo-data:/var/tempo
    networks:
      - observability
    restart: unless-stopped

  # ============================================
  # OTel Collector — Collecteur unifié
  # ============================================
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.115.0
    ports:
      - "4327:4317"   # OTLP gRPC (mapped pour pas conflit avec Tempo)
      - "4328:4318"   # OTLP HTTP
      - "8889:8889"   # Prometheus metrics
    command: --config=/etc/otel/config.yml
    volumes:
      - ./config/otel/otel-collector-config.yml:/etc/otel/config.yml:ro
    depends_on:
      - tempo
      - prometheus
      - loki
    networks:
      - observability
    restart: unless-stopped

  # ============================================
  # Grafana Alloy — Agent de collecte
  # ============================================
  alloy:
    image: grafana/alloy:v1.5.0
    ports:
      - "12345:12345"  # UI
    command:
      - run
      - /etc/alloy/config.alloy
      - --server.http.listen-addr=0.0.0.0:12345
    volumes:
      - ./config/alloy/config.alloy:/etc/alloy/config.alloy:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /var/log:/var/log:ro
    networks:
      - observability
    restart: unless-stopped

  # ============================================
  # Node Exporter — Métriques système
  # ============================================
  node-exporter:
    image: quay.io/prometheus/node-exporter:v1.8.0
    ports:
      - "9100:9100"
    command:
      - "--path.rootfs=/host"
    pid: host
    volumes:
      - /:/host:ro,rslave
    networks:
      - observability
    restart: unless-stopped

  # ============================================
  # Blackbox Exporter — Probes externes
  # ============================================
  blackbox-exporter:
    image: prom/blackbox-exporter:v0.25.0
    ports:
      - "9115:9115"
    volumes:
      - ./config/blackbox/blackbox.yml:/config/blackbox.yml:ro
    command:
      - "--config.file=/config/blackbox.yml"
    networks:
      - observability
    restart: unless-stopped

  # ============================================
  # App démo — pour tester la stack
  # ============================================
  demo-app:
    build:
      context: ./demo-app
      dockerfile: Dockerfile
    ports:
      - "8080:8080"
    environment:
      OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4317
      OTEL_SERVICE_NAME: demo-app
      OTEL_RESOURCE_ATTRIBUTES: "deployment.environment=demo"
    labels:
      prometheus.scrape: "true"
      prometheus.port: "8080"
    depends_on:
      - otel-collector
    networks:
      - observability
    restart: unless-stopped

volumes:
  grafana-data:
  prometheus-data:
  loki-data:
  tempo-data:
  alertmanager-data:

networks:
  observability:
    driver: bridge

Structure du projet

observability-stack/
├── docker-compose.yml
├── .env                          # GRAFANA_PASSWORD, SLACK_WEBHOOK_URL, etc.
├── config/
│   ├── prometheus/
│   │   ├── prometheus.yml
│   │   └── rules/
│   │       ├── recording.yml
│   │       └── alerts.yml
│   ├── alertmanager/
│   │   └── alertmanager.yml
│   ├── loki/
│   │   └── loki-config.yml
│   ├── tempo/
│   │   └── tempo-config.yml
│   ├── otel/
│   │   └── otel-collector-config.yml
│   ├── alloy/
│   │   └── config.alloy
│   ├── blackbox/
│   │   └── blackbox.yml
│   └── grafana/
│       └── provisioning/
│           ├── datasources/
│           │   └── datasources.yml
│           └── dashboards/
│               ├── dashboards.yml
│               └── json/
│                   └── service-overview.json
└── demo-app/
    ├── Dockerfile
    └── main.py

Script de démarrage rapide

#!/bin/bash
# start.sh — Démarrer la stack d'observabilité

set -euo pipefail

echo "🚀 Démarrage de la stack LGTM..."

# Vérifier Docker
if ! command -v docker &> /dev/null; then
    echo "❌ Docker n'est pas installé"
    exit 1
fi

# Créer le .env si absent
if [ ! -f .env ]; then
    echo "📝 Création du fichier .env..."
    cat > .env << 'EOF'
GRAFANA_PASSWORD=changeme-please
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/xxx
PAGERDUTY_ROUTING_KEY=xxx
SMTP_PASSWORD=xxx
EOF
    echo "⚠️  Édite .env avec tes vrais credentials avant de continuer"
    exit 1
fi

# Créer les répertoires
mkdir -p config/{prometheus/rules,alertmanager,loki,tempo,otel,alloy,blackbox,grafana/provisioning/{datasources,dashboards/json}}

# Démarrer
docker compose up -d

echo ""
echo "✅ Stack démarrée !"
echo ""
echo "📊 Grafana:       http://localhost:3000  (admin / voir .env)"
echo "🔥 Prometheus:    http://localhost:9090"
echo "📋 Alertmanager:  http://localhost:9093"
echo "📝 Loki:          http://localhost:3100"
echo "🔍 Tempo:         http://localhost:3200"
echo "📡 OTel Collector: gRPC=localhost:4327, HTTP=localhost:4328"
echo "🖥️  Node Exporter: http://localhost:9100"
echo "⚫ Blackbox:      http://localhost:9115"
echo "🔧 Alloy UI:      http://localhost:12345"
echo ""
echo "💡 Envoie tes traces OTLP vers localhost:4327 (gRPC) ou localhost:4328 (HTTP)"

Récapitulatif et prochaines étapes

Tu as maintenant les bases pour construire une stack d’observabilité complète :

Composant	Rôle	Port
Prometheus	Métriques (scrape + stockage + alertes)	9090
Grafana	Visualisation (dashboards + exploration)	3000
Loki	Logs (agrégation + requêtes)	3100
Tempo	Traces (stockage + corrélation)	3200
Alertmanager	Routing des alertes	9093
OTel Collector	Collecte unifiée (métriques + logs + traces)	4317/4318
Alloy	Agent local (remplace Promtail + agents séparés)	12345

Checklist de mise en production

Métriques : RED pour chaque service, USE pour chaque nœud
SLOs : Définis pour les services critiques, error budgets monitorés
Alertes : Actionnables, avec runbooks, 2 niveaux (critical/warning)
Logs : Structurés (JSON), avec trace_id, centralisés dans Loki
Traces : OpenTelemetry configuré, sampling adapté au volume
Corrélation : Liens métriques → logs → traces fonctionnels dans Grafana
Dashboards : As Code (Grafonnet ou Terraform), versionnés en Git
Rétention : Définie par type (métriques 30j, logs 30j, traces 14j)
Haute dispo : Prometheus en mode HA, Loki/Tempo en mode distribué pour la prod

Pour aller plus loin

Mimir : Stockage long-terme pour métriques Prometheus (multi-tenant, scalable)
Pyroscope : Profiling continu — le 4ème pilier de l’observabilité
Kubernetes monitoring : kube-state-metrics, cAdvisor, kube-prometheus-stack (Helm)
eBPF : Instrumentation zero-code au niveau kernel (Grafana Beyla)
On-call : Grafana OnCall pour la gestion des astreintes

L’observabilité n’est pas un projet qu’on “termine”. C’est une pratique qu’on améliore continuellement. Commence simple, itère, et ajoute des couches quand tu en as besoin — pas avant.

Sur cette page