Aller au contenu principal
Avancé 6 chapitres

📊 Monitoring & ObservabilitĂ©

Progression 6 chapitres

Programme

Le monitoring, c’est savoir que ton systĂšme est cassĂ©. L’observabilitĂ©, c’est comprendre pourquoi il est cassĂ©. Cette formation te donne les deux — de la thĂ©orie aux fichiers de config prĂȘts Ă  dĂ©ployer.

On va construire ensemble une stack d’observabilitĂ© complĂšte basĂ©e sur Prometheus, Grafana, Loki, Tempo et OpenTelemetry. À la fin, tu auras une vision 360° de tes systĂšmes : mĂ©triques, logs et traces, corrĂ©lĂ©s entre eux.


Module 1 — Fondamentaux du Monitoring et de l’ObservabilitĂ©

Monitoring vs Observabilité : quelle différence ?

Le monitoring rĂ©pond Ă  la question : “Est-ce que ça marche ?” L’observabilitĂ© rĂ©pond Ă  : “Pourquoi ça ne marche pas ?”

AspectMonitoringObservabilité
ApprocheRĂ©active — alertes sur seuils connusExploratoire — investigation de problĂšmes inconnus
QuestionsPrédéfinies (CPU > 90% ?)Ad-hoc (pourquoi le P99 a doublé ?)
DonnéesMétriques agrégéesMétriques + logs + traces corrélés
Adapté pourPannes connuesSystÚmes distribués complexes

En pratique, tu as besoin des deux. Le monitoring te rĂ©veille Ă  3h du matin. L’observabilitĂ© te permet de rĂ©soudre le problĂšme en 10 minutes au lieu de 2 heures.

Les 3 piliers de l’observabilitĂ©

1. Métriques

Des valeurs numériques agrégées dans le temps. LégÚres, rapides à interroger, idéales pour les alertes.

# Exemple : taux de requĂȘtes HTTP par seconde
rate(http_requests_total{service="api"}[5m])

Types de métriques :

  • Counter — ne fait qu’augmenter (requĂȘtes totales, erreurs totales)
  • Gauge — monte et descend (tempĂ©rature, connexions actives)
  • Histogram — distribution de valeurs (latence, taille des rĂ©ponses)
  • Summary — comme histogram, mais calcule les quantiles cĂŽtĂ© client

2. Logs

Des événements textuels horodatés. Riches en contexte, mais volumineux.

{
  "timestamp": "2026-03-20T08:15:32Z",
  "level": "error",
  "service": "payment-api",
  "trace_id": "abc123def456",
  "message": "Timeout connecting to stripe API",
  "duration_ms": 30000
}

Bonnes pratiques logs :

  • Format structurĂ© (JSON) — toujours
  • Inclure le trace_id pour la corrĂ©lation
  • Niveaux cohĂ©rents : DEBUG, INFO, WARN, ERROR
  • Pas de donnĂ©es sensibles (PII, tokens)

3. Traces distribuées

Le parcours complet d’une requĂȘte Ă  travers tes microservices.

[Gateway] ──(12ms)──> [Auth Service] ──(3ms)──> [User DB]
     │
     └──(45ms)──> [Payment Service] ──(30000ms)──> [Stripe API] ← TIMEOUT

Une trace est composĂ©e de spans — chaque span reprĂ©sente une opĂ©ration (appel HTTP, requĂȘte DB, traitement).

La méthode RED (pour les services)

Pour chaque service orientĂ© requĂȘtes, mesure ces 3 choses :

SignalDescriptionMétrique Prometheus
RateRequĂȘtes par seconderate(http_requests_total[5m])
ErrorsTaux d’erreursrate(http_requests_total{status=~"5.."}[5m])
DurationLatence (histogramme)histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Dashboard RED complet pour un service
# Rate
sum(rate(http_requests_total{service="api"}[5m]))

# Error ratio
sum(rate(http_requests_total{service="api",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="api"}[5m]))

# Duration P99
histogram_quantile(0.99,
  sum by (le) (rate(http_request_duration_seconds_bucket{service="api"}[5m]))
)

La méthode USE (pour les ressources)

Pour chaque ressource infrastructure (CPU, mémoire, disque, réseau) :

SignalDescriptionExemple
Utilization% de la capacité utiliséeCPU à 75%
SaturationFile d’attente / backlogLoad average, disk I/O queue
ErrorsNombre d’erreursECC errors, packet drops
# USE pour le CPU
# Utilization
1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)

# Saturation (load average / nombre de CPUs)
node_load1 / count without (cpu) (node_cpu_seconds_total{mode="idle"})

# Errors
rate(node_edac_correctable_errors_total[5m])

SLIs, SLOs, SLAs et Error Budgets

Définitions

  • SLI (Service Level Indicator) — la mĂ©trique mesurĂ©e (ex: % de requĂȘtes < 300ms)
  • SLO (Service Level Objective) — l’objectif interne (ex: 99.9% des requĂȘtes < 300ms)
  • SLA (Service Level Agreement) — l’engagement contractuel (ex: 99.5% — toujours plus bas que le SLO)
  • Error Budget — la marge d’erreur tolĂ©rĂ©e avant de violer le SLO

Exemple concret

Ton API de paiement :

# SLI : latence
sli:
  type: latency
  metric: "http_request_duration_seconds"
  threshold: 0.3  # 300ms
  filter: '{service="payment-api", method!="OPTIONS"}'

# SLO : 99.9% sur 30 jours
slo:
  target: 0.999
  window: 30d

# Calcul de l'error budget
# 30 jours = 43,200 minutes
# Error budget = 0.1% = 43.2 minutes de downtime autorisé
# Ou : sur 1M de requĂȘtes, 1,000 peuvent ĂȘtre lentes
# Calcul du SLI en PromQL
sum(rate(http_request_duration_seconds_bucket{
  service="payment-api",
  le="0.3"
}[30d]))
/
sum(rate(http_request_duration_seconds_count{
  service="payment-api"
}[30d]))

# Consommation de l'error budget
1 - (
  (1 - sli_value) / (1 - 0.999)
)

Error budget en pratique

Error Budget restantAction
> 50%Déploiements normaux, expérimentation OK
25-50%Prudence, review renforcée
< 25%Gel des déploiements non-critiques
0%Incident — focus fiabilitĂ© uniquement

Module 2 — Prometheus

Architecture de Prometheus

Prometheus fonctionne en pull : il va chercher les métriques sur tes services à intervalles réguliers (scraping).

graph TB
    subgraph Prom["Prometheus"]
        Scraper["Scraper (pull)"] --> TSDB["TSDB (stockage)"]
        TSDB --> Rules["Rule Engine
recording + alerting"] end Targets["Targets /metrics
node_exporter, app, blackbox"] -->|"scrape"| Scraper Rules -->|"push alerts"| AM["Alertmanager
Slack, PagerDuty, Email"] style Prom fill:#1a2332,stroke:#dc2626,color:#f1f5f9 style Scraper fill:#1a2332,stroke:#f59e0b,color:#f1f5f9 style TSDB fill:#1a2332,stroke:#3b82f6,color:#f1f5f9 style Rules fill:#1a2332,stroke:#a855f7,color:#f1f5f9 style Targets fill:#1a2332,stroke:#22c55e,color:#f1f5f9 style AM fill:#1a2332,stroke:#dc2626,color:#f1f5f9

Configuration de base

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: production
    env: prod

rule_files:
  - "rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

scrape_configs:
  # Prometheus se scrape lui-mĂȘme
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # Node Exporter — mĂ©triques systĂšme
  - job_name: "node"
    static_configs:
      - targets:
          - "node-exporter-1:9100"
          - "node-exporter-2:9100"
    relabel_configs:
      - source_labels: [__address__]
        regex: "(.+):9100"
        target_label: instance
        replacement: "${1}"

  # Découverte automatique via Docker
  - job_name: "docker"
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 30s
    relabel_configs:
      - source_labels: [__meta_docker_container_label_prometheus_scrape]
        regex: "true"
        action: keep
      - source_labels: [__meta_docker_container_label_prometheus_port]
        target_label: __address__
        regex: (.+)
        replacement: "${1}"
      - source_labels: [__meta_docker_container_name]
        target_label: container

  # Découverte Kubernetes
  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels:
          [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

  # Blackbox Exporter — monitoring externe
  - job_name: "blackbox-http"
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://devopslab.ch
          - https://api.devopslab.ch/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

PromQL — de dĂ©butant Ă  avancĂ©

Les bases

# Valeur instantanée d'une métrique
up{job="node"}

# Filtrage avec labels
http_requests_total{method="GET", status="200"}

# Regex sur labels
http_requests_total{status=~"5.."}      # 5xx
http_requests_total{path!~"/health.*"}  # exclure health

Fonctions de taux (rate)

# rate() — taux par seconde sur une fenĂȘtre (lissĂ©)
rate(http_requests_total[5m])

# irate() — taux instantanĂ© (2 derniers points)
irate(http_requests_total[5m])

# increase() — augmentation absolue sur la fenĂȘtre
increase(http_requests_total[1h])

RĂšgle d’or : utilise rate() pour les alertes et dashboards, irate() uniquement pour les graphes haute rĂ©solution.

Agrégations

# Somme par service
sum by (service) (rate(http_requests_total[5m]))

# Moyenne par instance
avg by (instance) (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# Top 5 des pods par CPU
topk(5, sum by (pod) (rate(container_cpu_usage_seconds_total[5m])))

# Quantiles
quantile(0.95, rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m]))

Histogrammes et quantiles

# P50 (médiane) de la latence
histogram_quantile(0.5,
  sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)

# P99 par service
histogram_quantile(0.99,
  sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
)

# Apdex score (seuil satisfait=0.3s, toléré=1.2s)
(
  sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
  +
  sum(rate(http_request_duration_seconds_bucket{le="1.2"}[5m]))
)
/ 2
/ sum(rate(http_request_duration_seconds_count[5m]))

RequĂȘtes avancĂ©es

# Prédiction : disque plein dans combien d'heures ?
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0

# Taux d'erreur avec seuil minimum de trafic (éviter les faux positifs)
(
  sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
  /
  sum by (service) (rate(http_requests_total[5m]))
) > 0.05
and
sum by (service) (rate(http_requests_total[5m])) > 10

# Détection d'anomalie : écart par rapport à la moyenne sur 7 jours
(
  avg_over_time(rate(http_requests_total[5m])[1h:5m])
  -
  avg_over_time(rate(http_requests_total[5m])[7d:1h])
)
/
stddev_over_time(rate(http_requests_total[5m])[7d:1h])
> 3

# Ratio de cache hit
sum(rate(cache_hits_total[5m]))
/
(sum(rate(cache_hits_total[5m])) + sum(rate(cache_misses_total[5m])))

Exporters

Node Exporter

Métriques systÚme (CPU, mémoire, disque, réseau) :

# Installation
docker run -d \
  --name node-exporter \
  --net host \
  --pid host \
  -v /:/host:ro,rslave \
  quay.io/prometheus/node-exporter \
  --path.rootfs=/host

Métriques clés :

# CPU utilisé
1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))

# Mémoire utilisée (%)
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# Espace disque restant (%)
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}

# Trafic réseau
rate(node_network_receive_bytes_total{device="eth0"}[5m]) * 8  # bits/s

Blackbox Exporter

Monitoring boĂźte noire (HTTP, TCP, DNS, ICMP) :

# blackbox.yml
modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: [200, 301, 302]
      follow_redirects: true
      preferred_ip_protocol: ip4
      tls_config:
        insecure_skip_verify: false

  http_post_json:
    prober: http
    timeout: 5s
    http:
      method: POST
      headers:
        Content-Type: application/json
      body: '{"healthcheck": true}'

  tcp_connect:
    prober: tcp
    timeout: 5s

  icmp_ping:
    prober: icmp
    timeout: 5s
    icmp:
      preferred_ip_protocol: ip4

  dns_resolution:
    prober: dns
    timeout: 5s
    dns:
      query_name: devopslab.ch
      query_type: A
      valid_rcodes:
        - NOERROR
# Latence HTTP du probe
probe_http_duration_seconds{phase="resolve"}

# Certificat SSL : jours avant expiration
(probe_ssl_earliest_cert_expiry - time()) / 86400

# Disponibilité sur 30 jours
avg_over_time(probe_success{job="blackbox-http"}[30d]) * 100

Exporter custom en Python

# custom_exporter.py
from prometheus_client import start_http_server, Gauge, Counter, Histogram
import time
import psycopg2

# Définition des métriques
db_connections = Gauge(
    'app_db_connections_active',
    'Number of active database connections',
    ['database']
)

orders_total = Counter(
    'app_orders_total',
    'Total number of orders processed',
    ['status', 'payment_method']
)

order_processing_time = Histogram(
    'app_order_processing_seconds',
    'Time to process an order',
    ['order_type'],
    buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

def collect_metrics():
    """Collecte les métriques métier."""
    conn = psycopg2.connect("postgresql://localhost/myapp")
    cur = conn.cursor()

    # Connexions actives
    cur.execute("SELECT count(*) FROM pg_stat_activity WHERE state = 'active'")
    db_connections.labels(database="myapp").set(cur.fetchone()[0])

    # Commandes par statut
    cur.execute("""
        SELECT status, payment_method, count(*)
        FROM orders
        WHERE created_at > now() - interval '5 minutes'
        GROUP BY status, payment_method
    """)
    for status, method, count in cur.fetchall():
        orders_total.labels(status=status, payment_method=method).inc(count)

    cur.close()
    conn.close()

if __name__ == '__main__':
    start_http_server(8000)
    while True:
        collect_metrics()
        time.sleep(15)

Recording Rules

Les recording rules prĂ©-calculent des requĂȘtes coĂ»teuses :

# rules/recording.yml
groups:
  - name: http_rules
    interval: 30s
    rules:
      # Taux de requĂȘtes par service (prĂ©-calculĂ©)
      - record: service:http_requests:rate5m
        expr: sum by (service) (rate(http_requests_total[5m]))

      # Taux d'erreur par service
      - record: service:http_errors:ratio5m
        expr: |
          sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum by (service) (rate(http_requests_total[5m]))

      # Latence P99 par service
      - record: service:http_latency:p99_5m
        expr: |
          histogram_quantile(0.99,
            sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

  - name: node_rules
    interval: 1m
    rules:
      - record: instance:node_cpu:utilization
        expr: 1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))

      - record: instance:node_memory:utilization
        expr: 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

      - record: instance:node_disk:predicted_full_hours
        expr: |
          -1 * (
            node_filesystem_avail_bytes{mountpoint="/"}
            /
            deriv(node_filesystem_avail_bytes{mountpoint="/"}[6h])
          ) / 3600

  - name: slo_rules
    interval: 30s
    rules:
      # SLI : ratio de requĂȘtes rapides (< 300ms)
      - record: slo:api_latency:ratio5m
        expr: |
          sum(rate(http_request_duration_seconds_bucket{service="api", le="0.3"}[5m]))
          /
          sum(rate(http_request_duration_seconds_count{service="api"}[5m]))

      # Error budget restant (sur 30 jours, SLO 99.9%)
      - record: slo:api_latency:error_budget_remaining
        expr: |
          1 - (
            (1 - slo:api_latency:ratio5m)
            /
            (1 - 0.999)
          )

Module 3 — Grafana

Installation

# docker-compose pour Grafana standalone
services:
  grafana:
    image: grafana/grafana:11.5.0
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_USER: admin
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
      GF_USERS_ALLOW_SIGN_UP: "false"
      GF_AUTH_ANONYMOUS_ENABLED: "false"
      GF_INSTALL_PLUGINS: "grafana-clock-panel,grafana-piechart-panel"
    volumes:
      - grafana-data:/var/lib/grafana
      - ./provisioning:/etc/grafana/provisioning

volumes:
  grafana-data:

Provisioning des datasources

# provisioning/datasources/datasources.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    jsonData:
      timeInterval: "15s"
      exemplarTraceIdDestinations:
        - name: traceID
          datasourceUid: tempo

  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    jsonData:
      derivedFields:
        - datasourceUid: tempo
          matcherRegex: '"trace_id":"(\w+)"'
          name: TraceID
          url: "$${__value.raw}"

  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200
    jsonData:
      tracesToLogs:
        datasourceUid: loki
        filterByTraceID: true
        filterBySpanID: true
      tracesToMetrics:
        datasourceUid: prometheus
      nodeGraph:
        enabled: true
      serviceMap:
        datasourceUid: prometheus

Dashboards efficaces

Les principes d’un bon dashboard :

  1. Un dashboard = un objectif (overview, service detail, debugging)
  2. HiĂ©rarchie visuelle — les mĂ©triques importantes en haut et en gros
  3. Seuils visuels — vert/jaune/rouge immĂ©diatement lisibles
  4. Liens entre dashboards — drill-down du overview vers le dĂ©tail

Dashboard JSON provisionné

# provisioning/dashboards/dashboards.yml
apiVersion: 1

providers:
  - name: default
    orgId: 1
    folder: "Production"
    type: file
    disableDeletion: false
    editable: true
    options:
      path: /etc/grafana/provisioning/dashboards/json
      foldersFromFilesStructure: true

Variables et templating

Les variables rendent tes dashboards dynamiques et réutilisables :

{
  "templating": {
    "list": [
      {
        "name": "datasource",
        "type": "datasource",
        "query": "prometheus"
      },
      {
        "name": "namespace",
        "type": "query",
        "datasource": "${datasource}",
        "query": "label_values(kube_pod_info, namespace)",
        "refresh": 2,
        "sort": 1
      },
      {
        "name": "service",
        "type": "query",
        "datasource": "${datasource}",
        "query": "label_values(http_requests_total{namespace=\"$namespace\"}, service)",
        "refresh": 2,
        "sort": 1,
        "multi": true,
        "includeAll": true
      },
      {
        "name": "percentile",
        "type": "custom",
        "query": "50,90,95,99",
        "current": {
          "text": "99",
          "value": "99"
        }
      }
    ]
  }
}

Utilisation dans les requĂȘtes :

# Variable simple
rate(http_requests_total{namespace="$namespace", service=~"$service"}[5m])

# Variable avec regex multi-valeur
rate(http_requests_total{service=~"${service:pipe}"}[$__rate_interval])

# Variable dans le quantile
histogram_quantile(0.$percentile,
  sum by (le) (rate(http_request_duration_seconds_bucket{service=~"$service"}[$__rate_interval]))
)

Dashboards as Code

Grafonnet (Jsonnet)

// dashboards/service-overview.jsonnet
local grafana = import 'github.com/grafana/grafonnet/gen/grafonnet-latest/main.libsonnet';

local dashboard = grafana.dashboard;
local panel = grafana.panel;
local prometheus = grafana.query.prometheus;
local variable = grafana.dashboard.variable;

local ds = variable.datasource.new('datasource', 'prometheus');
local svc = variable.query.new('service')
  + variable.query.withDatasourceFromVariable(ds)
  + variable.query.queryTypes.withLabelValues('service', 'http_requests_total')
  + variable.query.withRefresh(2)
  + variable.query.selectionOptions.withMulti(true)
  + variable.query.selectionOptions.withIncludeAll(true);

local requestRatePanel =
  panel.timeSeries.new('Request Rate')
  + panel.timeSeries.queryOptions.withTargets([
    prometheus.new('${datasource}',
      'sum by (service) (rate(http_requests_total{service=~"$service"}[$__rate_interval]))'
    )
    + prometheus.withLegendFormat('{{ service }}'),
  ])
  + panel.timeSeries.standardOptions.withUnit('reqps')
  + panel.timeSeries.gridPos.withW(12)
  + panel.timeSeries.gridPos.withH(8);

local errorRatePanel =
  panel.timeSeries.new('Error Rate')
  + panel.timeSeries.queryOptions.withTargets([
    prometheus.new('${datasource}',
      |||
        sum by (service) (rate(http_requests_total{service=~"$service", status=~"5.."}[$__rate_interval]))
        /
        sum by (service) (rate(http_requests_total{service=~"$service"}[$__rate_interval]))
      |||
    )
    + prometheus.withLegendFormat('{{ service }}'),
  ])
  + panel.timeSeries.standardOptions.withUnit('percentunit')
  + panel.timeSeries.fieldConfig.defaults.custom.withThresholdsStyleMode('area')
  + panel.timeSeries.standardOptions.withOverrides([])
  + panel.timeSeries.gridPos.withW(12)
  + panel.timeSeries.gridPos.withH(8)
  + panel.timeSeries.gridPos.withX(12);

local latencyPanel =
  panel.timeSeries.new('Latency P99')
  + panel.timeSeries.queryOptions.withTargets([
    prometheus.new('${datasource}',
      |||
        histogram_quantile(0.99,
          sum by (service, le) (rate(http_request_duration_seconds_bucket{service=~"$service"}[$__rate_interval]))
        )
      |||
    )
    + prometheus.withLegendFormat('{{ service }}'),
  ])
  + panel.timeSeries.standardOptions.withUnit('s')
  + panel.timeSeries.gridPos.withW(24)
  + panel.timeSeries.gridPos.withH(8);

dashboard.new('Service Overview')
+ dashboard.withUid('service-overview')
+ dashboard.withTags(['generated', 'service'])
+ dashboard.withTimezone('browser')
+ dashboard.withRefresh('30s')
+ dashboard.withVariables([ds, svc])
+ dashboard.withPanels([
  requestRatePanel,
  errorRatePanel,
  latencyPanel,
])

Terraform Provider Grafana

# main.tf
terraform {
  required_providers {
    grafana = {
      source  = "grafana/grafana"
      version = "~> 3.0"
    }
  }
}

provider "grafana" {
  url  = "https://grafana.devopslab.ch"
  auth = var.grafana_api_key
}

resource "grafana_folder" "production" {
  title = "Production"
}

resource "grafana_dashboard" "service_overview" {
  folder      = grafana_folder.production.id
  config_json = file("${path.module}/dashboards/service-overview.json")
}

resource "grafana_data_source" "prometheus" {
  type = "prometheus"
  name = "Prometheus"
  url  = "http://prometheus:9090"

  json_data_encoded = jsonencode({
    timeInterval = "15s"
  })
}

# Alerting via Terraform
resource "grafana_contact_point" "slack" {
  name = "Slack Alerts"

  slack {
    url     = var.slack_webhook_url
    channel = "#alerts-prod"
    title   = "{{ .CommonLabels.alertname }}"
    text    = "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"
  }
}

resource "grafana_notification_policy" "default" {
  contact_point = grafana_contact_point.slack.name
  group_by      = ["alertname", "service"]

  policy {
    matcher {
      label = "severity"
      match = "="
      value = "critical"
    }
    contact_point = grafana_contact_point.slack.name
    group_wait    = "30s"
    group_interval = "5m"
  }
}

Module 4 — Loki (AgrĂ©gation de Logs)

Architecture de Loki

Loki est un systĂšme d’agrĂ©gation de logs inspirĂ© de Prometheus. Il n’indexe pas le contenu des logs — seulement les labels. C’est ce qui le rend lĂ©ger et Ă©conomique.

graph TB
    PT["Promtail (agent)"] -->|"push"| Dist
    GA["Grafana Alloy (agent)"] -->|"push"| Dist
    DL["Docker Log Driver"] -->|"push"| Dist
    subgraph Loki["Loki"]
        Dist["Distributor"] --> Ing["Ingester"]
        Ing --> Store["Storage (S3/local)"]
    end
    Store --> Graf["Grafana (visualisation)"]
    style Loki fill:#1a2332,stroke:#dc2626,color:#f1f5f9
    style PT fill:#1a2332,stroke:#22c55e,color:#f1f5f9
    style GA fill:#1a2332,stroke:#22c55e,color:#f1f5f9
    style DL fill:#1a2332,stroke:#22c55e,color:#f1f5f9
    style Dist fill:#1a2332,stroke:#f59e0b,color:#f1f5f9
    style Ing fill:#1a2332,stroke:#3b82f6,color:#f1f5f9
    style Store fill:#1a2332,stroke:#a855f7,color:#f1f5f9
    style Graf fill:#1a2332,stroke:#f59e0b,color:#f1f5f9

Configuration Loki

# loki-config.yml
auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: info

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

storage_config:
  filesystem:
    directory: /loki/storage

limits_config:
  retention_period: 30d
  max_query_length: 721h
  max_query_parallelism: 32
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20
  per_stream_rate_limit: 5MB
  max_entries_limit_per_query: 10000

compactor:
  working_directory: /loki/compactor
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h

ruler:
  storage:
    type: local
    local:
      directory: /loki/rules
  rule_path: /loki/rules-temp
  alertmanager_url: http://alertmanager:9093
  ring:
    kvstore:
      store: inmemory
  enable_api: true

Promtail — l’agent de collecte

# promtail-config.yml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push
    tenant_id: default

scrape_configs:
  # Logs Docker via journal
  - job_name: docker
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s
    relabel_configs:
      - source_labels: ['__meta_docker_container_name']
        regex: '/(.*)'
        target_label: container
      - source_labels: ['__meta_docker_container_log_stream']
        target_label: stream
      - source_labels: ['__meta_docker_container_label_com_docker_compose_service']
        target_label: service

  # Logs fichiers systĂšme
  - job_name: system
    static_configs:
      - targets:
          - localhost
        labels:
          job: syslog
          __path__: /var/log/syslog
      - targets:
          - localhost
        labels:
          job: authlog
          __path__: /var/log/auth.log

    pipeline_stages:
      # Parser les logs JSON
      - json:
          expressions:
            level: level
            msg: message
            trace_id: trace_id
            duration: duration_ms
      - labels:
          level:
          trace_id:
      - metrics:
          log_lines_total:
            type: Counter
            description: "Total log lines"
            source: level
            config:
              action: inc
              match_all: true

  # Logs Kubernetes
  - job_name: kubernetes
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        target_label: app
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod
    pipeline_stages:
      - cri: {}
      - json:
          expressions:
            level: level
            trace_id: trace_id
      - labels:
          level:
      - structured_metadata:
          trace_id:

Grafana Alloy (successeur de Promtail)

// alloy-config.alloy
// Grafana Alloy — collecteur unifiĂ© (logs, mĂ©triques, traces)

// Découverte des containers Docker
discovery.docker "containers" {
  host = "unix:///var/run/docker.sock"
}

// Collecte des logs Docker
loki.source.docker "docker_logs" {
  host       = "unix:///var/run/docker.sock"
  targets    = discovery.docker.containers.targets
  forward_to = [loki.process.pipeline.receiver]
}

// Pipeline de traitement
loki.process "pipeline" {
  // Parser JSON
  stage.json {
    expressions = {
      level    = "level",
      trace_id = "trace_id",
      msg      = "message",
    }
  }

  // Ajouter le level comme label
  stage.labels {
    values = {
      level = "",
    }
  }

  // Structured metadata (pas indexé mais filtrable)
  stage.structured_metadata {
    values = {
      trace_id = "",
    }
  }

  // Drop les logs de health check
  stage.drop {
    expression = ".*GET /health.*"
    drop_counter_reason = "healthcheck"
  }

  forward_to = [loki.write.default.receiver]
}

// Envoi vers Loki
loki.write "default" {
  endpoint {
    url = "http://loki:3100/loki/api/v1/push"
  }
}

// Bonus : collecte des métriques aussi
prometheus.scrape "docker_metrics" {
  targets    = discovery.docker.containers.targets
  forward_to = [prometheus.remote_write.default.receiver]
}

prometheus.remote_write "default" {
  endpoint {
    url = "http://prometheus:9090/api/v1/write"
  }
}

LogQL — le langage de requĂȘte

RequĂȘtes de base

# Tous les logs d'un service
{service="payment-api"}

# Filtrer par niveau
{service="payment-api"} |= "error"
{service="payment-api"} | json | level="error"

# Exclure les health checks
{service="api"} != "/health"

# Regex
{service="api"} |~ "timeout|connection refused"

# Parser JSON et filtrer
{service="api"}
  | json
  | level="error"
  | duration_ms > 5000
  | line_format "{{.timestamp}} [{{.level}}] {{.message}} ({{.duration_ms}}ms)"

RequĂȘtes mĂ©triques (agrĂ©gation)

# Nombre de logs d'erreur par minute
count_over_time({service="api"} |= "error" [1m])

# Taux de logs par service
sum by (service) (rate({job="docker"} [5m]))

# Bytes de logs par service (pour détecter les services trop bavards)
sum by (service) (bytes_over_time({job="docker"} [1h]))

# P99 de la durée extraite des logs
{service="api"} | json | unwrap duration_ms | __error__=""
quantile_over_time(0.99, {service="api"} | json | unwrap duration_ms [5m])

# Top 5 des messages d'erreur les plus fréquents
topk(5,
  sum by (message) (count_over_time(
    {service="api"} | json | level="error" [1h]
  ))
)

# Détection de patterns : logs contenant "OOM"
count_over_time({job=~".+"} |= "OOM" [5m]) > 0

Corrélation logs-métriques

La magie opÚre quand tu relies logs et métriques. Dans Grafana, configure les derived fields dans la datasource Loki :

# Dans la datasource Loki
jsonData:
  derivedFields:
    - datasourceUid: tempo     # Lien vers Tempo
      matcherRegex: '"trace_id":"(\w+)"'
      name: TraceID
      url: "$${__value.raw}"
    - datasourceUid: prometheus
      matcherRegex: '"service":"(\w+)"'
      name: ServiceDashboard
      url: "/d/service-detail/service-detail?var-service=$${__value.raw}"

Flux de debug typique :

  1. Alerte Prometheus → taux d’erreur > 5%
  2. Dashboard Grafana → le service payment-api est impactĂ©
  3. Logs Loki → {service="payment-api"} | json | level="error"
  4. Trace → clic sur le trace_id → vue complùte dans Tempo
  5. Root cause → le span vers Stripe montre un timeout

Module 5 — Alerting

Configuration complùte d’Alertmanager

# alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_from: "alerts@devopslab.ch"
  smtp_smarthost: "smtp.example.com:587"
  smtp_auth_username: "alerts@devopslab.ch"
  smtp_auth_password: "${SMTP_PASSWORD}"
  smtp_require_tls: true
  slack_api_url: "${SLACK_WEBHOOK_URL}"
  pagerduty_url: "https://events.pagerduty.com/v2/enqueue"

# Templates personnalisés
templates:
  - "/etc/alertmanager/templates/*.tmpl"

# Inhibition : supprimer les alertes redondantes
inhibit_rules:
  # Si le cluster est down, pas besoin d'alerter sur chaque service
  - source_matchers:
      - alertname = ClusterDown
    target_matchers:
      - severity =~ "warning|info"
    equal: [cluster]

  # Si un nƓud est down, pas besoin d'alerter sur ses services
  - source_matchers:
      - alertname = NodeDown
    target_matchers:
      - alertname != NodeDown
    equal: [instance]

  # Critical inhibe warning pour la mĂȘme alerte
  - source_matchers:
      - severity = critical
    target_matchers:
      - severity = warning
    equal: [alertname, service]

# Routing
route:
  receiver: slack-default
  group_by: [alertname, cluster, service]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    # Alertes critiques → PagerDuty + Slack
    - receiver: pagerduty-critical
      matchers:
        - severity = critical
      group_wait: 10s
      repeat_interval: 1h
      continue: true  # Continue vers la route suivante aussi

    - receiver: slack-critical
      matchers:
        - severity = critical
      group_wait: 10s

    # Alertes warning → Slack uniquement
    - receiver: slack-warning
      matchers:
        - severity = warning
      group_wait: 1m
      repeat_interval: 12h

    # Alertes business → channel dĂ©diĂ©
    - receiver: slack-business
      matchers:
        - team = business
      group_by: [alertname]
      repeat_interval: 24h

    # Watchdog (heartbeat de l'alerting)
    - receiver: "null"
      matchers:
        - alertname = Watchdog

# Receivers
receivers:
  - name: "null"

  - name: slack-default
    slack_configs:
      - channel: "#alerts-default"
        send_resolved: true
        title: '{{ if eq .Status "firing" }}đŸ”„{{ else }}✅{{ end }} {{ .CommonLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
          *{{ .Labels.severity | toUpper }}* - {{ .Labels.service }}
          {{ .Annotations.description }}
          {{ if .Labels.instance }}Instance: `{{ .Labels.instance }}`{{ end }}
          {{ end }}
        actions:
          - type: button
            text: "📊 Dashboard"
            url: "{{ (index .Alerts 0).Annotations.dashboard_url }}"
          - type: button
            text: "📖 Runbook"
            url: "{{ (index .Alerts 0).Annotations.runbook_url }}"

  - name: slack-critical
    slack_configs:
      - channel: "#alerts-critical"
        send_resolved: true
        color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
        title: '🚹 CRITICAL: {{ .CommonLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
          *Service:* {{ .Labels.service }}
          *Description:* {{ .Annotations.description }}
          *Impact:* {{ .Annotations.impact }}
          *Runbook:* {{ .Annotations.runbook_url }}
          {{ end }}

  - name: slack-warning
    slack_configs:
      - channel: "#alerts-warning"
        send_resolved: true
        title: '⚠ {{ .CommonLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
          {{ .Annotations.description }}
          {{ end }}

  - name: slack-business
    slack_configs:
      - channel: "#alerts-business"
        send_resolved: true
        title: '📊 {{ .CommonLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
          {{ .Annotations.description }}
          {{ end }}

  - name: pagerduty-critical
    pagerduty_configs:
      - routing_key: "${PAGERDUTY_ROUTING_KEY}"
        severity: critical
        description: "{{ .CommonLabels.alertname }}: {{ .CommonAnnotations.description }}"
        details:
          firing: "{{ .Alerts.Firing | len }}"
          resolved: "{{ .Alerts.Resolved | len }}"
          service: "{{ .CommonLabels.service }}"

Rùgles d’alerte Prometheus

# rules/alerts.yml
groups:
  - name: service_alerts
    rules:
      # Taux d'erreur élevé
      - alert: HighErrorRate
        expr: service:http_errors:ratio5m > 0.05
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Taux d'erreur élevé sur {{ $labels.service }}"
          description: >-
            Le service {{ $labels.service }} a un taux d'erreur de
            {{ $value | humanizePercentage }} (seuil: 5%).
          impact: "Les utilisateurs reçoivent des erreurs 5xx."
          dashboard_url: "https://grafana.devopslab.ch/d/service-detail?var-service={{ $labels.service }}"
          runbook_url: "https://wiki.devopslab.ch/runbooks/high-error-rate"

      # Latence élevée
      - alert: HighLatencyP99
        expr: service:http_latency:p99_5m > 1
        for: 10m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "Latence P99 élevée sur {{ $labels.service }}"
          description: >-
            Le P99 du service {{ $labels.service }} est Ă  {{ $value | humanizeDuration }}
            (seuil: 1s).

      # SLO en danger
      - alert: SLOBudgetBurning
        expr: slo:api_latency:error_budget_remaining < 0.25
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Error budget critique pour l'API"
          description: >-
            Il ne reste que {{ $value | humanizePercentage }} de l'error budget.
            Gel des déploiements recommandé.

  - name: infrastructure_alerts
    rules:
      # NƓud down
      - alert: NodeDown
        expr: up{job="node"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "NƓud {{ $labels.instance }} injoignable"
          description: "Le node exporter de {{ $labels.instance }} ne répond plus depuis 2 minutes."

      # CPU saturé
      - alert: HighCPU
        expr: instance:node_cpu:utilization > 0.85
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "CPU élevé sur {{ $labels.instance }}"
          description: "CPU Ă  {{ $value | humanizePercentage }} depuis 15 minutes."

      # Disque bientĂŽt plein
      - alert: DiskFillingUp
        expr: instance:node_disk:predicted_full_hours > 0 and instance:node_disk:predicted_full_hours < 24
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Disque de {{ $labels.instance }} bientĂŽt plein"
          description: >-
            Au rythme actuel, le disque sera plein dans
            {{ $value | humanizeDuration }}.

      # Certificat SSL expire bientĂŽt
      - alert: SSLCertExpiringSoon
        expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 14
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Certificat SSL de {{ $labels.instance }} expire dans {{ $value | humanize }} jours"

      # Mémoire critique
      - alert: HighMemory
        expr: instance:node_memory:utilization > 0.9
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Mémoire critique sur {{ $labels.instance }}"
          description: "Utilisation mémoire à {{ $value | humanizePercentage }}."

  - name: watchdog
    rules:
      # Alerte qui fire toujours — si elle disparaĂźt, l'alerting est cassĂ©
      - alert: Watchdog
        expr: vector(1)
        labels:
          severity: none
        annotations:
          summary: "Watchdog — l'alerting fonctionne"

Bonnes pratiques anti-alert fatigue

L’alert fatigue tue la fiabilitĂ©. Voici comment l’éviter :

1. Chaque alerte doit ĂȘtre actionnable

Si tu reçois une alerte et que la rĂ©action est “ok je regarde demain”, c’est pas une alerte, c’est du bruit. Supprime-la ou transforme-la en dashboard.

2. Deux niveaux suffisent

NiveauActionDestination
criticalRĂ©veille quelqu’unPagerDuty + Slack
warningÀ traiter aux heures ouvrĂ©esSlack uniquement

Pas de info, pas de notice. Si c’est informatif, c’est un dashboard.

3. Seuils avec for — toujours

# ❌ Mauvais — fire sur un spike de 15 secondes
- alert: HighCPU
  expr: node_cpu_utilization > 0.8

# ✅ Bon — problùme persistant
- alert: HighCPU
  expr: node_cpu_utilization > 0.85
  for: 15m

4. Seuil minimum de trafic

# ❌ Alerte sur un seul Ă©chec
http_errors_ratio > 0.5

# ✅ Seulement si trafic significatif
http_errors_ratio > 0.05 and http_requests_rate > 10

5. Alertes basées sur les symptÎmes, pas les causes

# ❌ Alerte sur la cause (trop bas niveau)
- alert: PodRestarting
  expr: rate(kube_pod_container_status_restarts_total[15m]) > 0

# ✅ Alerte sur le symptîme (impact utilisateur)
- alert: HighErrorRate
  expr: service:http_errors:ratio5m > 0.05
  for: 5m

6. Runbooks obligatoires

Chaque alerte doit avoir un lien vers un runbook qui explique :

  • Ce que l’alerte signifie
  • Comment diagnostiquer
  • Comment rĂ©soudre
  • Qui escalader si bloquĂ©

Module 6 — Tracing DistribuĂ©

OpenTelemetry — le standard

OpenTelemetry (OTel) est le standard ouvert pour la télémétrie. Il fournit des APIs, SDKs et un collecteur pour métriques, logs et traces.

graph TB
    SA["Service A (OTel SDK)"] -->|"OTLP"| Coll
    SB["Service B (OTel SDK)"] -->|"OTLP"| Coll
    SC["Service C (OTel SDK)"] -->|"OTLP"| Coll
    Coll["OTel Collector
Receivers → Process → Exporters"] --> Tempo["Tempo (traces)"] Coll --> Prometheus["Prometheus (mĂ©triques)"] style SA fill:#1a2332,stroke:#3b82f6,color:#f1f5f9 style SB fill:#1a2332,stroke:#3b82f6,color:#f1f5f9 style SC fill:#1a2332,stroke:#3b82f6,color:#f1f5f9 style Coll fill:#1a2332,stroke:#dc2626,color:#f1f5f9 style Tempo fill:#1a2332,stroke:#a855f7,color:#f1f5f9 style Prometheus fill:#1a2332,stroke:#f59e0b,color:#f1f5f9

Configuration du Collector OTel

# otel-collector-config.yml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1000

  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128

  attributes:
    actions:
      - key: environment
        value: production
        action: upsert

  tail_sampling:
    policies:
      # Garder toutes les traces avec erreurs
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      # Garder 10% des traces normales
      - name: sampling
        type: probabilistic
        probabilistic:
          sampling_percentage: 10
      # Garder les traces lentes (> 2s)
      - name: slow-traces
        type: latency
        latency:
          threshold_ms: 2000

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

  prometheus:
    endpoint: 0.0.0.0:8889
    namespace: otel
    resource_to_telemetry_conversion:
      enabled: true

  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlp/tempo]

    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]

    logs:
      receivers: [otlp]
      processors: [memory_limiter, attributes, batch]
      exporters: [loki]

Instrumentation automatique vs manuelle

Auto-instrumentation Python

# Installation
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install
# Lancement avec auto-instrumentation
opentelemetry-instrument \
  --service_name payment-api \
  --exporter_otlp_endpoint http://otel-collector:4317 \
  --exporter_otlp_protocol grpc \
  python app.py
# Ou via le code — configuration programmatique
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor

# Configuration du provider
resource = Resource.create({
    "service.name": "payment-api",
    "service.version": "1.2.0",
    "deployment.environment": "production",
})

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(
    OTLPSpanExporter(endpoint="http://otel-collector:4317")
)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# Auto-instrumentation des frameworks
FlaskInstrumentor().instrument()
RequestsInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument()

Instrumentation manuelle Python

from opentelemetry import trace
import logging

tracer = trace.get_tracer("payment.service")
logger = logging.getLogger(__name__)

@app.route("/api/payment", methods=["POST"])
def process_payment():
    with tracer.start_as_current_span("process_payment") as span:
        # Ajouter des attributs au span
        span.set_attribute("payment.method", request.json["method"])
        span.set_attribute("payment.amount", request.json["amount"])
        span.set_attribute("payment.currency", request.json["currency"])

        # Injecter le trace_id dans les logs
        ctx = trace.get_current_span().get_span_context()
        logger.info(
            "Processing payment",
            extra={
                "trace_id": format(ctx.trace_id, '032x'),
                "span_id": format(ctx.span_id, '016x'),
                "amount": request.json["amount"],
            }
        )

        # Span enfant pour l'appel DB
        with tracer.start_as_current_span("validate_card") as child:
            child.set_attribute("db.system", "postgresql")
            child.set_attribute("db.statement", "SELECT * FROM cards WHERE...")
            card = validate_card(request.json["card_token"])

        # Span enfant pour l'appel externe
        with tracer.start_as_current_span("charge_stripe") as child:
            child.set_attribute("http.method", "POST")
            child.set_attribute("http.url", "https://api.stripe.com/v1/charges")
            try:
                result = stripe.Charge.create(
                    amount=request.json["amount"],
                    currency=request.json["currency"],
                    source=card.token,
                )
                child.set_attribute("payment.stripe_id", result.id)
            except stripe.error.CardError as e:
                child.set_status(trace.StatusCode.ERROR, str(e))
                child.record_exception(e)
                span.set_attribute("payment.status", "failed")
                raise

        span.set_attribute("payment.status", "success")
        return {"status": "ok", "charge_id": result.id}

Instrumentation Go

package main

import (
	"context"
	"log"
	"net/http"
	"time"

	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/attribute"
	"go.opentelemetry.io/otel/codes"
	"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
	"go.opentelemetry.io/otel/propagation"
	"go.opentelemetry.io/otel/sdk/resource"
	sdktrace "go.opentelemetry.io/otel/sdk/trace"
	semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
	"go.opentelemetry.io/otel/trace"
	"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)

var tracer trace.Tracer

func initTracer() func() {
	ctx := context.Background()

	exporter, err := otlptracegrpc.New(ctx,
		otlptracegrpc.WithEndpoint("otel-collector:4317"),
		otlptracegrpc.WithInsecure(),
	)
	if err != nil {
		log.Fatalf("failed to create exporter: %v", err)
	}

	res, _ := resource.Merge(
		resource.Default(),
		resource.NewWithAttributes(
			semconv.SchemaURL,
			semconv.ServiceName("order-service"),
			semconv.ServiceVersion("2.1.0"),
			attribute.String("environment", "production"),
		),
	)

	tp := sdktrace.NewTracerProvider(
		sdktrace.WithBatcher(exporter),
		sdktrace.WithResource(res),
		sdktrace.WithSampler(sdktrace.ParentBased(
			sdktrace.TraceIDRatioBased(0.1), // 10% sampling
		)),
	)

	otel.SetTracerProvider(tp)
	otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
		propagation.TraceContext{},
		propagation.Baggage{},
	))

	tracer = tp.Tracer("order-service")

	return func() {
		ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
		defer cancel()
		tp.Shutdown(ctx)
	}
}

func processOrder(ctx context.Context, orderID string) error {
	ctx, span := tracer.Start(ctx, "processOrder",
		trace.WithAttributes(
			attribute.String("order.id", orderID),
		),
	)
	defer span.End()

	// Span enfant : validation
	if err := validateOrder(ctx, orderID); err != nil {
		span.RecordError(err)
		span.SetStatus(codes.Error, err.Error())
		return err
	}

	// Span enfant : paiement
	if err := chargePayment(ctx, orderID); err != nil {
		span.RecordError(err)
		span.SetStatus(codes.Error, err.Error())
		return err
	}

	span.SetAttributes(attribute.String("order.status", "completed"))
	return nil
}

func validateOrder(ctx context.Context, orderID string) error {
	ctx, span := tracer.Start(ctx, "validateOrder")
	defer span.End()

	span.SetAttributes(
		attribute.String("db.system", "postgresql"),
		attribute.String("db.operation", "SELECT"),
	)
	// ... validation logic
	return nil
}

func main() {
	shutdown := initTracer()
	defer shutdown()

	// HTTP handler avec instrumentation automatique
	handler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		orderID := r.URL.Query().Get("id")
		if err := processOrder(r.Context(), orderID); err != nil {
			http.Error(w, err.Error(), http.StatusInternalServerError)
			return
		}
		w.Write([]byte(`{"status":"ok"}`))
	})

	// otelhttp instrumente automatiquement les requĂȘtes HTTP
	wrappedHandler := otelhttp.NewHandler(handler, "order-api")

	log.Println("Starting server on :8080")
	log.Fatal(http.ListenAndServe(":8080", wrappedHandler))
}

Configuration de Tempo

# tempo-config.yml
server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318

ingester:
  max_block_duration: 5m

compactor:
  compaction:
    block_retention: 336h  # 14 jours

metrics_generator:
  registry:
    external_labels:
      source: tempo
      cluster: production
  storage:
    path: /var/tempo/generator/wal
    remote_write:
      - url: http://prometheus:9090/api/v1/write
        send_exemplars: true
  traces_storage:
    path: /var/tempo/generator/traces
  processor:
    service_graphs:
      dimensions:
        - service.namespace
      enable_client_server_prefix: true
    span_metrics:
      dimensions:
        - http.method
        - http.status_code
        - service.version

storage:
  trace:
    backend: local
    wal:
      path: /var/tempo/wal
    local:
      path: /var/tempo/blocks

overrides:
  defaults:
    metrics_generator:
      processors:
        - service-graphs
        - span-metrics

Corrélation traces-métriques-logs

La corrĂ©lation complĂšte — le Saint Graal de l’observabilitĂ© :

# Dans ton application, assure-toi d'inclure le trace_id partout

# 1. Logs structurés avec trace_id
# → Loki peut lier vers Tempo via derived fields

# 2. Métriques avec exemplars
# → Prometheus stocke des trace_id dans les exemplars des histogrammes

# 3. Tempo génÚre des métriques de spans
# → service_graphs et span_metrics dans Prometheus

# Configuration Grafana — datasource Tempo avec corrĂ©lation
# (voir provisioning/datasources plus haut)

Flux de debug complet :

1. ALERTE: "HighLatencyP99 sur payment-api" (Prometheus → Alertmanager → Slack)
      │
2. DASHBOARD: Ouvre le dashboard RED, voit le spike de latence
      │  → Clique sur un point du graphe, voit les exemplars
      │
3. TRACE: Clique sur l'exemplar → trace complùte dans Tempo
      │  → payment-api (12ms) → stripe-client (30045ms) ← LE COUPABLE
      │
4. LOGS: Depuis le span, clique "View Logs" → Loki
      │  → "Connection timeout to api.stripe.com after 30s"
      │
5. ROOT CAUSE: Stripe a un incident. Rien à faire de notre cÎté.
      → Ajouter un circuit breaker avec fallback.

Bonus — Stack LGTM complùte en Docker Compose

Voici un Docker Compose complet pour déployer la stack Loki + Grafana + Tempo + Mimir (Prometheus) :

# docker-compose.yml
# Stack LGTM - Observabilité complÚte
# Usage: docker compose up -d

services:
  # ============================================
  # Grafana — Visualisation
  # ============================================
  grafana:
    image: grafana/grafana:11.5.0
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_USER: admin
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD:-admin}
      GF_AUTH_ANONYMOUS_ENABLED: "false"
      GF_FEATURE_TOGGLES_ENABLE: "traceQLStreaming,metricsSummary"
    volumes:
      - grafana-data:/var/lib/grafana
      - ./config/grafana/provisioning:/etc/grafana/provisioning
    depends_on:
      - prometheus
      - loki
      - tempo
    networks:
      - observability
    restart: unless-stopped

  # ============================================
  # Prometheus — MĂ©triques
  # ============================================
  prometheus:
    image: prom/prometheus:v2.53.0
    ports:
      - "9090:9090"
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=30d"
      - "--web.enable-remote-write-receiver"
      - "--enable-feature=exemplar-storage"
      - "--enable-feature=native-histograms"
    volumes:
      - ./config/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./config/prometheus/rules:/etc/prometheus/rules:ro
      - prometheus-data:/prometheus
    networks:
      - observability
    restart: unless-stopped

  # ============================================
  # Alertmanager — Alertes
  # ============================================
  alertmanager:
    image: prom/alertmanager:v0.27.0
    ports:
      - "9093:9093"
    command:
      - "--config.file=/etc/alertmanager/alertmanager.yml"
      - "--storage.path=/alertmanager"
    volumes:
      - ./config/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
      - alertmanager-data:/alertmanager
    networks:
      - observability
    restart: unless-stopped

  # ============================================
  # Loki — Logs
  # ============================================
  loki:
    image: grafana/loki:3.4.0
    ports:
      - "3100:3100"
    command: -config.file=/etc/loki/loki-config.yml
    volumes:
      - ./config/loki/loki-config.yml:/etc/loki/loki-config.yml:ro
      - loki-data:/loki
    networks:
      - observability
    restart: unless-stopped

  # ============================================
  # Tempo — Traces
  # ============================================
  tempo:
    image: grafana/tempo:2.6.0
    ports:
      - "3200:3200"   # HTTP API
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
    command: -config.file=/etc/tempo/tempo-config.yml
    volumes:
      - ./config/tempo/tempo-config.yml:/etc/tempo/tempo-config.yml:ro
      - tempo-data:/var/tempo
    networks:
      - observability
    restart: unless-stopped

  # ============================================
  # OTel Collector — Collecteur unifiĂ©
  # ============================================
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.115.0
    ports:
      - "4327:4317"   # OTLP gRPC (mapped pour pas conflit avec Tempo)
      - "4328:4318"   # OTLP HTTP
      - "8889:8889"   # Prometheus metrics
    command: --config=/etc/otel/config.yml
    volumes:
      - ./config/otel/otel-collector-config.yml:/etc/otel/config.yml:ro
    depends_on:
      - tempo
      - prometheus
      - loki
    networks:
      - observability
    restart: unless-stopped

  # ============================================
  # Grafana Alloy — Agent de collecte
  # ============================================
  alloy:
    image: grafana/alloy:v1.5.0
    ports:
      - "12345:12345"  # UI
    command:
      - run
      - /etc/alloy/config.alloy
      - --server.http.listen-addr=0.0.0.0:12345
    volumes:
      - ./config/alloy/config.alloy:/etc/alloy/config.alloy:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /var/log:/var/log:ro
    networks:
      - observability
    restart: unless-stopped

  # ============================================
  # Node Exporter — MĂ©triques systĂšme
  # ============================================
  node-exporter:
    image: quay.io/prometheus/node-exporter:v1.8.0
    ports:
      - "9100:9100"
    command:
      - "--path.rootfs=/host"
    pid: host
    volumes:
      - /:/host:ro,rslave
    networks:
      - observability
    restart: unless-stopped

  # ============================================
  # Blackbox Exporter — Probes externes
  # ============================================
  blackbox-exporter:
    image: prom/blackbox-exporter:v0.25.0
    ports:
      - "9115:9115"
    volumes:
      - ./config/blackbox/blackbox.yml:/config/blackbox.yml:ro
    command:
      - "--config.file=/config/blackbox.yml"
    networks:
      - observability
    restart: unless-stopped

  # ============================================
  # App dĂ©mo — pour tester la stack
  # ============================================
  demo-app:
    build:
      context: ./demo-app
      dockerfile: Dockerfile
    ports:
      - "8080:8080"
    environment:
      OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4317
      OTEL_SERVICE_NAME: demo-app
      OTEL_RESOURCE_ATTRIBUTES: "deployment.environment=demo"
    labels:
      prometheus.scrape: "true"
      prometheus.port: "8080"
    depends_on:
      - otel-collector
    networks:
      - observability
    restart: unless-stopped

volumes:
  grafana-data:
  prometheus-data:
  loki-data:
  tempo-data:
  alertmanager-data:

networks:
  observability:
    driver: bridge

Structure du projet

observability-stack/
├── docker-compose.yml
├── .env                          # GRAFANA_PASSWORD, SLACK_WEBHOOK_URL, etc.
├── config/
│   ├── prometheus/
│   │   ├── prometheus.yml
│   │   └── rules/
│   │       ├── recording.yml
│   │       └── alerts.yml
│   ├── alertmanager/
│   │   └── alertmanager.yml
│   ├── loki/
│   │   └── loki-config.yml
│   ├── tempo/
│   │   └── tempo-config.yml
│   ├── otel/
│   │   └── otel-collector-config.yml
│   ├── alloy/
│   │   └── config.alloy
│   ├── blackbox/
│   │   └── blackbox.yml
│   └── grafana/
│       └── provisioning/
│           ├── datasources/
│           │   └── datasources.yml
│           └── dashboards/
│               ├── dashboards.yml
│               └── json/
│                   └── service-overview.json
└── demo-app/
    ├── Dockerfile
    └── main.py

Script de démarrage rapide

#!/bin/bash
# start.sh — DĂ©marrer la stack d'observabilitĂ©

set -euo pipefail

echo "🚀 DĂ©marrage de la stack LGTM..."

# Vérifier Docker
if ! command -v docker &> /dev/null; then
    echo "❌ Docker n'est pas installĂ©"
    exit 1
fi

# Créer le .env si absent
if [ ! -f .env ]; then
    echo "📝 CrĂ©ation du fichier .env..."
    cat > .env << 'EOF'
GRAFANA_PASSWORD=changeme-please
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/xxx
PAGERDUTY_ROUTING_KEY=xxx
SMTP_PASSWORD=xxx
EOF
    echo "⚠  Édite .env avec tes vrais credentials avant de continuer"
    exit 1
fi

# Créer les répertoires
mkdir -p config/{prometheus/rules,alertmanager,loki,tempo,otel,alloy,blackbox,grafana/provisioning/{datasources,dashboards/json}}

# Démarrer
docker compose up -d

echo ""
echo "✅ Stack dĂ©marrĂ©e !"
echo ""
echo "📊 Grafana:       http://localhost:3000  (admin / voir .env)"
echo "đŸ”„ Prometheus:    http://localhost:9090"
echo "📋 Alertmanager:  http://localhost:9093"
echo "📝 Loki:          http://localhost:3100"
echo "🔍 Tempo:         http://localhost:3200"
echo "📡 OTel Collector: gRPC=localhost:4327, HTTP=localhost:4328"
echo "đŸ–„ïž  Node Exporter: http://localhost:9100"
echo "⚫ Blackbox:      http://localhost:9115"
echo "🔧 Alloy UI:      http://localhost:12345"
echo ""
echo "💡 Envoie tes traces OTLP vers localhost:4327 (gRPC) ou localhost:4328 (HTTP)"

Récapitulatif et prochaines étapes

Tu as maintenant les bases pour construire une stack d’observabilitĂ© complĂšte :

ComposantRĂŽlePort
PrometheusMétriques (scrape + stockage + alertes)9090
GrafanaVisualisation (dashboards + exploration)3000
LokiLogs (agrĂ©gation + requĂȘtes)3100
TempoTraces (stockage + corrélation)3200
AlertmanagerRouting des alertes9093
OTel CollectorCollecte unifiée (métriques + logs + traces)4317/4318
AlloyAgent local (remplace Promtail + agents séparés)12345

Checklist de mise en production

  • MĂ©triques : RED pour chaque service, USE pour chaque nƓud
  • SLOs : DĂ©finis pour les services critiques, error budgets monitorĂ©s
  • Alertes : Actionnables, avec runbooks, 2 niveaux (critical/warning)
  • Logs : StructurĂ©s (JSON), avec trace_id, centralisĂ©s dans Loki
  • Traces : OpenTelemetry configurĂ©, sampling adaptĂ© au volume
  • CorrĂ©lation : Liens mĂ©triques → logs → traces fonctionnels dans Grafana
  • Dashboards : As Code (Grafonnet ou Terraform), versionnĂ©s en Git
  • RĂ©tention : DĂ©finie par type (mĂ©triques 30j, logs 30j, traces 14j)
  • Haute dispo : Prometheus en mode HA, Loki/Tempo en mode distribuĂ© pour la prod

Pour aller plus loin

  • Mimir : Stockage long-terme pour mĂ©triques Prometheus (multi-tenant, scalable)
  • Pyroscope : Profiling continu — le 4Ăšme pilier de l’observabilitĂ©
  • Kubernetes monitoring : kube-state-metrics, cAdvisor, kube-prometheus-stack (Helm)
  • eBPF : Instrumentation zero-code au niveau kernel (Grafana Beyla)
  • On-call : Grafana OnCall pour la gestion des astreintes

L’observabilitĂ© n’est pas un projet qu’on “termine”. C’est une pratique qu’on amĂ©liore continuellement. Commence simple, itĂšre, et ajoute des couches quand tu en as besoin — pas avant.