đ Monitoring & ObservabilitĂ©
Programme
OpenTelemetry : architecture et Collector Gratuit
Les 3 piliers de l'observabilitĂ© (traces, mĂ©triques, logs), l'architecture OpenTelemetry SDK â Collector â Backend, et la configuration du Collector en dĂ©tail.
Instrumentation et sampling OpenTelemetry
Instrumentation automatique et manuelle avec OpenTelemetry, stratégies de sampling pour gérer le volume, semantic conventions et stack complÚte de démo.
Prometheus : architecture et PromQL
Architecture interne de Prometheus (TSDB, scraping, service discovery), PromQL avancé et recording/alerting rules pour le monitoring production.
Alertmanager et Grafana dashboards
Alertmanager avec routing et escalade (PagerDuty, Slack), Grafana dashboards efficaces, et stratégies de scaling Prometheus (Thanos, Cortex).
SRE : SLI/SLO et Error Budgets
Les principes SRE de Google : SLI/SLO/SLA avec calculs concrets, error budgets comme outil de décision, et élimination du toil.
Incident Management et Chaos Engineering
Incident management structuré, postmortems blameless, Chaos Engineering (Chaos Monkey, Litmus) et organisation des astreintes on-call.
Le monitoring, câest savoir que ton systĂšme est cassĂ©. LâobservabilitĂ©, câest comprendre pourquoi il est cassĂ©. Cette formation te donne les deux â de la thĂ©orie aux fichiers de config prĂȘts Ă dĂ©ployer.
On va construire ensemble une stack dâobservabilitĂ© complĂšte basĂ©e sur Prometheus, Grafana, Loki, Tempo et OpenTelemetry. Ă la fin, tu auras une vision 360° de tes systĂšmes : mĂ©triques, logs et traces, corrĂ©lĂ©s entre eux.
Module 1 â Fondamentaux du Monitoring et de lâObservabilitĂ©
Monitoring vs Observabilité : quelle différence ?
Le monitoring rĂ©pond Ă la question : âEst-ce que ça marche ?â LâobservabilitĂ© rĂ©pond Ă : âPourquoi ça ne marche pas ?â
| Aspect | Monitoring | Observabilité |
|---|---|---|
| Approche | RĂ©active â alertes sur seuils connus | Exploratoire â investigation de problĂšmes inconnus |
| Questions | Prédéfinies (CPU > 90% ?) | Ad-hoc (pourquoi le P99 a doublé ?) |
| Données | Métriques agrégées | Métriques + logs + traces corrélés |
| Adapté pour | Pannes connues | SystÚmes distribués complexes |
En pratique, tu as besoin des deux. Le monitoring te rĂ©veille Ă 3h du matin. LâobservabilitĂ© te permet de rĂ©soudre le problĂšme en 10 minutes au lieu de 2 heures.
Les 3 piliers de lâobservabilitĂ©
1. Métriques
Des valeurs numériques agrégées dans le temps. LégÚres, rapides à interroger, idéales pour les alertes.
# Exemple : taux de requĂȘtes HTTP par seconde
rate(http_requests_total{service="api"}[5m])
Types de métriques :
- Counter â ne fait quâaugmenter (requĂȘtes totales, erreurs totales)
- Gauge â monte et descend (tempĂ©rature, connexions actives)
- Histogram â distribution de valeurs (latence, taille des rĂ©ponses)
- Summary â comme histogram, mais calcule les quantiles cĂŽtĂ© client
2. Logs
Des événements textuels horodatés. Riches en contexte, mais volumineux.
{
"timestamp": "2026-03-20T08:15:32Z",
"level": "error",
"service": "payment-api",
"trace_id": "abc123def456",
"message": "Timeout connecting to stripe API",
"duration_ms": 30000
}
Bonnes pratiques logs :
- Format structurĂ© (JSON) â toujours
- Inclure le
trace_idpour la corrélation - Niveaux cohérents : DEBUG, INFO, WARN, ERROR
- Pas de données sensibles (PII, tokens)
3. Traces distribuées
Le parcours complet dâune requĂȘte Ă travers tes microservices.
[Gateway] ââ(12ms)ââ> [Auth Service] ââ(3ms)ââ> [User DB]
â
âââ(45ms)ââ> [Payment Service] ââ(30000ms)ââ> [Stripe API] â TIMEOUT
Une trace est composĂ©e de spans â chaque span reprĂ©sente une opĂ©ration (appel HTTP, requĂȘte DB, traitement).
La méthode RED (pour les services)
Pour chaque service orientĂ© requĂȘtes, mesure ces 3 choses :
| Signal | Description | Métrique Prometheus |
|---|---|---|
| Rate | RequĂȘtes par seconde | rate(http_requests_total[5m]) |
| Errors | Taux dâerreurs | rate(http_requests_total{status=~"5.."}[5m]) |
| Duration | Latence (histogramme) | histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) |
# Dashboard RED complet pour un service
# Rate
sum(rate(http_requests_total{service="api"}[5m]))
# Error ratio
sum(rate(http_requests_total{service="api",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="api"}[5m]))
# Duration P99
histogram_quantile(0.99,
sum by (le) (rate(http_request_duration_seconds_bucket{service="api"}[5m]))
)
La méthode USE (pour les ressources)
Pour chaque ressource infrastructure (CPU, mémoire, disque, réseau) :
| Signal | Description | Exemple |
|---|---|---|
| Utilization | % de la capacité utilisée | CPU à 75% |
| Saturation | File dâattente / backlog | Load average, disk I/O queue |
| Errors | Nombre dâerreurs | ECC errors, packet drops |
# USE pour le CPU
# Utilization
1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
# Saturation (load average / nombre de CPUs)
node_load1 / count without (cpu) (node_cpu_seconds_total{mode="idle"})
# Errors
rate(node_edac_correctable_errors_total[5m])
SLIs, SLOs, SLAs et Error Budgets
Définitions
- SLI (Service Level Indicator) â la mĂ©trique mesurĂ©e (ex: % de requĂȘtes < 300ms)
- SLO (Service Level Objective) â lâobjectif interne (ex: 99.9% des requĂȘtes < 300ms)
- SLA (Service Level Agreement) â lâengagement contractuel (ex: 99.5% â toujours plus bas que le SLO)
- Error Budget â la marge dâerreur tolĂ©rĂ©e avant de violer le SLO
Exemple concret
Ton API de paiement :
# SLI : latence
sli:
type: latency
metric: "http_request_duration_seconds"
threshold: 0.3 # 300ms
filter: '{service="payment-api", method!="OPTIONS"}'
# SLO : 99.9% sur 30 jours
slo:
target: 0.999
window: 30d
# Calcul de l'error budget
# 30 jours = 43,200 minutes
# Error budget = 0.1% = 43.2 minutes de downtime autorisé
# Ou : sur 1M de requĂȘtes, 1,000 peuvent ĂȘtre lentes
# Calcul du SLI en PromQL
sum(rate(http_request_duration_seconds_bucket{
service="payment-api",
le="0.3"
}[30d]))
/
sum(rate(http_request_duration_seconds_count{
service="payment-api"
}[30d]))
# Consommation de l'error budget
1 - (
(1 - sli_value) / (1 - 0.999)
)
Error budget en pratique
| Error Budget restant | Action |
|---|---|
| > 50% | Déploiements normaux, expérimentation OK |
| 25-50% | Prudence, review renforcée |
| < 25% | Gel des déploiements non-critiques |
| 0% | Incident â focus fiabilitĂ© uniquement |
Module 2 â Prometheus
Architecture de Prometheus
Prometheus fonctionne en pull : il va chercher les métriques sur tes services à intervalles réguliers (scraping).
graph TB
subgraph Prom["Prometheus"]
Scraper["Scraper (pull)"] --> TSDB["TSDB (stockage)"]
TSDB --> Rules["Rule Engine
recording + alerting"]
end
Targets["Targets /metrics
node_exporter, app, blackbox"] -->|"scrape"| Scraper
Rules -->|"push alerts"| AM["Alertmanager
Slack, PagerDuty, Email"]
style Prom fill:#1a2332,stroke:#dc2626,color:#f1f5f9
style Scraper fill:#1a2332,stroke:#f59e0b,color:#f1f5f9
style TSDB fill:#1a2332,stroke:#3b82f6,color:#f1f5f9
style Rules fill:#1a2332,stroke:#a855f7,color:#f1f5f9
style Targets fill:#1a2332,stroke:#22c55e,color:#f1f5f9
style AM fill:#1a2332,stroke:#dc2626,color:#f1f5f9
Configuration de base
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: production
env: prod
rule_files:
- "rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
# Prometheus se scrape lui-mĂȘme
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
# Node Exporter â mĂ©triques systĂšme
- job_name: "node"
static_configs:
- targets:
- "node-exporter-1:9100"
- "node-exporter-2:9100"
relabel_configs:
- source_labels: [__address__]
regex: "(.+):9100"
target_label: instance
replacement: "${1}"
# Découverte automatique via Docker
- job_name: "docker"
docker_sd_configs:
- host: unix:///var/run/docker.sock
refresh_interval: 30s
relabel_configs:
- source_labels: [__meta_docker_container_label_prometheus_scrape]
regex: "true"
action: keep
- source_labels: [__meta_docker_container_label_prometheus_port]
target_label: __address__
regex: (.+)
replacement: "${1}"
- source_labels: [__meta_docker_container_name]
target_label: container
# Découverte Kubernetes
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels:
[__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# Blackbox Exporter â monitoring externe
- job_name: "blackbox-http"
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://devopslab.ch
- https://api.devopslab.ch/health
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
PromQL â de dĂ©butant Ă avancĂ©
Les bases
# Valeur instantanée d'une métrique
up{job="node"}
# Filtrage avec labels
http_requests_total{method="GET", status="200"}
# Regex sur labels
http_requests_total{status=~"5.."} # 5xx
http_requests_total{path!~"/health.*"} # exclure health
Fonctions de taux (rate)
# rate() â taux par seconde sur une fenĂȘtre (lissĂ©)
rate(http_requests_total[5m])
# irate() â taux instantanĂ© (2 derniers points)
irate(http_requests_total[5m])
# increase() â augmentation absolue sur la fenĂȘtre
increase(http_requests_total[1h])
RĂšgle dâor : utilise
rate()pour les alertes et dashboards,irate()uniquement pour les graphes haute résolution.
Agrégations
# Somme par service
sum by (service) (rate(http_requests_total[5m]))
# Moyenne par instance
avg by (instance) (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
# Top 5 des pods par CPU
topk(5, sum by (pod) (rate(container_cpu_usage_seconds_total[5m])))
# Quantiles
quantile(0.95, rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m]))
Histogrammes et quantiles
# P50 (médiane) de la latence
histogram_quantile(0.5,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)
# P99 par service
histogram_quantile(0.99,
sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
)
# Apdex score (seuil satisfait=0.3s, toléré=1.2s)
(
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
+
sum(rate(http_request_duration_seconds_bucket{le="1.2"}[5m]))
)
/ 2
/ sum(rate(http_request_duration_seconds_count[5m]))
RequĂȘtes avancĂ©es
# Prédiction : disque plein dans combien d'heures ?
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0
# Taux d'erreur avec seuil minimum de trafic (éviter les faux positifs)
(
sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (service) (rate(http_requests_total[5m]))
) > 0.05
and
sum by (service) (rate(http_requests_total[5m])) > 10
# Détection d'anomalie : écart par rapport à la moyenne sur 7 jours
(
avg_over_time(rate(http_requests_total[5m])[1h:5m])
-
avg_over_time(rate(http_requests_total[5m])[7d:1h])
)
/
stddev_over_time(rate(http_requests_total[5m])[7d:1h])
> 3
# Ratio de cache hit
sum(rate(cache_hits_total[5m]))
/
(sum(rate(cache_hits_total[5m])) + sum(rate(cache_misses_total[5m])))
Exporters
Node Exporter
Métriques systÚme (CPU, mémoire, disque, réseau) :
# Installation
docker run -d \
--name node-exporter \
--net host \
--pid host \
-v /:/host:ro,rslave \
quay.io/prometheus/node-exporter \
--path.rootfs=/host
Métriques clés :
# CPU utilisé
1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))
# Mémoire utilisée (%)
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
# Espace disque restant (%)
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}
# Trafic réseau
rate(node_network_receive_bytes_total{device="eth0"}[5m]) * 8 # bits/s
Blackbox Exporter
Monitoring boĂźte noire (HTTP, TCP, DNS, ICMP) :
# blackbox.yml
modules:
http_2xx:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: [200, 301, 302]
follow_redirects: true
preferred_ip_protocol: ip4
tls_config:
insecure_skip_verify: false
http_post_json:
prober: http
timeout: 5s
http:
method: POST
headers:
Content-Type: application/json
body: '{"healthcheck": true}'
tcp_connect:
prober: tcp
timeout: 5s
icmp_ping:
prober: icmp
timeout: 5s
icmp:
preferred_ip_protocol: ip4
dns_resolution:
prober: dns
timeout: 5s
dns:
query_name: devopslab.ch
query_type: A
valid_rcodes:
- NOERROR
# Latence HTTP du probe
probe_http_duration_seconds{phase="resolve"}
# Certificat SSL : jours avant expiration
(probe_ssl_earliest_cert_expiry - time()) / 86400
# Disponibilité sur 30 jours
avg_over_time(probe_success{job="blackbox-http"}[30d]) * 100
Exporter custom en Python
# custom_exporter.py
from prometheus_client import start_http_server, Gauge, Counter, Histogram
import time
import psycopg2
# Définition des métriques
db_connections = Gauge(
'app_db_connections_active',
'Number of active database connections',
['database']
)
orders_total = Counter(
'app_orders_total',
'Total number of orders processed',
['status', 'payment_method']
)
order_processing_time = Histogram(
'app_order_processing_seconds',
'Time to process an order',
['order_type'],
buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)
def collect_metrics():
"""Collecte les métriques métier."""
conn = psycopg2.connect("postgresql://localhost/myapp")
cur = conn.cursor()
# Connexions actives
cur.execute("SELECT count(*) FROM pg_stat_activity WHERE state = 'active'")
db_connections.labels(database="myapp").set(cur.fetchone()[0])
# Commandes par statut
cur.execute("""
SELECT status, payment_method, count(*)
FROM orders
WHERE created_at > now() - interval '5 minutes'
GROUP BY status, payment_method
""")
for status, method, count in cur.fetchall():
orders_total.labels(status=status, payment_method=method).inc(count)
cur.close()
conn.close()
if __name__ == '__main__':
start_http_server(8000)
while True:
collect_metrics()
time.sleep(15)
Recording Rules
Les recording rules prĂ©-calculent des requĂȘtes coĂ»teuses :
# rules/recording.yml
groups:
- name: http_rules
interval: 30s
rules:
# Taux de requĂȘtes par service (prĂ©-calculĂ©)
- record: service:http_requests:rate5m
expr: sum by (service) (rate(http_requests_total[5m]))
# Taux d'erreur par service
- record: service:http_errors:ratio5m
expr: |
sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (service) (rate(http_requests_total[5m]))
# Latence P99 par service
- record: service:http_latency:p99_5m
expr: |
histogram_quantile(0.99,
sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
)
- name: node_rules
interval: 1m
rules:
- record: instance:node_cpu:utilization
expr: 1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))
- record: instance:node_memory:utilization
expr: 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
- record: instance:node_disk:predicted_full_hours
expr: |
-1 * (
node_filesystem_avail_bytes{mountpoint="/"}
/
deriv(node_filesystem_avail_bytes{mountpoint="/"}[6h])
) / 3600
- name: slo_rules
interval: 30s
rules:
# SLI : ratio de requĂȘtes rapides (< 300ms)
- record: slo:api_latency:ratio5m
expr: |
sum(rate(http_request_duration_seconds_bucket{service="api", le="0.3"}[5m]))
/
sum(rate(http_request_duration_seconds_count{service="api"}[5m]))
# Error budget restant (sur 30 jours, SLO 99.9%)
- record: slo:api_latency:error_budget_remaining
expr: |
1 - (
(1 - slo:api_latency:ratio5m)
/
(1 - 0.999)
)
Module 3 â Grafana
Installation
# docker-compose pour Grafana standalone
services:
grafana:
image: grafana/grafana:11.5.0
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_USER: admin
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
GF_USERS_ALLOW_SIGN_UP: "false"
GF_AUTH_ANONYMOUS_ENABLED: "false"
GF_INSTALL_PLUGINS: "grafana-clock-panel,grafana-piechart-panel"
volumes:
- grafana-data:/var/lib/grafana
- ./provisioning:/etc/grafana/provisioning
volumes:
grafana-data:
Provisioning des datasources
# provisioning/datasources/datasources.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
jsonData:
timeInterval: "15s"
exemplarTraceIdDestinations:
- name: traceID
datasourceUid: tempo
- name: Loki
type: loki
access: proxy
url: http://loki:3100
jsonData:
derivedFields:
- datasourceUid: tempo
matcherRegex: '"trace_id":"(\w+)"'
name: TraceID
url: "$${__value.raw}"
- name: Tempo
type: tempo
access: proxy
url: http://tempo:3200
jsonData:
tracesToLogs:
datasourceUid: loki
filterByTraceID: true
filterBySpanID: true
tracesToMetrics:
datasourceUid: prometheus
nodeGraph:
enabled: true
serviceMap:
datasourceUid: prometheus
Dashboards efficaces
Les principes dâun bon dashboard :
- Un dashboard = un objectif (overview, service detail, debugging)
- HiĂ©rarchie visuelle â les mĂ©triques importantes en haut et en gros
- Seuils visuels â vert/jaune/rouge immĂ©diatement lisibles
- Liens entre dashboards â drill-down du overview vers le dĂ©tail
Dashboard JSON provisionné
# provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
- name: default
orgId: 1
folder: "Production"
type: file
disableDeletion: false
editable: true
options:
path: /etc/grafana/provisioning/dashboards/json
foldersFromFilesStructure: true
Variables et templating
Les variables rendent tes dashboards dynamiques et réutilisables :
{
"templating": {
"list": [
{
"name": "datasource",
"type": "datasource",
"query": "prometheus"
},
{
"name": "namespace",
"type": "query",
"datasource": "${datasource}",
"query": "label_values(kube_pod_info, namespace)",
"refresh": 2,
"sort": 1
},
{
"name": "service",
"type": "query",
"datasource": "${datasource}",
"query": "label_values(http_requests_total{namespace=\"$namespace\"}, service)",
"refresh": 2,
"sort": 1,
"multi": true,
"includeAll": true
},
{
"name": "percentile",
"type": "custom",
"query": "50,90,95,99",
"current": {
"text": "99",
"value": "99"
}
}
]
}
}
Utilisation dans les requĂȘtes :
# Variable simple
rate(http_requests_total{namespace="$namespace", service=~"$service"}[5m])
# Variable avec regex multi-valeur
rate(http_requests_total{service=~"${service:pipe}"}[$__rate_interval])
# Variable dans le quantile
histogram_quantile(0.$percentile,
sum by (le) (rate(http_request_duration_seconds_bucket{service=~"$service"}[$__rate_interval]))
)
Dashboards as Code
Grafonnet (Jsonnet)
// dashboards/service-overview.jsonnet
local grafana = import 'github.com/grafana/grafonnet/gen/grafonnet-latest/main.libsonnet';
local dashboard = grafana.dashboard;
local panel = grafana.panel;
local prometheus = grafana.query.prometheus;
local variable = grafana.dashboard.variable;
local ds = variable.datasource.new('datasource', 'prometheus');
local svc = variable.query.new('service')
+ variable.query.withDatasourceFromVariable(ds)
+ variable.query.queryTypes.withLabelValues('service', 'http_requests_total')
+ variable.query.withRefresh(2)
+ variable.query.selectionOptions.withMulti(true)
+ variable.query.selectionOptions.withIncludeAll(true);
local requestRatePanel =
panel.timeSeries.new('Request Rate')
+ panel.timeSeries.queryOptions.withTargets([
prometheus.new('${datasource}',
'sum by (service) (rate(http_requests_total{service=~"$service"}[$__rate_interval]))'
)
+ prometheus.withLegendFormat('{{ service }}'),
])
+ panel.timeSeries.standardOptions.withUnit('reqps')
+ panel.timeSeries.gridPos.withW(12)
+ panel.timeSeries.gridPos.withH(8);
local errorRatePanel =
panel.timeSeries.new('Error Rate')
+ panel.timeSeries.queryOptions.withTargets([
prometheus.new('${datasource}',
|||
sum by (service) (rate(http_requests_total{service=~"$service", status=~"5.."}[$__rate_interval]))
/
sum by (service) (rate(http_requests_total{service=~"$service"}[$__rate_interval]))
|||
)
+ prometheus.withLegendFormat('{{ service }}'),
])
+ panel.timeSeries.standardOptions.withUnit('percentunit')
+ panel.timeSeries.fieldConfig.defaults.custom.withThresholdsStyleMode('area')
+ panel.timeSeries.standardOptions.withOverrides([])
+ panel.timeSeries.gridPos.withW(12)
+ panel.timeSeries.gridPos.withH(8)
+ panel.timeSeries.gridPos.withX(12);
local latencyPanel =
panel.timeSeries.new('Latency P99')
+ panel.timeSeries.queryOptions.withTargets([
prometheus.new('${datasource}',
|||
histogram_quantile(0.99,
sum by (service, le) (rate(http_request_duration_seconds_bucket{service=~"$service"}[$__rate_interval]))
)
|||
)
+ prometheus.withLegendFormat('{{ service }}'),
])
+ panel.timeSeries.standardOptions.withUnit('s')
+ panel.timeSeries.gridPos.withW(24)
+ panel.timeSeries.gridPos.withH(8);
dashboard.new('Service Overview')
+ dashboard.withUid('service-overview')
+ dashboard.withTags(['generated', 'service'])
+ dashboard.withTimezone('browser')
+ dashboard.withRefresh('30s')
+ dashboard.withVariables([ds, svc])
+ dashboard.withPanels([
requestRatePanel,
errorRatePanel,
latencyPanel,
])
Terraform Provider Grafana
# main.tf
terraform {
required_providers {
grafana = {
source = "grafana/grafana"
version = "~> 3.0"
}
}
}
provider "grafana" {
url = "https://grafana.devopslab.ch"
auth = var.grafana_api_key
}
resource "grafana_folder" "production" {
title = "Production"
}
resource "grafana_dashboard" "service_overview" {
folder = grafana_folder.production.id
config_json = file("${path.module}/dashboards/service-overview.json")
}
resource "grafana_data_source" "prometheus" {
type = "prometheus"
name = "Prometheus"
url = "http://prometheus:9090"
json_data_encoded = jsonencode({
timeInterval = "15s"
})
}
# Alerting via Terraform
resource "grafana_contact_point" "slack" {
name = "Slack Alerts"
slack {
url = var.slack_webhook_url
channel = "#alerts-prod"
title = "{{ .CommonLabels.alertname }}"
text = "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"
}
}
resource "grafana_notification_policy" "default" {
contact_point = grafana_contact_point.slack.name
group_by = ["alertname", "service"]
policy {
matcher {
label = "severity"
match = "="
value = "critical"
}
contact_point = grafana_contact_point.slack.name
group_wait = "30s"
group_interval = "5m"
}
}
Module 4 â Loki (AgrĂ©gation de Logs)
Architecture de Loki
Loki est un systĂšme dâagrĂ©gation de logs inspirĂ© de Prometheus. Il nâindexe pas le contenu des logs â seulement les labels. Câest ce qui le rend lĂ©ger et Ă©conomique.
graph TB
PT["Promtail (agent)"] -->|"push"| Dist
GA["Grafana Alloy (agent)"] -->|"push"| Dist
DL["Docker Log Driver"] -->|"push"| Dist
subgraph Loki["Loki"]
Dist["Distributor"] --> Ing["Ingester"]
Ing --> Store["Storage (S3/local)"]
end
Store --> Graf["Grafana (visualisation)"]
style Loki fill:#1a2332,stroke:#dc2626,color:#f1f5f9
style PT fill:#1a2332,stroke:#22c55e,color:#f1f5f9
style GA fill:#1a2332,stroke:#22c55e,color:#f1f5f9
style DL fill:#1a2332,stroke:#22c55e,color:#f1f5f9
style Dist fill:#1a2332,stroke:#f59e0b,color:#f1f5f9
style Ing fill:#1a2332,stroke:#3b82f6,color:#f1f5f9
style Store fill:#1a2332,stroke:#a855f7,color:#f1f5f9
style Graf fill:#1a2332,stroke:#f59e0b,color:#f1f5f9
Configuration Loki
# loki-config.yml
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
log_level: info
common:
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
storage_config:
filesystem:
directory: /loki/storage
limits_config:
retention_period: 30d
max_query_length: 721h
max_query_parallelism: 32
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
per_stream_rate_limit: 5MB
max_entries_limit_per_query: 10000
compactor:
working_directory: /loki/compactor
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 2h
ruler:
storage:
type: local
local:
directory: /loki/rules
rule_path: /loki/rules-temp
alertmanager_url: http://alertmanager:9093
ring:
kvstore:
store: inmemory
enable_api: true
Promtail â lâagent de collecte
# promtail-config.yml
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
tenant_id: default
scrape_configs:
# Logs Docker via journal
- job_name: docker
docker_sd_configs:
- host: unix:///var/run/docker.sock
refresh_interval: 5s
relabel_configs:
- source_labels: ['__meta_docker_container_name']
regex: '/(.*)'
target_label: container
- source_labels: ['__meta_docker_container_log_stream']
target_label: stream
- source_labels: ['__meta_docker_container_label_com_docker_compose_service']
target_label: service
# Logs fichiers systĂšme
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: syslog
__path__: /var/log/syslog
- targets:
- localhost
labels:
job: authlog
__path__: /var/log/auth.log
pipeline_stages:
# Parser les logs JSON
- json:
expressions:
level: level
msg: message
trace_id: trace_id
duration: duration_ms
- labels:
level:
trace_id:
- metrics:
log_lines_total:
type: Counter
description: "Total log lines"
source: level
config:
action: inc
match_all: true
# Logs Kubernetes
- job_name: kubernetes
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: app
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
pipeline_stages:
- cri: {}
- json:
expressions:
level: level
trace_id: trace_id
- labels:
level:
- structured_metadata:
trace_id:
Grafana Alloy (successeur de Promtail)
// alloy-config.alloy
// Grafana Alloy â collecteur unifiĂ© (logs, mĂ©triques, traces)
// Découverte des containers Docker
discovery.docker "containers" {
host = "unix:///var/run/docker.sock"
}
// Collecte des logs Docker
loki.source.docker "docker_logs" {
host = "unix:///var/run/docker.sock"
targets = discovery.docker.containers.targets
forward_to = [loki.process.pipeline.receiver]
}
// Pipeline de traitement
loki.process "pipeline" {
// Parser JSON
stage.json {
expressions = {
level = "level",
trace_id = "trace_id",
msg = "message",
}
}
// Ajouter le level comme label
stage.labels {
values = {
level = "",
}
}
// Structured metadata (pas indexé mais filtrable)
stage.structured_metadata {
values = {
trace_id = "",
}
}
// Drop les logs de health check
stage.drop {
expression = ".*GET /health.*"
drop_counter_reason = "healthcheck"
}
forward_to = [loki.write.default.receiver]
}
// Envoi vers Loki
loki.write "default" {
endpoint {
url = "http://loki:3100/loki/api/v1/push"
}
}
// Bonus : collecte des métriques aussi
prometheus.scrape "docker_metrics" {
targets = discovery.docker.containers.targets
forward_to = [prometheus.remote_write.default.receiver]
}
prometheus.remote_write "default" {
endpoint {
url = "http://prometheus:9090/api/v1/write"
}
}
LogQL â le langage de requĂȘte
RequĂȘtes de base
# Tous les logs d'un service
{service="payment-api"}
# Filtrer par niveau
{service="payment-api"} |= "error"
{service="payment-api"} | json | level="error"
# Exclure les health checks
{service="api"} != "/health"
# Regex
{service="api"} |~ "timeout|connection refused"
# Parser JSON et filtrer
{service="api"}
| json
| level="error"
| duration_ms > 5000
| line_format "{{.timestamp}} [{{.level}}] {{.message}} ({{.duration_ms}}ms)"
RequĂȘtes mĂ©triques (agrĂ©gation)
# Nombre de logs d'erreur par minute
count_over_time({service="api"} |= "error" [1m])
# Taux de logs par service
sum by (service) (rate({job="docker"} [5m]))
# Bytes de logs par service (pour détecter les services trop bavards)
sum by (service) (bytes_over_time({job="docker"} [1h]))
# P99 de la durée extraite des logs
{service="api"} | json | unwrap duration_ms | __error__=""
quantile_over_time(0.99, {service="api"} | json | unwrap duration_ms [5m])
# Top 5 des messages d'erreur les plus fréquents
topk(5,
sum by (message) (count_over_time(
{service="api"} | json | level="error" [1h]
))
)
# Détection de patterns : logs contenant "OOM"
count_over_time({job=~".+"} |= "OOM" [5m]) > 0
Corrélation logs-métriques
La magie opÚre quand tu relies logs et métriques. Dans Grafana, configure les derived fields dans la datasource Loki :
# Dans la datasource Loki
jsonData:
derivedFields:
- datasourceUid: tempo # Lien vers Tempo
matcherRegex: '"trace_id":"(\w+)"'
name: TraceID
url: "$${__value.raw}"
- datasourceUid: prometheus
matcherRegex: '"service":"(\w+)"'
name: ServiceDashboard
url: "/d/service-detail/service-detail?var-service=$${__value.raw}"
Flux de debug typique :
- Alerte Prometheus â taux dâerreur > 5%
- Dashboard Grafana â le service
payment-apiest impactĂ© - Logs Loki â
{service="payment-api"} | json | level="error" - Trace â clic sur le
trace_idâ vue complĂšte dans Tempo - Root cause â le span vers Stripe montre un timeout
Module 5 â Alerting
Configuration complĂšte dâAlertmanager
# alertmanager.yml
global:
resolve_timeout: 5m
smtp_from: "alerts@devopslab.ch"
smtp_smarthost: "smtp.example.com:587"
smtp_auth_username: "alerts@devopslab.ch"
smtp_auth_password: "${SMTP_PASSWORD}"
smtp_require_tls: true
slack_api_url: "${SLACK_WEBHOOK_URL}"
pagerduty_url: "https://events.pagerduty.com/v2/enqueue"
# Templates personnalisés
templates:
- "/etc/alertmanager/templates/*.tmpl"
# Inhibition : supprimer les alertes redondantes
inhibit_rules:
# Si le cluster est down, pas besoin d'alerter sur chaque service
- source_matchers:
- alertname = ClusterDown
target_matchers:
- severity =~ "warning|info"
equal: [cluster]
# Si un nĆud est down, pas besoin d'alerter sur ses services
- source_matchers:
- alertname = NodeDown
target_matchers:
- alertname != NodeDown
equal: [instance]
# Critical inhibe warning pour la mĂȘme alerte
- source_matchers:
- severity = critical
target_matchers:
- severity = warning
equal: [alertname, service]
# Routing
route:
receiver: slack-default
group_by: [alertname, cluster, service]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# Alertes critiques â PagerDuty + Slack
- receiver: pagerduty-critical
matchers:
- severity = critical
group_wait: 10s
repeat_interval: 1h
continue: true # Continue vers la route suivante aussi
- receiver: slack-critical
matchers:
- severity = critical
group_wait: 10s
# Alertes warning â Slack uniquement
- receiver: slack-warning
matchers:
- severity = warning
group_wait: 1m
repeat_interval: 12h
# Alertes business â channel dĂ©diĂ©
- receiver: slack-business
matchers:
- team = business
group_by: [alertname]
repeat_interval: 24h
# Watchdog (heartbeat de l'alerting)
- receiver: "null"
matchers:
- alertname = Watchdog
# Receivers
receivers:
- name: "null"
- name: slack-default
slack_configs:
- channel: "#alerts-default"
send_resolved: true
title: '{{ if eq .Status "firing" }}đ„{{ else }}â
{{ end }} {{ .CommonLabels.alertname }}'
text: >-
{{ range .Alerts }}
*{{ .Labels.severity | toUpper }}* - {{ .Labels.service }}
{{ .Annotations.description }}
{{ if .Labels.instance }}Instance: `{{ .Labels.instance }}`{{ end }}
{{ end }}
actions:
- type: button
text: "đ Dashboard"
url: "{{ (index .Alerts 0).Annotations.dashboard_url }}"
- type: button
text: "đ Runbook"
url: "{{ (index .Alerts 0).Annotations.runbook_url }}"
- name: slack-critical
slack_configs:
- channel: "#alerts-critical"
send_resolved: true
color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
title: 'đš CRITICAL: {{ .CommonLabels.alertname }}'
text: >-
{{ range .Alerts }}
*Service:* {{ .Labels.service }}
*Description:* {{ .Annotations.description }}
*Impact:* {{ .Annotations.impact }}
*Runbook:* {{ .Annotations.runbook_url }}
{{ end }}
- name: slack-warning
slack_configs:
- channel: "#alerts-warning"
send_resolved: true
title: 'â ïž {{ .CommonLabels.alertname }}'
text: >-
{{ range .Alerts }}
{{ .Annotations.description }}
{{ end }}
- name: slack-business
slack_configs:
- channel: "#alerts-business"
send_resolved: true
title: 'đ {{ .CommonLabels.alertname }}'
text: >-
{{ range .Alerts }}
{{ .Annotations.description }}
{{ end }}
- name: pagerduty-critical
pagerduty_configs:
- routing_key: "${PAGERDUTY_ROUTING_KEY}"
severity: critical
description: "{{ .CommonLabels.alertname }}: {{ .CommonAnnotations.description }}"
details:
firing: "{{ .Alerts.Firing | len }}"
resolved: "{{ .Alerts.Resolved | len }}"
service: "{{ .CommonLabels.service }}"
RĂšgles dâalerte Prometheus
# rules/alerts.yml
groups:
- name: service_alerts
rules:
# Taux d'erreur élevé
- alert: HighErrorRate
expr: service:http_errors:ratio5m > 0.05
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "Taux d'erreur élevé sur {{ $labels.service }}"
description: >-
Le service {{ $labels.service }} a un taux d'erreur de
{{ $value | humanizePercentage }} (seuil: 5%).
impact: "Les utilisateurs reçoivent des erreurs 5xx."
dashboard_url: "https://grafana.devopslab.ch/d/service-detail?var-service={{ $labels.service }}"
runbook_url: "https://wiki.devopslab.ch/runbooks/high-error-rate"
# Latence élevée
- alert: HighLatencyP99
expr: service:http_latency:p99_5m > 1
for: 10m
labels:
severity: warning
team: platform
annotations:
summary: "Latence P99 élevée sur {{ $labels.service }}"
description: >-
Le P99 du service {{ $labels.service }} est Ă {{ $value | humanizeDuration }}
(seuil: 1s).
# SLO en danger
- alert: SLOBudgetBurning
expr: slo:api_latency:error_budget_remaining < 0.25
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "Error budget critique pour l'API"
description: >-
Il ne reste que {{ $value | humanizePercentage }} de l'error budget.
Gel des déploiements recommandé.
- name: infrastructure_alerts
rules:
# NĆud down
- alert: NodeDown
expr: up{job="node"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "NĆud {{ $labels.instance }} injoignable"
description: "Le node exporter de {{ $labels.instance }} ne répond plus depuis 2 minutes."
# CPU saturé
- alert: HighCPU
expr: instance:node_cpu:utilization > 0.85
for: 15m
labels:
severity: warning
annotations:
summary: "CPU élevé sur {{ $labels.instance }}"
description: "CPU Ă {{ $value | humanizePercentage }} depuis 15 minutes."
# Disque bientĂŽt plein
- alert: DiskFillingUp
expr: instance:node_disk:predicted_full_hours > 0 and instance:node_disk:predicted_full_hours < 24
for: 30m
labels:
severity: warning
annotations:
summary: "Disque de {{ $labels.instance }} bientĂŽt plein"
description: >-
Au rythme actuel, le disque sera plein dans
{{ $value | humanizeDuration }}.
# Certificat SSL expire bientĂŽt
- alert: SSLCertExpiringSoon
expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 14
for: 1h
labels:
severity: warning
annotations:
summary: "Certificat SSL de {{ $labels.instance }} expire dans {{ $value | humanize }} jours"
# Mémoire critique
- alert: HighMemory
expr: instance:node_memory:utilization > 0.9
for: 10m
labels:
severity: critical
annotations:
summary: "Mémoire critique sur {{ $labels.instance }}"
description: "Utilisation mémoire à {{ $value | humanizePercentage }}."
- name: watchdog
rules:
# Alerte qui fire toujours â si elle disparaĂźt, l'alerting est cassĂ©
- alert: Watchdog
expr: vector(1)
labels:
severity: none
annotations:
summary: "Watchdog â l'alerting fonctionne"
Bonnes pratiques anti-alert fatigue
Lâalert fatigue tue la fiabilitĂ©. Voici comment lâĂ©viter :
1. Chaque alerte doit ĂȘtre actionnable
Si tu reçois une alerte et que la rĂ©action est âok je regarde demainâ, câest pas une alerte, câest du bruit. Supprime-la ou transforme-la en dashboard.
2. Deux niveaux suffisent
| Niveau | Action | Destination |
|---|---|---|
critical | RĂ©veille quelquâun | PagerDuty + Slack |
warning | à traiter aux heures ouvrées | Slack uniquement |
Pas de info, pas de notice. Si câest informatif, câest un dashboard.
3. Seuils avec for â toujours
# â Mauvais â fire sur un spike de 15 secondes
- alert: HighCPU
expr: node_cpu_utilization > 0.8
# â
Bon â problĂšme persistant
- alert: HighCPU
expr: node_cpu_utilization > 0.85
for: 15m
4. Seuil minimum de trafic
# â Alerte sur un seul Ă©chec
http_errors_ratio > 0.5
# â
Seulement si trafic significatif
http_errors_ratio > 0.05 and http_requests_rate > 10
5. Alertes basées sur les symptÎmes, pas les causes
# â Alerte sur la cause (trop bas niveau)
- alert: PodRestarting
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
# â
Alerte sur le symptĂŽme (impact utilisateur)
- alert: HighErrorRate
expr: service:http_errors:ratio5m > 0.05
for: 5m
6. Runbooks obligatoires
Chaque alerte doit avoir un lien vers un runbook qui explique :
- Ce que lâalerte signifie
- Comment diagnostiquer
- Comment résoudre
- Qui escalader si bloqué
Module 6 â Tracing DistribuĂ©
OpenTelemetry â le standard
OpenTelemetry (OTel) est le standard ouvert pour la télémétrie. Il fournit des APIs, SDKs et un collecteur pour métriques, logs et traces.
graph TB
SA["Service A (OTel SDK)"] -->|"OTLP"| Coll
SB["Service B (OTel SDK)"] -->|"OTLP"| Coll
SC["Service C (OTel SDK)"] -->|"OTLP"| Coll
Coll["OTel Collector
Receivers â Process â Exporters"] --> Tempo["Tempo (traces)"]
Coll --> Prometheus["Prometheus (métriques)"]
style SA fill:#1a2332,stroke:#3b82f6,color:#f1f5f9
style SB fill:#1a2332,stroke:#3b82f6,color:#f1f5f9
style SC fill:#1a2332,stroke:#3b82f6,color:#f1f5f9
style Coll fill:#1a2332,stroke:#dc2626,color:#f1f5f9
style Tempo fill:#1a2332,stroke:#a855f7,color:#f1f5f9
style Prometheus fill:#1a2332,stroke:#f59e0b,color:#f1f5f9
Configuration du Collector OTel
# otel-collector-config.yml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1000
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
attributes:
actions:
- key: environment
value: production
action: upsert
tail_sampling:
policies:
# Garder toutes les traces avec erreurs
- name: errors
type: status_code
status_code:
status_codes: [ERROR]
# Garder 10% des traces normales
- name: sampling
type: probabilistic
probabilistic:
sampling_percentage: 10
# Garder les traces lentes (> 2s)
- name: slow-traces
type: latency
latency:
threshold_ms: 2000
exporters:
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
namespace: otel
resource_to_telemetry_conversion:
enabled: true
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling, batch]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [memory_limiter, attributes, batch]
exporters: [loki]
Instrumentation automatique vs manuelle
Auto-instrumentation Python
# Installation
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install
# Lancement avec auto-instrumentation
opentelemetry-instrument \
--service_name payment-api \
--exporter_otlp_endpoint http://otel-collector:4317 \
--exporter_otlp_protocol grpc \
python app.py
# Ou via le code â configuration programmatique
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
# Configuration du provider
resource = Resource.create({
"service.name": "payment-api",
"service.version": "1.2.0",
"deployment.environment": "production",
})
provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(
OTLPSpanExporter(endpoint="http://otel-collector:4317")
)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
# Auto-instrumentation des frameworks
FlaskInstrumentor().instrument()
RequestsInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument()
Instrumentation manuelle Python
from opentelemetry import trace
import logging
tracer = trace.get_tracer("payment.service")
logger = logging.getLogger(__name__)
@app.route("/api/payment", methods=["POST"])
def process_payment():
with tracer.start_as_current_span("process_payment") as span:
# Ajouter des attributs au span
span.set_attribute("payment.method", request.json["method"])
span.set_attribute("payment.amount", request.json["amount"])
span.set_attribute("payment.currency", request.json["currency"])
# Injecter le trace_id dans les logs
ctx = trace.get_current_span().get_span_context()
logger.info(
"Processing payment",
extra={
"trace_id": format(ctx.trace_id, '032x'),
"span_id": format(ctx.span_id, '016x'),
"amount": request.json["amount"],
}
)
# Span enfant pour l'appel DB
with tracer.start_as_current_span("validate_card") as child:
child.set_attribute("db.system", "postgresql")
child.set_attribute("db.statement", "SELECT * FROM cards WHERE...")
card = validate_card(request.json["card_token"])
# Span enfant pour l'appel externe
with tracer.start_as_current_span("charge_stripe") as child:
child.set_attribute("http.method", "POST")
child.set_attribute("http.url", "https://api.stripe.com/v1/charges")
try:
result = stripe.Charge.create(
amount=request.json["amount"],
currency=request.json["currency"],
source=card.token,
)
child.set_attribute("payment.stripe_id", result.id)
except stripe.error.CardError as e:
child.set_status(trace.StatusCode.ERROR, str(e))
child.record_exception(e)
span.set_attribute("payment.status", "failed")
raise
span.set_attribute("payment.status", "success")
return {"status": "ok", "charge_id": result.id}
Instrumentation Go
package main
import (
"context"
"log"
"net/http"
"time"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/codes"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/propagation"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
"go.opentelemetry.io/otel/trace"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)
var tracer trace.Tracer
func initTracer() func() {
ctx := context.Background()
exporter, err := otlptracegrpc.New(ctx,
otlptracegrpc.WithEndpoint("otel-collector:4317"),
otlptracegrpc.WithInsecure(),
)
if err != nil {
log.Fatalf("failed to create exporter: %v", err)
}
res, _ := resource.Merge(
resource.Default(),
resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceName("order-service"),
semconv.ServiceVersion("2.1.0"),
attribute.String("environment", "production"),
),
)
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(res),
sdktrace.WithSampler(sdktrace.ParentBased(
sdktrace.TraceIDRatioBased(0.1), // 10% sampling
)),
)
otel.SetTracerProvider(tp)
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{},
propagation.Baggage{},
))
tracer = tp.Tracer("order-service")
return func() {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
tp.Shutdown(ctx)
}
}
func processOrder(ctx context.Context, orderID string) error {
ctx, span := tracer.Start(ctx, "processOrder",
trace.WithAttributes(
attribute.String("order.id", orderID),
),
)
defer span.End()
// Span enfant : validation
if err := validateOrder(ctx, orderID); err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
return err
}
// Span enfant : paiement
if err := chargePayment(ctx, orderID); err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
return err
}
span.SetAttributes(attribute.String("order.status", "completed"))
return nil
}
func validateOrder(ctx context.Context, orderID string) error {
ctx, span := tracer.Start(ctx, "validateOrder")
defer span.End()
span.SetAttributes(
attribute.String("db.system", "postgresql"),
attribute.String("db.operation", "SELECT"),
)
// ... validation logic
return nil
}
func main() {
shutdown := initTracer()
defer shutdown()
// HTTP handler avec instrumentation automatique
handler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
orderID := r.URL.Query().Get("id")
if err := processOrder(r.Context(), orderID); err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
w.Write([]byte(`{"status":"ok"}`))
})
// otelhttp instrumente automatiquement les requĂȘtes HTTP
wrappedHandler := otelhttp.NewHandler(handler, "order-api")
log.Println("Starting server on :8080")
log.Fatal(http.ListenAndServe(":8080", wrappedHandler))
}
Configuration de Tempo
# tempo-config.yml
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
ingester:
max_block_duration: 5m
compactor:
compaction:
block_retention: 336h # 14 jours
metrics_generator:
registry:
external_labels:
source: tempo
cluster: production
storage:
path: /var/tempo/generator/wal
remote_write:
- url: http://prometheus:9090/api/v1/write
send_exemplars: true
traces_storage:
path: /var/tempo/generator/traces
processor:
service_graphs:
dimensions:
- service.namespace
enable_client_server_prefix: true
span_metrics:
dimensions:
- http.method
- http.status_code
- service.version
storage:
trace:
backend: local
wal:
path: /var/tempo/wal
local:
path: /var/tempo/blocks
overrides:
defaults:
metrics_generator:
processors:
- service-graphs
- span-metrics
Corrélation traces-métriques-logs
La corrĂ©lation complĂšte â le Saint Graal de lâobservabilitĂ© :
# Dans ton application, assure-toi d'inclure le trace_id partout
# 1. Logs structurés avec trace_id
# â Loki peut lier vers Tempo via derived fields
# 2. Métriques avec exemplars
# â Prometheus stocke des trace_id dans les exemplars des histogrammes
# 3. Tempo génÚre des métriques de spans
# â service_graphs et span_metrics dans Prometheus
# Configuration Grafana â datasource Tempo avec corrĂ©lation
# (voir provisioning/datasources plus haut)
Flux de debug complet :
1. ALERTE: "HighLatencyP99 sur payment-api" (Prometheus â Alertmanager â Slack)
â
2. DASHBOARD: Ouvre le dashboard RED, voit le spike de latence
â â Clique sur un point du graphe, voit les exemplars
â
3. TRACE: Clique sur l'exemplar â trace complĂšte dans Tempo
â â payment-api (12ms) â stripe-client (30045ms) â LE COUPABLE
â
4. LOGS: Depuis le span, clique "View Logs" â Loki
â â "Connection timeout to api.stripe.com after 30s"
â
5. ROOT CAUSE: Stripe a un incident. Rien à faire de notre cÎté.
â Ajouter un circuit breaker avec fallback.
Bonus â Stack LGTM complĂšte en Docker Compose
Voici un Docker Compose complet pour déployer la stack Loki + Grafana + Tempo + Mimir (Prometheus) :
# docker-compose.yml
# Stack LGTM - Observabilité complÚte
# Usage: docker compose up -d
services:
# ============================================
# Grafana â Visualisation
# ============================================
grafana:
image: grafana/grafana:11.5.0
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_USER: admin
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD:-admin}
GF_AUTH_ANONYMOUS_ENABLED: "false"
GF_FEATURE_TOGGLES_ENABLE: "traceQLStreaming,metricsSummary"
volumes:
- grafana-data:/var/lib/grafana
- ./config/grafana/provisioning:/etc/grafana/provisioning
depends_on:
- prometheus
- loki
- tempo
networks:
- observability
restart: unless-stopped
# ============================================
# Prometheus â MĂ©triques
# ============================================
prometheus:
image: prom/prometheus:v2.53.0
ports:
- "9090:9090"
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=30d"
- "--web.enable-remote-write-receiver"
- "--enable-feature=exemplar-storage"
- "--enable-feature=native-histograms"
volumes:
- ./config/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./config/prometheus/rules:/etc/prometheus/rules:ro
- prometheus-data:/prometheus
networks:
- observability
restart: unless-stopped
# ============================================
# Alertmanager â Alertes
# ============================================
alertmanager:
image: prom/alertmanager:v0.27.0
ports:
- "9093:9093"
command:
- "--config.file=/etc/alertmanager/alertmanager.yml"
- "--storage.path=/alertmanager"
volumes:
- ./config/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
- alertmanager-data:/alertmanager
networks:
- observability
restart: unless-stopped
# ============================================
# Loki â Logs
# ============================================
loki:
image: grafana/loki:3.4.0
ports:
- "3100:3100"
command: -config.file=/etc/loki/loki-config.yml
volumes:
- ./config/loki/loki-config.yml:/etc/loki/loki-config.yml:ro
- loki-data:/loki
networks:
- observability
restart: unless-stopped
# ============================================
# Tempo â Traces
# ============================================
tempo:
image: grafana/tempo:2.6.0
ports:
- "3200:3200" # HTTP API
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
command: -config.file=/etc/tempo/tempo-config.yml
volumes:
- ./config/tempo/tempo-config.yml:/etc/tempo/tempo-config.yml:ro
- tempo-data:/var/tempo
networks:
- observability
restart: unless-stopped
# ============================================
# OTel Collector â Collecteur unifiĂ©
# ============================================
otel-collector:
image: otel/opentelemetry-collector-contrib:0.115.0
ports:
- "4327:4317" # OTLP gRPC (mapped pour pas conflit avec Tempo)
- "4328:4318" # OTLP HTTP
- "8889:8889" # Prometheus metrics
command: --config=/etc/otel/config.yml
volumes:
- ./config/otel/otel-collector-config.yml:/etc/otel/config.yml:ro
depends_on:
- tempo
- prometheus
- loki
networks:
- observability
restart: unless-stopped
# ============================================
# Grafana Alloy â Agent de collecte
# ============================================
alloy:
image: grafana/alloy:v1.5.0
ports:
- "12345:12345" # UI
command:
- run
- /etc/alloy/config.alloy
- --server.http.listen-addr=0.0.0.0:12345
volumes:
- ./config/alloy/config.alloy:/etc/alloy/config.alloy:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
- /var/log:/var/log:ro
networks:
- observability
restart: unless-stopped
# ============================================
# Node Exporter â MĂ©triques systĂšme
# ============================================
node-exporter:
image: quay.io/prometheus/node-exporter:v1.8.0
ports:
- "9100:9100"
command:
- "--path.rootfs=/host"
pid: host
volumes:
- /:/host:ro,rslave
networks:
- observability
restart: unless-stopped
# ============================================
# Blackbox Exporter â Probes externes
# ============================================
blackbox-exporter:
image: prom/blackbox-exporter:v0.25.0
ports:
- "9115:9115"
volumes:
- ./config/blackbox/blackbox.yml:/config/blackbox.yml:ro
command:
- "--config.file=/config/blackbox.yml"
networks:
- observability
restart: unless-stopped
# ============================================
# App dĂ©mo â pour tester la stack
# ============================================
demo-app:
build:
context: ./demo-app
dockerfile: Dockerfile
ports:
- "8080:8080"
environment:
OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4317
OTEL_SERVICE_NAME: demo-app
OTEL_RESOURCE_ATTRIBUTES: "deployment.environment=demo"
labels:
prometheus.scrape: "true"
prometheus.port: "8080"
depends_on:
- otel-collector
networks:
- observability
restart: unless-stopped
volumes:
grafana-data:
prometheus-data:
loki-data:
tempo-data:
alertmanager-data:
networks:
observability:
driver: bridge
Structure du projet
observability-stack/
âââ docker-compose.yml
âââ .env # GRAFANA_PASSWORD, SLACK_WEBHOOK_URL, etc.
âââ config/
â âââ prometheus/
â â âââ prometheus.yml
â â âââ rules/
â â âââ recording.yml
â â âââ alerts.yml
â âââ alertmanager/
â â âââ alertmanager.yml
â âââ loki/
â â âââ loki-config.yml
â âââ tempo/
â â âââ tempo-config.yml
â âââ otel/
â â âââ otel-collector-config.yml
â âââ alloy/
â â âââ config.alloy
â âââ blackbox/
â â âââ blackbox.yml
â âââ grafana/
â âââ provisioning/
â âââ datasources/
â â âââ datasources.yml
â âââ dashboards/
â âââ dashboards.yml
â âââ json/
â âââ service-overview.json
âââ demo-app/
âââ Dockerfile
âââ main.py
Script de démarrage rapide
#!/bin/bash
# start.sh â DĂ©marrer la stack d'observabilitĂ©
set -euo pipefail
echo "đ DĂ©marrage de la stack LGTM..."
# Vérifier Docker
if ! command -v docker &> /dev/null; then
echo "â Docker n'est pas installĂ©"
exit 1
fi
# Créer le .env si absent
if [ ! -f .env ]; then
echo "đ CrĂ©ation du fichier .env..."
cat > .env << 'EOF'
GRAFANA_PASSWORD=changeme-please
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/xxx
PAGERDUTY_ROUTING_KEY=xxx
SMTP_PASSWORD=xxx
EOF
echo "â ïž Ădite .env avec tes vrais credentials avant de continuer"
exit 1
fi
# Créer les répertoires
mkdir -p config/{prometheus/rules,alertmanager,loki,tempo,otel,alloy,blackbox,grafana/provisioning/{datasources,dashboards/json}}
# Démarrer
docker compose up -d
echo ""
echo "â
Stack démarrée !"
echo ""
echo "đ Grafana: http://localhost:3000 (admin / voir .env)"
echo "đ„ Prometheus: http://localhost:9090"
echo "đ Alertmanager: http://localhost:9093"
echo "đ Loki: http://localhost:3100"
echo "đ Tempo: http://localhost:3200"
echo "đĄ OTel Collector: gRPC=localhost:4327, HTTP=localhost:4328"
echo "đ„ïž Node Exporter: http://localhost:9100"
echo "â« Blackbox: http://localhost:9115"
echo "đ§ Alloy UI: http://localhost:12345"
echo ""
echo "đĄ Envoie tes traces OTLP vers localhost:4327 (gRPC) ou localhost:4328 (HTTP)"
Récapitulatif et prochaines étapes
Tu as maintenant les bases pour construire une stack dâobservabilitĂ© complĂšte :
| Composant | RĂŽle | Port |
|---|---|---|
| Prometheus | Métriques (scrape + stockage + alertes) | 9090 |
| Grafana | Visualisation (dashboards + exploration) | 3000 |
| Loki | Logs (agrĂ©gation + requĂȘtes) | 3100 |
| Tempo | Traces (stockage + corrélation) | 3200 |
| Alertmanager | Routing des alertes | 9093 |
| OTel Collector | Collecte unifiée (métriques + logs + traces) | 4317/4318 |
| Alloy | Agent local (remplace Promtail + agents séparés) | 12345 |
Checklist de mise en production
- MĂ©triques : RED pour chaque service, USE pour chaque nĆud
- SLOs : Définis pour les services critiques, error budgets monitorés
- Alertes : Actionnables, avec runbooks, 2 niveaux (critical/warning)
- Logs : Structurés (JSON), avec trace_id, centralisés dans Loki
- Traces : OpenTelemetry configuré, sampling adapté au volume
- CorrĂ©lation : Liens mĂ©triques â logs â traces fonctionnels dans Grafana
- Dashboards : As Code (Grafonnet ou Terraform), versionnés en Git
- Rétention : Définie par type (métriques 30j, logs 30j, traces 14j)
- Haute dispo : Prometheus en mode HA, Loki/Tempo en mode distribué pour la prod
Pour aller plus loin
- Mimir : Stockage long-terme pour métriques Prometheus (multi-tenant, scalable)
- Pyroscope : Profiling continu â le 4Ăšme pilier de lâobservabilitĂ©
- Kubernetes monitoring : kube-state-metrics, cAdvisor, kube-prometheus-stack (Helm)
- eBPF : Instrumentation zero-code au niveau kernel (Grafana Beyla)
- On-call : Grafana OnCall pour la gestion des astreintes
LâobservabilitĂ© nâest pas un projet quâon âtermineâ. Câest une pratique quâon amĂ©liore continuellement. Commence simple, itĂšre, et ajoute des couches quand tu en as besoin â pas avant.