Add comprehensive operator documentation for the new observability features
introduced in the previous commit.
**New Documentation:**
- docs/monitoring/PROMETHEUS_METRICS.md - Complete reference for all 18 new
Prometheus metrics with alert suggestions
**Updated Documentation:**
- docs/API.md - Document X-Request-ID and X-Diagnostics-Cached-At headers,
explain diagnostics endpoint caching behavior
- docs/TROUBLESHOOTING.md - Add section on correlating API calls with logs
using request IDs
- docs/operations/ADAPTIVE_POLLING_ROLLOUT.md - Update monitoring checklists
with new per-node and scheduler metrics
- docs/CONFIGURATION.md - Clarify LOG_FILE dual-output behavior and rotation
defaults
These updates ensure operators understand:
- How to set up monitoring/alerting for new metrics
- How to configure file logging with rotation
- How to troubleshoot using request correlation
- What metrics are available for dashboards
Related to: 495e6c794 (feat: comprehensive diagnostics improvements)
4.5 KiB
Pulse Prometheus Metrics (v4.24.0+)
Pulse exposes multiple metric families that cover HTTP ingress, per-node poll execution, scheduler health, and diagnostics caching. Use the following reference when wiring dashboards or alert rules.
HTTP Request Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
pulse_http_request_duration_seconds |
Histogram | method, route, status |
Request latency buckets. route is a normalised path (dynamic segments collapsed to :id, :uuid, etc.). |
pulse_http_requests_total |
Counter | method, route, status |
Total requests handled. |
pulse_http_request_errors_total |
Counter | method, route, status_class |
Counts 4xx/5xx responses. |
Alert suggestion:
rate(pulse_http_request_errors_total{status_class="server_error"}[5m]) > 0.05 (more than ~3 server errors/min) should page ops.
Per-Node Poll Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
pulse_monitor_node_poll_duration_seconds |
Histogram | instance_type, instance, node |
Wall-clock duration for each node poll. |
pulse_monitor_node_poll_total |
Counter | instance_type, instance, node, result |
Success/error counts per node. |
pulse_monitor_node_poll_errors_total |
Counter | instance_type, instance, node, error_type |
Error type breakdown (connection, auth, internal, etc.). |
pulse_monitor_node_poll_last_success_timestamp |
Gauge | instance_type, instance, node |
Unix timestamp of last successful poll. |
pulse_monitor_node_poll_staleness_seconds |
Gauge | instance_type, instance, node |
Seconds since last success (−1 means no success yet). |
Alert suggestion:
max_over_time(pulse_monitor_node_poll_staleness_seconds{node!=""}[10m]) > 300 indicates a node has been stale for 5+ minutes.
Scheduler Health Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
pulse_scheduler_queue_due_soon |
Gauge | — | Number of tasks due within 12 seconds. |
pulse_scheduler_queue_depth |
Gauge | instance_type |
Queue depth per instance type (PVE, PBS, PMG). |
pulse_scheduler_queue_wait_seconds |
Histogram | instance_type |
Wait time between when a task should run and when it actually executes. |
pulse_scheduler_dead_letter_depth |
Gauge | instance_type, instance |
Dead-letter queue depth per monitored instance. |
pulse_scheduler_breaker_state |
Gauge | instance_type, instance |
Circuit breaker state: 0=closed, 1=half-open, 2=open, -1=unknown. |
pulse_scheduler_breaker_failure_count |
Gauge | instance_type, instance |
Consecutive failures tracked by the breaker. |
pulse_scheduler_breaker_retry_seconds |
Gauge | instance_type, instance |
Seconds until the breaker will allow the next attempt. |
Alert suggestions:
- Queue saturation:
max_over_time(pulse_scheduler_queue_depth[10m]) > <instance count * 1.5> - DLQ growth:
increase(pulse_scheduler_dead_letter_depth[10m]) > 0 - Breaker stuck open:
pulse_scheduler_breaker_state == 2for > 10 minutes.
Diagnostics Cache Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
pulse_diagnostics_cache_hits_total |
Counter | — | Diagnostics requests served from cache. |
pulse_diagnostics_cache_misses_total |
Counter | — | Requests that triggered a fresh probe. |
pulse_diagnostics_refresh_duration_seconds |
Histogram | — | Time taken to refresh diagnostics payload. |
Alert suggestion:
rate(pulse_diagnostics_cache_misses_total[5m]) spiking alongside pulse_diagnostics_refresh_duration_seconds > 20s can signal upstream slowness.
Existing Instance-Level Poll Metrics (for completeness)
The following metrics pre-date v4.24.0 but remain essential:
| Metric | Type | Description |
|---|---|---|
pulse_monitor_poll_duration_seconds |
Histogram | Poll duration per instance. |
pulse_monitor_poll_total |
Counter | Success/error counts per instance. |
pulse_monitor_poll_errors_total |
Counter | Error counts per instance. |
pulse_monitor_poll_last_success_timestamp |
Gauge | Last successful poll timestamp. |
pulse_monitor_poll_staleness_seconds |
Gauge | Seconds since last successful poll (instance-level). |
pulse_monitor_poll_queue_depth |
Gauge | Current queue depth. |
pulse_monitor_poll_inflight |
Gauge | Polls currently running. |
Refer to this document whenever you build dashboards or craft alert policies. Scrape all metrics from the Pulse backend /metrics endpoint (9091 by default for systemd installs).