Files
Pulse/docs/monitoring/PROMETHEUS_METRICS.md
rcourtman e0396c1362 docs: update documentation for diagnostics improvements
Add comprehensive operator documentation for the new observability features
introduced in the previous commit.

**New Documentation:**
- docs/monitoring/PROMETHEUS_METRICS.md - Complete reference for all 18 new
  Prometheus metrics with alert suggestions

**Updated Documentation:**
- docs/API.md - Document X-Request-ID and X-Diagnostics-Cached-At headers,
  explain diagnostics endpoint caching behavior
- docs/TROUBLESHOOTING.md - Add section on correlating API calls with logs
  using request IDs
- docs/operations/ADAPTIVE_POLLING_ROLLOUT.md - Update monitoring checklists
  with new per-node and scheduler metrics
- docs/CONFIGURATION.md - Clarify LOG_FILE dual-output behavior and rotation
  defaults

These updates ensure operators understand:
- How to set up monitoring/alerting for new metrics
- How to configure file logging with rotation
- How to troubleshoot using request correlation
- What metrics are available for dashboards

Related to: 495e6c794 (feat: comprehensive diagnostics improvements)
2025-10-21 12:45:19 +00:00

4.5 KiB
Raw Blame History

Pulse Prometheus Metrics (v4.24.0+)

Pulse exposes multiple metric families that cover HTTP ingress, per-node poll execution, scheduler health, and diagnostics caching. Use the following reference when wiring dashboards or alert rules.


HTTP Request Metrics

Metric Type Labels Description
pulse_http_request_duration_seconds Histogram method, route, status Request latency buckets. route is a normalised path (dynamic segments collapsed to :id, :uuid, etc.).
pulse_http_requests_total Counter method, route, status Total requests handled.
pulse_http_request_errors_total Counter method, route, status_class Counts 4xx/5xx responses.

Alert suggestion:
rate(pulse_http_request_errors_total{status_class="server_error"}[5m]) > 0.05 (more than ~3 server errors/min) should page ops.


Per-Node Poll Metrics

Metric Type Labels Description
pulse_monitor_node_poll_duration_seconds Histogram instance_type, instance, node Wall-clock duration for each node poll.
pulse_monitor_node_poll_total Counter instance_type, instance, node, result Success/error counts per node.
pulse_monitor_node_poll_errors_total Counter instance_type, instance, node, error_type Error type breakdown (connection, auth, internal, etc.).
pulse_monitor_node_poll_last_success_timestamp Gauge instance_type, instance, node Unix timestamp of last successful poll.
pulse_monitor_node_poll_staleness_seconds Gauge instance_type, instance, node Seconds since last success (1 means no success yet).

Alert suggestion:
max_over_time(pulse_monitor_node_poll_staleness_seconds{node!=""}[10m]) > 300 indicates a node has been stale for 5+ minutes.


Scheduler Health Metrics

Metric Type Labels Description
pulse_scheduler_queue_due_soon Gauge Number of tasks due within 12 seconds.
pulse_scheduler_queue_depth Gauge instance_type Queue depth per instance type (PVE, PBS, PMG).
pulse_scheduler_queue_wait_seconds Histogram instance_type Wait time between when a task should run and when it actually executes.
pulse_scheduler_dead_letter_depth Gauge instance_type, instance Dead-letter queue depth per monitored instance.
pulse_scheduler_breaker_state Gauge instance_type, instance Circuit breaker state: 0=closed, 1=half-open, 2=open, -1=unknown.
pulse_scheduler_breaker_failure_count Gauge instance_type, instance Consecutive failures tracked by the breaker.
pulse_scheduler_breaker_retry_seconds Gauge instance_type, instance Seconds until the breaker will allow the next attempt.

Alert suggestions:

  • Queue saturation: max_over_time(pulse_scheduler_queue_depth[10m]) > <instance count * 1.5>
  • DLQ growth: increase(pulse_scheduler_dead_letter_depth[10m]) > 0
  • Breaker stuck open: pulse_scheduler_breaker_state == 2 for > 10 minutes.

Diagnostics Cache Metrics

Metric Type Labels Description
pulse_diagnostics_cache_hits_total Counter Diagnostics requests served from cache.
pulse_diagnostics_cache_misses_total Counter Requests that triggered a fresh probe.
pulse_diagnostics_refresh_duration_seconds Histogram Time taken to refresh diagnostics payload.

Alert suggestion:
rate(pulse_diagnostics_cache_misses_total[5m]) spiking alongside pulse_diagnostics_refresh_duration_seconds > 20s can signal upstream slowness.


Existing Instance-Level Poll Metrics (for completeness)

The following metrics pre-date v4.24.0 but remain essential:

Metric Type Description
pulse_monitor_poll_duration_seconds Histogram Poll duration per instance.
pulse_monitor_poll_total Counter Success/error counts per instance.
pulse_monitor_poll_errors_total Counter Error counts per instance.
pulse_monitor_poll_last_success_timestamp Gauge Last successful poll timestamp.
pulse_monitor_poll_staleness_seconds Gauge Seconds since last successful poll (instance-level).
pulse_monitor_poll_queue_depth Gauge Current queue depth.
pulse_monitor_poll_inflight Gauge Polls currently running.

Refer to this document whenever you build dashboards or craft alert policies. Scrape all metrics from the Pulse backend /metrics endpoint (9091 by default for systemd installs).