mirror of https://github.com/rcourtman/Pulse.git synced 2026-02-18 00:17:39 +01:00

Files

rcourtman e0396c1362 docs: update documentation for diagnostics improvements

Add comprehensive operator documentation for the new observability features
introduced in the previous commit.

**New Documentation:**
- docs/monitoring/PROMETHEUS_METRICS.md - Complete reference for all 18 new
  Prometheus metrics with alert suggestions

**Updated Documentation:**
- docs/API.md - Document X-Request-ID and X-Diagnostics-Cached-At headers,
  explain diagnostics endpoint caching behavior
- docs/TROUBLESHOOTING.md - Add section on correlating API calls with logs
  using request IDs
- docs/operations/ADAPTIVE_POLLING_ROLLOUT.md - Update monitoring checklists
  with new per-node and scheduler metrics
- docs/CONFIGURATION.md - Clarify LOG_FILE dual-output behavior and rotation
  defaults

These updates ensure operators understand:
- How to set up monitoring/alerting for new metrics
- How to configure file logging with rotation
- How to troubleshoot using request correlation
- What metrics are available for dashboards

Related to: 495e6c794 (feat: comprehensive diagnostics improvements)

2025-10-21 12:45:19 +00:00

4.5 KiB

Raw Blame History

Pulse Prometheus Metrics (v4.24.0+)

Pulse exposes multiple metric families that cover HTTP ingress, per-node poll execution, scheduler health, and diagnostics caching. Use the following reference when wiring dashboards or alert rules.

HTTP Request Metrics

Metric	Type	Labels	Description
`pulse_http_request_duration_seconds`	Histogram	`method`, `route`, `status`	Request latency buckets. `route` is a normalised path (dynamic segments collapsed to `:id`, `:uuid`, etc.).
`pulse_http_requests_total`	Counter	`method`, `route`, `status`	Total requests handled.
`pulse_http_request_errors_total`	Counter	`method`, `route`, `status_class`	Counts 4xx/5xx responses.

Alert suggestion:
rate(pulse_http_request_errors_total{status_class="server_error"}[5m]) > 0.05 (more than ~3 server errors/min) should page ops.

Per-Node Poll Metrics

Metric	Type	Labels	Description
`pulse_monitor_node_poll_duration_seconds`	Histogram	`instance_type`, `instance`, `node`	Wall-clock duration for each node poll.
`pulse_monitor_node_poll_total`	Counter	`instance_type`, `instance`, `node`, `result`	Success/error counts per node.
`pulse_monitor_node_poll_errors_total`	Counter	`instance_type`, `instance`, `node`, `error_type`	Error type breakdown (connection, auth, internal, etc.).
`pulse_monitor_node_poll_last_success_timestamp`	Gauge	`instance_type`, `instance`, `node`	Unix timestamp of last successful poll.
`pulse_monitor_node_poll_staleness_seconds`	Gauge	`instance_type`, `instance`, `node`	Seconds since last success (−1 means no success yet).

Alert suggestion:
max_over_time(pulse_monitor_node_poll_staleness_seconds{node!=""}[10m]) > 300 indicates a node has been stale for 5+ minutes.

Scheduler Health Metrics

Metric	Type	Labels	Description
`pulse_scheduler_queue_due_soon`	Gauge	—	Number of tasks due within 12 seconds.
`pulse_scheduler_queue_depth`	Gauge	`instance_type`	Queue depth per instance type (PVE, PBS, PMG).
`pulse_scheduler_queue_wait_seconds`	Histogram	`instance_type`	Wait time between when a task should run and when it actually executes.
`pulse_scheduler_dead_letter_depth`	Gauge	`instance_type`, `instance`	Dead-letter queue depth per monitored instance.
`pulse_scheduler_breaker_state`	Gauge	`instance_type`, `instance`	Circuit breaker state: `0`=closed, `1`=half-open, `2`=open, `-1`=unknown.
`pulse_scheduler_breaker_failure_count`	Gauge	`instance_type`, `instance`	Consecutive failures tracked by the breaker.
`pulse_scheduler_breaker_retry_seconds`	Gauge	`instance_type`, `instance`	Seconds until the breaker will allow the next attempt.

Alert suggestions:

Queue saturation: max_over_time(pulse_scheduler_queue_depth[10m]) > <instance count * 1.5>
DLQ growth: increase(pulse_scheduler_dead_letter_depth[10m]) > 0
Breaker stuck open: pulse_scheduler_breaker_state == 2 for > 10 minutes.

Diagnostics Cache Metrics

Metric	Type	Labels	Description
`pulse_diagnostics_cache_hits_total`	Counter	—	Diagnostics requests served from cache.
`pulse_diagnostics_cache_misses_total`	Counter	—	Requests that triggered a fresh probe.
`pulse_diagnostics_refresh_duration_seconds`	Histogram	—	Time taken to refresh diagnostics payload.

Alert suggestion:
rate(pulse_diagnostics_cache_misses_total[5m]) spiking alongside pulse_diagnostics_refresh_duration_seconds > 20s can signal upstream slowness.

Existing Instance-Level Poll Metrics (for completeness)

The following metrics pre-date v4.24.0 but remain essential:

Metric	Type	Description
`pulse_monitor_poll_duration_seconds`	Histogram	Poll duration per instance.
`pulse_monitor_poll_total`	Counter	Success/error counts per instance.
`pulse_monitor_poll_errors_total`	Counter	Error counts per instance.
`pulse_monitor_poll_last_success_timestamp`	Gauge	Last successful poll timestamp.
`pulse_monitor_poll_staleness_seconds`	Gauge	Seconds since last successful poll (instance-level).
`pulse_monitor_poll_queue_depth`	Gauge	Current queue depth.
`pulse_monitor_poll_inflight`	Gauge	Polls currently running.

Refer to this document whenever you build dashboards or craft alert policies. Scrape all metrics from the Pulse backend /metrics endpoint (9091 by default for systemd installs).

4.5 KiB Raw Blame History Unescape Escape