📊 Prometheus Metrics

Pulse exposes metrics at /metrics (default port 9091).

Example scrape target:

http://<pulse-host>:9091/metrics

This listener is separate from the main UI/API port (7655). In Docker and Kubernetes you must expose 9091 explicitly if you want to scrape it from outside the container/pod.

Helm note: the current chart exposes only port 7655, so Prometheus scraping requires an additional Service that targets 9091 (and a matching ServiceMonitor).

🌐 HTTP Ingress

Metric	Type	Description
`pulse_http_request_duration_seconds`	Histogram	Latency buckets by `method`, `route`, `status`.
`pulse_http_requests_total`	Counter	Total requests by `method`, `route`, `status`.
`pulse_http_request_errors_total`	Counter	Error totals by `method`, `route`, `status_class` (`client_error`, `server_error`, `none`).

🔄 Polling & Nodes

Metric	Type	Description
`pulse_monitor_poll_duration_seconds`	Histogram	Per-instance poll latency.
`pulse_monitor_poll_total`	Counter	Success/error counts per instance (`result` label).
`pulse_monitor_poll_errors_total`	Counter	Poll failures by `error_type`.
`pulse_monitor_poll_last_success_timestamp`	Gauge	Unix timestamp of last success.
`pulse_monitor_poll_staleness_seconds`	Gauge	Seconds since last success (`-1` if never succeeded).
`pulse_monitor_poll_queue_depth`	Gauge	Global queue depth.
`pulse_monitor_poll_inflight`	Gauge	In-flight polls by `instance_type`.
`pulse_monitor_node_poll_duration_seconds`	Histogram	Per-node poll latency.
`pulse_monitor_node_poll_total`	Counter	Success/error counts per node (`result` label).
`pulse_monitor_node_poll_errors_total`	Counter	Node poll failures by `error_type`.
`pulse_monitor_node_poll_last_success_timestamp`	Gauge	Unix timestamp of last node success.
`pulse_monitor_node_poll_staleness_seconds`	Gauge	Seconds since last node success (`-1` if never succeeded).

🧠 Scheduler Health

Metric	Type	Description
`pulse_scheduler_queue_due_soon`	Gauge	Tasks due within the next 12 seconds.
`pulse_scheduler_queue_depth`	Gauge	Queue depth per `instance_type`.
`pulse_scheduler_queue_wait_seconds`	Histogram	Wait time between task readiness and execution.
`pulse_scheduler_dead_letter_depth`	Gauge	DLQ depth by `instance_type` and `instance`.
`pulse_scheduler_breaker_state`	Gauge	`0`=Closed, `1`=Half-Open, `2`=Open, `-1`=Unknown.
`pulse_scheduler_breaker_failure_count`	Gauge	Consecutive failure count.
`pulse_scheduler_breaker_retry_seconds`	Gauge	Seconds until next retry allowed.

⚡ Diagnostics Cache

Metric	Type	Description
`pulse_diagnostics_cache_hits_total`	Counter	Cache hits.
`pulse_diagnostics_cache_misses_total`	Counter	Cache misses.
`pulse_diagnostics_refresh_duration_seconds`	Histogram	Refresh latency.

🚨 Alert Lifecycle

Metric	Type	Description
`pulse_alerts_active`	Gauge	Active alerts by `level` and `type`.
`pulse_alerts_fired_total`	Counter	Total alerts fired by `level` and `type`.
`pulse_alerts_resolved_total`	Counter	Total alerts resolved by `type`.
`pulse_alerts_acknowledged_total`	Counter	Total alerts acknowledged.
`pulse_alerts_suppressed_total`	Counter	Alerts suppressed by `reason` (quiet_hours, rate_limit, duplicate, etc.).
`pulse_alerts_rate_limited_total`	Counter	Alerts suppressed due to rate limiting.
`pulse_alert_duration_seconds`	Histogram	Time from alert fire to resolve (by `type`).

🚨 Alerting Examples

High Error Rate: rate(pulse_http_request_errors_total[5m]) > 0.05
Stale Node: pulse_monitor_node_poll_staleness_seconds > 300
Breaker Open: pulse_scheduler_breaker_state == 2

3.9 KiB Raw Blame History