Files
Pulse/docs/monitoring/PROMETHEUS_METRICS.md
rcourtman 2b48b0a459 feat: add --kube-include-all-deployments flag for Kubernetes agent
Adds IncludeAllDeployments option to show all deployments, not just
problem ones (where replicas don't match desired). This provides parity
with the existing --kube-include-all-pods flag.

- Add IncludeAllDeployments to kubernetesagent.Config
- Add --kube-include-all-deployments flag and PULSE_KUBE_INCLUDE_ALL_DEPLOYMENTS env var
- Update collectDeployments to respect the new flag
- Add test for IncludeAllDeployments functionality
- Update UNIFIED_AGENT.md documentation

Addresses feedback from PR #855
2025-12-18 20:58:30 +00:00

1.8 KiB

📊 Prometheus Metrics

Pulse exposes metrics at /metrics (default port 9091).

Example scrape target:

  • http://<pulse-host>:9091/metrics

This listener is separate from the main UI/API port (7655). In Docker and Kubernetes you must expose 9091 explicitly if you want to scrape it from outside the container/pod.

🌐 HTTP Ingress

Metric Type Description
pulse_http_request_duration_seconds Histogram Latency buckets by method, route, status.
pulse_http_requests_total Counter Total requests.
pulse_http_request_errors_total Counter 4xx/5xx errors.

🔄 Polling & Nodes

Metric Type Description
pulse_monitor_node_poll_duration_seconds Histogram Per-node poll latency.
pulse_monitor_node_poll_total Counter Success/error counts per node.
pulse_monitor_node_poll_staleness_seconds Gauge Seconds since last success.
pulse_monitor_poll_queue_depth Gauge Global queue depth.

🧠 Scheduler Health

Metric Type Description
pulse_scheduler_queue_depth Gauge Queue depth per instance type.
pulse_scheduler_dead_letter_depth Gauge DLQ depth per instance.
pulse_scheduler_breaker_state Gauge 0=Closed, 1=Half-Open, 2=Open.

Diagnostics Cache

Metric Type Description
pulse_diagnostics_cache_hits_total Counter Cache hits.
pulse_diagnostics_refresh_duration_seconds Histogram Refresh latency.

🚨 Alerting Examples

  • High Error Rate: rate(pulse_http_request_errors_total[5m]) > 0.05
  • Stale Node: pulse_monitor_node_poll_staleness_seconds > 300
  • Breaker Open: pulse_scheduler_breaker_state == 2