mirror of https://github.com/rcourtman/Pulse.git synced 2026-02-18 00:17:39 +01:00

Files

rcourtman 160adeb3b8 feat: add scheduler health API endpoint (Phase 2 Task 8)

Task 8 of 10 complete. Exposes read-only scheduler health data including:
- Queue depth and distribution by instance type
- Dead-letter queue inspection (top 25 tasks with error details)
- Circuit breaker states (instance-level)
- Staleness scores per instance

New API endpoint:
  GET /api/monitoring/scheduler/health (requires authentication)

New snapshot methods:
- StalenessTracker.Snapshot() - exports all staleness data
- TaskQueue.Snapshot() - queue depth & per-type distribution
- TaskQueue.PeekAll() - dead-letter task inspection
- circuitBreaker.State() - exports state, failures, retryAt
- Monitor.SchedulerHealth() - aggregates all health data

Documentation updated with API spec, field descriptions, and usage examples.

2025-10-20 15:13:38 +00:00

5.5 KiB

Raw Blame History

Pulse Adaptive Polling – Phase 2 Summary

Executive Summary

Phase 2 delivers adaptive polling infrastructure that dynamically adjusts monitoring intervals based on instance freshness, error rates, and system load. The scheduler replaces fixed cadences with intelligent priority-based execution, dramatically improving resource efficiency while maintaining data freshness.

Completed Tasks (8/10 - 80%)

✅ Task 1: Poll Cycle Metrics

7 new Prometheus metrics (duration, staleness, queue depth, in-flight, errors)
Per-instance tracking with histogram/counter/gauge types
Integrated into all poll functions (PVE/PBS/PMG)
Metrics server on port 9091

✅ Task 2: Adaptive Scheduler Scaffold

Pluggable interfaces (StalenessSource, IntervalSelector, TaskEnqueuer)
BuildPlan generates ordered task lists with NextRun times
FilterDue/DispatchDue for queue management
Default no-op implementations for gradual rollout

✅ Task 3: Configuration & Feature Flags

ADAPTIVE_POLLING_ENABLED feature flag (default: false)
Min/max/base interval tuning (5s / 5m / 10s defaults)
Environment variable overrides
Persisted in system.json
Validation logic (min ≤ base ≤ max)

✅ Task 4: Staleness Tracker

Per-instance freshness metadata (last success/error/mutation)
SHA1 change hash detection
Normalized staleness scoring (0-1 scale)
Integration with PollMetrics for authoritative timestamps
Updates from all poll result handlers

✅ Task 5: Adaptive Interval Logic

EMA smoothing (alpha=0.6) to prevent oscillations
Staleness-based interpolation (min-max range)
Error penalty (0.6x per failure) for faster recovery detection
Queue depth stretch (0.1x per task) for backpressure
±5% jitter to avoid thundering herd

✅ Task 6: Priority Queue Execution

Min-heap (container/heap) ordered by NextRun + Priority
Worker goroutines with WaitNext() blocking
Tasks only execute when due (respects adaptive intervals)
Automatic rescheduling after execution
Upsert semantics prevent duplicates

✅ Task 7: Error Handling & Circuit Breakers

Circuit breaker with closed/open/half-open states
Trips after 3 consecutive failures
Exponential backoff (5s initial, 2x multiplier, 5min max)
±20% jitter on retry delays
Dead-letter queue after 5 transient failures
Error classification (transient vs permanent)

✅ Task 10: Documentation

Architecture guide in docs/monitoring/ADAPTIVE_POLLING.md
Configuration reference
Metrics catalog
Operational guidance & troubleshooting
Rollout plan (dev → staged → full)

Deferred Tasks (2/10 - 20%)

⏭ Task 8: API Surfaces (Future Phase)

Scheduler health endpoint
Dead-letter queue inspection/management
Circuit breaker state visibility
UI dashboard integration

⏭ Task 9: Testing Harness (Future Phase)

Unit tests for scheduler math
Integration tests with mock instances
Soak tests for queue stability
Regression suite for Phase 1 hardening

Key Metrics Delivered

Metric	Purpose
`pulse_monitor_poll_total`	Success/error rate tracking
`pulse_monitor_poll_duration_seconds`	Latency per instance
`pulse_monitor_poll_staleness_seconds`	Data freshness indicator
`pulse_monitor_poll_queue_depth`	Backpressure monitoring
`pulse_monitor_poll_inflight`	Concurrency tracking
`pulse_monitor_poll_errors_total`	Error type classification
`pulse_monitor_poll_last_success_timestamp`	Recovery timeline

Technical Achievements

Performance:

Adaptive intervals reduce unnecessary polls on idle instances
Queue-based execution prevents task pile-up
Circuit breakers stop hot loops on failing endpoints

Reliability:

Exponential backoff with jitter prevents thundering herd
Dead-letter queue isolates persistent failures
Transient error retry logic (5 attempts before DLQ)

Observability:

7 Prometheus metrics for complete visibility
Structured logging for all state transitions
Tamper-evident audit trail (from Phase 1)

Deployment Status

Current State: Feature flag disabled by default (ADAPTIVE_POLLING_ENABLED=false)

Activation Path:

Enable flag in dev/QA environment
Observe metrics for 24-48 hours
Staged rollout to subset of production clusters
Full activation after validation

Rollback: Set ADAPTIVE_POLLING_ENABLED=false to revert to fixed intervals

Git Commits

c048e7b9b - Tasks 1-3: Metrics + scheduler + config
8ce93c1df - Task 4: Staleness tracker
e8bd79c6c - Task 5: Adaptive interval logic
1d6fa9188 - Task 6: Priority queue execution
7d9aaa406 - Task 7: Circuit breakers & backoff
[current] - Task 10: Documentation

Phase 2 Success Criteria ✅

Metrics pipeline operational
Scheduler produces valid task plans
Queue respects adaptive intervals
Circuit breakers prevent runaway failures
Documentation enables ops team rollout
Feature flag allows safe activation

Known Limitations

Dead-letter queue state lost on restart (no persistence yet)
Circuit breaker state not exposed via API (Task 8)
No automated test coverage (Task 9)
Queue depth metric updated per-cycle (not real-time within cycle)

Next Steps

Immediate (Post-Phase 2):

Deploy to dev environment with flag enabled
Configure Grafana dashboards for new metrics
Set alerting thresholds (queue depth >50, staleness >60s)

Future Phases:

Task 8: REST API for scheduler introspection
Task 9: Comprehensive test suite
Phase 3: External sentinels and cross-cluster coordination

5.5 KiB Raw Blame History Unescape Escape