mirror of
https://github.com/rcourtman/Pulse.git
synced 2026-02-18 00:17:39 +01:00
Task 8 of 10 complete. Exposes read-only scheduler health data including: - Queue depth and distribution by instance type - Dead-letter queue inspection (top 25 tasks with error details) - Circuit breaker states (instance-level) - Staleness scores per instance New API endpoint: GET /api/monitoring/scheduler/health (requires authentication) New snapshot methods: - StalenessTracker.Snapshot() - exports all staleness data - TaskQueue.Snapshot() - queue depth & per-type distribution - TaskQueue.PeekAll() - dead-letter task inspection - circuitBreaker.State() - exports state, failures, retryAt - Monitor.SchedulerHealth() - aggregates all health data Documentation updated with API spec, field descriptions, and usage examples.
5.5 KiB
5.5 KiB
Pulse Adaptive Polling – Phase 2 Summary
Executive Summary
Phase 2 delivers adaptive polling infrastructure that dynamically adjusts monitoring intervals based on instance freshness, error rates, and system load. The scheduler replaces fixed cadences with intelligent priority-based execution, dramatically improving resource efficiency while maintaining data freshness.
Completed Tasks (8/10 - 80%)
✅ Task 1: Poll Cycle Metrics
- 7 new Prometheus metrics (duration, staleness, queue depth, in-flight, errors)
- Per-instance tracking with histogram/counter/gauge types
- Integrated into all poll functions (PVE/PBS/PMG)
- Metrics server on port 9091
✅ Task 2: Adaptive Scheduler Scaffold
- Pluggable interfaces (StalenessSource, IntervalSelector, TaskEnqueuer)
- BuildPlan generates ordered task lists with NextRun times
- FilterDue/DispatchDue for queue management
- Default no-op implementations for gradual rollout
✅ Task 3: Configuration & Feature Flags
ADAPTIVE_POLLING_ENABLEDfeature flag (default: false)- Min/max/base interval tuning (5s / 5m / 10s defaults)
- Environment variable overrides
- Persisted in system.json
- Validation logic (min ≤ base ≤ max)
✅ Task 4: Staleness Tracker
- Per-instance freshness metadata (last success/error/mutation)
- SHA1 change hash detection
- Normalized staleness scoring (0-1 scale)
- Integration with PollMetrics for authoritative timestamps
- Updates from all poll result handlers
✅ Task 5: Adaptive Interval Logic
- EMA smoothing (alpha=0.6) to prevent oscillations
- Staleness-based interpolation (min-max range)
- Error penalty (0.6x per failure) for faster recovery detection
- Queue depth stretch (0.1x per task) for backpressure
- ±5% jitter to avoid thundering herd
✅ Task 6: Priority Queue Execution
- Min-heap (container/heap) ordered by NextRun + Priority
- Worker goroutines with WaitNext() blocking
- Tasks only execute when due (respects adaptive intervals)
- Automatic rescheduling after execution
- Upsert semantics prevent duplicates
✅ Task 7: Error Handling & Circuit Breakers
- Circuit breaker with closed/open/half-open states
- Trips after 3 consecutive failures
- Exponential backoff (5s initial, 2x multiplier, 5min max)
- ±20% jitter on retry delays
- Dead-letter queue after 5 transient failures
- Error classification (transient vs permanent)
✅ Task 10: Documentation
- Architecture guide in
docs/monitoring/ADAPTIVE_POLLING.md - Configuration reference
- Metrics catalog
- Operational guidance & troubleshooting
- Rollout plan (dev → staged → full)
Deferred Tasks (2/10 - 20%)
⏭ Task 8: API Surfaces (Future Phase)
- Scheduler health endpoint
- Dead-letter queue inspection/management
- Circuit breaker state visibility
- UI dashboard integration
⏭ Task 9: Testing Harness (Future Phase)
- Unit tests for scheduler math
- Integration tests with mock instances
- Soak tests for queue stability
- Regression suite for Phase 1 hardening
Key Metrics Delivered
| Metric | Purpose |
|---|---|
pulse_monitor_poll_total |
Success/error rate tracking |
pulse_monitor_poll_duration_seconds |
Latency per instance |
pulse_monitor_poll_staleness_seconds |
Data freshness indicator |
pulse_monitor_poll_queue_depth |
Backpressure monitoring |
pulse_monitor_poll_inflight |
Concurrency tracking |
pulse_monitor_poll_errors_total |
Error type classification |
pulse_monitor_poll_last_success_timestamp |
Recovery timeline |
Technical Achievements
Performance:
- Adaptive intervals reduce unnecessary polls on idle instances
- Queue-based execution prevents task pile-up
- Circuit breakers stop hot loops on failing endpoints
Reliability:
- Exponential backoff with jitter prevents thundering herd
- Dead-letter queue isolates persistent failures
- Transient error retry logic (5 attempts before DLQ)
Observability:
- 7 Prometheus metrics for complete visibility
- Structured logging for all state transitions
- Tamper-evident audit trail (from Phase 1)
Deployment Status
Current State: Feature flag disabled by default (ADAPTIVE_POLLING_ENABLED=false)
Activation Path:
- Enable flag in dev/QA environment
- Observe metrics for 24-48 hours
- Staged rollout to subset of production clusters
- Full activation after validation
Rollback: Set ADAPTIVE_POLLING_ENABLED=false to revert to fixed intervals
Git Commits
c048e7b9b- Tasks 1-3: Metrics + scheduler + config8ce93c1df- Task 4: Staleness trackere8bd79c6c- Task 5: Adaptive interval logic1d6fa9188- Task 6: Priority queue execution7d9aaa406- Task 7: Circuit breakers & backoff[current]- Task 10: Documentation
Phase 2 Success Criteria ✅
- Metrics pipeline operational
- Scheduler produces valid task plans
- Queue respects adaptive intervals
- Circuit breakers prevent runaway failures
- Documentation enables ops team rollout
- Feature flag allows safe activation
Known Limitations
- Dead-letter queue state lost on restart (no persistence yet)
- Circuit breaker state not exposed via API (Task 8)
- No automated test coverage (Task 9)
- Queue depth metric updated per-cycle (not real-time within cycle)
Next Steps
Immediate (Post-Phase 2):
- Deploy to dev environment with flag enabled
- Configure Grafana dashboards for new metrics
- Set alerting thresholds (queue depth >50, staleness >60s)
Future Phases:
- Task 8: REST API for scheduler introspection
- Task 9: Comprehensive test suite
- Phase 3: External sentinels and cross-cluster coordination