Files
Pulse/docs/PHASE2_SUMMARY.md
rcourtman 160adeb3b8 feat: add scheduler health API endpoint (Phase 2 Task 8)
Task 8 of 10 complete. Exposes read-only scheduler health data including:
- Queue depth and distribution by instance type
- Dead-letter queue inspection (top 25 tasks with error details)
- Circuit breaker states (instance-level)
- Staleness scores per instance

New API endpoint:
  GET /api/monitoring/scheduler/health (requires authentication)

New snapshot methods:
- StalenessTracker.Snapshot() - exports all staleness data
- TaskQueue.Snapshot() - queue depth & per-type distribution
- TaskQueue.PeekAll() - dead-letter task inspection
- circuitBreaker.State() - exports state, failures, retryAt
- Monitor.SchedulerHealth() - aggregates all health data

Documentation updated with API spec, field descriptions, and usage examples.
2025-10-20 15:13:38 +00:00

5.5 KiB
Raw Blame History

Pulse Adaptive Polling Phase 2 Summary

Executive Summary

Phase 2 delivers adaptive polling infrastructure that dynamically adjusts monitoring intervals based on instance freshness, error rates, and system load. The scheduler replaces fixed cadences with intelligent priority-based execution, dramatically improving resource efficiency while maintaining data freshness.

Completed Tasks (8/10 - 80%)

Task 1: Poll Cycle Metrics

  • 7 new Prometheus metrics (duration, staleness, queue depth, in-flight, errors)
  • Per-instance tracking with histogram/counter/gauge types
  • Integrated into all poll functions (PVE/PBS/PMG)
  • Metrics server on port 9091

Task 2: Adaptive Scheduler Scaffold

  • Pluggable interfaces (StalenessSource, IntervalSelector, TaskEnqueuer)
  • BuildPlan generates ordered task lists with NextRun times
  • FilterDue/DispatchDue for queue management
  • Default no-op implementations for gradual rollout

Task 3: Configuration & Feature Flags

  • ADAPTIVE_POLLING_ENABLED feature flag (default: false)
  • Min/max/base interval tuning (5s / 5m / 10s defaults)
  • Environment variable overrides
  • Persisted in system.json
  • Validation logic (min ≤ base ≤ max)

Task 4: Staleness Tracker

  • Per-instance freshness metadata (last success/error/mutation)
  • SHA1 change hash detection
  • Normalized staleness scoring (0-1 scale)
  • Integration with PollMetrics for authoritative timestamps
  • Updates from all poll result handlers

Task 5: Adaptive Interval Logic

  • EMA smoothing (alpha=0.6) to prevent oscillations
  • Staleness-based interpolation (min-max range)
  • Error penalty (0.6x per failure) for faster recovery detection
  • Queue depth stretch (0.1x per task) for backpressure
  • ±5% jitter to avoid thundering herd

Task 6: Priority Queue Execution

  • Min-heap (container/heap) ordered by NextRun + Priority
  • Worker goroutines with WaitNext() blocking
  • Tasks only execute when due (respects adaptive intervals)
  • Automatic rescheduling after execution
  • Upsert semantics prevent duplicates

Task 7: Error Handling & Circuit Breakers

  • Circuit breaker with closed/open/half-open states
  • Trips after 3 consecutive failures
  • Exponential backoff (5s initial, 2x multiplier, 5min max)
  • ±20% jitter on retry delays
  • Dead-letter queue after 5 transient failures
  • Error classification (transient vs permanent)

Task 10: Documentation

  • Architecture guide in docs/monitoring/ADAPTIVE_POLLING.md
  • Configuration reference
  • Metrics catalog
  • Operational guidance & troubleshooting
  • Rollout plan (dev → staged → full)

Deferred Tasks (2/10 - 20%)

⏭ Task 8: API Surfaces (Future Phase)

  • Scheduler health endpoint
  • Dead-letter queue inspection/management
  • Circuit breaker state visibility
  • UI dashboard integration

⏭ Task 9: Testing Harness (Future Phase)

  • Unit tests for scheduler math
  • Integration tests with mock instances
  • Soak tests for queue stability
  • Regression suite for Phase 1 hardening

Key Metrics Delivered

Metric Purpose
pulse_monitor_poll_total Success/error rate tracking
pulse_monitor_poll_duration_seconds Latency per instance
pulse_monitor_poll_staleness_seconds Data freshness indicator
pulse_monitor_poll_queue_depth Backpressure monitoring
pulse_monitor_poll_inflight Concurrency tracking
pulse_monitor_poll_errors_total Error type classification
pulse_monitor_poll_last_success_timestamp Recovery timeline

Technical Achievements

Performance:

  • Adaptive intervals reduce unnecessary polls on idle instances
  • Queue-based execution prevents task pile-up
  • Circuit breakers stop hot loops on failing endpoints

Reliability:

  • Exponential backoff with jitter prevents thundering herd
  • Dead-letter queue isolates persistent failures
  • Transient error retry logic (5 attempts before DLQ)

Observability:

  • 7 Prometheus metrics for complete visibility
  • Structured logging for all state transitions
  • Tamper-evident audit trail (from Phase 1)

Deployment Status

Current State: Feature flag disabled by default (ADAPTIVE_POLLING_ENABLED=false)

Activation Path:

  1. Enable flag in dev/QA environment
  2. Observe metrics for 24-48 hours
  3. Staged rollout to subset of production clusters
  4. Full activation after validation

Rollback: Set ADAPTIVE_POLLING_ENABLED=false to revert to fixed intervals

Git Commits

  1. c048e7b9b - Tasks 1-3: Metrics + scheduler + config
  2. 8ce93c1df - Task 4: Staleness tracker
  3. e8bd79c6c - Task 5: Adaptive interval logic
  4. 1d6fa9188 - Task 6: Priority queue execution
  5. 7d9aaa406 - Task 7: Circuit breakers & backoff
  6. [current] - Task 10: Documentation

Phase 2 Success Criteria

  • Metrics pipeline operational
  • Scheduler produces valid task plans
  • Queue respects adaptive intervals
  • Circuit breakers prevent runaway failures
  • Documentation enables ops team rollout
  • Feature flag allows safe activation

Known Limitations

  • Dead-letter queue state lost on restart (no persistence yet)
  • Circuit breaker state not exposed via API (Task 8)
  • No automated test coverage (Task 9)
  • Queue depth metric updated per-cycle (not real-time within cycle)

Next Steps

Immediate (Post-Phase 2):

  • Deploy to dev environment with flag enabled
  • Configure Grafana dashboards for new metrics
  • Set alerting thresholds (queue depth >50, staleness >60s)

Future Phases:

  • Task 8: REST API for scheduler introspection
  • Task 9: Comprehensive test suite
  • Phase 3: External sentinels and cross-cluster coordination