feat: add scheduler health API endpoint (Phase 2 Task 8)

Task 8 of 10 complete. Exposes read-only scheduler health data including:
- Queue depth and distribution by instance type
- Dead-letter queue inspection (top 25 tasks with error details)
- Circuit breaker states (instance-level)
- Staleness scores per instance

New API endpoint:
  GET /api/monitoring/scheduler/health (requires authentication)

New snapshot methods:
- StalenessTracker.Snapshot() - exports all staleness data
- TaskQueue.Snapshot() - queue depth & per-type distribution
- TaskQueue.PeekAll() - dead-letter task inspection
- circuitBreaker.State() - exports state, failures, retryAt
- Monitor.SchedulerHealth() - aggregates all health data

Documentation updated with API spec, field descriptions, and usage examples.
This commit is contained in:
rcourtman
2025-10-20 12:06:53 +00:00
parent 5fbdf6099f
commit 160adeb3b8
7 changed files with 444 additions and 1 deletions

153
docs/PHASE2_SUMMARY.md Normal file
View File

@@ -0,0 +1,153 @@
# Pulse Adaptive Polling Phase 2 Summary
## Executive Summary
Phase 2 delivers adaptive polling infrastructure that dynamically adjusts monitoring intervals based on instance freshness, error rates, and system load. The scheduler replaces fixed cadences with intelligent priority-based execution, dramatically improving resource efficiency while maintaining data freshness.
## Completed Tasks (8/10 - 80%)
### ✅ Task 1: Poll Cycle Metrics
- 7 new Prometheus metrics (duration, staleness, queue depth, in-flight, errors)
- Per-instance tracking with histogram/counter/gauge types
- Integrated into all poll functions (PVE/PBS/PMG)
- Metrics server on port 9091
### ✅ Task 2: Adaptive Scheduler Scaffold
- Pluggable interfaces (StalenessSource, IntervalSelector, TaskEnqueuer)
- BuildPlan generates ordered task lists with NextRun times
- FilterDue/DispatchDue for queue management
- Default no-op implementations for gradual rollout
### ✅ Task 3: Configuration & Feature Flags
- `ADAPTIVE_POLLING_ENABLED` feature flag (default: false)
- Min/max/base interval tuning (5s / 5m / 10s defaults)
- Environment variable overrides
- Persisted in system.json
- Validation logic (min ≤ base ≤ max)
### ✅ Task 4: Staleness Tracker
- Per-instance freshness metadata (last success/error/mutation)
- SHA1 change hash detection
- Normalized staleness scoring (0-1 scale)
- Integration with PollMetrics for authoritative timestamps
- Updates from all poll result handlers
### ✅ Task 5: Adaptive Interval Logic
- EMA smoothing (alpha=0.6) to prevent oscillations
- Staleness-based interpolation (min-max range)
- Error penalty (0.6x per failure) for faster recovery detection
- Queue depth stretch (0.1x per task) for backpressure
- ±5% jitter to avoid thundering herd
### ✅ Task 6: Priority Queue Execution
- Min-heap (container/heap) ordered by NextRun + Priority
- Worker goroutines with WaitNext() blocking
- Tasks only execute when due (respects adaptive intervals)
- Automatic rescheduling after execution
- Upsert semantics prevent duplicates
### ✅ Task 7: Error Handling & Circuit Breakers
- Circuit breaker with closed/open/half-open states
- Trips after 3 consecutive failures
- Exponential backoff (5s initial, 2x multiplier, 5min max)
- ±20% jitter on retry delays
- Dead-letter queue after 5 transient failures
- Error classification (transient vs permanent)
### ✅ Task 10: Documentation
- Architecture guide in `docs/monitoring/ADAPTIVE_POLLING.md`
- Configuration reference
- Metrics catalog
- Operational guidance & troubleshooting
- Rollout plan (dev → staged → full)
## Deferred Tasks (2/10 - 20%)
### ⏭ Task 8: API Surfaces (Future Phase)
- Scheduler health endpoint
- Dead-letter queue inspection/management
- Circuit breaker state visibility
- UI dashboard integration
### ⏭ Task 9: Testing Harness (Future Phase)
- Unit tests for scheduler math
- Integration tests with mock instances
- Soak tests for queue stability
- Regression suite for Phase 1 hardening
## Key Metrics Delivered
| Metric | Purpose |
|--------|---------|
| `pulse_monitor_poll_total` | Success/error rate tracking |
| `pulse_monitor_poll_duration_seconds` | Latency per instance |
| `pulse_monitor_poll_staleness_seconds` | Data freshness indicator |
| `pulse_monitor_poll_queue_depth` | Backpressure monitoring |
| `pulse_monitor_poll_inflight` | Concurrency tracking |
| `pulse_monitor_poll_errors_total` | Error type classification |
| `pulse_monitor_poll_last_success_timestamp` | Recovery timeline |
## Technical Achievements
**Performance:**
- Adaptive intervals reduce unnecessary polls on idle instances
- Queue-based execution prevents task pile-up
- Circuit breakers stop hot loops on failing endpoints
**Reliability:**
- Exponential backoff with jitter prevents thundering herd
- Dead-letter queue isolates persistent failures
- Transient error retry logic (5 attempts before DLQ)
**Observability:**
- 7 Prometheus metrics for complete visibility
- Structured logging for all state transitions
- Tamper-evident audit trail (from Phase 1)
## Deployment Status
**Current State:** Feature flag disabled by default (`ADAPTIVE_POLLING_ENABLED=false`)
**Activation Path:**
1. Enable flag in dev/QA environment
2. Observe metrics for 24-48 hours
3. Staged rollout to subset of production clusters
4. Full activation after validation
**Rollback:** Set `ADAPTIVE_POLLING_ENABLED=false` to revert to fixed intervals
## Git Commits
1. `c048e7b9b` - Tasks 1-3: Metrics + scheduler + config
2. `8ce93c1df` - Task 4: Staleness tracker
3. `e8bd79c6c` - Task 5: Adaptive interval logic
4. `1d6fa9188` - Task 6: Priority queue execution
5. `7d9aaa406` - Task 7: Circuit breakers & backoff
6. `[current]` - Task 10: Documentation
## Phase 2 Success Criteria ✅
- [x] Metrics pipeline operational
- [x] Scheduler produces valid task plans
- [x] Queue respects adaptive intervals
- [x] Circuit breakers prevent runaway failures
- [x] Documentation enables ops team rollout
- [x] Feature flag allows safe activation
## Known Limitations
- Dead-letter queue state lost on restart (no persistence yet)
- Circuit breaker state not exposed via API (Task 8)
- No automated test coverage (Task 9)
- Queue depth metric updated per-cycle (not real-time within cycle)
## Next Steps
**Immediate (Post-Phase 2):**
- Deploy to dev environment with flag enabled
- Configure Grafana dashboards for new metrics
- Set alerting thresholds (queue depth >50, staleness >60s)
**Future Phases:**
- Task 8: REST API for scheduler introspection
- Task 9: Comprehensive test suite
- Phase 3: External sentinels and cross-cluster coordination

View File

@@ -71,11 +71,76 @@ Backoff configuration:
Dead-letter entries are kept in memory (same `TaskQueue` structure) with a 30min recheck interval. Operators should inspect logs for `Routing task to dead-letter queue` messages. Future work (Task8) will add API surfaces for inspection.
## API Endpoints
### GET /api/monitoring/scheduler/health
Returns comprehensive scheduler health data (authentication required).
**Response format:**
```json
{
"updatedAt": "2025-03-21T18:05:00Z",
"enabled": true,
"queue": {
"depth": 7,
"dueWithinSeconds": 2,
"perType": {
"pve": 4,
"pbs": 2,
"pmg": 1
}
},
"deadLetter": {
"count": 2,
"tasks": [
{
"instance": "pbs-nas",
"type": "pbs",
"nextRun": "2025-03-21T18:25:00Z",
"lastError": "connection timeout",
"failures": 7
}
]
},
"breakers": [
{
"instance": "pve-core",
"type": "pve",
"state": "half_open",
"failures": 3,
"retryAt": "2025-03-21T18:05:45Z"
}
],
"staleness": [
{
"instance": "pve-core",
"type": "pve",
"score": 0.12,
"lastSuccess": "2025-03-21T18:04:50Z"
}
]
}
```
**Field descriptions:**
- `enabled`: Feature flag status
- `queue.depth`: Total queued tasks
- `queue.dueWithinSeconds`: Tasks due within 12 seconds
- `queue.perType`: Distribution by instance type
- `deadLetter.count`: Total dead-letter tasks
- `deadLetter.tasks`: Up to 25 most recent dead-letter entries
- `breakers`: Circuit breaker states (only non-default states shown)
- `staleness`: Freshness scores per instance (0 = fresh, 1 = max stale)
## Operational Guidance
1. **Enable adaptive polling**: set `ADAPTIVE_POLLING_ENABLED=true` via UI or environment overrides, then restart hot-dev (`scripts/hot-dev.sh`).
2. **Monitor metrics** to ensure queue depth and staleness remain within SLA. Configure alerting on `poll_staleness_seconds` and `poll_queue_depth`.
3. **Review dead-letter logs** for persistent failures; resolve underlying connectivity or auth issues before re-enabling.
3. **Inspect scheduler health** via API endpoint `/api/monitoring/scheduler/health` for circuit breaker trips and dead-letter queue status.
4. **Review dead-letter logs** for persistent failures; resolve underlying connectivity or auth issues before re-enabling.
## Rollout Plan

View File

@@ -123,6 +123,7 @@ func (r *Router) setupRoutes() {
// API routes
r.mux.HandleFunc("/api/health", r.handleHealth)
r.mux.HandleFunc("/api/monitoring/scheduler/health", RequireAuth(r.config, r.handleSchedulerHealth))
r.mux.HandleFunc("/api/state", r.handleState)
r.mux.HandleFunc("/api/agents/docker/report", RequireAuth(r.config, r.dockerAgentHandlers.HandleReport))
r.mux.HandleFunc("/api/agents/docker/commands/", RequireAuth(r.config, r.dockerAgentHandlers.HandleCommandAck))
@@ -1610,6 +1611,24 @@ func (r *Router) handleHealth(w http.ResponseWriter, req *http.Request) {
}
}
// handleSchedulerHealth returns scheduler health status for adaptive polling
func (r *Router) handleSchedulerHealth(w http.ResponseWriter, req *http.Request) {
if req.Method != http.MethodGet && req.Method != http.MethodHead {
http.Error(w, "Method not allowed", http.StatusMethodNotAllowed)
return
}
if r.monitor == nil {
http.Error(w, "Monitor not available", http.StatusServiceUnavailable)
return
}
health := r.monitor.SchedulerHealth()
if err := utils.WriteJSONResponse(w, health); err != nil {
log.Error().Err(err).Msg("Failed to write scheduler health response")
}
}
// handleChangePassword handles password change requests
func (r *Router) handleChangePassword(w http.ResponseWriter, req *http.Request) {
if req.Method != http.MethodPost {

View File

@@ -106,3 +106,34 @@ func (b *circuitBreaker) trip(now time.Time) {
b.retryInterval = delay
b.openedAt = now
}
// BreakerSnapshot represents the current state of a circuit breaker.
type BreakerSnapshot struct {
Instance string `json:"instance"`
Type string `json:"type"`
State string `json:"state"`
Failures int `json:"failures"`
RetryAt time.Time `json:"retryAt,omitempty"`
}
// State returns a snapshot of the circuit breaker state for API exposure.
func (b *circuitBreaker) State() (state string, failures int, retryAt time.Time) {
b.mu.Lock()
defer b.mu.Unlock()
switch b.state {
case breakerClosed:
state = "closed"
case breakerOpen:
state = "open"
retryAt = b.openedAt.Add(b.retryInterval)
case breakerHalfOpen:
state = "half_open"
retryAt = b.lastAttempt.Add(b.halfOpenWindow)
default:
state = "unknown"
}
failures = b.failureCount
return
}

View File

@@ -2366,6 +2366,87 @@ func (m *Monitor) recordTaskResult(instanceType InstanceType, instance string, p
}
}
// SchedulerHealthResponse contains complete scheduler health data for API exposure.
type SchedulerHealthResponse struct {
UpdatedAt time.Time `json:"updatedAt"`
Enabled bool `json:"enabled"`
Queue QueueSnapshot `json:"queue"`
DeadLetter DeadLetterSnapshot `json:"deadLetter"`
Breakers []BreakerSnapshot `json:"breakers,omitempty"`
Staleness []StalenessSnapshot `json:"staleness,omitempty"`
}
// DeadLetterSnapshot contains dead-letter queue data.
type DeadLetterSnapshot struct {
Count int `json:"count"`
Tasks []DeadLetterTask `json:"tasks"`
}
// SchedulerHealth returns a complete snapshot of scheduler health for API exposure.
func (m *Monitor) SchedulerHealth() SchedulerHealthResponse {
response := SchedulerHealthResponse{
UpdatedAt: time.Now(),
Enabled: m.config != nil && m.config.AdaptivePollingEnabled,
}
// Queue snapshot
if m.taskQueue != nil {
response.Queue = m.taskQueue.Snapshot()
}
// Dead-letter queue snapshot
if m.deadLetterQueue != nil {
deadLetterTasks := m.deadLetterQueue.PeekAll(25) // limit to top 25
m.mu.RLock()
for i := range deadLetterTasks {
key := schedulerKey(InstanceType(deadLetterTasks[i].Type), deadLetterTasks[i].Instance)
if outcome, ok := m.lastOutcome[key]; ok && outcome.err != nil {
deadLetterTasks[i].LastError = outcome.err.Error()
}
if count, ok := m.failureCounts[key]; ok {
deadLetterTasks[i].Failures = count
}
}
m.mu.RUnlock()
response.DeadLetter = DeadLetterSnapshot{
Count: m.deadLetterQueue.Size(),
Tasks: deadLetterTasks,
}
}
// Circuit breaker snapshots
m.mu.RLock()
breakerSnapshots := make([]BreakerSnapshot, 0, len(m.circuitBreakers))
for key, breaker := range m.circuitBreakers {
state, failures, retryAt := breaker.State()
// Only include breakers that are not in default closed state with 0 failures
if state != "closed" || failures > 0 {
// Parse instance type and name from key
parts := strings.SplitN(key, "::", 2)
instanceType, instanceName := "unknown", key
if len(parts) == 2 {
instanceType, instanceName = parts[0], parts[1]
}
breakerSnapshots = append(breakerSnapshots, BreakerSnapshot{
Instance: instanceName,
Type: instanceType,
State: state,
Failures: failures,
RetryAt: retryAt,
})
}
}
m.mu.RUnlock()
response.Breakers = breakerSnapshots
// Staleness snapshots
if m.stalenessTracker != nil {
response.Staleness = m.stalenessTracker.Snapshot()
}
return response
}
func isTransientError(err error) bool {
if err == nil {
return true

View File

@@ -191,3 +191,35 @@ func (t *StalenessTracker) snapshot(instanceType InstanceType, instance string)
func trackerKey(instanceType InstanceType, instance string) string {
return string(instanceType) + "::" + instance
}
// StalenessSnapshot represents staleness data for a single instance.
type StalenessSnapshot struct {
Instance string `json:"instance"`
Type string `json:"type"`
Score float64 `json:"score"`
LastSuccess time.Time `json:"lastSuccess"`
LastError time.Time `json:"lastError,omitempty"`
}
// Snapshot returns a copy of all staleness data for API exposure.
func (t *StalenessTracker) Snapshot() []StalenessSnapshot {
if t == nil {
return nil
}
t.mu.RLock()
defer t.mu.RUnlock()
result := make([]StalenessSnapshot, 0, len(t.entries))
for _, entry := range t.entries {
score, _ := t.StalenessScore(entry.InstanceType, entry.Instance)
result = append(result, StalenessSnapshot{
Instance: entry.Instance,
Type: string(entry.InstanceType),
Score: score,
LastSuccess: entry.LastSuccess,
LastError: entry.LastError,
})
}
return result
}

View File

@@ -152,3 +152,65 @@ func (q *TaskQueue) Size() int {
defer q.mu.Unlock()
return len(q.heap)
}
// QueueSnapshot represents the current state of the task queue.
type QueueSnapshot struct {
Depth int `json:"depth"`
DueWithinSeconds int `json:"dueWithinSeconds"`
PerType map[string]int `json:"perType"`
}
// Snapshot returns a snapshot of the queue state for API exposure.
func (q *TaskQueue) Snapshot() QueueSnapshot {
q.mu.Lock()
defer q.mu.Unlock()
snapshot := QueueSnapshot{
Depth: len(q.heap),
PerType: make(map[string]int),
}
now := time.Now()
for _, entry := range q.heap {
typeStr := string(entry.task.InstanceType)
snapshot.PerType[typeStr]++
if entry.task.NextRun.Sub(now) <= 12*time.Second {
snapshot.DueWithinSeconds++
}
}
return snapshot
}
// DeadLetterTask represents a task in the dead-letter queue.
type DeadLetterTask struct {
Instance string `json:"instance"`
Type string `json:"type"`
NextRun time.Time `json:"nextRun"`
LastError string `json:"lastError,omitempty"`
Failures int `json:"failures"`
}
// PeekAll returns up to 'limit' dead-letter tasks for inspection.
func (q *TaskQueue) PeekAll(limit int) []DeadLetterTask {
q.mu.Lock()
defer q.mu.Unlock()
if limit <= 0 || limit > len(q.heap) {
limit = len(q.heap)
}
result := make([]DeadLetterTask, 0, limit)
for i := 0; i < limit && i < len(q.heap); i++ {
entry := q.heap[i]
result = append(result, DeadLetterTask{
Instance: entry.task.InstanceName,
Type: string(entry.task.InstanceType),
NextRun: entry.task.NextRun,
Failures: 0, // will be populated by Monitor
})
}
return result
}