feat: add scheduler health API endpoint (Phase 2 Task 8)

Task 8 of 10 complete. Exposes read-only scheduler health data including: - Queue depth and distribution by instance type - Dead-letter queue inspection (top 25 tasks with error details) - Circuit breaker states (instance-level) - Staleness scores per instance New API endpoint: GET /api/monitoring/scheduler/health (requires authentication) New snapshot methods: - StalenessTracker.Snapshot() - exports all staleness data - TaskQueue.Snapshot() - queue depth & per-type distribution - TaskQueue.PeekAll() - dead-letter task inspection - circuitBreaker.State() - exports state, failures, retryAt - Monitor.SchedulerHealth() - aggregates all health data Documentation updated with API spec, field descriptions, and usage examples.
2026-02-18 00:17:39 +01:00 · 2025-10-20 12:06:53 +00:00
parent 5fbdf6099f
commit 160adeb3b8
7 changed files with 444 additions and 1 deletions
--- a/docs/PHASE2_SUMMARY.md
+++ b/docs/PHASE2_SUMMARY.md
@@ -0,0 +1,153 @@
+# Pulse Adaptive Polling – Phase 2 Summary
+
+## Executive Summary
+Phase 2 delivers adaptive polling infrastructure that dynamically adjusts monitoring intervals based on instance freshness, error rates, and system load. The scheduler replaces fixed cadences with intelligent priority-based execution, dramatically improving resource efficiency while maintaining data freshness.
+
+## Completed Tasks (8/10 - 80%)
+
+### ✅ Task 1: Poll Cycle Metrics
+- 7 new Prometheus metrics (duration, staleness, queue depth, in-flight, errors)
+- Per-instance tracking with histogram/counter/gauge types
+- Integrated into all poll functions (PVE/PBS/PMG)
+- Metrics server on port 9091
+
+### ✅ Task 2: Adaptive Scheduler Scaffold
+- Pluggable interfaces (StalenessSource, IntervalSelector, TaskEnqueuer)
+- BuildPlan generates ordered task lists with NextRun times
+- FilterDue/DispatchDue for queue management
+- Default no-op implementations for gradual rollout
+
+### ✅ Task 3: Configuration & Feature Flags
+- `ADAPTIVE_POLLING_ENABLED` feature flag (default: false)
+- Min/max/base interval tuning (5s / 5m / 10s defaults)
+- Environment variable overrides
+- Persisted in system.json
+- Validation logic (min ≤ base ≤ max)
+
+### ✅ Task 4: Staleness Tracker
+- Per-instance freshness metadata (last success/error/mutation)
+- SHA1 change hash detection
+- Normalized staleness scoring (0-1 scale)
+- Integration with PollMetrics for authoritative timestamps
+- Updates from all poll result handlers
+
+### ✅ Task 5: Adaptive Interval Logic
+- EMA smoothing (alpha=0.6) to prevent oscillations
+- Staleness-based interpolation (min-max range)
+- Error penalty (0.6x per failure) for faster recovery detection
+- Queue depth stretch (0.1x per task) for backpressure
+- ±5% jitter to avoid thundering herd
+
+### ✅ Task 6: Priority Queue Execution
+- Min-heap (container/heap) ordered by NextRun + Priority
+- Worker goroutines with WaitNext() blocking
+- Tasks only execute when due (respects adaptive intervals)
+- Automatic rescheduling after execution
+- Upsert semantics prevent duplicates
+
+### ✅ Task 7: Error Handling & Circuit Breakers
+- Circuit breaker with closed/open/half-open states
+- Trips after 3 consecutive failures
+- Exponential backoff (5s initial, 2x multiplier, 5min max)
+- ±20% jitter on retry delays
+- Dead-letter queue after 5 transient failures
+- Error classification (transient vs permanent)
+
+### ✅ Task 10: Documentation
+- Architecture guide in `docs/monitoring/ADAPTIVE_POLLING.md`
+- Configuration reference
+- Metrics catalog
+- Operational guidance & troubleshooting
+- Rollout plan (dev → staged → full)
+
+## Deferred Tasks (2/10 - 20%)
+
+### ⏭ Task 8: API Surfaces (Future Phase)
+- Scheduler health endpoint
+- Dead-letter queue inspection/management
+- Circuit breaker state visibility
+- UI dashboard integration
+
+### ⏭ Task 9: Testing Harness (Future Phase)
+- Unit tests for scheduler math
+- Integration tests with mock instances
+- Soak tests for queue stability
+- Regression suite for Phase 1 hardening
+
+## Key Metrics Delivered
+
+| Metric | Purpose |
+|--------|---------|
+| `pulse_monitor_poll_total` | Success/error rate tracking |
+| `pulse_monitor_poll_duration_seconds` | Latency per instance |
+| `pulse_monitor_poll_staleness_seconds` | Data freshness indicator |
+| `pulse_monitor_poll_queue_depth` | Backpressure monitoring |
+| `pulse_monitor_poll_inflight` | Concurrency tracking |
+| `pulse_monitor_poll_errors_total` | Error type classification |
+| `pulse_monitor_poll_last_success_timestamp` | Recovery timeline |
+
+## Technical Achievements
+
+**Performance:**
+- Adaptive intervals reduce unnecessary polls on idle instances
+- Queue-based execution prevents task pile-up
+- Circuit breakers stop hot loops on failing endpoints
+
+**Reliability:**
+- Exponential backoff with jitter prevents thundering herd
+- Dead-letter queue isolates persistent failures
+- Transient error retry logic (5 attempts before DLQ)
+
+**Observability:**
+- 7 Prometheus metrics for complete visibility
+- Structured logging for all state transitions
+- Tamper-evident audit trail (from Phase 1)
+
+## Deployment Status
+
+**Current State:** Feature flag disabled by default (`ADAPTIVE_POLLING_ENABLED=false`)
+
+**Activation Path:**
+1. Enable flag in dev/QA environment
+2. Observe metrics for 24-48 hours
+3. Staged rollout to subset of production clusters
+4. Full activation after validation
+
+**Rollback:** Set `ADAPTIVE_POLLING_ENABLED=false` to revert to fixed intervals
+
+## Git Commits
+
+1. `c048e7b9b` - Tasks 1-3: Metrics + scheduler + config
+2. `8ce93c1df` - Task 4: Staleness tracker
+3. `e8bd79c6c` - Task 5: Adaptive interval logic
+4. `1d6fa9188` - Task 6: Priority queue execution
+5. `7d9aaa406` - Task 7: Circuit breakers & backoff
+6. `[current]` - Task 10: Documentation
+
+## Phase 2 Success Criteria ✅
+
+- [x] Metrics pipeline operational
+- [x] Scheduler produces valid task plans
+- [x] Queue respects adaptive intervals
+- [x] Circuit breakers prevent runaway failures
+- [x] Documentation enables ops team rollout
+- [x] Feature flag allows safe activation
+
+## Known Limitations
+
+- Dead-letter queue state lost on restart (no persistence yet)
+- Circuit breaker state not exposed via API (Task 8)
+- No automated test coverage (Task 9)
+- Queue depth metric updated per-cycle (not real-time within cycle)
+
+## Next Steps
+
+**Immediate (Post-Phase 2):**
+- Deploy to dev environment with flag enabled
+- Configure Grafana dashboards for new metrics
+- Set alerting thresholds (queue depth >50, staleness >60s)
+
+**Future Phases:**
+- Task 8: REST API for scheduler introspection
+- Task 9: Comprehensive test suite
+- Phase 3: External sentinels and cross-cluster coordination
--- a/docs/monitoring/ADAPTIVE_POLLING.md
+++ b/docs/monitoring/ADAPTIVE_POLLING.md
@@ -71,11 +71,76 @@ Backoff configuration:

 Dead-letter entries are kept in memory (same `TaskQueue` structure) with a 30 min recheck interval. Operators should inspect logs for `Routing task to dead-letter queue` messages. Future work (Task 8) will add API surfaces for inspection.

+## API Endpoints
+
+### GET /api/monitoring/scheduler/health
+
+Returns comprehensive scheduler health data (authentication required).
+
+**Response format:**
+
+```json
+{
+  "updatedAt": "2025-03-21T18:05:00Z",
+  "enabled": true,
+  "queue": {
+    "depth": 7,
+    "dueWithinSeconds": 2,
+    "perType": {
+      "pve": 4,
+      "pbs": 2,
+      "pmg": 1
+    }
+  },
+  "deadLetter": {
+    "count": 2,
+    "tasks": [
+      {
+        "instance": "pbs-nas",
+        "type": "pbs",
+        "nextRun": "2025-03-21T18:25:00Z",
+        "lastError": "connection timeout",
+        "failures": 7
+      }
+    ]
+  },
+  "breakers": [
+    {
+      "instance": "pve-core",
+      "type": "pve",
+      "state": "half_open",
+      "failures": 3,
+      "retryAt": "2025-03-21T18:05:45Z"
+    }
+  ],
+  "staleness": [
+    {
+      "instance": "pve-core",
+      "type": "pve",
+      "score": 0.12,
+      "lastSuccess": "2025-03-21T18:04:50Z"
+    }
+  ]
+}
+```
+
+**Field descriptions:**
+
+- `enabled`: Feature flag status
+- `queue.depth`: Total queued tasks
+- `queue.dueWithinSeconds`: Tasks due within 12 seconds
+- `queue.perType`: Distribution by instance type
+- `deadLetter.count`: Total dead-letter tasks
+- `deadLetter.tasks`: Up to 25 most recent dead-letter entries
+- `breakers`: Circuit breaker states (only non-default states shown)
+- `staleness`: Freshness scores per instance (0 = fresh, 1 = max stale)
+
 ## Operational Guidance

 1. **Enable adaptive polling**: set `ADAPTIVE_POLLING_ENABLED=true` via UI or environment overrides, then restart hot-dev (`scripts/hot-dev.sh`).
 2. **Monitor metrics** to ensure queue depth and staleness remain within SLA. Configure alerting on `poll_staleness_seconds` and `poll_queue_depth`.
-3. **Review dead-letter logs** for persistent failures; resolve underlying connectivity or auth issues before re-enabling.
+3. **Inspect scheduler health** via API endpoint `/api/monitoring/scheduler/health` for circuit breaker trips and dead-letter queue status.
+4. **Review dead-letter logs** for persistent failures; resolve underlying connectivity or auth issues before re-enabling.

 ## Rollout Plan

--- a/internal/api/router.go
+++ b/internal/api/router.go
@@ -123,6 +123,7 @@ func (r *Router) setupRoutes() {

 	// API routes
 	r.mux.HandleFunc("/api/health", r.handleHealth)
+	r.mux.HandleFunc("/api/monitoring/scheduler/health", RequireAuth(r.config, r.handleSchedulerHealth))
 	r.mux.HandleFunc("/api/state", r.handleState)
 	r.mux.HandleFunc("/api/agents/docker/report", RequireAuth(r.config, r.dockerAgentHandlers.HandleReport))
 	r.mux.HandleFunc("/api/agents/docker/commands/", RequireAuth(r.config, r.dockerAgentHandlers.HandleCommandAck))
@@ -1610,6 +1611,24 @@ func (r *Router) handleHealth(w http.ResponseWriter, req *http.Request) {
 	}
 }

+// handleSchedulerHealth returns scheduler health status for adaptive polling
+func (r *Router) handleSchedulerHealth(w http.ResponseWriter, req *http.Request) {
+	if req.Method != http.MethodGet && req.Method != http.MethodHead {
+		http.Error(w, "Method not allowed", http.StatusMethodNotAllowed)
+		return
+	}
+
+	if r.monitor == nil {
+		http.Error(w, "Monitor not available", http.StatusServiceUnavailable)
+		return
+	}
+
+	health := r.monitor.SchedulerHealth()
+	if err := utils.WriteJSONResponse(w, health); err != nil {
+		log.Error().Err(err).Msg("Failed to write scheduler health response")
+	}
+}
+
 // handleChangePassword handles password change requests
 func (r *Router) handleChangePassword(w http.ResponseWriter, req *http.Request) {
 	if req.Method != http.MethodPost {
--- a/internal/monitoring/circuit_breaker.go
+++ b/internal/monitoring/circuit_breaker.go
@@ -106,3 +106,34 @@ func (b *circuitBreaker) trip(now time.Time) {
 	b.retryInterval = delay
 	b.openedAt = now
 }
+
+// BreakerSnapshot represents the current state of a circuit breaker.
+type BreakerSnapshot struct {
+	Instance string    `json:"instance"`
+	Type     string    `json:"type"`
+	State    string    `json:"state"`
+	Failures int       `json:"failures"`
+	RetryAt  time.Time `json:"retryAt,omitempty"`
+}
+
+// State returns a snapshot of the circuit breaker state for API exposure.
+func (b *circuitBreaker) State() (state string, failures int, retryAt time.Time) {
+	b.mu.Lock()
+	defer b.mu.Unlock()
+
+	switch b.state {
+	case breakerClosed:
+		state = "closed"
+	case breakerOpen:
+		state = "open"
+		retryAt = b.openedAt.Add(b.retryInterval)
+	case breakerHalfOpen:
+		state = "half_open"
+		retryAt = b.lastAttempt.Add(b.halfOpenWindow)
+	default:
+		state = "unknown"
+	}
+
+	failures = b.failureCount
+	return
+}
--- a/internal/monitoring/monitor.go
+++ b/internal/monitoring/monitor.go
@@ -2366,6 +2366,87 @@ func (m *Monitor) recordTaskResult(instanceType InstanceType, instance string, p
 	}
 }

+// SchedulerHealthResponse contains complete scheduler health data for API exposure.
+type SchedulerHealthResponse struct {
+	UpdatedAt   time.Time                  `json:"updatedAt"`
+	Enabled     bool                       `json:"enabled"`
+	Queue       QueueSnapshot              `json:"queue"`
+	DeadLetter  DeadLetterSnapshot         `json:"deadLetter"`
+	Breakers    []BreakerSnapshot          `json:"breakers,omitempty"`
+	Staleness   []StalenessSnapshot        `json:"staleness,omitempty"`
+}
+
+// DeadLetterSnapshot contains dead-letter queue data.
+type DeadLetterSnapshot struct {
+	Count int              `json:"count"`
+	Tasks []DeadLetterTask `json:"tasks"`
+}
+
+// SchedulerHealth returns a complete snapshot of scheduler health for API exposure.
+func (m *Monitor) SchedulerHealth() SchedulerHealthResponse {
+	response := SchedulerHealthResponse{
+		UpdatedAt: time.Now(),
+		Enabled:   m.config != nil && m.config.AdaptivePollingEnabled,
+	}
+
+	// Queue snapshot
+	if m.taskQueue != nil {
+		response.Queue = m.taskQueue.Snapshot()
+	}
+
+	// Dead-letter queue snapshot
+	if m.deadLetterQueue != nil {
+		deadLetterTasks := m.deadLetterQueue.PeekAll(25) // limit to top 25
+		m.mu.RLock()
+		for i := range deadLetterTasks {
+			key := schedulerKey(InstanceType(deadLetterTasks[i].Type), deadLetterTasks[i].Instance)
+			if outcome, ok := m.lastOutcome[key]; ok && outcome.err != nil {
+				deadLetterTasks[i].LastError = outcome.err.Error()
+			}
+			if count, ok := m.failureCounts[key]; ok {
+				deadLetterTasks[i].Failures = count
+			}
+		}
+		m.mu.RUnlock()
+		response.DeadLetter = DeadLetterSnapshot{
+			Count: m.deadLetterQueue.Size(),
+			Tasks: deadLetterTasks,
+		}
+	}
+
+	// Circuit breaker snapshots
+	m.mu.RLock()
+	breakerSnapshots := make([]BreakerSnapshot, 0, len(m.circuitBreakers))
+	for key, breaker := range m.circuitBreakers {
+		state, failures, retryAt := breaker.State()
+		// Only include breakers that are not in default closed state with 0 failures
+		if state != "closed" || failures > 0 {
+			// Parse instance type and name from key
+			parts := strings.SplitN(key, "::", 2)
+			instanceType, instanceName := "unknown", key
+			if len(parts) == 2 {
+				instanceType, instanceName = parts[0], parts[1]
+			}
+			breakerSnapshots = append(breakerSnapshots, BreakerSnapshot{
+				Instance: instanceName,
+				Type:     instanceType,
+				State:    state,
+				Failures: failures,
+				RetryAt:  retryAt,
+			})
+		}
+	}
+	m.mu.RUnlock()
+	response.Breakers = breakerSnapshots
+
+	// Staleness snapshots
+	if m.stalenessTracker != nil {
+		response.Staleness = m.stalenessTracker.Snapshot()
+	}
+
+	return response
+}
+
 func isTransientError(err error) bool {
 	if err == nil {
 		return true
--- a/internal/monitoring/staleness_tracker.go
+++ b/internal/monitoring/staleness_tracker.go
@@ -191,3 +191,35 @@ func (t *StalenessTracker) snapshot(instanceType InstanceType, instance string)
 func trackerKey(instanceType InstanceType, instance string) string {
    return string(instanceType) + "::" + instance
 }
+
+// StalenessSnapshot represents staleness data for a single instance.
+type StalenessSnapshot struct {
+    Instance    string    `json:"instance"`
+    Type        string    `json:"type"`
+    Score       float64   `json:"score"`
+    LastSuccess time.Time `json:"lastSuccess"`
+    LastError   time.Time `json:"lastError,omitempty"`
+}
+
+// Snapshot returns a copy of all staleness data for API exposure.
+func (t *StalenessTracker) Snapshot() []StalenessSnapshot {
+    if t == nil {
+        return nil
+    }
+
+    t.mu.RLock()
+    defer t.mu.RUnlock()
+
+    result := make([]StalenessSnapshot, 0, len(t.entries))
+    for _, entry := range t.entries {
+        score, _ := t.StalenessScore(entry.InstanceType, entry.Instance)
+        result = append(result, StalenessSnapshot{
+            Instance:    entry.Instance,
+            Type:        string(entry.InstanceType),
+            Score:       score,
+            LastSuccess: entry.LastSuccess,
+            LastError:   entry.LastError,
+        })
+    }
+    return result
+}
--- a/internal/monitoring/task_queue.go
+++ b/internal/monitoring/task_queue.go
@@ -152,3 +152,65 @@ func (q *TaskQueue) Size() int {
 	defer q.mu.Unlock()
 	return len(q.heap)
 }
+
+// QueueSnapshot represents the current state of the task queue.
+type QueueSnapshot struct {
+	Depth            int            `json:"depth"`
+	DueWithinSeconds int            `json:"dueWithinSeconds"`
+	PerType          map[string]int `json:"perType"`
+}
+
+// Snapshot returns a snapshot of the queue state for API exposure.
+func (q *TaskQueue) Snapshot() QueueSnapshot {
+	q.mu.Lock()
+	defer q.mu.Unlock()
+
+	snapshot := QueueSnapshot{
+		Depth:   len(q.heap),
+		PerType: make(map[string]int),
+	}
+
+	now := time.Now()
+	for _, entry := range q.heap {
+		typeStr := string(entry.task.InstanceType)
+		snapshot.PerType[typeStr]++
+
+		if entry.task.NextRun.Sub(now) <= 12*time.Second {
+			snapshot.DueWithinSeconds++
+		}
+	}
+
+	return snapshot
+}
+
+// DeadLetterTask represents a task in the dead-letter queue.
+type DeadLetterTask struct {
+	Instance  string    `json:"instance"`
+	Type      string    `json:"type"`
+	NextRun   time.Time `json:"nextRun"`
+	LastError string    `json:"lastError,omitempty"`
+	Failures  int       `json:"failures"`
+}
+
+// PeekAll returns up to 'limit' dead-letter tasks for inspection.
+func (q *TaskQueue) PeekAll(limit int) []DeadLetterTask {
+	q.mu.Lock()
+	defer q.mu.Unlock()
+
+	if limit <= 0 || limit > len(q.heap) {
+		limit = len(q.heap)
+	}
+
+	result := make([]DeadLetterTask, 0, limit)
+	for i := 0; i < limit && i < len(q.heap); i++ {
+		entry := q.heap[i]
+		result = append(result, DeadLetterTask{
+			Instance: entry.task.InstanceName,
+			Type:     string(entry.task.InstanceType),
+			NextRun:  entry.task.NextRun,
+			Failures: 0, // will be populated by Monitor
+		})
+	}
+
+	return result
+}