# Adaptive Polling Architecture ## Overview Phase 2 introduces a scheduler that adapts poll cadence based on freshness, errors, and workload. The goal is to prioritize stale or changing instances while backing off on healthy, idle targets. ``` ┌──────────┐ ┌──────────────┐ ┌──────────────┐ ┌─────────────┐ │ PollLoop │─────▶│ Scheduler │─────▶│ Priority Q │─────▶│ TaskWorkers │ └──────────┘ └──────────────┘ └──────────────┘ └─────────────┘ ▲ │ │ │ │ └─────► Staleness, metrics, circuit breaker feedback ────┘ ``` - **Scheduler** computes `ScheduledTask` entries using adaptive intervals. - **Task queue** is a min-heap keyed by `NextRun`; only due tasks execute. - **Workers** execute tasks, capture outcomes, reschedule via scheduler or backoff logic. ## Key Components | Component | File | Responsibility | |-----------------------|-------------------------------------------|--------------------------------------------------------------| | Scheduler | `internal/monitoring/scheduler.go` | Calculates adaptive intervals per instance. | | Staleness tracker | `internal/monitoring/staleness_tracker.go`| Maintains freshness metadata and scores. | | Priority queue | `internal/monitoring/task_queue.go` | Orders `ScheduledTask` items by due time + priority. | | Circuit breaker | `internal/monitoring/circuit_breaker.go` | Trips on repeated failures, preventing hot loops. | | Backoff | `internal/monitoring/backoff.go` | Exponential retry delays with jitter. | | Workers | `internal/monitoring/monitor.go` | Pop tasks, execute pollers, reschedule or dead-letter. | ## Configuration Environment variables (default in `internal/config/config.go`): | Variable | Default | Description | |-------------------------------------|---------|--------------------------------------------------| | `ADAPTIVE_POLLING_ENABLED` | false | Feature flag for adaptive scheduler. | | `ADAPTIVE_POLLING_BASE_INTERVAL` | 10s | Target cadence when system is healthy. | | `ADAPTIVE_POLLING_MIN_INTERVAL` | 5s | Lower bound (active instances). | | `ADAPTIVE_POLLING_MAX_INTERVAL` | 5m | Upper bound (idle instances). | All settings persist in `system.json` and respond to environment overrides. ## Metrics Exposed via Prometheus (`:9091/metrics`): | Metric | Type | Labels | Description | |------------------------------------------|-----------|---------------------------------------|--------------------------------------------| | `pulse_monitor_poll_total` | counter | `instance_type`, `instance`, `result` | Overall poll attempts (success/error). | | `pulse_monitor_poll_duration_seconds` | histogram | `instance_type`, `instance` | Poll latency per instance. | | `pulse_monitor_poll_staleness_seconds` | gauge | `instance_type`, `instance` | Age since last success (0 on success). | | `pulse_monitor_poll_queue_depth` | gauge | — | Size of priority queue. | | `pulse_monitor_poll_inflight` | gauge | `instance_type` | Concurrent tasks per type. | ## Circuit Breaker & Backoff | State | Trigger | Recovery | |-------------|---------------------------------------------|--------------------------------------------| | **Closed** | Default. Failures counted. | — | | **Open** | ≥3 consecutive failures. Poll suppressed. | Exponential delay (max 5 min). | | **Half-open**| Retry window elapsed. Limited re-attempt. | Success ⇒ closed. Failure ⇒ open. | Backoff configuration: - Initial delay: 5 s - Multiplier: x2 per failure - Jitter: ±20 % - Max delay: 5 minutes - After 5 transient failures or any permanent failure, task moves to dead-letter queue for operator action. ## Dead-Letter Queue Dead-letter entries are kept in memory (same `TaskQueue` structure) with a 30 min recheck interval. Operators should inspect logs for `Routing task to dead-letter queue` messages. Future work (Task 8) will add API surfaces for inspection. ## API Endpoints ### GET /api/monitoring/scheduler/health Returns comprehensive scheduler health data (authentication required). **Response format:** ```json { "updatedAt": "2025-03-21T18:05:00Z", "enabled": true, "queue": { "depth": 7, "dueWithinSeconds": 2, "perType": { "pve": 4, "pbs": 2, "pmg": 1 } }, "deadLetter": { "count": 2, "tasks": [ { "instance": "pbs-nas", "type": "pbs", "nextRun": "2025-03-21T18:25:00Z", "lastError": "connection timeout", "failures": 7 } ] }, "breakers": [ { "instance": "pve-core", "type": "pve", "state": "half_open", "failures": 3, "retryAt": "2025-03-21T18:05:45Z" } ], "staleness": [ { "instance": "pve-core", "type": "pve", "score": 0.12, "lastSuccess": "2025-03-21T18:04:50Z" } ] } ``` **Field descriptions:** - `enabled`: Feature flag status - `queue.depth`: Total queued tasks - `queue.dueWithinSeconds`: Tasks due within 12 seconds - `queue.perType`: Distribution by instance type - `deadLetter.count`: Total dead-letter tasks - `deadLetter.tasks`: Up to 25 most recent dead-letter entries - `breakers`: Circuit breaker states (only non-default states shown) - `staleness`: Freshness scores per instance (0 = fresh, 1 = max stale) ## Operational Guidance 1. **Enable adaptive polling**: set `ADAPTIVE_POLLING_ENABLED=true` via UI or environment overrides, then restart hot-dev (`scripts/hot-dev.sh`). 2. **Monitor metrics** to ensure queue depth and staleness remain within SLA. Configure alerting on `poll_staleness_seconds` and `poll_queue_depth`. 3. **Inspect scheduler health** via API endpoint `/api/monitoring/scheduler/health` for circuit breaker trips and dead-letter queue status. 4. **Review dead-letter logs** for persistent failures; resolve underlying connectivity or auth issues before re-enabling. ## Rollout Plan 1. **Dev/QA**: Run hot-dev with feature flag enabled; observe metrics and logs for several cycles. 2. **Staged deploy**: Enable flag on a subset of clusters; monitor queue depth (<50) and staleness (<45 s). 3. **Full rollout**: Toggle flag globally once metrics are stable; document any overrides in release notes. 4. **Post-launch**: Add Grafana panels for queue depth & staleness; alert on circuit breaker trips (future API work). ## Known Follow-ups - Task 8: expose scheduler health & dead-letter statistics via API and UI panels. - Task 9: add dedicated unit/integration harness for the scheduler & workers.