Files
Pulse/docs/monitoring/ADAPTIVE_POLLING.md
rcourtman 160adeb3b8 feat: add scheduler health API endpoint (Phase 2 Task 8)
Task 8 of 10 complete. Exposes read-only scheduler health data including:
- Queue depth and distribution by instance type
- Dead-letter queue inspection (top 25 tasks with error details)
- Circuit breaker states (instance-level)
- Staleness scores per instance

New API endpoint:
  GET /api/monitoring/scheduler/health (requires authentication)

New snapshot methods:
- StalenessTracker.Snapshot() - exports all staleness data
- TaskQueue.Snapshot() - queue depth & per-type distribution
- TaskQueue.PeekAll() - dead-letter task inspection
- circuitBreaker.State() - exports state, failures, retryAt
- Monitor.SchedulerHealth() - aggregates all health data

Documentation updated with API spec, field descriptions, and usage examples.
2025-10-20 15:13:38 +00:00

7.6 KiB
Raw Blame History

Adaptive Polling Architecture

Overview

Phase2 introduces a scheduler that adapts poll cadence based on freshness, errors, and workload. The goal is to prioritize stale or changing instances while backing off on healthy, idle targets.

┌──────────┐      ┌──────────────┐      ┌──────────────┐      ┌─────────────┐
│ PollLoop │─────▶│  Scheduler   │─────▶│ Priority Q   │─────▶│ TaskWorkers │
└──────────┘      └──────────────┘      └──────────────┘      └─────────────┘
        ▲                   │                    │                    │
        │                   └─────► Staleness, metrics, circuit breaker feedback ────┘
  • Scheduler computes ScheduledTask entries using adaptive intervals.
  • Task queue is a min-heap keyed by NextRun; only due tasks execute.
  • Workers execute tasks, capture outcomes, reschedule via scheduler or backoff logic.

Key Components

Component File Responsibility
Scheduler internal/monitoring/scheduler.go Calculates adaptive intervals per instance.
Staleness tracker internal/monitoring/staleness_tracker.go Maintains freshness metadata and scores.
Priority queue internal/monitoring/task_queue.go Orders ScheduledTask items by due time + priority.
Circuit breaker internal/monitoring/circuit_breaker.go Trips on repeated failures, preventing hot loops.
Backoff internal/monitoring/backoff.go Exponential retry delays with jitter.
Workers internal/monitoring/monitor.go Pop tasks, execute pollers, reschedule or dead-letter.

Configuration

Environment variables (default in internal/config/config.go):

Variable Default Description
ADAPTIVE_POLLING_ENABLED false Feature flag for adaptive scheduler.
ADAPTIVE_POLLING_BASE_INTERVAL 10s Target cadence when system is healthy.
ADAPTIVE_POLLING_MIN_INTERVAL 5s Lower bound (active instances).
ADAPTIVE_POLLING_MAX_INTERVAL 5m Upper bound (idle instances).

All settings persist in system.json and respond to environment overrides.

Metrics

Exposed via Prometheus (:9091/metrics):

Metric Type Labels Description
pulse_monitor_poll_total counter instance_type, instance, result Overall poll attempts (success/error).
pulse_monitor_poll_duration_seconds histogram instance_type, instance Poll latency per instance.
pulse_monitor_poll_staleness_seconds gauge instance_type, instance Age since last success (0 on success).
pulse_monitor_poll_queue_depth gauge Size of priority queue.
pulse_monitor_poll_inflight gauge instance_type Concurrent tasks per type.

Circuit Breaker & Backoff

State Trigger Recovery
Closed Default. Failures counted.
Open ≥3 consecutive failures. Poll suppressed. Exponential delay (max 5min).
Half-open Retry window elapsed. Limited re-attempt. Success ⇒ closed. Failure ⇒ open.

Backoff configuration:

  • Initial delay: 5s
  • Multiplier: x2 per failure
  • Jitter: ±20%
  • Max delay: 5minutes
  • After 5 transient failures or any permanent failure, task moves to dead-letter queue for operator action.

Dead-Letter Queue

Dead-letter entries are kept in memory (same TaskQueue structure) with a 30min recheck interval. Operators should inspect logs for Routing task to dead-letter queue messages. Future work (Task8) will add API surfaces for inspection.

API Endpoints

GET /api/monitoring/scheduler/health

Returns comprehensive scheduler health data (authentication required).

Response format:

{
  "updatedAt": "2025-03-21T18:05:00Z",
  "enabled": true,
  "queue": {
    "depth": 7,
    "dueWithinSeconds": 2,
    "perType": {
      "pve": 4,
      "pbs": 2,
      "pmg": 1
    }
  },
  "deadLetter": {
    "count": 2,
    "tasks": [
      {
        "instance": "pbs-nas",
        "type": "pbs",
        "nextRun": "2025-03-21T18:25:00Z",
        "lastError": "connection timeout",
        "failures": 7
      }
    ]
  },
  "breakers": [
    {
      "instance": "pve-core",
      "type": "pve",
      "state": "half_open",
      "failures": 3,
      "retryAt": "2025-03-21T18:05:45Z"
    }
  ],
  "staleness": [
    {
      "instance": "pve-core",
      "type": "pve",
      "score": 0.12,
      "lastSuccess": "2025-03-21T18:04:50Z"
    }
  ]
}

Field descriptions:

  • enabled: Feature flag status
  • queue.depth: Total queued tasks
  • queue.dueWithinSeconds: Tasks due within 12 seconds
  • queue.perType: Distribution by instance type
  • deadLetter.count: Total dead-letter tasks
  • deadLetter.tasks: Up to 25 most recent dead-letter entries
  • breakers: Circuit breaker states (only non-default states shown)
  • staleness: Freshness scores per instance (0 = fresh, 1 = max stale)

Operational Guidance

  1. Enable adaptive polling: set ADAPTIVE_POLLING_ENABLED=true via UI or environment overrides, then restart hot-dev (scripts/hot-dev.sh).
  2. Monitor metrics to ensure queue depth and staleness remain within SLA. Configure alerting on poll_staleness_seconds and poll_queue_depth.
  3. Inspect scheduler health via API endpoint /api/monitoring/scheduler/health for circuit breaker trips and dead-letter queue status.
  4. Review dead-letter logs for persistent failures; resolve underlying connectivity or auth issues before re-enabling.

Rollout Plan

  1. Dev/QA: Run hot-dev with feature flag enabled; observe metrics and logs for several cycles.
  2. Staged deploy: Enable flag on a subset of clusters; monitor queue depth (<50) and staleness (<45s).
  3. Full rollout: Toggle flag globally once metrics are stable; document any overrides in release notes.
  4. Post-launch: Add Grafana panels for queue depth & staleness; alert on circuit breaker trips (future API work).

Known Follow-ups

  • Task8: expose scheduler health & dead-letter statistics via API and UI panels.
  • Task9: add dedicated unit/integration harness for the scheduler & workers.