Files
Pulse/docs/monitoring/ADAPTIVE_POLLING.md
rcourtman 6eb1a10d9b Refactor: Code cleanup and localStorage consolidation
This commit includes comprehensive codebase cleanup and refactoring:

## Code Cleanup
- Remove dead TypeScript code (types/monitoring.ts - 194 lines duplicate)
- Remove unused Go functions (GetClusterNodes, MigratePassword, GetClusterHealthInfo)
- Clean up commented-out code blocks across multiple files
- Remove unused TypeScript exports (helpTextClass, private tag color helpers)
- Delete obsolete test files and components

## localStorage Consolidation
- Centralize all storage keys into STORAGE_KEYS constant
- Update 5 files to use centralized keys:
  * utils/apiClient.ts (AUTH, LEGACY_TOKEN)
  * components/Dashboard/Dashboard.tsx (GUEST_METADATA)
  * components/Docker/DockerHosts.tsx (DOCKER_METADATA)
  * App.tsx (PLATFORMS_SEEN)
  * stores/updates.ts (UPDATES)
- Benefits: Single source of truth, prevents typos, better maintainability

## Previous Work Committed
- Docker monitoring improvements and disk metrics
- Security enhancements and setup fixes
- API refactoring and cleanup
- Documentation updates
- Build system improvements

## Testing
- All frontend tests pass (29 tests)
- All Go tests pass (15 packages)
- Production build successful
- Zero breaking changes

Total: 186 files changed, 5825 insertions(+), 11602 deletions(-)
2025-11-04 21:50:46 +00:00

8.7 KiB
Raw Blame History

Adaptive Polling Architecture

Overview

Pulse uses an adaptive polling scheduler that adapts poll cadence based on freshness, errors, and workload. The goal is to prioritize stale or changing instances while backing off on healthy, idle targets.

flowchart LR
    Scheduler[Scheduler]
    Queue[Priority Queue<br/>by NextRun]
    Workers[Workers]

    Scheduler -->|schedule| Queue
    Queue -->|dequeue| Workers
    Workers -->|success| Scheduler
    Workers -->|failure| CB[Circuit Breaker]
    CB -->|backoff| Scheduler
  • Scheduler computes ScheduledTask entries using adaptive intervals.
  • Task queue is a min-heap keyed by NextRun; only due tasks execute.
  • Workers execute tasks, capture outcomes, reschedule via scheduler or backoff logic.

Key Components

Component File Responsibility
Scheduler internal/monitoring/scheduler.go Calculates adaptive intervals per instance.
Staleness tracker internal/monitoring/staleness_tracker.go Maintains freshness metadata and scores.
Priority queue internal/monitoring/task_queue.go Orders ScheduledTask items by due time + priority.
Circuit breaker internal/monitoring/circuit_breaker.go Trips on repeated failures, preventing hot loops.
Backoff internal/monitoring/backoff.go Exponential retry delays with jitter.
Workers internal/monitoring/monitor.go Pop tasks, execute pollers, reschedule or dead-letter.

Configuration

v4.24.0: Adaptive polling is enabled by default but can be toggled without restart.

Via UI

Navigate to Settings → System → Monitoring to enable/disable adaptive polling. Changes apply immediately without requiring a restart.

Via Environment Variables

Environment variables (default in internal/config/config.go):

Variable Default Description
ADAPTIVE_POLLING_ENABLED true Changed in v4.24.0: Now enabled by default
ADAPTIVE_POLLING_BASE_INTERVAL 10s Target cadence when system is healthy
ADAPTIVE_POLLING_MIN_INTERVAL 5s Lower bound (active instances)
ADAPTIVE_POLLING_MAX_INTERVAL 5m Upper bound (idle instances)

All settings persist in system.json and respond to environment overrides. Changes apply without restart when modified via UI.

Metrics

v4.24.0: Extended metrics for comprehensive monitoring.

Exposed via Prometheus (:9091/metrics):

Metric Type Labels Description
pulse_monitor_poll_total counter instance_type, instance, result Overall poll attempts (success/error)
pulse_monitor_poll_duration_seconds histogram instance_type, instance Poll latency per instance
pulse_monitor_poll_staleness_seconds gauge instance_type, instance Age since last success (0 on success)
pulse_monitor_poll_queue_depth gauge Size of priority queue
pulse_monitor_poll_inflight gauge instance_type Concurrent tasks per type
pulse_monitor_poll_errors_total counter instance_type, instance, category Error counts by category (transient/permanent)
pulse_monitor_poll_last_success_timestamp gauge instance_type, instance Unix timestamp of last successful poll

Alerting Recommendations:

  • Alert when pulse_monitor_poll_staleness_seconds > 120 for critical instances
  • Alert when pulse_monitor_poll_queue_depth > 50 (backlog building)
  • Alert when pulse_monitor_poll_errors_total with category=permanent increases (auth/config issues)

Circuit Breaker & Backoff

State Trigger Recovery
Closed Default. Failures counted.
Open ≥3 consecutive failures. Poll suppressed. Exponential delay (max 5min).
Half-open Retry window elapsed. Limited re-attempt. Success ⇒ closed. Failure ⇒ open.
stateDiagram-v2
    [*] --> Closed: Startup / reset
    Closed: Default state\nPolling active\nFailure counter increments
    Closed --> Open: ≥3 consecutive failures
    Open: Polls suppressed\nScheduler schedules backoff (max 5m)
    Open --> HalfOpen: Retry window elapsed
    HalfOpen: Single probe allowed\nBreaker watches probe result
    HalfOpen --> Closed: Probe success\nReset failure streak & delay
    HalfOpen --> Open: Probe failure\nIncrease streak & backoff

Backoff configuration:

  • Initial delay: 5s
  • Multiplier: x2 per failure
  • Jitter: ±20%
  • Max delay: 5minutes
  • After 5 transient failures or any permanent failure, task moves to dead-letter queue for operator action.

Dead-Letter Queue

Dead-letter entries are kept in memory (same TaskQueue structure) with a 30min recheck interval. Operators should inspect logs for Routing task to dead-letter queue messages. Future work (Task8) will add API surfaces for inspection.

API Endpoints

GET /api/monitoring/scheduler/health

Returns comprehensive scheduler health data (authentication required).

Response format:

{
  "updatedAt": "2025-03-21T18:05:00Z",
  "enabled": true,
  "queue": {
    "depth": 7,
    "dueWithinSeconds": 2,
    "perType": {
      "pve": 4,
      "pbs": 2,
      "pmg": 1
    }
  },
  "deadLetter": {
    "count": 2,
    "tasks": [
      {
        "instance": "pbs-nas",
        "type": "pbs",
        "nextRun": "2025-03-21T18:25:00Z",
        "lastError": "connection timeout",
        "failures": 7
      }
    ]
  },
  "breakers": [
    {
      "instance": "pve-core",
      "type": "pve",
      "state": "half_open",
      "failures": 3,
      "retryAt": "2025-03-21T18:05:45Z"
    }
  ],
  "staleness": [
    {
      "instance": "pve-core",
      "type": "pve",
      "score": 0.12,
      "lastSuccess": "2025-03-21T18:04:50Z"
    }
  ]
}

Field descriptions:

  • enabled: Feature flag status
  • queue.depth: Total queued tasks
  • queue.dueWithinSeconds: Tasks due within 12 seconds
  • queue.perType: Distribution by instance type
  • deadLetter.count: Total dead-letter tasks
  • deadLetter.tasks: Up to 25 most recent dead-letter entries
  • breakers: Circuit breaker states (only non-default states shown)
  • staleness: Freshness scores per instance (0 = fresh, 1 = max stale)

Operational Guidance

  1. Enable adaptive polling: set ADAPTIVE_POLLING_ENABLED=true via UI or environment overrides, then restart hot-dev (scripts/hot-dev.sh).
  2. Monitor metrics to ensure queue depth and staleness remain within SLA. Configure alerting on poll_staleness_seconds and poll_queue_depth.
  3. Inspect scheduler health via API endpoint /api/monitoring/scheduler/health for circuit breaker trips and dead-letter queue status.
  4. Review dead-letter logs for persistent failures; resolve underlying connectivity or auth issues before re-enabling.

Rollout Plan

  1. Dev/QA: Run hot-dev with feature flag enabled; observe metrics and logs for several cycles.
  2. Staged deploy: Enable flag on a subset of clusters; monitor queue depth (<50) and staleness (<45s).
  3. Full rollout: Toggle flag globally once metrics are stable; document any overrides in release notes.
  4. Post-launch: Add Grafana panels for queue depth & staleness; alert on circuit breaker trips (future API work).

Known Follow-ups

  • Task8: expose scheduler health & dead-letter statistics via API and UI panels.
  • Task9: add dedicated unit/integration harness for the scheduler & workers.