mirror of https://github.com/rcourtman/Pulse.git synced 2026-02-18 00:17:39 +01:00

Files

rcourtman 6eb1a10d9b Refactor: Code cleanup and localStorage consolidation

This commit includes comprehensive codebase cleanup and refactoring:

## Code Cleanup
- Remove dead TypeScript code (types/monitoring.ts - 194 lines duplicate)
- Remove unused Go functions (GetClusterNodes, MigratePassword, GetClusterHealthInfo)
- Clean up commented-out code blocks across multiple files
- Remove unused TypeScript exports (helpTextClass, private tag color helpers)
- Delete obsolete test files and components

## localStorage Consolidation
- Centralize all storage keys into STORAGE_KEYS constant
- Update 5 files to use centralized keys:
  * utils/apiClient.ts (AUTH, LEGACY_TOKEN)
  * components/Dashboard/Dashboard.tsx (GUEST_METADATA)
  * components/Docker/DockerHosts.tsx (DOCKER_METADATA)
  * App.tsx (PLATFORMS_SEEN)
  * stores/updates.ts (UPDATES)
- Benefits: Single source of truth, prevents typos, better maintainability

## Previous Work Committed
- Docker monitoring improvements and disk metrics
- Security enhancements and setup fixes
- API refactoring and cleanup
- Documentation updates
- Build system improvements

## Testing
- All frontend tests pass (29 tests)
- All Go tests pass (15 packages)
- Production build successful
- Zero breaking changes

Total: 186 files changed, 5825 insertions(+), 11602 deletions(-)

2025-11-04 21:50:46 +00:00

8.7 KiB

Raw Blame History

Adaptive Polling Architecture

Overview

Pulse uses an adaptive polling scheduler that adapts poll cadence based on freshness, errors, and workload. The goal is to prioritize stale or changing instances while backing off on healthy, idle targets.

flowchart LR
    Scheduler[Scheduler]
    Queue[Priority Queue<br/>by NextRun]
    Workers[Workers]

    Scheduler -->|schedule| Queue
    Queue -->|dequeue| Workers
    Workers -->|success| Scheduler
    Workers -->|failure| CB[Circuit Breaker]
    CB -->|backoff| Scheduler

Scheduler computes ScheduledTask entries using adaptive intervals.
Task queue is a min-heap keyed by NextRun; only due tasks execute.
Workers execute tasks, capture outcomes, reschedule via scheduler or backoff logic.

Key Components

Component	File	Responsibility
Scheduler	`internal/monitoring/scheduler.go`	Calculates adaptive intervals per instance.
Staleness tracker	`internal/monitoring/staleness_tracker.go`	Maintains freshness metadata and scores.
Priority queue	`internal/monitoring/task_queue.go`	Orders `ScheduledTask` items by due time + priority.
Circuit breaker	`internal/monitoring/circuit_breaker.go`	Trips on repeated failures, preventing hot loops.
Backoff	`internal/monitoring/backoff.go`	Exponential retry delays with jitter.
Workers	`internal/monitoring/monitor.go`	Pop tasks, execute pollers, reschedule or dead-letter.

Configuration

v4.24.0: Adaptive polling is enabled by default but can be toggled without restart.

Via UI

Navigate to Settings → System → Monitoring to enable/disable adaptive polling. Changes apply immediately without requiring a restart.

Via Environment Variables

Environment variables (default in internal/config/config.go):

Variable	Default	Description
`ADAPTIVE_POLLING_ENABLED`	true	Changed in v4.24.0: Now enabled by default
`ADAPTIVE_POLLING_BASE_INTERVAL`	10s	Target cadence when system is healthy
`ADAPTIVE_POLLING_MIN_INTERVAL`	5s	Lower bound (active instances)
`ADAPTIVE_POLLING_MAX_INTERVAL`	5m	Upper bound (idle instances)

All settings persist in system.json and respond to environment overrides. Changes apply without restart when modified via UI.

Metrics

v4.24.0: Extended metrics for comprehensive monitoring.

Exposed via Prometheus (:9091/metrics):

Metric	Type	Labels	Description
`pulse_monitor_poll_total`	counter	`instance_type`, `instance`, `result`	Overall poll attempts (success/error)
`pulse_monitor_poll_duration_seconds`	histogram	`instance_type`, `instance`	Poll latency per instance
`pulse_monitor_poll_staleness_seconds`	gauge	`instance_type`, `instance`	Age since last success (0 on success)
`pulse_monitor_poll_queue_depth`	gauge	—	Size of priority queue
`pulse_monitor_poll_inflight`	gauge	`instance_type`	Concurrent tasks per type
`pulse_monitor_poll_errors_total`	counter	`instance_type`, `instance`, `category`	Error counts by category (transient/permanent)
`pulse_monitor_poll_last_success_timestamp`	gauge	`instance_type`, `instance`	Unix timestamp of last successful poll

Alerting Recommendations:

Alert when pulse_monitor_poll_staleness_seconds > 120 for critical instances
Alert when pulse_monitor_poll_queue_depth > 50 (backlog building)
Alert when pulse_monitor_poll_errors_total with category=permanent increases (auth/config issues)

Circuit Breaker & Backoff

State	Trigger	Recovery
Closed	Default. Failures counted.	—
Open	≥3 consecutive failures. Poll suppressed.	Exponential delay (max 5 min).
Half-open	Retry window elapsed. Limited re-attempt.	Success ⇒ closed. Failure ⇒ open.

stateDiagram-v2
    [*] --> Closed: Startup / reset
    Closed: Default state\nPolling active\nFailure counter increments
    Closed --> Open: ≥3 consecutive failures
    Open: Polls suppressed\nScheduler schedules backoff (max 5m)
    Open --> HalfOpen: Retry window elapsed
    HalfOpen: Single probe allowed\nBreaker watches probe result
    HalfOpen --> Closed: Probe success\nReset failure streak & delay
    HalfOpen --> Open: Probe failure\nIncrease streak & backoff

Backoff configuration:

Initial delay: 5 s
Multiplier: x2 per failure
Jitter: ±20 %
Max delay: 5 minutes
After 5 transient failures or any permanent failure, task moves to dead-letter queue for operator action.

Dead-Letter Queue

Dead-letter entries are kept in memory (same TaskQueue structure) with a 30 min recheck interval. Operators should inspect logs for Routing task to dead-letter queue messages. Future work (Task 8) will add API surfaces for inspection.

API Endpoints

GET /api/monitoring/scheduler/health

Returns comprehensive scheduler health data (authentication required).

Response format:

{
  "updatedAt": "2025-03-21T18:05:00Z",
  "enabled": true,
  "queue": {
    "depth": 7,
    "dueWithinSeconds": 2,
    "perType": {
      "pve": 4,
      "pbs": 2,
      "pmg": 1
    }
  },
  "deadLetter": {
    "count": 2,
    "tasks": [
      {
        "instance": "pbs-nas",
        "type": "pbs",
        "nextRun": "2025-03-21T18:25:00Z",
        "lastError": "connection timeout",
        "failures": 7
      }
    ]
  },
  "breakers": [
    {
      "instance": "pve-core",
      "type": "pve",
      "state": "half_open",
      "failures": 3,
      "retryAt": "2025-03-21T18:05:45Z"
    }
  ],
  "staleness": [
    {
      "instance": "pve-core",
      "type": "pve",
      "score": 0.12,
      "lastSuccess": "2025-03-21T18:04:50Z"
    }
  ]
}

Field descriptions:

enabled: Feature flag status
queue.depth: Total queued tasks
queue.dueWithinSeconds: Tasks due within 12 seconds
queue.perType: Distribution by instance type
deadLetter.count: Total dead-letter tasks
deadLetter.tasks: Up to 25 most recent dead-letter entries
breakers: Circuit breaker states (only non-default states shown)
staleness: Freshness scores per instance (0 = fresh, 1 = max stale)

Operational Guidance

Enable adaptive polling: set ADAPTIVE_POLLING_ENABLED=true via UI or environment overrides, then restart hot-dev (scripts/hot-dev.sh).
Monitor metrics to ensure queue depth and staleness remain within SLA. Configure alerting on poll_staleness_seconds and poll_queue_depth.
Inspect scheduler health via API endpoint /api/monitoring/scheduler/health for circuit breaker trips and dead-letter queue status.
Review dead-letter logs for persistent failures; resolve underlying connectivity or auth issues before re-enabling.

Rollout Plan

Dev/QA: Run hot-dev with feature flag enabled; observe metrics and logs for several cycles.
Staged deploy: Enable flag on a subset of clusters; monitor queue depth (<50) and staleness (<45 s).
Full rollout: Toggle flag globally once metrics are stable; document any overrides in release notes.
Post-launch: Add Grafana panels for queue depth & staleness; alert on circuit breaker trips (future API work).

Known Follow-ups

Task 8: expose scheduler health & dead-letter statistics via API and UI panels.
Task 9: add dedicated unit/integration harness for the scheduler & workers.

8.7 KiB Raw Blame History Unescape Escape