This commit includes comprehensive codebase cleanup and refactoring: ## Code Cleanup - Remove dead TypeScript code (types/monitoring.ts - 194 lines duplicate) - Remove unused Go functions (GetClusterNodes, MigratePassword, GetClusterHealthInfo) - Clean up commented-out code blocks across multiple files - Remove unused TypeScript exports (helpTextClass, private tag color helpers) - Delete obsolete test files and components ## localStorage Consolidation - Centralize all storage keys into STORAGE_KEYS constant - Update 5 files to use centralized keys: * utils/apiClient.ts (AUTH, LEGACY_TOKEN) * components/Dashboard/Dashboard.tsx (GUEST_METADATA) * components/Docker/DockerHosts.tsx (DOCKER_METADATA) * App.tsx (PLATFORMS_SEEN) * stores/updates.ts (UPDATES) - Benefits: Single source of truth, prevents typos, better maintainability ## Previous Work Committed - Docker monitoring improvements and disk metrics - Security enhancements and setup fixes - API refactoring and cleanup - Documentation updates - Build system improvements ## Testing - All frontend tests pass (29 tests) - All Go tests pass (15 packages) - Production build successful - Zero breaking changes Total: 186 files changed, 5825 insertions(+), 11602 deletions(-)
8.7 KiB
Adaptive Polling Architecture
Overview
Pulse uses an adaptive polling scheduler that adapts poll cadence based on freshness, errors, and workload. The goal is to prioritize stale or changing instances while backing off on healthy, idle targets.
flowchart LR
Scheduler[Scheduler]
Queue[Priority Queue<br/>by NextRun]
Workers[Workers]
Scheduler -->|schedule| Queue
Queue -->|dequeue| Workers
Workers -->|success| Scheduler
Workers -->|failure| CB[Circuit Breaker]
CB -->|backoff| Scheduler
- Scheduler computes
ScheduledTaskentries using adaptive intervals. - Task queue is a min-heap keyed by
NextRun; only due tasks execute. - Workers execute tasks, capture outcomes, reschedule via scheduler or backoff logic.
Key Components
| Component | File | Responsibility |
|---|---|---|
| Scheduler | internal/monitoring/scheduler.go |
Calculates adaptive intervals per instance. |
| Staleness tracker | internal/monitoring/staleness_tracker.go |
Maintains freshness metadata and scores. |
| Priority queue | internal/monitoring/task_queue.go |
Orders ScheduledTask items by due time + priority. |
| Circuit breaker | internal/monitoring/circuit_breaker.go |
Trips on repeated failures, preventing hot loops. |
| Backoff | internal/monitoring/backoff.go |
Exponential retry delays with jitter. |
| Workers | internal/monitoring/monitor.go |
Pop tasks, execute pollers, reschedule or dead-letter. |
Configuration
v4.24.0: Adaptive polling is enabled by default but can be toggled without restart.
Via UI
Navigate to Settings → System → Monitoring to enable/disable adaptive polling. Changes apply immediately without requiring a restart.
Via Environment Variables
Environment variables (default in internal/config/config.go):
| Variable | Default | Description |
|---|---|---|
ADAPTIVE_POLLING_ENABLED |
true | Changed in v4.24.0: Now enabled by default |
ADAPTIVE_POLLING_BASE_INTERVAL |
10s | Target cadence when system is healthy |
ADAPTIVE_POLLING_MIN_INTERVAL |
5s | Lower bound (active instances) |
ADAPTIVE_POLLING_MAX_INTERVAL |
5m | Upper bound (idle instances) |
All settings persist in system.json and respond to environment overrides. Changes apply without restart when modified via UI.
Metrics
v4.24.0: Extended metrics for comprehensive monitoring.
Exposed via Prometheus (:9091/metrics):
| Metric | Type | Labels | Description |
|---|---|---|---|
pulse_monitor_poll_total |
counter | instance_type, instance, result |
Overall poll attempts (success/error) |
pulse_monitor_poll_duration_seconds |
histogram | instance_type, instance |
Poll latency per instance |
pulse_monitor_poll_staleness_seconds |
gauge | instance_type, instance |
Age since last success (0 on success) |
pulse_monitor_poll_queue_depth |
gauge | — | Size of priority queue |
pulse_monitor_poll_inflight |
gauge | instance_type |
Concurrent tasks per type |
pulse_monitor_poll_errors_total |
counter | instance_type, instance, category |
Error counts by category (transient/permanent) |
pulse_monitor_poll_last_success_timestamp |
gauge | instance_type, instance |
Unix timestamp of last successful poll |
Alerting Recommendations:
- Alert when
pulse_monitor_poll_staleness_seconds> 120 for critical instances - Alert when
pulse_monitor_poll_queue_depth> 50 (backlog building) - Alert when
pulse_monitor_poll_errors_totalwithcategory=permanentincreases (auth/config issues)
Circuit Breaker & Backoff
| State | Trigger | Recovery |
|---|---|---|
| Closed | Default. Failures counted. | — |
| Open | ≥3 consecutive failures. Poll suppressed. | Exponential delay (max 5 min). |
| Half-open | Retry window elapsed. Limited re-attempt. | Success ⇒ closed. Failure ⇒ open. |
stateDiagram-v2
[*] --> Closed: Startup / reset
Closed: Default state\nPolling active\nFailure counter increments
Closed --> Open: ≥3 consecutive failures
Open: Polls suppressed\nScheduler schedules backoff (max 5m)
Open --> HalfOpen: Retry window elapsed
HalfOpen: Single probe allowed\nBreaker watches probe result
HalfOpen --> Closed: Probe success\nReset failure streak & delay
HalfOpen --> Open: Probe failure\nIncrease streak & backoff
Backoff configuration:
- Initial delay: 5 s
- Multiplier: x2 per failure
- Jitter: ±20 %
- Max delay: 5 minutes
- After 5 transient failures or any permanent failure, task moves to dead-letter queue for operator action.
Dead-Letter Queue
Dead-letter entries are kept in memory (same TaskQueue structure) with a 30 min recheck interval. Operators should inspect logs for Routing task to dead-letter queue messages. Future work (Task 8) will add API surfaces for inspection.
API Endpoints
GET /api/monitoring/scheduler/health
Returns comprehensive scheduler health data (authentication required).
Response format:
{
"updatedAt": "2025-03-21T18:05:00Z",
"enabled": true,
"queue": {
"depth": 7,
"dueWithinSeconds": 2,
"perType": {
"pve": 4,
"pbs": 2,
"pmg": 1
}
},
"deadLetter": {
"count": 2,
"tasks": [
{
"instance": "pbs-nas",
"type": "pbs",
"nextRun": "2025-03-21T18:25:00Z",
"lastError": "connection timeout",
"failures": 7
}
]
},
"breakers": [
{
"instance": "pve-core",
"type": "pve",
"state": "half_open",
"failures": 3,
"retryAt": "2025-03-21T18:05:45Z"
}
],
"staleness": [
{
"instance": "pve-core",
"type": "pve",
"score": 0.12,
"lastSuccess": "2025-03-21T18:04:50Z"
}
]
}
Field descriptions:
enabled: Feature flag statusqueue.depth: Total queued tasksqueue.dueWithinSeconds: Tasks due within 12 secondsqueue.perType: Distribution by instance typedeadLetter.count: Total dead-letter tasksdeadLetter.tasks: Up to 25 most recent dead-letter entriesbreakers: Circuit breaker states (only non-default states shown)staleness: Freshness scores per instance (0 = fresh, 1 = max stale)
Operational Guidance
- Enable adaptive polling: set
ADAPTIVE_POLLING_ENABLED=truevia UI or environment overrides, then restart hot-dev (scripts/hot-dev.sh). - Monitor metrics to ensure queue depth and staleness remain within SLA. Configure alerting on
poll_staleness_secondsandpoll_queue_depth. - Inspect scheduler health via API endpoint
/api/monitoring/scheduler/healthfor circuit breaker trips and dead-letter queue status. - Review dead-letter logs for persistent failures; resolve underlying connectivity or auth issues before re-enabling.
Rollout Plan
- Dev/QA: Run hot-dev with feature flag enabled; observe metrics and logs for several cycles.
- Staged deploy: Enable flag on a subset of clusters; monitor queue depth (<50) and staleness (<45 s).
- Full rollout: Toggle flag globally once metrics are stable; document any overrides in release notes.
- Post-launch: Add Grafana panels for queue depth & staleness; alert on circuit breaker trips (future API work).
Known Follow-ups
- Task 8: expose scheduler health & dead-letter statistics via API and UI panels.
- Task 9: add dedicated unit/integration harness for the scheduler & workers.