mirror of
https://github.com/rcourtman/Pulse.git
synced 2026-02-18 00:17:39 +01:00
This commit implements critical reliability features to prevent data loss and improve alert system robustness: **Persistent Notification Queue:** - SQLite-backed queue with WAL journaling for crash recovery - Dead Letter Queue (DLQ) for notifications that exhaust retries - Exponential backoff retry logic (100ms → 200ms → 400ms) - Full audit trail for all notification delivery attempts - New file: internal/notifications/queue.go (661 lines) **DLQ Management API:** - GET /api/notifications/dlq - Retrieve DLQ items - GET /api/notifications/queue/stats - Queue statistics - POST /api/notifications/dlq/retry - Retry failed notifications - POST /api/notifications/dlq/delete - Delete DLQ items - New file: internal/api/notification_queue.go (145 lines) **Prometheus Metrics:** - 18 comprehensive metrics for alerts and notifications - Metric hooks integrated via function pointers to avoid import cycles - /metrics endpoint exposed for Prometheus scraping - New file: internal/metrics/alert_metrics.go (193 lines) **Alert History Reliability:** - Exponential backoff retry for history saves (3 attempts) - Automatic backup restoration on write failure - Modified: internal/alerts/history.go **Flapping Detection:** - Detects and suppresses rapidly oscillating alerts - Configurable window (default: 5 minutes) - Configurable threshold (default: 5 state changes) - Configurable cooldown (default: 15 minutes) - Automatic cleanup of inactive flapping history **Alert TTL & Auto-Cleanup:** - MaxAlertAgeDays: Auto-cleanup old alerts (default: 7 days) - MaxAcknowledgedAgeDays: Faster cleanup for acked alerts (default: 1 day) - AutoAcknowledgeAfterHours: Auto-ack long-running alerts (default: 24 hours) - Prevents memory leaks from long-running alerts **WebSocket Broadcast Sequencer:** - Channel-based sequencing ensures ordered message delivery - 100ms coalescing window for rapid state updates - Prevents race conditions in WebSocket broadcasts - Modified: internal/websocket/hub.go **Configuration Fields Added:** - FlappingEnabled, FlappingWindowSeconds, FlappingThreshold, FlappingCooldownMinutes - MaxAlertAgeDays, MaxAcknowledgedAgeDays, AutoAcknowledgeAfterHours All features are production-ready and build successfully.