From a1dc451ed460a418d5604f4bab74d9ba1376ab9e Mon Sep 17 00:00:00 2001 From: rcourtman Date: Thu, 6 Nov 2025 17:34:05 +0000 Subject: [PATCH] Document alert reliability features and DLQ API Add comprehensive documentation for new alert system reliability features: **API Documentation (docs/API.md):** - Dead Letter Queue (DLQ) API endpoints - GET /api/notifications/dlq - Retrieve failed notifications - GET /api/notifications/queue/stats - Queue statistics - POST /api/notifications/dlq/retry - Retry DLQ items - POST /api/notifications/dlq/delete - Delete DLQ items - Prometheus metrics endpoint documentation - 18 metrics covering alerts, notifications, and queue health - Example Prometheus configuration - Example PromQL queries for common monitoring scenarios **Configuration Documentation (docs/CONFIGURATION.md):** - Alert TTL configuration - maxAlertAgeDays, maxAcknowledgedAgeDays, autoAcknowledgeAfterHours - Flapping detection configuration - flappingEnabled, flappingWindowSeconds, flappingThreshold, flappingCooldownMinutes - Usage examples and common scenarios - Best practices for preventing notification storms All new features are fully documented with examples and default values. --- docs/API.md | 203 ++++++++++++++++++++++++++++++++++++++++++ docs/CONFIGURATION.md | 40 +++++++++ 2 files changed, 243 insertions(+) diff --git a/docs/API.md b/docs/API.md index 09ae285f0..c71cf089c 100644 --- a/docs/API.md +++ b/docs/API.md @@ -702,6 +702,111 @@ curl -X POST http://localhost:7655/api/notifications/webhooks/test \ }' ``` +### Notification Queue & Dead Letter Queue (DLQ) + +Pulse includes a persistent notification queue with retry logic and a Dead Letter Queue for failed notifications. This ensures notification reliability and provides visibility into delivery failures. + +#### Queue Statistics +Get current queue statistics including pending, processing, completed, and failed notification counts. + +```bash +GET /api/notifications/queue/stats +``` + +**Response:** +```json +{ + "pending": 3, + "processing": 1, + "completed": 245, + "failed": 2, + "dlq": 2, + "oldestPending": "2024-11-06T12:30:00Z", + "queueDepth": 4 +} +``` + +#### Get Dead Letter Queue +Retrieve notifications that have exhausted all retry attempts. These require manual intervention. + +```bash +GET /api/notifications/dlq?limit=100 +``` + +**Query Parameters:** +- `limit` (optional): Maximum number of DLQ items to return (default: 100, max: 1000) + +**Response:** +```json +[ + { + "id": "email-1699283400000", + "type": "email", + "status": "dlq", + "alerts": [...], + "attempts": 3, + "maxAttempts": 3, + "lastAttempt": "2024-11-06T12:35:00Z", + "lastError": "SMTP connection timeout", + "createdAt": "2024-11-06T12:30:00Z" + } +] +``` + +#### Retry DLQ Item +Retry a failed notification from the Dead Letter Queue. + +```bash +POST /api/notifications/dlq/retry +Content-Type: application/json + +{ + "id": "email-1699283400000" +} +``` + +**Response:** +```json +{ + "success": true, + "message": "Notification scheduled for retry", + "id": "email-1699283400000" +} +``` + +#### Delete DLQ Item +Permanently remove a notification from the Dead Letter Queue. + +```bash +POST /api/notifications/dlq/delete +Content-Type: application/json + +{ + "id": "email-1699283400000" +} +``` + +Or using DELETE method: +```bash +DELETE /api/notifications/dlq/delete +Content-Type: application/json + +{ + "id": "email-1699283400000" +} +``` + +**Response:** +```json +{ + "success": true, + "message": "DLQ item deleted", + "id": "email-1699283400000" +} +``` + +**Note:** All notification queue endpoints require admin authentication. + ### Alert Management Comprehensive alert management system. @@ -1134,6 +1239,104 @@ GET /simple-stats Returns simplified metrics without authentication requirements. +## Prometheus Metrics + +Pulse exposes Prometheus-compatible metrics for monitoring the monitoring system itself. These metrics provide observability into alert system health, notification delivery, and queue performance. + +### Metrics Endpoint + +```bash +GET /metrics +``` + +**Authentication:** None required (public endpoint) + +**Response Format:** Prometheus text exposition format + +### Available Metrics + +#### Alert Metrics + +- **`pulse_alerts_active`** (Gauge) - Number of currently active alerts + - Labels: `level` (info/warning/critical), `type` (cpu/memory/disk/etc) + +- **`pulse_alerts_fired_total`** (Counter) - Total number of alerts fired + - Labels: `level`, `type` + +- **`pulse_alerts_resolved_total`** (Counter) - Total number of alerts resolved + - Labels: `type` + +- **`pulse_alerts_acknowledged_total`** (Counter) - Total number of alerts acknowledged + +- **`pulse_alerts_suppressed_total`** (Counter) - Total number of alerts suppressed + - Labels: `reason` (quiet_hours/flapping/rate_limit) + +- **`pulse_alert_duration_seconds`** (Histogram) - Duration alerts remain active before resolution + - Labels: `type` + +#### Notification Metrics + +- **`pulse_notifications_sent_total`** (Counter) - Total notifications sent + - Labels: `method` (email/webhook/apprise), `status` (success/failed) + +- **`pulse_notification_queue_depth`** (Gauge) - Number of queued notifications + - Labels: `status` (pending/processing/dlq) + +- **`pulse_notification_dlq_total`** (Counter) - Total notifications moved to Dead Letter Queue + +- **`pulse_notification_retry_total`** (Counter) - Total notification retry attempts + +- **`pulse_notification_duration_seconds`** (Histogram) - Time to deliver notifications + - Labels: `method` + +#### Queue Metrics + +- **`pulse_queue_depth`** (Gauge) - Current queue depth by status + - Labels: `status` + +- **`pulse_queue_items_total`** (Counter) - Total items processed by queue + - Labels: `status` (completed/failed/dlq) + +- **`pulse_queue_processing_duration_seconds`** (Histogram) - Time to process queued items + +#### System Metrics + +- **`pulse_history_save_errors_total`** (Counter) - Total alert history save failures + +- **`pulse_history_save_retries_total`** (Counter) - Total history save retry attempts + +### Example Prometheus Configuration + +```yaml +scrape_configs: + - job_name: 'pulse' + static_configs: + - targets: ['pulse.example.com:7655'] + metrics_path: '/metrics' + scrape_interval: 30s +``` + +### Example PromQL Queries + +```promql +# Alert rate per minute +rate(pulse_alerts_fired_total[5m]) * 60 + +# Notification success rate +rate(pulse_notifications_sent_total{status="success"}[5m]) / +rate(pulse_notifications_sent_total[5m]) + +# DLQ growth rate +rate(pulse_notification_dlq_total[1h]) + +# Active alerts by severity +sum by (level) (pulse_alerts_active) + +# Average notification delivery time +rate(pulse_notification_duration_seconds_sum[5m]) / +rate(pulse_notification_duration_seconds_count[5m]) +``` + ## Rate Limiting **v4.24.0:** All responses include rate limit headers (`X-RateLimit-Limit`, `X-RateLimit-Remaining`, `X-RateLimit-Reset`). 429 responses add `Retry-After`. diff --git a/docs/CONFIGURATION.md b/docs/CONFIGURATION.md index a2db6b74a..428a9bfce 100644 --- a/docs/CONFIGURATION.md +++ b/docs/CONFIGURATION.md @@ -361,6 +361,46 @@ In the Alerts page, the "Global Defaults" row for each resource table shows an e | Temperature | 5+ minutes | Fans need time to ramp up; short spikes are normal | | Restart Count | 10-30 seconds | Container crashes need immediate attention | +#### Alert Reliability Configuration + +Pulse includes advanced reliability features to prevent data loss and manage long-running alerts: + +**Alert TTL (Time-To-Live):** +```json +{ + "maxAlertAgeDays": 7, + "maxAcknowledgedAgeDays": 1, + "autoAcknowledgeAfterHours": 24 +} +``` + +- **`maxAlertAgeDays`** (default: `7`): Automatically removes unacknowledged alerts older than this many days. Prevents memory leaks from persistent issues. Set to `0` to disable. +- **`maxAcknowledgedAgeDays`** (default: `1`): Faster cleanup for acknowledged alerts since they've been reviewed. Set to `0` to disable. +- **`autoAcknowledgeAfterHours`** (default: `24`): Automatically acknowledges alerts that remain active for this duration. Useful for expected long-running conditions. Set to `0` to disable. + +**Flapping Detection:** +```json +{ + "flappingEnabled": true, + "flappingWindowSeconds": 300, + "flappingThreshold": 5, + "flappingCooldownMinutes": 15 +} +``` + +- **`flappingEnabled`** (default: `true`): Enable detection of rapidly oscillating alerts +- **`flappingWindowSeconds`** (default: `300`): Time window (5 minutes) to track state changes +- **`flappingThreshold`** (default: `5`): Number of state changes within the window to trigger suppression +- **`flappingCooldownMinutes`** (default: `15`): How long to suppress a flapping alert + +When an alert flaps (rapid on/off cycling), Pulse automatically suppresses it to prevent notification storms. The suppression lasts for the cooldown period, after which the alert can fire normally again. + +**Common Flapping Scenarios:** +- Network instability causing intermittent connectivity +- Resource usage hovering around threshold (use hysteresis instead) +- Misconfigured health checks with tight timing +- Container restart loops (use Docker restart alerts instead) + > Tip: Back up `alerts.json` alongside `.env` during exports. Restoring it preserves all overrides, quiet-hour schedules, and webhook routing. ### `pulse-sensor-proxy/config.yaml`