From a1dc451ed460a418d5604f4bab74d9ba1376ab9e Mon Sep 17 00:00:00 2001
From: rcourtman <courtmanr@gmail.com>
Date: Thu, 6 Nov 2025 17:34:05 +0000
Subject: [PATCH] Document alert reliability features and DLQ API

Add comprehensive documentation for new alert system reliability features:

**API Documentation (docs/API.md):**
- Dead Letter Queue (DLQ) API endpoints
  - GET /api/notifications/dlq - Retrieve failed notifications
  - GET /api/notifications/queue/stats - Queue statistics
  - POST /api/notifications/dlq/retry - Retry DLQ items
  - POST /api/notifications/dlq/delete - Delete DLQ items
- Prometheus metrics endpoint documentation
  - 18 metrics covering alerts, notifications, and queue health
  - Example Prometheus configuration
  - Example PromQL queries for common monitoring scenarios

**Configuration Documentation (docs/CONFIGURATION.md):**
- Alert TTL configuration
  - maxAlertAgeDays, maxAcknowledgedAgeDays, autoAcknowledgeAfterHours
- Flapping detection configuration
  - flappingEnabled, flappingWindowSeconds, flappingThreshold, flappingCooldownMinutes
- Usage examples and common scenarios
- Best practices for preventing notification storms

All new features are fully documented with examples and default values.
---
 docs/API.md           | 203 ++++++++++++++++++++++++++++++++++++++++++
 docs/CONFIGURATION.md |  40 +++++++++
 2 files changed, 243 insertions(+)

diff --git a/docs/API.md b/docs/API.md
index 09ae285f0..c71cf089c 100644
--- a/docs/API.md
+++ b/docs/API.md
@@ -702,6 +702,111 @@ curl -X POST http://localhost:7655/api/notifications/webhooks/test \
   }'
 ```
 
+### Notification Queue & Dead Letter Queue (DLQ)
+
+Pulse includes a persistent notification queue with retry logic and a Dead Letter Queue for failed notifications. This ensures notification reliability and provides visibility into delivery failures.
+
+#### Queue Statistics
+Get current queue statistics including pending, processing, completed, and failed notification counts.
+
+```bash
+GET /api/notifications/queue/stats
+```
+
+**Response:**
+```json
+{
+  "pending": 3,
+  "processing": 1,
+  "completed": 245,
+  "failed": 2,
+  "dlq": 2,
+  "oldestPending": "2024-11-06T12:30:00Z",
+  "queueDepth": 4
+}
+```
+
+#### Get Dead Letter Queue
+Retrieve notifications that have exhausted all retry attempts. These require manual intervention.
+
+```bash
+GET /api/notifications/dlq?limit=100
+```
+
+**Query Parameters:**
+- `limit` (optional): Maximum number of DLQ items to return (default: 100, max: 1000)
+
+**Response:**
+```json
+[
+  {
+    "id": "email-1699283400000",
+    "type": "email",
+    "status": "dlq",
+    "alerts": [...],
+    "attempts": 3,
+    "maxAttempts": 3,
+    "lastAttempt": "2024-11-06T12:35:00Z",
+    "lastError": "SMTP connection timeout",
+    "createdAt": "2024-11-06T12:30:00Z"
+  }
+]
+```
+
+#### Retry DLQ Item
+Retry a failed notification from the Dead Letter Queue.
+
+```bash
+POST /api/notifications/dlq/retry
+Content-Type: application/json
+
+{
+  "id": "email-1699283400000"
+}
+```
+
+**Response:**
+```json
+{
+  "success": true,
+  "message": "Notification scheduled for retry",
+  "id": "email-1699283400000"
+}
+```
+
+#### Delete DLQ Item
+Permanently remove a notification from the Dead Letter Queue.
+
+```bash
+POST /api/notifications/dlq/delete
+Content-Type: application/json
+
+{
+  "id": "email-1699283400000"
+}
+```
+
+Or using DELETE method:
+```bash
+DELETE /api/notifications/dlq/delete
+Content-Type: application/json
+
+{
+  "id": "email-1699283400000"
+}
+```
+
+**Response:**
+```json
+{
+  "success": true,
+  "message": "DLQ item deleted",
+  "id": "email-1699283400000"
+}
+```
+
+**Note:** All notification queue endpoints require admin authentication.
+
 
 ### Alert Management
 Comprehensive alert management system.
@@ -1134,6 +1239,104 @@ GET /simple-stats
 
 Returns simplified metrics without authentication requirements.
 
+## Prometheus Metrics
+
+Pulse exposes Prometheus-compatible metrics for monitoring the monitoring system itself. These metrics provide observability into alert system health, notification delivery, and queue performance.
+
+### Metrics Endpoint
+
+```bash
+GET /metrics
+```
+
+**Authentication:** None required (public endpoint)
+
+**Response Format:** Prometheus text exposition format
+
+### Available Metrics
+
+#### Alert Metrics
+
+- **`pulse_alerts_active`** (Gauge) - Number of currently active alerts
+  - Labels: `level` (info/warning/critical), `type` (cpu/memory/disk/etc)
+
+- **`pulse_alerts_fired_total`** (Counter) - Total number of alerts fired
+  - Labels: `level`, `type`
+
+- **`pulse_alerts_resolved_total`** (Counter) - Total number of alerts resolved
+  - Labels: `type`
+
+- **`pulse_alerts_acknowledged_total`** (Counter) - Total number of alerts acknowledged
+
+- **`pulse_alerts_suppressed_total`** (Counter) - Total number of alerts suppressed
+  - Labels: `reason` (quiet_hours/flapping/rate_limit)
+
+- **`pulse_alert_duration_seconds`** (Histogram) - Duration alerts remain active before resolution
+  - Labels: `type`
+
+#### Notification Metrics
+
+- **`pulse_notifications_sent_total`** (Counter) - Total notifications sent
+  - Labels: `method` (email/webhook/apprise), `status` (success/failed)
+
+- **`pulse_notification_queue_depth`** (Gauge) - Number of queued notifications
+  - Labels: `status` (pending/processing/dlq)
+
+- **`pulse_notification_dlq_total`** (Counter) - Total notifications moved to Dead Letter Queue
+
+- **`pulse_notification_retry_total`** (Counter) - Total notification retry attempts
+
+- **`pulse_notification_duration_seconds`** (Histogram) - Time to deliver notifications
+  - Labels: `method`
+
+#### Queue Metrics
+
+- **`pulse_queue_depth`** (Gauge) - Current queue depth by status
+  - Labels: `status`
+
+- **`pulse_queue_items_total`** (Counter) - Total items processed by queue
+  - Labels: `status` (completed/failed/dlq)
+
+- **`pulse_queue_processing_duration_seconds`** (Histogram) - Time to process queued items
+
+#### System Metrics
+
+- **`pulse_history_save_errors_total`** (Counter) - Total alert history save failures
+
+- **`pulse_history_save_retries_total`** (Counter) - Total history save retry attempts
+
+### Example Prometheus Configuration
+
+```yaml
+scrape_configs:
+  - job_name: 'pulse'
+    static_configs:
+      - targets: ['pulse.example.com:7655']
+    metrics_path: '/metrics'
+    scrape_interval: 30s
+```
+
+### Example PromQL Queries
+
+```promql
+# Alert rate per minute
+rate(pulse_alerts_fired_total[5m]) * 60
+
+# Notification success rate
+rate(pulse_notifications_sent_total{status="success"}[5m]) /
+rate(pulse_notifications_sent_total[5m])
+
+# DLQ growth rate
+rate(pulse_notification_dlq_total[1h])
+
+# Active alerts by severity
+sum by (level) (pulse_alerts_active)
+
+# Average notification delivery time
+rate(pulse_notification_duration_seconds_sum[5m]) /
+rate(pulse_notification_duration_seconds_count[5m])
+```
+
 ## Rate Limiting
 
 **v4.24.0:** All responses include rate limit headers (`X-RateLimit-Limit`, `X-RateLimit-Remaining`, `X-RateLimit-Reset`). 429 responses add `Retry-After`.
diff --git a/docs/CONFIGURATION.md b/docs/CONFIGURATION.md
index a2db6b74a..428a9bfce 100644
--- a/docs/CONFIGURATION.md
+++ b/docs/CONFIGURATION.md
@@ -361,6 +361,46 @@ In the Alerts page, the "Global Defaults" row for each resource table shows an e
 | Temperature | 5+ minutes | Fans need time to ramp up; short spikes are normal |
 | Restart Count | 10-30 seconds | Container crashes need immediate attention |
 
+#### Alert Reliability Configuration
+
+Pulse includes advanced reliability features to prevent data loss and manage long-running alerts:
+
+**Alert TTL (Time-To-Live):**
+```json
+{
+  "maxAlertAgeDays": 7,
+  "maxAcknowledgedAgeDays": 1,
+  "autoAcknowledgeAfterHours": 24
+}
+```
+
+- **`maxAlertAgeDays`** (default: `7`): Automatically removes unacknowledged alerts older than this many days. Prevents memory leaks from persistent issues. Set to `0` to disable.
+- **`maxAcknowledgedAgeDays`** (default: `1`): Faster cleanup for acknowledged alerts since they've been reviewed. Set to `0` to disable.
+- **`autoAcknowledgeAfterHours`** (default: `24`): Automatically acknowledges alerts that remain active for this duration. Useful for expected long-running conditions. Set to `0` to disable.
+
+**Flapping Detection:**
+```json
+{
+  "flappingEnabled": true,
+  "flappingWindowSeconds": 300,
+  "flappingThreshold": 5,
+  "flappingCooldownMinutes": 15
+}
+```
+
+- **`flappingEnabled`** (default: `true`): Enable detection of rapidly oscillating alerts
+- **`flappingWindowSeconds`** (default: `300`): Time window (5 minutes) to track state changes
+- **`flappingThreshold`** (default: `5`): Number of state changes within the window to trigger suppression
+- **`flappingCooldownMinutes`** (default: `15`): How long to suppress a flapping alert
+
+When an alert flaps (rapid on/off cycling), Pulse automatically suppresses it to prevent notification storms. The suppression lasts for the cooldown period, after which the alert can fire normally again.
+
+**Common Flapping Scenarios:**
+- Network instability causing intermittent connectivity
+- Resource usage hovering around threshold (use hysteresis instead)
+- Misconfigured health checks with tight timing
+- Container restart loops (use Docker restart alerts instead)
+
 > Tip: Back up `alerts.json` alongside `.env` during exports. Restoring it preserves all overrides, quiet-hour schedules, and webhook routing.
 
 ### `pulse-sensor-proxy/config.yaml`