mirror of
https://github.com/rcourtman/Pulse.git
synced 2026-02-18 00:17:39 +01:00
Document alert reliability features and DLQ API
Add comprehensive documentation for new alert system reliability features: **API Documentation (docs/API.md):** - Dead Letter Queue (DLQ) API endpoints - GET /api/notifications/dlq - Retrieve failed notifications - GET /api/notifications/queue/stats - Queue statistics - POST /api/notifications/dlq/retry - Retry DLQ items - POST /api/notifications/dlq/delete - Delete DLQ items - Prometheus metrics endpoint documentation - 18 metrics covering alerts, notifications, and queue health - Example Prometheus configuration - Example PromQL queries for common monitoring scenarios **Configuration Documentation (docs/CONFIGURATION.md):** - Alert TTL configuration - maxAlertAgeDays, maxAcknowledgedAgeDays, autoAcknowledgeAfterHours - Flapping detection configuration - flappingEnabled, flappingWindowSeconds, flappingThreshold, flappingCooldownMinutes - Usage examples and common scenarios - Best practices for preventing notification storms All new features are fully documented with examples and default values.
This commit is contained in:
203
docs/API.md
203
docs/API.md
@@ -702,6 +702,111 @@ curl -X POST http://localhost:7655/api/notifications/webhooks/test \
|
||||
}'
|
||||
```
|
||||
|
||||
### Notification Queue & Dead Letter Queue (DLQ)
|
||||
|
||||
Pulse includes a persistent notification queue with retry logic and a Dead Letter Queue for failed notifications. This ensures notification reliability and provides visibility into delivery failures.
|
||||
|
||||
#### Queue Statistics
|
||||
Get current queue statistics including pending, processing, completed, and failed notification counts.
|
||||
|
||||
```bash
|
||||
GET /api/notifications/queue/stats
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"pending": 3,
|
||||
"processing": 1,
|
||||
"completed": 245,
|
||||
"failed": 2,
|
||||
"dlq": 2,
|
||||
"oldestPending": "2024-11-06T12:30:00Z",
|
||||
"queueDepth": 4
|
||||
}
|
||||
```
|
||||
|
||||
#### Get Dead Letter Queue
|
||||
Retrieve notifications that have exhausted all retry attempts. These require manual intervention.
|
||||
|
||||
```bash
|
||||
GET /api/notifications/dlq?limit=100
|
||||
```
|
||||
|
||||
**Query Parameters:**
|
||||
- `limit` (optional): Maximum number of DLQ items to return (default: 100, max: 1000)
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
[
|
||||
{
|
||||
"id": "email-1699283400000",
|
||||
"type": "email",
|
||||
"status": "dlq",
|
||||
"alerts": [...],
|
||||
"attempts": 3,
|
||||
"maxAttempts": 3,
|
||||
"lastAttempt": "2024-11-06T12:35:00Z",
|
||||
"lastError": "SMTP connection timeout",
|
||||
"createdAt": "2024-11-06T12:30:00Z"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
#### Retry DLQ Item
|
||||
Retry a failed notification from the Dead Letter Queue.
|
||||
|
||||
```bash
|
||||
POST /api/notifications/dlq/retry
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"id": "email-1699283400000"
|
||||
}
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"message": "Notification scheduled for retry",
|
||||
"id": "email-1699283400000"
|
||||
}
|
||||
```
|
||||
|
||||
#### Delete DLQ Item
|
||||
Permanently remove a notification from the Dead Letter Queue.
|
||||
|
||||
```bash
|
||||
POST /api/notifications/dlq/delete
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"id": "email-1699283400000"
|
||||
}
|
||||
```
|
||||
|
||||
Or using DELETE method:
|
||||
```bash
|
||||
DELETE /api/notifications/dlq/delete
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"id": "email-1699283400000"
|
||||
}
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"message": "DLQ item deleted",
|
||||
"id": "email-1699283400000"
|
||||
}
|
||||
```
|
||||
|
||||
**Note:** All notification queue endpoints require admin authentication.
|
||||
|
||||
|
||||
### Alert Management
|
||||
Comprehensive alert management system.
|
||||
@@ -1134,6 +1239,104 @@ GET /simple-stats
|
||||
|
||||
Returns simplified metrics without authentication requirements.
|
||||
|
||||
## Prometheus Metrics
|
||||
|
||||
Pulse exposes Prometheus-compatible metrics for monitoring the monitoring system itself. These metrics provide observability into alert system health, notification delivery, and queue performance.
|
||||
|
||||
### Metrics Endpoint
|
||||
|
||||
```bash
|
||||
GET /metrics
|
||||
```
|
||||
|
||||
**Authentication:** None required (public endpoint)
|
||||
|
||||
**Response Format:** Prometheus text exposition format
|
||||
|
||||
### Available Metrics
|
||||
|
||||
#### Alert Metrics
|
||||
|
||||
- **`pulse_alerts_active`** (Gauge) - Number of currently active alerts
|
||||
- Labels: `level` (info/warning/critical), `type` (cpu/memory/disk/etc)
|
||||
|
||||
- **`pulse_alerts_fired_total`** (Counter) - Total number of alerts fired
|
||||
- Labels: `level`, `type`
|
||||
|
||||
- **`pulse_alerts_resolved_total`** (Counter) - Total number of alerts resolved
|
||||
- Labels: `type`
|
||||
|
||||
- **`pulse_alerts_acknowledged_total`** (Counter) - Total number of alerts acknowledged
|
||||
|
||||
- **`pulse_alerts_suppressed_total`** (Counter) - Total number of alerts suppressed
|
||||
- Labels: `reason` (quiet_hours/flapping/rate_limit)
|
||||
|
||||
- **`pulse_alert_duration_seconds`** (Histogram) - Duration alerts remain active before resolution
|
||||
- Labels: `type`
|
||||
|
||||
#### Notification Metrics
|
||||
|
||||
- **`pulse_notifications_sent_total`** (Counter) - Total notifications sent
|
||||
- Labels: `method` (email/webhook/apprise), `status` (success/failed)
|
||||
|
||||
- **`pulse_notification_queue_depth`** (Gauge) - Number of queued notifications
|
||||
- Labels: `status` (pending/processing/dlq)
|
||||
|
||||
- **`pulse_notification_dlq_total`** (Counter) - Total notifications moved to Dead Letter Queue
|
||||
|
||||
- **`pulse_notification_retry_total`** (Counter) - Total notification retry attempts
|
||||
|
||||
- **`pulse_notification_duration_seconds`** (Histogram) - Time to deliver notifications
|
||||
- Labels: `method`
|
||||
|
||||
#### Queue Metrics
|
||||
|
||||
- **`pulse_queue_depth`** (Gauge) - Current queue depth by status
|
||||
- Labels: `status`
|
||||
|
||||
- **`pulse_queue_items_total`** (Counter) - Total items processed by queue
|
||||
- Labels: `status` (completed/failed/dlq)
|
||||
|
||||
- **`pulse_queue_processing_duration_seconds`** (Histogram) - Time to process queued items
|
||||
|
||||
#### System Metrics
|
||||
|
||||
- **`pulse_history_save_errors_total`** (Counter) - Total alert history save failures
|
||||
|
||||
- **`pulse_history_save_retries_total`** (Counter) - Total history save retry attempts
|
||||
|
||||
### Example Prometheus Configuration
|
||||
|
||||
```yaml
|
||||
scrape_configs:
|
||||
- job_name: 'pulse'
|
||||
static_configs:
|
||||
- targets: ['pulse.example.com:7655']
|
||||
metrics_path: '/metrics'
|
||||
scrape_interval: 30s
|
||||
```
|
||||
|
||||
### Example PromQL Queries
|
||||
|
||||
```promql
|
||||
# Alert rate per minute
|
||||
rate(pulse_alerts_fired_total[5m]) * 60
|
||||
|
||||
# Notification success rate
|
||||
rate(pulse_notifications_sent_total{status="success"}[5m]) /
|
||||
rate(pulse_notifications_sent_total[5m])
|
||||
|
||||
# DLQ growth rate
|
||||
rate(pulse_notification_dlq_total[1h])
|
||||
|
||||
# Active alerts by severity
|
||||
sum by (level) (pulse_alerts_active)
|
||||
|
||||
# Average notification delivery time
|
||||
rate(pulse_notification_duration_seconds_sum[5m]) /
|
||||
rate(pulse_notification_duration_seconds_count[5m])
|
||||
```
|
||||
|
||||
## Rate Limiting
|
||||
|
||||
**v4.24.0:** All responses include rate limit headers (`X-RateLimit-Limit`, `X-RateLimit-Remaining`, `X-RateLimit-Reset`). 429 responses add `Retry-After`.
|
||||
|
||||
@@ -361,6 +361,46 @@ In the Alerts page, the "Global Defaults" row for each resource table shows an e
|
||||
| Temperature | 5+ minutes | Fans need time to ramp up; short spikes are normal |
|
||||
| Restart Count | 10-30 seconds | Container crashes need immediate attention |
|
||||
|
||||
#### Alert Reliability Configuration
|
||||
|
||||
Pulse includes advanced reliability features to prevent data loss and manage long-running alerts:
|
||||
|
||||
**Alert TTL (Time-To-Live):**
|
||||
```json
|
||||
{
|
||||
"maxAlertAgeDays": 7,
|
||||
"maxAcknowledgedAgeDays": 1,
|
||||
"autoAcknowledgeAfterHours": 24
|
||||
}
|
||||
```
|
||||
|
||||
- **`maxAlertAgeDays`** (default: `7`): Automatically removes unacknowledged alerts older than this many days. Prevents memory leaks from persistent issues. Set to `0` to disable.
|
||||
- **`maxAcknowledgedAgeDays`** (default: `1`): Faster cleanup for acknowledged alerts since they've been reviewed. Set to `0` to disable.
|
||||
- **`autoAcknowledgeAfterHours`** (default: `24`): Automatically acknowledges alerts that remain active for this duration. Useful for expected long-running conditions. Set to `0` to disable.
|
||||
|
||||
**Flapping Detection:**
|
||||
```json
|
||||
{
|
||||
"flappingEnabled": true,
|
||||
"flappingWindowSeconds": 300,
|
||||
"flappingThreshold": 5,
|
||||
"flappingCooldownMinutes": 15
|
||||
}
|
||||
```
|
||||
|
||||
- **`flappingEnabled`** (default: `true`): Enable detection of rapidly oscillating alerts
|
||||
- **`flappingWindowSeconds`** (default: `300`): Time window (5 minutes) to track state changes
|
||||
- **`flappingThreshold`** (default: `5`): Number of state changes within the window to trigger suppression
|
||||
- **`flappingCooldownMinutes`** (default: `15`): How long to suppress a flapping alert
|
||||
|
||||
When an alert flaps (rapid on/off cycling), Pulse automatically suppresses it to prevent notification storms. The suppression lasts for the cooldown period, after which the alert can fire normally again.
|
||||
|
||||
**Common Flapping Scenarios:**
|
||||
- Network instability causing intermittent connectivity
|
||||
- Resource usage hovering around threshold (use hysteresis instead)
|
||||
- Misconfigured health checks with tight timing
|
||||
- Container restart loops (use Docker restart alerts instead)
|
||||
|
||||
> Tip: Back up `alerts.json` alongside `.env` during exports. Restoring it preserves all overrides, quiet-hour schedules, and webhook routing.
|
||||
|
||||
### `pulse-sensor-proxy/config.yaml`
|
||||
|
||||
Reference in New Issue
Block a user