- Add AI service with Anthropic, OpenAI, and Ollama providers
- Add AI chat UI component with streaming responses
- Add AI settings page for configuration
- Add agent exec framework for command execution
- Add API endpoints for AI chat and configuration
Add comprehensive documentation for new alert system reliability features:
**API Documentation (docs/API.md):**
- Dead Letter Queue (DLQ) API endpoints
- GET /api/notifications/dlq - Retrieve failed notifications
- GET /api/notifications/queue/stats - Queue statistics
- POST /api/notifications/dlq/retry - Retry DLQ items
- POST /api/notifications/dlq/delete - Delete DLQ items
- Prometheus metrics endpoint documentation
- 18 metrics covering alerts, notifications, and queue health
- Example Prometheus configuration
- Example PromQL queries for common monitoring scenarios
**Configuration Documentation (docs/CONFIGURATION.md):**
- Alert TTL configuration
- maxAlertAgeDays, maxAcknowledgedAgeDays, autoAcknowledgeAfterHours
- Flapping detection configuration
- flappingEnabled, flappingWindowSeconds, flappingThreshold, flappingCooldownMinutes
- Usage examples and common scenarios
- Best practices for preventing notification storms
All new features are fully documented with examples and default values.
Issues found during systematic audit after #642:
1. CRITICAL BUG - Rollback downloads were completely broken:
- Code constructed: pulse-linux-amd64 (no version, no .tar.gz)
- Actual asset name: pulse-v4.26.1-linux-amd64.tar.gz
- This would cause 404 errors on all rollback attempts
- Fixed: Construct correct tarball URL with version
- Added: Extract tarball after download to get binary
2. TEMPERATURE_MONITORING.md referenced non-existent v4.27.0:
- Changed to use /latest/download/ for future-proof docs
3. API.md example had wrong filename format:
- Changed pulse-linux-amd64.tar.gz to pulse-v4.30.0-linux-amd64.tar.gz
- Ensures example matches actual release asset naming
The rollback bug would have affected any user attempting to roll back
to a previous version via the UI or API.
Add comprehensive operator documentation for the new observability features
introduced in the previous commit.
**New Documentation:**
- docs/monitoring/PROMETHEUS_METRICS.md - Complete reference for all 18 new
Prometheus metrics with alert suggestions
**Updated Documentation:**
- docs/API.md - Document X-Request-ID and X-Diagnostics-Cached-At headers,
explain diagnostics endpoint caching behavior
- docs/TROUBLESHOOTING.md - Add section on correlating API calls with logs
using request IDs
- docs/operations/ADAPTIVE_POLLING_ROLLOUT.md - Update monitoring checklists
with new per-node and scheduler metrics
- docs/CONFIGURATION.md - Clarify LOG_FILE dual-output behavior and rotation
defaults
These updates ensure operators understand:
- How to set up monitoring/alerting for new metrics
- How to configure file logging with rotation
- How to troubleshoot using request correlation
- What metrics are available for dashboards
Related to: 495e6c794 (feat: comprehensive diagnostics improvements)
- Add comprehensive test coverage for alerts package with 285+ new tests
- Implement ThresholdsTable component with metric thresholds display
- Enhance Alerts page UI with improved layout and metric filtering
- Add frontend component tests for Alerts page and ThresholdsTable
- Set up Vitest testing infrastructure for SolidJS components
- Improve config persistence with better validation
- Expand discovery tests with 333+ test cases
- Update API, configuration, and Docker monitoring documentation