The agent was crashing with 'fatal error: concurrent map writes' when
handleCheckUpdatesCommand spawned a goroutine that called collectOnce
concurrently with the main collection loop. Both code paths access
a.prevContainerCPU without synchronization.
Added a.cpuMu mutex to protect all accesses to prevContainerCPU in:
- pruneStaleCPUSamples()
- collectContainer() delete operation
- calculateContainerCPUPercent()
Related to #1063
This commit adds enterprise-grade reporting and audit capabilities:
Reporting:
- Refactored metrics store from internal/ to pkg/ for enterprise access
- Added pkg/reporting with shared interfaces for report generation
- Created API endpoint: GET /api/admin/reports/generate
- New ReportingPanel.tsx for PDF/CSV report configuration
Audit Webhooks:
- Extended pkg/audit with webhook URL management interface
- Added API endpoint: GET/POST /api/admin/webhooks/audit
- New AuditWebhookPanel.tsx for webhook configuration
- Updated Settings.tsx with Reporting and Webhooks tabs
Server Hardening:
- Enterprise hooks now execute outside mutex with panic recovery
- Removed dbPath from metrics Stats API to prevent path disclosure
- Added storage metrics persistence to polling loop
Documentation:
- Updated README.md feature table
- Updated docs/API.md with new endpoints
- Updated docs/PULSE_PRO.md with feature descriptions
- Updated docs/WEBHOOKS.md with audit webhooks section
- Replace barrel import in AuditLogPanel.tsx to fix ad-blocker crash
- Remove all Enterprise/Pro badges from nav and feature headers
- Simplify upgrade CTAs to clean 'Upgrade to Pro' links
- Update docs: PULSE_PRO.md, API.md, README.md, SECURITY.md
- Align terminology: single Pro tier, no separate Enterprise tier
Also includes prior refactoring:
- Move auth package to pkg/auth for enterprise reuse
- Export server functions for testability
- Stabilize CLI tests
- Add GET /api/audit endpoint for listing events with filters
- Add GET /api/audit/:id/verify endpoint for signature verification
- Add AuditLogPanel UI component with filtering and verification
- Update docs with audit API documentation
- Add localStorage utils for persisting UI state
- Update gitignore patterns
Implements server-side persistence for AI chat sessions, allowing users
to continue conversations across devices and browser sessions. Related
to #1059.
Backend:
- Add chat session CRUD API endpoints (GET/PUT/DELETE)
- Add persistence layer with per-user session storage
- Support session cleanup for old sessions (90 days)
- Multi-user support via auth context
Frontend:
- Rewrite aiChat store with server sync (debounced)
- Add session management UI (new conversation, switch, delete)
- Local storage as fallback/cache
- Initialize sync on app startup when AI is enabled
- Major updates to README.md and docs/README.md for Pulse v5
- Added technical deep-dives for Pulse Pro (docs/PULSE_PRO.md) and AI Patrol (docs/AI.md)
- Updated Prometheus metrics documentation and Helm schema for metrics separation
- Refreshed security, installation, and deployment documentation for unified agent models
- Cleaned up legacy summary files
The storage deduplication logic only checked cluster config's Shared
flag, but this required the cluster config API call to succeed. When
the per-node storage API already returns shared=1 (as the user
verified), we should use that directly.
Now we check three sources for shared storage detection:
1. Per-node API shared flag (storage.Shared)
2. Cluster config shared flag (if available)
3. Storage type heuristics (NFS, RBD, PBS, etc.)
Related to #1049
Add contextual help icons throughout the UI to improve feature
discoverability. Users can click (?) icons to see explanations
with examples for settings they might not understand.
- HelpIcon component with click-to-open popover
- Centralized help content registry in /content/help/
- FeatureTip component for dismissible contextual tips
- Help added to: alert delay, AI endpoints, update channel
Adds IncludeAllDeployments option to show all deployments, not just
problem ones (where replicas don't match desired). This provides parity
with the existing --kube-include-all-pods flag.
- Add IncludeAllDeployments to kubernetesagent.Config
- Add --kube-include-all-deployments flag and PULSE_KUBE_INCLUDE_ALL_DEPLOYMENTS env var
- Update collectDeployments to respect the new flag
- Add test for IncludeAllDeployments functionality
- Update UNIFIED_AGENT.md documentation
Addresses feedback from PR #855
- Fixed unused variables in wizard components
- Fixed invalid aiEnabled field in FeaturesStep (AI uses separate API)
- Added AI endpoints section to API.md
- Add AI service with Anthropic, OpenAI, and Ollama providers
- Add AI chat UI component with streaming responses
- Add AI settings page for configuration
- Add agent exec framework for command execution
- Add API endpoints for AI chat and configuration
Add comprehensive documentation for new alert system reliability features:
**API Documentation (docs/API.md):**
- Dead Letter Queue (DLQ) API endpoints
- GET /api/notifications/dlq - Retrieve failed notifications
- GET /api/notifications/queue/stats - Queue statistics
- POST /api/notifications/dlq/retry - Retry DLQ items
- POST /api/notifications/dlq/delete - Delete DLQ items
- Prometheus metrics endpoint documentation
- 18 metrics covering alerts, notifications, and queue health
- Example Prometheus configuration
- Example PromQL queries for common monitoring scenarios
**Configuration Documentation (docs/CONFIGURATION.md):**
- Alert TTL configuration
- maxAlertAgeDays, maxAcknowledgedAgeDays, autoAcknowledgeAfterHours
- Flapping detection configuration
- flappingEnabled, flappingWindowSeconds, flappingThreshold, flappingCooldownMinutes
- Usage examples and common scenarios
- Best practices for preventing notification storms
All new features are fully documented with examples and default values.
Issues found during systematic audit after #642:
1. CRITICAL BUG - Rollback downloads were completely broken:
- Code constructed: pulse-linux-amd64 (no version, no .tar.gz)
- Actual asset name: pulse-v4.26.1-linux-amd64.tar.gz
- This would cause 404 errors on all rollback attempts
- Fixed: Construct correct tarball URL with version
- Added: Extract tarball after download to get binary
2. TEMPERATURE_MONITORING.md referenced non-existent v4.27.0:
- Changed to use /latest/download/ for future-proof docs
3. API.md example had wrong filename format:
- Changed pulse-linux-amd64.tar.gz to pulse-v4.30.0-linux-amd64.tar.gz
- Ensures example matches actual release asset naming
The rollback bug would have affected any user attempting to roll back
to a previous version via the UI or API.
Add comprehensive operator documentation for the new observability features
introduced in the previous commit.
**New Documentation:**
- docs/monitoring/PROMETHEUS_METRICS.md - Complete reference for all 18 new
Prometheus metrics with alert suggestions
**Updated Documentation:**
- docs/API.md - Document X-Request-ID and X-Diagnostics-Cached-At headers,
explain diagnostics endpoint caching behavior
- docs/TROUBLESHOOTING.md - Add section on correlating API calls with logs
using request IDs
- docs/operations/ADAPTIVE_POLLING_ROLLOUT.md - Update monitoring checklists
with new per-node and scheduler metrics
- docs/CONFIGURATION.md - Clarify LOG_FILE dual-output behavior and rotation
defaults
These updates ensure operators understand:
- How to set up monitoring/alerting for new metrics
- How to configure file logging with rotation
- How to troubleshoot using request correlation
- What metrics are available for dashboards
Related to: 495e6c794 (feat: comprehensive diagnostics improvements)
- Add comprehensive test coverage for alerts package with 285+ new tests
- Implement ThresholdsTable component with metric thresholds display
- Enhance Alerts page UI with improved layout and metric filtering
- Add frontend component tests for Alerts page and ThresholdsTable
- Set up Vitest testing infrastructure for SolidJS components
- Improve config persistence with better validation
- Expand discovery tests with 333+ test cases
- Update API, configuration, and Docker monitoring documentation