364 Commits

Author SHA1 Message Date
rcourtman
82cb9a45fa fix: ship alerting hotfixes and prepare 5.1.4
(cherry picked from commit d1e61d8a8a)
2026-02-07 23:15:43 +00:00
rcourtman
a44031f47d fix(monitoring): preserve recent PVE nodes on empty polls (#1094)
(cherry picked from commit 13af83f3fc)
2026-02-07 23:15:26 +00:00
rcourtman
93fd5788c9 fix(monitoring): add info logging for skipped recovery notifications 2026-02-05 09:59:05 +00:00
rcourtman
ee0e89871d fix: reduce metrics memory 86x by reverting buffer and adding LTTB downsampling
The in-memory metrics buffer was changed from 1000 to 86400 points per
metric to support 30-day sparklines, but this pre-allocated ~18 MB per
guest (7 slices × 86400 × 32 bytes). With 50 guests that's 920 MB —
explaining why users needed to double their LXC memory after upgrading
to 5.1.0.

- Revert in-memory buffer to 1000 points / 24h retention
- Remove eager slice pre-allocation (use append growth instead)
- Add LTTB (Largest Triangle Three Buckets) downsampling algorithm
- Chart endpoints now use a two-tier strategy: in-memory for ranges
  ≤ 2h, SQLite persistent store + LTTB for longer ranges
- Reduce frontend ring buffer from 86400 to 2000 points

Related to #1190
2026-02-04 19:49:52 +00:00
rcourtman
bcd0dbfc18 Add metrics history memory regression test 2026-02-04 19:35:19 +00:00
rcourtman
049a3e424c Add memory regression tests for agent and scheduler 2026-02-04 19:33:29 +00:00
rcourtman
64e57f0e0e fix: smooth I/O rates using sliding window like Prometheus rate()
Proxmox reports cumulative byte counters that update unevenly across
polling intervals, causing a steady 100 Mbps download to appear as
spikes up to 450 Mbps in sparkline charts. Replace per-interval rate
calculation with a 4-sample sliding window (30s at 10s polling) that
averages over the full span — the same approach Prometheus rate() uses.
2026-02-04 19:04:17 +00:00
rcourtman
c9547f226e fix: add rateTracker to host report tests and block direct tag pushes
Initialize rateTracker in ApplyHostReport test monitors to prevent nil
pointer panic when CalculateRates is called during host report processing.

Add pre-push hook guard that blocks pushing version tags directly —
releases must go through the create-release.yml workflow.
2026-02-04 16:47:31 +00:00
rcourtman
9d4d392026 fix: host network sparklines showing cumulative bytes instead of rates
Host network sparklines were displaying wildly incorrect values (e.g., 147 GB/s
for an idle Raspberry Pi) because cumulative byte counters (total bytes since
boot) were being stored directly instead of being converted to rates.

Changes:
- monitor.go: Use RateTracker to calculate network rates for hosts, matching
  the existing pattern used for VMs and containers. Only record network
  metrics when we have enough samples to calculate valid rates.
- router.go: Remove network metrics from live fallback for hosts since we
  can't calculate rates from a single snapshot. Better to show nothing than
  misleading cumulative totals.

The fix follows the established codebase pattern where:
1. Agent reports cumulative RXBytes/TXBytes
2. RateTracker compares consecutive samples to calculate bytes/second
3. Rates are stored in metrics history for sparkline display
2026-02-04 16:11:04 +00:00
rcourtman
cffb91f9ea Pre-populate node display name cache before guest polling
Guest polling (CheckGuest) runs before CheckNode in each poll cycle,
so the display name cache was empty when the first guest alert was
created. This caused the initial notification to use the raw Proxmox
node name. Fix by seeding the cache from modelNodes (which are already
available) before guest polling starts.

Related to #1188
2026-02-04 14:29:49 +00:00
rcourtman
05266d9062 Show node display name in alerts instead of raw Proxmox node name
Alerts previously showed the raw Proxmox node name (e.g., "on pve") even
when users configured a display name (e.g., "SPACEX") via Settings or the
host agent --hostname flag. This affected the alert UI, email notifications,
and webhook payloads.

Add NodeDisplayName field to the alert chain: cache display names in the
alert Manager (populated by CheckNode/CheckHost on every poll), resolve
them at alert creation via preserveAlertState, refresh on metric updates,
and enrich at read time in GetActiveAlerts. Update models.Alert, the
syncAlertsToState conversion, email templates, Apprise body text, webhook
payloads, and all frontend rendering paths.

Related to #1188
2026-02-04 14:26:44 +00:00
rcourtman
5c18748742 Add SMART disk lifecycle monitoring with historical charts
Expand the smartctl collector to capture detailed SMART attributes (SATA
and NVMe), propagate them through the full data pipeline, persist them
as time-series metrics, and display them in an interactive disk detail
drawer with historical sparkline charts.

Backend: add SMARTAttributes struct, writeSMARTMetrics for persistent
storage, "disk" resource type in metrics API with live fallback.
Frontend: enhanced DiskList with Power-On column and SMART warnings,
new DiskDetail drawer matching NodeDrawer styling patterns, generic
HistoryChart metric support with proper tooltip formatting.
2026-02-04 13:35:40 +00:00
rcourtman
902bdd92c2 fix: prefer status-mem over status-freemem for VM memory calculation
Proxmox's FreeMem field reports free memory relative to the balloon's
guest-visible total (total_mem), not relative to MaxMem. When ballooning
is active and the VM's memory has been reduced, subtracting FreeMem from
MaxMem produces wildly inflated usage (e.g. 97% when actual usage is 20%).

Proxmox's Mem field is already calculated as (total_mem - free_mem),
giving the correct used bytes regardless of balloon state. Swap the
priority so Mem is checked before FreeMem.

Related to #1185
2026-02-04 12:08:33 +00:00
rcourtman
5a990dd554 Fix sparkline data inconsistency and support 30d range 2026-02-03 22:39:50 +00:00
rcourtman
2ebe65bbc5 security: add scope checks to AI Patrol and agent profile endpoints
- AI Patrol mutation endpoints (acknowledge, dismiss, suppress, snooze, resolve,
  findings/note, suppressions/*) now require ai:execute scope to prevent
  low-privilege tokens from blinding patrol by hiding/suppressing findings

- Agent profile admin endpoints (/api/admin/profiles/*) now require
  settings:write scope to prevent low-privilege tokens from modifying
  fleet-wide agent behavior
2026-02-03 19:29:56 +00:00
rcourtman
1733bea15c feat(ui): show backup permission warnings on Backups page
When PVE backup polling detects permission errors (403/401/permission
denied), track them per instance and surface them via the scheduler
health endpoint.

The Backups page now fetches instance warnings and displays a banner
when backup permission issues are detected, telling users exactly how
to fix the problem.

Related to #1139
2026-02-03 19:27:10 +00:00
rcourtman
beae4c860c fix: address 6 security and reliability issues
Security fixes:
- Auto-register now requires settings:write scope for API tokens
- X-Forwarded-For in auto-register only trusted from verified proxies
- Public URL capture requires authentication (no loopback bypass)
- Lockout reset now uses RequireAdmin for session users

Reliability fixes:
- Docker stop command expiration clears PendingUninstall flag
- Cancelled notifications get completed_at set and are cleaned up
2026-02-03 17:32:44 +00:00
rcourtman
c7f4030c29 fix(monitoring): prevent memory leak from stale metrics history and rate tracker entries
MetricsHistory.Cleanup() was defined but never called, and even if called,
it only removed old data points without deleting map entries for deleted
containers/VMs. Each stale entry leaked ~224KB (7 pre-allocated slices).

Changes:
- Call metricsHistory.Cleanup() and rateTracker.Cleanup() in maintenance loop
- Delete map entries entirely when all data points have expired
- Return nil instead of empty slice in cleanupMetrics() to release backing arrays
- Add Cleanup() method to RateTracker with 24-hour stale threshold
- Add debug logging to track cleanup activity

Related to #1153
2026-02-03 17:16:06 +00:00
rcourtman
bd030c7c87 security: fix webhook SSRF, rate limit spoofing, metrics retention, and url poisoning
- Fix SSRF and rate limit bypass in SendEnhancedWebhook by validating the rendered URL.
- Fix rate limit spoofing in updates API by using secure IP extraction (trusted proxies).
- Fix memory leak in metrics history by correctly clearing fully stale data series.
- Fix public URL poisoning by preventing overwrites when explicitly configured.
2026-02-03 16:58:13 +00:00
rcourtman
4f40c3d751 fix: resolve critical stability and auth issues
- Fix data race in webhook notifications by removing shared state
- Fix duplicate monitors on config reload by stopping old instances
- Prevent metrics ID deletion on transient startup errors
- Support Bearer auth header for config export/import endpoints
2026-02-03 16:46:27 +00:00
rcourtman
aeca5e39fa Fix multi-tenant persistence and backend stability
- Initialize Alert and Notification managers with tenant-specific data directories

- Add panic recovery to WebSocket safeSend for stability

- Record host metrics to history for sparkline support
2026-02-03 16:24:42 +00:00
rcourtman
bea3bbe5f6 Fix API token authentication and multi-tenancy logic
- Fix AuthContextMiddleware to use tenant-specific config for token validation

- Resolve data race in token LastUsedAt update

- Fix invalid org IDs returning 501/402 instead of 400

- Prevent unauthenticated organization directory creation (DoS protection)
2026-02-03 16:24:28 +00:00
rcourtman
71f80c8a99 Fix: alert resolution now records incident timeline during quiet hours
- Fixed early return in handleAlertResolved that skipped incident recording
  when quiet hours suppressed recovery notifications
- Added Host Agent alert delay configuration (backend + UI)
- Host Agents now have dedicated time threshold settings like other resource types

Related to #1179
2026-02-03 12:49:41 +00:00
rcourtman
8495878553 Fix: improve mock metrics sampler startup performance
- Reduce minimum seed duration from 7 days to 1 hour for faster startup
  on resource-constrained systems (like demo server 1GB droplet)
- Reduce sleep times from 200ms to 50ms between resource processing
- Add diagnostic logging throughout mock metrics seeding to help debug
  issues where sparklines show no data
- Add progress logging for nodes, VMs, containers, storage, docker hosts
2026-02-03 12:03:06 +00:00
rcourtman
a61f1b387a Fix: data race in Docker detection test mock — add mutex for concurrent calls 2026-02-03 00:12:16 +00:00
rcourtman
c8483f8116 Fix: PBS backup verification status not updating after cache populated
The PBS backup snapshot cache only compared BackupCount and LastBackup
timestamp to decide whether to re-fetch. When PBS verify jobs complete,
neither field changes — only the Verification field on individual
snapshots changes — so the cache served stale data indefinitely.

Add a 10-minute TTL per backup group so verification status changes are
picked up periodically. Also add panic recovery to PBS and PVE backup
goroutines, and use runtimeCtx for PBS backup polling to respect
monitor shutdown.

Closes #1174
2026-02-02 23:12:26 +00:00
rcourtman
3b347b6548 fix: harden SQLite against I/O contention causing persistent lock errors
- Move all SQLite pragmas from db.Exec() to DSN parameters so every
  connection the pool creates gets busy_timeout and other settings.
  Previously only the first connection had these applied.
- Set MaxOpenConns(1) on audit, RBAC, and notification databases
  (metrics already had this). Fixes potential for multiple connections
  where new ones lack busy_timeout.
- Increase busy_timeout from 5s to 30s across all databases to
  tolerate disk I/O pressure during backup windows.
- Fix nested query deadlocks in GetRoles(), GetUserAssignments(), and
  CancelByAlertIDs() that would deadlock with MaxOpenConns(1).
- Fix circuit breaker retryInterval not resetting on recovery, which
  caused the next trip to start at 5-minute backoff instead of 5s.

Related to #1156
2026-02-02 17:29:14 +00:00
rcourtman
95a0d7a6bd feat(backend): implement AI Patrol, Investigation, and system-wide refactors 2026-01-30 19:02:14 +00:00
rcourtman
19a67dd4f3 Update core infrastructure components
Config:
- AI configuration improvements
- API tokens handling
- Persistence layer updates

Host Agent:
- Command execution improvements
- Better test coverage

Infrastructure Discovery:
- Service improvements
- Enhanced test coverage

Models:
- State snapshot updates
- Model improvements

Monitoring:
- Polling improvements
- Guest config handling
- Storage config support

WebSocket:
- Hub tenant test updates

Service Discovery:
- New service discovery module
2026-01-28 16:52:35 +00:00
rcourtman
70dbb495ad fix: address triage issues #1149, #1153, #1162, #1163
- #1163: Add node badges to storage resources in threshold tables
  (ResourceTable.tsx, ResourceCard.tsx)
- #1162: Fix PBS backup alerts showing datastore as node name
  (alerts.go - use "Unknown" for orphaned backups)
- #1153: Fix memory leaks in tracking maps
  - Add max 48 sample limit for pmgQuarantineHistory
  - Add max 10 entry limit for flappingHistory
  - Add cleanup for dockerUpdateFirstSeen
  - Add cleanupTrackingMaps() for auth, polling, and circuit breaker maps

Note: #1149 fix (chat sessions null check) is in AISettings.tsx
which has other pending changes - will be committed separately.
2026-01-26 22:21:10 +00:00
rcourtman
7f7edfceb4 test: expand backend coverage 2026-01-25 21:08:44 +00:00
rcourtman
1e77763870 feat: improve monitoring and temperature handling
Temperature Monitoring:
- Enhance temperature collection and processing
- Add temperature tests

Monitor Improvements:
- Improve monitor reload handling
- Add reload tests

Test Coverage:
- Add Ceph monitoring tests
- Add Docker commands tests
- Add host agent temperature tests
- Add extra coverage tests
2026-01-24 22:43:31 +00:00
rcourtman
c4ca169e2b feat: add multi-tenant isolation foundation (disabled by default)
Implements multi-tenant infrastructure for organization-based data isolation.
Feature is gated behind PULSE_MULTI_TENANT_ENABLED env var and requires
Enterprise license - no impact on existing users.

Core components:
- TenantMiddleware: extracts org ID, validates access, 501/402 responses
- AuthorizationChecker: token/user access validation for organizations
- MultiTenantChecker: WebSocket upgrade gating with license check
- Per-tenant audit logging via LogAuditEventForTenant
- Organization model with membership support

Gating behavior:
- Feature flag disabled: 501 Not Implemented for non-default orgs
- Flag enabled, no license: 402 Payment Required
- Default org always works regardless of flag/license

Documentation added: docs/MULTI_TENANT.md
2026-01-23 21:42:27 +00:00
rcourtman
4c19fa3c1b fix: resolve btrfs disk summing (#1158), podman disable flag (#1151), and diagnostics path (#1155) 2026-01-23 19:24:38 +00:00
rcourtman
8963d69764 feat: add metrics store point limiting and mock improvements
- Add point limiting to metrics queries
- Improve mock metrics history for testing
- Add monitor enhancements
2026-01-22 22:29:56 +00:00
rcourtman
5f56efa88a Refactor: Core monitoring and update managers multi-tenancy
- Updated monitoring reload and metrics history to be tenant-aware
- Refactored update manager and checksum validation for multi-tenancy
- Enhanced test coverage for agent updates and metrics storage
2026-01-22 16:43:24 +00:00
rcourtman
289d95374f feat: add multi-tenancy foundation (directory-per-tenant)
Implements Phase 1-2 of multi-tenancy support using a directory-per-tenant
strategy that preserves existing file-based persistence.

Key changes:
- Add MultiTenantPersistence manager for org-scoped config routing
- Add TenantMiddleware for X-Pulse-Org-ID header extraction and context propagation
- Add MultiTenantMonitor for per-tenant monitor lifecycle management
- Refactor handlers (ConfigHandlers, AlertHandlers, AIHandlers, etc.) to be
  context-aware with getConfig(ctx)/getMonitor(ctx) helpers
- Add Organization model for future tenant metadata
- Update server and router to wire multi-tenant components

All handlers maintain backward compatibility via legacy field fallbacks
for single-tenant deployments using the "default" org.
2026-01-22 13:39:06 +00:00
rcourtman
c75972d57c Fix mock metrics history and guest drawer controls 2026-01-22 09:39:53 +00:00
rcourtman
2e0da42a81 chore: reliability and maintenance improvements
Host agent:
- Add SHA256 checksum verification for downloaded binaries
- Verify checksum file matches expected bundle filename

WebSocket:
- Add write failure tracking with graceful disconnection
- Increase write deadline to 30s for large state payloads
- Better handling for slow clients (Raspberry Pi, slow networks)

Monitoring:
- Remove unused temperature proxy imports
- Add monitor polling improvements
- Expand test coverage

Other:
- Update package.json dependencies
- Fix generate-release-notes.sh path handling
- Minor reporting engine cleanup
2026-01-22 00:45:04 +00:00
rcourtman
c8b6cbfc6d feat(pro): long-term metrics history (30d/90d)
- Add FeatureLongTermMetrics license feature for Pro tier
- Implement tiered storage in metrics store (raw, minute, hourly, daily)
- Add covering index for unified history query performance
- Seed mock data for 90 days with appropriate aggregation tiers
- Update PULSE_PRO.md to document the feature
- 7-day history remains free, 30d/90d requires Pro license
2026-01-22 00:42:41 +00:00
rcourtman
925815c3e7 test: update config and monitoring tests after proxy removal
Remove references to sensor proxy config fields in test cases.
2026-01-21 12:03:30 +00:00
rcourtman
7049f5b43c refactor: simplify temperature monitoring after sensor proxy removal
Remove proxy-related temperature code paths:
- temperature.go: remove proxy client integration and fallback logic
- config.go: remove SensorProxyEnabled and related config fields
- monitor.go: remove proxy client initialization and state

Temperature monitoring now relies solely on the unified agent approach.
2026-01-21 12:00:28 +00:00
rcourtman
d4a6c0d2e8 refactor: remove legacy pulse-sensor-proxy temperature monitoring
The sensor proxy approach for temperature monitoring has been superseded
by the unified agent architecture where host agents report temperature
data directly. This removes:

- cmd/pulse-sensor-proxy/ - standalone proxy daemon
- internal/tempproxy/ - client library
- internal/api/*temperature_proxy* - API handlers and tests
- internal/api/sensor_proxy_gate* - feature gate
- internal/monitoring/*proxy_test* - proxy-specific tests
- scripts/*sensor-proxy* - installation and management scripts
- security/apparmor/, security/seccomp/ - proxy security profiles

Temperature monitoring remains available via the unified agent approach.
2026-01-21 11:59:04 +00:00
rcourtman
ebc29b4fdb feat: show pending apt updates for Proxmox nodes (#1083)
- Add PendingUpdates and PendingUpdatesCheckedAt fields to Node model
- Add GetNodePendingUpdates method to Proxmox client (calls /nodes/{node}/apt/update)
- Add 30-minute polling cache to avoid excessive API calls
- Add pendingUpdates to frontend Node type
- Add color-coded badge in NodeSummaryTable (yellow: 1-9, orange: 10+)
- Update test stubs for interface compliance

Requires Sys.Audit permission on Proxmox API token to read apt updates.
2026-01-21 10:53:36 +00:00
rcourtman
a6a8efaa65 test: Add comprehensive test coverage across packages
New test files with expanded coverage:

API tests:
- ai_handler_test.go: AI handler unit tests with mocking
- agent_profiles_tools_test.go: Profile management tests
- alerts_endpoints_test.go: Alert API endpoint tests
- alerts_test.go: Updated for interface changes
- audit_handlers_test.go: Audit handler tests
- frontend_embed_test.go: Frontend embedding tests
- metadata_handlers_test.go, metadata_provider_test.go: Metadata tests
- notifications_test.go: Updated for interface changes
- profile_suggestions_test.go: Profile suggestion tests
- saml_service_test.go: SAML authentication tests
- sensor_proxy_gate_test.go: Sensor proxy tests
- updates_test.go: Updated for interface changes

Agent tests:
- dockeragent/signature_test.go: Docker agent signature tests
- hostagent/agent_metrics_test.go: Host agent metrics tests
- hostagent/commands_test.go: Command execution tests
- hostagent/network_helpers_test.go: Network helper tests
- hostagent/proxmox_setup_test.go: Updated setup tests
- kubernetesagent/*_test.go: Kubernetes agent tests

Core package tests:
- monitoring/kubernetes_agents_test.go, reload_test.go
- remoteconfig/client_test.go, signature_test.go
- sensors/collector_test.go
- updates/adapter_installsh_*_test.go: Install adapter tests
- updates/manager_*_test.go: Update manager tests
- websocket/hub_*_test.go: WebSocket hub tests

Library tests:
- pkg/audit/export_test.go: Audit export tests
- pkg/metrics/store_test.go: Metrics store tests
- pkg/proxmox/*_test.go: Proxmox client tests
- pkg/reporting/reporting_test.go: Reporting tests
- pkg/server/*_test.go: Server tests
- pkg/tlsutil/extra_test.go: TLS utility tests

Total: ~8000 lines of new test code
2026-01-19 19:26:18 +00:00
rcourtman
204a9fe084 perf: Cache agent profiles to prevent disk I/O on every report. Related to #1094
GetHostAgentConfig was loading profiles and assignments from disk on
every agent report (every 10-30 seconds per host). With multiple hosts,
this caused disk I/O contention that eventually led to request timeouts.

Added in-memory caching with 60-second TTL:
- Fast path reads from cache without locks when valid
- Double-checked locking pattern for cache refresh
- Cache auto-invalidates after TTL, no manual invalidation needed
2026-01-17 22:31:02 +00:00
rcourtman
103eb9c3e0 feat(monitoring): auto-detect Docker inside LXC containers
Adds automatic Docker detection for Proxmox LXC containers:
- New HasDocker and DockerCheckedAt fields on Container model
- Docker socket check via connected agents on first run, restart, or start
- Parallel checking with timeouts for efficiency
- Caches results and only re-checks after state transitions

This enables the AI to know which LXC containers are Docker hosts
for better infrastructure guidance.
2026-01-17 14:42:52 +00:00
rcourtman
035436ad6e fix: add mutex to prevent concurrent map writes in Docker agent CPU tracking
The agent was crashing with 'fatal error: concurrent map writes' when
handleCheckUpdatesCommand spawned a goroutine that called collectOnce
concurrently with the main collection loop. Both code paths access
a.prevContainerCPU without synchronization.

Added a.cpuMu mutex to protect all accesses to prevContainerCPU in:
- pruneStaleCPUSamples()
- collectContainer() delete operation
- calculateContainerCPUPercent()

Related to #1063
2026-01-15 21:10:55 +00:00
rcourtman
9b49d3171d feat(pbs): add datastore exclusion to reduce PBS log noise
Users with removable/unmounted datastores (e.g., external HDDs for
offline backup) experienced excessive PBS log entries because Pulse
was querying all datastores including unavailable ones.

Added `excludeDatastores` field to PBS node configuration that accepts
patterns to exclude specific datastores from monitoring:
- Exact names: "exthdd1500gb"
- Prefix patterns: "ext*"
- Suffix patterns: "*hdd"
- Contains patterns: "*removable*"

Pattern matching is case-insensitive.

Fixes #1105
2026-01-14 12:26:18 +00:00
rcourtman
d389345153 fix(hosts): calculate Used memory from Total-Free for host agents in LXC
The previous fix (4090d981) addressed memory reporting for Docker agents
running in LXC containers, but the same issue also affects host agents.
When gopsutil runs inside an LXC, it can read Total and Free memory
from cgroup limits, but reports 0 for Used memory.

Added the same Total - Free fallback calculation to the host agent
processing path, which populates the Hosts tab.

Fixes #1075
2026-01-12 21:19:40 +00:00