Pulse

mirror of https://github.com/rcourtman/Pulse.git synced 2026-02-18 00:17:39 +01:00

Author	SHA1	Message	Date
rcourtman	2207642fa9	Related to #727 : normalize persisted Proxmox hosts	2025-11-20 19:58:05 +00:00
rcourtman	766cbe573e	Handle missing storage on cluster nodes	2025-11-18 15:57:29 +00:00
rcourtman	7c895df1f3	Fix Proxmox 9.x VM status endpoint incompatibility Proxmox VE 9.x removed support for the "full" parameter in the /nodes/{node}/qemu/{vmid}/status/current endpoint. When Pulse sent GetVMStatus() requests with ?full=1, Proxmox responded with: API error 400: {"errors":{"full":"property is not defined in schema..."}} This caused the cluster client to mark ALL endpoints as unhealthy, which cascaded into multiple failures: - VM status checks failed - Guest agent queries were blocked - Filesystem data collection stopped working - All Windows VMs showed disk:-1 (unknown) instead of actual disk usage The fix removes the ?full=1 parameter since Proxmox 9.x returns all data by default without needing this parameter. This maintains backward compatibility with older Proxmox versions while fixing the issue in 9.x. After this fix: - Cluster endpoints are correctly marked as healthy - Guest agent queries work properly - Windows VMs report actual disk usage (e.g., 26% on C:\ drive) - VM monitoring functions normally on Proxmox 9.x	2025-11-13 11:22:36 +00:00
rcourtman	f61b850179	Ensure VM status requests always return meminfo (Related to #694 )	2025-11-12 17:30:10 +00:00
rcourtman	2e1ef44ecd	Filter read-only filesystems from host agent disk metrics (related to #690 ) Squashfs snap mounts on Ubuntu (and similar read-only filesystems like erofs on Home Assistant OS) always report near-full usage and trigger false disk alerts. The filter logic existed in Proxmox monitoring but wasn't applied to host agents. Changes: - Extract read-only filesystem filter to shared pkg/fsfilters package - Apply filter in hostmetrics.collectDisks() for host/docker agents - Apply filter in monitor.ApplyHostReport() for backward compatibility - Convert internal/monitoring/fs_filters.go to wrapper functions This prevents squashfs, erofs, iso9660, cdfs, udf, cramfs, romfs, and saturated overlay filesystems from generating alerts. Filtering happens at both collection time (agents) and ingestion time (server) to ensure older agents don't cause false alerts until they're updated.	2025-11-12 09:47:02 +00:00
rcourtman	bb7ca93c18	feat: Add mdadm RAID monitoring support for host agents Implements comprehensive mdadm RAID array monitoring for Linux hosts via pulse-host-agent. Arrays are automatically detected and monitored with real-time status updates, rebuild progress tracking, and automatic alerting for degraded or failed arrays. Key changes: Backend: - Add mdadm package for parsing mdadm --detail output - Extend host agent report structure with RAID array data - Integrate mdadm collection into host agent (Linux-only, best-effort) - Add RAID array processing in monitoring system - Implement automatic alerting: - Critical alerts for degraded arrays or arrays with failed devices - Warning alerts for rebuilding/resyncing arrays with progress tracking - Auto-clear alerts when arrays return to healthy state Frontend: - Add TypeScript types for RAID arrays and devices - Display RAID arrays in host details drawer with: - Array status (clean/degraded/recovering) with color-coded indicators - Device counts (active/total/failed/spare) - Rebuild progress percentage and speed when applicable - Green for healthy, amber for rebuilding, red for degraded Documentation: - Document mdadm monitoring feature in HOST_AGENT.md - Explain requirements (Linux, mdadm installed, root access) - Clarify scope (software RAID only, hardware RAID not supported) Testing: - Add comprehensive tests for mdadm output parsing - Test parsing of healthy, degraded, and rebuilding arrays - Verify proper extraction of device states and rebuild progress All builds pass successfully. RAID monitoring is automatic and best-effort - if mdadm is not installed or no arrays exist, host agent continues reporting other metrics normally. Related to #676	2025-11-09 16:36:33 +00:00
rcourtman	a406fe42d8	Fix Proxmox 9.x RRD parameter incompatibility causing cluster health issues Proxmox VE 9.x removed support for the 'ds' parameter in RRD endpoints (/nodes/{node}/rrddata and /nodes/{node}/lxc/{vmid}/rrddata). When Pulse sent RRD requests with ds=memused,memavailable,etc., Proxmox responded with: API error 400: {"errors":{"ds":"property is not defined in schema..."}} This caused cluster nodes to be repeatedly marked unhealthy, which cascaded into storage polling failures showing 'All cluster endpoints are unhealthy' even though the nodes were actually healthy and reachable. Changes: - Added check in cluster_client.go executeWithFailover to recognize the ds parameter error as a capability issue rather than node health failure - Nodes with this error no longer get marked unhealthy - Storage polling and other operations now succeed even when RRD calls fail - The RRD data will be unavailable but core monitoring continues This fix maintains backward compatibility with older Proxmox versions while gracefully handling the API change in Proxmox 9.x.	2025-11-08 12:06:08 +00:00
rcourtman	48fabdd827	Improve Docker temperature monitoring documentation for clarity (related to #600 ) Updated the Quick Start for Docker section in TEMPERATURE_MONITORING.md to be more user-friendly and address common setup issues: - Added clear explanation of why the proxy is needed (containers can't access hardware) - Provided concrete IP example instead of placeholder - Showed full docker-compose.yml context with proper YAML structure - Added sudo to commands where needed - Updated docker-compose commands to v2 syntax with note about v1 - Expanded verification steps with clearer success indicators - Added reminder to check container name in verification commands These improvements should help users who encounter blank temperature displays due to missing proxy installation or bind mount configuration.	2025-11-07 15:09:42 +00:00
rcourtman	9199892115	Fix Windows VM disk accumulation bug by normalizing drive letters Related to #656 Windows guest agents can return multiple directory mountpoints (C:\, C:\Users, C:\Windows) all on the same physical drive. When the QEMU guest agent omits disk[] metadata, commit `5325ef481` falls back to using the mountpoint string as the disk identifier. This causes every Windows directory to be treated as a separate disk, accumulating to inflated totals (e.g., 1TB reported for a 250GB drive). Root cause: The fallback logic in pkg/proxmox/client.go:1585-1594 assigns fs.Disk = fs.Mountpoint when disk[] is missing. On Windows, every directory path is unique, so the deduplication guard in internal/monitoring/monitor_polling.go: 619-635 never triggers, causing all directories to be summed. Changes: - Detect Windows-style mountpoints (drive letter + colon + backslash) - Normalize to drive root when disk[] is missing (e.g., C:\Users → C:) - Preserve existing behavior for Linux/BSD and VMs with disk[] metadata - Add debug logging for synthesized Windows drive identifiers This fix maintains backward compatibility with commit `5325ef481` while preventing the Windows directory accumulation issue. LXC containers are unaffected as they use a different code path.	2025-11-07 12:27:11 +00:00
rcourtman	1a78dcbba2	Fix guest agent disk data regression on Proxmox 8.3+ Related to #630 Proxmox 8.3+ changed the VM status API to return the `agent` field as an object ({"enabled":1,"available":1}) instead of an integer (0 or 1). This caused Pulse to incorrectly treat VMs as having no guest agent, resulting in missing disk usage data (disk:-1) even when the guest agent was running and functional. The issue manifested as: - VMs showing "Guest details unavailable" or missing disk data - Pulse logs showing no "Guest agent enabled, querying filesystem info" messages - `pvesh get /nodes/<node>/qemu/<vmid>/agent/get-fsinfo` working correctly from the command line, confirming the agent was functional Root cause: The VMStatus struct defined `Agent` as an int field. When Proxmox 8.3+ sent the new object format, JSON unmarshaling silently left the field at zero, causing Pulse to skip all guest agent queries. Changes: - Created VMAgentField type with custom UnmarshalJSON to handle both formats: * Legacy (Proxmox <8.3): integer (0 or 1) * Modern (Proxmox 8.3+): object {"enabled":N,"available":N} - Updated VMStatus.Agent from `int` to `VMAgentField` - Updated all references to `detailedStatus.Agent` to use `.Agent.Value` - The unmarshaler prioritizes the "available" field over "enabled" to ensure we only query when the agent is actually responding This fix maintains backward compatibility with older Proxmox versions while supporting the new format introduced in Proxmox 8.3+.	2025-11-06 18:42:46 +00:00
rcourtman	af55362009	Fix inflated RAM usage reporting for LXC containers Related to #553 ## Problem LXC containers showed inflated memory usage (e.g., 90%+ when actual usage was 50-60%, 96% when actual was 61%) because the code used the raw `mem` value from Proxmox's `/cluster/resources` API endpoint. This value comes from cgroup `memory.current` which includes reclaimable cache and buffers, making memory appear nearly full even when plenty is available. ## Root Cause - Nodes: Had sophisticated cache-aware memory calculation with RRD fallbacks - VMs (qemu): Had detailed memory calculation using guest agent meminfo - LXCs: Naively used `res.Mem` directly without any cache-aware correction The Proxmox cluster resources API's `mem` field for LXCs includes cache/buffers (from cgroup memory accounting), which should be excluded for accurate "used" memory. ## Solution Implement cache-aware memory calculation for LXC containers by: 1. Adding `GetLXCRRDData()` method to fetch RRD metrics for LXC containers from `/nodes/{node}/lxc/{vmid}/rrddata` 2. Using RRD `memavailable` to calculate actual used memory (total - available) 3. Falling back to RRD `memused` if `memavailable` is not available 4. Only using cluster resources `mem` value as last resort This matches the approach already used for nodes and VMs, providing consistent cache-aware memory reporting across all resource types. ## Changes - Added `GuestRRDPoint` type and `GetLXCRRDData()` method to pkg/proxmox - Added `GetLXCRRDData()` to ClusterClient for cluster-aware operations - Modified LXC memory calculation in `pollPVEInstance()` to use RRD data when available - Added guest memory snapshot recording for LXC containers - Updated test stubs to implement the new interface method ## Testing - Code compiles successfully - Follows the same proven pattern used for nodes and VMs - Includes diagnostic snapshot recording for troubleshooting	2025-11-06 00:16:18 +00:00
rcourtman	23691d5b41	Improve cluster health diagnostics and error messaging Related to #405 Enhances error reporting and logging when all cluster endpoints are unhealthy, making it easier to diagnose connectivity issues. Changes: 1. Enhanced error messages in cluster_client.go: - Error now includes list of unreachable endpoints - Added detailed logging when no healthy endpoints available - Log at WARN level (not DEBUG) when cluster health check fails - Better context in recovery attempts with start/completion summaries 2. Improved storage polling resilience in monitor_polling.go: - Better error context when cluster storage polling fails - Specific guidance for "no healthy nodes available" scenario - Storage polling continues with direct node queries even if cluster-wide query fails (already worked, but now clearer) 3. Better recovery logging: - Log when recovery attempts start with list of unhealthy endpoints - Log individual recovery failures at DEBUG level - Log recovery summary (success/failure counts) - Track throttled endpoints separately for clearer diagnostics These changes help users understand: - Which specific endpoints are unreachable - Whether it's a network/connectivity issue vs. API issue - That Pulse will continue trying to recover endpoints automatically - That storage monitoring continues via direct node queries The root issue is that Pulse's internal health tracking can mark all endpoints unhealthy when they're unreachable from the Pulse server, even if Proxmox reports them as "online" in cluster status. Better logging helps diagnose these network connectivity issues.	2025-11-05 19:44:29 +00:00
rcourtman	4c1d7a2797	Fix PMG API parameter issues causing 400 errors Related to #614 Corrects three issues with PMG monitoring: 1. Remove unsupported timeframe parameter from GetMailStatistics - PMG API /statistics/mail does not accept timeframe parameter - Previously sent "timeframe=day" causing 400 error - API returns current day statistics by default 2. Fix GetMailCount timespan parameter to use seconds - Changed from 24 (hours) to 86400 (seconds) - PMG API expects timespan in seconds, not hours - Previously sent "timespan=24" causing 400 error 3. Update function signature and tests - Renamed GetMailCount parameter from timespanHours to timespanSeconds - Updated test expectations to match corrected API calls - Tests verify parameters are sent correctly These changes align the PMG client with actual PMG API requirements, fixing the data population issues reported in v4.25.0.	2025-11-05 19:28:37 +00:00
rcourtman	c93581e1aa	Add DNS caching to reduce excessive DNS queries Related to #608 Implements DNS caching using rs/dnscache to dramatically reduce DNS query volume for frequently accessed Proxmox hosts. Users were reporting 260,000+ DNS queries in 37 hours for the same hostnames. Changes: - Added rs/dnscache dependency for DNS resolution caching - Created pkg/tlsutil/dnscache.go with DNS cache wrapper - Updated HTTP client creation to use cached DNS resolver - Added DNSCacheTimeout configuration option (default: 5 minutes) - Made DNS cache timeout configurable via: - system.json: dnsCacheTimeout field (seconds) - Environment variable: DNS_CACHE_TIMEOUT (duration string) - DNS cache periodically refreshes to prevent stale entries Benefits: - Reduces DNS query load on local DNS servers by ~99% - Reduces network traffic and DNS query log volume - Maintains fresh DNS entries through periodic refresh - Configurable timeout for different network environments Default behavior: 5-minute cache timeout with automatic refresh	2025-11-05 18:25:38 +00:00
rcourtman	6eb1a10d9b	Refactor: Code cleanup and localStorage consolidation This commit includes comprehensive codebase cleanup and refactoring: ## Code Cleanup - Remove dead TypeScript code (types/monitoring.ts - 194 lines duplicate) - Remove unused Go functions (GetClusterNodes, MigratePassword, GetClusterHealthInfo) - Clean up commented-out code blocks across multiple files - Remove unused TypeScript exports (helpTextClass, private tag color helpers) - Delete obsolete test files and components ## localStorage Consolidation - Centralize all storage keys into STORAGE_KEYS constant - Update 5 files to use centralized keys: * utils/apiClient.ts (AUTH, LEGACY_TOKEN) * components/Dashboard/Dashboard.tsx (GUEST_METADATA) * components/Docker/DockerHosts.tsx (DOCKER_METADATA) * App.tsx (PLATFORMS_SEEN) * stores/updates.ts (UPDATES) - Benefits: Single source of truth, prevents typos, better maintainability ## Previous Work Committed - Docker monitoring improvements and disk metrics - Security enhancements and setup fixes - API refactoring and cleanup - Documentation updates - Build system improvements ## Testing - All frontend tests pass (29 tests) - All Go tests pass (15 packages) - Production build successful - Zero breaking changes Total: 186 files changed, 5825 insertions(+), 11602 deletions(-)	2025-11-04 21:50:46 +00:00
rcourtman	32392d1212	Add disk metrics, block I/O, and mount details to Docker monitoring Extends Docker container monitoring with comprehensive disk and storage information: - Writable layer size and root filesystem usage displayed in new Disk column - Block I/O statistics (read/write bytes totals) shown in container drawer - Mount metadata including type, source, destination, mode, and driver details - Configurable via --collect-disk flag (enabled by default, can be disabled for large fleets) Also fixes config watcher to consistently use production auth config path instead of following PULSE_DATA_DIR when in mock mode.	2025-10-29 12:05:36 +00:00
rcourtman	f2acdd59af	Normalize docker agent version handling	2025-10-28 08:42:58 +00:00
rcourtman	68ce8e7520	feat: finalize swarm service monitoring (#598 )	2025-10-26 09:35:49 +00:00
rcourtman	5c54685f04	Add API token scopes and standalone host agent Introduces granular permission scopes for API tokens (docker:report, docker:manage, host-agent:report, monitoring:read/write, settings:read/write) allowing tokens to be restricted to minimum required access. Legacy tokens default to full access until scopes are explicitly configured. Adds standalone host agent for monitoring Linux, macOS, and Windows servers outside Proxmox/Docker estates. New Servers workspace in UI displays uptime, OS metadata, and capacity metrics from enrolled agents. Includes comprehensive token management UI overhaul with scope presets, inline editing, and visual scope indicators.	2025-10-23 11:40:31 +00:00
rcourtman	a885fb5472	Surface LXC interface IPs via PVE interfaces API (#596 )	2025-10-23 08:07:32 +00:00
rcourtman	b95c01066e	Capture dynamic LXC IP metrics (#596 )	2025-10-23 07:50:45 +00:00
rcourtman	be85459db2	Add LXC config metadata for guest drawers (#596 )	2025-10-23 07:30:32 +00:00
rcourtman	aac3dacd63	Improve LXC guest metrics visibility (#596 )	2025-10-22 22:24:33 +00:00
rcourtman	fe1533ea13	Improve PMG metric ingestion refs #551	2025-10-22 18:15:27 +00:00
rcourtman	3a3e0e080c	Add replication monitoring plumbing and UI Refs #395	2025-10-22 16:10:15 +00:00
rcourtman	c9543e8a7e	Add qemu guest agent version metadata	2025-10-22 15:24:07 +00:00
rcourtman	f8b6aa6c97	Treat 501 responses as non-fatal in cluster failover (#449 )	2025-10-22 14:23:13 +00:00
rcourtman	13e2577c57	Handle FreeBSD guest agent disk counters Refs #580	2025-10-22 14:06:45 +00:00
rcourtman	ff4dc49ae4	Update Pulse install flow and related components	2025-10-21 19:58:53 +00:00
rcourtman	7c00055047	feat: unify and improve Proxmox discovery/scanning architecture Replaced inconsistent per-product detection logic with a unified probe architecture using confidence scoring and product-specific matchers. Key improvements: - PBS detection now inspects TLS certs, auth headers (401/403), and probes PBS-specific endpoints (/api2/json/status, /config/datastore) fixing false negatives for self-signed and auth-protected servers - PMG detection uses header analysis first, then conditional endpoint probing, working consistently regardless of port - Single unified probeProxmoxService() replaces separate checkPort8006() and checkServer() code paths, eliminating duplication - Confidence scoring (0.0-1.0+) with evidence tracking for debugging - Consolidated hostname resolution and version handling Technical changes: - Added ProxmoxProbeResult with structured evidence and scoring - Added product matchers: applyPVEHeuristics, applyPMGHeuristics, applyPBSHeuristics - Removed legacy methods: checkPort8006, checkServer, isPMGServer, detectProductFromEndpoint, and duplicate hostname helpers - Updated all tests to use new unified probe architecture - Added probe_test_helpers.go for test access to internal methods All tests passing. Fixes PBS detection issues and improves consistency across PVE/PMG/PBS discovery.	2025-10-21 13:09:41 +00:00
rcourtman	56c6c0cc0c	feat: improve discovery with progress tracking, validation, and structured errors Significantly enhanced network discovery feature to eliminate false positives, provide real-time progress updates, and better error reporting. Key improvements: - Require positive Proxmox identification (version data, auth headers, or certificates) instead of reporting any service on ports 8006/8007 - Add real-time progress tracking with phase/target counts and completion percentage - Implement structured error reporting with IP, phase, type, and timestamp details - Fix TLS timeout handling to prevent hangs on unresponsive hosts - Expose progress and structured errors via WebSocket for UI consumption - Reduce log verbosity by moving discovery logs to debug level - Fix duplicate IP counting to ensure progress reaches 100% Breaking changes: None (backward compatible with legacy API methods)	2025-10-20 22:29:30 +00:00
rcourtman	5ebb32ce10	feat: enhance runtime configuration and system settings management Improves configuration handling and system settings APIs to support v4.24.0 features including runtime logging controls, adaptive polling configuration, and enhanced config export/persistence. Changes: - Add config override system for discovery service - Enhance system settings API with runtime logging controls - Improve config persistence and export functionality - Update security setup handling - Refine monitoring and discovery service integration These changes provide the backend support for the configuration features documented in the v4.24.0 release.	2025-10-20 17:41:19 +00:00
rcourtman	c91b7874ac	docs: comprehensive v4.24.0 documentation audit and updates Complete documentation overhaul for Pulse v4.24.0 release covering all new features and operational procedures. Documentation Updates (19 files): P0 Release-Critical: - Operations: Rewrote ADAPTIVE_POLLING_ROLLOUT.md as GA operations runbook - Operations: Updated ADAPTIVE_POLLING_MANAGEMENT_ENDPOINTS.md with DEFERRED status - Operations: Enhanced audit-log-rotation.md with scheduler health checks - Security: Updated proxy hardening docs with rate limit defaults - Docker: Added runtime logging and rollback procedures P1 Deployment & Integration: - KUBERNETES.md: Runtime logging config, adaptive polling, post-upgrade verification - PORT_CONFIGURATION.md: Service naming, change tracking via update history - REVERSE_PROXY.md: Rate limit headers, error pass-through, v4.24.0 verification - PROXY_AUTH.md, OIDC.md, WEBHOOKS.md: Runtime logging integration - TROUBLESHOOTING.md, VM_DISK_MONITORING.md, zfs-monitoring.md: Updated workflows Features Documented: - X-RateLimit-* headers for all API responses - Updates rollback workflow (UI & CLI) - Scheduler health API with rich metadata - Runtime logging configuration (no restart required) - Adaptive polling (GA, enabled by default) - Enhanced audit logging - Circuit breakers and dead-letter queue Supporting Changes: - Discovery service enhancements - Config handlers updates - Sensor proxy installer improvements Total Changes: 1,626 insertions(+), 622 deletions(-) Files Modified: 24 (19 docs, 5 code) All documentation is production-ready for v4.24.0 release.	2025-10-20 17:20:13 +00:00
rcourtman	7d422d2909	feat: add professional logging with runtime configuration and performance optimization Implements structured logging package with LOG_LEVEL/LOG_FORMAT env support, debug level guards for hot paths, enriched error messages with actionable context, and stack trace capture for production debugging. Improves observability and reduces log overhead in high-frequency polling loops.	2025-10-20 15:13:38 +00:00
rcourtman	524f42cc28	security: complete Phase 1 sensor proxy hardening Implements comprehensive security hardening for pulse-sensor-proxy: - Privilege drop from root to unprivileged user (UID 995) - Hash-chained tamper-evident audit logging with remote forwarding - Per-UID rate limiting (0.2 QPS, burst 2) with concurrency caps - Enhanced command validation with 10+ attack pattern tests - Fuzz testing (7M+ executions, 0 crashes) - SSH hardening, AppArmor/seccomp profiles, operational runbooks All 27 Phase 1 tasks complete. Ready for production deployment.	2025-10-20 15:13:37 +00:00
rcourtman	b640347a78	fix: improve discovery performance and reliability Discovery Fixes: - Always update cache even when scan finds no servers (prevents stale data) - Remove automatic re-add of deleted nodes to discovery (was causing confusion) - Optimize Docker subnet scanning from 762 IPs to 254 IPs (3x faster) - Add getHostSubnetFromGateway() to detect host network from container Frontend Type Fixes: - Fix ThresholdsTable editScope type errors - Fix SnapshotAlertConfig index signature - Remove unused variable in Settings.tsx These changes make discovery faster, more reliable, and fix the issue where deleted nodes would persist in the discovery cache or immediately reappear.	2025-10-18 22:59:40 +00:00
Pulse Automation Bot	cfdfe896be	Adjust backup and snapshot alert handling	2025-10-18 20:11:01 +00:00
rcourtman	6fdef61710	Expand monitoring and discovery test coverage	2025-10-16 08:17:08 +00:00
rcourtman	4838793677	feat: enhance alerts system with tests and improved thresholds - Add comprehensive test coverage for alerts package with 285+ new tests - Implement ThresholdsTable component with metric thresholds display - Enhance Alerts page UI with improved layout and metric filtering - Add frontend component tests for Alerts page and ThresholdsTable - Set up Vitest testing infrastructure for SolidJS components - Improve config persistence with better validation - Expand discovery tests with 333+ test cases - Update API, configuration, and Docker monitoring documentation	2025-10-15 22:25:04 +00:00
rcourtman	91fecacfef	feat: add docker agent command handling	2025-10-15 19:27:19 +00:00
rcourtman	aaae27dc11	Log memory source transitions for diagnostics (#553 )	2025-10-15 19:19:11 +00:00
rcourtman	32421b36b8	Refs #533 : add total-minus-used memory fallback	2025-10-15 18:19:54 +00:00
rcourtman	5ce47a72ec	Improve discovery classification heuristics Refs #551	2025-10-15 14:08:05 +00:00
rcourtman	881b7f9a54	Fix false ZFS log/cache warnings	2025-10-14 20:57:43 +00:00
rcourtman	7e5fa9a147	fix: restore cache-aware node memory on PVE 8.4	2025-10-14 16:40:45 +00:00
rcourtman	2163d6f5a8	Use guest meminfo available for VM memory usage	2025-10-12 11:03:56 +00:00
rcourtman	f46ff1792b	Fix settings security tab navigation	2025-10-11 23:29:47 +00:00

47 Commits