Commit Graph

34 Commits

Author SHA1 Message Date
rcourtman
160adeb3b8 feat: add scheduler health API endpoint (Phase 2 Task 8)
Task 8 of 10 complete. Exposes read-only scheduler health data including:
- Queue depth and distribution by instance type
- Dead-letter queue inspection (top 25 tasks with error details)
- Circuit breaker states (instance-level)
- Staleness scores per instance

New API endpoint:
  GET /api/monitoring/scheduler/health (requires authentication)

New snapshot methods:
- StalenessTracker.Snapshot() - exports all staleness data
- TaskQueue.Snapshot() - queue depth & per-type distribution
- TaskQueue.PeekAll() - dead-letter task inspection
- circuitBreaker.State() - exports state, failures, retryAt
- Monitor.SchedulerHealth() - aggregates all health data

Documentation updated with API spec, field descriptions, and usage examples.
2025-10-20 15:13:38 +00:00
rcourtman
5fbdf6099f docs: add adaptive polling architecture guide (Phase 2 Task 10)
Comprehensive documentation for Phase 2 adaptive polling:
- Architecture overview with component diagram
- Configuration guide (env vars, defaults, feature flag)
- Prometheus metrics reference (7 new metrics)
- Circuit breaker & backoff behavior explanation
- Dead-letter queue operational guidance
- Rollout plan (dev/QA → staged → full)
- Troubleshooting guide for common issues

Task 10 of 10 complete. Phase 2: 8/10 tasks implemented (80%).
2025-10-20 15:13:37 +00:00
rcourtman
aa5c08ad4a feat: implement priority queue-based task execution (Phase 2 Task 6)
Replaces immediate polling with queue-based scheduling:
- TaskQueue with min-heap (container/heap) for NextRun-ordered execution
- Worker goroutines that block on WaitNext() until tasks are due
- Tasks only execute when NextRun <= now, respecting adaptive intervals
- Automatic rescheduling after execution via scheduler.BuildPlan
- Queue depth tracking for backpressure-aware interval adjustments
- Upsert semantics for updating scheduled tasks without duplicates

Task 6 of 10 complete (60%). Ready for error/backoff policies.
2025-10-20 15:13:37 +00:00
rcourtman
524f42cc28 security: complete Phase 1 sensor proxy hardening
Implements comprehensive security hardening for pulse-sensor-proxy:
- Privilege drop from root to unprivileged user (UID 995)
- Hash-chained tamper-evident audit logging with remote forwarding
- Per-UID rate limiting (0.2 QPS, burst 2) with concurrency caps
- Enhanced command validation with 10+ attack pattern tests
- Fuzz testing (7M+ executions, 0 crashes)
- SSH hardening, AppArmor/seccomp profiles, operational runbooks

All 27 Phase 1 tasks complete. Ready for production deployment.
2025-10-20 15:13:37 +00:00
rcourtman
29f4879cd4 test: add comprehensive security tests and documentation
Implements all remaining Codex recommendations before launch:

1. Privileged Methods Tests:
   - TestPrivilegedMethodsCompleteness ensures all host-side RPCs are protected
   - Will fail if new privileged RPC is added without authorization
   - Verifies read-only methods are NOT in privilegedMethods

2. ID-Mapped Root Detection Tests:
   - TestIDMappedRootDetection covers all boundary conditions
   - Tests UID/GID range detection (both must be in range)
   - Tests multiple ID ranges, edge cases, disabled mode
   - 100% coverage of container identification logic

3. Authorization Tests:
   - TestPrivilegedMethodsBlocked verifies containers can't call privileged RPCs
   - TestIDMappedRootDisabled ensures feature can be disabled
   - Tests both container and host credentials

4. Comprehensive Security Documentation (23 KB):
   - Architecture overview with diagrams
   - Complete authentication & authorization flow
   - Rate limiting details (already implemented: 20/min per peer)
   - SSH security model and forced commands
   - Container isolation mechanisms
   - Monitoring & alerting recommendations
   - Development mode documentation (PULSE_DEV_ALLOW_CONTAINER_SSH)
   - Troubleshooting guide with common issues
   - Incident response procedures

Rate Limiting Status:
- Already implemented in throttle.go (20 req/min, burst 10, max 10 concurrent)
- Per-peer rate limiting at line 328 in main.go
- Per-node concurrency control at line 825 in main.go
- Exceeds Codex's requirements

All tests pass. Documentation covers all security aspects.

Addresses final Codex recommendations for production readiness.
2025-10-19 16:47:13 +00:00
Pulse Automation Bot
0b4e4f9c59 Add configurable backup polling interval 2025-10-18 13:06:41 +00:00
Pulse Automation Bot
d15ad1d0b4 Add Helm chart tooling, CI, and release packaging 2025-10-18 11:50:57 +00:00
Richard Courtman
02701ca22b fix: gracefully handle standalone node cleanup limitation
- Cleanup script now detects forced command restriction on standalone nodes
- Logs helpful message explaining limitation (security by design)
- Does not fail when standalone nodes cannot be cleaned up
- Documents that standalone node cleanup is limited by forced command security
- Automatic cleanup works fully for cluster nodes
- Manual cleanup command provided for standalone nodes if needed
2025-10-18 07:34:18 +00:00
Richard Courtman
b328a09e45 docs: add automatic cleanup documentation for node removal 2025-10-18 07:03:42 +00:00
Richard Courtman
de3bb47930 fix: improve turnkey temperature monitoring for standalone nodes
- Fix script input handling to work with standard curl | bash pattern by prioritizing /dev/tty
- Add Raspberry Pi temperature sensor support (cpu_thermal chip and generic temp sensors)
- Add comprehensive documentation for turnkey standalone node setup
- Fix printf formatting error in setup script
2025-10-18 06:51:56 +00:00
rcourtman
a5d4d57097 docs: implement Codex recommendations for temperature monitoring
Add comprehensive documentation improvements based on architectural review:

1. Enhanced Known Limitations section:
   - Document single proxy failure mode
   - Explain sensors output parsing brittleness with mitigation steps
   - Clarify cluster discovery dependencies and fallback options
   - Describe SSH fan-out scaling considerations for large clusters

2. Documented SSH key rotation workflow:
   - Promote automated rotation script as recommended approach
   - Include dry-run, execution, and rollback examples
   - Provide manual fallback process
   - Reference existing pulse-proxy-rotate-keys.sh script

3. Added Future Improvements roadmap:
   - Proxmox API integration (when available)
   - Agent-based architecture option
   - SNMP/IPMI support
   - Schema validation
   - Caching and throttling
   - Automated rotation timer
   - Health check endpoint

Instrumentation verified: proxy already has comprehensive Prometheus metrics
(RPC/SSH requests, latency, queue depth, rate limiting) and structured logging.
2025-10-17 12:03:31 +00:00
rcourtman
07fe382553 docs: update temperature monitoring guide to reflect removed UI button
- Replace references to 'Ensure cluster keys' button with instructions to re-run setup script
- Update troubleshooting section for new cluster nodes
- The setup script already handles SSH key distribution automatically
2025-10-17 11:46:31 +00:00
rcourtman
3a4fc044ea Add guest agent caching and update doc hints (refs #560) 2025-10-16 08:15:49 +00:00
rcourtman
4838793677 feat: enhance alerts system with tests and improved thresholds
- Add comprehensive test coverage for alerts package with 285+ new tests
- Implement ThresholdsTable component with metric thresholds display
- Enhance Alerts page UI with improved layout and metric filtering
- Add frontend component tests for Alerts page and ThresholdsTable
- Set up Vitest testing infrastructure for SolidJS components
- Improve config persistence with better validation
- Expand discovery tests with 333+ test cases
- Update API, configuration, and Docker monitoring documentation
2025-10-15 22:25:04 +00:00
rcourtman
91fecacfef feat: add docker agent command handling 2025-10-15 19:27:19 +00:00
rcourtman
78889ffedc Ignore read-only guest filesystems in disk aggregation 2025-10-14 16:13:53 +00:00
rcourtman
261bd7ac74 Adopt multi-token auth across docs, UI, and tooling 2025-10-14 15:47:49 +00:00
rcourtman
5cf0697157 Document optional host-script upgrade path 2025-10-14 13:19:38 +00:00
rcourtman
61020881c4 Align proxy upgrade messaging with node re-add workflow 2025-10-14 13:17:34 +00:00
rcourtman
eda3a08ae5 Document proxy installer upgrade path 2025-10-14 12:43:50 +00:00
rcourtman
e4c3b06f14 Automate sensor proxy container mount and auth 2025-10-14 12:41:48 +00:00
rcourtman
156fd34c50 Update Proxmox guest agent permissions docs and tooling (refs #548) 2025-10-14 10:21:52 +00:00
rcourtman
5c79d2516d feat: streamline docker agent onboarding 2025-10-14 09:45:32 +00:00
rcourtman
d3d4b9811a docs: add manual pulse-sensor-proxy install steps 2025-10-13 19:36:50 +00:00
rcourtman
fcd8b62705 refactor: Rename install-temp-proxy.sh to install-sensor-proxy.sh
Complete the pulse-sensor-proxy rename by updating the installer script name and all references to it.

Updated:
- Renamed scripts/install-temp-proxy.sh → scripts/install-sensor-proxy.sh
- Updated all documentation references
- Updated install.sh references
- Updated build-release.sh comments
2025-10-13 13:23:53 +00:00
rcourtman
b952444837 refactor: Rename pulse-temp-proxy to pulse-sensor-proxy
The name "temp-proxy" implied a temporary or incomplete implementation. The new name better reflects its purpose as a secure sensor data bridge for containerized Pulse deployments.

Changes:
- Renamed cmd/pulse-temp-proxy/ to cmd/pulse-sensor-proxy/
- Updated all path constants and binary references
- Renamed environment variables: PULSE_TEMP_PROXY_* to PULSE_SENSOR_PROXY_*
- Updated systemd service and service account name
- Updated installation, rotation, and build scripts
- Renamed hardening documentation
- Maintained backward compatibility for key removal during upgrades
2025-10-13 13:17:05 +00:00
rcourtman
97066d8351 docs: Update socket paths and add monitoring section to TEMPERATURE_MONITORING.md
Updated documentation to reflect new directory-level bind mount architecture:
- Changed socket path from /var/run/pulse-temp-proxy.sock to /run/pulse-temp-proxy/pulse-temp-proxy.sock
- Updated LXC bind mount syntax to directory-level (create=dir instead of create=file)
- Added "Monitoring the Proxy" section with manual monitoring commands
- Documents systemd restart-on-failure reliance for v1
- Notes future pulse-watchdog integration planned

Related to #528
2025-10-12 22:42:38 +00:00
rcourtman
47116bedb5 docs: Add comprehensive Operations & Troubleshooting section
Addresses operational documentation gaps for pulse-temp-proxy:

- Service management (restart, stop, start, enable/disable)
- Log locations and viewing commands
- SSH key rotation procedures (recommended every 90 days)
- Key revocation when nodes leave cluster
- Failure modes (proxy down, socket issues, pvecm absent, off-cluster)
- Known limitations (one per host, cluster membership, cross-cluster)
- Common issues with troubleshooting steps
- Diagnostic info collection for bug reports

This provides operators with everything they need to manage the proxy service
in production environments.
2025-10-12 21:50:55 +00:00
rcourtman
6d4694f019 security: Add SO_PEERCRED authentication to temperature proxy
Addresses security concern raised in code review:
- Socket permissions changed from 0666 to 0660
- Added SO_PEERCRED verification to authenticate connecting processes
- Only allows root (UID 0) or proxy's own user
- Prevents unauthorized processes from triggering SSH key rollout
- Documented passwordless root SSH requirement for clusters

This prevents any process on the host or in other containers from
accessing the proxy RPC endpoints.
2025-10-12 21:42:22 +00:00
rcourtman
e7bc338891 feat: Implement secure temperature proxy for containerized deployments
Addresses #528

Introduces pulse-temp-proxy architecture to eliminate SSH key exposure in containers:

**Architecture:**
- pulse-temp-proxy runs on Proxmox host (outside LXC/Docker)
- SSH keys stored on host filesystem (/var/lib/pulse-temp-proxy/ssh/)
- Pulse communicates via unix socket (bind-mounted into container)
- Proxy handles cluster discovery, key rollout, and temperature fetching

**Components:**
- cmd/pulse-temp-proxy: Standalone Go binary with unix socket RPC server
- internal/tempproxy: Client library for Pulse backend
- scripts/install-temp-proxy.sh: Idempotent installer for existing deployments
- scripts/pulse-temp-proxy.service: Systemd service for proxy

**Integration:**
- Pulse automatically detects and uses proxy when socket exists
- Falls back to direct SSH for native installations
- Installer automatically configures proxy for new LXC deployments
- Existing LXC users can upgrade by running install-temp-proxy.sh

**Security improvements:**
- Container compromise no longer exposes SSH keys
- SSH keys never enter container filesystem
- Maintains forced command restrictions
- Transparent to users - no workflow changes

**Documentation:**
- Updated TEMPERATURE_MONITORING.md with new architecture
- Added verification steps and upgrade instructions
- Preserved legacy documentation for native installs
2025-10-12 21:35:35 +00:00
rcourtman
c8e3c93516 fix: Add security gates for containerized temperature monitoring
Addresses #528

- Added opt-in confirmation prompt to setup script with security notice
- Added runtime warning when containerized Pulse uses SSH temperature monitoring
- Documented security considerations and hardening recommendations
- Users must explicitly confirm understanding before enabling in containers
2025-10-12 21:01:25 +00:00
rcourtman
18a88cb4cc Improve NVMe temperature handling 2025-10-12 16:06:55 +00:00
rcourtman
a74baed121 feat: capture Proxmox memory snapshots in diagnostics 2025-10-12 10:25:43 +00:00
rcourtman
f46ff1792b Fix settings security tab navigation 2025-10-11 23:29:47 +00:00