Pulse

mirror of https://github.com/rcourtman/Pulse.git synced 2026-02-18 23:41:48 +01:00

Author	SHA1	Message	Date
rcourtman	bc479643e4	release: prepare v4.25.0	2025-10-22 10:46:18 +00:00
rcourtman	ff4dc49ae4	Update Pulse install flow and related components	2025-10-21 19:58:53 +00:00
rcourtman	e0396c1362	docs: update documentation for diagnostics improvements Add comprehensive operator documentation for the new observability features introduced in the previous commit. New Documentation: - docs/monitoring/PROMETHEUS_METRICS.md - Complete reference for all 18 new Prometheus metrics with alert suggestions Updated Documentation: - docs/API.md - Document X-Request-ID and X-Diagnostics-Cached-At headers, explain diagnostics endpoint caching behavior - docs/TROUBLESHOOTING.md - Add section on correlating API calls with logs using request IDs - docs/operations/ADAPTIVE_POLLING_ROLLOUT.md - Update monitoring checklists with new per-node and scheduler metrics - docs/CONFIGURATION.md - Clarify LOG_FILE dual-output behavior and rotation defaults These updates ensure operators understand: - How to set up monitoring/alerting for new metrics - How to configure file logging with rotation - How to troubleshoot using request correlation - What metrics are available for dashboards Related to: `495e6c794` (feat: comprehensive diagnostics improvements)	2025-10-21 12:45:19 +00:00
rcourtman	ddc9a7a068	docs: comprehensive documentation for rate limit fix and configurability Document the pulse-sensor-proxy rate limiting bug fix and new configurability across all relevant documentation: TEMPERATURE_MONITORING.md: - Added 'Rate Limiting & Scaling' section with symptom diagnosis - Included sizing table for 1-3, 4-10, 10-20, and 30+ node deployments - Provided tuning formula: interval_ms = polling_interval / node_count TROUBLESHOOTING.md: - Added 'Temperature data flickers after adding nodes' section - Step-by-step diagnosis using limiter metrics and scheduler health - Quick fix with config example CONFIGURATION.md: - Added pulse-sensor-proxy/config.yaml reference section - Documented rate_limit.per_peer_interval_ms and per_peer_burst fields - Included defaults and example override pulse-sensor-proxy-runbook.md: - Updated quick reference with new defaults (1 req/sec, burst 5) - Added 'Rate Limit Tuning' procedure with 4 deployment profiles - Included validation steps and monitoring commands TEMPERATURE_MONITORING_SECURITY.md: - Updated rate limiting section with new defaults - Added configurable overrides guidance - Documented security considerations for production deployments Related commits: - `46b8b8d08`: Initial rate limit fix (hardcoded defaults) - `ca534e2b6`: Made rate limits configurable via YAML - `e244da837`: Added guidance for large deployments (30+ nodes)	2025-10-21 11:36:07 +00:00
rcourtman	2f43d67af9	docs: simplify Mermaid diagrams for better readability The previous diagrams were too complex and overwhelming. Simplified all diagrams to show core concepts clearly: - Adaptive polling: reduced to basic scheduler→queue→workers flow - Temperature proxy: simplified to 3-box trust boundary view - Sensor proxy sequence: simplified to essential request flow - Webhook pipeline: reduced to template→send→retry flow - Script library: simplified to code→test→bundle→dist flow Fixed parsing error in temperature proxy diagram (parentheses in edge label causing render failure). Diagrams should clarify architecture, not recreate implementation.	2025-10-21 10:50:40 +00:00
rcourtman	7bfd6997ec	docs: remove outdated installer v2 rollout planning doc The v2 installer rollout is complete - dist/install-docker-agent.sh now contains the bundled v2 installer with embedded library modules. This planning document served its purpose and is no longer relevant.	2025-10-21 10:48:35 +00:00
rcourtman	10d52244f8	docs: remove internal 'Phase 2' reference from adaptive polling docs Replace internal development phase reference with clear description of what the adaptive polling scheduler does. 'Phase 2' is internal jargon that provides no value to users.	2025-10-21 10:45:46 +00:00
rcourtman	85ffe10aed	docs: add Mermaid diagrams to improve visual documentation Enhance documentation with six Mermaid diagrams to better explain complex system implementations: - Adaptive polling lifecycle flowchart showing enqueue→execute→feedback cycle with scheduler, priority queue, and worker interactions - Circuit breaker state machine diagram illustrating Closed↔Open↔Half-open transitions with triggers and recovery paths - Temperature proxy architecture diagram highlighting trust boundaries, security controls, and data flow between host/container/cluster - Sensor proxy request flow sequence diagram showing auth, rate limiting, validation, and SSH execution pipeline - Alert webhook pipeline flowchart detailing template resolution, URL rendering, HTTP dispatch, and retry logic - Script library workflow diagram illustrating dev→test→bundle→distribute lifecycle emphasizing modular design These visualizations make it easier for operators and contributors to understand Pulse's sophisticated architectural patterns.	2025-10-21 10:40:33 +00:00
rcourtman	b929fdcc6e	feat: improve source build installation experience - Remove confusing --main flag, use --source for clarity - Fix timeout issues when building from source in LXC containers - Increase timeout from 5min to 20min for source builds - Add PULSE_CONTAINER_TIMEOUT env var for custom timeouts - Support PULSE_CONTAINER_TIMEOUT=0 to disable timeout - Fix misleading "Latest version: vX.X.X" message during source builds - Update documentation to use --source instead of --main - Simplify auto-update script logic for source builds Changes: - install.sh: Check BUILD_FROM_SOURCE early to skip version detection - install.sh: Adaptive timeout (300s binary, 1200s source builds) - install.sh: Better timeout error messages with recovery instructions - README.md: Replace --main with --source in examples - docs/INSTALL.md: Replace --main with --source in examples - scripts/pulse-auto-update.sh: Remove --main special case	2025-10-21 08:57:29 +00:00
rcourtman	c91b7874ac	docs: comprehensive v4.24.0 documentation audit and updates Complete documentation overhaul for Pulse v4.24.0 release covering all new features and operational procedures. Documentation Updates (19 files): P0 Release-Critical: - Operations: Rewrote ADAPTIVE_POLLING_ROLLOUT.md as GA operations runbook - Operations: Updated ADAPTIVE_POLLING_MANAGEMENT_ENDPOINTS.md with DEFERRED status - Operations: Enhanced audit-log-rotation.md with scheduler health checks - Security: Updated proxy hardening docs with rate limit defaults - Docker: Added runtime logging and rollback procedures P1 Deployment & Integration: - KUBERNETES.md: Runtime logging config, adaptive polling, post-upgrade verification - PORT_CONFIGURATION.md: Service naming, change tracking via update history - REVERSE_PROXY.md: Rate limit headers, error pass-through, v4.24.0 verification - PROXY_AUTH.md, OIDC.md, WEBHOOKS.md: Runtime logging integration - TROUBLESHOOTING.md, VM_DISK_MONITORING.md, zfs-monitoring.md: Updated workflows Features Documented: - X-RateLimit-* headers for all API responses - Updates rollback workflow (UI & CLI) - Scheduler health API with rich metadata - Runtime logging configuration (no restart required) - Adaptive polling (GA, enabled by default) - Enhanced audit logging - Circuit breakers and dead-letter queue Supporting Changes: - Discovery service enhancements - Config handlers updates - Sensor proxy installer improvements Total Changes: 1,626 insertions(+), 622 deletions(-) Files Modified: 24 (19 docs, 5 code) All documentation is production-ready for v4.24.0 release.	2025-10-20 17:20:13 +00:00
rcourtman	fd0a4f2b0a	docs: update documentation for v4.24.0 features Updates documentation to reflect features implemented in recent commits: Security & API Enhancements: - Rate limit headers (X-RateLimit-Limit, X-RateLimit-Remaining, Retry-After) - Audit logging for rollback actions and scheduler health - Runtime logging configuration tracking Scheduler Health API: - Document new v4.24.0 endpoint features - Per-instance circuit breaker status - Dead-letter queue tracking - Staleness metrics - Enhanced response format with backward compatibility Version & Health Endpoints: - Updated /api/version response fields - Optional health endpoint fields - Deployment type and update availability Configuration & Installation: - HTTP config fetch via PULSE_INIT_CONFIG_URL - Updated environment variable documentation - Enhanced FAQ entries Monitoring & Operations: - Adaptive polling architecture documentation - Rollback procedure references - Production deployment guidance All documentation changes align with implemented features from commits: - `656ae0d25` (PMG test fix) - `dec85a4ef` (PBS/PMG stubs + HTTP config) - Earlier commits: scheduler health API, rollback, rate limiting	2025-10-20 16:08:10 +00:00
rcourtman	469d11fc7e	docs: add comprehensive scheduler health API documentation Add detailed API reference and update rollout playbook: New: docs/api/SCHEDULER_HEALTH.md - Complete endpoint reference for /api/monitoring/scheduler/health - Request/response structure with field descriptions - Enhanced "instances" array documentation - Example responses showing all states (healthy, transient, DLQ) - Useful jq queries for troubleshooting: - Find instances with errors - List DLQ entries - Show open circuit breakers - Sort by failure streaks - Migration guide (legacy → new fields) - Troubleshooting examples with real scenarios Updated: docs/operations/ADAPTIVE_POLLING_ROLLOUT.md - Enhanced "Accessing Scheduler Health API" section (§6) - Added examples using new instances[] array - Updated queries to use pollStatus, breaker, deadLetter fields - Practical jq commands for operators Key Documentation Features: - Complete JSON schema with examples - All new fields documented with types and descriptions - Real-world troubleshooting scenarios - Copy-paste ready jq queries - Migration path for existing integrations - Backward compatibility notes Operators can now: - Find error messages without log digging - Understand circuit breaker states - Track DLQ entries with full context - Diagnose issues using single API call Part of Phase 2 follow-up - enhanced observability	2025-10-20 15:13:38 +00:00
rcourtman	0fcfad3dc5	feat: add shared script library system and refactor docker-agent installer Implements a comprehensive script improvement infrastructure to reduce code duplication, improve maintainability, and enable easier testing of installer scripts. ## New Infrastructure ### Shared Library System (scripts/lib/) - common.sh: Core utilities (logging, sudo, dry-run, cleanup management) - systemd.sh: Service management helpers with container-safe systemctl - http.sh: HTTP/download helpers with curl/wget fallback and retry logic - README.md: Complete API documentation for all library functions ### Bundler System - scripts/bundle.sh: Concatenates library modules into single-file installers - scripts/bundle.manifest: Defines bundling configuration for distributables - Enables both modular development and curl\|bash distribution ### Test Infrastructure - scripts/tests/run.sh: Test harness for running all smoke tests - scripts/tests/test-common-lib.sh: Common library validation (5 tests) - scripts/tests/test-docker-agent-v2.sh: Installer smoke tests (4 tests) - scripts/tests/integration/: Container-based integration tests (5 scenarios) - All tests passing ✓ ## Refactored Installer ### install-docker-agent-v2.sh - Reduced from 1098 to 563 lines (48% code reduction) - Uses shared libraries for all common operations - NEW: --dry-run flag support - Maintains 100% backward compatibility with original - Fully tested with smoke and integration tests ### Key Improvements - Sudo escalation: 100+ lines → 1 function call - Download logic: 51 lines → 1 function call - Service creation: 33 lines → 2 function calls - Logging: Standardized across all operations - Error handling: Improved with common library ## Documentation ### Rollout Strategy (docs/installer-v2-rollout.md) - 3-phase rollout plan (Alpha → Beta → GA) - Feature flag mechanism for gradual deployment - Testing checklist and success metrics - Rollback procedures and communication plan ### Developer Guides - docs/script-library-guide.md: Complete library usage guide - docs/CONTRIBUTING-SCRIPTS.md: Contribution workflow - docs/installer-v2-quickref.md: Quick reference for operators ## Metrics - Code reduction: 48% (1098 → 563 lines) - Reusable functions: 0 → 30+ - Test coverage: 0 → 8 test scenarios - Documentation: 0 → 5 comprehensive guides ## Testing All tests passing: - Smoke tests: 2/2 passed (8 test cases) - Integration tests: 5/5 scenarios passed - Bundled output: Syntax validated, dry-run tested ## Next Steps This lays the foundation for migrating other installers (install.sh, install-sensor-proxy.sh) to use the same pattern, reducing overall maintenance burden and improving code quality across the project.	2025-10-20 15:13:38 +00:00
rcourtman	ce5ad64810	docs: defer circuit breaker/DLQ management endpoints (Phase 2 Task 11) Document decision to defer mutation endpoints after soak testing: Assessment Results: - Integration tests (55s, 12 instances): Automatic recovery worked perfectly - Soak tests (2-240min, 80 instances): No manual intervention needed - Circuit breakers: Opened/closed automatically as designed - DLQ routing: Permanent failures handled correctly Current Capabilities (Sufficient): - Read-only scheduler health API provides full visibility - Operator workarounds: service restart, feature flag toggle - Grafana alerting: queue depth, staleness, DLQ, breakers Why Defer: - No operational need demonstrated in testing - Implementation requires auth/RBAC/audit/UI work - Cost not justified until production usage reveals need - Can add later when data shows actual pain points Future Design Notes: - POST /api/monitoring/breakers/{instance}/reset - POST /api/monitoring/dlq/retry (all or specific) - DELETE /api/monitoring/dlq/{instance} - Auth, audit, rate limiting, UI integration required Re-evaluation Criteria: - Operators request controls >3x in 30 days - Troubleshooting steps inadequate - Service restarts too disruptive - Production incidents need surgical controls Decision: Monitor production usage for 60 days, then reassess based on actual operator feedback and support ticket patterns. Part of Phase 2 - Adaptive Polling completion	2025-10-20 15:13:38 +00:00
rcourtman	cb8be81f1d	docs: add adaptive polling production rollout playbook (Phase 2 Task 10) Add comprehensive operator playbook for production enablement: Prerequisites: - Test suite validation (unit, integration, soak) - Monitoring readiness (Grafana dashboards, alerts) - Configuration management and rollback planning - Stakeholder sign-off Staging Rollout: - Feature flag enablement steps - Verification procedures (scheduler health API) - 24-48h observation window with success criteria - Metric checkpoints at 0h, 12h, 24h Production Rollout: - Gradual strategy (25% nodes every 2 hours) - Low-traffic maintenance window - Per-cluster monitoring during rollout - Success criteria and completion validation Grafana/Alert Configuration: - Dashboard panels: queue depth, staleness, throughput, breakers/DLQ - Alert thresholds: - Queue depth > 1.5× instances for >10min (Warning) - Staleness > 60s for >5min (Critical) - DLQ growth (Warning) - Stuck breakers >10min (Critical) Rollback Procedure: - Clear disable/restart steps - Verification of rollback success - Post-rollback actions and incident reporting Troubleshooting: - Symptom/cause/action table - Scheduler health API access guide - Immediate rollback triggers Operators can now safely enable adaptive polling following this step-by-step playbook. Part of Phase 2 Task 10 (Documentation)	2025-10-20 15:13:38 +00:00
rcourtman	d5c7a3494b	chore: remove deprecated Pulse+ agent metrics and add audit log rotation docs Removed all legacy Pulse+ agent metrics infrastructure (cloud-relay) which has been fully replaced by the new docker agent and temperature agent implementations. Changes: - Remove cloud-relay directory and all related binaries (relay, relay-linux, etc.) - Remove Pulse+ documentation (AGENT_METRICS_IMPLEMENTATION.md, AGENT_METRICS_SETUP.md) - Clean up pulse-relay references in workflows and release checklist - Add audit log rotation documentation for sensor proxy hash-chained logs - Update .gitignore to remove cloud-relay/ entry The new docker and temp agents remain fully functional and unaffected by this cleanup.	2025-10-20 15:13:38 +00:00
rcourtman	fa21e9c69c	chore: remove completed phase summary documents Removed PHASE1_SUMMARY.md and PHASE2_SUMMARY.md as both phases are complete. All relevant documentation has been integrated into the main docs: - Security hardening docs in SECURITY.md - Adaptive polling architecture in docs/monitoring/ADAPTIVE_POLLING.md	2025-10-20 15:13:38 +00:00
rcourtman	b3f37a798c	docs: update Phase 2 summary to reflect completion (9/10 tasks = 90%) Updated PHASE2_SUMMARY.md to include: - ✅ Task 8: Scheduler health API endpoint completion - ✅ Task 9: Unit testing completion (40+ test cases) - Updated git commit history (9 commits total) - Revised known limitations (removed API/testing gaps) - Updated future work section Phase 2 achievements: - 9/10 tasks complete (only integration/soak tests deferred) - 40+ unit tests covering backoff, circuit breakers, staleness - Full scheduler health API with authentication - Comprehensive documentation and rollout plan - Production-ready with feature flag control Remaining work (deferred to future): - Integration tests with mock PVE/PBS clients - Soak tests for extended queue stability - Write endpoints for circuit breaker/DLQ management	2025-10-20 15:13:38 +00:00
rcourtman	160adeb3b8	feat: add scheduler health API endpoint (Phase 2 Task 8) Task 8 of 10 complete. Exposes read-only scheduler health data including: - Queue depth and distribution by instance type - Dead-letter queue inspection (top 25 tasks with error details) - Circuit breaker states (instance-level) - Staleness scores per instance New API endpoint: GET /api/monitoring/scheduler/health (requires authentication) New snapshot methods: - StalenessTracker.Snapshot() - exports all staleness data - TaskQueue.Snapshot() - queue depth & per-type distribution - TaskQueue.PeekAll() - dead-letter task inspection - circuitBreaker.State() - exports state, failures, retryAt - Monitor.SchedulerHealth() - aggregates all health data Documentation updated with API spec, field descriptions, and usage examples.	2025-10-20 15:13:38 +00:00
rcourtman	5fbdf6099f	docs: add adaptive polling architecture guide (Phase 2 Task 10) Comprehensive documentation for Phase 2 adaptive polling: - Architecture overview with component diagram - Configuration guide (env vars, defaults, feature flag) - Prometheus metrics reference (7 new metrics) - Circuit breaker & backoff behavior explanation - Dead-letter queue operational guidance - Rollout plan (dev/QA → staged → full) - Troubleshooting guide for common issues Task 10 of 10 complete. Phase 2: 8/10 tasks implemented (80%).	2025-10-20 15:13:37 +00:00
rcourtman	aa5c08ad4a	feat: implement priority queue-based task execution (Phase 2 Task 6) Replaces immediate polling with queue-based scheduling: - TaskQueue with min-heap (container/heap) for NextRun-ordered execution - Worker goroutines that block on WaitNext() until tasks are due - Tasks only execute when NextRun <= now, respecting adaptive intervals - Automatic rescheduling after execution via scheduler.BuildPlan - Queue depth tracking for backpressure-aware interval adjustments - Upsert semantics for updating scheduled tasks without duplicates Task 6 of 10 complete (60%). Ready for error/backoff policies.	2025-10-20 15:13:37 +00:00
rcourtman	524f42cc28	security: complete Phase 1 sensor proxy hardening Implements comprehensive security hardening for pulse-sensor-proxy: - Privilege drop from root to unprivileged user (UID 995) - Hash-chained tamper-evident audit logging with remote forwarding - Per-UID rate limiting (0.2 QPS, burst 2) with concurrency caps - Enhanced command validation with 10+ attack pattern tests - Fuzz testing (7M+ executions, 0 crashes) - SSH hardening, AppArmor/seccomp profiles, operational runbooks All 27 Phase 1 tasks complete. Ready for production deployment.	2025-10-20 15:13:37 +00:00
rcourtman	29f4879cd4	test: add comprehensive security tests and documentation Implements all remaining Codex recommendations before launch: 1. Privileged Methods Tests: - TestPrivilegedMethodsCompleteness ensures all host-side RPCs are protected - Will fail if new privileged RPC is added without authorization - Verifies read-only methods are NOT in privilegedMethods 2. ID-Mapped Root Detection Tests: - TestIDMappedRootDetection covers all boundary conditions - Tests UID/GID range detection (both must be in range) - Tests multiple ID ranges, edge cases, disabled mode - 100% coverage of container identification logic 3. Authorization Tests: - TestPrivilegedMethodsBlocked verifies containers can't call privileged RPCs - TestIDMappedRootDisabled ensures feature can be disabled - Tests both container and host credentials 4. Comprehensive Security Documentation (23 KB): - Architecture overview with diagrams - Complete authentication & authorization flow - Rate limiting details (already implemented: 20/min per peer) - SSH security model and forced commands - Container isolation mechanisms - Monitoring & alerting recommendations - Development mode documentation (PULSE_DEV_ALLOW_CONTAINER_SSH) - Troubleshooting guide with common issues - Incident response procedures Rate Limiting Status: - Already implemented in throttle.go (20 req/min, burst 10, max 10 concurrent) - Per-peer rate limiting at line 328 in main.go - Per-node concurrency control at line 825 in main.go - Exceeds Codex's requirements All tests pass. Documentation covers all security aspects. Addresses final Codex recommendations for production readiness.	2025-10-19 16:47:13 +00:00
Pulse Automation Bot	0b4e4f9c59	Add configurable backup polling interval	2025-10-18 13:06:41 +00:00
Pulse Automation Bot	d15ad1d0b4	Add Helm chart tooling, CI, and release packaging	2025-10-18 11:50:57 +00:00
Richard Courtman	02701ca22b	fix: gracefully handle standalone node cleanup limitation - Cleanup script now detects forced command restriction on standalone nodes - Logs helpful message explaining limitation (security by design) - Does not fail when standalone nodes cannot be cleaned up - Documents that standalone node cleanup is limited by forced command security - Automatic cleanup works fully for cluster nodes - Manual cleanup command provided for standalone nodes if needed	2025-10-18 07:34:18 +00:00
Richard Courtman	b328a09e45	docs: add automatic cleanup documentation for node removal	2025-10-18 07:03:42 +00:00
Richard Courtman	de3bb47930	fix: improve turnkey temperature monitoring for standalone nodes - Fix script input handling to work with standard curl \| bash pattern by prioritizing /dev/tty - Add Raspberry Pi temperature sensor support (cpu_thermal chip and generic temp sensors) - Add comprehensive documentation for turnkey standalone node setup - Fix printf formatting error in setup script	2025-10-18 06:51:56 +00:00
rcourtman	a5d4d57097	docs: implement Codex recommendations for temperature monitoring Add comprehensive documentation improvements based on architectural review: 1. Enhanced Known Limitations section: - Document single proxy failure mode - Explain sensors output parsing brittleness with mitigation steps - Clarify cluster discovery dependencies and fallback options - Describe SSH fan-out scaling considerations for large clusters 2. Documented SSH key rotation workflow: - Promote automated rotation script as recommended approach - Include dry-run, execution, and rollback examples - Provide manual fallback process - Reference existing pulse-proxy-rotate-keys.sh script 3. Added Future Improvements roadmap: - Proxmox API integration (when available) - Agent-based architecture option - SNMP/IPMI support - Schema validation - Caching and throttling - Automated rotation timer - Health check endpoint Instrumentation verified: proxy already has comprehensive Prometheus metrics (RPC/SSH requests, latency, queue depth, rate limiting) and structured logging.	2025-10-17 12:03:31 +00:00
rcourtman	07fe382553	docs: update temperature monitoring guide to reflect removed UI button - Replace references to 'Ensure cluster keys' button with instructions to re-run setup script - Update troubleshooting section for new cluster nodes - The setup script already handles SSH key distribution automatically	2025-10-17 11:46:31 +00:00
rcourtman	3a4fc044ea	Add guest agent caching and update doc hints (refs #560 )	2025-10-16 08:15:49 +00:00
rcourtman	4838793677	feat: enhance alerts system with tests and improved thresholds - Add comprehensive test coverage for alerts package with 285+ new tests - Implement ThresholdsTable component with metric thresholds display - Enhance Alerts page UI with improved layout and metric filtering - Add frontend component tests for Alerts page and ThresholdsTable - Set up Vitest testing infrastructure for SolidJS components - Improve config persistence with better validation - Expand discovery tests with 333+ test cases - Update API, configuration, and Docker monitoring documentation	2025-10-15 22:25:04 +00:00
rcourtman	91fecacfef	feat: add docker agent command handling	2025-10-15 19:27:19 +00:00
rcourtman	78889ffedc	Ignore read-only guest filesystems in disk aggregation	2025-10-14 16:13:53 +00:00
rcourtman	261bd7ac74	Adopt multi-token auth across docs, UI, and tooling	2025-10-14 15:47:49 +00:00
rcourtman	5cf0697157	Document optional host-script upgrade path	2025-10-14 13:19:38 +00:00
rcourtman	61020881c4	Align proxy upgrade messaging with node re-add workflow	2025-10-14 13:17:34 +00:00
rcourtman	eda3a08ae5	Document proxy installer upgrade path	2025-10-14 12:43:50 +00:00
rcourtman	e4c3b06f14	Automate sensor proxy container mount and auth	2025-10-14 12:41:48 +00:00
rcourtman	156fd34c50	Update Proxmox guest agent permissions docs and tooling (refs #548 )	2025-10-14 10:21:52 +00:00
rcourtman	5c79d2516d	feat: streamline docker agent onboarding	2025-10-14 09:45:32 +00:00
rcourtman	d3d4b9811a	docs: add manual pulse-sensor-proxy install steps	2025-10-13 19:36:50 +00:00
rcourtman	fcd8b62705	refactor: Rename install-temp-proxy.sh to install-sensor-proxy.sh Complete the pulse-sensor-proxy rename by updating the installer script name and all references to it. Updated: - Renamed scripts/install-temp-proxy.sh → scripts/install-sensor-proxy.sh - Updated all documentation references - Updated install.sh references - Updated build-release.sh comments	2025-10-13 13:23:53 +00:00
rcourtman	b952444837	refactor: Rename pulse-temp-proxy to pulse-sensor-proxy The name "temp-proxy" implied a temporary or incomplete implementation. The new name better reflects its purpose as a secure sensor data bridge for containerized Pulse deployments. Changes: - Renamed cmd/pulse-temp-proxy/ to cmd/pulse-sensor-proxy/ - Updated all path constants and binary references - Renamed environment variables: PULSE_TEMP_PROXY_* to PULSE_SENSOR_PROXY_* - Updated systemd service and service account name - Updated installation, rotation, and build scripts - Renamed hardening documentation - Maintained backward compatibility for key removal during upgrades	2025-10-13 13:17:05 +00:00
rcourtman	97066d8351	docs: Update socket paths and add monitoring section to TEMPERATURE_MONITORING.md Updated documentation to reflect new directory-level bind mount architecture: - Changed socket path from /var/run/pulse-temp-proxy.sock to /run/pulse-temp-proxy/pulse-temp-proxy.sock - Updated LXC bind mount syntax to directory-level (create=dir instead of create=file) - Added "Monitoring the Proxy" section with manual monitoring commands - Documents systemd restart-on-failure reliance for v1 - Notes future pulse-watchdog integration planned Related to #528	2025-10-12 22:42:38 +00:00
rcourtman	47116bedb5	docs: Add comprehensive Operations & Troubleshooting section Addresses operational documentation gaps for pulse-temp-proxy: - Service management (restart, stop, start, enable/disable) - Log locations and viewing commands - SSH key rotation procedures (recommended every 90 days) - Key revocation when nodes leave cluster - Failure modes (proxy down, socket issues, pvecm absent, off-cluster) - Known limitations (one per host, cluster membership, cross-cluster) - Common issues with troubleshooting steps - Diagnostic info collection for bug reports This provides operators with everything they need to manage the proxy service in production environments.	2025-10-12 21:50:55 +00:00
rcourtman	6d4694f019	security: Add SO_PEERCRED authentication to temperature proxy Addresses security concern raised in code review: - Socket permissions changed from 0666 to 0660 - Added SO_PEERCRED verification to authenticate connecting processes - Only allows root (UID 0) or proxy's own user - Prevents unauthorized processes from triggering SSH key rollout - Documented passwordless root SSH requirement for clusters This prevents any process on the host or in other containers from accessing the proxy RPC endpoints.	2025-10-12 21:42:22 +00:00
rcourtman	e7bc338891	feat: Implement secure temperature proxy for containerized deployments Addresses #528 Introduces pulse-temp-proxy architecture to eliminate SSH key exposure in containers: Architecture: - pulse-temp-proxy runs on Proxmox host (outside LXC/Docker) - SSH keys stored on host filesystem (/var/lib/pulse-temp-proxy/ssh/) - Pulse communicates via unix socket (bind-mounted into container) - Proxy handles cluster discovery, key rollout, and temperature fetching Components: - cmd/pulse-temp-proxy: Standalone Go binary with unix socket RPC server - internal/tempproxy: Client library for Pulse backend - scripts/install-temp-proxy.sh: Idempotent installer for existing deployments - scripts/pulse-temp-proxy.service: Systemd service for proxy Integration: - Pulse automatically detects and uses proxy when socket exists - Falls back to direct SSH for native installations - Installer automatically configures proxy for new LXC deployments - Existing LXC users can upgrade by running install-temp-proxy.sh Security improvements: - Container compromise no longer exposes SSH keys - SSH keys never enter container filesystem - Maintains forced command restrictions - Transparent to users - no workflow changes Documentation: - Updated TEMPERATURE_MONITORING.md with new architecture - Added verification steps and upgrade instructions - Preserved legacy documentation for native installs	2025-10-12 21:35:35 +00:00
rcourtman	c8e3c93516	fix: Add security gates for containerized temperature monitoring Addresses #528 - Added opt-in confirmation prompt to setup script with security notice - Added runtime warning when containerized Pulse uses SSH temperature monitoring - Documented security considerations and hardening recommendations - Users must explicitly confirm understanding before enabling in containers	2025-10-12 21:01:25 +00:00
rcourtman	18a88cb4cc	Improve NVMe temperature handling	2025-10-12 16:06:55 +00:00

1 2

52 Commits