Add comprehensive operator documentation for the new observability features
introduced in the previous commit.
**New Documentation:**
- docs/monitoring/PROMETHEUS_METRICS.md - Complete reference for all 18 new
Prometheus metrics with alert suggestions
**Updated Documentation:**
- docs/API.md - Document X-Request-ID and X-Diagnostics-Cached-At headers,
explain diagnostics endpoint caching behavior
- docs/TROUBLESHOOTING.md - Add section on correlating API calls with logs
using request IDs
- docs/operations/ADAPTIVE_POLLING_ROLLOUT.md - Update monitoring checklists
with new per-node and scheduler metrics
- docs/CONFIGURATION.md - Clarify LOG_FILE dual-output behavior and rotation
defaults
These updates ensure operators understand:
- How to set up monitoring/alerting for new metrics
- How to configure file logging with rotation
- How to troubleshoot using request correlation
- What metrics are available for dashboards
Related to: 495e6c794 (feat: comprehensive diagnostics improvements)
Document the pulse-sensor-proxy rate limiting bug fix and new
configurability across all relevant documentation:
TEMPERATURE_MONITORING.md:
- Added 'Rate Limiting & Scaling' section with symptom diagnosis
- Included sizing table for 1-3, 4-10, 10-20, and 30+ node deployments
- Provided tuning formula: interval_ms = polling_interval / node_count
TROUBLESHOOTING.md:
- Added 'Temperature data flickers after adding nodes' section
- Step-by-step diagnosis using limiter metrics and scheduler health
- Quick fix with config example
CONFIGURATION.md:
- Added pulse-sensor-proxy/config.yaml reference section
- Documented rate_limit.per_peer_interval_ms and per_peer_burst fields
- Included defaults and example override
pulse-sensor-proxy-runbook.md:
- Updated quick reference with new defaults (1 req/sec, burst 5)
- Added 'Rate Limit Tuning' procedure with 4 deployment profiles
- Included validation steps and monitoring commands
TEMPERATURE_MONITORING_SECURITY.md:
- Updated rate limiting section with new defaults
- Added configurable overrides guidance
- Documented security considerations for production deployments
Related commits:
- 46b8b8d08: Initial rate limit fix (hardcoded defaults)
- ca534e2b6: Made rate limits configurable via YAML
- e244da837: Added guidance for large deployments (30+ nodes)
The previous diagrams were too complex and overwhelming. Simplified
all diagrams to show core concepts clearly:
- Adaptive polling: reduced to basic scheduler→queue→workers flow
- Temperature proxy: simplified to 3-box trust boundary view
- Sensor proxy sequence: simplified to essential request flow
- Webhook pipeline: reduced to template→send→retry flow
- Script library: simplified to code→test→bundle→dist flow
Fixed parsing error in temperature proxy diagram (parentheses in
edge label causing render failure).
Diagrams should clarify architecture, not recreate implementation.
The v2 installer rollout is complete - dist/install-docker-agent.sh
now contains the bundled v2 installer with embedded library modules.
This planning document served its purpose and is no longer relevant.
Replace internal development phase reference with clear description
of what the adaptive polling scheduler does. 'Phase 2' is internal
jargon that provides no value to users.
- Remove confusing --main flag, use --source for clarity
- Fix timeout issues when building from source in LXC containers
- Increase timeout from 5min to 20min for source builds
- Add PULSE_CONTAINER_TIMEOUT env var for custom timeouts
- Support PULSE_CONTAINER_TIMEOUT=0 to disable timeout
- Fix misleading "Latest version: vX.X.X" message during source builds
- Update documentation to use --source instead of --main
- Simplify auto-update script logic for source builds
Changes:
- install.sh: Check BUILD_FROM_SOURCE early to skip version detection
- install.sh: Adaptive timeout (300s binary, 1200s source builds)
- install.sh: Better timeout error messages with recovery instructions
- README.md: Replace --main with --source in examples
- docs/INSTALL.md: Replace --main with --source in examples
- scripts/pulse-auto-update.sh: Remove --main special case
Add detailed API reference and update rollout playbook:
**New: docs/api/SCHEDULER_HEALTH.md**
- Complete endpoint reference for /api/monitoring/scheduler/health
- Request/response structure with field descriptions
- Enhanced "instances" array documentation
- Example responses showing all states (healthy, transient, DLQ)
- Useful jq queries for troubleshooting:
- Find instances with errors
- List DLQ entries
- Show open circuit breakers
- Sort by failure streaks
- Migration guide (legacy → new fields)
- Troubleshooting examples with real scenarios
**Updated: docs/operations/ADAPTIVE_POLLING_ROLLOUT.md**
- Enhanced "Accessing Scheduler Health API" section (§6)
- Added examples using new instances[] array
- Updated queries to use pollStatus, breaker, deadLetter fields
- Practical jq commands for operators
**Key Documentation Features:**
- Complete JSON schema with examples
- All new fields documented with types and descriptions
- Real-world troubleshooting scenarios
- Copy-paste ready jq queries
- Migration path for existing integrations
- Backward compatibility notes
Operators can now:
- Find error messages without log digging
- Understand circuit breaker states
- Track DLQ entries with full context
- Diagnose issues using single API call
Part of Phase 2 follow-up - enhanced observability
Document decision to defer mutation endpoints after soak testing:
**Assessment Results:**
- Integration tests (55s, 12 instances): Automatic recovery worked perfectly
- Soak tests (2-240min, 80 instances): No manual intervention needed
- Circuit breakers: Opened/closed automatically as designed
- DLQ routing: Permanent failures handled correctly
**Current Capabilities (Sufficient):**
- Read-only scheduler health API provides full visibility
- Operator workarounds: service restart, feature flag toggle
- Grafana alerting: queue depth, staleness, DLQ, breakers
**Why Defer:**
- No operational need demonstrated in testing
- Implementation requires auth/RBAC/audit/UI work
- Cost not justified until production usage reveals need
- Can add later when data shows actual pain points
**Future Design Notes:**
- POST /api/monitoring/breakers/{instance}/reset
- POST /api/monitoring/dlq/retry (all or specific)
- DELETE /api/monitoring/dlq/{instance}
- Auth, audit, rate limiting, UI integration required
**Re-evaluation Criteria:**
- Operators request controls >3x in 30 days
- Troubleshooting steps inadequate
- Service restarts too disruptive
- Production incidents need surgical controls
Decision: Monitor production usage for 60 days, then reassess based on actual operator feedback and support ticket patterns.
Part of Phase 2 - Adaptive Polling completion
Removed all legacy Pulse+ agent metrics infrastructure (cloud-relay) which has been
fully replaced by the new docker agent and temperature agent implementations.
Changes:
- Remove cloud-relay directory and all related binaries (relay, relay-linux, etc.)
- Remove Pulse+ documentation (AGENT_METRICS_IMPLEMENTATION.md, AGENT_METRICS_SETUP.md)
- Clean up pulse-relay references in workflows and release checklist
- Add audit log rotation documentation for sensor proxy hash-chained logs
- Update .gitignore to remove cloud-relay/ entry
The new docker and temp agents remain fully functional and unaffected by this cleanup.
Removed PHASE1_SUMMARY.md and PHASE2_SUMMARY.md as both phases are complete.
All relevant documentation has been integrated into the main docs:
- Security hardening docs in SECURITY.md
- Adaptive polling architecture in docs/monitoring/ADAPTIVE_POLLING.md
Updated PHASE2_SUMMARY.md to include:
- ✅ Task 8: Scheduler health API endpoint completion
- ✅ Task 9: Unit testing completion (40+ test cases)
- Updated git commit history (9 commits total)
- Revised known limitations (removed API/testing gaps)
- Updated future work section
Phase 2 achievements:
- 9/10 tasks complete (only integration/soak tests deferred)
- 40+ unit tests covering backoff, circuit breakers, staleness
- Full scheduler health API with authentication
- Comprehensive documentation and rollout plan
- Production-ready with feature flag control
Remaining work (deferred to future):
- Integration tests with mock PVE/PBS clients
- Soak tests for extended queue stability
- Write endpoints for circuit breaker/DLQ management
Task 8 of 10 complete. Exposes read-only scheduler health data including:
- Queue depth and distribution by instance type
- Dead-letter queue inspection (top 25 tasks with error details)
- Circuit breaker states (instance-level)
- Staleness scores per instance
New API endpoint:
GET /api/monitoring/scheduler/health (requires authentication)
New snapshot methods:
- StalenessTracker.Snapshot() - exports all staleness data
- TaskQueue.Snapshot() - queue depth & per-type distribution
- TaskQueue.PeekAll() - dead-letter task inspection
- circuitBreaker.State() - exports state, failures, retryAt
- Monitor.SchedulerHealth() - aggregates all health data
Documentation updated with API spec, field descriptions, and usage examples.
Replaces immediate polling with queue-based scheduling:
- TaskQueue with min-heap (container/heap) for NextRun-ordered execution
- Worker goroutines that block on WaitNext() until tasks are due
- Tasks only execute when NextRun <= now, respecting adaptive intervals
- Automatic rescheduling after execution via scheduler.BuildPlan
- Queue depth tracking for backpressure-aware interval adjustments
- Upsert semantics for updating scheduled tasks without duplicates
Task 6 of 10 complete (60%). Ready for error/backoff policies.
Implements all remaining Codex recommendations before launch:
1. Privileged Methods Tests:
- TestPrivilegedMethodsCompleteness ensures all host-side RPCs are protected
- Will fail if new privileged RPC is added without authorization
- Verifies read-only methods are NOT in privilegedMethods
2. ID-Mapped Root Detection Tests:
- TestIDMappedRootDetection covers all boundary conditions
- Tests UID/GID range detection (both must be in range)
- Tests multiple ID ranges, edge cases, disabled mode
- 100% coverage of container identification logic
3. Authorization Tests:
- TestPrivilegedMethodsBlocked verifies containers can't call privileged RPCs
- TestIDMappedRootDisabled ensures feature can be disabled
- Tests both container and host credentials
4. Comprehensive Security Documentation (23 KB):
- Architecture overview with diagrams
- Complete authentication & authorization flow
- Rate limiting details (already implemented: 20/min per peer)
- SSH security model and forced commands
- Container isolation mechanisms
- Monitoring & alerting recommendations
- Development mode documentation (PULSE_DEV_ALLOW_CONTAINER_SSH)
- Troubleshooting guide with common issues
- Incident response procedures
Rate Limiting Status:
- Already implemented in throttle.go (20 req/min, burst 10, max 10 concurrent)
- Per-peer rate limiting at line 328 in main.go
- Per-node concurrency control at line 825 in main.go
- Exceeds Codex's requirements
All tests pass. Documentation covers all security aspects.
Addresses final Codex recommendations for production readiness.
- Cleanup script now detects forced command restriction on standalone nodes
- Logs helpful message explaining limitation (security by design)
- Does not fail when standalone nodes cannot be cleaned up
- Documents that standalone node cleanup is limited by forced command security
- Automatic cleanup works fully for cluster nodes
- Manual cleanup command provided for standalone nodes if needed
- Fix script input handling to work with standard curl | bash pattern by prioritizing /dev/tty
- Add Raspberry Pi temperature sensor support (cpu_thermal chip and generic temp sensors)
- Add comprehensive documentation for turnkey standalone node setup
- Fix printf formatting error in setup script
- Replace references to 'Ensure cluster keys' button with instructions to re-run setup script
- Update troubleshooting section for new cluster nodes
- The setup script already handles SSH key distribution automatically
- Add comprehensive test coverage for alerts package with 285+ new tests
- Implement ThresholdsTable component with metric thresholds display
- Enhance Alerts page UI with improved layout and metric filtering
- Add frontend component tests for Alerts page and ThresholdsTable
- Set up Vitest testing infrastructure for SolidJS components
- Improve config persistence with better validation
- Expand discovery tests with 333+ test cases
- Update API, configuration, and Docker monitoring documentation
Complete the pulse-sensor-proxy rename by updating the installer script name and all references to it.
Updated:
- Renamed scripts/install-temp-proxy.sh → scripts/install-sensor-proxy.sh
- Updated all documentation references
- Updated install.sh references
- Updated build-release.sh comments
The name "temp-proxy" implied a temporary or incomplete implementation. The new name better reflects its purpose as a secure sensor data bridge for containerized Pulse deployments.
Changes:
- Renamed cmd/pulse-temp-proxy/ to cmd/pulse-sensor-proxy/
- Updated all path constants and binary references
- Renamed environment variables: PULSE_TEMP_PROXY_* to PULSE_SENSOR_PROXY_*
- Updated systemd service and service account name
- Updated installation, rotation, and build scripts
- Renamed hardening documentation
- Maintained backward compatibility for key removal during upgrades
Updated documentation to reflect new directory-level bind mount architecture:
- Changed socket path from /var/run/pulse-temp-proxy.sock to /run/pulse-temp-proxy/pulse-temp-proxy.sock
- Updated LXC bind mount syntax to directory-level (create=dir instead of create=file)
- Added "Monitoring the Proxy" section with manual monitoring commands
- Documents systemd restart-on-failure reliance for v1
- Notes future pulse-watchdog integration planned
Related to #528
Addresses operational documentation gaps for pulse-temp-proxy:
- Service management (restart, stop, start, enable/disable)
- Log locations and viewing commands
- SSH key rotation procedures (recommended every 90 days)
- Key revocation when nodes leave cluster
- Failure modes (proxy down, socket issues, pvecm absent, off-cluster)
- Known limitations (one per host, cluster membership, cross-cluster)
- Common issues with troubleshooting steps
- Diagnostic info collection for bug reports
This provides operators with everything they need to manage the proxy service
in production environments.
Addresses security concern raised in code review:
- Socket permissions changed from 0666 to 0660
- Added SO_PEERCRED verification to authenticate connecting processes
- Only allows root (UID 0) or proxy's own user
- Prevents unauthorized processes from triggering SSH key rollout
- Documented passwordless root SSH requirement for clusters
This prevents any process on the host or in other containers from
accessing the proxy RPC endpoints.
Addresses #528
Introduces pulse-temp-proxy architecture to eliminate SSH key exposure in containers:
**Architecture:**
- pulse-temp-proxy runs on Proxmox host (outside LXC/Docker)
- SSH keys stored on host filesystem (/var/lib/pulse-temp-proxy/ssh/)
- Pulse communicates via unix socket (bind-mounted into container)
- Proxy handles cluster discovery, key rollout, and temperature fetching
**Components:**
- cmd/pulse-temp-proxy: Standalone Go binary with unix socket RPC server
- internal/tempproxy: Client library for Pulse backend
- scripts/install-temp-proxy.sh: Idempotent installer for existing deployments
- scripts/pulse-temp-proxy.service: Systemd service for proxy
**Integration:**
- Pulse automatically detects and uses proxy when socket exists
- Falls back to direct SSH for native installations
- Installer automatically configures proxy for new LXC deployments
- Existing LXC users can upgrade by running install-temp-proxy.sh
**Security improvements:**
- Container compromise no longer exposes SSH keys
- SSH keys never enter container filesystem
- Maintains forced command restrictions
- Transparent to users - no workflow changes
**Documentation:**
- Updated TEMPERATURE_MONITORING.md with new architecture
- Added verification steps and upgrade instructions
- Preserved legacy documentation for native installs