Export/import payload bumped to v4.1 to include API tokens alongside existing
config bundle, eliminating blind spots in disaster recovery scenarios.
## Key Features
**API Tokens in Exports (v4.1)**
- Exports now include API token metadata (ID, name, hash, prefix, suffix, timestamps)
- Export format version bumped from 4.0 to 4.1
- Fixes gap where API tokens were lost during config migrations
**Transactional Atomic Imports**
- New importTransaction helper stages all writes before committing
- On failure, automatic rollback restores original configs
- Prevents partial/corrupted imports that could break running systems
- All config writes (nodes, alerts, email, webhooks, apprise, system, OIDC, API tokens, guest metadata) now transaction-aware
**Backward Compatibility**
- Version 4.0 exports (without API tokens) still import successfully
- System logs notice but proceeds, leaving existing API tokens untouched
- No breaking changes to existing export/import workflows
## Implementation
**Files Added:**
- internal/config/import_transaction.go - Transaction helper with staging/rollback
**Files Modified:**
- internal/config/export.go - v4.1 export, transactional ImportConfig wrapper
- internal/config/persistence.go - Transaction-aware Save* methods, beginTransaction/endTransaction helpers
- internal/config/persistence_test.go - 4 comprehensive unit tests
**Testing:**
- TestExportConfigIncludesAPITokens - Verifies API tokens in v4.1 exports
- TestImportConfigTransactionalSuccess - Validates atomic import success path
- TestImportConfigRollbackOnFailure - Confirms rollback on mid-import failure
- TestImportAcceptsVersion40Bundle - Ensures backward compatibility with v4.0
All tests passing ✅
## Migration Notes
- No manual migration required
- Users can re-export to generate v4.1 bundles with API tokens
- Existing 4.0 bundles remain valid for import
- Recommended: Re-run export after upgrade to ensure API tokens are captured
Co-authored-by: Codex (implementation)
Co-authored-by: Claude (coordination and testing)
When installing Pulse in an LXC container with temperature proxy
support, the installation now automatically:
- Configures PULSE_SENSOR_PROXY_SOCKET in /etc/pulse/.env
- Restarts Pulse service to pick up the configuration
This ensures temperature monitoring works immediately without
requiring manual configuration after installation.
Fixes build failure caused by unescaped apostrophes in single-quoted
strings. The Vite/Babel parser was failing on "You'll" and "you'll"
in ActivationModal.tsx, preventing successful frontend builds.
Replaced inconsistent per-product detection logic with a unified probe
architecture using confidence scoring and product-specific matchers.
Key improvements:
- PBS detection now inspects TLS certs, auth headers (401/403), and
probes PBS-specific endpoints (/api2/json/status, /config/datastore)
fixing false negatives for self-signed and auth-protected servers
- PMG detection uses header analysis first, then conditional endpoint
probing, working consistently regardless of port
- Single unified probeProxmoxService() replaces separate checkPort8006()
and checkServer() code paths, eliminating duplication
- Confidence scoring (0.0-1.0+) with evidence tracking for debugging
- Consolidated hostname resolution and version handling
Technical changes:
- Added ProxmoxProbeResult with structured evidence and scoring
- Added product matchers: applyPVEHeuristics, applyPMGHeuristics,
applyPBSHeuristics
- Removed legacy methods: checkPort8006, checkServer, isPMGServer,
detectProductFromEndpoint, and duplicate hostname helpers
- Updated all tests to use new unified probe architecture
- Added probe_test_helpers.go for test access to internal methods
All tests passing. Fixes PBS detection issues and improves consistency
across PVE/PMG/PBS discovery.
Add comprehensive operator documentation for the new observability features
introduced in the previous commit.
**New Documentation:**
- docs/monitoring/PROMETHEUS_METRICS.md - Complete reference for all 18 new
Prometheus metrics with alert suggestions
**Updated Documentation:**
- docs/API.md - Document X-Request-ID and X-Diagnostics-Cached-At headers,
explain diagnostics endpoint caching behavior
- docs/TROUBLESHOOTING.md - Add section on correlating API calls with logs
using request IDs
- docs/operations/ADAPTIVE_POLLING_ROLLOUT.md - Update monitoring checklists
with new per-node and scheduler metrics
- docs/CONFIGURATION.md - Clarify LOG_FILE dual-output behavior and rotation
defaults
These updates ensure operators understand:
- How to set up monitoring/alerting for new metrics
- How to configure file logging with rotation
- How to troubleshoot using request correlation
- What metrics are available for dashboards
Related to: 495e6c794 (feat: comprehensive diagnostics improvements)
Complete the API token export/import feature with proper version
handling and backward compatibility:
- Bump export format to version 4.1 to indicate API token support
- Import API tokens when loading v4.1 exports
- Handle version compatibility gracefully:
- v4.1: Full support including API tokens
- v4.0: Notice that tokens weren't included (backward compatible)
- Other: Warning but best-effort import
- Initialize empty array instead of nil for cleaner JSON
This ensures API tokens are properly preserved when migrating or
restoring Pulse instances while maintaining backward compatibility
with older exports.
Enhance request ID middleware to support distributed tracing:
- Honor incoming X-Request-ID headers from upstream proxies/load balancers
- Use logging.WithRequestID() for consistent ID generation across codebase
- Return X-Request-ID in response headers for client correlation
- Include request_id in panic recovery logs for debugging
This enables better request tracing across multiple Pulse instances
and integrates with standard distributed tracing practices.
Add API tokens to the export data so they are included when
exporting/backing up configuration. This ensures API tokens are
preserved when migrating or restoring Pulse instances.
Changes:
- Add APITokens field to ExportData struct
- Load API tokens during export process
- Include tokens in exported JSON (omitempty if none exist)
Document the pulse-sensor-proxy rate limiting bug fix and new
configurability across all relevant documentation:
TEMPERATURE_MONITORING.md:
- Added 'Rate Limiting & Scaling' section with symptom diagnosis
- Included sizing table for 1-3, 4-10, 10-20, and 30+ node deployments
- Provided tuning formula: interval_ms = polling_interval / node_count
TROUBLESHOOTING.md:
- Added 'Temperature data flickers after adding nodes' section
- Step-by-step diagnosis using limiter metrics and scheduler health
- Quick fix with config example
CONFIGURATION.md:
- Added pulse-sensor-proxy/config.yaml reference section
- Documented rate_limit.per_peer_interval_ms and per_peer_burst fields
- Included defaults and example override
pulse-sensor-proxy-runbook.md:
- Updated quick reference with new defaults (1 req/sec, burst 5)
- Added 'Rate Limit Tuning' procedure with 4 deployment profiles
- Included validation steps and monitoring commands
TEMPERATURE_MONITORING_SECURITY.md:
- Updated rate limiting section with new defaults
- Added configurable overrides guidance
- Documented security considerations for production deployments
Related commits:
- 46b8b8d08: Initial rate limit fix (hardcoded defaults)
- ca534e2b6: Made rate limits configurable via YAML
- e244da837: Added guidance for large deployments (30+ nodes)
Update config.example.yaml with:
- Recommendations for very large deployments (30+ nodes)
- Formula for calculating optimal rate limits based on node count
- Example calculation: 30 nodes with 10s polling = 300ms interval
- Security note about minimum safe intervals
This helps admins properly configure the proxy for enterprise
deployments with dozens of nodes.
Add support for configuring rate limits via config.yaml to allow
administrators to tune the proxy for different deployment sizes.
Changes:
- Add RateLimitConfig struct to config.go with per_peer_interval_ms and per_peer_burst
- Update newRateLimiter() to accept optional RateLimitConfig parameter
- Load rate limit config from YAML and apply overrides to defaults
- Update tests to pass nil for default behavior
- Add comprehensive config.example.yaml with documentation
Configuration examples:
- Small (1-3 nodes): 1000ms interval, burst 5 (default)
- Medium (4-10 nodes): 500ms interval, burst 10
- Large (10+ nodes): 250ms interval, burst 20
Defaults remain conservative (1 req/sec, burst 5) to support most
deployments while allowing customization for larger environments.
Related: #46b8b8d08 (rate limit fix for multi-node support)
- Increase rate limit from 1 req/5sec to 1 req/sec (60/min)
- Increase burst from 2 to 5 requests
- Fixes temperature collection failures when monitoring 3+ nodes
- All requests from containerized Pulse use same UID, causing rate limiting
- New limits support 5-10 node deployments comfortably
Resolves issue where adding standalone nodes broke temperature monitoring
for all nodes due to aggressive rate limiting.
Replace harsh/technical language with clearer, more positive messaging:
BEFORE → AFTER:
- "No alert violations detected during observation yet" → "All systems healthy — no alerts triggered"
- "Monitoring is live; notifications will start after..." → "Monitoring is active. Review your settings..."
- "24h observation ending" → "24-hour setup period ending soon"
- "Review alerts before activating" → "Ready to activate notifications"
- "breached thresholds" → "triggered"
- "violations" → "alerts"
Key improvements:
- Removed jargon: "observation window", "during observation"
- Removed ominous language: "yet", harsh "violations"
- More conversational: "You'll receive" vs "will dispatch to configured destinations"
- Positive framing: "All systems healthy" vs absence-focused language
- Clearer actions: "turning on alerts" vs "enabling notifications"
- Enthusiastic success messages: "Notifications activated!" with exclamation
Affected components:
- ActivationBanner.tsx: 4 text improvements
- ActivationModal.tsx: 5 text improvements
Impact: Better first-run UX, less intimidating language, clearer call-to-action
Implement 5 medium/low priority improvements identified in systematic review:
UX IMPROVEMENTS:
- Notify existing critical alerts when activating from pending_review state
Previously: critical alerts during observation window would never notify
Now: users receive notifications for active critical alerts after activation
Implementation: Added NotifyExistingAlert() method and logic in ActivateAlerts()
PERFORMANCE OPTIMIZATIONS:
- Replace per-alert cleanup goroutines with periodic batch cleanup
Prevents spawning 1000s of goroutines during alert flapping
recentlyResolved entries now cleaned up once per minute instead of 1 goroutine per alert
- Simplify GetActiveAlerts() implementation
Removed intermediate map copy, holds lock slightly longer but operation is fast
Cleaner code with reduced memory allocation
CONFIGURATION VALIDATION:
- Validate timezone in quiet hours configuration
Invalid timezones now disable quiet hours with error log instead of silent fallback
Prevents unexpected behavior when timezone is typo'd or invalid
GRACEFUL SHUTDOWN:
- Add 100ms delay in Stop() for background goroutine cleanup
Reduces risk of state corruption during shutdown
Allows escalation checker and periodic save to exit cleanly
Technical details:
- internal/alerts/alerts.go: Added NotifyExistingAlert(), optimized cleanup patterns
- internal/api/alerts.go: Enhanced ActivateAlerts() to notify existing critical alerts
- Removed ~20 lines of goroutine spawning code
- Added periodic cleanup for recentlyResolved map
- All changes preserve backward compatibility
Testing: Verified compilation with 'go build -o /dev/null ./...'
Fix 5 critical bugs identified through systematic code review:
CRITICAL FIXES (prevent service crashes):
- Add panic recovery to all alert callbacks (onAlert, onResolved, onEscalate)
- Clone alerts before passing to escalation callback to prevent data races
- Make clearAlertNoLock callback async to prevent deadlock
HIGH PRIORITY FIXES (prevent memory leaks):
- Add cleanup for stale pendingAlerts entries (deleted resources)
- Add cleanup for dockerRestartTracking (ephemeral containers in CI/CD)
MEDIUM PRIORITY FIXES (prevent stuck alerts):
- Validate hysteresis thresholds (ensure clear < trigger)
- Auto-fix invalid configurations with warning logs
Impact:
- Service stability: Malformed webhook URLs or email configs can no longer crash Pulse
- Memory management: Prevents unbounded growth in dynamic environments
- Alert reliability: Prevents alerts that never clear due to invalid thresholds
- Concurrency safety: Eliminates data races in escalation path
Technical details:
- Created safeCallResolvedCallback() and safeCallEscalateCallback() wrappers
- Added ensureValidHysteresis() validation helper
- Extended Cleanup() with pendingAlerts and dockerRestartTracking pruning
- All callbacks now have defer/recover panic handlers with detailed logging
Testing: Verified compilation with 'go build -o /dev/null ./...'
The previous diagrams were too complex and overwhelming. Simplified
all diagrams to show core concepts clearly:
- Adaptive polling: reduced to basic scheduler→queue→workers flow
- Temperature proxy: simplified to 3-box trust boundary view
- Sensor proxy sequence: simplified to essential request flow
- Webhook pipeline: reduced to template→send→retry flow
- Script library: simplified to code→test→bundle→dist flow
Fixed parsing error in temperature proxy diagram (parentheses in
edge label causing render failure).
Diagrams should clarify architecture, not recreate implementation.
The v2 installer rollout is complete - dist/install-docker-agent.sh
now contains the bundled v2 installer with embedded library modules.
This planning document served its purpose and is no longer relevant.
Replace internal development phase reference with clear description
of what the adaptive polling scheduler does. 'Phase 2' is internal
jargon that provides no value to users.
Source builds use commit hashes (main-c147fa1) not semantic versions
(v4.23.0), so update checks would always fail or show misleading
"Update Available" banners.
Changes:
- Add IsSourceBuild flag to VersionInfo struct
- Detect source builds via BUILD_FROM_SOURCE marker file
- Skip update check for source builds (like Docker)
- Update frontend to show "Built from source" message
- Disable manual update check button for source builds
- Return "source" deployment type for source builds
Backend:
- internal/updates/version.go: Add isSourceBuildEnvironment() detection
- internal/updates/manager.go: Skip check with appropriate message
- internal/api/types.go: Add isSourceBuild to API response
- internal/api/router.go: Include isSourceBuild in version endpoint
Frontend:
- src/api/updates.ts: Add isSourceBuild to VersionInfo type
- src/stores/updates.ts: Don't poll for updates on source builds
- src/components/Settings/Settings.tsx: Show "Built from source" message
Fixes the confusing "Update Available" banner for users who explicitly
chose --source to get latest main branch code.
Co-authored-by: Codex AI
Source builds use commit hashes (0.0.0-main-44ef8b6) not semantic
versions (v4.23.0), so auto-updates don't make sense. The auto-updater
would download release binaries, replacing the user's source build.
Changes:
- Skip auto-update question when BUILD_FROM_SOURCE=true
- Show informational message instead
- Applies to both Quick and Advanced modes
This prevents confusion when users explicitly choose --source to get
the latest main branch code instead of stable releases.
When installing temperature monitoring for a new container, stop any
existing pulse-sensor-proxy service before trying to overwrite the
binary. This prevents 'Text file busy' errors when the binary is
currently running.
Fixes the error that occurred when installing container 103 while
container 107's proxy was still running.
The prompt now says 'Enable temperature monitoring from first boot'
instead of 'Restart the container to activate' since we moved the
question to before container creation.
Also clarified 'Configure container with temperature monitoring bind mount'
to better reflect what actually happens.
Major improvements to the install script based on comprehensive review:
## 1. Temperature Monitoring - No Restart Required ✨
- Ask about temperature monitoring BEFORE container creation (not after)
- Add bind mount during `pct create` instead of requiring restart later
- Quick mode defaults to "yes", Advanced mode asks user
- Host path: /run/pulse-sensor-proxy → /mnt/pulse-proxy in container
- Support --skip-restart flag in install-sensor-proxy.sh
- Eliminates disruptive container restart on fresh installs
## 2. Shell Injection Prevention 🔒
- Replace `eval pct create` with array-based command building
- Prevents quoting bugs with special characters in hostnames/nameservers
- Safer handling of user input in container creation
## 3. Non-Interactive Install Support 🤖
- Replace bare `read` with `safe_read_with_default` in prompts
- Prevents hangs when running `curl | bash` non-interactively
- Proper fallback to sensible defaults
## 4. Cleanup on Interrupt 🧹
- Track container ID globally during creation
- Properly cleanup orphaned containers on Ctrl+C/SIGTERM
- New handle_install_interrupt() function
- Prevents leftover containers after cancelled installs
## 5. Air-Gapped Network Support 🌐
- Replace 8.8.8.8 ping check with `hostname -I` IP detection
- Supports restricted/firewalled networks where external ping fails
- More reliable for DHCP-only environments
Changes:
- install.sh: Refactor temperature prompt timing and mount setup
- install.sh: Convert pct create to array-based args (lines 1018-1055)
- install.sh: Add handle_install_interrupt trap (lines 38-48)
- install.sh: Replace ping check with IP detection (line 1082)
- scripts/install-sensor-proxy.sh: Add --skip-restart flag support
- scripts/install-sensor-proxy.sh: Improve mount detection and updates
Impact:
- Fresh installs now complete without any container restarts
- Temperature monitoring works immediately after first boot
- Safer and more robust for automation/CI scenarios
- Better experience on restricted networks
Co-authored-by: Codex AI
- Remove confusing --main flag, use --source for clarity
- Fix timeout issues when building from source in LXC containers
- Increase timeout from 5min to 20min for source builds
- Add PULSE_CONTAINER_TIMEOUT env var for custom timeouts
- Support PULSE_CONTAINER_TIMEOUT=0 to disable timeout
- Fix misleading "Latest version: vX.X.X" message during source builds
- Update documentation to use --source instead of --main
- Simplify auto-update script logic for source builds
Changes:
- install.sh: Check BUILD_FROM_SOURCE early to skip version detection
- install.sh: Adaptive timeout (300s binary, 1200s source builds)
- install.sh: Better timeout error messages with recovery instructions
- README.md: Replace --main with --source in examples
- docs/INSTALL.md: Replace --main with --source in examples
- scripts/pulse-auto-update.sh: Remove --main special case
Significantly enhanced network discovery feature to eliminate false positives,
provide real-time progress updates, and better error reporting.
Key improvements:
- Require positive Proxmox identification (version data, auth headers, or certificates)
instead of reporting any service on ports 8006/8007
- Add real-time progress tracking with phase/target counts and completion percentage
- Implement structured error reporting with IP, phase, type, and timestamp details
- Fix TLS timeout handling to prevent hangs on unresponsive hosts
- Expose progress and structured errors via WebSocket for UI consumption
- Reduce log verbosity by moving discovery logs to debug level
- Fix duplicate IP counting to ensure progress reaches 100%
Breaking changes: None (backward compatible with legacy API methods)
The installer was configuring pulse-backend.service.d but the actual
service is pulse.service, so the PULSE_SENSOR_PROXY_SOCKET environment
variable wasn't being set.
Changed: pulse-backend.service → pulse.service
This ensures Pulse actually uses the proxy socket for temperature
monitoring instead of attempting SSH connections.
Added containerized and containerId fields to /api/version endpoint
to enable automatic temperature proxy installation for LXC containers.
Changes:
- Added Containerized bool field to VersionResponse
- Added ContainerId string field to VersionResponse
- Detect containerization by checking /run/systemd/container file
- Extract container ID from hostname for LXC containers
- Set deployment type from container type (lxc/docker)
This allows the PVE setup script to:
1. Detect that Pulse is running in a container
2. Find the container ID by matching IPs
3. Automatically install pulse-sensor-proxy on the host
4. Configure bind mount for secure socket communication
Fixes the issue where setup script showed 'Proxy not available'
even when Pulse was containerized.
Critical bug fix: The setup script's format string had 33 placeholders
but was only receiving 27 arguments, causing:
- INSTALLER_URL to receive authToken instead of pulseURL
- This made curl try to resolve the token value as a hostname
- Error: 'curl: (6) Could not resolve host: N7AE3P'
- Token ID showed '%!s(MISSING)' in manual setup instructions
Fixed by:
- Added missing tokenName at position 7
- Added literal '%s' strings for version_ge printf placeholders
- Added authToken arguments for Authorization headers (positions 29, 31)
- Ensured all 33 format placeholders have corresponding arguments
Now generates correct URLs:
- INSTALLER_URL: http://192.168.0.160:7655/api/install/install-sensor-proxy.sh
- --pulse-server: http://192.168.0.160:7655
- Token ID: pulse-monitor@pam!pulse-192-168-0-160-[timestamp]
Changed curl flags from -fsSL to -fSL to enable error output.
The -s flag was silencing all curl errors including SSL/TLS issues,
making it impossible to diagnose download failures.
With -S (show errors), stderr now captures meaningful error messages
like certificate problems, connection failures, etc.
- Back up container config before making mount modifications
- Restore original config if socket verification fails
- Clean up backup file on success or when verification is skipped
- Leave host-level resources (user, binary, service) in place for idempotency
This ensures failed installations don't leave containers in an
inconsistent state while keeping successfully installed host services
for faster re-runs.