Commit Graph

236 Commits

Author SHA1 Message Date
rcourtman
7ae393c8ec Refine Proxmox node memory fallback (#582) 2025-10-22 15:36:26 +00:00
rcourtman
c9543e8a7e Add qemu guest agent version metadata 2025-10-22 15:24:07 +00:00
rcourtman
f8b6aa6c97 Treat 501 responses as non-fatal in cluster failover (#449) 2025-10-22 14:23:13 +00:00
rcourtman
13e2577c57 Handle FreeBSD guest agent disk counters
Refs #580
2025-10-22 14:06:45 +00:00
rcourtman
cdf80dbab5 Align backup node column with current host (#577) 2025-10-22 13:47:51 +00:00
rcourtman
77108abc65 Propagate config updates to settings nodes (#588) 2025-10-22 13:45:13 +00:00
rcourtman
f61a61aad0 Add Apprise API controls to notification panel (#584) 2025-10-22 13:44:39 +00:00
rcourtman
be26f957c0 Add snapshot size alert thresholds (#585) 2025-10-22 13:30:40 +00:00
rcourtman
30879c3b7b Handle AMD Tctl temperature readings (refs #586) 2025-10-22 12:58:34 +00:00
rcourtman
4ba5dcc785 feat: improve mobile layout responsiveness 2025-10-22 12:39:53 +00:00
rcourtman
9d1b9a78c4 refactor: unify notifications into global toast 2025-10-22 12:39:01 +00:00
rcourtman
f83caf8933 Add collision-safe Docker host identifiers (#590) 2025-10-22 12:30:25 +00:00
rcourtman
971d55f334 Use lucide loader icon for discovery spinners 2025-10-22 12:30:04 +00:00
rcourtman
bc479643e4 release: prepare v4.25.0 v4.25.0 2025-10-22 10:46:18 +00:00
rcourtman
4eb8bed9b5 Fix initial setup caching and container discovery defaults 2025-10-22 07:34:32 +00:00
rcourtman
ff4dc49ae4 Update Pulse install flow and related components 2025-10-21 19:58:53 +00:00
rcourtman
999da6d900 feat: production-ready import/export with API tokens and transactional rollback
Export/import payload bumped to v4.1 to include API tokens alongside existing
config bundle, eliminating blind spots in disaster recovery scenarios.

## Key Features

**API Tokens in Exports (v4.1)**
- Exports now include API token metadata (ID, name, hash, prefix, suffix, timestamps)
- Export format version bumped from 4.0 to 4.1
- Fixes gap where API tokens were lost during config migrations

**Transactional Atomic Imports**
- New importTransaction helper stages all writes before committing
- On failure, automatic rollback restores original configs
- Prevents partial/corrupted imports that could break running systems
- All config writes (nodes, alerts, email, webhooks, apprise, system, OIDC, API tokens, guest metadata) now transaction-aware

**Backward Compatibility**
- Version 4.0 exports (without API tokens) still import successfully
- System logs notice but proceeds, leaving existing API tokens untouched
- No breaking changes to existing export/import workflows

## Implementation

**Files Added:**
- internal/config/import_transaction.go - Transaction helper with staging/rollback

**Files Modified:**
- internal/config/export.go - v4.1 export, transactional ImportConfig wrapper
- internal/config/persistence.go - Transaction-aware Save* methods, beginTransaction/endTransaction helpers
- internal/config/persistence_test.go - 4 comprehensive unit tests

**Testing:**
- TestExportConfigIncludesAPITokens - Verifies API tokens in v4.1 exports
- TestImportConfigTransactionalSuccess - Validates atomic import success path
- TestImportConfigRollbackOnFailure - Confirms rollback on mid-import failure
- TestImportAcceptsVersion40Bundle - Ensures backward compatibility with v4.0

All tests passing 

## Migration Notes

- No manual migration required
- Users can re-export to generate v4.1 bundles with API tokens
- Existing 4.0 bundles remain valid for import
- Recommended: Re-run export after upgrade to ensure API tokens are captured

Co-authored-by: Codex (implementation)
Co-authored-by: Claude (coordination and testing)
2025-10-21 14:37:44 +00:00
rcourtman
dfc0085048 fix: configure PULSE_SENSOR_PROXY_SOCKET env var during LXC install
When installing Pulse in an LXC container with temperature proxy
support, the installation now automatically:
- Configures PULSE_SENSOR_PROXY_SOCKET in /etc/pulse/.env
- Restarts Pulse service to pick up the configuration

This ensures temperature monitoring works immediately without
requiring manual configuration after installation.
2025-10-21 14:03:48 +00:00
rcourtman
29bb2460e4 fix: escape apostrophes in alert notification strings
Fixes build failure caused by unescaped apostrophes in single-quoted
strings. The Vite/Babel parser was failing on "You'll" and "you'll"
in ActivationModal.tsx, preventing successful frontend builds.
2025-10-21 13:33:56 +00:00
rcourtman
7c00055047 feat: unify and improve Proxmox discovery/scanning architecture
Replaced inconsistent per-product detection logic with a unified probe
architecture using confidence scoring and product-specific matchers.

Key improvements:
- PBS detection now inspects TLS certs, auth headers (401/403), and
  probes PBS-specific endpoints (/api2/json/status, /config/datastore)
  fixing false negatives for self-signed and auth-protected servers
- PMG detection uses header analysis first, then conditional endpoint
  probing, working consistently regardless of port
- Single unified probeProxmoxService() replaces separate checkPort8006()
  and checkServer() code paths, eliminating duplication
- Confidence scoring (0.0-1.0+) with evidence tracking for debugging
- Consolidated hostname resolution and version handling

Technical changes:
- Added ProxmoxProbeResult with structured evidence and scoring
- Added product matchers: applyPVEHeuristics, applyPMGHeuristics,
  applyPBSHeuristics
- Removed legacy methods: checkPort8006, checkServer, isPMGServer,
  detectProductFromEndpoint, and duplicate hostname helpers
- Updated all tests to use new unified probe architecture
- Added probe_test_helpers.go for test access to internal methods

All tests passing. Fixes PBS detection issues and improves consistency
across PVE/PMG/PBS discovery.
2025-10-21 13:09:41 +00:00
rcourtman
e0396c1362 docs: update documentation for diagnostics improvements
Add comprehensive operator documentation for the new observability features
introduced in the previous commit.

**New Documentation:**
- docs/monitoring/PROMETHEUS_METRICS.md - Complete reference for all 18 new
  Prometheus metrics with alert suggestions

**Updated Documentation:**
- docs/API.md - Document X-Request-ID and X-Diagnostics-Cached-At headers,
  explain diagnostics endpoint caching behavior
- docs/TROUBLESHOOTING.md - Add section on correlating API calls with logs
  using request IDs
- docs/operations/ADAPTIVE_POLLING_ROLLOUT.md - Update monitoring checklists
  with new per-node and scheduler metrics
- docs/CONFIGURATION.md - Clarify LOG_FILE dual-output behavior and rotation
  defaults

These updates ensure operators understand:
- How to set up monitoring/alerting for new metrics
- How to configure file logging with rotation
- How to troubleshoot using request correlation
- What metrics are available for dashboards

Related to: 495e6c794 (feat: comprehensive diagnostics improvements)
2025-10-21 12:45:19 +00:00
rcourtman
acedd18c07 fix: upgrade vite to 6.4.1 to resolve CVE-2025-62522
Fixes Dependabot alert #33 - path traversal vulnerability in vite's
server.fs.deny when using backslash on Windows. Upgraded from 6.3.5 to 6.4.1.
2025-10-21 12:41:08 +00:00
rcourtman
2786afdff0 feat: comprehensive diagnostics and observability improvements
Upgrade diagnostics infrastructure from 5/10 to 8/10 production readiness
with enhanced metrics, logging, and request correlation capabilities.

**Request Correlation**
- Wire request IDs through context in middleware
- Return X-Request-ID header in all API responses
- Enable downstream log correlation across request lifecycle

**HTTP/API Metrics** (18 new Prometheus metrics)
- pulse_http_request_duration_seconds - API latency histogram
- pulse_http_requests_total - request counter by method/route/status
- pulse_http_request_errors_total - error counter by type
- Path normalization to control label cardinality

**Per-Node Poll Metrics**
- pulse_monitor_node_poll_duration_seconds - per-node timing
- pulse_monitor_node_poll_total - success/error counts per node
- pulse_monitor_node_poll_errors_total - error breakdown per node
- pulse_monitor_node_poll_last_success_timestamp - freshness tracking
- pulse_monitor_node_poll_staleness_seconds - age since last success
- Enables multi-node hotspot identification

**Scheduler Health Metrics**
- pulse_scheduler_queue_due_soon - ready queue depth
- pulse_scheduler_queue_depth - by instance type
- pulse_scheduler_queue_wait_seconds - time in queue histogram
- pulse_scheduler_dead_letter_depth - failed task tracking
- pulse_scheduler_breaker_state - circuit breaker state
- pulse_scheduler_breaker_failure_count - consecutive failures
- pulse_scheduler_breaker_retry_seconds - time until retry
- Enable alerting on DLQ spikes, breaker opens, queue backlogs

**Diagnostics Endpoint Caching**
- pulse_diagnostics_cache_hits_total - cache performance
- pulse_diagnostics_cache_misses_total - cache misses
- pulse_diagnostics_refresh_duration_seconds - probe timing
- 45-second TTL prevents thundering herd on /api/diagnostics
- Thread-safe with RWMutex
- X-Diagnostics-Cached-At header shows cache freshness

**Debug Log Performance**
- Gate high-frequency debug logs behind IsLevelEnabled() checks
- Reduces CPU waste in production when debug disabled
- Covers scheduler loops, poll cycles, API handlers

**Persistent Logging**
- File logging with automatic rotation
- LOG_FILE, LOG_MAX_SIZE, LOG_MAX_AGE, LOG_COMPRESS env vars
- MultiWriter sends logs to both stderr and file
- Gzip compression support for rotated logs

Files modified:
- internal/api/diagnostics.go (caching layer)
- internal/api/middleware.go (request IDs, HTTP metrics)
- internal/api/http_metrics.go (NEW - HTTP metric definitions)
- internal/logging/logging.go (file logging with rotation)
- internal/monitoring/metrics.go (node + scheduler metrics)
- internal/monitoring/monitor.go (instrumentation, debug gating)

Impact: Dramatically improved production troubleshooting with per-node
visibility, scheduler health metrics, persistent logs, and cached
diagnostics. Fast incident response now possible for multi-node deployments.
2025-10-21 12:37:39 +00:00
rcourtman
bd13b966d0 feat: complete API token export/import with version handling
Complete the API token export/import feature with proper version
handling and backward compatibility:

- Bump export format to version 4.1 to indicate API token support
- Import API tokens when loading v4.1 exports
- Handle version compatibility gracefully:
  - v4.1: Full support including API tokens
  - v4.0: Notice that tokens weren't included (backward compatible)
  - Other: Warning but best-effort import
- Initialize empty array instead of nil for cleaner JSON

This ensures API tokens are properly preserved when migrating or
restoring Pulse instances while maintaining backward compatibility
with older exports.
2025-10-21 11:38:23 +00:00
rcourtman
59cd456428 feat: improve request ID handling in middleware
Enhance request ID middleware to support distributed tracing:

- Honor incoming X-Request-ID headers from upstream proxies/load balancers
- Use logging.WithRequestID() for consistent ID generation across codebase
- Return X-Request-ID in response headers for client correlation
- Include request_id in panic recovery logs for debugging

This enables better request tracing across multiple Pulse instances
and integrates with standard distributed tracing practices.
2025-10-21 11:37:57 +00:00
rcourtman
cdbc6057b0 feat: export API tokens in config export
Add API tokens to the export data so they are included when
exporting/backing up configuration. This ensures API tokens are
preserved when migrating or restoring Pulse instances.

Changes:
- Add APITokens field to ExportData struct
- Load API tokens during export process
- Include tokens in exported JSON (omitempty if none exist)
2025-10-21 11:37:25 +00:00
rcourtman
ddc9a7a068 docs: comprehensive documentation for rate limit fix and configurability
Document the pulse-sensor-proxy rate limiting bug fix and new
configurability across all relevant documentation:

TEMPERATURE_MONITORING.md:
- Added 'Rate Limiting & Scaling' section with symptom diagnosis
- Included sizing table for 1-3, 4-10, 10-20, and 30+ node deployments
- Provided tuning formula: interval_ms = polling_interval / node_count

TROUBLESHOOTING.md:
- Added 'Temperature data flickers after adding nodes' section
- Step-by-step diagnosis using limiter metrics and scheduler health
- Quick fix with config example

CONFIGURATION.md:
- Added pulse-sensor-proxy/config.yaml reference section
- Documented rate_limit.per_peer_interval_ms and per_peer_burst fields
- Included defaults and example override

pulse-sensor-proxy-runbook.md:
- Updated quick reference with new defaults (1 req/sec, burst 5)
- Added 'Rate Limit Tuning' procedure with 4 deployment profiles
- Included validation steps and monitoring commands

TEMPERATURE_MONITORING_SECURITY.md:
- Updated rate limiting section with new defaults
- Added configurable overrides guidance
- Documented security considerations for production deployments

Related commits:
- 46b8b8d08: Initial rate limit fix (hardcoded defaults)
- ca534e2b6: Made rate limits configurable via YAML
- e244da837: Added guidance for large deployments (30+ nodes)
2025-10-21 11:36:07 +00:00
rcourtman
35adcf104f docs: add guidance for large deployments (30+ nodes) in rate limit config
Update config.example.yaml with:
- Recommendations for very large deployments (30+ nodes)
- Formula for calculating optimal rate limits based on node count
- Example calculation: 30 nodes with 10s polling = 300ms interval
- Security note about minimum safe intervals

This helps admins properly configure the proxy for enterprise
deployments with dozens of nodes.
2025-10-21 11:27:13 +00:00
rcourtman
44d5f91e92 feat: make pulse-sensor-proxy rate limits configurable
Add support for configuring rate limits via config.yaml to allow
administrators to tune the proxy for different deployment sizes.

Changes:
- Add RateLimitConfig struct to config.go with per_peer_interval_ms and per_peer_burst
- Update newRateLimiter() to accept optional RateLimitConfig parameter
- Load rate limit config from YAML and apply overrides to defaults
- Update tests to pass nil for default behavior
- Add comprehensive config.example.yaml with documentation

Configuration examples:
- Small (1-3 nodes): 1000ms interval, burst 5 (default)
- Medium (4-10 nodes): 500ms interval, burst 10
- Large (10+ nodes): 250ms interval, burst 20

Defaults remain conservative (1 req/sec, burst 5) to support most
deployments while allowing customization for larger environments.

Related: #46b8b8d08 (rate limit fix for multi-node support)
2025-10-21 11:25:21 +00:00
rcourtman
d856e75018 fix: increase pulse-sensor-proxy rate limits for multi-node support
- Increase rate limit from 1 req/5sec to 1 req/sec (60/min)
- Increase burst from 2 to 5 requests
- Fixes temperature collection failures when monitoring 3+ nodes
- All requests from containerized Pulse use same UID, causing rate limiting
- New limits support 5-10 node deployments comfortably

Resolves issue where adding standalone nodes broke temperature monitoring
for all nodes due to aggressive rate limiting.
2025-10-21 11:21:12 +00:00
rcourtman
c98fe537d3 fix: improve alert activation messaging for clarity and friendliness
Replace harsh/technical language with clearer, more positive messaging:

BEFORE → AFTER:
- "No alert violations detected during observation yet" → "All systems healthy — no alerts triggered"
- "Monitoring is live; notifications will start after..." → "Monitoring is active. Review your settings..."
- "24h observation ending" → "24-hour setup period ending soon"
- "Review alerts before activating" → "Ready to activate notifications"
- "breached thresholds" → "triggered"
- "violations" → "alerts"

Key improvements:
- Removed jargon: "observation window", "during observation"
- Removed ominous language: "yet", harsh "violations"
- More conversational: "You'll receive" vs "will dispatch to configured destinations"
- Positive framing: "All systems healthy" vs absence-focused language
- Clearer actions: "turning on alerts" vs "enabling notifications"
- Enthusiastic success messages: "Notifications activated!" with exclamation

Affected components:
- ActivationBanner.tsx: 4 text improvements
- ActivationModal.tsx: 5 text improvements

Impact: Better first-run UX, less intimidating language, clearer call-to-action
2025-10-21 11:09:59 +00:00
rcourtman
ad371bf412 feat: improve alert system performance, UX, and edge case handling
Implement 5 medium/low priority improvements identified in systematic review:

UX IMPROVEMENTS:
- Notify existing critical alerts when activating from pending_review state
  Previously: critical alerts during observation window would never notify
  Now: users receive notifications for active critical alerts after activation
  Implementation: Added NotifyExistingAlert() method and logic in ActivateAlerts()

PERFORMANCE OPTIMIZATIONS:
- Replace per-alert cleanup goroutines with periodic batch cleanup
  Prevents spawning 1000s of goroutines during alert flapping
  recentlyResolved entries now cleaned up once per minute instead of 1 goroutine per alert
- Simplify GetActiveAlerts() implementation
  Removed intermediate map copy, holds lock slightly longer but operation is fast
  Cleaner code with reduced memory allocation

CONFIGURATION VALIDATION:
- Validate timezone in quiet hours configuration
  Invalid timezones now disable quiet hours with error log instead of silent fallback
  Prevents unexpected behavior when timezone is typo'd or invalid

GRACEFUL SHUTDOWN:
- Add 100ms delay in Stop() for background goroutine cleanup
  Reduces risk of state corruption during shutdown
  Allows escalation checker and periodic save to exit cleanly

Technical details:
- internal/alerts/alerts.go: Added NotifyExistingAlert(), optimized cleanup patterns
- internal/api/alerts.go: Enhanced ActivateAlerts() to notify existing critical alerts
- Removed ~20 lines of goroutine spawning code
- Added periodic cleanup for recentlyResolved map
- All changes preserve backward compatibility

Testing: Verified compilation with 'go build -o /dev/null ./...'
2025-10-21 11:05:45 +00:00
rcourtman
06b5d5153b fix: resolve critical alert system bugs preventing crashes and memory leaks
Fix 5 critical bugs identified through systematic code review:

CRITICAL FIXES (prevent service crashes):
- Add panic recovery to all alert callbacks (onAlert, onResolved, onEscalate)
- Clone alerts before passing to escalation callback to prevent data races
- Make clearAlertNoLock callback async to prevent deadlock

HIGH PRIORITY FIXES (prevent memory leaks):
- Add cleanup for stale pendingAlerts entries (deleted resources)
- Add cleanup for dockerRestartTracking (ephemeral containers in CI/CD)

MEDIUM PRIORITY FIXES (prevent stuck alerts):
- Validate hysteresis thresholds (ensure clear < trigger)
- Auto-fix invalid configurations with warning logs

Impact:
- Service stability: Malformed webhook URLs or email configs can no longer crash Pulse
- Memory management: Prevents unbounded growth in dynamic environments
- Alert reliability: Prevents alerts that never clear due to invalid thresholds
- Concurrency safety: Eliminates data races in escalation path

Technical details:
- Created safeCallResolvedCallback() and safeCallEscalateCallback() wrappers
- Added ensureValidHysteresis() validation helper
- Extended Cleanup() with pendingAlerts and dockerRestartTracking pruning
- All callbacks now have defer/recover panic handlers with detailed logging

Testing: Verified compilation with 'go build -o /dev/null ./...'
2025-10-21 10:55:57 +00:00
rcourtman
2f43d67af9 docs: simplify Mermaid diagrams for better readability
The previous diagrams were too complex and overwhelming. Simplified
all diagrams to show core concepts clearly:

- Adaptive polling: reduced to basic scheduler→queue→workers flow
- Temperature proxy: simplified to 3-box trust boundary view
- Sensor proxy sequence: simplified to essential request flow
- Webhook pipeline: reduced to template→send→retry flow
- Script library: simplified to code→test→bundle→dist flow

Fixed parsing error in temperature proxy diagram (parentheses in
edge label causing render failure).

Diagrams should clarify architecture, not recreate implementation.
2025-10-21 10:50:40 +00:00
rcourtman
7bfd6997ec docs: remove outdated installer v2 rollout planning doc
The v2 installer rollout is complete - dist/install-docker-agent.sh
now contains the bundled v2 installer with embedded library modules.
This planning document served its purpose and is no longer relevant.
2025-10-21 10:48:35 +00:00
rcourtman
10d52244f8 docs: remove internal 'Phase 2' reference from adaptive polling docs
Replace internal development phase reference with clear description
of what the adaptive polling scheduler does. 'Phase 2' is internal
jargon that provides no value to users.
2025-10-21 10:45:46 +00:00
rcourtman
85ffe10aed docs: add Mermaid diagrams to improve visual documentation
Enhance documentation with six Mermaid diagrams to better explain
complex system implementations:

- Adaptive polling lifecycle flowchart showing enqueue→execute→feedback
  cycle with scheduler, priority queue, and worker interactions
- Circuit breaker state machine diagram illustrating Closed↔Open↔Half-open
  transitions with triggers and recovery paths
- Temperature proxy architecture diagram highlighting trust boundaries,
  security controls, and data flow between host/container/cluster
- Sensor proxy request flow sequence diagram showing auth, rate limiting,
  validation, and SSH execution pipeline
- Alert webhook pipeline flowchart detailing template resolution, URL
  rendering, HTTP dispatch, and retry logic
- Script library workflow diagram illustrating dev→test→bundle→distribute
  lifecycle emphasizing modular design

These visualizations make it easier for operators and contributors to
understand Pulse's sophisticated architectural patterns.
2025-10-21 10:40:33 +00:00
rcourtman
f9cb96ceb8 feat: add --uninstall support to Docker agent and sensor proxy scripts
Users can now cleanly uninstall components with optional data removal.

Docker Agent (install-docker-agent.sh):
- --uninstall: Remove service, binary, systemd unit, Unraid startup hook
- --purge: Also remove log files (optional, must be used with --uninstall)
- Stops/disables service even if unit file is missing (resilient cleanup)
- Validates --purge requires --uninstall

Sensor Proxy (install-sensor-proxy.sh):
- --uninstall: Remove service, binary, cleanup scripts, socket directory
- Calls existing cleanup helper to remove SSH keys from cluster nodes
- Manual fallback if cleanup helper is missing
- --purge: Also remove state/logs and service account
- Validates --purge requires --uninstall

Usage:
  # Uninstall Docker agent (keep logs)
  curl ... | bash -s -- --uninstall

  # Uninstall Docker agent (remove everything)
  curl ... | bash -s -- --uninstall --purge

  # Uninstall sensor proxy (keep state/logs)
  curl ... | bash -s -- --uninstall

  # Uninstall sensor proxy (remove everything)
  curl ... | bash -s -- --uninstall --purge

Changes:
- scripts/install-docker-agent.sh: Add --purge flag, improve uninstall flow
- scripts/install-sensor-proxy.sh: Add perform_uninstall() function
- Both: Non-interactive, idempotent, resilient cleanup

Next: Update UI to show uninstall commands when removing hosts/nodes

Co-authored-by: Codex AI
2025-10-21 10:21:48 +00:00
rcourtman
66b97333f7 fix: skip update check for source builds and show appropriate UI message
Source builds use commit hashes (main-c147fa1) not semantic versions
(v4.23.0), so update checks would always fail or show misleading
"Update Available" banners.

Changes:
- Add IsSourceBuild flag to VersionInfo struct
- Detect source builds via BUILD_FROM_SOURCE marker file
- Skip update check for source builds (like Docker)
- Update frontend to show "Built from source" message
- Disable manual update check button for source builds
- Return "source" deployment type for source builds

Backend:
- internal/updates/version.go: Add isSourceBuildEnvironment() detection
- internal/updates/manager.go: Skip check with appropriate message
- internal/api/types.go: Add isSourceBuild to API response
- internal/api/router.go: Include isSourceBuild in version endpoint

Frontend:
- src/api/updates.ts: Add isSourceBuild to VersionInfo type
- src/stores/updates.ts: Don't poll for updates on source builds
- src/components/Settings/Settings.tsx: Show "Built from source" message

Fixes the confusing "Update Available" banner for users who explicitly
chose --source to get latest main branch code.

Co-authored-by: Codex AI
2025-10-21 10:08:00 +00:00
rcourtman
0e0661eb68 fix: skip auto-update prompt for source builds
Source builds use commit hashes (0.0.0-main-44ef8b6) not semantic
versions (v4.23.0), so auto-updates don't make sense. The auto-updater
would download release binaries, replacing the user's source build.

Changes:
- Skip auto-update question when BUILD_FROM_SOURCE=true
- Show informational message instead
- Applies to both Quick and Advanced modes

This prevents confusion when users explicitly choose --source to get
the latest main branch code instead of stable releases.
2025-10-21 09:41:46 +00:00
rcourtman
4c1ac06cdb fix: stop existing pulse-sensor-proxy service before binary update
When installing temperature monitoring for a new container, stop any
existing pulse-sensor-proxy service before trying to overwrite the
binary. This prevents 'Text file busy' errors when the binary is
currently running.

Fixes the error that occurred when installing container 103 while
container 107's proxy was still running.
2025-10-21 09:39:30 +00:00
rcourtman
63e056eb0a fix: update temperature monitoring prompt text for pre-creation flow
The prompt now says 'Enable temperature monitoring from first boot'
instead of 'Restart the container to activate' since we moved the
question to before container creation.

Also clarified 'Configure container with temperature monitoring bind mount'
to better reflect what actually happens.
2025-10-21 09:33:51 +00:00
rcourtman
7e871780f6 feat: improve LXC installer robustness and temperature monitoring UX
Major improvements to the install script based on comprehensive review:

## 1. Temperature Monitoring - No Restart Required 
- Ask about temperature monitoring BEFORE container creation (not after)
- Add bind mount during `pct create` instead of requiring restart later
- Quick mode defaults to "yes", Advanced mode asks user
- Host path: /run/pulse-sensor-proxy → /mnt/pulse-proxy in container
- Support --skip-restart flag in install-sensor-proxy.sh
- Eliminates disruptive container restart on fresh installs

## 2. Shell Injection Prevention 🔒
- Replace `eval pct create` with array-based command building
- Prevents quoting bugs with special characters in hostnames/nameservers
- Safer handling of user input in container creation

## 3. Non-Interactive Install Support 🤖
- Replace bare `read` with `safe_read_with_default` in prompts
- Prevents hangs when running `curl | bash` non-interactively
- Proper fallback to sensible defaults

## 4. Cleanup on Interrupt 🧹
- Track container ID globally during creation
- Properly cleanup orphaned containers on Ctrl+C/SIGTERM
- New handle_install_interrupt() function
- Prevents leftover containers after cancelled installs

## 5. Air-Gapped Network Support 🌐
- Replace 8.8.8.8 ping check with `hostname -I` IP detection
- Supports restricted/firewalled networks where external ping fails
- More reliable for DHCP-only environments

Changes:
- install.sh: Refactor temperature prompt timing and mount setup
- install.sh: Convert pct create to array-based args (lines 1018-1055)
- install.sh: Add handle_install_interrupt trap (lines 38-48)
- install.sh: Replace ping check with IP detection (line 1082)
- scripts/install-sensor-proxy.sh: Add --skip-restart flag support
- scripts/install-sensor-proxy.sh: Improve mount detection and updates

Impact:
- Fresh installs now complete without any container restarts
- Temperature monitoring works immediately after first boot
- Safer and more robust for automation/CI scenarios
- Better experience on restricted networks

Co-authored-by: Codex AI
2025-10-21 09:22:43 +00:00
rcourtman
b929fdcc6e feat: improve source build installation experience
- Remove confusing --main flag, use --source for clarity
- Fix timeout issues when building from source in LXC containers
  - Increase timeout from 5min to 20min for source builds
  - Add PULSE_CONTAINER_TIMEOUT env var for custom timeouts
  - Support PULSE_CONTAINER_TIMEOUT=0 to disable timeout
- Fix misleading "Latest version: vX.X.X" message during source builds
- Update documentation to use --source instead of --main
- Simplify auto-update script logic for source builds

Changes:
- install.sh: Check BUILD_FROM_SOURCE early to skip version detection
- install.sh: Adaptive timeout (300s binary, 1200s source builds)
- install.sh: Better timeout error messages with recovery instructions
- README.md: Replace --main with --source in examples
- docs/INSTALL.md: Replace --main with --source in examples
- scripts/pulse-auto-update.sh: Remove --main special case
2025-10-21 08:57:29 +00:00
rcourtman
56c6c0cc0c feat: improve discovery with progress tracking, validation, and structured errors
Significantly enhanced network discovery feature to eliminate false positives,
provide real-time progress updates, and better error reporting.

Key improvements:
- Require positive Proxmox identification (version data, auth headers, or certificates)
  instead of reporting any service on ports 8006/8007
- Add real-time progress tracking with phase/target counts and completion percentage
- Implement structured error reporting with IP, phase, type, and timestamp details
- Fix TLS timeout handling to prevent hangs on unresponsive hosts
- Expose progress and structured errors via WebSocket for UI consumption
- Reduce log verbosity by moving discovery logs to debug level
- Fix duplicate IP counting to ensure progress reaches 100%

Breaking changes: None (backward compatible with legacy API methods)
2025-10-20 22:29:30 +00:00
rcourtman
95c85f6e01 fix: use correct service name (pulse.service) for proxy environment override
The installer was configuring pulse-backend.service.d but the actual
service is pulse.service, so the PULSE_SENSOR_PROXY_SOCKET environment
variable wasn't being set.

Changed: pulse-backend.service → pulse.service

This ensures Pulse actually uses the proxy socket for temperature
monitoring instead of attempting SSH connections.
2025-10-20 22:28:33 +00:00
rcourtman
8194ce9e7a feat: add containerization detection to version endpoint
Added containerized and containerId fields to /api/version endpoint
to enable automatic temperature proxy installation for LXC containers.

Changes:
- Added Containerized bool field to VersionResponse
- Added ContainerId string field to VersionResponse
- Detect containerization by checking /run/systemd/container file
- Extract container ID from hostname for LXC containers
- Set deployment type from container type (lxc/docker)

This allows the PVE setup script to:
1. Detect that Pulse is running in a container
2. Find the container ID by matching IPs
3. Automatically install pulse-sensor-proxy on the host
4. Configure bind mount for secure socket communication

Fixes the issue where setup script showed 'Proxy not available'
even when Pulse was containerized.
2025-10-20 22:14:03 +00:00
rcourtman
d430efcecb fix: correct fmt.Sprintf argument alignment in PVE setup script
Critical bug fix: The setup script's format string had 33 placeholders
but was only receiving 27 arguments, causing:
- INSTALLER_URL to receive authToken instead of pulseURL
- This made curl try to resolve the token value as a hostname
- Error: 'curl: (6) Could not resolve host: N7AE3P'
- Token ID showed '%!s(MISSING)' in manual setup instructions

Fixed by:
- Added missing tokenName at position 7
- Added literal '%s' strings for version_ge printf placeholders
- Added authToken arguments for Authorization headers (positions 29, 31)
- Ensured all 33 format placeholders have corresponding arguments

Now generates correct URLs:
- INSTALLER_URL: http://192.168.0.160:7655/api/install/install-sensor-proxy.sh
- --pulse-server: http://192.168.0.160:7655
- Token ID: pulse-monitor@pam!pulse-192-168-0-160-[timestamp]
2025-10-20 21:58:37 +00:00
rcourtman
8faa9040fb fix: show curl errors in installer download failures
Changed curl flags from -fsSL to -fSL to enable error output.
The -s flag was silencing all curl errors including SSL/TLS issues,
making it impossible to diagnose download failures.

With -S (show errors), stderr now captures meaningful error messages
like certificate problems, connection failures, etc.
2025-10-20 21:31:54 +00:00
rcourtman
90d51a2b1b feat: add rollback mechanism for container config changes
- Back up container config before making mount modifications
- Restore original config if socket verification fails
- Clean up backup file on success or when verification is skipped
- Leave host-level resources (user, binary, service) in place for idempotency

This ensures failed installations don't leave containers in an
inconsistent state while keeping successfully installed host services
for faster re-runs.
2025-10-20 21:16:06 +00:00