diff --git a/docs/DOCKER_HUB_README.md b/docs/DOCKER_HUB_README.md deleted file mode 100644 index a73e9be9f..000000000 --- a/docs/DOCKER_HUB_README.md +++ /dev/null @@ -1,325 +0,0 @@ -# Pulse - -[![GitHub release](https://img.shields.io/github/v/release/rcourtman/Pulse)](https://github.com/rcourtman/Pulse/releases/latest) -[![Docker Pulls](https://img.shields.io/docker/pulls/rcourtman/pulse)](https://hub.docker.com/r/rcourtman/pulse) -[![License](https://img.shields.io/github/license/rcourtman/Pulse)](https://github.com/rcourtman/Pulse/blob/main/LICENSE) - -**Real-time monitoring for Proxmox VE, Proxmox Mail Gateway, PBS, and Docker infrastructure with alerts and webhooks.** - -Monitor your hybrid Proxmox and Docker estate from a single dashboard. Get instant alerts when nodes go down, containers misbehave, backups fail, or storage fills up. Supports email, Discord, Slack, Telegram, and more. - -**[Try the live demo →](https://demo.pulserelay.pro)** (read-only with mock data) - -## Support Pulse Development - -Pulse is built by a solo developer in evenings and weekends. Your support helps: -- Keep me motivated to add new features -- Prioritize bug fixes and user requests -- Ensure Pulse stays 100% free and open-source forever - -[![GitHub Sponsors](https://img.shields.io/github/sponsors/rcourtman?style=social&label=Sponsor)](https://github.com/sponsors/rcourtman) -[![ko-fi](https://ko-fi.com/img/githubbutton_sm.svg)](https://ko-fi.com/rcourtman) - -**Not ready to sponsor?** Star the project or share it with your homelab community! - -## Features - -- **Auto-Discovery**: Finds Proxmox nodes on your network, one-liner setup via generated scripts -- **Cluster Support**: Configure one node, monitor entire cluster -- **Enterprise Security**: - - Credentials encrypted at rest, masked in logs, never sent to frontend - - CSRF protection for all state-changing operations - - Rate limiting (500 req/min general, 10 attempts/min for auth) - - Account lockout after failed login attempts - - Secure session management with HttpOnly cookies - - bcrypt password hashing (cost 12) - passwords NEVER stored in plain text - - API tokens stored securely with restricted file permissions - - Security headers (CSP, X-Frame-Options, etc.) - - Comprehensive audit logging -- Live monitoring of VMs, containers, nodes, storage -- **Smart Alerts**: Email and webhooks (Discord, Slack, Telegram, Teams, ntfy.sh, Gotify) - - Example: "VM 'webserver' is down on node 'pve1'" - - Example: "Storage 'local-lvm' at 85% capacity" - - Example: "VM 'database' is back online" -- **Adaptive Thresholds**: Hysteresis-based trigger/clear levels, fractional network thresholds, per-metric search, reset-to-defaults, and Custom overrides with inline audit trail -- **Alert Timeline Analytics**: Rich history explorer with acknowledgement/clear markers, escalation breadcrumbs, and quick filters for noisy resources -- **Ceph Awareness**: Surface Ceph health, pool utilisation, and daemon status automatically when Proxmox exposes Ceph-backed storage -- Unified view of PBS backups, PVE backups, and snapshots -- **Interactive Backup Explorer**: Cross-highlighted bar chart + grid with quick time-range pivots (24h/7d/30d/custom) and contextual tooltips for the busiest jobs -- Proxmox Mail Gateway analytics: mail volume, spam/virus trends, quarantine health, and cluster node status -- Optional Docker container monitoring via lightweight agent -- Config export/import with encryption and authentication -- Automatic stable updates with safe rollback (opt-in) -- Runtime logging controls (switch level/format or mirror to file without downtime) -- Update history with rollback guidance captured in the UI -- Dark/light themes, responsive design -- Built with Go for minimal resource usage - -[View screenshots and full documentation on GitHub →](https://github.com/rcourtman/Pulse) - -## Privacy - -**Pulse respects your privacy:** -- No telemetry or analytics collection -- No phone-home functionality -- No external API calls (except for configured webhooks) -- All data stays on your server -- Open source - verify it yourself - -Your infrastructure data is yours alone. - -## Quick Start with Docker - -### Basic Setup - -```bash -docker run -d \ - --name pulse \ - -p 7655:7655 \ - -v pulse_data:/data \ - --restart unless-stopped \ - rcourtman/pulse:latest -``` - -Then open `http://localhost:7655` and complete the security setup wizard. - -### Network Discovery - -Pulse automatically discovers Proxmox nodes on your network! By default, it scans: -- 192.168.0.0/16 (home networks) -- 10.0.0.0/8 (private networks) -- 172.16.0.0/12 (Docker/internal networks) - -To scan a custom subnet instead: -```bash -docker run -d \ - --name pulse \ - -p 7655:7655 \ - -v pulse_data:/data \ - -e DISCOVERY_SUBNET="192.168.50.0/24" \ - --restart unless-stopped \ - rcourtman/pulse:latest -``` - -### Automated Deployment with Pre-configured Auth - -```bash -# Deploy with authentication pre-configured -docker run -d \ - --name pulse \ - -p 7655:7655 \ - -v pulse_data:/data \ - -e API_TOKENS="ansible-token,docker-agent-token" \ - -e PULSE_AUTH_USER="admin" \ - -e PULSE_AUTH_PASS="your-password" \ - --restart unless-stopped \ - rcourtman/pulse:latest - -# Plain text credentials are automatically hashed for security -# No setup required - API works immediately -``` - -### Docker Compose - -```yaml -services: - pulse: - image: rcourtman/pulse:latest - container_name: pulse - ports: - - "7655:7655" - volumes: - - pulse_data:/data - environment: - # NOTE: Env vars override UI settings. Remove env var to allow UI configuration. - - # Network discovery (usually not needed - auto-scans common networks) - # - DISCOVERY_SUBNET=192.168.50.0/24 # Only for non-standard networks - - # Ports - # - PORT=7655 # Backend port (default: 7655) - # - FRONTEND_PORT=7655 # Frontend port (default: 7655) - - # Security (all optional - runs open by default) - # - PULSE_AUTH_USER=admin # Username for web UI login - # - PULSE_AUTH_PASS=your-password # Plain text or bcrypt hash (auto-hashed if plain) - # - API_TOKENS=token-a,token-b # Comma-separated tokens (plain or SHA3-256 hashed) - # - API_TOKEN=legacy-token # Optional single-token fallback - # - ALLOW_UNPROTECTED_EXPORT=false # Allow export without auth (default: false) - - # Security: Plain text credentials are automatically hashed - # You can provide either: - # 1. Plain text (auto-hashed): PULSE_AUTH_PASS=mypassword - # 2. Pre-hashed (advanced): PULSE_AUTH_PASS='$$2a$$12$$...' - # Note: Escape $ as $$ in docker-compose.yml for pre-hashed values - - # Performance - # - CONNECTION_TIMEOUT=10 # Connection timeout in seconds (default: 10) - - # CORS & logging - # - ALLOWED_ORIGINS=https://app.example.com # CORS origins (default: none, same-origin only) - # - LOG_LEVEL=info # Log level: debug/info/warn/error (default: info) - # - LOG_FORMAT=auto # auto | json | console (default: auto) - # - LOG_FILE=/data/pulse.log # Optional mirrored logfile inside container - # - LOG_MAX_SIZE=100 # Rotate logfile after N MB - # - LOG_MAX_AGE=30 # Retain rotated logs for N days - # - LOG_COMPRESS=true # Compress rotated logs - restart: unless-stopped - -volumes: - pulse_data: - -### Updating & Rollbacks (v4.24.0+) - -```bash -# Update to the latest tagged image -docker pull rcourtman/pulse:latest -docker stop pulse && docker rm pulse -docker run -d --name pulse \ - -p 7655:7655 -v pulse_data:/data \ - --restart unless-stopped \ - rcourtman/pulse:latest -``` -- Every upgrade is logged in **Settings → System → Updates** with an `event_id` for change tracking. -- Need to revert? Redeploy the previous tag (for example `rcourtman/pulse:v4.23.2`). Record the rollback reason in your change notes and double-check `/api/monitoring/scheduler/health` once the container is back online. -``` - -## Initial Setup - -1. Open `http://:7655` -2. **Complete the mandatory security setup** (first-time only) -3. Create your admin username and password -4. Use **Settings → Security → API tokens** to issue dedicated tokens for automation (one token per integration makes revocation painless) - -## Configure Proxmox/PBS Nodes - -After logging in: - -1. Go to Settings → Nodes -2. Discovered nodes appear automatically -3. Click "Setup Script" next to any node -4. Click "Generate Setup Code" button (creates a 6-character code valid for 5 minutes) -5. Copy and run the provided one-liner on your Proxmox/PBS host -6. Node is configured and monitoring starts automatically - -**Example setup command:** -```bash -curl -sSL "http://pulse:7655/api/setup-script?type=pve&host=https://pve:8006&auth_token=ABC123" | bash -``` - -## Docker Updates - -```bash -# Latest stable -docker pull rcourtman/pulse:latest - -# Latest RC/pre-release -docker pull rcourtman/pulse:rc - -# Specific version -docker pull rcourtman/pulse:v4.22.0 - -# Then recreate your container -docker stop pulse && docker rm pulse -# Run your docker run or docker-compose command again -``` - -## Security - -- **Authentication required** - Protects your Proxmox infrastructure credentials -- **Quick setup wizard** - Secure your installation in under a minute -- **Multiple auth methods**: Password authentication, API tokens, proxy auth (SSO), or combinations -- **Proxy/SSO support** - Integrate with Authentik, Authelia, and other authentication proxies -- **Enterprise-grade protection**: - - Credentials encrypted at rest (AES-256-GCM) - - CSRF tokens for state-changing operations - - Rate limiting and account lockout protection - - Secure session management with HttpOnly cookies - - bcrypt password hashing (cost 12) - passwords NEVER stored in plain text - - API tokens stored securely with restricted file permissions - - Security headers (CSP, X-Frame-Options, etc.) - - Comprehensive audit logging -- **Security by design**: - - Frontend never receives node credentials - - API tokens visible only to authenticated users - - Export/import requires authentication when configured - -See [Security Documentation](https://github.com/rcourtman/Pulse/blob/main/docs/SECURITY.md) for details. - -## HTTPS/TLS Configuration - -Enable HTTPS by setting these environment variables: - -```bash -docker run -d -p 7655:7655 \ - -e HTTPS_ENABLED=true \ - -e TLS_CERT_FILE=/data/cert.pem \ - -e TLS_KEY_FILE=/data/key.pem \ - -v pulse_data:/data \ - -v /path/to/certs:/data/certs:ro \ - rcourtman/pulse:latest -``` - -## Troubleshooting - -### Authentication Issues - -#### Cannot login after setting up security -- **Docker**: Ensure bcrypt hash is exactly 60 characters and wrapped in single quotes -- **Docker Compose**: MUST escape $ characters as $$ (e.g., `$$2a$$12$$...`) -- **Example (docker run)**: `PULSE_AUTH_PASS='$2a$12$YTZXOCEylj4TaevZ0DCeI.notayQZ..b0OZ97lUZ.Q24fljLiMQHK'` -- **Example (docker-compose.yml)**: `PULSE_AUTH_PASS='$$2a$$12$$YTZXOCEylj4TaevZ0DCeI.notayQZ..b0OZ97lUZ.Q24fljLiMQHK'` -- If hash is truncated or mangled, authentication will fail -- Use Quick Security Setup in the UI to avoid manual configuration errors - -#### .env file not created (Docker) -- **Expected behavior**: When using environment variables, no .env file is created in /data -- The .env file is only created when using Quick Security Setup or password changes -- If you provide credentials via environment variables, they take precedence -- To use Quick Security Setup: Start container WITHOUT auth environment variables - -### VM Disk Stats Show "-" -- VMs require QEMU Guest Agent to report disk usage (Proxmox API returns 0 for VMs) -- Install guest agent in VM: `apt install qemu-guest-agent` (Linux) or virtio-win tools (Windows) -- Enable in VM Options → QEMU Guest Agent, then restart VM -- Container (LXC) disk stats always work (no guest agent needed) - -### Connection Issues -- Check Proxmox API is accessible (port 8006/8007) -- Verify credentials have PVEAuditor role plus VM.GuestAgent.Audit (PVE 9) or VM.Monitor (PVE 8); the setup script applies these via the PulseMonitor role (adds Sys.Audit when available) -- For PBS: ensure API token has Datastore.Audit permission - -### Logs -```bash -# View logs -docker logs pulse - -# Follow logs -docker logs -f pulse -``` - -## Documentation - -Full documentation available on GitHub: - -- [Complete Installation Guide](https://github.com/rcourtman/Pulse/blob/main/docs/INSTALL.md) -- [Configuration Guide](https://github.com/rcourtman/Pulse/blob/main/docs/CONFIGURATION.md) -- [VM Disk Monitoring](https://github.com/rcourtman/Pulse/blob/main/docs/VM_DISK_MONITORING.md) - Set up QEMU Guest Agent for accurate VM disk usage -- [Troubleshooting](https://github.com/rcourtman/Pulse/blob/main/docs/TROUBLESHOOTING.md) -- [API Reference](https://github.com/rcourtman/Pulse/blob/main/docs/API.md) -- [Webhook Guide](https://github.com/rcourtman/Pulse/blob/main/docs/WEBHOOKS.md) -- [Proxy Authentication](https://github.com/rcourtman/Pulse/blob/main/docs/PROXY_AUTH.md) - SSO integration with Authentik, Authelia, etc. -- [Reverse Proxy Setup](https://github.com/rcourtman/Pulse/blob/main/docs/REVERSE_PROXY.md) - nginx, Caddy, Apache, Traefik configs -- [Security](https://github.com/rcourtman/Pulse/blob/main/docs/SECURITY.md) -- [FAQ](https://github.com/rcourtman/Pulse/blob/main/docs/FAQ.md) - -## Links - -- [GitHub Repository](https://github.com/rcourtman/Pulse) -- [Releases & Changelog](https://github.com/rcourtman/Pulse/releases) -- [Issues & Feature Requests](https://github.com/rcourtman/Pulse/issues) -- [Live Demo](https://demo.pulserelay.pro) - -## License - -MIT - See [LICENSE](https://github.com/rcourtman/Pulse/blob/main/LICENSE) diff --git a/docs/PORT_CONFIGURATION.md b/docs/PORT_CONFIGURATION.md deleted file mode 100644 index 745dc4e82..000000000 --- a/docs/PORT_CONFIGURATION.md +++ /dev/null @@ -1,127 +0,0 @@ -# Port Configuration Guide - -Pulse supports multiple ways to configure the frontend port (default: 7655). - -> **Development tip:** The hot-reload workflow (`scripts/hot-dev.sh` or `make dev-hot`) loads `.env`, `.env.local`, and `.env.dev`. Set `FRONTEND_PORT` or `PULSE_DEV_API_PORT` there to run the backend on a different port while keeping the generated `curl` commands and Vite proxy in sync. - -## Recommended Methods - -### 1. During Installation (Easiest) -The installer prompts for the port. To skip the prompt, use: -```bash -FRONTEND_PORT=8080 curl -fsSL https://raw.githubusercontent.com/rcourtman/Pulse/main/install.sh | bash -``` - -### 2. Using systemd override (For existing installations) -```bash -sudo systemctl edit pulse -``` -Add these lines: -```ini -[Service] -Environment="FRONTEND_PORT=8080" -``` -Then restart: `sudo systemctl restart pulse` - -### 3. Using system.json (Alternative method) -Edit `/etc/pulse/system.json`: -```json -{ - "frontendPort": 8080 -} -``` -Then restart: `sudo systemctl restart pulse` - -### 4. Using environment variables (Docker) -For Docker deployments: -```bash -docker run -e FRONTEND_PORT=8080 -p 8080:8080 rcourtman/pulse:latest -``` - -## Priority Order - -Pulse checks for port configuration in this order: -1. `FRONTEND_PORT` environment variable -2. `PORT` environment variable (legacy) -3. `frontendPort` in system.json -4. Default: 7655 - -Environment variables always override configuration files. - -## Why not .env? - -The `/etc/pulse/.env` file is reserved exclusively for authentication credentials: -- `API_TOKENS` - One or more API authentication tokens (hashed) -- `API_TOKEN` - Legacy single API token (hashed) -- `PULSE_AUTH_USER` - Web UI username -- `PULSE_AUTH_PASS` - Web UI password (hashed) - -Keeping application configuration separate from authentication credentials: -- Makes it clear what's a secret vs what's configuration -- Allows different permission models if needed -- Follows the principle of separation of concerns -- Makes it easier to backup/share configs without exposing credentials - -## Service Name Variations - -**Important:** Pulse uses different service names depending on the deployment environment: - -- **Systemd (default):** `pulse.service` or `pulse-backend.service` (legacy) -- **Hot-dev scripts:** `pulse-hot-dev` (development only) -- **Kubernetes/Helm:** Deployment `pulse`, Service `pulse` (port configured via Helm values) - -**To check the active service:** -```bash -# Systemd -systemctl list-units | grep pulse -systemctl status pulse - -# Kubernetes -kubectl -n pulse get svc pulse -kubectl -n pulse get deploy pulse -``` - -## Change Tracking (v4.24.0+) - -Port changes via environment variables or `system.json` take effect immediately after restart. **v4.24.0 records configuration changes in update history**—useful for audit trails and troubleshooting. - -**To view change history:** -```bash -# Via UI -# Navigate to Settings → System → Updates - -# Via API -curl -s http://localhost:7655/api/updates/history | jq '.entries[] | {timestamp, action, status}' -``` - -## Troubleshooting - -### Port not changing after configuration? -1. **Check which service name is in use:** - ```bash - systemctl list-units | grep pulse - ``` - It might be `pulse` (default), `pulse-backend` (legacy), or `pulse-hot-dev` (dev environment) depending on your installation method. - -2. **Verify the configuration is loaded:** - ```bash - # Systemd - sudo systemctl show pulse | grep Environment - - # Kubernetes - kubectl -n pulse get deploy pulse -o jsonpath='{.spec.template.spec.containers[0].env}' | jq - ``` - -3. **Check if another process is using the port:** - ```bash - sudo lsof -i :8080 - ``` - -4. **Verify post-restart** (v4.24.0+): - ```bash - # Check actual listening port - curl -s http://localhost:7655/api/version | jq - - # Check update history for restart event - curl -s http://localhost:7655/api/updates/history?limit=5 | jq - ``` diff --git a/docs/PULSE_SENSOR_PROXY_HARDENING.md b/docs/PULSE_SENSOR_PROXY_HARDENING.md deleted file mode 100644 index e484cba80..000000000 --- a/docs/PULSE_SENSOR_PROXY_HARDENING.md +++ /dev/null @@ -1,1018 +0,0 @@ -# Pulse Temperature Proxy - Security Hardening Guide - -## Overview - -The `pulse-sensor-proxy` is a host-side service that provides secure temperature monitoring for containerized Pulse deployments. It addresses a critical security concern: SSH keys stored inside LXC containers can be exfiltrated if the container is compromised. - -**Architecture:** -- Host-side proxy runs with minimal privileges on each Proxmox node -- Containerized Pulse communicates via Unix socket (inside the container at `/mnt/pulse-proxy/pulse-sensor-proxy.sock`, backed by `/run/pulse-sensor-proxy/pulse-sensor-proxy.sock` on the host) -- Proxy authenticates containers using Linux `SO_PEERCRED` (UID/PID verification) -- SSH keys never leave the host filesystem - -```mermaid -flowchart LR - subgraph Host Node - direction TB - PulseProxy["pulse-sensor-proxy service\n(systemd, user=pulse-sensor-proxy)"] - HostSocket["/run/pulse-sensor-proxy/\npulse-sensor-proxy.sock"] - PulseProxy -- Unix socket --> HostSocket - end - - subgraph LXC Container (Pulse) - direction TB - MountPoint["/mnt/pulse-proxy/\npulse-sensor-proxy.sock"] - PulseBackend["Pulse backend (Go)\n-hot-dev / production"] - MountPoint --> PulseBackend - end - - HostSocket == Proxmox mp mount ==> MountPoint - PulseBackend -.-> Sensors[(Cluster nodes via SSH 'sensors -j')] - PulseProxy -.-> Sensors -``` - -**Threat Model:** -- ✅ Container compromise cannot access SSH keys -- ✅ Container cannot directly SSH to cluster nodes -- ✅ Rate limiting prevents abuse via socket -- ✅ IP restrictions on SSH keys limit lateral movement -- ✅ Audit logging tracks all temperature requests - -## Prerequisites - -- Proxmox VE 7.0+ or Proxmox Backup Server 2.0+ -- LXC container running Pulse (unprivileged recommended) -- Root access to Proxmox host(s) -- `lm-sensors` installed on all nodes -- Cluster SSH access configured (root passwordless SSH between nodes) - -## Host Hardening - -### Service Account - -The proxy runs as the `pulse-sensor-proxy` user with these characteristics: -- System account (no login shell: `/usr/sbin/nologin`) -- No home directory -- Dedicated group: `pulse-sensor-proxy` -- Owns `/var/lib/pulse-sensor-proxy` and `/run/pulse-sensor-proxy` - -**Verify service account:** -```bash -# Check user exists -id pulse-sensor-proxy - -# Expected output: -# uid=XXX(pulse-sensor-proxy) gid=XXX(pulse-sensor-proxy) groups=XXX(pulse-sensor-proxy) - -# Check shell (should be /usr/sbin/nologin) -getent passwd pulse-sensor-proxy | cut -d: -f7 -``` - -### Systemd Unit Security - -The systemd unit includes comprehensive hardening directives: - -**Key security features:** -- `User=pulse-sensor-proxy` / `Group=pulse-sensor-proxy` - Unprivileged execution -- `NoNewPrivileges=true` - Prevents privilege escalation -- `ProtectSystem=strict` - Read-only `/usr`, `/boot`, `/efi` -- `ProtectHome=true` - Inaccessible `/home`, `/root`, `/run/user` -- `PrivateTmp=true` - Isolated `/tmp` and `/var/tmp` -- `SystemCallFilter=@system-service` - Restricted syscalls -- `CapabilityBoundingSet=` - No capabilities granted -- `RestrictAddressFamilies=AF_UNIX AF_INET AF_INET6` - Socket restrictions - -**Verify systemd security:** -```bash -# Check service status -systemctl status pulse-sensor-proxy - -# Verify user/group -ps aux | grep pulse-sensor-proxy | grep -v grep - -# Expected: pulse-sensor-proxy user, not root - -# Check systemd security settings -systemctl show pulse-sensor-proxy | grep -E '(User=|NoNewPrivileges|ProtectSystem|CapabilityBoundingSet)' -``` - -### File Permissions - -**Critical paths and ownership:** -``` -/var/lib/pulse-sensor-proxy/ pulse-sensor-proxy:pulse-sensor-proxy 0750 -├── ssh/ pulse-sensor-proxy:pulse-sensor-proxy 0700 -│ ├── id_ed25519 pulse-sensor-proxy:pulse-sensor-proxy 0600 -│ └── id_ed25519.pub pulse-sensor-proxy:pulse-sensor-proxy 0640 -└── ssh.d/ pulse-sensor-proxy:pulse-sensor-proxy 0750 - ├── next/ pulse-sensor-proxy:pulse-sensor-proxy 0750 - └── prev/ pulse-sensor-proxy:pulse-sensor-proxy 0750 - -/run/pulse-sensor-proxy/ pulse-sensor-proxy:pulse-sensor-proxy 0775 -└── pulse-sensor-proxy.sock pulse-sensor-proxy:pulse-sensor-proxy 0777 -/mnt/pulse-proxy/ nobody:nogroup (id-mapped) 0777 -└── pulse-sensor-proxy.sock nobody:nogroup 0777 -``` - -**Verify permissions:** -```bash -# Check base directory -ls -ld /var/lib/pulse-sensor-proxy/ -# Expected: drwxr-x--- pulse-sensor-proxy pulse-sensor-proxy - -# Check SSH keys -ls -l /var/lib/pulse-sensor-proxy/ssh/ -# Expected: -# -rw------- pulse-sensor-proxy pulse-sensor-proxy id_ed25519 -# -rw-r----- pulse-sensor-proxy pulse-sensor-proxy id_ed25519.pub - -# Check socket directory on host (note: 0775 for container access) -ls -ld /run/pulse-sensor-proxy/ -# Expected: drwxrwxr-x pulse-sensor-proxy pulse-sensor-proxy - -# Check socket directory inside container -ls -ld /mnt/pulse-proxy/ -# Expected: drwxrwxrwx nobody nogroup (id-mapped) -``` - -**Why 0775 on socket directory?** -The socket directory needs `0775` (not `0770`) to allow the container's unprivileged UID (e.g., 1001) to traverse into the directory and access the socket. The socket itself is `0777` as access control is enforced via `SO_PEERCRED`. - -## LXC Container Requirements - -### Configuration Summary - -| Setting | Value | Purpose | -|---------|-------|---------| -| `lxc.idmap` | `u 0 100000 65536`
`g 0 100000 65536` | Unprivileged UID/GID mapping | -| `lxc.apparmor.profile` | `generated` or custom | AppArmor confinement | -| `lxc.cap.drop` | `sys_admin` (optional) | Drop dangerous capabilities | -| `lxc.mount.entry` | `/run/pulse-sensor-proxy mnt/pulse-proxy none bind,create=dir 0 0` | Socket access from container (migration-safe) | - -### Sample LXC Configuration - -**In `/etc/pve/lxc/.conf`:** -```ini -# Unprivileged container (required) -unprivileged: 1 - -# AppArmor profile (recommended) -lxc.apparmor.profile: generated - -# Drop CAP_SYS_ADMIN if feasible (optional but recommended) -# WARNING: May break some container management operations -lxc.cap.drop: sys_admin - -# Bind mount proxy socket directory (REQUIRED) -# Use lxc.mount.entry to keep snapshots/migrations working with Proxmox bind mounts -lxc.mount.entry: /run/pulse-sensor-proxy mnt/pulse-proxy none bind,create=dir 0 0 -``` - -**Key points:** -- **Directory-level mount**: Mount `/run/pulse-sensor-proxy` directory, not the socket file itself (socket is recreated by systemd) -- **lxc.mount.entry instead of mp**: Proxmox refuses snapshots/migrations when `mpX` bind mounts target `/run/pulse-sensor-proxy`; `lxc.mount.entry` keeps the bind mount without blocking those workflows -- **Mode 0775**: Socket directory needs group+other execute permissions for container UID traversal -- **Socket 0777**: Actual socket is world-writable; security enforced via `SO_PEERCRED` authentication - -### Upgrading Existing Installations - -If you previously used an `mpX` bind mount (e.g., `mp0: /run/pulse-sensor-proxy,mp=/mnt/pulse-proxy`), upgrade by **removing each node in Pulse and then re-adding it using the “Copy install script” flow in Settings → Nodes**. The installer now: - -- Removes `mp*` entries referencing `/run/pulse-sensor-proxy` -- Writes the migration-safe `lxc.mount.entry: /run/pulse-sensor-proxy mnt/pulse-proxy none bind,create=dir 0 0` -- Keeps the systemd override so the backend automatically uses `/mnt/pulse-proxy/pulse-sensor-proxy.sock` - -This is the same workflow you used originally—no extra commands are required. Just remove the node from Pulse, click “Copy install script,” run it on the Proxmox host, and add the node again. If you prefer to refresh in place, rerun the host-side installer directly (e.g. `sudo /opt/pulse/scripts/install-sensor-proxy.sh --ctid --pulse-server http://:7655`). - -### Runtime Verification - -**Check container is unprivileged:** -```bash -# On host -pct config | grep unprivileged -# Expected: unprivileged: 1 - -# Inside container -cat /proc/self/uid_map -# Expected: 0 100000 65536 (or similar) -# NOT: 0 0 4294967295 (privileged) -``` - -**Check AppArmor confinement:** -```bash -# Inside container -cat /proc/self/attr/current -# Expected: lxc-_ (enforcing) or similar -# NOT: unconfined -``` - -**Check namespace isolation:** -```bash -# Inside container -ls -li /proc/self/ns/ -# Each namespace should have a unique inode number, different from host -``` - -**Check capabilities:** -```bash -# Inside container -capsh --print | grep Current -# Should show limited capability set -# If lxc.cap.drop: sys_admin is set, CAP_SYS_ADMIN should be absent -``` - -**Check bind mount:** -```bash -# Inside container -ls -la /mnt/pulse-proxy/ -# Expected: pulse-sensor-proxy.sock visible - -# Test socket access (requires Pulse to attempt connection) -socat - UNIX-CONNECT:/mnt/pulse-proxy/pulse-sensor-proxy.sock -# Should connect (may timeout waiting for input, but connection succeeds) -``` - -## Key Management - -### SSH Key Restrictions - -All SSH keys deployed to cluster nodes include these restrictions: -- `command="sensors -j"` - Forced command (only sensors allowed) -- `from=""` - IP address restrictions -- `no-port-forwarding` - Disable port forwarding -- `no-X11-forwarding` - Disable X11 forwarding -- `no-agent-forwarding` - Disable agent forwarding -- `no-pty` - Disable PTY allocation - -**Example authorized_keys entry:** -``` -from="192.168.0.0/24,10.0.0.0/8",command="sensors -j",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty ssh-ed25519 AAAA... pulse-sensor-proxy -``` - -**Configure allowed subnets:** - -Create `/etc/pulse-sensor-proxy/config.yaml`: -```yaml -allowed_source_subnets: - - "192.168.0.0/24" # LAN subnet - - "10.0.0.0/8" # VPN subnet - -log_level: "info" # Logging verbosity: trace, debug, info, warn, error, fatal, disabled -``` - -Or use environment variables: -```bash -# In /etc/default/pulse-sensor-proxy (loaded by systemd) -PULSE_SENSOR_PROXY_ALLOWED_SUBNETS="192.168.0.0/24,10.0.0.0/8" -PULSE_SENSOR_PROXY_LOG_LEVEL="info" # Set logging verbosity -``` - -**Auto-detection:** -If no subnets are configured, the proxy auto-detects host IP addresses and uses them as `/32` (IPv4) or `/128` (IPv6) CIDRs. This is secure but brittle (breaks if host IP changes). Explicit configuration is recommended. - -**Verify SSH restrictions:** -```bash -# On any cluster node -grep pulse-sensor-proxy /root/.ssh/authorized_keys - -# Expected format: -# from="...",command="sensors -j",no-* ssh-ed25519 AAAA... pulse-sensor-proxy -``` - -### Key Rotation - -**Rotation cadence:** -- Recommended: Every 90 days -- Minimum: Every 180 days -- After incident: Immediately - -**Rotation workflow:** - -The `pulse-sensor-proxy-rotate-keys.sh` script performs staged rotation with verification: - -1. **Dry-run (recommended first):** - ```bash - /opt/pulse/scripts/pulse-sensor-proxy-rotate-keys.sh --dry-run - ``` - Shows what would happen without making changes. - -2. **Perform rotation:** - ```bash - /opt/pulse/scripts/pulse-sensor-proxy-rotate-keys.sh - ``` - - **What happens:** - - Generates new Ed25519 keypair in `/var/lib/pulse-sensor-proxy/ssh.d/next/` - - Pushes new key to all cluster nodes (via RPC `ensure_cluster_keys`) - - Verifies SSH connectivity with new key on each node - - Atomically swaps keys: - - Current `/ssh/` → `/ssh.d/prev/` (backup) - - Staging `/ssh.d/next/` → `/ssh/` (active) - - Old keys preserved in `/ssh.d/prev/` for rollback - -3. **If rotation fails, rollback:** - ```bash - /opt/pulse/scripts/pulse-sensor-proxy-rotate-keys.sh --rollback - ``` - - Restores previous keypair from `/ssh.d/prev/` and re-pushes to cluster nodes. - -**Post-rotation verification:** -```bash -# Check new key timestamp -stat /var/lib/pulse-sensor-proxy/ssh/id_ed25519 - -# Verify all nodes have new key -for node in pve1 pve2 pve3; do - echo "=== $node ===" - ssh root@$node "grep pulse-sensor-proxy /root/.ssh/authorized_keys | tail -1" -done - -# Test temperature fetch via proxy -curl -s --unix-socket /run/pulse-sensor-proxy/pulse-sensor-proxy.sock \ - -d '{"correlation_id":"test","method":"get_temp","params":{"node":"pve1"}}' \ - | jq . -``` - -### Automated Rotation (Optional) - -**Create systemd timer:** - -`/etc/systemd/system/pulse-sensor-proxy-key-rotation.service`: -```ini -[Unit] -Description=Rotate pulse-sensor-proxy SSH keys -After=pulse-sensor-proxy.service -Requires=pulse-sensor-proxy.service - -[Service] -Type=oneshot -ExecStart=/opt/pulse/scripts/pulse-sensor-proxy-rotate-keys.sh -StandardOutput=journal -StandardError=journal -``` - -`/etc/systemd/system/pulse-sensor-proxy-key-rotation.timer`: -```ini -[Unit] -Description=Rotate pulse-sensor-proxy SSH keys every 90 days -Requires=pulse-sensor-proxy-key-rotation.service - -[Timer] -OnCalendar=quarterly -RandomizedDelaySec=1h -Persistent=true - -[Install] -WantedBy=timers.target -``` - -**Enable timer:** -```bash -systemctl daemon-reload -systemctl enable --now pulse-sensor-proxy-key-rotation.timer - -# Check next run -systemctl list-timers pulse-sensor-proxy-key-rotation.timer -``` - -## Monitoring & Auditing - -### Metrics Endpoint - -The proxy exposes Prometheus metrics on `127.0.0.1:9127` by default. - -**Available metrics:** -- `pulse_proxy_rpc_requests_total{method, result}` - RPC request counter -- `pulse_proxy_rpc_latency_seconds{method}` - RPC handler latency histogram -- `pulse_proxy_ssh_requests_total{node, result}` - SSH request counter per node -- `pulse_proxy_ssh_latency_seconds{node}` - SSH latency histogram per node -- `pulse_proxy_queue_depth` - Concurrent RPC requests (gauge) -- `pulse_proxy_rate_limit_hits_total` - Rejected requests due to rate limiting -- `pulse_proxy_build_info{version}` - Build metadata - -**Configure metrics address:** - -In `/etc/default/pulse-sensor-proxy`: -```bash -# Listen on all interfaces (WARNING: exposes metrics externally) -PULSE_SENSOR_PROXY_METRICS_ADDR="0.0.0.0:9127" - -# Disable metrics -PULSE_SENSOR_PROXY_METRICS_ADDR="disabled" -``` - -**Test metrics endpoint:** -```bash -curl -s http://127.0.0.1:9127/metrics | grep pulse_proxy -``` - -### Prometheus Integration - -**Sample scrape configuration:** - -```yaml -scrape_configs: - - job_name: 'pulse-sensor-proxy' - static_configs: - - targets: - - 'pve1:9127' - - 'pve2:9127' - - 'pve3:9127' - relabel_configs: - - source_labels: [__address__] - regex: '([^:]+):.+' - target_label: instance -``` - -### Alert Rules - -**Recommended Prometheus alerts:** - -```yaml -groups: - - name: pulse-sensor-proxy - rules: - # High SSH failure rate - - alert: PulseProxySSHFailureRate - expr: | - rate(pulse_proxy_ssh_requests_total{result="error"}[5m]) > 0.1 - for: 5m - labels: - severity: warning - annotations: - summary: "High SSH failure rate on {{ $labels.instance }}" - description: "{{ $value | humanize }} SSH requests/sec failing" - - # Rate limiting active - - alert: PulseProxyRateLimiting - expr: | - rate(pulse_proxy_rate_limit_hits_total[5m]) > 0 - for: 5m - labels: - severity: warning - annotations: - summary: "Rate limiting active on {{ $labels.instance }}" - description: "Proxy rejecting requests due to rate limits" - - # High queue depth - - alert: PulseProxyQueueDepth - expr: pulse_proxy_queue_depth > 5 - for: 5m - labels: - severity: warning - annotations: - summary: "High RPC queue depth on {{ $labels.instance }}" - description: "{{ $value }} concurrent requests (threshold: 5)" - - # Proxy down - - alert: PulseProxyDown - expr: up{job="pulse-sensor-proxy"} == 0 - for: 2m - labels: - severity: critical - annotations: - summary: "Pulse proxy down on {{ $labels.instance }}" -``` - -### Audit Logging - -**Log format:** -All RPC requests are logged with structured fields: -- `corr_id` - Correlation ID (UUID, tracks request lifecycle) -- `uid` / `pid` - Peer credentials from `SO_PEERCRED` -- `method` - RPC method called (`get_temp`, `register_nodes`, `ensure_cluster_keys`) - -**Example log entries:** -```json -{"level":"info","corr_id":"a7f3d..","uid":1001,"pid":12345,"method":"get_temp","node":"pve1","msg":"RPC request"} -{"level":"info","corr_id":"a7f3d..","node":"pve1","latency_ms":245,"msg":"Temperature fetch successful"} -``` - -**Query logs:** -```bash -# All RPC requests in last hour -journalctl -u pulse-sensor-proxy --since "1 hour ago" -o json | \ - jq -r 'select(.corr_id != null) | [.corr_id, .uid, .method, .node] | @tsv' - -# Failed SSH requests -journalctl -u pulse-sensor-proxy --since today | grep -E '(SSH.*failed|error)' - -# Rate limit hits -journalctl -u pulse-sensor-proxy --since today | grep "rate limit" - -# Specific correlation ID -journalctl -u pulse-sensor-proxy | grep "corr_id=a7f3d" -``` - -For rotation guidance, follow [operations/audit-log-rotation.md](operations/audit-log-rotation.md). After each rotation and proxy restart, verify the adaptive polling scheduler reports closed breakers and no DLQ entries for temperature pollers: - -```bash -curl -s http://localhost:7655/api/monitoring/scheduler/health \ - | jq '.instances[] | select(.key | contains("temperature")) | {key, breaker: .breaker.state, deadLetter: .deadLetter.present}' -``` - -### Rate Limiting - -**Current limits (per peer UID):** -- **Rate**: ~12 requests/minute (Go `rate.Every(5s)` token bucket) - > Allows short bursts of 2 requests; steady-state calls beyond 12/min are rejected. -- **Per-peer concurrency**: 2 simultaneous RPCs -- **Global concurrency**: 8 in-flight RPCs across all peers -- **Penalty**: 2 s enforced sleep when validation fails (payload too large, malformed JSON, unauthorized method) -- **Per-node guard**: only 1 SSH fetch per target node at a time (prevents hammering the same hypervisor) - -**Behavior on limit exceeded:** -- Request rejected immediately (no queuing) -- `pulse_proxy_rate_limit_hits_total` and `pulse_proxy_limiter_rejects_total{reason}` increment -- Audit log entry with `limiter.rejection`/reason code (`rate`, `peer_concurrency`, `global_concurrency`) -- Client receives an RPC error equivalent to HTTP 429 semantics - -**Adjust limits (advanced):** - -The defaults live in `cmd/pulse-sensor-proxy/throttle.go`. To customise them, edit and rebuild: -```go -const ( - defaultPerPeerBurst = 2 - defaultPerPeerConcurrency = 2 - defaultGlobalConcurrency = 8 -) - -var ( - defaultPerPeerRateInterval = 5 * time.Second // 12 req/min - defaultPenaltyDuration = 2 * time.Second -) -``` - -Then rebuild and restart: -```bash -go build -v ./cmd/pulse-sensor-proxy -systemctl restart pulse-sensor-proxy -``` - -## Incident Response - -### Suspected Compromise Checklist - -**If the proxy or host is suspected compromised:** - -1. **Isolate immediately:** - ```bash - # Stop proxy service - systemctl stop pulse-sensor-proxy - - # Block outbound SSH from host (if applicable) - iptables -A OUTPUT -p tcp --dport 22 -j REJECT - ``` - -2. **Rotate all keys:** - ```bash - # Remove compromised keys from all nodes - for node in pve1 pve2 pve3; do - ssh root@$node "sed -i '/pulse-sensor-proxy/d' /root/.ssh/authorized_keys" - done - - # Generate new keys (don't use rotation script - may be compromised) - rm -rf /var/lib/pulse-sensor-proxy/ssh* - mkdir -p /var/lib/pulse-sensor-proxy/ssh - ssh-keygen -t ed25519 -N '' -C "pulse-sensor-proxy emergency $(date -u +%Y%m%dT%H%M%SZ)" \ - -f /var/lib/pulse-sensor-proxy/ssh/id_ed25519 - chown -R pulse-sensor-proxy:pulse-sensor-proxy /var/lib/pulse-sensor-proxy/ssh - chmod 0700 /var/lib/pulse-sensor-proxy/ssh - chmod 0600 /var/lib/pulse-sensor-proxy/ssh/id_ed25519 - chmod 0640 /var/lib/pulse-sensor-proxy/ssh/id_ed25519.pub - ``` - -3. **Audit logs:** - ```bash - # Export all proxy logs - journalctl -u pulse-sensor-proxy --since "7 days ago" > /tmp/proxy-audit-$(date +%s).log - - # Look for anomalies: - # - Unusual correlation IDs - # - High rate limit hits - # - Unexpected UIDs/PIDs - # - SSH errors to unexpected nodes - ``` - -4. **Reinstall proxy:** - ```bash - # Re-run installation script - /opt/pulse/scripts/install-sensor-proxy.sh - - # Verify service status - systemctl status pulse-sensor-proxy - ``` - -5. **Re-push keys:** - ```bash - # Use proxy RPC to push new keys - /opt/pulse/scripts/pulse-sensor-proxy-rotate-keys.sh - ``` - -6. **Verify no persistence mechanisms:** - ```bash - # Check for unexpected systemd units - systemctl list-units --all | grep -i proxy - - # Check for unexpected cron jobs - crontab -l -u pulse-sensor-proxy - - # Check for unauthorized files in /var/lib/pulse-sensor-proxy - find /var/lib/pulse-sensor-proxy -type f ! -path '*/ssh/*' ! -path '*/ssh.d/*' - ``` - -### Post-Incident Hardening - -After an incident, consider: -- **Audit all LXC containers** for unexpected privilege escalation -- **Review bind mounts** on all containers (check for unauthorized mounts) -- **Enable full syscall auditing** (`auditd`) on host -- **Restrict network access** to proxy metrics endpoint (firewall `127.0.0.1:9127`) -- **Implement log aggregation** (forward `journald` to central SIEM) - -## Testing & Rollout - -### Development Testing - -Before deploying to production, verify the implementation with these safe tests: - -**1. Build Verification:** -```bash -# Compile proxy -cd /opt/pulse -go build -v ./cmd/pulse-sensor-proxy - -# Verify binary -./pulse-sensor-proxy version -# Expected: pulse-sensor-proxy dev (or version number) - -# Check help output -./pulse-sensor-proxy --help -``` - -**2. Rotation Script Syntax:** -```bash -# Syntax check -bash -n /opt/pulse/scripts/pulse-sensor-proxy-rotate-keys.sh - -# Help output -/opt/pulse/scripts/pulse-sensor-proxy-rotate-keys.sh --help - -# Dry-run (requires root and socket) -sudo /opt/pulse/scripts/pulse-sensor-proxy-rotate-keys.sh --dry-run -``` - -**3. Configuration Validation:** -```bash -# Test config file parsing -cat > /tmp/test-config.yaml < /tmp/pulse-sensor-proxy-status-before.txt - ``` - -2. **Create service account:** - ```bash - # Run install script or manually create - if ! id -u pulse-sensor-proxy >/dev/null 2>&1; then - useradd --system --user-group --no-create-home --shell /usr/sbin/nologin pulse-sensor-proxy - fi - ``` - -3. **Update file ownership:** - ```bash - chown -R pulse-sensor-proxy:pulse-sensor-proxy /var/lib/pulse-sensor-proxy/ - chmod 0750 /var/lib/pulse-sensor-proxy/ - chmod 0700 /var/lib/pulse-sensor-proxy/ssh/ - chmod 0600 /var/lib/pulse-sensor-proxy/ssh/id_ed25519 - chmod 0640 /var/lib/pulse-sensor-proxy/ssh/id_ed25519.pub - ``` - -**Phase 2: Deploy Hardened Version** - -1. **Build and install binary:** - ```bash - cd /opt/pulse - go build -v -o /tmp/pulse-sensor-proxy ./cmd/pulse-sensor-proxy - - # Verify build - /tmp/pulse-sensor-proxy version - - # Install - sudo install -D -m 0755 -o root -g root /tmp/pulse-sensor-proxy /opt/pulse/sensor-proxy/bin/pulse-sensor-proxy - ``` - The installer and cleanup routines now expect the binary under `/opt/pulse/sensor-proxy/bin` to support read-only `/usr` mounts while keeping self-heal paths consistent. - -2. **Install hardened systemd unit:** - ```bash - # Copy hardened unit - sudo cp /opt/pulse/scripts/pulse-sensor-proxy.service /etc/systemd/system/ - - # Verify syntax - systemd-analyze verify /etc/systemd/system/pulse-sensor-proxy.service - - # Reload systemd - sudo systemctl daemon-reload - ``` - -3. **Update RuntimeDirectoryMode for LXC access:** - ```bash - # Ensure socket directory is accessible from container - sudo mkdir -p /etc/systemd/system/pulse-sensor-proxy.service.d/ - cat | sudo tee /etc/systemd/system/pulse-sensor-proxy.service.d/lxc-access.conf <<'EOF' -[Service] -RuntimeDirectoryMode=0775 -EOF - - sudo systemctl daemon-reload - ``` - -**Phase 3: Restart and Verify** - -1. **Restart service:** - ```bash - sudo systemctl restart pulse-sensor-proxy - - # Check status - sudo systemctl status pulse-sensor-proxy - ``` - -2. **Verify service user:** - ```bash - ps aux | grep pulse-sensor-proxy | grep -v grep - # Expected: pulse-sensor-proxy user, not root - ``` - -3. **Check socket permissions:** - ```bash - ls -ld /run/pulse-sensor-proxy/ - # Expected: drwxrwxr-x pulse-sensor-proxy pulse-sensor-proxy - - ls -l /run/pulse-sensor-proxy/pulse-sensor-proxy.sock - # Expected: srwxrwxrwx pulse-sensor-proxy pulse-sensor-proxy - ``` - -4. **Test from container:** - ```bash - # Inside LXC container running Pulse - ls -la /run/pulse-sensor-proxy/ - # Should show socket - - # Check Pulse logs for connection success - journalctl -u pulse -n 50 | grep -i temperature - ``` - -**Phase 4: End-to-End Validation** - -1. **Test RPC methods:** - ```bash - # On host, test socket connectivity - echo '{"correlation_id":"test-001","method":"register_nodes","params":{}}' | \ - sudo socat - UNIX-CONNECT:/run/pulse-sensor-proxy/pulse-sensor-proxy.sock | jq . - - # Should return cluster nodes list - ``` - -2. **Test temperature fetch:** - ```bash - # From container or via socket - echo '{"correlation_id":"test-002","method":"get_temp","params":{"node":"pve1"}}' | \ - socat - UNIX-CONNECT:/run/pulse-sensor-proxy/pulse-sensor-proxy.sock | jq . - - # Should return sensors JSON data - ``` - -3. **Verify metrics endpoint:** - ```bash - curl -s http://127.0.0.1:9127/metrics | grep pulse_proxy - - # Should show metrics like: - # pulse_proxy_rpc_requests_total{method="get_temp",result="success"} N - # pulse_proxy_queue_depth 0 - ``` - -4. **Test SSH key rotation:** - ```bash - # Dry-run first - sudo /opt/pulse/scripts/pulse-sensor-proxy-rotate-keys.sh --dry-run - - # Full rotation (if confident) - sudo /opt/pulse/scripts/pulse-sensor-proxy-rotate-keys.sh - - # Verify all nodes updated - for node in pve1 pve2 pve3; do - ssh root@$node "tail -1 /root/.ssh/authorized_keys" - done - ``` - -5. **Audit logging verification:** - ```bash - # Check logs include correlation IDs and peer credentials - sudo journalctl -u pulse-sensor-proxy --since "5 minutes ago" -o json | \ - jq -r 'select(.corr_id != null) | [.corr_id, .uid, .method] | @tsv' - - # Should show structured logging with UIDs - ``` - -**Phase 5: Monitoring Setup** - -1. **Configure Prometheus scraping:** - ```yaml - # Add to prometheus.yml - scrape_configs: - - job_name: 'pulse-sensor-proxy' - static_configs: - - targets: ['localhost:9127'] - ``` - -2. **Import alert rules:** - ```bash - # Copy alert rules from docs to Prometheus alerts directory - # Reload Prometheus configuration - ``` - -3. **Verify alerts fire (optional stress test):** - ```bash - # Generate rate limit hits (test alert) - for i in {1..50}; do - echo '{"correlation_id":"stress-'$i'","method":"register_nodes","params":{}}' | \ - socat - UNIX-CONNECT:/run/pulse-sensor-proxy/pulse-sensor-proxy.sock & - done - wait - - # Check rate limit metric increased - curl -s http://127.0.0.1:9127/metrics | grep rate_limit_hits - ``` - -### Rollback Procedure - -If issues occur during rollout: - -1. **Stop new service:** - ```bash - sudo systemctl stop pulse-sensor-proxy - ``` - -2. **Restore backup:** - ```bash - sudo cp /etc/systemd/system/pulse-sensor-proxy.service.backup \ - /etc/systemd/system/pulse-sensor-proxy.service - sudo systemctl daemon-reload - ``` - -3. **Restore SSH keys (if rotated):** - ```bash - # If rotation was performed and failed - sudo /opt/pulse/scripts/pulse-sensor-proxy-rotate-keys.sh --rollback - ``` - -4. **Restart with old configuration:** - ```bash - sudo systemctl restart pulse-sensor-proxy - sudo systemctl status pulse-sensor-proxy - ``` - -5. **Verify Pulse connectivity:** - ```bash - # Check Pulse can still fetch temperatures - # Monitor Pulse logs - ``` - -### Known Limitations - -- **No automated unit tests**: Code verification relies on build success and manual testing -- **Key rotation requires manual trigger**: Automated timer setup is optional -- **Metrics require Prometheus**: No built-in alerting without external monitoring -- **LXC bind mount required**: Container must have directory-level bind mount configured -- **Root required for rotation script**: Script needs root to run `ensure_cluster_keys` RPC - -### Future Improvements - -- Add Go unit tests for validation, throttling, and metrics logic -- Implement health check endpoint (e.g., `/health`) separate from metrics -- Add support for TLS on metrics endpoint -- Create automated integration test suite -- Add `--check` flag to rotation script for pre-flight validation -- Support for multiple LXC containers accessing same proxy instance - -## Appendix - -### Quick Verification Checklist - -**Host:** -- [ ] Service running as `pulse-sensor-proxy` user (not root) -- [ ] Keys in `/var/lib/pulse-sensor-proxy/ssh/` owned by `pulse-sensor-proxy:pulse-sensor-proxy` -- [ ] Private key permissions: `0600` -- [ ] Socket directory permissions: `0775` (not `0770`) -- [ ] Metrics endpoint accessible: `curl http://127.0.0.1:9127/metrics` - -**Container:** -- [ ] Container is unprivileged (`unprivileged: 1` in config) -- [ ] Bind mount exists: `ls /mnt/pulse-proxy/pulse-sensor-proxy.sock` -- [ ] AppArmor enforced: `cat /proc/self/attr/current` shows confinement -- [ ] Pulse can connect to socket (check Pulse logs) - -**SSH Keys:** -- [ ] All nodes have `pulse-sensor-proxy` key in `/root/.ssh/authorized_keys` -- [ ] Keys include `from="..."` restrictions -- [ ] Keys include `command="sensors -j"` forced command -- [ ] Keys include `no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty` - -**Monitoring:** -- [ ] Prometheus scraping metrics successfully -- [ ] Alerts configured for SSH failures, rate limiting, queue depth -- [ ] Logs forwarded to central logging (optional but recommended) - -### Reference Commands - -**Service Management:** -```bash -systemctl status pulse-sensor-proxy # Check service status -systemctl restart pulse-sensor-proxy # Restart service -journalctl -u pulse-sensor-proxy -f # Tail logs -``` - -**Key Management:** -```bash -/opt/pulse/scripts/pulse-sensor-proxy-rotate-keys.sh --dry-run # Dry-run rotation -/opt/pulse/scripts/pulse-sensor-proxy-rotate-keys.sh # Perform rotation -/opt/pulse/scripts/pulse-sensor-proxy-rotate-keys.sh --rollback # Rollback -``` - -**Metrics:** -```bash -curl http://127.0.0.1:9127/metrics # Fetch all metrics -curl -s http://127.0.0.1:9127/metrics | grep pulse_proxy # Filter proxy metrics -``` - -**Manual RPC (Testing):** -```bash -# Using socat (inline JSON) -echo '{"correlation_id":"test","method":"get_temp","params":{"node":"pve1"}}' | \ - socat - UNIX-CONNECT:/run/pulse-sensor-proxy/pulse-sensor-proxy.sock - -# Using Python (proper JSON-RPC client) -python3 <<'PY' -import json, socket, uuid -payload = { - "correlation_id": str(uuid.uuid4()), - "method": "get_temp", - "params": {"node": "pve1"} -} -with socket.socket(socket.AF_UNIX, socket.SOCK_STREAM) as s: - s.connect("/run/pulse-sensor-proxy/pulse-sensor-proxy.sock") - s.sendall((json.dumps(payload) + "\n").encode()) - s.shutdown(socket.SHUT_WR) - print(s.recv(65536).decode()) -PY -``` - -**Verification:** -```bash -# Check service user -ps aux | grep pulse-sensor-proxy | grep -v grep - -# Check file ownership -ls -lR /var/lib/pulse-sensor-proxy/ - -# Check bind mount in container -pct enter -ls -la /run/pulse-sensor-proxy/ - -# Check SSH keys on nodes -for node in pve1 pve2 pve3; do - echo "=== $node ===" - ssh root@$node "grep pulse-sensor-proxy /root/.ssh/authorized_keys" -done -``` - ---- - -**Document Version:** 1.0 -**Last Updated:** 2025-10-13 -**Applies To:** pulse-sensor-proxy v1.0+ diff --git a/docs/README.md b/docs/README.md index a53705167..a82c0d37f 100644 --- a/docs/README.md +++ b/docs/README.md @@ -20,7 +20,6 @@ Welcome to the Pulse documentation portal. Here you'll find everything you need - **[Docker Guide](DOCKER.md)** – Advanced Docker & Compose configurations. - **[Kubernetes](KUBERNETES.md)** – Helm charts, ingress, and HA setups. - **[Reverse Proxy](REVERSE_PROXY.md)** – Nginx, Caddy, Traefik, and Cloudflare Tunnel recipes. -- **[Port Configuration](PORT_CONFIGURATION.md)** – Changing default ports. - **[Troubleshooting](TROUBLESHOOTING.md)** – Deep dive into common issues and logs. ## 🔐 Security diff --git a/docs/TEMPERATURE_MONITORING.md b/docs/TEMPERATURE_MONITORING.md index ecb605c52..5e3d7a4ad 100644 --- a/docs/TEMPERATURE_MONITORING.md +++ b/docs/TEMPERATURE_MONITORING.md @@ -324,7 +324,7 @@ journalctl -u pulse-sensor-proxy -f ``` Forward these logs off-host for retention by following -[operations/sensor-proxy-log-forwarding.md](operations/sensor-proxy-log-forwarding.md). +[operations/SENSOR_PROXY_LOGS.md](operations/SENSOR_PROXY_LOGS.md). In the Pulse container, check the logs at startup: ```bash @@ -718,7 +718,7 @@ pulse-sensor-proxy config set-allowed-nodes --replace --merge 192.168.0.1 - Installer uses CLI (no more shell/Python divergence) **See also:** -- [Sensor Proxy Config Management Guide](operations/sensor-proxy-config-management.md) - Complete runbook +- [Sensor Proxy Config Management Guide](operations/SENSOR_PROXY_CONFIG.md) - Complete runbook - [Sensor Proxy CLI Reference](/opt/pulse/cmd/pulse-sensor-proxy/README.md) - Full command documentation ## Control-Plane Sync & Migration diff --git a/docs/TEMPERATURE_MONITORING_SECURITY.md b/docs/TEMPERATURE_MONITORING_SECURITY.md deleted file mode 100644 index 6bf568424..000000000 --- a/docs/TEMPERATURE_MONITORING_SECURITY.md +++ /dev/null @@ -1,499 +0,0 @@ -# Temperature Monitoring Security Guide - -This document describes the security architecture of Pulse's temperature monitoring system with pulse-sensor-proxy. - -## Table of Contents -- [Architecture Overview](#architecture-overview) -- [Security Boundaries](#security-boundaries) -- [Authentication & Authorization](#authentication--authorization) -- [Rate Limiting](#rate-limiting) -- [SSH Security](#ssh-security) -- [Container Isolation](#container-isolation) -- [Monitoring & Alerting](#monitoring--alerting) -- [Development Mode](#development-mode) -- [Troubleshooting](#troubleshooting) - ---- - -## Architecture Overview - -```mermaid -graph TD - Container[Pulse Container] - Proxy[pulse-sensor-proxy
Host Service] - Cluster[Cluster Nodes
SSH sensors -j] - - Container -->|Unix Socket
Rate Limited| Proxy - Proxy -->|SSH
Forced Command| Cluster - Cluster -->|Temperature JSON| Proxy - Proxy -->|Temperature JSON| Container - - style Proxy fill:#e1f5e1 - style Container fill:#fff4e1 - style Cluster fill:#e1f0ff -``` - -**Key Principle**: SSH keys never enter containers. All SSH operations are performed by the host-side proxy. - ---- - -## Security Boundaries - -### 1. Host ↔ Container Boundary -- **Enforced by**: Method-level authorization + ID-mapped root detection -- **Container CAN**: - - ✅ Call `get_temperature` (read temperature data) - - ✅ Call `get_status` (check proxy health) -- **Container CANNOT**: - - ❌ Call `ensure_cluster_keys` (SSH key distribution) - - ❌ Call `register_nodes` (node discovery) - - ❌ Call `request_cleanup` (cleanup operations) - - ❌ Use direct SSH (blocked by container detection) - -### 2. Proxy ↔ Cluster Nodes Boundary -- **Enforced by**: SSH forced commands + IP filtering -- **SSH authorized_keys entry**: -```bash -from="192.168.0.0/24",command="sensors -j",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty ssh-ed25519 AAAA... pulse-sensor-proxy -``` -- Proxy can ONLY run `sensors -j` on cluster nodes -- IP restrictions prevent lateral movement - -### 3. Client ↔ Proxy Boundary -- **Enforced by**: UID-based ACL + adaptive rate limiting -- SO_PEERCRED verifies caller's UID/GID/PID -- Rate limiting (defaults): ~12 requests per minute per UID (burst 2), per-UID concurrency 2, global concurrency 8, 2 s penalty on validation failures -- Per-node guard: only 1 SSH fetch per node at a time - ---- - -## Authentication & Authorization - -### Authentication (Who can connect?) - -**Allowed UIDs**: -- Root (UID 0) - host processes -- Proxy's own UID (pulse-sensor-proxy user) -- Configured UIDs from `/etc/pulse-sensor-proxy/config.yaml` -- ID-mapped root ranges (containers, if enabled) - -**ID-Mapped Root Detection**: -- Reads `/etc/subuid` and `/etc/subgid` for UID/GID mapping ranges -- Containers typically use ranges like `100000-165535` -- Both UID AND GID must be in mapped ranges - -### Authorization (What can they call?) - -**Privileged Methods** (host-only): -```go -var privilegedMethods = map[string]bool{ - "ensure_cluster_keys": true, // SSH key distribution - "register_nodes": true, // Node registration - "request_cleanup": true, // Cleanup operations -} -``` - -**Authorization Check**: -```go -if privilegedMethods[method] && isIDMappedRoot(credentials) { - return "method requires host-level privileges" -} -``` - -**Read-Only Methods** (containers allowed): -- `get_temperature` - Fetch temperature data via proxy -- `get_status` - Check proxy health and version - ---- - -## Rate Limiting - -### Per-Peer Limits (commit 46b8b8d) - -- **Rate:** 1 request per second (`per_peer_interval_ms = 1000`) -- **Burst:** 5 requests (enough to sweep five nodes per polling window) -- **Per-peer concurrency:** Maximum 2 concurrent RPCs -- **Global concurrency:** 8 simultaneous RPCs across all peers -- **Penalty:** 2 s enforced delay on validation failures (oversized payloads, unauthorized methods) -- **Cleanup:** Peer entries expire after 10 minutes of inactivity - -### Configurable Overrides - -Administrators can raise or lower thresholds via `/etc/pulse-sensor-proxy/config.yaml`: - -```yaml -rate_limit: - per_peer_interval_ms: 500 # 2 rps - per_peer_burst: 10 # allow 10-node sweep -``` - -Security guidance: -- Keep `per_peer_interval_ms ≥ 100` in production; lower values expand the attack surface for noisy callers. -- Ensure UID/GID filters stay in place when increasing throughput, and continue to ship audit logs off-host. -- Monitor `pulse_proxy_limiter_penalties_total` alongside `pulse_proxy_limiter_rejects_total` to spot abusive or compromised clients. - -### Per-Node Concurrency -- **Limit**: 1 concurrent SSH request per node -- **Purpose**: Prevents SSH connection storms -- **Scope**: Applies to all peers requesting same node - -### Monitoring Rate Limits -```bash -# Check rate limit metrics -curl -s http://127.0.0.1:9127/metrics | grep pulse_proxy_limiter_rejects_total - -# Watch for rate limit warnings in logs -journalctl -u pulse-sensor-proxy -f | grep "Rate limit exceeded" -``` - ---- - -## SSH Security - -### SSH Key Management - -**Key Location**: `/var/lib/pulse-sensor-proxy/ssh/id_ed25519` -- **Owner**: `pulse-sensor-proxy:pulse-sensor-proxy` -- **Permissions**: `0600` (read/write for owner only) -- **Type**: Ed25519 (modern, secure) - -**Key Distribution**: -- Only host processes can trigger distribution (via `ensure_cluster_keys`) -- Containers are blocked from key distribution operations -- Keys are distributed with forced commands and IP restrictions - -### Forced Command Restrictions - -**On cluster nodes**, the SSH key can ONLY run: -```bash -sensors -j -``` - -**No other commands possible**: -- ❌ Shell access denied (`no-pty`) -- ❌ Port forwarding disabled (`no-port-forwarding`) -- ❌ X11 forwarding disabled (`no-X11-forwarding`) -- ❌ Agent forwarding disabled (`no-agent-forwarding`) - -### IP Filtering - -**Source IP restrictions**: -```bash -from="192.168.0.0/24,10.0.0.0/8" -``` -- Automatically detected from cluster node IPs -- Prevents SSH key use from outside the cluster -- Updated during key rotation - ---- - -## Container Isolation - -### Fallback SSH Protection - -**In containers**, direct SSH is blocked: -```go -if system.InContainer() && !devModeAllowSSH { - log.Error().Msg("SECURITY BLOCK: SSH temperature collection disabled in containers") - return &Temperature{Available: false}, nil -} -``` - -**Container Detection Methods**: -1. `PULSE_FORCE_CONTAINER=1` override for explicit opt-in -2. Presence of `/.dockerenv` or `/run/.containerenv` -3. `container=` hints from environment variables -4. `/proc/1/environ` and `/proc/1/cgroup` markers (`docker`, `lxc`, `containerd`, `kubepods`, etc.) - -**Bypass**: Only possible with explicit environment variable (see [Development Mode](#development-mode)) - -### ID-Mapped Root Detection - -**How it works**: -```go -// Check /etc/subuid and /etc/subgid for mapping ranges -// Example /etc/subuid: -// root:100000:65536 - -func isIDMappedRoot(cred *peerCredentials) bool { - return uidInRange(cred.uid, idMappedUIDRanges) && - gidInRange(cred.gid, idMappedGIDRanges) -} -``` - -**Why both UID and GID?**: -- Container root: `uid=100000, gid=100000` → ID-mapped -- Container app user: `uid=101001, gid=101001` → ID-mapped -- Host root: `uid=0, gid=0` → NOT ID-mapped -- Mixed: `uid=100000, gid=50` → NOT ID-mapped (fails check) - ---- - -## Monitoring & Alerting - -### Log Locations - -**Proxy logs**: -```bash -journalctl -u pulse-sensor-proxy -f -``` - -**Backend logs** (inside container): -```bash -journalctl -u pulse-backend -f -``` - -Want off-host retention? Forward `audit.log` and `proxy.log` using -[`scripts/setup-log-forwarding.sh`](operations/sensor-proxy-log-forwarding.md) -so events land in your SIEM with RELP + TLS. - -**Audit rotation**: Use the steps in [operations/audit-log-rotation.md](operations/audit-log-rotation.md) to rotate `/var/log/pulse/sensor-proxy/audit.log`. After each rotation, restart the proxy and confirm temperature pollers are healthy in `/api/monitoring/scheduler/health` (closed breakers, no DLQ entries). - -### Security Events to Monitor - -#### 1. Privileged Method Denials -``` -SECURITY: Container attempted to call privileged method - access denied -method=ensure_cluster_keys uid=101000 gid=101000 pid=12345 -``` - -**Alert on**: Any occurrence (indicates attempted privilege escalation) - -#### 2. Rate Limit Violations -``` -Rate limit exceeded uid=101000 pid=12345 -``` - -**Alert on**: Sustained violations (>10/minute indicates possible abuse) - -#### 3. Authorization Failures -``` -Peer authorization failed uid=50000 gid=50000 -``` - -**Alert on**: Repeated failures from same UID (indicates misconfiguration or probing) - -#### 4. SSH Fallback Attempts -``` -SECURITY BLOCK: SSH temperature collection disabled in containers -``` - -**Alert on**: Any occurrence (should only happen during misconfigurations) - -### Metrics to Track - -```bash -# Rate limit hits -pulse_proxy_rate_limit_hits_total - -# RPC requests by method and result -pulse_proxy_rpc_requests_total{method="get_temperature",result="success"} -pulse_proxy_rpc_requests_total{method="ensure_cluster_keys",result="unauthorized"} - -# SSH request latency -pulse_proxy_ssh_latency_seconds{node="example-node"} - -# Active connections -pulse_proxy_queue_depth -pulse_proxy_global_concurrency_inflight -``` - -### Recommended Alerts - -1. **Privilege Escalation Attempts**: - ``` - pulse_proxy_rpc_requests_total{result="unauthorized"} > 0 - ``` - -2. **Rate Limit Abuse**: - ``` - rate(pulse_proxy_rate_limit_hits_total[5m]) > 1 - ``` - -3. **Proxy Unavailable**: - ``` - up{job="pulse-sensor-proxy"} == 0 - ``` - -4. **Scheduler Drift** (Pulse side – ensures temperature pollers stay healthy): - ``` - max_over_time(pulse_monitor_poll_queue_depth[5m]) > - ``` - Pair with a check of `/api/monitoring/scheduler/health` to confirm temperature instances report `breaker.state == "closed"`. - ---- - -## Development Mode - -### SSH Fallback Override - -**Purpose**: Allow direct SSH from containers during development/testing - -**Environment Variable**: -```bash -export PULSE_DEV_ALLOW_CONTAINER_SSH=true -``` - -**Security Implications**: -- ⚠️ **NEVER use in production** -- Allows container to use SSH keys if present -- Defeats the security isolation model -- Should only be used in trusted development environments - -**Example Usage**: -```bash -# In systemd override for pulse-backend -mkdir -p /etc/systemd/system/pulse-backend.service.d -cat < /etc/systemd/system/pulse-backend.service.d/dev-ssh.conf -[Service] -Environment=PULSE_DEV_ALLOW_CONTAINER_SSH=true -EOF -systemctl daemon-reload -systemctl restart pulse-backend -``` - -**Monitoring**: -```bash -# Check if dev mode is active -journalctl -u pulse-backend | grep "dev mode" | tail -1 -``` - -**Disable dev mode**: -```bash -rm /etc/systemd/system/pulse-backend.service.d/dev-ssh.conf -systemctl daemon-reload -systemctl restart pulse-backend -``` - ---- - -## Troubleshooting - -### "method requires host-level privileges" - -**Symptom**: Container gets this error when calling RPC - -**Cause**: Container attempted to call privileged method - -**Resolution**: This is expected behavior. Only these methods are restricted: -- `ensure_cluster_keys` -- `register_nodes` -- `request_cleanup` - -**If host process is blocked**: -1. Check UID is not in ID-mapped range: - ```bash - id - cat /etc/subuid /etc/subgid - ``` - -2. Verify proxy's allowed UIDs: - ```bash - cat /etc/pulse-sensor-proxy/config.yaml - ``` - -### "Rate limit exceeded" - -**Symptom**: Requests failing with rate limit error - -**Cause**: Peer exceeded ~12 requests/minute (or exhausted per-peer/global concurrency) - -**Resolution**: -1. Confirm workload is legitimate (look for retry loops or aggressive polling). -2. Allow the limiter to recover—penalty sleeps clear in ~2 s and idle peers expire after 10 minutes. -3. If sustained higher throughput is required, adjust the constants in `cmd/pulse-sensor-proxy/throttle.go` and rebuild. - -### Temperature monitoring unavailable - -**Symptom**: No temperature data in dashboard - -**Diagnosis**: -```bash -# 1. Check proxy is running -systemctl status pulse-sensor-proxy - -# 2. Check socket exists -ls -la /run/pulse-sensor-proxy/ - -# 3. Check socket is accessible in container -ls -la /mnt/pulse-proxy/ - -# 4. Test proxy from host -curl -s --unix-socket /run/pulse-sensor-proxy/pulse-sensor-proxy.sock \ - -X POST -d '{"method":"get_status"}' | jq - -# 5. Check SSH connectivity -ssh root@example-node "sensors -j" - -# 6. Inspect adaptive polling for temperature pollers -curl -s http://localhost:7655/api/monitoring/scheduler/health \ - | jq '.instances[] | select(.key | contains("temperature")) | {key, breaker: .breaker.state, deadLetter: .deadLetter.present, lastSuccess: .pollStatus.lastSuccess}' -``` - -### SSH key not distributed - -**Symptom**: Manual `ensure_cluster_keys` call fails - -**Check**: -1. Are you calling from host (not container)? -2. Is pvecm available? `command -v pvecm` -3. Can you reach cluster nodes? `pvecm status` -4. Check proxy logs: `journalctl -u pulse-sensor-proxy -f` - ---- - -## Best Practices - -### Production Deployments - -1. ✅ **Never use dev mode** (`PULSE_DEV_ALLOW_CONTAINER_SSH=true`) -2. ✅ **Monitor security logs** for unauthorized access attempts -3. ✅ **Use IP filtering** on SSH authorized_keys entries -4. ✅ **Rotate SSH keys** periodically (use `ensure_cluster_keys` with rotation) -5. ✅ **Limit allowed_peer_uids** to minimum necessary -6. ✅ **Enable audit logging** for privileged operations - -### Development Environments - -1. ✅ Use dev mode SSH override if needed (document why) -2. ✅ Test with actual ID-mapped containers -3. ✅ Verify privileged method blocking works -4. ✅ Test rate limiting under load - -### Incident Response - -**If container compromise suspected**: - -1. Check for privileged method attempts: - ```bash - journalctl -u pulse-sensor-proxy | grep "SECURITY:" - ``` - -2. Check rate limit violations: - ```bash - journalctl -u pulse-sensor-proxy | grep "Rate limit" - ``` - -3. Restart proxy to clear state: - ```bash - systemctl restart pulse-sensor-proxy - ``` - -4. Consider rotating SSH keys: - ```bash - # From host, call ensure_cluster_keys with new key - ``` - ---- - -## References - -- [Pulse Installation Guide](../README.md) -- [pulse-sensor-proxy Configuration](../cmd/pulse-sensor-proxy/README.md) -- [Security Audit Results](../SECURITY.md) -- [LXC ID Mapping Documentation](https://linuxcontainers.org/lxc/manpages/man5/lxc.container.conf.5.html#lbAJ) - ---- - -**Last Updated**: 2025-10-19 -**Security Contact**: File issues at https://github.com/rcourtman/Pulse/issues diff --git a/docs/api/SCHEDULER_HEALTH.md b/docs/api/SCHEDULER_HEALTH.md index 2b2332fb2..c6a841b0b 100644 --- a/docs/api/SCHEDULER_HEALTH.md +++ b/docs/api/SCHEDULER_HEALTH.md @@ -1,134 +1,11 @@ -# Scheduler Health API +# 🩺 Scheduler Health API -Adaptive scheduler health endpoint +**Endpoint**: `GET /api/monitoring/scheduler/health` +**Auth**: Required (Bearer token or Cookie) -Endpoint: `GET /api/monitoring/scheduler/health` +Returns a real-time snapshot of the adaptive scheduler, including queue state, circuit breakers, and dead-letter tasks. -Returns a snapshot of the adaptive polling scheduler, queue state, circuit breakers, and per-instance status. Requires authentication (session cookie or bearer token). - -**Key Features:** -- Real-time scheduler health monitoring -- Circuit breaker status per instance -- Dead-letter queue tracking (tasks that repeatedly fail) -- Per-instance staleness metrics -- No query parameters required -- Read-only endpoint (rate-limited under general 500 req/min bucket) - ---- - -## Request - -``` -GET /api/monitoring/scheduler/health -Authorization: Bearer -``` - -No query parameters are needed. - ---- - -## Response Overview - -```json -{ - "updatedAt": "2025-10-20T13:05:42Z", // RFC 3339 timestamp - "enabled": true, // Mirrors AdaptivePollingEnabled setting - "queue": {...}, - "deadLetter": {...}, - "breakers": [...], // legacy summary (for backward compatibility) - "staleness": [...], // legacy summary (for backward compatibility) - "instances": [ ... ] // authoritative per-instance view (v4.24.0+) -} -``` - -**Field Notes:** -- `updatedAt`: RFC 3339 timestamp of when this snapshot was generated -- `enabled`: Reflects the current `AdaptivePollingEnabled` system setting -- `breakers` and `staleness`: Legacy arrays maintained for backward compatibility; use `instances` for complete data -- `instances`: Authoritative source for per-instance health (v4.24.0+) - -### Queue Snapshot (`queue`) - -| Field | Type | Description | -|-------|------|-------------| -| `depth` | integer | Current queue size | -| `dueWithinSeconds` | integer | Items scheduled within the next 12 seconds | -| `perType` | object | Counts per instance type, e.g. `{"pve":4}` | - -### Dead-letter Snapshot (`deadLetter`) - -| Field | Type | Description | -|-------|------|-------------| -| `count` | integer | Total items in the dead-letter queue | -| `tasks` | array | **Limited to 25 entries** for performance. Each task includes `instance`, `type`, `nextRun`, `lastError`, and `failures` count. For complete per-instance DLQ data, use `instances[].deadLetter` | - -**Note:** The top-level `deadLetter.tasks` array is capped at 25 items to prevent large responses. Use the `instances` array for exhaustive coverage. - -### Instances (`instances`) - -Each element gives a complete view of one instance. - -| Field | Type | Description | -|-------|------|-------------| -| `key` | string | Unique key `type::name` | -| `type` | string | Instance type (`pve`, `pbs`, `pmg`, etc.) | -| `displayName` | string | Friendly name (falls back to host/name) | -| `instance` | string | Raw instance identifier | -| `connection` | string | Connection URL or host | -| `pollStatus` | object | Recent poll outcomes | -| `breaker` | object | Circuit breaker state | -| `deadLetter` | object | Dead-letter insight for this instance | - -#### Poll Status (`pollStatus`) - -| Field | Type | Description | -|-------|------|-------------| -| `lastSuccess` | timestamp nullable | RFC 3339 timestamp of most recent successful poll | -| `lastError` | object nullable | `{ at, message, category }` where `at` is RFC 3339, `message` describes the error, and `category` is `transient` (network issues, timeouts) or `permanent` (auth failures, invalid config) | -| `consecutiveFailures` | integer | Current failure streak length (resets on successful poll) | -| `firstFailureAt` | timestamp nullable | RFC 3339 timestamp when the current failure streak began. Useful for calculating failure duration | - -**Timing Metadata (v4.24.0+):** -- `firstFailureAt`: Tracks when a failure streak started, enabling "failing for X minutes" calculations -- Resets to `null` when a successful poll occurs -- Combine with `consecutiveFailures` to assess severity - -#### Breaker (`breaker`) - -| Field | Type | Description | -|-------|------|-------------| -| `state` | string | `closed` (healthy), `open` (failing), `half_open` (testing recovery), or `unknown` (not initialized) | -| `since` | timestamp nullable | RFC 3339 timestamp when the current state began. Use to calculate how long a breaker has been open | -| `lastTransition` | timestamp nullable | RFC 3339 timestamp of the most recent state change (e.g., closed → open) | -| `retryAt` | timestamp nullable | RFC 3339 timestamp of next scheduled retry attempt when breaker is open or half-open | -| `failureCount` | integer | Number of failures in the current breaker cycle. Resets when breaker closes | - -**Circuit Breaker Timing (v4.24.0+):** -- `since`: When did the current state start? (e.g., "breaker has been open for 5 minutes") -- `lastTransition`: When was the last state change? (useful for detecting flapping) -- `retryAt`: When will the next retry attempt occur? (for open/half-open states) -- `failureCount`: How many failures have accumulated? (triggers state transitions) - -**State Transitions:** -- `closed` → `open`: Triggered after N failures (default: 5) -- `open` → `half_open`: After timeout period, allows one test request -- `half_open` → `closed`: If test request succeeds -- `half_open` → `open`: If test request fails - -#### Dead-letter (`deadLetter`) - -| Field | Type | Description | -|-------|------|-------------| -| `present` | boolean | `true` if instance is in the DLQ | -| `reason` | string | `max_retry_attempts` or `permanent_failure` | -| `firstAttempt` | timestamp nullable | First time the instance hit DLQ | -| `lastAttempt` | timestamp nullable | Most recent DLQ enqueue | -| `retryCount` | integer | Number of DLQ attempts | -| `nextRetry` | timestamp nullable | Next scheduled retry time | - ---- - -## Example Response +## 📦 Response Format ```json { @@ -137,44 +14,13 @@ Each element gives a complete view of one instance. "queue": { "depth": 7, "dueWithinSeconds": 2, - "perType": { "pve": 4, "pbs": 2, "pmg": 1 } + "perType": { "pve": 4, "pbs": 2 } }, - "deadLetter": { - "count": 1, - "tasks": [ - { - "instance": "pbs-b", - "type": "pbs", - "nextRun": "2025-10-20T13:30:00Z", - "lastError": "401 unauthorized", - "failures": 5 - } - ] - }, - "breakers": [ - { - "instance": "pve-a", - "type": "pve", - "state": "half_open", - "failures": 3, - "retryAt": "2025-10-20T13:06:15Z" - } - ], - "staleness": [ - { - "instance": "pve-a", - "type": "pve", - "score": 0.42, - "lastSuccess": "2025-10-20T13:05:10Z", - "lastError": "2025-10-20T13:05:40Z" - } - ], "instances": [ { "key": "pve::pve-a", "type": "pve", "displayName": "Pulse PVE Cluster", - "instance": "pve-a", "connection": "https://pve-a:8006", "pollStatus": { "lastSuccess": "2025-10-20T13:05:10Z", @@ -187,133 +33,50 @@ Each element gives a complete view of one instance. "firstFailureAt": "2025-10-20T13:05:20Z" }, "breaker": { - "state": "half_open", - "since": "2025-10-20T13:05:40Z", - "lastTransition": "2025-10-20T13:05:40Z", + "state": "half_open", // closed, open, half_open "retryAt": "2025-10-20T13:06:15Z", "failureCount": 3 }, "deadLetter": { "present": false } - }, - { - "key": "pbs::pbs-b", - "type": "pbs", - "displayName": "Backup PBS", - "instance": "pbs-b", - "connection": "https://pbs-b:8007", - "pollStatus": { - "lastSuccess": "2025-10-20T12:55:00Z", - "lastError": { - "at": "2025-10-20T13:00:01Z", - "message": "401 unauthorized", - "category": "permanent" - }, - "consecutiveFailures": 5, - "firstFailureAt": "2025-10-20T12:58:30Z" - }, - "breaker": { - "state": "open", - "since": "2025-10-20T13:00:01Z", - "lastTransition": "2025-10-20T13:00:01Z", - "retryAt": "2025-10-20T13:02:01Z", - "failureCount": 5 - }, - "deadLetter": { - "present": true, - "reason": "max_retry_attempts", - "firstAttempt": "2025-10-20T12:58:30Z", - "lastAttempt": "2025-10-20T13:00:01Z", - "retryCount": 5, - "nextRetry": "2025-10-20T13:30:00Z" - } } ] } ``` ---- +## 🔍 Key Fields -## Useful `jq` Queries +### Instances (`instances`) +The authoritative source for per-instance health. -### Instances with recent errors +* **`pollStatus`**: + * `lastSuccess`: Timestamp of last successful poll. + * `lastError`: Details of the last error (message, category). + * `consecutiveFailures`: Current failure streak. +* **`breaker`**: + * `state`: `closed` (healthy), `open` (failing), `half_open` (recovering). + * `retryAt`: Next retry time if open/half-open. +* **`deadLetter`**: + * `present`: `true` if the instance is in the DLQ (stopped polling). + * `reason`: Why it was moved to DLQ (e.g., `permanent_failure`). -``` -curl -s http://HOST:7655/api/monitoring/scheduler/health \ - | jq '.instances[] | select(.pollStatus.lastError != null) | {key, lastError: .pollStatus.lastError}' +## 🛠️ Common Queries (jq) + +**Find Failing Instances:** +```bash +curl -s http://HOST:7655/api/monitoring/scheduler/health | \ +jq '.instances[] | select(.pollStatus.consecutiveFailures > 0) | {key, failures: .pollStatus.consecutiveFailures}' ``` -### Current dead-letter queue entries - -``` -curl -s http://HOST:7655/api/monitoring/scheduler/health \ - | jq '.instances[] | select(.deadLetter.present) | {key, reason: .deadLetter.reason, retryCount: .deadLetter.retryCount}' +**Check Dead Letter Queue:** +```bash +curl -s http://HOST:7655/api/monitoring/scheduler/health | \ +jq '.instances[] | select(.deadLetter.present) | {key, reason: .deadLetter.reason}' ``` -### Breakers not closed - +**Find Open Breakers:** +```bash +curl -s http://HOST:7655/api/monitoring/scheduler/health | \ +jq '.instances[] | select(.breaker.state != "closed") | {key, state: .breaker.state}' ``` -curl -s http://HOST:7655/api/monitoring/scheduler/health \ - | jq '.instances[] | select(.breaker.state != "closed") | {key, breaker: .breaker}' -``` - -### Stale instances (score > 0.5) - -``` -curl -s http://HOST:7655/api/monitoring/scheduler/health \ - | jq '.staleness[] | select(.score > 0.5)' -``` - -### Instances sorted by failure streak - -``` -curl -s http://HOST:7655/api/monitoring/scheduler/health \ - | jq '.instances[] | select(.pollStatus.consecutiveFailures > 0) | {key, failures: .pollStatus.consecutiveFailures}' -``` - ---- - -## Migration Notes - -| Legacy Field | Status | Replacement | -|--------------|--------|-------------| -| `breakers` array | retains summary | use `instances[].breaker` for detailed view | -| `deadLetter.tasks` | retains summary | use `instances[].deadLetter` for per-instance enrichment | -| `staleness` array | unchanged | combined with `pollStatus.lastSuccess` gives precise timestamps | - -The `instances` array centralizes per-instance telemetry; existing integrations can migrate at their own pace. - ---- - -## Operational Notes - -**v4.24.0 Behavior:** -- **Read-only endpoint**: This endpoint is informational only and does not modify scheduler state -- **Rate limiting**: Falls under the general API limit (500 requests/minute per IP) -- **Authentication required**: Must provide valid session cookie or API token -- **Adaptive polling disabled**: When adaptive polling is disabled (`enabled: false`), the response includes empty `breakers`, `staleness`, and `instances` arrays -- **Real-time data**: Reflects current scheduler state; not historical (for trends, use metrics/logs) -- **No query parameters**: Returns complete snapshot on every request -- **Automatic adjustments**: The `enabled` field automatically reflects the `AdaptivePollingEnabled` system setting - -**Use Cases:** -- **Monitoring dashboards**: Embed in Grafana/Prometheus for real-time scheduler health -- **Alerting**: Trigger alerts on open circuit breakers or high DLQ counts -- **Debugging**: Investigate why specific instances aren't polling successfully -- **Capacity planning**: Monitor queue depth trends to assess if polling intervals need adjustment - -**Breaking Changes:** -- **None**: v4.24.0 only adds fields; all existing consumers continue to work -- Consumers just gain access to richer metadata (`firstFailureAt`, breaker timestamps, DLQ retry windows) - ---- - -## Troubleshooting Examples - -1. **Transient outages:** look for `pollStatus.lastError.category == "transient"` to confirm network hiccups; check `breaker.retryAt` to see when retries resume. -2. **Permanent failures:** `deadLetter.present == true` with `reason == "permanent_failure"` indicates credential or configuration issues. -3. **Breaker stuck:** `breaker.state != "closed"` with `since` > 5 minutes suggests manual intervention or rollback. -4. **Staleness spike:** compare `pollStatus.lastSuccess` with `updatedAt` to estimate data age; cross-reference `staleness.score` for alert thresholds. - -Use Grafana dashboards for historical trends; the API complements dashboards by revealing instant state and precise failure context. diff --git a/docs/development/MOCK_MODE.md b/docs/development/MOCK_MODE.md index 2ac8d5962..5f09f6a94 100644 --- a/docs/development/MOCK_MODE.md +++ b/docs/development/MOCK_MODE.md @@ -1,111 +1,37 @@ -# Mock Mode Development Guide +# 🧪 Mock Mode Development -Pulse ships with a mock data pipeline so you can iterate on UI and backend -changes without touching real infrastructure. This guide collects everything you -need to know about running in mock mode during development. +Develop Pulse without real infrastructure using the mock data pipeline. ---- - -## Why Mock Mode? - -- Exercise dashboards, alert timelines, and charts with predictable sample data. -- Reproduce edge cases (offline nodes, noisy containers, backup failures) by - tweaking configuration values rather than waiting for production incidents. -- Swap between synthetic and live data without rebuilding services. - ---- - -## Starting the Dev Stack +## 🚀 Quick Start ```bash -# Launch backend + frontend with hot reload +# Start dev stack ./scripts/hot-dev.sh + +# Toggle mock mode +npm run mock:on # Enable +npm run mock:off # Disable +npm run mock:status # Check status ``` -The script exposes: -- Frontend: `http://localhost:7655` (Vite hot module reload) -- Backend API: `http://localhost:7656` +## ⚙️ Configuration +Edit `mock.env` (or `mock.env.local` for overrides): ---- +| Variable | Default | Description | +| :--- | :--- | :--- | +| `PULSE_MOCK_MODE` | `false` | Enable mock mode. | +| `PULSE_MOCK_NODES` | `7` | Number of synthetic nodes. | +| `PULSE_MOCK_VMS_PER_NODE` | `5` | VMs per node. | +| `PULSE_MOCK_LXCS_PER_NODE` | `8` | Containers per node. | +| `PULSE_MOCK_RANDOM_METRICS` | `true` | Jitter metrics. | +| `PULSE_MOCK_STOPPED_PERCENT` | `20` | % of offline guests. | -## Toggling Mock Data +## ℹ️ How it Works +* **Data**: Swaps `PULSE_DATA_DIR` to `/opt/pulse/tmp/mock-data`. +* **Restart**: Backend restarts automatically; Frontend hot-reloads. +* **Reset**: To regenerate data, delete `/opt/pulse/tmp/mock-data` and toggle mock mode on. -The npm helpers and `toggle-mock.sh` wrapper point the backend at different -`.env` files and restart the relevant services automatically. - -```bash -npm run mock:on # Enable mock mode -npm run mock:off # Return to real data -npm run mock:status # Display current state -npm run mock:edit # Open mock.env in $EDITOR -``` - -Equivalent shell invocations: - -```bash -./scripts/toggle-mock.sh on -./scripts/toggle-mock.sh off -./scripts/toggle-mock.sh status -``` - -When switching: -- `mock.env` (or `mock.env.local`) feeds configuration values to the backend. -- `PULSE_DATA_DIR` swaps between `/opt/pulse/tmp/mock-data` (synthetic) and - `/etc/pulse` (real data) so test credentials never mix with production ones. -- The backend process restarts; the frontend stays hot-reloading. - ---- - -## Customising Mock Fixtures - -`mock.env` exposes the knobs most developers care about: - -```bash -PULSE_MOCK_MODE=false # Enable/disable mock mode -PULSE_MOCK_NODES=7 # Number of synthetic nodes -PULSE_MOCK_VMS_PER_NODE=5 # Average VM count per node -PULSE_MOCK_LXCS_PER_NODE=8 # Average container count per node -PULSE_MOCK_RANDOM_METRICS=true # Toggle metric jitter -PULSE_MOCK_STOPPED_PERCENT=20 # Percentage of guests stopped/offline -PULSE_ALLOW_DOCKER_UPDATES=true # Treat Docker builds as update-capable (skips restart) -``` - -When `PULSE_ALLOW_DOCKER_UPDATES` (or `PULSE_MOCK_MODE`) is enabled the backend -exposes the full update flow inside containers, fakes the deployment type to -`mock`, and suppresses the automatic process exit that normally follows a -successful upgrade. This is what the Playwright update suite uses inside CI. - -Create `mock.env.local` for personal tweaks that should not be committed: - -```bash -cp mock.env mock.env.local -$EDITOR mock.env.local -``` - -The toggle script prioritises `.local` files, falling back to the shared -defaults when none are present. - ---- - -## Troubleshooting - -- **Backend did not restart:** flip mock mode off/on again (`npm run mock:off`, - then `npm run mock:on`) to force a reload. -- **Ports already in use:** confirm nothing else is listening on `7655`/`7656` - (`lsof -i :7655` / `lsof -i :7656`) and kill stray processes. -- **Data feels stale:** delete `/opt/pulse/tmp/mock-data` and toggle mock mode - back on to regenerate fixtures. - ---- - -## Limitations - -- Mock data focuses on happy-path flows; use real Proxmox/PBS environments - before shipping changes that touch API integrations. -- Webhook payloads are synthetically generated and omit provider-specific - quirks—test with real channels for production rollouts. -- Encrypt/decrypt flows still use the local crypto stack; do not treat mock mode - as a sandbox for experimenting with credential formats. - -For more advanced scenarios, inspect `scripts/hot-dev.sh` and the mock seeders -under `internal/mock` for additional entry points. +## ⚠️ Limitations +* **Happy Path**: Focuses on standard flows; use real infrastructure for complex edge cases. +* **Webhooks**: Synthetic payloads only. +* **Encryption**: Uses local crypto stack (not a sandbox for auth). diff --git a/docs/monitoring/ADAPTIVE_POLLING.md b/docs/monitoring/ADAPTIVE_POLLING.md index fe8a78627..c3144281e 100644 --- a/docs/monitoring/ADAPTIVE_POLLING.md +++ b/docs/monitoring/ADAPTIVE_POLLING.md @@ -1,187 +1,52 @@ -# Adaptive Polling Architecture +# 📉 Adaptive Polling -## Overview -Pulse uses an adaptive polling scheduler that adapts poll cadence based on freshness, errors, and workload. The goal is to prioritize stale or changing instances while backing off on healthy, idle targets. +Pulse uses an adaptive scheduler to optimize polling based on instance health and activity. -```mermaid -flowchart LR - Scheduler[Scheduler] - Queue[Priority Queue
by NextRun] - Workers[Workers] +## 🧠 Architecture +* **Scheduler**: Calculates intervals based on health/staleness. +* **Priority Queue**: Min-heap keyed by `NextRun`. +* **Circuit Breaker**: Prevents hot loops on failing instances. +* **Backoff**: Exponential retry delays (5s to 5m). - Scheduler -->|schedule| Queue - Queue -->|dequeue| Workers - Workers -->|success| Scheduler - Workers -->|failure| CB[Circuit Breaker] - CB -->|backoff| Scheduler -``` +## ⚙️ Configuration +Adaptive polling is **enabled by default**. -- **Scheduler** computes `ScheduledTask` entries using adaptive intervals. -- **Task queue** is a min-heap keyed by `NextRun`; only due tasks execute. -- **Workers** execute tasks, capture outcomes, reschedule via scheduler or backoff logic. +### UI +**Settings → System → Monitoring**. -## Key Components +### Environment Variables +| Variable | Default | Description | +| :--- | :--- | :--- | +| `ADAPTIVE_POLLING_ENABLED` | `true` | Enable/disable. | +| `ADAPTIVE_POLLING_BASE_INTERVAL` | `10s` | Healthy poll rate. | +| `ADAPTIVE_POLLING_MIN_INTERVAL` | `5s` | Active/busy rate. | +| `ADAPTIVE_POLLING_MAX_INTERVAL` | `5m` | Idle/backoff rate. | -| Component | File | Responsibility | -|-----------------------|-------------------------------------------|--------------------------------------------------------------| -| Scheduler | `internal/monitoring/scheduler.go` | Calculates adaptive intervals per instance. | -| Staleness tracker | `internal/monitoring/staleness_tracker.go`| Maintains freshness metadata and scores. | -| Priority queue | `internal/monitoring/task_queue.go` | Orders `ScheduledTask` items by due time + priority. | -| Circuit breaker | `internal/monitoring/circuit_breaker.go` | Trips on repeated failures, preventing hot loops. | -| Backoff | `internal/monitoring/backoff.go` | Exponential retry delays with jitter. | -| Workers | `internal/monitoring/monitor.go` | Pop tasks, execute pollers, reschedule or dead-letter. | +## 📊 Metrics +Exposed at `:9091/metrics`. -## Configuration +| Metric | Type | Description | +| :--- | :--- | :--- | +| `pulse_monitor_poll_total` | Counter | Total poll attempts. | +| `pulse_monitor_poll_duration_seconds` | Histogram | Poll latency. | +| `pulse_monitor_poll_staleness_seconds` | Gauge | Age since last success. | +| `pulse_monitor_poll_queue_depth` | Gauge | Queue size. | +| `pulse_monitor_poll_errors_total` | Counter | Error counts by category. | -**v4.24.0:** Adaptive polling is **enabled by default** but can be toggled without restart. +## ⚡ Circuit Breaker +| State | Trigger | Recovery | +| :--- | :--- | :--- | +| **Closed** | Normal operation. | — | +| **Open** | ≥3 failures. | Backoff (max 5m). | +| **Half-open** | Retry window elapsed. | Success = Closed; Fail = Open. | -### Via UI -Navigate to **Settings → System → Monitoring** to enable/disable adaptive polling. Changes apply immediately without requiring a restart. +**Dead Letter Queue**: After 5 transient or 1 permanent failure, tasks move to DLQ (30m retry). -### Via Environment Variables -Environment variables (default in `internal/config/config.go`): +## 🩺 Health API +`GET /api/monitoring/scheduler/health` (Auth required) -| Variable | Default | Description | -|-------------------------------------|---------|--------------------------------------------------| -| `ADAPTIVE_POLLING_ENABLED` | true | **Changed in v4.24.0**: Now enabled by default | -| `ADAPTIVE_POLLING_BASE_INTERVAL` | 10s | Target cadence when system is healthy | -| `ADAPTIVE_POLLING_MIN_INTERVAL` | 5s | Lower bound (active instances) | -| `ADAPTIVE_POLLING_MAX_INTERVAL` | 5m | Upper bound (idle instances) | - -All settings persist in `system.json` and respond to environment overrides. **Changes apply without restart** when modified via UI. - -## Metrics - -**v4.24.0:** Extended metrics for comprehensive monitoring. - -Exposed via Prometheus (`:9091/metrics`): - -| Metric | Type | Labels | Description | -|---------------------------------------------|-----------|---------------------------------------|-------------------------------------------------| -| `pulse_monitor_poll_total` | counter | `instance_type`, `instance`, `result` | Overall poll attempts (success/error) | -| `pulse_monitor_poll_duration_seconds` | histogram | `instance_type`, `instance` | Poll latency per instance | -| `pulse_monitor_poll_staleness_seconds` | gauge | `instance_type`, `instance` | Age since last success (0 on success) | -| `pulse_monitor_poll_queue_depth` | gauge | — | Size of priority queue | -| `pulse_monitor_poll_inflight` | gauge | `instance_type` | Concurrent tasks per type | -| `pulse_monitor_poll_errors_total` | counter | `instance_type`, `instance`, `category` | Error counts by category (transient/permanent) | -| `pulse_monitor_poll_last_success_timestamp` | gauge | `instance_type`, `instance` | Unix timestamp of last successful poll | - -**Alerting Recommendations:** -- Alert when `pulse_monitor_poll_staleness_seconds` > 120 for critical instances -- Alert when `pulse_monitor_poll_queue_depth` > 50 (backlog building) -- Alert when `pulse_monitor_poll_errors_total` with `category=permanent` increases (auth/config issues) - -## Circuit Breaker & Backoff - -| State | Trigger | Recovery | -|-------------|---------------------------------------------|--------------------------------------------| -| **Closed** | Default. Failures counted. | — | -| **Open** | ≥3 consecutive failures. Poll suppressed. | Exponential delay (max 5 min). | -| **Half-open**| Retry window elapsed. Limited re-attempt. | Success ⇒ closed. Failure ⇒ open. | - -```mermaid -stateDiagram-v2 - [*] --> Closed: Startup / reset - Closed: Default state\nPolling active\nFailure counter increments - Closed --> Open: ≥3 consecutive failures - Open: Polls suppressed\nScheduler schedules backoff (max 5m) - Open --> HalfOpen: Retry window elapsed - HalfOpen: Single probe allowed\nBreaker watches probe result - HalfOpen --> Closed: Probe success\nReset failure streak & delay - HalfOpen --> Open: Probe failure\nIncrease streak & backoff -``` - -Backoff configuration: - -- Initial delay: 5 s -- Multiplier: x2 per failure -- Jitter: ±20 % -- Max delay: 5 minutes -- After 5 transient failures or any permanent failure, task moves to dead-letter queue for operator action. - -## Dead-Letter Queue - -Dead-letter entries are kept in memory (same `TaskQueue` structure) with a 30 min recheck interval. Operators should inspect logs for `Routing task to dead-letter queue` messages. Future work (Task 8) will add API surfaces for inspection. - -## API Endpoints - -### GET /api/monitoring/scheduler/health - -Returns comprehensive scheduler health data (authentication required). - -**Response format:** - -```json -{ - "updatedAt": "2025-03-21T18:05:00Z", - "enabled": true, - "queue": { - "depth": 7, - "dueWithinSeconds": 2, - "perType": { - "pve": 4, - "pbs": 2, - "pmg": 1 - } - }, - "deadLetter": { - "count": 2, - "tasks": [ - { - "instance": "pbs-nas", - "type": "pbs", - "nextRun": "2025-03-21T18:25:00Z", - "lastError": "connection timeout", - "failures": 7 - } - ] - }, - "breakers": [ - { - "instance": "pve-core", - "type": "pve", - "state": "half_open", - "failures": 3, - "retryAt": "2025-03-21T18:05:45Z" - } - ], - "staleness": [ - { - "instance": "pve-core", - "type": "pve", - "score": 0.12, - "lastSuccess": "2025-03-21T18:04:50Z" - } - ] -} -``` - -**Field descriptions:** - -- `enabled`: Feature flag status -- `queue.depth`: Total queued tasks -- `queue.dueWithinSeconds`: Tasks due within 12 seconds -- `queue.perType`: Distribution by instance type -- `deadLetter.count`: Total dead-letter tasks -- `deadLetter.tasks`: Up to 25 most recent dead-letter entries -- `breakers`: Circuit breaker states (only non-default states shown) -- `staleness`: Freshness scores per instance (0 = fresh, 1 = max stale) - -## Operational Guidance - -1. **Enable adaptive polling**: set `ADAPTIVE_POLLING_ENABLED=true` via UI or environment overrides, then restart hot-dev (`scripts/hot-dev.sh`). -2. **Monitor metrics** to ensure queue depth and staleness remain within SLA. Configure alerting on `poll_staleness_seconds` and `poll_queue_depth`. -3. **Inspect scheduler health** via API endpoint `/api/monitoring/scheduler/health` for circuit breaker trips and dead-letter queue status. -4. **Review dead-letter logs** for persistent failures; resolve underlying connectivity or auth issues before re-enabling. - -## Rollout Plan - -1. **Dev/QA**: Run hot-dev with feature flag enabled; observe metrics and logs for several cycles. -2. **Staged deploy**: Enable flag on a subset of clusters; monitor queue depth (<50) and staleness (<45 s). -3. **Full rollout**: Toggle flag globally once metrics are stable; document any overrides in release notes. -4. **Post-launch**: Add Grafana panels for queue depth & staleness; alert on circuit breaker trips (future API work). - -## Known Follow-ups - -- Task 8: expose scheduler health & dead-letter statistics via API and UI panels. -- Task 9: add dedicated unit/integration harness for the scheduler & workers. +Returns: +* Queue depth & breakdown. +* Dead-letter tasks. +* Circuit breaker states. +* Per-instance staleness. diff --git a/docs/monitoring/PROMETHEUS_METRICS.md b/docs/monitoring/PROMETHEUS_METRICS.md index aa1c77eda..b54751b0a 100644 --- a/docs/monitoring/PROMETHEUS_METRICS.md +++ b/docs/monitoring/PROMETHEUS_METRICS.md @@ -1,81 +1,36 @@ -# Pulse Prometheus Metrics (v4.24.0+) +# 📊 Prometheus Metrics -Pulse exposes multiple metric families that cover HTTP ingress, per-node poll execution, scheduler health, and diagnostics caching. Use the following reference when wiring dashboards or alert rules. - ---- - -## HTTP Request Metrics - -| Metric | Type | Labels | Description | -| --- | --- | --- | --- | -| `pulse_http_request_duration_seconds` | Histogram | `method`, `route`, `status` | Request latency buckets. `route` is a normalised path (dynamic segments collapsed to `:id`, `:uuid`, etc.). | -| `pulse_http_requests_total` | Counter | `method`, `route`, `status` | Total requests handled. | -| `pulse_http_request_errors_total` | Counter | `method`, `route`, `status_class` | Counts 4xx/5xx responses. | - -**Alert suggestion:** -`rate(pulse_http_request_errors_total{status_class="server_error"}[5m]) > 0.05` (more than ~3 server errors/min) should page ops. - ---- - -## Per-Node Poll Metrics - -| Metric | Type | Labels | Description | -| --- | --- | --- | --- | -| `pulse_monitor_node_poll_duration_seconds` | Histogram | `instance_type`, `instance`, `node` | Wall-clock duration for each node poll. | -| `pulse_monitor_node_poll_total` | Counter | `instance_type`, `instance`, `node`, `result` | Success/error counts per node. | -| `pulse_monitor_node_poll_errors_total` | Counter | `instance_type`, `instance`, `node`, `error_type` | Error type breakdown (connection, auth, internal, etc.). | -| `pulse_monitor_node_poll_last_success_timestamp` | Gauge | `instance_type`, `instance`, `node` | Unix timestamp of last successful poll. | -| `pulse_monitor_node_poll_staleness_seconds` | Gauge | `instance_type`, `instance`, `node` | Seconds since last success (−1 means no success yet). | - -**Alert suggestion:** -`max_over_time(pulse_monitor_node_poll_staleness_seconds{node!=""}[10m]) > 300` indicates a node has been stale for 5+ minutes. - ---- - -## Scheduler Health Metrics - -| Metric | Type | Labels | Description | -| --- | --- | --- | --- | -| `pulse_scheduler_queue_due_soon` | Gauge | — | Number of tasks due within 12 seconds. | -| `pulse_scheduler_queue_depth` | Gauge | `instance_type` | Queue depth per instance type (PVE, PBS, PMG). | -| `pulse_scheduler_queue_wait_seconds` | Histogram | `instance_type` | Wait time between when a task should run and when it actually executes. | -| `pulse_scheduler_dead_letter_depth` | Gauge | `instance_type`, `instance` | Dead-letter queue depth per monitored instance. | -| `pulse_scheduler_breaker_state` | Gauge | `instance_type`, `instance` | Circuit breaker state: `0`=closed, `1`=half-open, `2`=open, `-1`=unknown. | -| `pulse_scheduler_breaker_failure_count` | Gauge | `instance_type`, `instance` | Consecutive failures tracked by the breaker. | -| `pulse_scheduler_breaker_retry_seconds` | Gauge | `instance_type`, `instance` | Seconds until the breaker will allow the next attempt. | - -**Alert suggestions:** -- Queue saturation: `max_over_time(pulse_scheduler_queue_depth[10m]) > ` -- DLQ growth: `increase(pulse_scheduler_dead_letter_depth[10m]) > 0` -- Breaker stuck open: `pulse_scheduler_breaker_state == 2` for > 10 minutes. - ---- - -## Diagnostics Cache Metrics - -| Metric | Type | Labels | Description | -| --- | --- | --- | --- | -| `pulse_diagnostics_cache_hits_total` | Counter | — | Diagnostics requests served from cache. | -| `pulse_diagnostics_cache_misses_total` | Counter | — | Requests that triggered a fresh probe. | -| `pulse_diagnostics_refresh_duration_seconds` | Histogram | — | Time taken to refresh diagnostics payload. | - -**Alert suggestion:** -`rate(pulse_diagnostics_cache_misses_total[5m])` spiking alongside `pulse_diagnostics_refresh_duration_seconds` > 20s can signal upstream slowness. - ---- - -## Existing Instance-Level Poll Metrics (for completeness) - -The following metrics pre-date v4.24.0 but remain essential: +Pulse exposes metrics at `/metrics` (default port `9091`). +## 🌐 HTTP Ingress | Metric | Type | Description | -| --- | --- | --- | -| `pulse_monitor_poll_duration_seconds` | Histogram | Poll duration per instance. | -| `pulse_monitor_poll_total` | Counter | Success/error counts per instance. | -| `pulse_monitor_poll_errors_total` | Counter | Error counts per instance. | -| `pulse_monitor_poll_last_success_timestamp` | Gauge | Last successful poll timestamp. | -| `pulse_monitor_poll_staleness_seconds` | Gauge | Seconds since last successful poll (instance-level). | -| `pulse_monitor_poll_queue_depth` | Gauge | Current queue depth. | -| `pulse_monitor_poll_inflight` | Gauge | Polls currently running. | +| :--- | :--- | :--- | +| `pulse_http_request_duration_seconds` | Histogram | Latency buckets by `method`, `route`, `status`. | +| `pulse_http_requests_total` | Counter | Total requests. | +| `pulse_http_request_errors_total` | Counter | 4xx/5xx errors. | -Refer to this document whenever you build dashboards or craft alert policies. Scrape all metrics from the Pulse backend `/metrics` endpoint (9091 by default for systemd installs). +## 🔄 Polling & Nodes +| Metric | Type | Description | +| :--- | :--- | :--- | +| `pulse_monitor_node_poll_duration_seconds` | Histogram | Per-node poll latency. | +| `pulse_monitor_node_poll_total` | Counter | Success/error counts per node. | +| `pulse_monitor_node_poll_staleness_seconds` | Gauge | Seconds since last success. | +| `pulse_monitor_poll_queue_depth` | Gauge | Global queue depth. | + +## 🧠 Scheduler Health +| Metric | Type | Description | +| :--- | :--- | :--- | +| `pulse_scheduler_queue_depth` | Gauge | Queue depth per instance type. | +| `pulse_scheduler_dead_letter_depth` | Gauge | DLQ depth per instance. | +| `pulse_scheduler_breaker_state` | Gauge | `0`=Closed, `1`=Half-Open, `2`=Open. | + +## ⚡ Diagnostics Cache +| Metric | Type | Description | +| :--- | :--- | :--- | +| `pulse_diagnostics_cache_hits_total` | Counter | Cache hits. | +| `pulse_diagnostics_refresh_duration_seconds` | Histogram | Refresh latency. | + +## 🚨 Alerting Examples +* **High Error Rate**: `rate(pulse_http_request_errors_total[5m]) > 0.05` +* **Stale Node**: `pulse_monitor_node_poll_staleness_seconds > 300` +* **Breaker Open**: `pulse_scheduler_breaker_state == 2` diff --git a/docs/operations/ADAPTIVE_POLLING_ROLLOUT.md b/docs/operations/ADAPTIVE_POLLING_ROLLOUT.md index fdcfc5966..45532fb57 100644 --- a/docs/operations/ADAPTIVE_POLLING_ROLLOUT.md +++ b/docs/operations/ADAPTIVE_POLLING_ROLLOUT.md @@ -1,83 +1,30 @@ -# Adaptive Polling Rollout Runbook +# 🚀 Adaptive Polling Rollout -Adaptive polling (v4.24.0+) lets the scheduler dynamically adjust poll -intervals per resource. This runbook documents the safe way to enable, monitor, -and, if needed, disable the feature across environments. +Safely enable dynamic scheduling (v4.24.0+). -## Scope & Prerequisites +## 📋 Pre-Flight +1. **Snapshot Health**: + ```bash + curl -s http://localhost:7655/api/monitoring/scheduler/health | jq . + ``` +2. **Check Metrics**: Ensure `pulse_monitor_poll_queue_depth` is stable. -- Pulse **v4.24.0 or newer** -- Admin access to **Settings → System → Monitoring** -- Prometheus access to `pulse_monitor_*` metrics -- Ability to run authenticated `curl` commands against the Pulse API +## 🟢 Enable +Choose one method: +* **UI**: Settings → System → Monitoring → Adaptive Polling. +* **CLI**: `jq '.AdaptivePollingEnabled=true' /var/lib/pulse/system.json > tmp && mv tmp system.json` +* **Env**: `ADAPTIVE_POLLING_ENABLED=true` (Docker/K8s). -## Change Windows +## 🔍 Monitor (First 15m) +Watch for stability: +```bash +watch -n 5 'curl -s http://localhost:9091/metrics | grep pulse_monitor_poll_queue_depth' +``` +* **Success**: Queue depth < 50, no permanent errors. +* **Failure**: High queue depth, open breakers. -Run rollouts during a maintenance window where transient alert jitter is -acceptable. Adaptive polling touches every monitor queue; give yourself at least -15 minutes to observe steady state metrics. - -## Rollout Steps - -1. **Snapshot current health** - ```bash - curl -s http://localhost:7655/api/monitoring/scheduler/health | jq '.enabled, .queue.depth' - ``` - Record queue depth, breaker count, and dead-letter entries. - -2. **Enable adaptive polling** - - UI: toggle **Settings → System → Monitoring → Adaptive Polling** → Enable - - CLI: `jq '.AdaptivePollingEnabled=true' /var/lib/pulse/system.json > tmp && mv tmp system.json` - - Env override: `ADAPTIVE_POLLING_ENABLED=true` before starting Pulse (for - containers/k8s) - -3. **Watch metrics (first 5 minutes)** - ```bash - watch -n 5 'curl -s http://localhost:9091/metrics | grep -E "pulse_monitor_(poll_queue_depth|poll_staleness_seconds)" | head' - ``` - Targets: - - `pulse_monitor_poll_queue_depth < 50` - - `pulse_monitor_poll_staleness_seconds` under your SLA (typically < 60 s) - - No spikes in `pulse_monitor_poll_errors_total{category="permanent"}` - -4. **Validate scheduler state** - ```bash - curl -s http://localhost:7655/api/monitoring/scheduler/health \ - | jq '{enabled, queue: .queue.depth, breakers: [.breakers[]?.instance], deadLetter: .deadLetter.count}' - ``` - Expect `enabled: true`, empty breaker list, and `deadLetter.count == 0`. - -5. **Document overrides** - - Note any instances moved to manual polling (Settings → Nodes → Polling) - - Capture Grafana screenshots for queue depth/staleness widgets - -## Rollback - -If queue depth climbs uncontrollably or breakers remain open for >10 minutes: - -1. Disable the feature the same way you enabled it (UI/environment). -2. Restart Pulse if environment overrides were used, otherwise hot toggle is - immediate. -3. Continue monitoring until queue depth and staleness return to baseline. - -## Canary Strategy Suggestions - -| Stage | Action | Acceptance Criteria | -| --- | --- | --- | -| Dev | Enable flag in hot-dev (scripts/hot-dev.sh) | No scheduler panics, UI reflects flag instantly | -| Staging | Enable on one Pulse instance per region | `queue.depth` within ±20 % of baseline after 15 min | -| Production | Enable per cluster with 30 min soak | No more than 5 breaker openings per hour | - -## Instrumentation Checklist - -- Grafana dashboard with `queue.depth`, `poll_staleness_seconds`, - `poll_errors_total` by type -- Alert rule: `rate(pulse_monitor_poll_errors_total{category="permanent"}[5m]) > 0` -- Alert rule: `max_over_time(pulse_monitor_poll_queue_depth[5m]) > 75` -- JSON log search for `"scheduler":` warnings immediately after enablement - -## References - -- [Architecture doc](../monitoring/ADAPTIVE_POLLING.md) -- [Scheduler Health API](../api/SCHEDULER_HEALTH.md) -- [Kubernetes guidance](../KUBERNETES.md#adaptive-polling-configuration-v4250) +## ↩️ Rollback +If instability occurs > 10m: +1. **Disable**: Toggle off via UI or Env. +2. **Restart**: Required if using Env/CLI overrides. +3. **Verify**: Confirm queue drains. diff --git a/docs/operations/AUDIT_LOG_ROTATION.md b/docs/operations/AUDIT_LOG_ROTATION.md new file mode 100644 index 000000000..82e33ba21 --- /dev/null +++ b/docs/operations/AUDIT_LOG_ROTATION.md @@ -0,0 +1,51 @@ +# 🔄 Sensor Proxy Audit Log Rotation + +The proxy writes append-only, hash-chained logs to `/var/log/pulse/sensor-proxy/audit.log`. + +## ⚠️ Important +* **Do not delete**: The file is protected with `chattr +a`. +* **Rotate when**: >200MB or >30 days. + +## 🛠️ Manual Rotation + +Run as root: + +```bash +# 1. Unlock file +chattr -a /var/log/pulse/sensor-proxy/audit.log + +# 2. Rotate (copy & truncate) +cp -a /var/log/pulse/sensor-proxy/audit.log /var/log/pulse/sensor-proxy/audit.log.$(date +%Y%m%d) +: > /var/log/pulse/sensor-proxy/audit.log + +# 3. Relock & Restart +chown pulse-sensor-proxy:pulse-sensor-proxy /var/log/pulse/sensor-proxy/audit.log +chmod 0640 /var/log/pulse/sensor-proxy/audit.log +chattr +a /var/log/pulse/sensor-proxy/audit.log +systemctl restart pulse-sensor-proxy +``` + +## 🤖 Logrotate Config + +Create `/etc/logrotate.d/pulse-sensor-proxy`: + +```conf +/var/log/pulse/sensor-proxy/audit.log { + weekly + rotate 8 + compress + missingok + notifempty + create 0640 pulse-sensor-proxy pulse-sensor-proxy + sharedscripts + prerotate + /usr/bin/chattr -a /var/log/pulse/sensor-proxy/audit.log || true + endscript + postrotate + /bin/systemctl restart pulse-sensor-proxy.service || true + /usr/bin/chattr +a /var/log/pulse/sensor-proxy/audit.log || true + endscript +} +``` + +**Note**: Do NOT use `copytruncate`. The restart is required to reset the hash chain. diff --git a/docs/operations/AUTO_UPDATE.md b/docs/operations/AUTO_UPDATE.md new file mode 100644 index 000000000..05a8db77f --- /dev/null +++ b/docs/operations/AUTO_UPDATE.md @@ -0,0 +1,47 @@ +# 🔄 Automatic Updates +Manage Pulse auto-updates on host-mode installations. + +> **Note**: Docker/Kubernetes users should manage updates via their orchestrator. + +## ⚙️ Components +| File | Purpose | +| :--- | :--- | +| `pulse-update.timer` | Daily check (02:00 + jitter). | +| `pulse-update.service` | Runs the update script. | +| `pulse-auto-update.sh` | Fetches release & restarts Pulse. | + +## 🚀 Enable/Disable + +### Via UI (Recommended) +**Settings → System → Updates → Automatic Updates**. + +### Via CLI +```bash +# Enable +sudo jq '.autoUpdateEnabled=true' /var/lib/pulse/system.json > tmp && sudo mv tmp /var/lib/pulse/system.json +sudo systemctl enable --now pulse-update.timer + +# Disable +sudo jq '.autoUpdateEnabled=false' /var/lib/pulse/system.json > tmp && sudo mv tmp /var/lib/pulse/system.json +sudo systemctl disable --now pulse-update.timer +``` + +## 🧪 Manual Run +Test the update process: +```bash +sudo systemctl start pulse-update.service +journalctl -u pulse-update -f +``` + +## 🔍 Observability +* **History**: `curl -s http://localhost:7655/api/updates/history | jq` +* **Logs**: `/var/log/pulse/update-*.log` + +## ↩️ Rollback +If an update fails: +1. Check logs: `/var/log/pulse/update-YYYYMMDDHHMMSS.log`. +2. Revert manually: + ```bash + sudo /opt/pulse/install.sh --version v4.30.0 + ``` + Or use the **Rollback** button in the UI if available. diff --git a/docs/operations/SENSOR_PROXY_CONFIG.md b/docs/operations/SENSOR_PROXY_CONFIG.md new file mode 100644 index 000000000..eae06ae13 --- /dev/null +++ b/docs/operations/SENSOR_PROXY_CONFIG.md @@ -0,0 +1,40 @@ +# ⚙️ Sensor Proxy Configuration + +Safe configuration management using the CLI (v4.31.1+). + +## 📂 Files +* **`config.yaml`**: General settings (logging, metrics). +* **`allowed_nodes.yaml`**: Authorized node list (managed via CLI). + +## 🛠️ CLI Reference + +### Validation +Check for errors before restart. +```bash +pulse-sensor-proxy config validate +``` + +### Managing Nodes +**Add Nodes (Merge):** +```bash +pulse-sensor-proxy config set-allowed-nodes --merge 192.168.0.10 +``` + +**Replace List:** +```bash +pulse-sensor-proxy config set-allowed-nodes --replace \ + --merge 192.168.0.1 --merge 192.168.0.2 +``` + +## ⚠️ Troubleshooting + +**Validation Fails:** +* Check for duplicate `allowed_nodes` blocks in `config.yaml`. +* Run `pulse-sensor-proxy config validate 2>&1` for details. + +**Lock Errors:** +* Remove stale locks if process is dead: `rm /etc/pulse-sensor-proxy/*.lock`. + +**Empty List:** +* Valid for IPC-only clusters. +* Populate manually if needed using `--replace`. diff --git a/docs/operations/SENSOR_PROXY_LOGS.md b/docs/operations/SENSOR_PROXY_LOGS.md new file mode 100644 index 000000000..5c56b4dd2 --- /dev/null +++ b/docs/operations/SENSOR_PROXY_LOGS.md @@ -0,0 +1,31 @@ +# 📝 Sensor Proxy Log Forwarding + +Forward `audit.log` and `proxy.log` to a central SIEM via RELP + TLS. + +## 🚀 Quick Start +Run the helper script with your collector details: + +```bash +sudo REMOTE_HOST=logs.example.com \ + REMOTE_PORT=6514 \ + CERT_DIR=/etc/pulse/log-forwarding \ + CA_CERT=/path/to/ca.crt \ + CLIENT_CERT=/path/to/client.crt \ + CLIENT_KEY=/path/to/client.key \ + /opt/pulse/scripts/setup-log-forwarding.sh +``` + +## 📋 What It Does +1. **Inputs**: Watches `/var/log/pulse/sensor-proxy/{audit,proxy}.log`. +2. **Queue**: Disk-backed queue (50k messages) for reliability. +3. **Output**: RELP over TLS to `REMOTE_HOST`. +4. **Mirror**: Local debug file at `/var/log/pulse/sensor-proxy/forwarding.log`. + +## ✅ Verification +1. **Check Status**: `sudo systemctl status rsyslog` +2. **View Mirror**: `tail -f /var/log/pulse/sensor-proxy/forwarding.log` +3. **Test**: Restart proxy and check remote collector for `pulse.audit` tag. + +## 🧹 Maintenance +* **Disable**: Remove `/etc/rsyslog.d/pulse-sensor-proxy.conf` and restart rsyslog. +* **Rotate Certs**: Replace files in `CERT_DIR` and restart rsyslog. diff --git a/docs/operations/audit-log-rotation.md b/docs/operations/audit-log-rotation.md deleted file mode 100644 index 4c09de20d..000000000 --- a/docs/operations/audit-log-rotation.md +++ /dev/null @@ -1,120 +0,0 @@ -# Sensor Proxy Audit Log Rotation - -The temperature sensor proxy writes append-only, hash-chained audit events to -`/var/log/pulse/sensor-proxy/audit.log`. The file is created with `0640` -permissions, owned by `pulse-sensor-proxy`, and protected with `chattr +a` via -`scripts/secure-sensor-files.sh`. Because the process keeps the file handle open -and enforces append-only mode, you **must** follow the steps below to rotate the -log without losing events. - -## When to Rotate - -- File exceeds **200 MB** or contains more than 30 days of history -- Prior to exporting evidence for an incident review -- Immediately before changing log-forwarding endpoints (rsyslog/RELp) - -The proxy falls back to stderr (systemd journal) only when the file cannot be -opened. Do not rely on the fallback for long-term retention. - -## Pre-flight Checklist - -1. Confirm the service is healthy: - ```bash - systemctl status pulse-sensor-proxy --no-pager - ``` -2. Make sure `/var/log/pulse/sensor-proxy` is mounted with enough free space: - ```bash - df -h /var/log/pulse/sensor-proxy - ``` -3. Note the current scheduler health inside Pulse for later verification: - ```bash - curl -s http://localhost:7655/api/monitoring/scheduler/health | jq '.queue.depth, .deadLetter.count' - ``` - -## Manual Rotation Procedure - -> Run these steps as **root** on the Proxmox host that runs the proxy. - -1. Remove the append-only flag (logrotate needs to truncate the file): - ```bash - chattr -a /var/log/pulse/sensor-proxy/audit.log - ``` -2. Copy the current file to an evidence path, then truncate in place: - ```bash - ts=$(date +%Y%m%d-%H%M%S) - cp -a /var/log/pulse/sensor-proxy/audit.log /var/log/pulse/sensor-proxy/audit.log.$ts - : > /var/log/pulse/sensor-proxy/audit.log - ``` -3. Restore permissions and the append-only flag: - ```bash - chown pulse-sensor-proxy:pulse-sensor-proxy /var/log/pulse/sensor-proxy/audit.log - chmod 0640 /var/log/pulse/sensor-proxy/audit.log - chattr +a /var/log/pulse/sensor-proxy/audit.log - ``` -4. Restart the proxy so the file descriptor is reopened: - ```bash - systemctl restart pulse-sensor-proxy - ``` -5. Verify the service recreated the correlation hash chain: - ```bash - journalctl -u pulse-sensor-proxy -n 20 | grep -i "audit" || true - ``` -6. Re-check Pulse adaptive polling health (temperature pollers rely on the - proxy): - ```bash - curl -s http://localhost:7655/api/monitoring/scheduler/health \ - | jq '.instances[] | select(.key | contains("temperature")) | {key, breaker: .breaker.state, deadLetter: .deadLetter.present}' - ``` - All temperature instances should show `breaker: "closed"` with - `deadLetter: false`. - -## Logrotate Configuration - -Automate rotation with `/etc/logrotate.d/pulse-sensor-proxy`. Copy the snippet -below and adjust retention to match your compliance needs: - -```conf -/var/log/pulse/sensor-proxy/audit.log { - weekly - rotate 8 - compress - missingok - notifempty - create 0640 pulse-sensor-proxy pulse-sensor-proxy - sharedscripts - prerotate - /usr/bin/chattr -a /var/log/pulse/sensor-proxy/audit.log || true - endscript - postrotate - /bin/systemctl restart pulse-sensor-proxy.service || true - /usr/bin/chattr +a /var/log/pulse/sensor-proxy/audit.log || true - endscript -} -``` - -Keep `copytruncate` disabled—the restart ensures the proxy writes to a fresh -file with a new hash chain. Always forward rotated files to your SIEM before -removing them. - -## Forwarding Validations - -If you forward audit logs over RELP using `scripts/setup-log-forwarding.sh`: - -1. Tail the forwarding log: - ```bash - tail -f /var/log/pulse/sensor-proxy/forwarding.log - ``` -2. Ensure queues drain (`action.resumeRetryCount=-1` keeps retrying). -3. Confirm the remote receiver ingests the new file (look for the `pulse.audit` -tag). - -## Troubleshooting - -| Symptom | Action | -| --- | --- | -| `Operation not permitted` when truncating | `chattr -a` was not executed or SELinux/AppArmor denies it. Check `auditd`. | -| Proxy fails to restart | Run `journalctl -u pulse-sensor-proxy -xe` for context. The proxy refuses to start if the audit file cannot be opened. | -| Temperature polls stop after rotation | Check `/api/monitoring/scheduler/health` for dead-letter entries. Restart the main Pulse service if breakers stay open. | - -Once logs are rotated and validated, upload the archived copy to your evidence -store and record the event in your change log. diff --git a/docs/operations/auto-update.md b/docs/operations/auto-update.md deleted file mode 100644 index b6055efb9..000000000 --- a/docs/operations/auto-update.md +++ /dev/null @@ -1,104 +0,0 @@ -# Pulse Automatic Update Runbook - -Automatic updates are handled by three systemd units that live on host-mode -installations: - -| Component | Purpose | File | -| --- | --- | --- | -| `pulse-update.timer` | Schedules daily checks (02:00 + 0‑4 h jitter) | `/etc/systemd/system/pulse-update.timer` | -| `pulse-update.service` | Runs a single update cycle when triggered | `/etc/systemd/system/pulse-update.service` | -| `scripts/pulse-auto-update.sh` | Fetches release metadata, downloads binaries, restarts Pulse | `/opt/pulse/scripts/pulse-auto-update.sh` | - -> Docker and Kubernetes deployments do **not** use this flow—manage upgrades via -> your orchestrator. - -## Prerequisites - -- `autoUpdateEnabled` must be `true` in `/var/lib/pulse/system.json` (or toggled in - **Settings → System → Updates → Automatic Updates**). -- `pulse.service` must be healthy—the update service short-circuits if Pulse is - not running. -- Host needs outbound HTTPS access to `github.com` and `objects.githubusercontent.com`. - -## Enable or Disable - -### From the UI -1. Navigate to **Settings → System → Updates**. -2. Toggle **Automatic Updates** on. The backend persists `autoUpdateEnabled:true` - and surfaces a reminder to enable the timer. -3. On the host, run: - ```bash - sudo systemctl enable --now pulse-update.timer - sudo systemctl status pulse-update.timer --no-pager - ``` -4. To disable later, toggle the UI switch off **and** run - `sudo systemctl disable --now pulse-update.timer`. - -### From the CLI only -```bash -# Opt in -sudo jq '.autoUpdateEnabled=true' /var/lib/pulse/system.json | sudo tee /var/lib/pulse/system.json >/dev/null -sudo systemctl daemon-reload -sudo systemctl enable --now pulse-update.timer - -# Opt out -sudo jq '.autoUpdateEnabled=false' /var/lib/pulse/system.json | sudo tee /var/lib/pulse/system.json >/dev/null -sudo systemctl disable --now pulse-update.timer -``` -> Editing `system.json` while Pulse is running is safe, but prefer the UI so -> validation rules stay in place. - -## Trigger a Manual Run - -Use this when testing new releases or after changing firewall rules: - -```bash -sudo systemctl start pulse-update.service -sudo journalctl -u pulse-update -n 50 -``` - -The oneshot service exits when the script finishes. A successful run logs the new -version and writes an entry to `update-history.jsonl`. - -## Observability Checklist - -- **Timer status**: `systemctl list-timers pulse-update` -- **History API**: `curl -s http://localhost:7655/api/updates/history | jq '.entries[0]'` -- **Raw log**: `/var/log/pulse/update-*.log` (referenced inside the history entry’s - `log_path` field) -- **Journal**: `journalctl -u pulse-update -f` -- **Backups**: The script records `backup_path` in history (defaults to - `/etc/pulse.backup.`). Ensure the path exists before acknowledging - the rollout. - -## Failure Handling & Rollback - -1. Inspect the failing history entry: - ```bash - curl -s http://localhost:7655/api/updates/history?limit=1 | jq '.entries[0]' - ``` - Common statuses: `failed`, `rolled_back`, `succeeded`. -2. Review `/var/log/pulse/update-YYYYMMDDHHMMSS.log` for the stack trace. -3. To revert, redeploy the previous release: - ```bash - sudo /opt/pulse/install.sh --version v4.30.0 - ``` - or use the main installer command from the update history output. The installer - restores the `backup_path` recorded earlier when you choose **Rollback** in the - UI. -4. Confirm Pulse is healthy (`systemctl status pulse.service`) and that - `/api/updates/history` now contains a `rolled_back` entry referencing the same - `event_id`. - -## Troubleshooting - -| Symptom | Resolution | -| --- | --- | -| `Auto-updates disabled in configuration` in journal | Set `autoUpdateEnabled:true` (UI or edit `system.json`) and restart the timer. | -| `pulse-update.timer` immediately exits | Ensure `systemd` knows about the units (`sudo systemctl daemon-reload`) and that `pulse.service` exists (installer may not have run with `--enable-auto-updates`). | -| `github.com` errors / rate limit | The script retries via the release redirect. For proxied environments set `https_proxy` before the service runs. | -| Update succeeds but Pulse stays on previous version | Check `journalctl -u pulse-update` for `restart failed`; Pulse only switches after the service restarts successfully. | -| Timer enabled but no history entries | Verify time has passed since enablement (timer includes random delay) or start the service manually to seed the first run. | - -Document each run (success or rollback) in your change journal with the -`event_id` from `/api/updates/history` so you can cross-reference audit trails. diff --git a/docs/operations/sensor-proxy-config-management.md b/docs/operations/sensor-proxy-config-management.md deleted file mode 100644 index c296d7526..000000000 --- a/docs/operations/sensor-proxy-config-management.md +++ /dev/null @@ -1,469 +0,0 @@ -# Sensor Proxy Configuration Management - -This guide covers safe configuration management for pulse-sensor-proxy, including the new CLI tools introduced in v4.31.1+ to prevent config corruption. - -## Overview - -Starting with v4.31.1, pulse-sensor-proxy uses a two-file configuration system: - -1. **Main config:** `/etc/pulse-sensor-proxy/config.yaml` - Contains all settings except allowed nodes -2. **Allowed nodes:** `/etc/pulse-sensor-proxy/allowed_nodes.yaml` - Separate file for the authorized node list - -This separation prevents corruption from concurrent updates by the installer, control-plane sync, and self-heal timer. - -## Architecture - -### Why Two Files? - -Earlier versions stored `allowed_nodes:` inline in `config.yaml`, causing corruption when: -- The installer updated node lists -- The self-heal timer ran (every 5 minutes) -- Control-plane sync modified the list -- Version detection had edge cases - -Multiple code paths (shell, Python, Go) would race to update the same YAML file, creating duplicate `allowed_nodes:` keys that broke YAML parsing. - -### New System (v4.31.1+) - -**Phase 1 (Migration):** -- Force file-based mode exclusively -- Installer migrates inline blocks to `allowed_nodes.yaml` -- Self-heal timer includes corruption detection and repair - -**Phase 2 (Atomic Operations):** -- Go CLI replaces all shell/Python config manipulation -- File locking prevents concurrent writes -- Atomic writes (temp file + rename) ensure consistency -- systemd validation prevents startup with corrupt config - -## Configuration CLI Reference - -### Validate Configuration - -Check config files for errors before restarting the service: - -```bash -# Validate both config.yaml and allowed_nodes.yaml -pulse-sensor-proxy config validate - -# Validate specific config file -pulse-sensor-proxy config validate --config /path/to/config.yaml - -# Validate specific allowed_nodes file -pulse-sensor-proxy config validate --allowed-nodes /path/to/allowed_nodes.yaml -``` - -**Exit codes:** -- 0 = valid -- Non-zero = validation failed (check stderr for details) - -**Common validation errors:** -- "duplicate allowed_nodes blocks" - Run migration (see below) -- "failed to parse YAML" - Syntax error in config file -- "read_timeout must be positive" - Invalid timeout value - -### Manage Allowed Nodes - -The CLI provides two modes: - -**Merge mode (default):** Adds nodes to existing list -```bash -# Add single node -pulse-sensor-proxy config set-allowed-nodes --merge 192.168.0.10 - -# Add multiple nodes -pulse-sensor-proxy config set-allowed-nodes \ - --merge 192.168.0.1 \ - --merge 192.168.0.2 \ - --merge node1.local -``` - -**Replace mode:** Overwrites entire list -```bash -# Replace with new list -pulse-sensor-proxy config set-allowed-nodes --replace \ - --merge 192.168.0.1 \ - --merge 192.168.0.2 - -# Clear the list (empty is valid for IPC-only clusters) -pulse-sensor-proxy config set-allowed-nodes --replace -``` - -**Custom paths:** -```bash -# Use non-default path -pulse-sensor-proxy config set-allowed-nodes \ - --allowed-nodes /custom/path.yaml \ - --merge 192.168.0.10 -``` - -### How It Works - -1. **File locking:** Uses `flock(LOCK_EX)` on separate `.lock` file -2. **Atomic writes:** Writes to temp file, syncs, then renames -3. **Deduplication:** Automatically removes duplicate entries -4. **Normalization:** Trims whitespace, sorts entries -5. **Empty lists allowed:** Useful for security lockdown or IPC-based discovery - -## Common Tasks - -### Adding Nodes After Cluster Expansion - -When you add a new node to your Proxmox cluster: - -```bash -# Add the new node to allowed list -pulse-sensor-proxy config set-allowed-nodes --merge new-node.local - -# Validate config -pulse-sensor-proxy config validate - -# Restart proxy to apply -sudo systemctl restart pulse-sensor-proxy - -# Verify in Pulse UI -# Check Settings → Diagnostics → Temperature Proxy -``` - -### Removing Decommissioned Nodes - -When removing a node from your cluster: - -```bash -# Get current list -cat /etc/pulse-sensor-proxy/allowed_nodes.yaml - -# Replace with updated list (without old node) -pulse-sensor-proxy config set-allowed-nodes --replace \ - --merge 192.168.0.1 \ - --merge 192.168.0.2 - # (omit the decommissioned node) - -# Validate and restart -pulse-sensor-proxy config validate -sudo systemctl restart pulse-sensor-proxy -``` - -**Note:** The proxy cleanup system automatically removes SSH keys from deleted nodes. See temperature monitoring docs for details. - -### Migrating from Inline Config - -If you're running an older version with inline `allowed_nodes:` in config.yaml: - -```bash -# Upgrade to latest version (auto-migrates) -curl -fsSL https://raw.githubusercontent.com/rcourtman/Pulse/main/scripts/install-sensor-proxy.sh | \ - sudo bash -s -- --standalone --pulse-server http://your-pulse:7655 - -# Verify migration -pulse-sensor-proxy config validate - -# Check that allowed_nodes only appears in allowed_nodes.yaml -grep -n "allowed_nodes:" /etc/pulse-sensor-proxy/*.yaml -# Should show: allowed_nodes.yaml:3:allowed_nodes: -# Should NOT show duplicate entries in config.yaml -``` - -### Changing Other Config Settings - -For settings in `config.yaml` (not allowed_nodes): - -```bash -# Stop the service first -sudo systemctl stop pulse-sensor-proxy - -# Edit config.yaml manually -sudo nano /etc/pulse-sensor-proxy/config.yaml - -# Validate before starting -pulse-sensor-proxy config validate - -# Start service -sudo systemctl start pulse-sensor-proxy - -# Check for errors -sudo systemctl status pulse-sensor-proxy -journalctl -u pulse-sensor-proxy -n 50 -``` - -**Safe to edit in config.yaml:** -- `allowed_source_subnets` -- `allowed_peers` (UID/GID permissions) -- `rate_limit` settings -- `metrics_address` -- `http_*` settings (HTTPS mode) -- `pulse_control_plane` block - -**Never edit manually:** -- `allowed_nodes:` (use CLI instead, or it will be in allowed_nodes.yaml anyway) -- Lock files (`.lock`) - -## Troubleshooting - -### Config Validation Fails - -**Symptom:** `pulse-sensor-proxy config validate` returns error - -**Diagnosis:** -```bash -# Run validation with full output -pulse-sensor-proxy config validate 2>&1 - -# Check for duplicate blocks -grep -n "allowed_nodes:" /etc/pulse-sensor-proxy/config.yaml - -# Check YAML syntax -python3 -c "import yaml; yaml.safe_load(open('/etc/pulse-sensor-proxy/config.yaml'))" -``` - -**Common fixes:** -- Duplicate blocks: Run migration (upgrade to v4.31.1+) -- YAML syntax errors: Fix indentation, remove tabs, check colons -- Missing required fields: Add `read_timeout`, `write_timeout` - -### Service Won't Start After Config Change - -**Diagnosis:** -```bash -# Check systemd logs -journalctl -u pulse-sensor-proxy -n 100 - -# Look for validation errors -journalctl -u pulse-sensor-proxy | grep -i "validation\|corrupt\|duplicate" - -# Try starting in foreground for better errors -sudo -u pulse-sensor-proxy /opt/pulse/sensor-proxy/bin/pulse-sensor-proxy # legacy installs: /usr/local/bin/pulse-sensor-proxy -``` - -**Fix:** -```bash -# Validate config first -pulse-sensor-proxy config validate - -# If validation passes but service fails, check permissions -ls -la /etc/pulse-sensor-proxy/ -ls -la /var/lib/pulse-sensor-proxy/ - -# Ensure proxy user owns files -sudo chown -R pulse-sensor-proxy:pulse-sensor-proxy /etc/pulse-sensor-proxy/ -sudo chown -R pulse-sensor-proxy:pulse-sensor-proxy /var/lib/pulse-sensor-proxy/ -``` - -### Lock File Errors - -**Symptom:** `failed to acquire file lock` or `failed to open lock file` - -**Cause:** Lock file has wrong permissions or process holds stale lock - -**Fix:** -```bash -# Check lock file permissions (should be 0600) -ls -la /etc/pulse-sensor-proxy/*.lock - -# Fix permissions -sudo chmod 0600 /etc/pulse-sensor-proxy/*.lock -sudo chown pulse-sensor-proxy:pulse-sensor-proxy /etc/pulse-sensor-proxy/*.lock - -# If stale lock, identify holder -sudo lsof /etc/pulse-sensor-proxy/allowed_nodes.yaml.lock - -# Kill stale process if needed (use with caution) -sudo kill -``` - -**Prevention:** Locks are automatically released when process exits. Don't manually delete lock files. - -### Allowed Nodes List is Empty - -**Symptom:** allowed_nodes.yaml exists but has no entries - -**Is this a problem?** Not necessarily: -- Empty list is valid for clusters using IPC discovery (pvecm status) -- Control-plane mode populates the list automatically -- Standalone nodes require manual node entries - -**To populate manually:** -```bash -# Add your cluster nodes -pulse-sensor-proxy config set-allowed-nodes --replace \ - --merge 192.168.0.1 \ - --merge 192.168.0.2 \ - --merge 192.168.0.3 - -# Verify -cat /etc/pulse-sensor-proxy/allowed_nodes.yaml -``` - -## Best Practices - -### General Guidelines - -1. **Always validate before restarting:** - ```bash - pulse-sensor-proxy config validate && sudo systemctl restart pulse-sensor-proxy - ``` - -2. **Use the CLI for allowed_nodes changes:** - - Don't edit `allowed_nodes.yaml` manually - - Use `config set-allowed-nodes` instead - -3. **Stop service before editing config.yaml:** - - Prevents race conditions with running process - - systemd validation will catch errors on startup - -4. **Back up config before major changes:** - ```bash - sudo cp /etc/pulse-sensor-proxy/config.yaml /etc/pulse-sensor-proxy/config.yaml.backup - sudo cp /etc/pulse-sensor-proxy/allowed_nodes.yaml /etc/pulse-sensor-proxy/allowed_nodes.yaml.backup - ``` - -5. **Monitor after changes:** - ```bash - journalctl -u pulse-sensor-proxy -f - # Check Pulse UI: Settings → Diagnostics → Temperature Proxy - ``` - -### Automation Scripts - -When scripting config changes: - -```bash -#!/bin/bash -set -euo pipefail - -# Function to safely update allowed nodes -update_allowed_nodes() { - local nodes=("$@") - - # Build command - local cmd="pulse-sensor-proxy config set-allowed-nodes --replace" - for node in "${nodes[@]}"; do - cmd="$cmd --merge $node" - done - - # Execute with validation - if eval "$cmd"; then - echo "Allowed nodes updated successfully" - else - echo "Failed to update allowed nodes" >&2 - return 1 - fi - - # Validate - if ! pulse-sensor-proxy config validate; then - echo "Config validation failed after update" >&2 - return 1 - fi - - # Restart service - if sudo systemctl restart pulse-sensor-proxy; then - echo "Service restarted successfully" - else - echo "Service restart failed" >&2 - return 1 - fi - - # Wait for service to be active - sleep 2 - if systemctl is-active --quiet pulse-sensor-proxy; then - echo "Service is running" - else - echo "Service failed to start" >&2 - journalctl -u pulse-sensor-proxy -n 20 - return 1 - fi -} - -# Example usage -update_allowed_nodes "192.168.0.1" "192.168.0.2" "node3.local" -``` - -### Monitoring Config Health - -Add to your monitoring system: - -```bash -# Check for config corruption (should return 0) -pulse-sensor-proxy config validate -echo $? - -# Check for duplicate blocks (should be empty) -grep "allowed_nodes:" /etc/pulse-sensor-proxy/config.yaml | wc -l - -# Check lock file permissions (should be 0600) -stat -c "%a" /etc/pulse-sensor-proxy/*.lock - -# Check service is running -systemctl is-active pulse-sensor-proxy -``` - -## Migration Path - -### Upgrading from Pre-v4.31.1 - -**Automatic migration** (recommended): -```bash -# Simply reinstall - migration runs automatically -curl -fsSL https://raw.githubusercontent.com/rcourtman/Pulse/main/scripts/install-sensor-proxy.sh | \ - sudo bash -s -- --standalone --pulse-server http://your-pulse:7655 - -# Verify -pulse-sensor-proxy config validate -sudo systemctl status pulse-sensor-proxy -``` - -**Manual migration** (if needed): -```bash -# 1. Stop service -sudo systemctl stop pulse-sensor-proxy - -# 2. Extract allowed_nodes from config.yaml -grep -A 100 "^allowed_nodes:" /etc/pulse-sensor-proxy/config.yaml > /tmp/nodes.txt - -# 3. Parse and add to allowed_nodes.yaml -# (Example for simple list - adjust for your format) -pulse-sensor-proxy config set-allowed-nodes --replace \ - --merge node1.local \ - --merge node2.local - -# 4. Remove allowed_nodes from config.yaml -# Edit manually or use sed: -sudo sed -i '/^allowed_nodes:/,/^[a-z_]/d' /etc/pulse-sensor-proxy/config.yaml - -# 5. Add reference to allowed_nodes.yaml -echo "allowed_nodes_file: /etc/pulse-sensor-proxy/allowed_nodes.yaml" | \ - sudo tee -a /etc/pulse-sensor-proxy/config.yaml - -# 6. Validate -pulse-sensor-proxy config validate - -# 7. Start service -sudo systemctl start pulse-sensor-proxy -``` - -## Related Documentation - -- [Temperature Monitoring](../TEMPERATURE_MONITORING.md) - Setup and troubleshooting -- [Sensor Proxy README](/opt/pulse/cmd/pulse-sensor-proxy/README.md) - Complete CLI reference -- [Audit Log Rotation](audit-log-rotation.md) - Managing append-only logs -- [Temperature Monitoring Security](../TEMPERATURE_MONITORING_SECURITY.md) - Security architecture - -## Support - -If config management issues persist after following this guide: - -1. Collect diagnostics: - ```bash - pulse-sensor-proxy config validate 2>&1 > /tmp/validate.log - sudo systemctl status pulse-sensor-proxy > /tmp/status.log - journalctl -u pulse-sensor-proxy -n 200 > /tmp/journal.log - grep -n "allowed_nodes:" /etc/pulse-sensor-proxy/*.yaml > /tmp/grep.log - ``` - -2. File an issue at https://github.com/rcourtman/Pulse/issues - -3. Include: - - Pulse version - - Sensor proxy version (`pulse-sensor-proxy --version`) - - Output from diagnostic commands above - - Steps that led to the issue diff --git a/docs/operations/sensor-proxy-log-forwarding.md b/docs/operations/sensor-proxy-log-forwarding.md deleted file mode 100644 index 143556621..000000000 --- a/docs/operations/sensor-proxy-log-forwarding.md +++ /dev/null @@ -1,73 +0,0 @@ -# Sensor Proxy Log Forwarding - -Forward `pulse-sensor-proxy` logs to a central syslog/SIEM endpoint so audit -records survive host loss and can drive alerting. Pulse ships a helper script -(`scripts/setup-log-forwarding.sh`) that configures rsyslog to ship both -`audit.log` and `proxy.log` over RELP + TLS. - -## Requirements - -- Debian/Ubuntu host with **rsyslog** and the `imfile` + `omrelp` modules (present - by default). -- Root privileges to install certificates and restart rsyslog. -- TLS assets for the RELP connection: - - `ca.crt` – CA that issued the remote collector certificate. - - `client.crt` / `client.key` – mTLS credentials for this host. -- Network access to the remote collector (`REMOTE_HOST`, default `logs.pulse.example`, - port `6514`). - -## Installation Steps - -1. Copy your CA and client certificates into a safe directory on the host (the - script defaults to `/etc/pulse/log-forwarding`). -2. Run the helper with environment overrides for your collector: - ```bash - sudo REMOTE_HOST=logs.company.tld \ - REMOTE_PORT=6514 \ - CERT_DIR=/etc/pulse/log-forwarding \ - CA_CERT=/etc/pulse/log-forwarding/ca.crt \ - CLIENT_CERT=/etc/pulse/log-forwarding/pulse.crt \ - CLIENT_KEY=/etc/pulse/log-forwarding/pulse.key \ - /opt/pulse/scripts/setup-log-forwarding.sh - ``` - The script writes `/etc/rsyslog.d/pulse-sensor-proxy.conf`, ensures the - certificate directory exists (`0750`), and restarts rsyslog. - -## What the Script Configures - -- Two `imfile` inputs that watch `/var/log/pulse/sensor-proxy/audit.log` and - `/var/log/pulse/sensor-proxy/proxy.log` with `Tag`s `pulse.audit` and - `pulse.app`. -- A local mirror file at `/var/log/pulse/sensor-proxy/forwarding.log` so you can - inspect rsyslog activity. -- An RELP action with TLS, infinite retry (`action.resumeRetryCount=-1`), and a - 50k message disk-backed queue to absorb collector outages. - -## Verification Checklist - -1. Confirm rsyslog picked up the new config: - ```bash - sudo rsyslogd -N1 - sudo systemctl status rsyslog --no-pager - ``` -2. Tail the local mirror to ensure entries stream through: - ```bash - sudo tail -f /var/log/pulse/sensor-proxy/forwarding.log - ``` -3. On the collector side, filter for the `pulse.audit` tag and make sure new - entries arrive. For Splunk/ELK, index on `programname`. -4. Simulate a test event (e.g., restart `pulse-sensor-proxy` or deny a fake peer) - and verify it appears remotely. - -## Maintenance - -- **Certificate rotation**: Replace the key/cert files, then restart rsyslog. - Because the config points at static paths, no additional edits are required. -- **Disable forwarding**: Remove `/etc/rsyslog.d/pulse-sensor-proxy.conf` and run - `sudo systemctl restart rsyslog`. The local audit log remains untouched. -- **Queue monitoring**: Track rsyslog’s main log or use `rsyslogd -N6` to check - for queue overflows. At scale, scrape `/var/log/pulse/sensor-proxy/forwarding.log` - for `action resumed` messages. - -For rotation guidance on the underlying audit file, see -[operations/audit-log-rotation.md](audit-log-rotation.md). diff --git a/docs/security/SENSOR_PROXY_APPARMOR.md b/docs/security/SENSOR_PROXY_APPARMOR.md new file mode 100644 index 000000000..44fb5a5bf --- /dev/null +++ b/docs/security/SENSOR_PROXY_APPARMOR.md @@ -0,0 +1,39 @@ +# 🛡️ Sensor Proxy Hardening + +Secure `pulse-sensor-proxy` with AppArmor and Seccomp. + +## 🛡️ AppArmor + +Profile: `security/apparmor/pulse-sensor-proxy.apparmor` +* **Allows**: Configs, logs, SSH keys, outbound TCP/SSH. +* **Blocks**: Raw sockets, module loading, ptrace, exec outside allowlist. + +### Install & Enforce +```bash +sudo install -m 0644 security/apparmor/pulse-sensor-proxy.apparmor /etc/apparmor.d/pulse-sensor-proxy +sudo apparmor_parser -r /etc/apparmor.d/pulse-sensor-proxy +sudo aa-enforce pulse-sensor-proxy +``` + +## 🔒 Seccomp + +Profile: `security/seccomp/pulse-sensor-proxy.json` +* **Allows**: Go runtime syscalls, network, file IO. +* **Blocks**: Everything else (returns `EPERM`). + +### Systemd (Classic) +Add to service override: +```ini +[Service] +AppArmorProfile=pulse-sensor-proxy +SystemCallFilter=@system-service +SystemCallAllow=accept;connect;recvfrom;sendto;recvmsg;sendmsg;sendmmsg;getsockname;getpeername;getsockopt;setsockopt;shutdown +``` + +### Containers (Docker/Podman) +```bash +podman run --seccomp-profile /opt/pulse/security/seccomp/pulse-sensor-proxy.json ... +``` + +## 🔍 Verification +Check status with `aa-status` or `journalctl -t auditbeat`. diff --git a/docs/security/SENSOR_PROXY_HARDENING.md b/docs/security/SENSOR_PROXY_HARDENING.md new file mode 100644 index 000000000..c03f99c2a --- /dev/null +++ b/docs/security/SENSOR_PROXY_HARDENING.md @@ -0,0 +1,57 @@ +# 🛡️ Sensor Proxy Hardening + +The `pulse-sensor-proxy` runs on the host to securely collect temperatures, keeping SSH keys out of containers. + +## 🏗️ Architecture +* **Host**: Runs `pulse-sensor-proxy` (unprivileged user). +* **Container**: Connects via Unix socket (`/run/pulse-sensor-proxy/pulse-sensor-proxy.sock`). +* **Auth**: Uses `SO_PEERCRED` to verify container UID/PID. + +## 🔒 Host Hardening + +### Service Account +Runs as `pulse-sensor-proxy` (no shell, no home). +```bash +id pulse-sensor-proxy # uid=XXX(pulse-sensor-proxy) +``` + +### Systemd Security +The service unit uses: +* `User=pulse-sensor-proxy` +* `NoNewPrivileges=true` +* `ProtectSystem=strict` +* `PrivateTmp=true` + +### File Permissions +| Path | Owner | Mode | +| :--- | :--- | :--- | +| `/var/lib/pulse-sensor-proxy/` | `pulse-sensor-proxy` | `0750` | +| `/var/lib/pulse-sensor-proxy/ssh/` | `pulse-sensor-proxy` | `0700` | +| `/run/pulse-sensor-proxy/` | `pulse-sensor-proxy` | `0775` | + +## 📦 LXC Configuration +Required for the container to access the proxy socket. + +**`/etc/pve/lxc/.conf`**: +```ini +unprivileged: 1 +lxc.apparmor.profile: generated +lxc.mount.entry: /run/pulse-sensor-proxy mnt/pulse-proxy none bind,create=dir 0 0 +``` + +## 🔑 Key Management +SSH keys are restricted to `sensors -j` only. + +**Rotation**: +```bash +/opt/pulse/scripts/pulse-sensor-proxy-rotate-keys.sh +``` +* **Dry Run**: Add `--dry-run`. +* **Rollback**: Add `--rollback`. + +## 🚨 Incident Response +If compromised: +1. **Stop Proxy**: `systemctl stop pulse-sensor-proxy`. +2. **Rotate Keys**: Remove old keys from nodes manually or use `pulse-sensor-proxy-rotate-keys.sh`. +3. **Audit Logs**: Check `journalctl -u pulse-sensor-proxy`. +4. **Reinstall**: Run `/opt/pulse/scripts/install-sensor-proxy.sh`. diff --git a/docs/security/SENSOR_PROXY_NETWORK.md b/docs/security/SENSOR_PROXY_NETWORK.md new file mode 100644 index 000000000..55c6139dd --- /dev/null +++ b/docs/security/SENSOR_PROXY_NETWORK.md @@ -0,0 +1,35 @@ +# 🌐 Sensor Proxy Network Segmentation + +Isolate the proxy to prevent lateral movement. + +## 🚧 Zones +* **Pulse App**: Connects to Proxy via Unix socket (local). +* **Sensor Proxy**: Outbound SSH to Proxmox nodes only. +* **Proxmox Nodes**: Accept SSH from Proxy. +* **Logging**: Accepts RELP/TLS from Proxy. + +## 🛡️ Firewall Rules + +| Source | Dest | Port | Purpose | Action | +| :--- | :--- | :--- | :--- | :--- | +| **Pulse App** | Proxy | `unix` | RPC Requests | **Allow** (Local) | +| **Proxy** | Nodes | `22` | SSH (sensors) | **Allow** | +| **Proxy** | Logs | `6514` | Audit Logs | **Allow** | +| **Any** | Proxy | `22` | SSH Access | **Deny** (Use Bastion) | +| **Proxy** | Internet | `any` | Outbound | **Deny** | + +## 🔧 Implementation (iptables) +```bash +# Allow SSH to Proxmox +iptables -A OUTPUT -p tcp -d --dport 22 -j ACCEPT + +# Allow Log Forwarding +iptables -A OUTPUT -p tcp -d --dport 6514 -j ACCEPT + +# Drop all other outbound +iptables -P OUTPUT DROP +``` + +## 🚨 Monitoring +* Alert on outbound connections to non-whitelisted IPs. +* Monitor `pulse_proxy_limiter_rejects_total` for abuse. diff --git a/docs/security/TEMPERATURE_MONITORING.md b/docs/security/TEMPERATURE_MONITORING.md new file mode 100644 index 000000000..760419c9f --- /dev/null +++ b/docs/security/TEMPERATURE_MONITORING.md @@ -0,0 +1,31 @@ +# 🌡️ Temperature Monitoring Security + +Secure architecture for collecting hardware temperatures. + +## 🛡️ Security Model +* **Isolation**: SSH keys live on the host, not in the container. +* **Least Privilege**: Proxy runs as `pulse-sensor-proxy` (no shell). +* **Verification**: Container identity verified via `SO_PEERCRED`. + +## 🏗️ Components +1. **Pulse Backend**: Connects to Unix socket `/mnt/pulse-proxy/pulse-sensor-proxy.sock`. +2. **Sensor Proxy**: Validates request, executes SSH to node. +3. **Target Node**: Accepts SSH key restricted to `sensors -j`. + +## 🔒 Key Restrictions +SSH keys deployed to nodes are locked down: +``` +command="sensors -j",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty +``` + +## 🚦 Rate Limiting +* **Per Peer**: ~12 req/min. +* **Concurrency**: Max 2 parallel requests per peer. +* **Global**: Max 8 concurrent requests. + +## 📝 Auditing +All requests logged to system journal: +```bash +journalctl -u pulse-sensor-proxy +``` +Logs include: `uid`, `pid`, `method`, `node`, `correlation_id`. diff --git a/docs/security/pulse-sensor-proxy-hardening.md b/docs/security/pulse-sensor-proxy-hardening.md deleted file mode 100644 index 8d1ed97dc..000000000 --- a/docs/security/pulse-sensor-proxy-hardening.md +++ /dev/null @@ -1,52 +0,0 @@ -# Pulse Sensor Proxy AppArmor & Seccomp Hardening - -## AppArmor Profile -- Profile path: `security/apparmor/pulse-sensor-proxy.apparmor` -- Grants read-only access to configs, logs, SSH keys, and binaries; allows outbound TCP/SSH; blocks raw sockets, module loading, ptrace, and absolute command execution outside the allowlist. - -### Installation -```bash -sudo install -m 0644 security/apparmor/pulse-sensor-proxy.apparmor /etc/apparmor.d/pulse-sensor-proxy -sudo apparmor_parser -r /etc/apparmor.d/pulse-sensor-proxy -sudo ln -sf /etc/apparmor.d/pulse-sensor-proxy /etc/apparmor.d/force-complain/pulse-sensor-proxy # optional staged mode -sudo systemctl restart apparmor -``` - -### Enforce Mode -```bash -sudo aa-enforce pulse-sensor-proxy -``` -Monitor `/var/log/syslog` for `DENIED` events and update the profile as needed. - -## Seccomp Filter -- OCI-style profile: `security/seccomp/pulse-sensor-proxy.json` -- Allows standard Go runtime syscalls, network operations, file IO, and `execve` for whitelisted helpers; other syscalls return `EPERM`. - -### Apply via systemd (classic service) -Add to the override: -```ini -[Service] -AppArmorProfile=pulse-sensor-proxy -RestrictNamespaces=yes -NoNewPrivileges=yes -SystemCallFilter=@system-service -SystemCallArchitectures=native -SystemCallAllow=accept;connect;recvfrom;sendto;recvmsg;sendmsg;sendmmsg;getsockname;getpeername;getsockopt;setsockopt;shutdown -``` - -Reload and restart: -```bash -sudo systemctl daemon-reload -sudo systemctl restart pulse-sensor-proxy -``` - -### Apply seccomp JSON (containerised deployments) -- Profile: `security/seccomp/pulse-sensor-proxy.json` -- Use with Podman/Docker style runtimes: -```bash -podman run --seccomp-profile /opt/pulse/security/seccomp/pulse-sensor-proxy.json ... -``` - -## Operational Notes -- Use `journalctl -t auditbeat -g pulse-sensor-proxy` or `aa-status` to confirm profile status. -- Pair with network ACLs (see `docs/security/pulse-sensor-proxy-network.md`) and log shipping via [`scripts/setup-log-forwarding.sh` + the RELP runbook](../operations/sensor-proxy-log-forwarding.md). diff --git a/docs/security/pulse-sensor-proxy-network.md b/docs/security/pulse-sensor-proxy-network.md deleted file mode 100644 index 4218fc668..000000000 --- a/docs/security/pulse-sensor-proxy-network.md +++ /dev/null @@ -1,64 +0,0 @@ -# Pulse Sensor Proxy Network Segmentation - -## Overview -- **Proxy host** collects temperatures via SSH from Proxmox nodes and serves a Unix socket to the Pulse stack. -- Goals: isolate the proxy from production hypervisors, prevent lateral movement, and ensure log forwarding/audit channels remain available. - -## Zones & Connectivity -- **Pulse Application Zone (AZ-Pulse)** - - Hosts Pulse backend/frontend containers. - - Allowed to reach the proxy over Unix socket (local) or loopback if containerised via `socat`. -- **Sensor Proxy Zone (AZ-Sensor)** - - Dedicated VM/bare-metal host running `pulse-sensor-proxy`. - - Maintains outbound SSH to Proxmox management interfaces only. -- **Proxmox Management Zone (AZ-Proxmox)** - - Hypervisors / BMCs reachable on `tcp/22` (SSH) and optional IPMI UDP. -- **Logging/Monitoring Zone (AZ-Logging)** - - Receives forwarded audit/application logs (e.g. RELP/TLS on `tcp/6514`). - - Exposes Prometheus scrape port (default `tcp/9127`) if remote monitoring required. - -## Recommended Firewall Rules - -| Source Zone | Destination Zone | Protocol/Port | Purpose | Action | -|-------------|------------------|---------------|---------|--------| -| AZ-Pulse (localhost) | AZ-Sensor (Unix socket) | `unix` | RPC requests from Pulse | Allow (local only) | -| AZ-Sensor | AZ-Proxmox nodes | `tcp/22` | SSH for sensors/ipmitool wrapper | Allow (restricted to node list) | -| AZ-Sensor | AZ-Proxmox BMC | `udp/623` *(optional)* | IPMI if required for temperature data | Allow if needed | -| AZ-Proxmox | AZ-Sensor | `any` | Return SSH traffic | Allow stateful | -| AZ-Sensor | AZ-Logging | `tcp/6514` (TLS RELP) | Audit/application log forwarding | Allow | -| AZ-Logging | AZ-Sensor | `tcp/9127` *(optional)* | Prometheus scrape of proxy metrics | Allow if scraping remotely | -| Any | AZ-Sensor | `tcp/22` | Shell/SSH access | Deny (use management bastion) | -| AZ-Sensor | Internet | `any` | Outbound Internet | Deny (except package mirrors via proxy if required) | - -## Implementation Steps -1. Place proxy host in dedicated subnet/VLAN with ACLs enforcing the table above. -2. Populate `/etc/hosts` or routing so proxy resolves Proxmox nodes to management IPs only (no public networks). -3. Configure iptables/nftables on proxy: - ```bash - # Allow SSH to Proxmox nodes - iptables -A OUTPUT -p tcp -d /24 --dport 22 -m conntrack --ctstate NEW,ESTABLISHED -j ACCEPT - iptables -A INPUT -p tcp -s /24 --sport 22 -m conntrack --ctstate ESTABLISHED -j ACCEPT - - # Allow log forwarding - iptables -A OUTPUT -p tcp -d --dport 6514 -m conntrack --ctstate NEW,ESTABLISHED -j ACCEPT - iptables -A INPUT -p tcp -s --sport 6514 -m conntrack --ctstate ESTABLISHED -j ACCEPT - - # (Optional) allow Prometheus scrape - iptables -A INPUT -p tcp -s --dport 9127 -m conntrack --ctstate NEW,ESTABLISHED -j ACCEPT - iptables -A OUTPUT -p tcp -d --sport 9127 -m conntrack --ctstate ESTABLISHED -j ACCEPT - - # Drop everything else - iptables -P OUTPUT DROP - iptables -P INPUT DROP - ``` -4. Deny inbound SSH to proxy except via management bastion: block `tcp/22` or whitelist bastion IPs. -5. Ensure log-forwarding TLS certificates are rotated and stored under `/etc/pulse/log-forwarding`. - -## Monitoring & Alerting -- Alert if proxy initiates connections outside permitted subnets (Netflow or host firewall counters). -- Monitor `pulse_proxy_limiter_*` metrics for unusual rate-limit hits that might signal abuse. -- Track `audit_log` forwarding queue depth and remote availability; on failure, emit alert via rsyslog action queue (set `action.resumeRetryCount=-1` already). - -## Change Management -- Document node IP changes and update firewall objects (`PROXMOX_NODES`) before redeploying certificates. -- Capture segmentation in infrastructure-as-code (e.g. Terraform/security group definitions) to avoid drift.