refactor: finalize documentation overhaul

- Refactor specialized docs for conciseness and clarity
- Rename files to UPPER_CASE.md convention
- Verify accuracy against codebase
- Fix broken links
This commit is contained in:
courtmanr@gmail.com
2025-11-25 00:45:20 +00:00
parent 8464a69abe
commit fd39196166
25 changed files with 492 additions and 3557 deletions

View File

@@ -1,325 +0,0 @@
# Pulse
[![GitHub release](https://img.shields.io/github/v/release/rcourtman/Pulse)](https://github.com/rcourtman/Pulse/releases/latest)
[![Docker Pulls](https://img.shields.io/docker/pulls/rcourtman/pulse)](https://hub.docker.com/r/rcourtman/pulse)
[![License](https://img.shields.io/github/license/rcourtman/Pulse)](https://github.com/rcourtman/Pulse/blob/main/LICENSE)
**Real-time monitoring for Proxmox VE, Proxmox Mail Gateway, PBS, and Docker infrastructure with alerts and webhooks.**
Monitor your hybrid Proxmox and Docker estate from a single dashboard. Get instant alerts when nodes go down, containers misbehave, backups fail, or storage fills up. Supports email, Discord, Slack, Telegram, and more.
**[Try the live demo →](https://demo.pulserelay.pro)** (read-only with mock data)
## Support Pulse Development
Pulse is built by a solo developer in evenings and weekends. Your support helps:
- Keep me motivated to add new features
- Prioritize bug fixes and user requests
- Ensure Pulse stays 100% free and open-source forever
[![GitHub Sponsors](https://img.shields.io/github/sponsors/rcourtman?style=social&label=Sponsor)](https://github.com/sponsors/rcourtman)
[![ko-fi](https://ko-fi.com/img/githubbutton_sm.svg)](https://ko-fi.com/rcourtman)
**Not ready to sponsor?** Star the project or share it with your homelab community!
## Features
- **Auto-Discovery**: Finds Proxmox nodes on your network, one-liner setup via generated scripts
- **Cluster Support**: Configure one node, monitor entire cluster
- **Enterprise Security**:
- Credentials encrypted at rest, masked in logs, never sent to frontend
- CSRF protection for all state-changing operations
- Rate limiting (500 req/min general, 10 attempts/min for auth)
- Account lockout after failed login attempts
- Secure session management with HttpOnly cookies
- bcrypt password hashing (cost 12) - passwords NEVER stored in plain text
- API tokens stored securely with restricted file permissions
- Security headers (CSP, X-Frame-Options, etc.)
- Comprehensive audit logging
- Live monitoring of VMs, containers, nodes, storage
- **Smart Alerts**: Email and webhooks (Discord, Slack, Telegram, Teams, ntfy.sh, Gotify)
- Example: "VM 'webserver' is down on node 'pve1'"
- Example: "Storage 'local-lvm' at 85% capacity"
- Example: "VM 'database' is back online"
- **Adaptive Thresholds**: Hysteresis-based trigger/clear levels, fractional network thresholds, per-metric search, reset-to-defaults, and Custom overrides with inline audit trail
- **Alert Timeline Analytics**: Rich history explorer with acknowledgement/clear markers, escalation breadcrumbs, and quick filters for noisy resources
- **Ceph Awareness**: Surface Ceph health, pool utilisation, and daemon status automatically when Proxmox exposes Ceph-backed storage
- Unified view of PBS backups, PVE backups, and snapshots
- **Interactive Backup Explorer**: Cross-highlighted bar chart + grid with quick time-range pivots (24h/7d/30d/custom) and contextual tooltips for the busiest jobs
- Proxmox Mail Gateway analytics: mail volume, spam/virus trends, quarantine health, and cluster node status
- Optional Docker container monitoring via lightweight agent
- Config export/import with encryption and authentication
- Automatic stable updates with safe rollback (opt-in)
- Runtime logging controls (switch level/format or mirror to file without downtime)
- Update history with rollback guidance captured in the UI
- Dark/light themes, responsive design
- Built with Go for minimal resource usage
[View screenshots and full documentation on GitHub →](https://github.com/rcourtman/Pulse)
## Privacy
**Pulse respects your privacy:**
- No telemetry or analytics collection
- No phone-home functionality
- No external API calls (except for configured webhooks)
- All data stays on your server
- Open source - verify it yourself
Your infrastructure data is yours alone.
## Quick Start with Docker
### Basic Setup
```bash
docker run -d \
--name pulse \
-p 7655:7655 \
-v pulse_data:/data \
--restart unless-stopped \
rcourtman/pulse:latest
```
Then open `http://localhost:7655` and complete the security setup wizard.
### Network Discovery
Pulse automatically discovers Proxmox nodes on your network! By default, it scans:
- 192.168.0.0/16 (home networks)
- 10.0.0.0/8 (private networks)
- 172.16.0.0/12 (Docker/internal networks)
To scan a custom subnet instead:
```bash
docker run -d \
--name pulse \
-p 7655:7655 \
-v pulse_data:/data \
-e DISCOVERY_SUBNET="192.168.50.0/24" \
--restart unless-stopped \
rcourtman/pulse:latest
```
### Automated Deployment with Pre-configured Auth
```bash
# Deploy with authentication pre-configured
docker run -d \
--name pulse \
-p 7655:7655 \
-v pulse_data:/data \
-e API_TOKENS="ansible-token,docker-agent-token" \
-e PULSE_AUTH_USER="admin" \
-e PULSE_AUTH_PASS="your-password" \
--restart unless-stopped \
rcourtman/pulse:latest
# Plain text credentials are automatically hashed for security
# No setup required - API works immediately
```
### Docker Compose
```yaml
services:
pulse:
image: rcourtman/pulse:latest
container_name: pulse
ports:
- "7655:7655"
volumes:
- pulse_data:/data
environment:
# NOTE: Env vars override UI settings. Remove env var to allow UI configuration.
# Network discovery (usually not needed - auto-scans common networks)
# - DISCOVERY_SUBNET=192.168.50.0/24 # Only for non-standard networks
# Ports
# - PORT=7655 # Backend port (default: 7655)
# - FRONTEND_PORT=7655 # Frontend port (default: 7655)
# Security (all optional - runs open by default)
# - PULSE_AUTH_USER=admin # Username for web UI login
# - PULSE_AUTH_PASS=your-password # Plain text or bcrypt hash (auto-hashed if plain)
# - API_TOKENS=token-a,token-b # Comma-separated tokens (plain or SHA3-256 hashed)
# - API_TOKEN=legacy-token # Optional single-token fallback
# - ALLOW_UNPROTECTED_EXPORT=false # Allow export without auth (default: false)
# Security: Plain text credentials are automatically hashed
# You can provide either:
# 1. Plain text (auto-hashed): PULSE_AUTH_PASS=mypassword
# 2. Pre-hashed (advanced): PULSE_AUTH_PASS='$$2a$$12$$...'
# Note: Escape $ as $$ in docker-compose.yml for pre-hashed values
# Performance
# - CONNECTION_TIMEOUT=10 # Connection timeout in seconds (default: 10)
# CORS & logging
# - ALLOWED_ORIGINS=https://app.example.com # CORS origins (default: none, same-origin only)
# - LOG_LEVEL=info # Log level: debug/info/warn/error (default: info)
# - LOG_FORMAT=auto # auto | json | console (default: auto)
# - LOG_FILE=/data/pulse.log # Optional mirrored logfile inside container
# - LOG_MAX_SIZE=100 # Rotate logfile after N MB
# - LOG_MAX_AGE=30 # Retain rotated logs for N days
# - LOG_COMPRESS=true # Compress rotated logs
restart: unless-stopped
volumes:
pulse_data:
### Updating & Rollbacks (v4.24.0+)
```bash
# Update to the latest tagged image
docker pull rcourtman/pulse:latest
docker stop pulse && docker rm pulse
docker run -d --name pulse \
-p 7655:7655 -v pulse_data:/data \
--restart unless-stopped \
rcourtman/pulse:latest
```
- Every upgrade is logged in **Settings → System → Updates** with an `event_id` for change tracking.
- Need to revert? Redeploy the previous tag (for example `rcourtman/pulse:v4.23.2`). Record the rollback reason in your change notes and double-check `/api/monitoring/scheduler/health` once the container is back online.
```
## Initial Setup
1. Open `http://<your-server>:7655`
2. **Complete the mandatory security setup** (first-time only)
3. Create your admin username and password
4. Use **Settings → Security → API tokens** to issue dedicated tokens for automation (one token per integration makes revocation painless)
## Configure Proxmox/PBS Nodes
After logging in:
1. Go to Settings → Nodes
2. Discovered nodes appear automatically
3. Click "Setup Script" next to any node
4. Click "Generate Setup Code" button (creates a 6-character code valid for 5 minutes)
5. Copy and run the provided one-liner on your Proxmox/PBS host
6. Node is configured and monitoring starts automatically
**Example setup command:**
```bash
curl -sSL "http://pulse:7655/api/setup-script?type=pve&host=https://pve:8006&auth_token=ABC123" | bash
```
## Docker Updates
```bash
# Latest stable
docker pull rcourtman/pulse:latest
# Latest RC/pre-release
docker pull rcourtman/pulse:rc
# Specific version
docker pull rcourtman/pulse:v4.22.0
# Then recreate your container
docker stop pulse && docker rm pulse
# Run your docker run or docker-compose command again
```
## Security
- **Authentication required** - Protects your Proxmox infrastructure credentials
- **Quick setup wizard** - Secure your installation in under a minute
- **Multiple auth methods**: Password authentication, API tokens, proxy auth (SSO), or combinations
- **Proxy/SSO support** - Integrate with Authentik, Authelia, and other authentication proxies
- **Enterprise-grade protection**:
- Credentials encrypted at rest (AES-256-GCM)
- CSRF tokens for state-changing operations
- Rate limiting and account lockout protection
- Secure session management with HttpOnly cookies
- bcrypt password hashing (cost 12) - passwords NEVER stored in plain text
- API tokens stored securely with restricted file permissions
- Security headers (CSP, X-Frame-Options, etc.)
- Comprehensive audit logging
- **Security by design**:
- Frontend never receives node credentials
- API tokens visible only to authenticated users
- Export/import requires authentication when configured
See [Security Documentation](https://github.com/rcourtman/Pulse/blob/main/docs/SECURITY.md) for details.
## HTTPS/TLS Configuration
Enable HTTPS by setting these environment variables:
```bash
docker run -d -p 7655:7655 \
-e HTTPS_ENABLED=true \
-e TLS_CERT_FILE=/data/cert.pem \
-e TLS_KEY_FILE=/data/key.pem \
-v pulse_data:/data \
-v /path/to/certs:/data/certs:ro \
rcourtman/pulse:latest
```
## Troubleshooting
### Authentication Issues
#### Cannot login after setting up security
- **Docker**: Ensure bcrypt hash is exactly 60 characters and wrapped in single quotes
- **Docker Compose**: MUST escape $ characters as $$ (e.g., `$$2a$$12$$...`)
- **Example (docker run)**: `PULSE_AUTH_PASS='$2a$12$YTZXOCEylj4TaevZ0DCeI.notayQZ..b0OZ97lUZ.Q24fljLiMQHK'`
- **Example (docker-compose.yml)**: `PULSE_AUTH_PASS='$$2a$$12$$YTZXOCEylj4TaevZ0DCeI.notayQZ..b0OZ97lUZ.Q24fljLiMQHK'`
- If hash is truncated or mangled, authentication will fail
- Use Quick Security Setup in the UI to avoid manual configuration errors
#### .env file not created (Docker)
- **Expected behavior**: When using environment variables, no .env file is created in /data
- The .env file is only created when using Quick Security Setup or password changes
- If you provide credentials via environment variables, they take precedence
- To use Quick Security Setup: Start container WITHOUT auth environment variables
### VM Disk Stats Show "-"
- VMs require QEMU Guest Agent to report disk usage (Proxmox API returns 0 for VMs)
- Install guest agent in VM: `apt install qemu-guest-agent` (Linux) or virtio-win tools (Windows)
- Enable in VM Options → QEMU Guest Agent, then restart VM
- Container (LXC) disk stats always work (no guest agent needed)
### Connection Issues
- Check Proxmox API is accessible (port 8006/8007)
- Verify credentials have PVEAuditor role plus VM.GuestAgent.Audit (PVE 9) or VM.Monitor (PVE 8); the setup script applies these via the PulseMonitor role (adds Sys.Audit when available)
- For PBS: ensure API token has Datastore.Audit permission
### Logs
```bash
# View logs
docker logs pulse
# Follow logs
docker logs -f pulse
```
## Documentation
Full documentation available on GitHub:
- [Complete Installation Guide](https://github.com/rcourtman/Pulse/blob/main/docs/INSTALL.md)
- [Configuration Guide](https://github.com/rcourtman/Pulse/blob/main/docs/CONFIGURATION.md)
- [VM Disk Monitoring](https://github.com/rcourtman/Pulse/blob/main/docs/VM_DISK_MONITORING.md) - Set up QEMU Guest Agent for accurate VM disk usage
- [Troubleshooting](https://github.com/rcourtman/Pulse/blob/main/docs/TROUBLESHOOTING.md)
- [API Reference](https://github.com/rcourtman/Pulse/blob/main/docs/API.md)
- [Webhook Guide](https://github.com/rcourtman/Pulse/blob/main/docs/WEBHOOKS.md)
- [Proxy Authentication](https://github.com/rcourtman/Pulse/blob/main/docs/PROXY_AUTH.md) - SSO integration with Authentik, Authelia, etc.
- [Reverse Proxy Setup](https://github.com/rcourtman/Pulse/blob/main/docs/REVERSE_PROXY.md) - nginx, Caddy, Apache, Traefik configs
- [Security](https://github.com/rcourtman/Pulse/blob/main/docs/SECURITY.md)
- [FAQ](https://github.com/rcourtman/Pulse/blob/main/docs/FAQ.md)
## Links
- [GitHub Repository](https://github.com/rcourtman/Pulse)
- [Releases & Changelog](https://github.com/rcourtman/Pulse/releases)
- [Issues & Feature Requests](https://github.com/rcourtman/Pulse/issues)
- [Live Demo](https://demo.pulserelay.pro)
## License
MIT - See [LICENSE](https://github.com/rcourtman/Pulse/blob/main/LICENSE)

View File

@@ -1,127 +0,0 @@
# Port Configuration Guide
Pulse supports multiple ways to configure the frontend port (default: 7655).
> **Development tip:** The hot-reload workflow (`scripts/hot-dev.sh` or `make dev-hot`) loads `.env`, `.env.local`, and `.env.dev`. Set `FRONTEND_PORT` or `PULSE_DEV_API_PORT` there to run the backend on a different port while keeping the generated `curl` commands and Vite proxy in sync.
## Recommended Methods
### 1. During Installation (Easiest)
The installer prompts for the port. To skip the prompt, use:
```bash
FRONTEND_PORT=8080 curl -fsSL https://raw.githubusercontent.com/rcourtman/Pulse/main/install.sh | bash
```
### 2. Using systemd override (For existing installations)
```bash
sudo systemctl edit pulse
```
Add these lines:
```ini
[Service]
Environment="FRONTEND_PORT=8080"
```
Then restart: `sudo systemctl restart pulse`
### 3. Using system.json (Alternative method)
Edit `/etc/pulse/system.json`:
```json
{
"frontendPort": 8080
}
```
Then restart: `sudo systemctl restart pulse`
### 4. Using environment variables (Docker)
For Docker deployments:
```bash
docker run -e FRONTEND_PORT=8080 -p 8080:8080 rcourtman/pulse:latest
```
## Priority Order
Pulse checks for port configuration in this order:
1. `FRONTEND_PORT` environment variable
2. `PORT` environment variable (legacy)
3. `frontendPort` in system.json
4. Default: 7655
Environment variables always override configuration files.
## Why not .env?
The `/etc/pulse/.env` file is reserved exclusively for authentication credentials:
- `API_TOKENS` - One or more API authentication tokens (hashed)
- `API_TOKEN` - Legacy single API token (hashed)
- `PULSE_AUTH_USER` - Web UI username
- `PULSE_AUTH_PASS` - Web UI password (hashed)
Keeping application configuration separate from authentication credentials:
- Makes it clear what's a secret vs what's configuration
- Allows different permission models if needed
- Follows the principle of separation of concerns
- Makes it easier to backup/share configs without exposing credentials
## Service Name Variations
**Important:** Pulse uses different service names depending on the deployment environment:
- **Systemd (default):** `pulse.service` or `pulse-backend.service` (legacy)
- **Hot-dev scripts:** `pulse-hot-dev` (development only)
- **Kubernetes/Helm:** Deployment `pulse`, Service `pulse` (port configured via Helm values)
**To check the active service:**
```bash
# Systemd
systemctl list-units | grep pulse
systemctl status pulse
# Kubernetes
kubectl -n pulse get svc pulse
kubectl -n pulse get deploy pulse
```
## Change Tracking (v4.24.0+)
Port changes via environment variables or `system.json` take effect immediately after restart. **v4.24.0 records configuration changes in update history**—useful for audit trails and troubleshooting.
**To view change history:**
```bash
# Via UI
# Navigate to Settings → System → Updates
# Via API
curl -s http://localhost:7655/api/updates/history | jq '.entries[] | {timestamp, action, status}'
```
## Troubleshooting
### Port not changing after configuration?
1. **Check which service name is in use:**
```bash
systemctl list-units | grep pulse
```
It might be `pulse` (default), `pulse-backend` (legacy), or `pulse-hot-dev` (dev environment) depending on your installation method.
2. **Verify the configuration is loaded:**
```bash
# Systemd
sudo systemctl show pulse | grep Environment
# Kubernetes
kubectl -n pulse get deploy pulse -o jsonpath='{.spec.template.spec.containers[0].env}' | jq
```
3. **Check if another process is using the port:**
```bash
sudo lsof -i :8080
```
4. **Verify post-restart** (v4.24.0+):
```bash
# Check actual listening port
curl -s http://localhost:7655/api/version | jq
# Check update history for restart event
curl -s http://localhost:7655/api/updates/history?limit=5 | jq
```

File diff suppressed because it is too large Load Diff

View File

@@ -20,7 +20,6 @@ Welcome to the Pulse documentation portal. Here you'll find everything you need
- **[Docker Guide](DOCKER.md)** Advanced Docker & Compose configurations.
- **[Kubernetes](KUBERNETES.md)** Helm charts, ingress, and HA setups.
- **[Reverse Proxy](REVERSE_PROXY.md)** Nginx, Caddy, Traefik, and Cloudflare Tunnel recipes.
- **[Port Configuration](PORT_CONFIGURATION.md)** Changing default ports.
- **[Troubleshooting](TROUBLESHOOTING.md)** Deep dive into common issues and logs.
## 🔐 Security

View File

@@ -324,7 +324,7 @@ journalctl -u pulse-sensor-proxy -f
```
Forward these logs off-host for retention by following
[operations/sensor-proxy-log-forwarding.md](operations/sensor-proxy-log-forwarding.md).
[operations/SENSOR_PROXY_LOGS.md](operations/SENSOR_PROXY_LOGS.md).
In the Pulse container, check the logs at startup:
```bash
@@ -718,7 +718,7 @@ pulse-sensor-proxy config set-allowed-nodes --replace --merge 192.168.0.1
- Installer uses CLI (no more shell/Python divergence)
**See also:**
- [Sensor Proxy Config Management Guide](operations/sensor-proxy-config-management.md) - Complete runbook
- [Sensor Proxy Config Management Guide](operations/SENSOR_PROXY_CONFIG.md) - Complete runbook
- [Sensor Proxy CLI Reference](/opt/pulse/cmd/pulse-sensor-proxy/README.md) - Full command documentation
## Control-Plane Sync & Migration

View File

@@ -1,499 +0,0 @@
# Temperature Monitoring Security Guide
This document describes the security architecture of Pulse's temperature monitoring system with pulse-sensor-proxy.
## Table of Contents
- [Architecture Overview](#architecture-overview)
- [Security Boundaries](#security-boundaries)
- [Authentication & Authorization](#authentication--authorization)
- [Rate Limiting](#rate-limiting)
- [SSH Security](#ssh-security)
- [Container Isolation](#container-isolation)
- [Monitoring & Alerting](#monitoring--alerting)
- [Development Mode](#development-mode)
- [Troubleshooting](#troubleshooting)
---
## Architecture Overview
```mermaid
graph TD
Container[Pulse Container]
Proxy[pulse-sensor-proxy<br/>Host Service]
Cluster[Cluster Nodes<br/>SSH sensors -j]
Container -->|Unix Socket<br/>Rate Limited| Proxy
Proxy -->|SSH<br/>Forced Command| Cluster
Cluster -->|Temperature JSON| Proxy
Proxy -->|Temperature JSON| Container
style Proxy fill:#e1f5e1
style Container fill:#fff4e1
style Cluster fill:#e1f0ff
```
**Key Principle**: SSH keys never enter containers. All SSH operations are performed by the host-side proxy.
---
## Security Boundaries
### 1. Host ↔ Container Boundary
- **Enforced by**: Method-level authorization + ID-mapped root detection
- **Container CAN**:
- ✅ Call `get_temperature` (read temperature data)
- ✅ Call `get_status` (check proxy health)
- **Container CANNOT**:
- ❌ Call `ensure_cluster_keys` (SSH key distribution)
- ❌ Call `register_nodes` (node discovery)
- ❌ Call `request_cleanup` (cleanup operations)
- ❌ Use direct SSH (blocked by container detection)
### 2. Proxy ↔ Cluster Nodes Boundary
- **Enforced by**: SSH forced commands + IP filtering
- **SSH authorized_keys entry**:
```bash
from="192.168.0.0/24",command="sensors -j",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty ssh-ed25519 AAAA... pulse-sensor-proxy
```
- Proxy can ONLY run `sensors -j` on cluster nodes
- IP restrictions prevent lateral movement
### 3. Client ↔ Proxy Boundary
- **Enforced by**: UID-based ACL + adaptive rate limiting
- SO_PEERCRED verifies caller's UID/GID/PID
- Rate limiting (defaults): ~12 requests per minute per UID (burst 2), per-UID concurrency 2, global concurrency 8, 2s penalty on validation failures
- Per-node guard: only 1 SSH fetch per node at a time
---
## Authentication & Authorization
### Authentication (Who can connect?)
**Allowed UIDs**:
- Root (UID 0) - host processes
- Proxy's own UID (pulse-sensor-proxy user)
- Configured UIDs from `/etc/pulse-sensor-proxy/config.yaml`
- ID-mapped root ranges (containers, if enabled)
**ID-Mapped Root Detection**:
- Reads `/etc/subuid` and `/etc/subgid` for UID/GID mapping ranges
- Containers typically use ranges like `100000-165535`
- Both UID AND GID must be in mapped ranges
### Authorization (What can they call?)
**Privileged Methods** (host-only):
```go
var privilegedMethods = map[string]bool{
"ensure_cluster_keys": true, // SSH key distribution
"register_nodes": true, // Node registration
"request_cleanup": true, // Cleanup operations
}
```
**Authorization Check**:
```go
if privilegedMethods[method] && isIDMappedRoot(credentials) {
return "method requires host-level privileges"
}
```
**Read-Only Methods** (containers allowed):
- `get_temperature` - Fetch temperature data via proxy
- `get_status` - Check proxy health and version
---
## Rate Limiting
### Per-Peer Limits (commit 46b8b8d)
- **Rate:** 1 request per second (`per_peer_interval_ms = 1000`)
- **Burst:** 5 requests (enough to sweep five nodes per polling window)
- **Per-peer concurrency:** Maximum 2 concurrent RPCs
- **Global concurrency:** 8 simultaneous RPCs across all peers
- **Penalty:** 2s enforced delay on validation failures (oversized payloads, unauthorized methods)
- **Cleanup:** Peer entries expire after 10minutes of inactivity
### Configurable Overrides
Administrators can raise or lower thresholds via `/etc/pulse-sensor-proxy/config.yaml`:
```yaml
rate_limit:
per_peer_interval_ms: 500 # 2 rps
per_peer_burst: 10 # allow 10-node sweep
```
Security guidance:
- Keep `per_peer_interval_ms ≥ 100` in production; lower values expand the attack surface for noisy callers.
- Ensure UID/GID filters stay in place when increasing throughput, and continue to ship audit logs off-host.
- Monitor `pulse_proxy_limiter_penalties_total` alongside `pulse_proxy_limiter_rejects_total` to spot abusive or compromised clients.
### Per-Node Concurrency
- **Limit**: 1 concurrent SSH request per node
- **Purpose**: Prevents SSH connection storms
- **Scope**: Applies to all peers requesting same node
### Monitoring Rate Limits
```bash
# Check rate limit metrics
curl -s http://127.0.0.1:9127/metrics | grep pulse_proxy_limiter_rejects_total
# Watch for rate limit warnings in logs
journalctl -u pulse-sensor-proxy -f | grep "Rate limit exceeded"
```
---
## SSH Security
### SSH Key Management
**Key Location**: `/var/lib/pulse-sensor-proxy/ssh/id_ed25519`
- **Owner**: `pulse-sensor-proxy:pulse-sensor-proxy`
- **Permissions**: `0600` (read/write for owner only)
- **Type**: Ed25519 (modern, secure)
**Key Distribution**:
- Only host processes can trigger distribution (via `ensure_cluster_keys`)
- Containers are blocked from key distribution operations
- Keys are distributed with forced commands and IP restrictions
### Forced Command Restrictions
**On cluster nodes**, the SSH key can ONLY run:
```bash
sensors -j
```
**No other commands possible**:
- ❌ Shell access denied (`no-pty`)
- ❌ Port forwarding disabled (`no-port-forwarding`)
- ❌ X11 forwarding disabled (`no-X11-forwarding`)
- ❌ Agent forwarding disabled (`no-agent-forwarding`)
### IP Filtering
**Source IP restrictions**:
```bash
from="192.168.0.0/24,10.0.0.0/8"
```
- Automatically detected from cluster node IPs
- Prevents SSH key use from outside the cluster
- Updated during key rotation
---
## Container Isolation
### Fallback SSH Protection
**In containers**, direct SSH is blocked:
```go
if system.InContainer() && !devModeAllowSSH {
log.Error().Msg("SECURITY BLOCK: SSH temperature collection disabled in containers")
return &Temperature{Available: false}, nil
}
```
**Container Detection Methods**:
1. `PULSE_FORCE_CONTAINER=1` override for explicit opt-in
2. Presence of `/.dockerenv` or `/run/.containerenv`
3. `container=` hints from environment variables
4. `/proc/1/environ` and `/proc/1/cgroup` markers (`docker`, `lxc`, `containerd`, `kubepods`, etc.)
**Bypass**: Only possible with explicit environment variable (see [Development Mode](#development-mode))
### ID-Mapped Root Detection
**How it works**:
```go
// Check /etc/subuid and /etc/subgid for mapping ranges
// Example /etc/subuid:
// root:100000:65536
func isIDMappedRoot(cred *peerCredentials) bool {
return uidInRange(cred.uid, idMappedUIDRanges) &&
gidInRange(cred.gid, idMappedGIDRanges)
}
```
**Why both UID and GID?**:
- Container root: `uid=100000, gid=100000` → ID-mapped
- Container app user: `uid=101001, gid=101001` → ID-mapped
- Host root: `uid=0, gid=0` → NOT ID-mapped
- Mixed: `uid=100000, gid=50` → NOT ID-mapped (fails check)
---
## Monitoring & Alerting
### Log Locations
**Proxy logs**:
```bash
journalctl -u pulse-sensor-proxy -f
```
**Backend logs** (inside container):
```bash
journalctl -u pulse-backend -f
```
Want off-host retention? Forward `audit.log` and `proxy.log` using
[`scripts/setup-log-forwarding.sh`](operations/sensor-proxy-log-forwarding.md)
so events land in your SIEM with RELP + TLS.
**Audit rotation**: Use the steps in [operations/audit-log-rotation.md](operations/audit-log-rotation.md) to rotate `/var/log/pulse/sensor-proxy/audit.log`. After each rotation, restart the proxy and confirm temperature pollers are healthy in `/api/monitoring/scheduler/health` (closed breakers, no DLQ entries).
### Security Events to Monitor
#### 1. Privileged Method Denials
```
SECURITY: Container attempted to call privileged method - access denied
method=ensure_cluster_keys uid=101000 gid=101000 pid=12345
```
**Alert on**: Any occurrence (indicates attempted privilege escalation)
#### 2. Rate Limit Violations
```
Rate limit exceeded uid=101000 pid=12345
```
**Alert on**: Sustained violations (>10/minute indicates possible abuse)
#### 3. Authorization Failures
```
Peer authorization failed uid=50000 gid=50000
```
**Alert on**: Repeated failures from same UID (indicates misconfiguration or probing)
#### 4. SSH Fallback Attempts
```
SECURITY BLOCK: SSH temperature collection disabled in containers
```
**Alert on**: Any occurrence (should only happen during misconfigurations)
### Metrics to Track
```bash
# Rate limit hits
pulse_proxy_rate_limit_hits_total
# RPC requests by method and result
pulse_proxy_rpc_requests_total{method="get_temperature",result="success"}
pulse_proxy_rpc_requests_total{method="ensure_cluster_keys",result="unauthorized"}
# SSH request latency
pulse_proxy_ssh_latency_seconds{node="example-node"}
# Active connections
pulse_proxy_queue_depth
pulse_proxy_global_concurrency_inflight
```
### Recommended Alerts
1. **Privilege Escalation Attempts**:
```
pulse_proxy_rpc_requests_total{result="unauthorized"} > 0
```
2. **Rate Limit Abuse**:
```
rate(pulse_proxy_rate_limit_hits_total[5m]) > 1
```
3. **Proxy Unavailable**:
```
up{job="pulse-sensor-proxy"} == 0
```
4. **Scheduler Drift** (Pulse side ensures temperature pollers stay healthy):
```
max_over_time(pulse_monitor_poll_queue_depth[5m]) > <baseline*1.5>
```
Pair with a check of `/api/monitoring/scheduler/health` to confirm temperature instances report `breaker.state == "closed"`.
---
## Development Mode
### SSH Fallback Override
**Purpose**: Allow direct SSH from containers during development/testing
**Environment Variable**:
```bash
export PULSE_DEV_ALLOW_CONTAINER_SSH=true
```
**Security Implications**:
- ⚠️ **NEVER use in production**
- Allows container to use SSH keys if present
- Defeats the security isolation model
- Should only be used in trusted development environments
**Example Usage**:
```bash
# In systemd override for pulse-backend
mkdir -p /etc/systemd/system/pulse-backend.service.d
cat <<EOF > /etc/systemd/system/pulse-backend.service.d/dev-ssh.conf
[Service]
Environment=PULSE_DEV_ALLOW_CONTAINER_SSH=true
EOF
systemctl daemon-reload
systemctl restart pulse-backend
```
**Monitoring**:
```bash
# Check if dev mode is active
journalctl -u pulse-backend | grep "dev mode" | tail -1
```
**Disable dev mode**:
```bash
rm /etc/systemd/system/pulse-backend.service.d/dev-ssh.conf
systemctl daemon-reload
systemctl restart pulse-backend
```
---
## Troubleshooting
### "method requires host-level privileges"
**Symptom**: Container gets this error when calling RPC
**Cause**: Container attempted to call privileged method
**Resolution**: This is expected behavior. Only these methods are restricted:
- `ensure_cluster_keys`
- `register_nodes`
- `request_cleanup`
**If host process is blocked**:
1. Check UID is not in ID-mapped range:
```bash
id
cat /etc/subuid /etc/subgid
```
2. Verify proxy's allowed UIDs:
```bash
cat /etc/pulse-sensor-proxy/config.yaml
```
### "Rate limit exceeded"
**Symptom**: Requests failing with rate limit error
**Cause**: Peer exceeded ~12 requests/minute (or exhausted per-peer/global concurrency)
**Resolution**:
1. Confirm workload is legitimate (look for retry loops or aggressive polling).
2. Allow the limiter to recover—penalty sleeps clear in ~2s and idle peers expire after 10minutes.
3. If sustained higher throughput is required, adjust the constants in `cmd/pulse-sensor-proxy/throttle.go` and rebuild.
### Temperature monitoring unavailable
**Symptom**: No temperature data in dashboard
**Diagnosis**:
```bash
# 1. Check proxy is running
systemctl status pulse-sensor-proxy
# 2. Check socket exists
ls -la /run/pulse-sensor-proxy/
# 3. Check socket is accessible in container
ls -la /mnt/pulse-proxy/
# 4. Test proxy from host
curl -s --unix-socket /run/pulse-sensor-proxy/pulse-sensor-proxy.sock \
-X POST -d '{"method":"get_status"}' | jq
# 5. Check SSH connectivity
ssh root@example-node "sensors -j"
# 6. Inspect adaptive polling for temperature pollers
curl -s http://localhost:7655/api/monitoring/scheduler/health \
| jq '.instances[] | select(.key | contains("temperature")) | {key, breaker: .breaker.state, deadLetter: .deadLetter.present, lastSuccess: .pollStatus.lastSuccess}'
```
### SSH key not distributed
**Symptom**: Manual `ensure_cluster_keys` call fails
**Check**:
1. Are you calling from host (not container)?
2. Is pvecm available? `command -v pvecm`
3. Can you reach cluster nodes? `pvecm status`
4. Check proxy logs: `journalctl -u pulse-sensor-proxy -f`
---
## Best Practices
### Production Deployments
1. ✅ **Never use dev mode** (`PULSE_DEV_ALLOW_CONTAINER_SSH=true`)
2. ✅ **Monitor security logs** for unauthorized access attempts
3. ✅ **Use IP filtering** on SSH authorized_keys entries
4. ✅ **Rotate SSH keys** periodically (use `ensure_cluster_keys` with rotation)
5. ✅ **Limit allowed_peer_uids** to minimum necessary
6. ✅ **Enable audit logging** for privileged operations
### Development Environments
1. ✅ Use dev mode SSH override if needed (document why)
2. ✅ Test with actual ID-mapped containers
3. ✅ Verify privileged method blocking works
4. ✅ Test rate limiting under load
### Incident Response
**If container compromise suspected**:
1. Check for privileged method attempts:
```bash
journalctl -u pulse-sensor-proxy | grep "SECURITY:"
```
2. Check rate limit violations:
```bash
journalctl -u pulse-sensor-proxy | grep "Rate limit"
```
3. Restart proxy to clear state:
```bash
systemctl restart pulse-sensor-proxy
```
4. Consider rotating SSH keys:
```bash
# From host, call ensure_cluster_keys with new key
```
---
## References
- [Pulse Installation Guide](../README.md)
- [pulse-sensor-proxy Configuration](../cmd/pulse-sensor-proxy/README.md)
- [Security Audit Results](../SECURITY.md)
- [LXC ID Mapping Documentation](https://linuxcontainers.org/lxc/manpages/man5/lxc.container.conf.5.html#lbAJ)
---
**Last Updated**: 2025-10-19
**Security Contact**: File issues at https://github.com/rcourtman/Pulse/issues

View File

@@ -1,134 +1,11 @@
# Scheduler Health API
# 🩺 Scheduler Health API
Adaptive scheduler health endpoint
**Endpoint**: `GET /api/monitoring/scheduler/health`
**Auth**: Required (Bearer token or Cookie)
Endpoint: `GET /api/monitoring/scheduler/health`
Returns a real-time snapshot of the adaptive scheduler, including queue state, circuit breakers, and dead-letter tasks.
Returns a snapshot of the adaptive polling scheduler, queue state, circuit breakers, and per-instance status. Requires authentication (session cookie or bearer token).
**Key Features:**
- Real-time scheduler health monitoring
- Circuit breaker status per instance
- Dead-letter queue tracking (tasks that repeatedly fail)
- Per-instance staleness metrics
- No query parameters required
- Read-only endpoint (rate-limited under general 500 req/min bucket)
---
## Request
```
GET /api/monitoring/scheduler/health
Authorization: Bearer <token>
```
No query parameters are needed.
---
## Response Overview
```json
{
"updatedAt": "2025-10-20T13:05:42Z", // RFC 3339 timestamp
"enabled": true, // Mirrors AdaptivePollingEnabled setting
"queue": {...},
"deadLetter": {...},
"breakers": [...], // legacy summary (for backward compatibility)
"staleness": [...], // legacy summary (for backward compatibility)
"instances": [ ... ] // authoritative per-instance view (v4.24.0+)
}
```
**Field Notes:**
- `updatedAt`: RFC 3339 timestamp of when this snapshot was generated
- `enabled`: Reflects the current `AdaptivePollingEnabled` system setting
- `breakers` and `staleness`: Legacy arrays maintained for backward compatibility; use `instances` for complete data
- `instances`: Authoritative source for per-instance health (v4.24.0+)
### Queue Snapshot (`queue`)
| Field | Type | Description |
|-------|------|-------------|
| `depth` | integer | Current queue size |
| `dueWithinSeconds` | integer | Items scheduled within the next 12 seconds |
| `perType` | object | Counts per instance type, e.g. `{"pve":4}` |
### Dead-letter Snapshot (`deadLetter`)
| Field | Type | Description |
|-------|------|-------------|
| `count` | integer | Total items in the dead-letter queue |
| `tasks` | array | **Limited to 25 entries** for performance. Each task includes `instance`, `type`, `nextRun`, `lastError`, and `failures` count. For complete per-instance DLQ data, use `instances[].deadLetter` |
**Note:** The top-level `deadLetter.tasks` array is capped at 25 items to prevent large responses. Use the `instances` array for exhaustive coverage.
### Instances (`instances`)
Each element gives a complete view of one instance.
| Field | Type | Description |
|-------|------|-------------|
| `key` | string | Unique key `type::name` |
| `type` | string | Instance type (`pve`, `pbs`, `pmg`, etc.) |
| `displayName` | string | Friendly name (falls back to host/name) |
| `instance` | string | Raw instance identifier |
| `connection` | string | Connection URL or host |
| `pollStatus` | object | Recent poll outcomes |
| `breaker` | object | Circuit breaker state |
| `deadLetter` | object | Dead-letter insight for this instance |
#### Poll Status (`pollStatus`)
| Field | Type | Description |
|-------|------|-------------|
| `lastSuccess` | timestamp nullable | RFC 3339 timestamp of most recent successful poll |
| `lastError` | object nullable | `{ at, message, category }` where `at` is RFC 3339, `message` describes the error, and `category` is `transient` (network issues, timeouts) or `permanent` (auth failures, invalid config) |
| `consecutiveFailures` | integer | Current failure streak length (resets on successful poll) |
| `firstFailureAt` | timestamp nullable | RFC 3339 timestamp when the current failure streak began. Useful for calculating failure duration |
**Timing Metadata (v4.24.0+):**
- `firstFailureAt`: Tracks when a failure streak started, enabling "failing for X minutes" calculations
- Resets to `null` when a successful poll occurs
- Combine with `consecutiveFailures` to assess severity
#### Breaker (`breaker`)
| Field | Type | Description |
|-------|------|-------------|
| `state` | string | `closed` (healthy), `open` (failing), `half_open` (testing recovery), or `unknown` (not initialized) |
| `since` | timestamp nullable | RFC 3339 timestamp when the current state began. Use to calculate how long a breaker has been open |
| `lastTransition` | timestamp nullable | RFC 3339 timestamp of the most recent state change (e.g., closed → open) |
| `retryAt` | timestamp nullable | RFC 3339 timestamp of next scheduled retry attempt when breaker is open or half-open |
| `failureCount` | integer | Number of failures in the current breaker cycle. Resets when breaker closes |
**Circuit Breaker Timing (v4.24.0+):**
- `since`: When did the current state start? (e.g., "breaker has been open for 5 minutes")
- `lastTransition`: When was the last state change? (useful for detecting flapping)
- `retryAt`: When will the next retry attempt occur? (for open/half-open states)
- `failureCount`: How many failures have accumulated? (triggers state transitions)
**State Transitions:**
- `closed``open`: Triggered after N failures (default: 5)
- `open``half_open`: After timeout period, allows one test request
- `half_open``closed`: If test request succeeds
- `half_open``open`: If test request fails
#### Dead-letter (`deadLetter`)
| Field | Type | Description |
|-------|------|-------------|
| `present` | boolean | `true` if instance is in the DLQ |
| `reason` | string | `max_retry_attempts` or `permanent_failure` |
| `firstAttempt` | timestamp nullable | First time the instance hit DLQ |
| `lastAttempt` | timestamp nullable | Most recent DLQ enqueue |
| `retryCount` | integer | Number of DLQ attempts |
| `nextRetry` | timestamp nullable | Next scheduled retry time |
---
## Example Response
## 📦 Response Format
```json
{
@@ -137,44 +14,13 @@ Each element gives a complete view of one instance.
"queue": {
"depth": 7,
"dueWithinSeconds": 2,
"perType": { "pve": 4, "pbs": 2, "pmg": 1 }
"perType": { "pve": 4, "pbs": 2 }
},
"deadLetter": {
"count": 1,
"tasks": [
{
"instance": "pbs-b",
"type": "pbs",
"nextRun": "2025-10-20T13:30:00Z",
"lastError": "401 unauthorized",
"failures": 5
}
]
},
"breakers": [
{
"instance": "pve-a",
"type": "pve",
"state": "half_open",
"failures": 3,
"retryAt": "2025-10-20T13:06:15Z"
}
],
"staleness": [
{
"instance": "pve-a",
"type": "pve",
"score": 0.42,
"lastSuccess": "2025-10-20T13:05:10Z",
"lastError": "2025-10-20T13:05:40Z"
}
],
"instances": [
{
"key": "pve::pve-a",
"type": "pve",
"displayName": "Pulse PVE Cluster",
"instance": "pve-a",
"connection": "https://pve-a:8006",
"pollStatus": {
"lastSuccess": "2025-10-20T13:05:10Z",
@@ -187,133 +33,50 @@ Each element gives a complete view of one instance.
"firstFailureAt": "2025-10-20T13:05:20Z"
},
"breaker": {
"state": "half_open",
"since": "2025-10-20T13:05:40Z",
"lastTransition": "2025-10-20T13:05:40Z",
"state": "half_open", // closed, open, half_open
"retryAt": "2025-10-20T13:06:15Z",
"failureCount": 3
},
"deadLetter": {
"present": false
}
},
{
"key": "pbs::pbs-b",
"type": "pbs",
"displayName": "Backup PBS",
"instance": "pbs-b",
"connection": "https://pbs-b:8007",
"pollStatus": {
"lastSuccess": "2025-10-20T12:55:00Z",
"lastError": {
"at": "2025-10-20T13:00:01Z",
"message": "401 unauthorized",
"category": "permanent"
},
"consecutiveFailures": 5,
"firstFailureAt": "2025-10-20T12:58:30Z"
},
"breaker": {
"state": "open",
"since": "2025-10-20T13:00:01Z",
"lastTransition": "2025-10-20T13:00:01Z",
"retryAt": "2025-10-20T13:02:01Z",
"failureCount": 5
},
"deadLetter": {
"present": true,
"reason": "max_retry_attempts",
"firstAttempt": "2025-10-20T12:58:30Z",
"lastAttempt": "2025-10-20T13:00:01Z",
"retryCount": 5,
"nextRetry": "2025-10-20T13:30:00Z"
}
}
]
}
```
---
## 🔍 Key Fields
## Useful `jq` Queries
### Instances (`instances`)
The authoritative source for per-instance health.
### Instances with recent errors
* **`pollStatus`**:
* `lastSuccess`: Timestamp of last successful poll.
* `lastError`: Details of the last error (message, category).
* `consecutiveFailures`: Current failure streak.
* **`breaker`**:
* `state`: `closed` (healthy), `open` (failing), `half_open` (recovering).
* `retryAt`: Next retry time if open/half-open.
* **`deadLetter`**:
* `present`: `true` if the instance is in the DLQ (stopped polling).
* `reason`: Why it was moved to DLQ (e.g., `permanent_failure`).
```
curl -s http://HOST:7655/api/monitoring/scheduler/health \
| jq '.instances[] | select(.pollStatus.lastError != null) | {key, lastError: .pollStatus.lastError}'
## 🛠️ Common Queries (jq)
**Find Failing Instances:**
```bash
curl -s http://HOST:7655/api/monitoring/scheduler/health | \
jq '.instances[] | select(.pollStatus.consecutiveFailures > 0) | {key, failures: .pollStatus.consecutiveFailures}'
```
### Current dead-letter queue entries
```
curl -s http://HOST:7655/api/monitoring/scheduler/health \
| jq '.instances[] | select(.deadLetter.present) | {key, reason: .deadLetter.reason, retryCount: .deadLetter.retryCount}'
**Check Dead Letter Queue:**
```bash
curl -s http://HOST:7655/api/monitoring/scheduler/health | \
jq '.instances[] | select(.deadLetter.present) | {key, reason: .deadLetter.reason}'
```
### Breakers not closed
**Find Open Breakers:**
```bash
curl -s http://HOST:7655/api/monitoring/scheduler/health | \
jq '.instances[] | select(.breaker.state != "closed") | {key, state: .breaker.state}'
```
curl -s http://HOST:7655/api/monitoring/scheduler/health \
| jq '.instances[] | select(.breaker.state != "closed") | {key, breaker: .breaker}'
```
### Stale instances (score > 0.5)
```
curl -s http://HOST:7655/api/monitoring/scheduler/health \
| jq '.staleness[] | select(.score > 0.5)'
```
### Instances sorted by failure streak
```
curl -s http://HOST:7655/api/monitoring/scheduler/health \
| jq '.instances[] | select(.pollStatus.consecutiveFailures > 0) | {key, failures: .pollStatus.consecutiveFailures}'
```
---
## Migration Notes
| Legacy Field | Status | Replacement |
|--------------|--------|-------------|
| `breakers` array | retains summary | use `instances[].breaker` for detailed view |
| `deadLetter.tasks` | retains summary | use `instances[].deadLetter` for per-instance enrichment |
| `staleness` array | unchanged | combined with `pollStatus.lastSuccess` gives precise timestamps |
The `instances` array centralizes per-instance telemetry; existing integrations can migrate at their own pace.
---
## Operational Notes
**v4.24.0 Behavior:**
- **Read-only endpoint**: This endpoint is informational only and does not modify scheduler state
- **Rate limiting**: Falls under the general API limit (500 requests/minute per IP)
- **Authentication required**: Must provide valid session cookie or API token
- **Adaptive polling disabled**: When adaptive polling is disabled (`enabled: false`), the response includes empty `breakers`, `staleness`, and `instances` arrays
- **Real-time data**: Reflects current scheduler state; not historical (for trends, use metrics/logs)
- **No query parameters**: Returns complete snapshot on every request
- **Automatic adjustments**: The `enabled` field automatically reflects the `AdaptivePollingEnabled` system setting
**Use Cases:**
- **Monitoring dashboards**: Embed in Grafana/Prometheus for real-time scheduler health
- **Alerting**: Trigger alerts on open circuit breakers or high DLQ counts
- **Debugging**: Investigate why specific instances aren't polling successfully
- **Capacity planning**: Monitor queue depth trends to assess if polling intervals need adjustment
**Breaking Changes:**
- **None**: v4.24.0 only adds fields; all existing consumers continue to work
- Consumers just gain access to richer metadata (`firstFailureAt`, breaker timestamps, DLQ retry windows)
---
## Troubleshooting Examples
1. **Transient outages:** look for `pollStatus.lastError.category == "transient"` to confirm network hiccups; check `breaker.retryAt` to see when retries resume.
2. **Permanent failures:** `deadLetter.present == true` with `reason == "permanent_failure"` indicates credential or configuration issues.
3. **Breaker stuck:** `breaker.state != "closed"` with `since` > 5 minutes suggests manual intervention or rollback.
4. **Staleness spike:** compare `pollStatus.lastSuccess` with `updatedAt` to estimate data age; cross-reference `staleness.score` for alert thresholds.
Use Grafana dashboards for historical trends; the API complements dashboards by revealing instant state and precise failure context.

View File

@@ -1,111 +1,37 @@
# Mock Mode Development Guide
# 🧪 Mock Mode Development
Pulse ships with a mock data pipeline so you can iterate on UI and backend
changes without touching real infrastructure. This guide collects everything you
need to know about running in mock mode during development.
Develop Pulse without real infrastructure using the mock data pipeline.
---
## Why Mock Mode?
- Exercise dashboards, alert timelines, and charts with predictable sample data.
- Reproduce edge cases (offline nodes, noisy containers, backup failures) by
tweaking configuration values rather than waiting for production incidents.
- Swap between synthetic and live data without rebuilding services.
---
## Starting the Dev Stack
## 🚀 Quick Start
```bash
# Launch backend + frontend with hot reload
# Start dev stack
./scripts/hot-dev.sh
# Toggle mock mode
npm run mock:on # Enable
npm run mock:off # Disable
npm run mock:status # Check status
```
The script exposes:
- Frontend: `http://localhost:7655` (Vite hot module reload)
- Backend API: `http://localhost:7656`
## ⚙️ Configuration
Edit `mock.env` (or `mock.env.local` for overrides):
---
| Variable | Default | Description |
| :--- | :--- | :--- |
| `PULSE_MOCK_MODE` | `false` | Enable mock mode. |
| `PULSE_MOCK_NODES` | `7` | Number of synthetic nodes. |
| `PULSE_MOCK_VMS_PER_NODE` | `5` | VMs per node. |
| `PULSE_MOCK_LXCS_PER_NODE` | `8` | Containers per node. |
| `PULSE_MOCK_RANDOM_METRICS` | `true` | Jitter metrics. |
| `PULSE_MOCK_STOPPED_PERCENT` | `20` | % of offline guests. |
## Toggling Mock Data
## How it Works
* **Data**: Swaps `PULSE_DATA_DIR` to `/opt/pulse/tmp/mock-data`.
* **Restart**: Backend restarts automatically; Frontend hot-reloads.
* **Reset**: To regenerate data, delete `/opt/pulse/tmp/mock-data` and toggle mock mode on.
The npm helpers and `toggle-mock.sh` wrapper point the backend at different
`.env` files and restart the relevant services automatically.
```bash
npm run mock:on # Enable mock mode
npm run mock:off # Return to real data
npm run mock:status # Display current state
npm run mock:edit # Open mock.env in $EDITOR
```
Equivalent shell invocations:
```bash
./scripts/toggle-mock.sh on
./scripts/toggle-mock.sh off
./scripts/toggle-mock.sh status
```
When switching:
- `mock.env` (or `mock.env.local`) feeds configuration values to the backend.
- `PULSE_DATA_DIR` swaps between `/opt/pulse/tmp/mock-data` (synthetic) and
`/etc/pulse` (real data) so test credentials never mix with production ones.
- The backend process restarts; the frontend stays hot-reloading.
---
## Customising Mock Fixtures
`mock.env` exposes the knobs most developers care about:
```bash
PULSE_MOCK_MODE=false # Enable/disable mock mode
PULSE_MOCK_NODES=7 # Number of synthetic nodes
PULSE_MOCK_VMS_PER_NODE=5 # Average VM count per node
PULSE_MOCK_LXCS_PER_NODE=8 # Average container count per node
PULSE_MOCK_RANDOM_METRICS=true # Toggle metric jitter
PULSE_MOCK_STOPPED_PERCENT=20 # Percentage of guests stopped/offline
PULSE_ALLOW_DOCKER_UPDATES=true # Treat Docker builds as update-capable (skips restart)
```
When `PULSE_ALLOW_DOCKER_UPDATES` (or `PULSE_MOCK_MODE`) is enabled the backend
exposes the full update flow inside containers, fakes the deployment type to
`mock`, and suppresses the automatic process exit that normally follows a
successful upgrade. This is what the Playwright update suite uses inside CI.
Create `mock.env.local` for personal tweaks that should not be committed:
```bash
cp mock.env mock.env.local
$EDITOR mock.env.local
```
The toggle script prioritises `.local` files, falling back to the shared
defaults when none are present.
---
## Troubleshooting
- **Backend did not restart:** flip mock mode off/on again (`npm run mock:off`,
then `npm run mock:on`) to force a reload.
- **Ports already in use:** confirm nothing else is listening on `7655`/`7656`
(`lsof -i :7655` / `lsof -i :7656`) and kill stray processes.
- **Data feels stale:** delete `/opt/pulse/tmp/mock-data` and toggle mock mode
back on to regenerate fixtures.
---
## Limitations
- Mock data focuses on happy-path flows; use real Proxmox/PBS environments
before shipping changes that touch API integrations.
- Webhook payloads are synthetically generated and omit provider-specific
quirks—test with real channels for production rollouts.
- Encrypt/decrypt flows still use the local crypto stack; do not treat mock mode
as a sandbox for experimenting with credential formats.
For more advanced scenarios, inspect `scripts/hot-dev.sh` and the mock seeders
under `internal/mock` for additional entry points.
## ⚠️ Limitations
* **Happy Path**: Focuses on standard flows; use real infrastructure for complex edge cases.
* **Webhooks**: Synthetic payloads only.
* **Encryption**: Uses local crypto stack (not a sandbox for auth).

View File

@@ -1,187 +1,52 @@
# Adaptive Polling Architecture
# 📉 Adaptive Polling
## Overview
Pulse uses an adaptive polling scheduler that adapts poll cadence based on freshness, errors, and workload. The goal is to prioritize stale or changing instances while backing off on healthy, idle targets.
Pulse uses an adaptive scheduler to optimize polling based on instance health and activity.
```mermaid
flowchart LR
Scheduler[Scheduler]
Queue[Priority Queue<br/>by NextRun]
Workers[Workers]
## 🧠 Architecture
* **Scheduler**: Calculates intervals based on health/staleness.
* **Priority Queue**: Min-heap keyed by `NextRun`.
* **Circuit Breaker**: Prevents hot loops on failing instances.
* **Backoff**: Exponential retry delays (5s to 5m).
Scheduler -->|schedule| Queue
Queue -->|dequeue| Workers
Workers -->|success| Scheduler
Workers -->|failure| CB[Circuit Breaker]
CB -->|backoff| Scheduler
```
## ⚙️ Configuration
Adaptive polling is **enabled by default**.
- **Scheduler** computes `ScheduledTask` entries using adaptive intervals.
- **Task queue** is a min-heap keyed by `NextRun`; only due tasks execute.
- **Workers** execute tasks, capture outcomes, reschedule via scheduler or backoff logic.
### UI
**Settings → System → Monitoring**.
## Key Components
### Environment Variables
| Variable | Default | Description |
| :--- | :--- | :--- |
| `ADAPTIVE_POLLING_ENABLED` | `true` | Enable/disable. |
| `ADAPTIVE_POLLING_BASE_INTERVAL` | `10s` | Healthy poll rate. |
| `ADAPTIVE_POLLING_MIN_INTERVAL` | `5s` | Active/busy rate. |
| `ADAPTIVE_POLLING_MAX_INTERVAL` | `5m` | Idle/backoff rate. |
| Component | File | Responsibility |
|-----------------------|-------------------------------------------|--------------------------------------------------------------|
| Scheduler | `internal/monitoring/scheduler.go` | Calculates adaptive intervals per instance. |
| Staleness tracker | `internal/monitoring/staleness_tracker.go`| Maintains freshness metadata and scores. |
| Priority queue | `internal/monitoring/task_queue.go` | Orders `ScheduledTask` items by due time + priority. |
| Circuit breaker | `internal/monitoring/circuit_breaker.go` | Trips on repeated failures, preventing hot loops. |
| Backoff | `internal/monitoring/backoff.go` | Exponential retry delays with jitter. |
| Workers | `internal/monitoring/monitor.go` | Pop tasks, execute pollers, reschedule or dead-letter. |
## 📊 Metrics
Exposed at `:9091/metrics`.
## Configuration
| Metric | Type | Description |
| :--- | :--- | :--- |
| `pulse_monitor_poll_total` | Counter | Total poll attempts. |
| `pulse_monitor_poll_duration_seconds` | Histogram | Poll latency. |
| `pulse_monitor_poll_staleness_seconds` | Gauge | Age since last success. |
| `pulse_monitor_poll_queue_depth` | Gauge | Queue size. |
| `pulse_monitor_poll_errors_total` | Counter | Error counts by category. |
**v4.24.0:** Adaptive polling is **enabled by default** but can be toggled without restart.
## ⚡ Circuit Breaker
| State | Trigger | Recovery |
| :--- | :--- | :--- |
| **Closed** | Normal operation. | — |
| **Open** | ≥3 failures. | Backoff (max 5m). |
| **Half-open** | Retry window elapsed. | Success = Closed; Fail = Open. |
### Via UI
Navigate to **Settings → System → Monitoring** to enable/disable adaptive polling. Changes apply immediately without requiring a restart.
**Dead Letter Queue**: After 5 transient or 1 permanent failure, tasks move to DLQ (30m retry).
### Via Environment Variables
Environment variables (default in `internal/config/config.go`):
## 🩺 Health API
`GET /api/monitoring/scheduler/health` (Auth required)
| Variable | Default | Description |
|-------------------------------------|---------|--------------------------------------------------|
| `ADAPTIVE_POLLING_ENABLED` | true | **Changed in v4.24.0**: Now enabled by default |
| `ADAPTIVE_POLLING_BASE_INTERVAL` | 10s | Target cadence when system is healthy |
| `ADAPTIVE_POLLING_MIN_INTERVAL` | 5s | Lower bound (active instances) |
| `ADAPTIVE_POLLING_MAX_INTERVAL` | 5m | Upper bound (idle instances) |
All settings persist in `system.json` and respond to environment overrides. **Changes apply without restart** when modified via UI.
## Metrics
**v4.24.0:** Extended metrics for comprehensive monitoring.
Exposed via Prometheus (`:9091/metrics`):
| Metric | Type | Labels | Description |
|---------------------------------------------|-----------|---------------------------------------|-------------------------------------------------|
| `pulse_monitor_poll_total` | counter | `instance_type`, `instance`, `result` | Overall poll attempts (success/error) |
| `pulse_monitor_poll_duration_seconds` | histogram | `instance_type`, `instance` | Poll latency per instance |
| `pulse_monitor_poll_staleness_seconds` | gauge | `instance_type`, `instance` | Age since last success (0 on success) |
| `pulse_monitor_poll_queue_depth` | gauge | — | Size of priority queue |
| `pulse_monitor_poll_inflight` | gauge | `instance_type` | Concurrent tasks per type |
| `pulse_monitor_poll_errors_total` | counter | `instance_type`, `instance`, `category` | Error counts by category (transient/permanent) |
| `pulse_monitor_poll_last_success_timestamp` | gauge | `instance_type`, `instance` | Unix timestamp of last successful poll |
**Alerting Recommendations:**
- Alert when `pulse_monitor_poll_staleness_seconds` > 120 for critical instances
- Alert when `pulse_monitor_poll_queue_depth` > 50 (backlog building)
- Alert when `pulse_monitor_poll_errors_total` with `category=permanent` increases (auth/config issues)
## Circuit Breaker & Backoff
| State | Trigger | Recovery |
|-------------|---------------------------------------------|--------------------------------------------|
| **Closed** | Default. Failures counted. | — |
| **Open** | ≥3 consecutive failures. Poll suppressed. | Exponential delay (max 5min). |
| **Half-open**| Retry window elapsed. Limited re-attempt. | Success ⇒ closed. Failure ⇒ open. |
```mermaid
stateDiagram-v2
[*] --> Closed: Startup / reset
Closed: Default state\nPolling active\nFailure counter increments
Closed --> Open: ≥3 consecutive failures
Open: Polls suppressed\nScheduler schedules backoff (max 5m)
Open --> HalfOpen: Retry window elapsed
HalfOpen: Single probe allowed\nBreaker watches probe result
HalfOpen --> Closed: Probe success\nReset failure streak & delay
HalfOpen --> Open: Probe failure\nIncrease streak & backoff
```
Backoff configuration:
- Initial delay: 5s
- Multiplier: x2 per failure
- Jitter: ±20%
- Max delay: 5minutes
- After 5 transient failures or any permanent failure, task moves to dead-letter queue for operator action.
## Dead-Letter Queue
Dead-letter entries are kept in memory (same `TaskQueue` structure) with a 30min recheck interval. Operators should inspect logs for `Routing task to dead-letter queue` messages. Future work (Task8) will add API surfaces for inspection.
## API Endpoints
### GET /api/monitoring/scheduler/health
Returns comprehensive scheduler health data (authentication required).
**Response format:**
```json
{
"updatedAt": "2025-03-21T18:05:00Z",
"enabled": true,
"queue": {
"depth": 7,
"dueWithinSeconds": 2,
"perType": {
"pve": 4,
"pbs": 2,
"pmg": 1
}
},
"deadLetter": {
"count": 2,
"tasks": [
{
"instance": "pbs-nas",
"type": "pbs",
"nextRun": "2025-03-21T18:25:00Z",
"lastError": "connection timeout",
"failures": 7
}
]
},
"breakers": [
{
"instance": "pve-core",
"type": "pve",
"state": "half_open",
"failures": 3,
"retryAt": "2025-03-21T18:05:45Z"
}
],
"staleness": [
{
"instance": "pve-core",
"type": "pve",
"score": 0.12,
"lastSuccess": "2025-03-21T18:04:50Z"
}
]
}
```
**Field descriptions:**
- `enabled`: Feature flag status
- `queue.depth`: Total queued tasks
- `queue.dueWithinSeconds`: Tasks due within 12 seconds
- `queue.perType`: Distribution by instance type
- `deadLetter.count`: Total dead-letter tasks
- `deadLetter.tasks`: Up to 25 most recent dead-letter entries
- `breakers`: Circuit breaker states (only non-default states shown)
- `staleness`: Freshness scores per instance (0 = fresh, 1 = max stale)
## Operational Guidance
1. **Enable adaptive polling**: set `ADAPTIVE_POLLING_ENABLED=true` via UI or environment overrides, then restart hot-dev (`scripts/hot-dev.sh`).
2. **Monitor metrics** to ensure queue depth and staleness remain within SLA. Configure alerting on `poll_staleness_seconds` and `poll_queue_depth`.
3. **Inspect scheduler health** via API endpoint `/api/monitoring/scheduler/health` for circuit breaker trips and dead-letter queue status.
4. **Review dead-letter logs** for persistent failures; resolve underlying connectivity or auth issues before re-enabling.
## Rollout Plan
1. **Dev/QA**: Run hot-dev with feature flag enabled; observe metrics and logs for several cycles.
2. **Staged deploy**: Enable flag on a subset of clusters; monitor queue depth (<50) and staleness (<45s).
3. **Full rollout**: Toggle flag globally once metrics are stable; document any overrides in release notes.
4. **Post-launch**: Add Grafana panels for queue depth & staleness; alert on circuit breaker trips (future API work).
## Known Follow-ups
- Task8: expose scheduler health & dead-letter statistics via API and UI panels.
- Task9: add dedicated unit/integration harness for the scheduler & workers.
Returns:
* Queue depth & breakdown.
* Dead-letter tasks.
* Circuit breaker states.
* Per-instance staleness.

View File

@@ -1,81 +1,36 @@
# Pulse Prometheus Metrics (v4.24.0+)
# 📊 Prometheus Metrics
Pulse exposes multiple metric families that cover HTTP ingress, per-node poll execution, scheduler health, and diagnostics caching. Use the following reference when wiring dashboards or alert rules.
---
## HTTP Request Metrics
| Metric | Type | Labels | Description |
| --- | --- | --- | --- |
| `pulse_http_request_duration_seconds` | Histogram | `method`, `route`, `status` | Request latency buckets. `route` is a normalised path (dynamic segments collapsed to `:id`, `:uuid`, etc.). |
| `pulse_http_requests_total` | Counter | `method`, `route`, `status` | Total requests handled. |
| `pulse_http_request_errors_total` | Counter | `method`, `route`, `status_class` | Counts 4xx/5xx responses. |
**Alert suggestion:**
`rate(pulse_http_request_errors_total{status_class="server_error"}[5m]) > 0.05` (more than ~3 server errors/min) should page ops.
---
## Per-Node Poll Metrics
| Metric | Type | Labels | Description |
| --- | --- | --- | --- |
| `pulse_monitor_node_poll_duration_seconds` | Histogram | `instance_type`, `instance`, `node` | Wall-clock duration for each node poll. |
| `pulse_monitor_node_poll_total` | Counter | `instance_type`, `instance`, `node`, `result` | Success/error counts per node. |
| `pulse_monitor_node_poll_errors_total` | Counter | `instance_type`, `instance`, `node`, `error_type` | Error type breakdown (connection, auth, internal, etc.). |
| `pulse_monitor_node_poll_last_success_timestamp` | Gauge | `instance_type`, `instance`, `node` | Unix timestamp of last successful poll. |
| `pulse_monitor_node_poll_staleness_seconds` | Gauge | `instance_type`, `instance`, `node` | Seconds since last success (1 means no success yet). |
**Alert suggestion:**
`max_over_time(pulse_monitor_node_poll_staleness_seconds{node!=""}[10m]) > 300` indicates a node has been stale for 5+ minutes.
---
## Scheduler Health Metrics
| Metric | Type | Labels | Description |
| --- | --- | --- | --- |
| `pulse_scheduler_queue_due_soon` | Gauge | — | Number of tasks due within 12 seconds. |
| `pulse_scheduler_queue_depth` | Gauge | `instance_type` | Queue depth per instance type (PVE, PBS, PMG). |
| `pulse_scheduler_queue_wait_seconds` | Histogram | `instance_type` | Wait time between when a task should run and when it actually executes. |
| `pulse_scheduler_dead_letter_depth` | Gauge | `instance_type`, `instance` | Dead-letter queue depth per monitored instance. |
| `pulse_scheduler_breaker_state` | Gauge | `instance_type`, `instance` | Circuit breaker state: `0`=closed, `1`=half-open, `2`=open, `-1`=unknown. |
| `pulse_scheduler_breaker_failure_count` | Gauge | `instance_type`, `instance` | Consecutive failures tracked by the breaker. |
| `pulse_scheduler_breaker_retry_seconds` | Gauge | `instance_type`, `instance` | Seconds until the breaker will allow the next attempt. |
**Alert suggestions:**
- Queue saturation: `max_over_time(pulse_scheduler_queue_depth[10m]) > <instance count * 1.5>`
- DLQ growth: `increase(pulse_scheduler_dead_letter_depth[10m]) > 0`
- Breaker stuck open: `pulse_scheduler_breaker_state == 2` for > 10 minutes.
---
## Diagnostics Cache Metrics
| Metric | Type | Labels | Description |
| --- | --- | --- | --- |
| `pulse_diagnostics_cache_hits_total` | Counter | — | Diagnostics requests served from cache. |
| `pulse_diagnostics_cache_misses_total` | Counter | — | Requests that triggered a fresh probe. |
| `pulse_diagnostics_refresh_duration_seconds` | Histogram | — | Time taken to refresh diagnostics payload. |
**Alert suggestion:**
`rate(pulse_diagnostics_cache_misses_total[5m])` spiking alongside `pulse_diagnostics_refresh_duration_seconds` > 20s can signal upstream slowness.
---
## Existing Instance-Level Poll Metrics (for completeness)
The following metrics pre-date v4.24.0 but remain essential:
Pulse exposes metrics at `/metrics` (default port `9091`).
## 🌐 HTTP Ingress
| Metric | Type | Description |
| --- | --- | --- |
| `pulse_monitor_poll_duration_seconds` | Histogram | Poll duration per instance. |
| `pulse_monitor_poll_total` | Counter | Success/error counts per instance. |
| `pulse_monitor_poll_errors_total` | Counter | Error counts per instance. |
| `pulse_monitor_poll_last_success_timestamp` | Gauge | Last successful poll timestamp. |
| `pulse_monitor_poll_staleness_seconds` | Gauge | Seconds since last successful poll (instance-level). |
| `pulse_monitor_poll_queue_depth` | Gauge | Current queue depth. |
| `pulse_monitor_poll_inflight` | Gauge | Polls currently running. |
| :--- | :--- | :--- |
| `pulse_http_request_duration_seconds` | Histogram | Latency buckets by `method`, `route`, `status`. |
| `pulse_http_requests_total` | Counter | Total requests. |
| `pulse_http_request_errors_total` | Counter | 4xx/5xx errors. |
Refer to this document whenever you build dashboards or craft alert policies. Scrape all metrics from the Pulse backend `/metrics` endpoint (9091 by default for systemd installs).
## 🔄 Polling & Nodes
| Metric | Type | Description |
| :--- | :--- | :--- |
| `pulse_monitor_node_poll_duration_seconds` | Histogram | Per-node poll latency. |
| `pulse_monitor_node_poll_total` | Counter | Success/error counts per node. |
| `pulse_monitor_node_poll_staleness_seconds` | Gauge | Seconds since last success. |
| `pulse_monitor_poll_queue_depth` | Gauge | Global queue depth. |
## 🧠 Scheduler Health
| Metric | Type | Description |
| :--- | :--- | :--- |
| `pulse_scheduler_queue_depth` | Gauge | Queue depth per instance type. |
| `pulse_scheduler_dead_letter_depth` | Gauge | DLQ depth per instance. |
| `pulse_scheduler_breaker_state` | Gauge | `0`=Closed, `1`=Half-Open, `2`=Open. |
## ⚡ Diagnostics Cache
| Metric | Type | Description |
| :--- | :--- | :--- |
| `pulse_diagnostics_cache_hits_total` | Counter | Cache hits. |
| `pulse_diagnostics_refresh_duration_seconds` | Histogram | Refresh latency. |
## 🚨 Alerting Examples
* **High Error Rate**: `rate(pulse_http_request_errors_total[5m]) > 0.05`
* **Stale Node**: `pulse_monitor_node_poll_staleness_seconds > 300`
* **Breaker Open**: `pulse_scheduler_breaker_state == 2`

View File

@@ -1,83 +1,30 @@
# Adaptive Polling Rollout Runbook
# 🚀 Adaptive Polling Rollout
Adaptive polling (v4.24.0+) lets the scheduler dynamically adjust poll
intervals per resource. This runbook documents the safe way to enable, monitor,
and, if needed, disable the feature across environments.
Safely enable dynamic scheduling (v4.24.0+).
## Scope & Prerequisites
## 📋 Pre-Flight
1. **Snapshot Health**:
```bash
curl -s http://localhost:7655/api/monitoring/scheduler/health | jq .
```
2. **Check Metrics**: Ensure `pulse_monitor_poll_queue_depth` is stable.
- Pulse **v4.24.0 or newer**
- Admin access to **Settings → System → Monitoring**
- Prometheus access to `pulse_monitor_*` metrics
- Ability to run authenticated `curl` commands against the Pulse API
## 🟢 Enable
Choose one method:
* **UI**: Settings → System → Monitoring → Adaptive Polling.
* **CLI**: `jq '.AdaptivePollingEnabled=true' /var/lib/pulse/system.json > tmp && mv tmp system.json`
* **Env**: `ADAPTIVE_POLLING_ENABLED=true` (Docker/K8s).
## Change Windows
## 🔍 Monitor (First 15m)
Watch for stability:
```bash
watch -n 5 'curl -s http://localhost:9091/metrics | grep pulse_monitor_poll_queue_depth'
```
* **Success**: Queue depth < 50, no permanent errors.
* **Failure**: High queue depth, open breakers.
Run rollouts during a maintenance window where transient alert jitter is
acceptable. Adaptive polling touches every monitor queue; give yourself at least
15minutes to observe steady state metrics.
## Rollout Steps
1. **Snapshot current health**
```bash
curl -s http://localhost:7655/api/monitoring/scheduler/health | jq '.enabled, .queue.depth'
```
Record queue depth, breaker count, and dead-letter entries.
2. **Enable adaptive polling**
- UI: toggle **Settings → System → Monitoring → Adaptive Polling** → Enable
- CLI: `jq '.AdaptivePollingEnabled=true' /var/lib/pulse/system.json > tmp && mv tmp system.json`
- Env override: `ADAPTIVE_POLLING_ENABLED=true` before starting Pulse (for
containers/k8s)
3. **Watch metrics (first 5 minutes)**
```bash
watch -n 5 'curl -s http://localhost:9091/metrics | grep -E "pulse_monitor_(poll_queue_depth|poll_staleness_seconds)" | head'
```
Targets:
- `pulse_monitor_poll_queue_depth < 50`
- `pulse_monitor_poll_staleness_seconds` under your SLA (typically < 60s)
- No spikes in `pulse_monitor_poll_errors_total{category="permanent"}`
4. **Validate scheduler state**
```bash
curl -s http://localhost:7655/api/monitoring/scheduler/health \
| jq '{enabled, queue: .queue.depth, breakers: [.breakers[]?.instance], deadLetter: .deadLetter.count}'
```
Expect `enabled: true`, empty breaker list, and `deadLetter.count == 0`.
5. **Document overrides**
- Note any instances moved to manual polling (Settings → Nodes → Polling)
- Capture Grafana screenshots for queue depth/staleness widgets
## Rollback
If queue depth climbs uncontrollably or breakers remain open for >10minutes:
1. Disable the feature the same way you enabled it (UI/environment).
2. Restart Pulse if environment overrides were used, otherwise hot toggle is
immediate.
3. Continue monitoring until queue depth and staleness return to baseline.
## Canary Strategy Suggestions
| Stage | Action | Acceptance Criteria |
| --- | --- | --- |
| Dev | Enable flag in hot-dev (scripts/hot-dev.sh) | No scheduler panics, UI reflects flag instantly |
| Staging | Enable on one Pulse instance per region | `queue.depth` within ±20% of baseline after 15min |
| Production | Enable per cluster with 30min soak | No more than 5 breaker openings per hour |
## Instrumentation Checklist
- Grafana dashboard with `queue.depth`, `poll_staleness_seconds`,
`poll_errors_total` by type
- Alert rule: `rate(pulse_monitor_poll_errors_total{category="permanent"}[5m]) > 0`
- Alert rule: `max_over_time(pulse_monitor_poll_queue_depth[5m]) > 75`
- JSON log search for `"scheduler":` warnings immediately after enablement
## References
- [Architecture doc](../monitoring/ADAPTIVE_POLLING.md)
- [Scheduler Health API](../api/SCHEDULER_HEALTH.md)
- [Kubernetes guidance](../KUBERNETES.md#adaptive-polling-configuration-v4250)
## ↩️ Rollback
If instability occurs > 10m:
1. **Disable**: Toggle off via UI or Env.
2. **Restart**: Required if using Env/CLI overrides.
3. **Verify**: Confirm queue drains.

View File

@@ -0,0 +1,51 @@
# 🔄 Sensor Proxy Audit Log Rotation
The proxy writes append-only, hash-chained logs to `/var/log/pulse/sensor-proxy/audit.log`.
## ⚠️ Important
* **Do not delete**: The file is protected with `chattr +a`.
* **Rotate when**: >200MB or >30 days.
## 🛠️ Manual Rotation
Run as root:
```bash
# 1. Unlock file
chattr -a /var/log/pulse/sensor-proxy/audit.log
# 2. Rotate (copy & truncate)
cp -a /var/log/pulse/sensor-proxy/audit.log /var/log/pulse/sensor-proxy/audit.log.$(date +%Y%m%d)
: > /var/log/pulse/sensor-proxy/audit.log
# 3. Relock & Restart
chown pulse-sensor-proxy:pulse-sensor-proxy /var/log/pulse/sensor-proxy/audit.log
chmod 0640 /var/log/pulse/sensor-proxy/audit.log
chattr +a /var/log/pulse/sensor-proxy/audit.log
systemctl restart pulse-sensor-proxy
```
## 🤖 Logrotate Config
Create `/etc/logrotate.d/pulse-sensor-proxy`:
```conf
/var/log/pulse/sensor-proxy/audit.log {
weekly
rotate 8
compress
missingok
notifempty
create 0640 pulse-sensor-proxy pulse-sensor-proxy
sharedscripts
prerotate
/usr/bin/chattr -a /var/log/pulse/sensor-proxy/audit.log || true
endscript
postrotate
/bin/systemctl restart pulse-sensor-proxy.service || true
/usr/bin/chattr +a /var/log/pulse/sensor-proxy/audit.log || true
endscript
}
```
**Note**: Do NOT use `copytruncate`. The restart is required to reset the hash chain.

View File

@@ -0,0 +1,47 @@
# 🔄 Automatic Updates
Manage Pulse auto-updates on host-mode installations.
> **Note**: Docker/Kubernetes users should manage updates via their orchestrator.
## ⚙️ Components
| File | Purpose |
| :--- | :--- |
| `pulse-update.timer` | Daily check (02:00 + jitter). |
| `pulse-update.service` | Runs the update script. |
| `pulse-auto-update.sh` | Fetches release & restarts Pulse. |
## 🚀 Enable/Disable
### Via UI (Recommended)
**Settings → System → Updates → Automatic Updates**.
### Via CLI
```bash
# Enable
sudo jq '.autoUpdateEnabled=true' /var/lib/pulse/system.json > tmp && sudo mv tmp /var/lib/pulse/system.json
sudo systemctl enable --now pulse-update.timer
# Disable
sudo jq '.autoUpdateEnabled=false' /var/lib/pulse/system.json > tmp && sudo mv tmp /var/lib/pulse/system.json
sudo systemctl disable --now pulse-update.timer
```
## 🧪 Manual Run
Test the update process:
```bash
sudo systemctl start pulse-update.service
journalctl -u pulse-update -f
```
## 🔍 Observability
* **History**: `curl -s http://localhost:7655/api/updates/history | jq`
* **Logs**: `/var/log/pulse/update-*.log`
## ↩️ Rollback
If an update fails:
1. Check logs: `/var/log/pulse/update-YYYYMMDDHHMMSS.log`.
2. Revert manually:
```bash
sudo /opt/pulse/install.sh --version v4.30.0
```
Or use the **Rollback** button in the UI if available.

View File

@@ -0,0 +1,40 @@
# ⚙️ Sensor Proxy Configuration
Safe configuration management using the CLI (v4.31.1+).
## 📂 Files
* **`config.yaml`**: General settings (logging, metrics).
* **`allowed_nodes.yaml`**: Authorized node list (managed via CLI).
## 🛠️ CLI Reference
### Validation
Check for errors before restart.
```bash
pulse-sensor-proxy config validate
```
### Managing Nodes
**Add Nodes (Merge):**
```bash
pulse-sensor-proxy config set-allowed-nodes --merge 192.168.0.10
```
**Replace List:**
```bash
pulse-sensor-proxy config set-allowed-nodes --replace \
--merge 192.168.0.1 --merge 192.168.0.2
```
## ⚠️ Troubleshooting
**Validation Fails:**
* Check for duplicate `allowed_nodes` blocks in `config.yaml`.
* Run `pulse-sensor-proxy config validate 2>&1` for details.
**Lock Errors:**
* Remove stale locks if process is dead: `rm /etc/pulse-sensor-proxy/*.lock`.
**Empty List:**
* Valid for IPC-only clusters.
* Populate manually if needed using `--replace`.

View File

@@ -0,0 +1,31 @@
# 📝 Sensor Proxy Log Forwarding
Forward `audit.log` and `proxy.log` to a central SIEM via RELP + TLS.
## 🚀 Quick Start
Run the helper script with your collector details:
```bash
sudo REMOTE_HOST=logs.example.com \
REMOTE_PORT=6514 \
CERT_DIR=/etc/pulse/log-forwarding \
CA_CERT=/path/to/ca.crt \
CLIENT_CERT=/path/to/client.crt \
CLIENT_KEY=/path/to/client.key \
/opt/pulse/scripts/setup-log-forwarding.sh
```
## 📋 What It Does
1. **Inputs**: Watches `/var/log/pulse/sensor-proxy/{audit,proxy}.log`.
2. **Queue**: Disk-backed queue (50k messages) for reliability.
3. **Output**: RELP over TLS to `REMOTE_HOST`.
4. **Mirror**: Local debug file at `/var/log/pulse/sensor-proxy/forwarding.log`.
## ✅ Verification
1. **Check Status**: `sudo systemctl status rsyslog`
2. **View Mirror**: `tail -f /var/log/pulse/sensor-proxy/forwarding.log`
3. **Test**: Restart proxy and check remote collector for `pulse.audit` tag.
## 🧹 Maintenance
* **Disable**: Remove `/etc/rsyslog.d/pulse-sensor-proxy.conf` and restart rsyslog.
* **Rotate Certs**: Replace files in `CERT_DIR` and restart rsyslog.

View File

@@ -1,120 +0,0 @@
# Sensor Proxy Audit Log Rotation
The temperature sensor proxy writes append-only, hash-chained audit events to
`/var/log/pulse/sensor-proxy/audit.log`. The file is created with `0640`
permissions, owned by `pulse-sensor-proxy`, and protected with `chattr +a` via
`scripts/secure-sensor-files.sh`. Because the process keeps the file handle open
and enforces append-only mode, you **must** follow the steps below to rotate the
log without losing events.
## When to Rotate
- File exceeds **200MB** or contains more than 30 days of history
- Prior to exporting evidence for an incident review
- Immediately before changing log-forwarding endpoints (rsyslog/RELp)
The proxy falls back to stderr (systemd journal) only when the file cannot be
opened. Do not rely on the fallback for long-term retention.
## Pre-flight Checklist
1. Confirm the service is healthy:
```bash
systemctl status pulse-sensor-proxy --no-pager
```
2. Make sure `/var/log/pulse/sensor-proxy` is mounted with enough free space:
```bash
df -h /var/log/pulse/sensor-proxy
```
3. Note the current scheduler health inside Pulse for later verification:
```bash
curl -s http://localhost:7655/api/monitoring/scheduler/health | jq '.queue.depth, .deadLetter.count'
```
## Manual Rotation Procedure
> Run these steps as **root** on the Proxmox host that runs the proxy.
1. Remove the append-only flag (logrotate needs to truncate the file):
```bash
chattr -a /var/log/pulse/sensor-proxy/audit.log
```
2. Copy the current file to an evidence path, then truncate in place:
```bash
ts=$(date +%Y%m%d-%H%M%S)
cp -a /var/log/pulse/sensor-proxy/audit.log /var/log/pulse/sensor-proxy/audit.log.$ts
: > /var/log/pulse/sensor-proxy/audit.log
```
3. Restore permissions and the append-only flag:
```bash
chown pulse-sensor-proxy:pulse-sensor-proxy /var/log/pulse/sensor-proxy/audit.log
chmod 0640 /var/log/pulse/sensor-proxy/audit.log
chattr +a /var/log/pulse/sensor-proxy/audit.log
```
4. Restart the proxy so the file descriptor is reopened:
```bash
systemctl restart pulse-sensor-proxy
```
5. Verify the service recreated the correlation hash chain:
```bash
journalctl -u pulse-sensor-proxy -n 20 | grep -i "audit" || true
```
6. Re-check Pulse adaptive polling health (temperature pollers rely on the
proxy):
```bash
curl -s http://localhost:7655/api/monitoring/scheduler/health \
| jq '.instances[] | select(.key | contains("temperature")) | {key, breaker: .breaker.state, deadLetter: .deadLetter.present}'
```
All temperature instances should show `breaker: "closed"` with
`deadLetter: false`.
## Logrotate Configuration
Automate rotation with `/etc/logrotate.d/pulse-sensor-proxy`. Copy the snippet
below and adjust retention to match your compliance needs:
```conf
/var/log/pulse/sensor-proxy/audit.log {
weekly
rotate 8
compress
missingok
notifempty
create 0640 pulse-sensor-proxy pulse-sensor-proxy
sharedscripts
prerotate
/usr/bin/chattr -a /var/log/pulse/sensor-proxy/audit.log || true
endscript
postrotate
/bin/systemctl restart pulse-sensor-proxy.service || true
/usr/bin/chattr +a /var/log/pulse/sensor-proxy/audit.log || true
endscript
}
```
Keep `copytruncate` disabled—the restart ensures the proxy writes to a fresh
file with a new hash chain. Always forward rotated files to your SIEM before
removing them.
## Forwarding Validations
If you forward audit logs over RELP using `scripts/setup-log-forwarding.sh`:
1. Tail the forwarding log:
```bash
tail -f /var/log/pulse/sensor-proxy/forwarding.log
```
2. Ensure queues drain (`action.resumeRetryCount=-1` keeps retrying).
3. Confirm the remote receiver ingests the new file (look for the `pulse.audit`
tag).
## Troubleshooting
| Symptom | Action |
| --- | --- |
| `Operation not permitted` when truncating | `chattr -a` was not executed or SELinux/AppArmor denies it. Check `auditd`. |
| Proxy fails to restart | Run `journalctl -u pulse-sensor-proxy -xe` for context. The proxy refuses to start if the audit file cannot be opened. |
| Temperature polls stop after rotation | Check `/api/monitoring/scheduler/health` for dead-letter entries. Restart the main Pulse service if breakers stay open. |
Once logs are rotated and validated, upload the archived copy to your evidence
store and record the event in your change log.

View File

@@ -1,104 +0,0 @@
# Pulse Automatic Update Runbook
Automatic updates are handled by three systemd units that live on host-mode
installations:
| Component | Purpose | File |
| --- | --- | --- |
| `pulse-update.timer` | Schedules daily checks (02:00 + 04h jitter) | `/etc/systemd/system/pulse-update.timer` |
| `pulse-update.service` | Runs a single update cycle when triggered | `/etc/systemd/system/pulse-update.service` |
| `scripts/pulse-auto-update.sh` | Fetches release metadata, downloads binaries, restarts Pulse | `/opt/pulse/scripts/pulse-auto-update.sh` |
> Docker and Kubernetes deployments do **not** use this flow—manage upgrades via
> your orchestrator.
## Prerequisites
- `autoUpdateEnabled` must be `true` in `/var/lib/pulse/system.json` (or toggled in
**Settings → System → Updates → Automatic Updates**).
- `pulse.service` must be healthy—the update service short-circuits if Pulse is
not running.
- Host needs outbound HTTPS access to `github.com` and `objects.githubusercontent.com`.
## Enable or Disable
### From the UI
1. Navigate to **Settings → System → Updates**.
2. Toggle **Automatic Updates** on. The backend persists `autoUpdateEnabled:true`
and surfaces a reminder to enable the timer.
3. On the host, run:
```bash
sudo systemctl enable --now pulse-update.timer
sudo systemctl status pulse-update.timer --no-pager
```
4. To disable later, toggle the UI switch off **and** run
`sudo systemctl disable --now pulse-update.timer`.
### From the CLI only
```bash
# Opt in
sudo jq '.autoUpdateEnabled=true' /var/lib/pulse/system.json | sudo tee /var/lib/pulse/system.json >/dev/null
sudo systemctl daemon-reload
sudo systemctl enable --now pulse-update.timer
# Opt out
sudo jq '.autoUpdateEnabled=false' /var/lib/pulse/system.json | sudo tee /var/lib/pulse/system.json >/dev/null
sudo systemctl disable --now pulse-update.timer
```
> Editing `system.json` while Pulse is running is safe, but prefer the UI so
> validation rules stay in place.
## Trigger a Manual Run
Use this when testing new releases or after changing firewall rules:
```bash
sudo systemctl start pulse-update.service
sudo journalctl -u pulse-update -n 50
```
The oneshot service exits when the script finishes. A successful run logs the new
version and writes an entry to `update-history.jsonl`.
## Observability Checklist
- **Timer status**: `systemctl list-timers pulse-update`
- **History API**: `curl -s http://localhost:7655/api/updates/history | jq '.entries[0]'`
- **Raw log**: `/var/log/pulse/update-*.log` (referenced inside the history entrys
`log_path` field)
- **Journal**: `journalctl -u pulse-update -f`
- **Backups**: The script records `backup_path` in history (defaults to
`/etc/pulse.backup.<timestamp>`). Ensure the path exists before acknowledging
the rollout.
## Failure Handling & Rollback
1. Inspect the failing history entry:
```bash
curl -s http://localhost:7655/api/updates/history?limit=1 | jq '.entries[0]'
```
Common statuses: `failed`, `rolled_back`, `succeeded`.
2. Review `/var/log/pulse/update-YYYYMMDDHHMMSS.log` for the stack trace.
3. To revert, redeploy the previous release:
```bash
sudo /opt/pulse/install.sh --version v4.30.0
```
or use the main installer command from the update history output. The installer
restores the `backup_path` recorded earlier when you choose **Rollback** in the
UI.
4. Confirm Pulse is healthy (`systemctl status pulse.service`) and that
`/api/updates/history` now contains a `rolled_back` entry referencing the same
`event_id`.
## Troubleshooting
| Symptom | Resolution |
| --- | --- |
| `Auto-updates disabled in configuration` in journal | Set `autoUpdateEnabled:true` (UI or edit `system.json`) and restart the timer. |
| `pulse-update.timer` immediately exits | Ensure `systemd` knows about the units (`sudo systemctl daemon-reload`) and that `pulse.service` exists (installer may not have run with `--enable-auto-updates`). |
| `github.com` errors / rate limit | The script retries via the release redirect. For proxied environments set `https_proxy` before the service runs. |
| Update succeeds but Pulse stays on previous version | Check `journalctl -u pulse-update` for `restart failed`; Pulse only switches after the service restarts successfully. |
| Timer enabled but no history entries | Verify time has passed since enablement (timer includes random delay) or start the service manually to seed the first run. |
Document each run (success or rollback) in your change journal with the
`event_id` from `/api/updates/history` so you can cross-reference audit trails.

View File

@@ -1,469 +0,0 @@
# Sensor Proxy Configuration Management
This guide covers safe configuration management for pulse-sensor-proxy, including the new CLI tools introduced in v4.31.1+ to prevent config corruption.
## Overview
Starting with v4.31.1, pulse-sensor-proxy uses a two-file configuration system:
1. **Main config:** `/etc/pulse-sensor-proxy/config.yaml` - Contains all settings except allowed nodes
2. **Allowed nodes:** `/etc/pulse-sensor-proxy/allowed_nodes.yaml` - Separate file for the authorized node list
This separation prevents corruption from concurrent updates by the installer, control-plane sync, and self-heal timer.
## Architecture
### Why Two Files?
Earlier versions stored `allowed_nodes:` inline in `config.yaml`, causing corruption when:
- The installer updated node lists
- The self-heal timer ran (every 5 minutes)
- Control-plane sync modified the list
- Version detection had edge cases
Multiple code paths (shell, Python, Go) would race to update the same YAML file, creating duplicate `allowed_nodes:` keys that broke YAML parsing.
### New System (v4.31.1+)
**Phase 1 (Migration):**
- Force file-based mode exclusively
- Installer migrates inline blocks to `allowed_nodes.yaml`
- Self-heal timer includes corruption detection and repair
**Phase 2 (Atomic Operations):**
- Go CLI replaces all shell/Python config manipulation
- File locking prevents concurrent writes
- Atomic writes (temp file + rename) ensure consistency
- systemd validation prevents startup with corrupt config
## Configuration CLI Reference
### Validate Configuration
Check config files for errors before restarting the service:
```bash
# Validate both config.yaml and allowed_nodes.yaml
pulse-sensor-proxy config validate
# Validate specific config file
pulse-sensor-proxy config validate --config /path/to/config.yaml
# Validate specific allowed_nodes file
pulse-sensor-proxy config validate --allowed-nodes /path/to/allowed_nodes.yaml
```
**Exit codes:**
- 0 = valid
- Non-zero = validation failed (check stderr for details)
**Common validation errors:**
- "duplicate allowed_nodes blocks" - Run migration (see below)
- "failed to parse YAML" - Syntax error in config file
- "read_timeout must be positive" - Invalid timeout value
### Manage Allowed Nodes
The CLI provides two modes:
**Merge mode (default):** Adds nodes to existing list
```bash
# Add single node
pulse-sensor-proxy config set-allowed-nodes --merge 192.168.0.10
# Add multiple nodes
pulse-sensor-proxy config set-allowed-nodes \
--merge 192.168.0.1 \
--merge 192.168.0.2 \
--merge node1.local
```
**Replace mode:** Overwrites entire list
```bash
# Replace with new list
pulse-sensor-proxy config set-allowed-nodes --replace \
--merge 192.168.0.1 \
--merge 192.168.0.2
# Clear the list (empty is valid for IPC-only clusters)
pulse-sensor-proxy config set-allowed-nodes --replace
```
**Custom paths:**
```bash
# Use non-default path
pulse-sensor-proxy config set-allowed-nodes \
--allowed-nodes /custom/path.yaml \
--merge 192.168.0.10
```
### How It Works
1. **File locking:** Uses `flock(LOCK_EX)` on separate `.lock` file
2. **Atomic writes:** Writes to temp file, syncs, then renames
3. **Deduplication:** Automatically removes duplicate entries
4. **Normalization:** Trims whitespace, sorts entries
5. **Empty lists allowed:** Useful for security lockdown or IPC-based discovery
## Common Tasks
### Adding Nodes After Cluster Expansion
When you add a new node to your Proxmox cluster:
```bash
# Add the new node to allowed list
pulse-sensor-proxy config set-allowed-nodes --merge new-node.local
# Validate config
pulse-sensor-proxy config validate
# Restart proxy to apply
sudo systemctl restart pulse-sensor-proxy
# Verify in Pulse UI
# Check Settings → Diagnostics → Temperature Proxy
```
### Removing Decommissioned Nodes
When removing a node from your cluster:
```bash
# Get current list
cat /etc/pulse-sensor-proxy/allowed_nodes.yaml
# Replace with updated list (without old node)
pulse-sensor-proxy config set-allowed-nodes --replace \
--merge 192.168.0.1 \
--merge 192.168.0.2
# (omit the decommissioned node)
# Validate and restart
pulse-sensor-proxy config validate
sudo systemctl restart pulse-sensor-proxy
```
**Note:** The proxy cleanup system automatically removes SSH keys from deleted nodes. See temperature monitoring docs for details.
### Migrating from Inline Config
If you're running an older version with inline `allowed_nodes:` in config.yaml:
```bash
# Upgrade to latest version (auto-migrates)
curl -fsSL https://raw.githubusercontent.com/rcourtman/Pulse/main/scripts/install-sensor-proxy.sh | \
sudo bash -s -- --standalone --pulse-server http://your-pulse:7655
# Verify migration
pulse-sensor-proxy config validate
# Check that allowed_nodes only appears in allowed_nodes.yaml
grep -n "allowed_nodes:" /etc/pulse-sensor-proxy/*.yaml
# Should show: allowed_nodes.yaml:3:allowed_nodes:
# Should NOT show duplicate entries in config.yaml
```
### Changing Other Config Settings
For settings in `config.yaml` (not allowed_nodes):
```bash
# Stop the service first
sudo systemctl stop pulse-sensor-proxy
# Edit config.yaml manually
sudo nano /etc/pulse-sensor-proxy/config.yaml
# Validate before starting
pulse-sensor-proxy config validate
# Start service
sudo systemctl start pulse-sensor-proxy
# Check for errors
sudo systemctl status pulse-sensor-proxy
journalctl -u pulse-sensor-proxy -n 50
```
**Safe to edit in config.yaml:**
- `allowed_source_subnets`
- `allowed_peers` (UID/GID permissions)
- `rate_limit` settings
- `metrics_address`
- `http_*` settings (HTTPS mode)
- `pulse_control_plane` block
**Never edit manually:**
- `allowed_nodes:` (use CLI instead, or it will be in allowed_nodes.yaml anyway)
- Lock files (`.lock`)
## Troubleshooting
### Config Validation Fails
**Symptom:** `pulse-sensor-proxy config validate` returns error
**Diagnosis:**
```bash
# Run validation with full output
pulse-sensor-proxy config validate 2>&1
# Check for duplicate blocks
grep -n "allowed_nodes:" /etc/pulse-sensor-proxy/config.yaml
# Check YAML syntax
python3 -c "import yaml; yaml.safe_load(open('/etc/pulse-sensor-proxy/config.yaml'))"
```
**Common fixes:**
- Duplicate blocks: Run migration (upgrade to v4.31.1+)
- YAML syntax errors: Fix indentation, remove tabs, check colons
- Missing required fields: Add `read_timeout`, `write_timeout`
### Service Won't Start After Config Change
**Diagnosis:**
```bash
# Check systemd logs
journalctl -u pulse-sensor-proxy -n 100
# Look for validation errors
journalctl -u pulse-sensor-proxy | grep -i "validation\|corrupt\|duplicate"
# Try starting in foreground for better errors
sudo -u pulse-sensor-proxy /opt/pulse/sensor-proxy/bin/pulse-sensor-proxy # legacy installs: /usr/local/bin/pulse-sensor-proxy
```
**Fix:**
```bash
# Validate config first
pulse-sensor-proxy config validate
# If validation passes but service fails, check permissions
ls -la /etc/pulse-sensor-proxy/
ls -la /var/lib/pulse-sensor-proxy/
# Ensure proxy user owns files
sudo chown -R pulse-sensor-proxy:pulse-sensor-proxy /etc/pulse-sensor-proxy/
sudo chown -R pulse-sensor-proxy:pulse-sensor-proxy /var/lib/pulse-sensor-proxy/
```
### Lock File Errors
**Symptom:** `failed to acquire file lock` or `failed to open lock file`
**Cause:** Lock file has wrong permissions or process holds stale lock
**Fix:**
```bash
# Check lock file permissions (should be 0600)
ls -la /etc/pulse-sensor-proxy/*.lock
# Fix permissions
sudo chmod 0600 /etc/pulse-sensor-proxy/*.lock
sudo chown pulse-sensor-proxy:pulse-sensor-proxy /etc/pulse-sensor-proxy/*.lock
# If stale lock, identify holder
sudo lsof /etc/pulse-sensor-proxy/allowed_nodes.yaml.lock
# Kill stale process if needed (use with caution)
sudo kill <PID>
```
**Prevention:** Locks are automatically released when process exits. Don't manually delete lock files.
### Allowed Nodes List is Empty
**Symptom:** allowed_nodes.yaml exists but has no entries
**Is this a problem?** Not necessarily:
- Empty list is valid for clusters using IPC discovery (pvecm status)
- Control-plane mode populates the list automatically
- Standalone nodes require manual node entries
**To populate manually:**
```bash
# Add your cluster nodes
pulse-sensor-proxy config set-allowed-nodes --replace \
--merge 192.168.0.1 \
--merge 192.168.0.2 \
--merge 192.168.0.3
# Verify
cat /etc/pulse-sensor-proxy/allowed_nodes.yaml
```
## Best Practices
### General Guidelines
1. **Always validate before restarting:**
```bash
pulse-sensor-proxy config validate && sudo systemctl restart pulse-sensor-proxy
```
2. **Use the CLI for allowed_nodes changes:**
- Don't edit `allowed_nodes.yaml` manually
- Use `config set-allowed-nodes` instead
3. **Stop service before editing config.yaml:**
- Prevents race conditions with running process
- systemd validation will catch errors on startup
4. **Back up config before major changes:**
```bash
sudo cp /etc/pulse-sensor-proxy/config.yaml /etc/pulse-sensor-proxy/config.yaml.backup
sudo cp /etc/pulse-sensor-proxy/allowed_nodes.yaml /etc/pulse-sensor-proxy/allowed_nodes.yaml.backup
```
5. **Monitor after changes:**
```bash
journalctl -u pulse-sensor-proxy -f
# Check Pulse UI: Settings → Diagnostics → Temperature Proxy
```
### Automation Scripts
When scripting config changes:
```bash
#!/bin/bash
set -euo pipefail
# Function to safely update allowed nodes
update_allowed_nodes() {
local nodes=("$@")
# Build command
local cmd="pulse-sensor-proxy config set-allowed-nodes --replace"
for node in "${nodes[@]}"; do
cmd="$cmd --merge $node"
done
# Execute with validation
if eval "$cmd"; then
echo "Allowed nodes updated successfully"
else
echo "Failed to update allowed nodes" >&2
return 1
fi
# Validate
if ! pulse-sensor-proxy config validate; then
echo "Config validation failed after update" >&2
return 1
fi
# Restart service
if sudo systemctl restart pulse-sensor-proxy; then
echo "Service restarted successfully"
else
echo "Service restart failed" >&2
return 1
fi
# Wait for service to be active
sleep 2
if systemctl is-active --quiet pulse-sensor-proxy; then
echo "Service is running"
else
echo "Service failed to start" >&2
journalctl -u pulse-sensor-proxy -n 20
return 1
fi
}
# Example usage
update_allowed_nodes "192.168.0.1" "192.168.0.2" "node3.local"
```
### Monitoring Config Health
Add to your monitoring system:
```bash
# Check for config corruption (should return 0)
pulse-sensor-proxy config validate
echo $?
# Check for duplicate blocks (should be empty)
grep "allowed_nodes:" /etc/pulse-sensor-proxy/config.yaml | wc -l
# Check lock file permissions (should be 0600)
stat -c "%a" /etc/pulse-sensor-proxy/*.lock
# Check service is running
systemctl is-active pulse-sensor-proxy
```
## Migration Path
### Upgrading from Pre-v4.31.1
**Automatic migration** (recommended):
```bash
# Simply reinstall - migration runs automatically
curl -fsSL https://raw.githubusercontent.com/rcourtman/Pulse/main/scripts/install-sensor-proxy.sh | \
sudo bash -s -- --standalone --pulse-server http://your-pulse:7655
# Verify
pulse-sensor-proxy config validate
sudo systemctl status pulse-sensor-proxy
```
**Manual migration** (if needed):
```bash
# 1. Stop service
sudo systemctl stop pulse-sensor-proxy
# 2. Extract allowed_nodes from config.yaml
grep -A 100 "^allowed_nodes:" /etc/pulse-sensor-proxy/config.yaml > /tmp/nodes.txt
# 3. Parse and add to allowed_nodes.yaml
# (Example for simple list - adjust for your format)
pulse-sensor-proxy config set-allowed-nodes --replace \
--merge node1.local \
--merge node2.local
# 4. Remove allowed_nodes from config.yaml
# Edit manually or use sed:
sudo sed -i '/^allowed_nodes:/,/^[a-z_]/d' /etc/pulse-sensor-proxy/config.yaml
# 5. Add reference to allowed_nodes.yaml
echo "allowed_nodes_file: /etc/pulse-sensor-proxy/allowed_nodes.yaml" | \
sudo tee -a /etc/pulse-sensor-proxy/config.yaml
# 6. Validate
pulse-sensor-proxy config validate
# 7. Start service
sudo systemctl start pulse-sensor-proxy
```
## Related Documentation
- [Temperature Monitoring](../TEMPERATURE_MONITORING.md) - Setup and troubleshooting
- [Sensor Proxy README](/opt/pulse/cmd/pulse-sensor-proxy/README.md) - Complete CLI reference
- [Audit Log Rotation](audit-log-rotation.md) - Managing append-only logs
- [Temperature Monitoring Security](../TEMPERATURE_MONITORING_SECURITY.md) - Security architecture
## Support
If config management issues persist after following this guide:
1. Collect diagnostics:
```bash
pulse-sensor-proxy config validate 2>&1 > /tmp/validate.log
sudo systemctl status pulse-sensor-proxy > /tmp/status.log
journalctl -u pulse-sensor-proxy -n 200 > /tmp/journal.log
grep -n "allowed_nodes:" /etc/pulse-sensor-proxy/*.yaml > /tmp/grep.log
```
2. File an issue at https://github.com/rcourtman/Pulse/issues
3. Include:
- Pulse version
- Sensor proxy version (`pulse-sensor-proxy --version`)
- Output from diagnostic commands above
- Steps that led to the issue

View File

@@ -1,73 +0,0 @@
# Sensor Proxy Log Forwarding
Forward `pulse-sensor-proxy` logs to a central syslog/SIEM endpoint so audit
records survive host loss and can drive alerting. Pulse ships a helper script
(`scripts/setup-log-forwarding.sh`) that configures rsyslog to ship both
`audit.log` and `proxy.log` over RELP + TLS.
## Requirements
- Debian/Ubuntu host with **rsyslog** and the `imfile` + `omrelp` modules (present
by default).
- Root privileges to install certificates and restart rsyslog.
- TLS assets for the RELP connection:
- `ca.crt` CA that issued the remote collector certificate.
- `client.crt` / `client.key` mTLS credentials for this host.
- Network access to the remote collector (`REMOTE_HOST`, default `logs.pulse.example`,
port `6514`).
## Installation Steps
1. Copy your CA and client certificates into a safe directory on the host (the
script defaults to `/etc/pulse/log-forwarding`).
2. Run the helper with environment overrides for your collector:
```bash
sudo REMOTE_HOST=logs.company.tld \
REMOTE_PORT=6514 \
CERT_DIR=/etc/pulse/log-forwarding \
CA_CERT=/etc/pulse/log-forwarding/ca.crt \
CLIENT_CERT=/etc/pulse/log-forwarding/pulse.crt \
CLIENT_KEY=/etc/pulse/log-forwarding/pulse.key \
/opt/pulse/scripts/setup-log-forwarding.sh
```
The script writes `/etc/rsyslog.d/pulse-sensor-proxy.conf`, ensures the
certificate directory exists (`0750`), and restarts rsyslog.
## What the Script Configures
- Two `imfile` inputs that watch `/var/log/pulse/sensor-proxy/audit.log` and
`/var/log/pulse/sensor-proxy/proxy.log` with `Tag`s `pulse.audit` and
`pulse.app`.
- A local mirror file at `/var/log/pulse/sensor-proxy/forwarding.log` so you can
inspect rsyslog activity.
- An RELP action with TLS, infinite retry (`action.resumeRetryCount=-1`), and a
50k message disk-backed queue to absorb collector outages.
## Verification Checklist
1. Confirm rsyslog picked up the new config:
```bash
sudo rsyslogd -N1
sudo systemctl status rsyslog --no-pager
```
2. Tail the local mirror to ensure entries stream through:
```bash
sudo tail -f /var/log/pulse/sensor-proxy/forwarding.log
```
3. On the collector side, filter for the `pulse.audit` tag and make sure new
entries arrive. For Splunk/ELK, index on `programname`.
4. Simulate a test event (e.g., restart `pulse-sensor-proxy` or deny a fake peer)
and verify it appears remotely.
## Maintenance
- **Certificate rotation**: Replace the key/cert files, then restart rsyslog.
Because the config points at static paths, no additional edits are required.
- **Disable forwarding**: Remove `/etc/rsyslog.d/pulse-sensor-proxy.conf` and run
`sudo systemctl restart rsyslog`. The local audit log remains untouched.
- **Queue monitoring**: Track rsyslogs main log or use `rsyslogd -N6` to check
for queue overflows. At scale, scrape `/var/log/pulse/sensor-proxy/forwarding.log`
for `action resumed` messages.
For rotation guidance on the underlying audit file, see
[operations/audit-log-rotation.md](audit-log-rotation.md).

View File

@@ -0,0 +1,39 @@
# 🛡️ Sensor Proxy Hardening
Secure `pulse-sensor-proxy` with AppArmor and Seccomp.
## 🛡️ AppArmor
Profile: `security/apparmor/pulse-sensor-proxy.apparmor`
* **Allows**: Configs, logs, SSH keys, outbound TCP/SSH.
* **Blocks**: Raw sockets, module loading, ptrace, exec outside allowlist.
### Install & Enforce
```bash
sudo install -m 0644 security/apparmor/pulse-sensor-proxy.apparmor /etc/apparmor.d/pulse-sensor-proxy
sudo apparmor_parser -r /etc/apparmor.d/pulse-sensor-proxy
sudo aa-enforce pulse-sensor-proxy
```
## 🔒 Seccomp
Profile: `security/seccomp/pulse-sensor-proxy.json`
* **Allows**: Go runtime syscalls, network, file IO.
* **Blocks**: Everything else (returns `EPERM`).
### Systemd (Classic)
Add to service override:
```ini
[Service]
AppArmorProfile=pulse-sensor-proxy
SystemCallFilter=@system-service
SystemCallAllow=accept;connect;recvfrom;sendto;recvmsg;sendmsg;sendmmsg;getsockname;getpeername;getsockopt;setsockopt;shutdown
```
### Containers (Docker/Podman)
```bash
podman run --seccomp-profile /opt/pulse/security/seccomp/pulse-sensor-proxy.json ...
```
## 🔍 Verification
Check status with `aa-status` or `journalctl -t auditbeat`.

View File

@@ -0,0 +1,57 @@
# 🛡️ Sensor Proxy Hardening
The `pulse-sensor-proxy` runs on the host to securely collect temperatures, keeping SSH keys out of containers.
## 🏗️ Architecture
* **Host**: Runs `pulse-sensor-proxy` (unprivileged user).
* **Container**: Connects via Unix socket (`/run/pulse-sensor-proxy/pulse-sensor-proxy.sock`).
* **Auth**: Uses `SO_PEERCRED` to verify container UID/PID.
## 🔒 Host Hardening
### Service Account
Runs as `pulse-sensor-proxy` (no shell, no home).
```bash
id pulse-sensor-proxy # uid=XXX(pulse-sensor-proxy)
```
### Systemd Security
The service unit uses:
* `User=pulse-sensor-proxy`
* `NoNewPrivileges=true`
* `ProtectSystem=strict`
* `PrivateTmp=true`
### File Permissions
| Path | Owner | Mode |
| :--- | :--- | :--- |
| `/var/lib/pulse-sensor-proxy/` | `pulse-sensor-proxy` | `0750` |
| `/var/lib/pulse-sensor-proxy/ssh/` | `pulse-sensor-proxy` | `0700` |
| `/run/pulse-sensor-proxy/` | `pulse-sensor-proxy` | `0775` |
## 📦 LXC Configuration
Required for the container to access the proxy socket.
**`/etc/pve/lxc/<VMID>.conf`**:
```ini
unprivileged: 1
lxc.apparmor.profile: generated
lxc.mount.entry: /run/pulse-sensor-proxy mnt/pulse-proxy none bind,create=dir 0 0
```
## 🔑 Key Management
SSH keys are restricted to `sensors -j` only.
**Rotation**:
```bash
/opt/pulse/scripts/pulse-sensor-proxy-rotate-keys.sh
```
* **Dry Run**: Add `--dry-run`.
* **Rollback**: Add `--rollback`.
## 🚨 Incident Response
If compromised:
1. **Stop Proxy**: `systemctl stop pulse-sensor-proxy`.
2. **Rotate Keys**: Remove old keys from nodes manually or use `pulse-sensor-proxy-rotate-keys.sh`.
3. **Audit Logs**: Check `journalctl -u pulse-sensor-proxy`.
4. **Reinstall**: Run `/opt/pulse/scripts/install-sensor-proxy.sh`.

View File

@@ -0,0 +1,35 @@
# 🌐 Sensor Proxy Network Segmentation
Isolate the proxy to prevent lateral movement.
## 🚧 Zones
* **Pulse App**: Connects to Proxy via Unix socket (local).
* **Sensor Proxy**: Outbound SSH to Proxmox nodes only.
* **Proxmox Nodes**: Accept SSH from Proxy.
* **Logging**: Accepts RELP/TLS from Proxy.
## 🛡️ Firewall Rules
| Source | Dest | Port | Purpose | Action |
| :--- | :--- | :--- | :--- | :--- |
| **Pulse App** | Proxy | `unix` | RPC Requests | **Allow** (Local) |
| **Proxy** | Nodes | `22` | SSH (sensors) | **Allow** |
| **Proxy** | Logs | `6514` | Audit Logs | **Allow** |
| **Any** | Proxy | `22` | SSH Access | **Deny** (Use Bastion) |
| **Proxy** | Internet | `any` | Outbound | **Deny** |
## 🔧 Implementation (iptables)
```bash
# Allow SSH to Proxmox
iptables -A OUTPUT -p tcp -d <PROXMOX_SUBNET> --dport 22 -j ACCEPT
# Allow Log Forwarding
iptables -A OUTPUT -p tcp -d <LOG_HOST> --dport 6514 -j ACCEPT
# Drop all other outbound
iptables -P OUTPUT DROP
```
## 🚨 Monitoring
* Alert on outbound connections to non-whitelisted IPs.
* Monitor `pulse_proxy_limiter_rejects_total` for abuse.

View File

@@ -0,0 +1,31 @@
# 🌡️ Temperature Monitoring Security
Secure architecture for collecting hardware temperatures.
## 🛡️ Security Model
* **Isolation**: SSH keys live on the host, not in the container.
* **Least Privilege**: Proxy runs as `pulse-sensor-proxy` (no shell).
* **Verification**: Container identity verified via `SO_PEERCRED`.
## 🏗️ Components
1. **Pulse Backend**: Connects to Unix socket `/mnt/pulse-proxy/pulse-sensor-proxy.sock`.
2. **Sensor Proxy**: Validates request, executes SSH to node.
3. **Target Node**: Accepts SSH key restricted to `sensors -j`.
## 🔒 Key Restrictions
SSH keys deployed to nodes are locked down:
```
command="sensors -j",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty
```
## 🚦 Rate Limiting
* **Per Peer**: ~12 req/min.
* **Concurrency**: Max 2 parallel requests per peer.
* **Global**: Max 8 concurrent requests.
## 📝 Auditing
All requests logged to system journal:
```bash
journalctl -u pulse-sensor-proxy
```
Logs include: `uid`, `pid`, `method`, `node`, `correlation_id`.

View File

@@ -1,52 +0,0 @@
# Pulse Sensor Proxy AppArmor & Seccomp Hardening
## AppArmor Profile
- Profile path: `security/apparmor/pulse-sensor-proxy.apparmor`
- Grants read-only access to configs, logs, SSH keys, and binaries; allows outbound TCP/SSH; blocks raw sockets, module loading, ptrace, and absolute command execution outside the allowlist.
### Installation
```bash
sudo install -m 0644 security/apparmor/pulse-sensor-proxy.apparmor /etc/apparmor.d/pulse-sensor-proxy
sudo apparmor_parser -r /etc/apparmor.d/pulse-sensor-proxy
sudo ln -sf /etc/apparmor.d/pulse-sensor-proxy /etc/apparmor.d/force-complain/pulse-sensor-proxy # optional staged mode
sudo systemctl restart apparmor
```
### Enforce Mode
```bash
sudo aa-enforce pulse-sensor-proxy
```
Monitor `/var/log/syslog` for `DENIED` events and update the profile as needed.
## Seccomp Filter
- OCI-style profile: `security/seccomp/pulse-sensor-proxy.json`
- Allows standard Go runtime syscalls, network operations, file IO, and `execve` for whitelisted helpers; other syscalls return `EPERM`.
### Apply via systemd (classic service)
Add to the override:
```ini
[Service]
AppArmorProfile=pulse-sensor-proxy
RestrictNamespaces=yes
NoNewPrivileges=yes
SystemCallFilter=@system-service
SystemCallArchitectures=native
SystemCallAllow=accept;connect;recvfrom;sendto;recvmsg;sendmsg;sendmmsg;getsockname;getpeername;getsockopt;setsockopt;shutdown
```
Reload and restart:
```bash
sudo systemctl daemon-reload
sudo systemctl restart pulse-sensor-proxy
```
### Apply seccomp JSON (containerised deployments)
- Profile: `security/seccomp/pulse-sensor-proxy.json`
- Use with Podman/Docker style runtimes:
```bash
podman run --seccomp-profile /opt/pulse/security/seccomp/pulse-sensor-proxy.json ...
```
## Operational Notes
- Use `journalctl -t auditbeat -g pulse-sensor-proxy` or `aa-status` to confirm profile status.
- Pair with network ACLs (see `docs/security/pulse-sensor-proxy-network.md`) and log shipping via [`scripts/setup-log-forwarding.sh` + the RELP runbook](../operations/sensor-proxy-log-forwarding.md).

View File

@@ -1,64 +0,0 @@
# Pulse Sensor Proxy Network Segmentation
## Overview
- **Proxy host** collects temperatures via SSH from Proxmox nodes and serves a Unix socket to the Pulse stack.
- Goals: isolate the proxy from production hypervisors, prevent lateral movement, and ensure log forwarding/audit channels remain available.
## Zones & Connectivity
- **Pulse Application Zone (AZ-Pulse)**
- Hosts Pulse backend/frontend containers.
- Allowed to reach the proxy over Unix socket (local) or loopback if containerised via `socat`.
- **Sensor Proxy Zone (AZ-Sensor)**
- Dedicated VM/bare-metal host running `pulse-sensor-proxy`.
- Maintains outbound SSH to Proxmox management interfaces only.
- **Proxmox Management Zone (AZ-Proxmox)**
- Hypervisors / BMCs reachable on `tcp/22` (SSH) and optional IPMI UDP.
- **Logging/Monitoring Zone (AZ-Logging)**
- Receives forwarded audit/application logs (e.g. RELP/TLS on `tcp/6514`).
- Exposes Prometheus scrape port (default `tcp/9127`) if remote monitoring required.
## Recommended Firewall Rules
| Source Zone | Destination Zone | Protocol/Port | Purpose | Action |
|-------------|------------------|---------------|---------|--------|
| AZ-Pulse (localhost) | AZ-Sensor (Unix socket) | `unix` | RPC requests from Pulse | Allow (local only) |
| AZ-Sensor | AZ-Proxmox nodes | `tcp/22` | SSH for sensors/ipmitool wrapper | Allow (restricted to node list) |
| AZ-Sensor | AZ-Proxmox BMC | `udp/623` *(optional)* | IPMI if required for temperature data | Allow if needed |
| AZ-Proxmox | AZ-Sensor | `any` | Return SSH traffic | Allow stateful |
| AZ-Sensor | AZ-Logging | `tcp/6514` (TLS RELP) | Audit/application log forwarding | Allow |
| AZ-Logging | AZ-Sensor | `tcp/9127` *(optional)* | Prometheus scrape of proxy metrics | Allow if scraping remotely |
| Any | AZ-Sensor | `tcp/22` | Shell/SSH access | Deny (use management bastion) |
| AZ-Sensor | Internet | `any` | Outbound Internet | Deny (except package mirrors via proxy if required) |
## Implementation Steps
1. Place proxy host in dedicated subnet/VLAN with ACLs enforcing the table above.
2. Populate `/etc/hosts` or routing so proxy resolves Proxmox nodes to management IPs only (no public networks).
3. Configure iptables/nftables on proxy:
```bash
# Allow SSH to Proxmox nodes
iptables -A OUTPUT -p tcp -d <PROXMOX_SUBNET>/24 --dport 22 -m conntrack --ctstate NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -p tcp -s <PROXMOX_SUBNET>/24 --sport 22 -m conntrack --ctstate ESTABLISHED -j ACCEPT
# Allow log forwarding
iptables -A OUTPUT -p tcp -d <LOG_HOST> --dport 6514 -m conntrack --ctstate NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -p tcp -s <LOG_HOST> --sport 6514 -m conntrack --ctstate ESTABLISHED -j ACCEPT
# (Optional) allow Prometheus scrape
iptables -A INPUT -p tcp -s <SCRAPE_HOST> --dport 9127 -m conntrack --ctstate NEW,ESTABLISHED -j ACCEPT
iptables -A OUTPUT -p tcp -d <SCRAPE_HOST> --sport 9127 -m conntrack --ctstate ESTABLISHED -j ACCEPT
# Drop everything else
iptables -P OUTPUT DROP
iptables -P INPUT DROP
```
4. Deny inbound SSH to proxy except via management bastion: block `tcp/22` or whitelist bastion IPs.
5. Ensure log-forwarding TLS certificates are rotated and stored under `/etc/pulse/log-forwarding`.
## Monitoring & Alerting
- Alert if proxy initiates connections outside permitted subnets (Netflow or host firewall counters).
- Monitor `pulse_proxy_limiter_*` metrics for unusual rate-limit hits that might signal abuse.
- Track `audit_log` forwarding queue depth and remote availability; on failure, emit alert via rsyslog action queue (set `action.resumeRetryCount=-1` already).
## Change Management
- Document node IP changes and update firewall objects (`PROXMOX_NODES`) before redeploying certificates.
- Capture segmentation in infrastructure-as-code (e.g. Terraform/security group definitions) to avoid drift.