mirror of
https://github.com/rcourtman/Pulse.git
synced 2026-02-18 00:17:39 +01:00
refactor: finalize documentation overhaul
- Refactor specialized docs for conciseness and clarity - Rename files to UPPER_CASE.md convention - Verify accuracy against codebase - Fix broken links
This commit is contained in:
@@ -1,325 +0,0 @@
|
||||
# Pulse
|
||||
|
||||
[](https://github.com/rcourtman/Pulse/releases/latest)
|
||||
[](https://hub.docker.com/r/rcourtman/pulse)
|
||||
[](https://github.com/rcourtman/Pulse/blob/main/LICENSE)
|
||||
|
||||
**Real-time monitoring for Proxmox VE, Proxmox Mail Gateway, PBS, and Docker infrastructure with alerts and webhooks.**
|
||||
|
||||
Monitor your hybrid Proxmox and Docker estate from a single dashboard. Get instant alerts when nodes go down, containers misbehave, backups fail, or storage fills up. Supports email, Discord, Slack, Telegram, and more.
|
||||
|
||||
**[Try the live demo →](https://demo.pulserelay.pro)** (read-only with mock data)
|
||||
|
||||
## Support Pulse Development
|
||||
|
||||
Pulse is built by a solo developer in evenings and weekends. Your support helps:
|
||||
- Keep me motivated to add new features
|
||||
- Prioritize bug fixes and user requests
|
||||
- Ensure Pulse stays 100% free and open-source forever
|
||||
|
||||
[](https://github.com/sponsors/rcourtman)
|
||||
[](https://ko-fi.com/rcourtman)
|
||||
|
||||
**Not ready to sponsor?** Star the project or share it with your homelab community!
|
||||
|
||||
## Features
|
||||
|
||||
- **Auto-Discovery**: Finds Proxmox nodes on your network, one-liner setup via generated scripts
|
||||
- **Cluster Support**: Configure one node, monitor entire cluster
|
||||
- **Enterprise Security**:
|
||||
- Credentials encrypted at rest, masked in logs, never sent to frontend
|
||||
- CSRF protection for all state-changing operations
|
||||
- Rate limiting (500 req/min general, 10 attempts/min for auth)
|
||||
- Account lockout after failed login attempts
|
||||
- Secure session management with HttpOnly cookies
|
||||
- bcrypt password hashing (cost 12) - passwords NEVER stored in plain text
|
||||
- API tokens stored securely with restricted file permissions
|
||||
- Security headers (CSP, X-Frame-Options, etc.)
|
||||
- Comprehensive audit logging
|
||||
- Live monitoring of VMs, containers, nodes, storage
|
||||
- **Smart Alerts**: Email and webhooks (Discord, Slack, Telegram, Teams, ntfy.sh, Gotify)
|
||||
- Example: "VM 'webserver' is down on node 'pve1'"
|
||||
- Example: "Storage 'local-lvm' at 85% capacity"
|
||||
- Example: "VM 'database' is back online"
|
||||
- **Adaptive Thresholds**: Hysteresis-based trigger/clear levels, fractional network thresholds, per-metric search, reset-to-defaults, and Custom overrides with inline audit trail
|
||||
- **Alert Timeline Analytics**: Rich history explorer with acknowledgement/clear markers, escalation breadcrumbs, and quick filters for noisy resources
|
||||
- **Ceph Awareness**: Surface Ceph health, pool utilisation, and daemon status automatically when Proxmox exposes Ceph-backed storage
|
||||
- Unified view of PBS backups, PVE backups, and snapshots
|
||||
- **Interactive Backup Explorer**: Cross-highlighted bar chart + grid with quick time-range pivots (24h/7d/30d/custom) and contextual tooltips for the busiest jobs
|
||||
- Proxmox Mail Gateway analytics: mail volume, spam/virus trends, quarantine health, and cluster node status
|
||||
- Optional Docker container monitoring via lightweight agent
|
||||
- Config export/import with encryption and authentication
|
||||
- Automatic stable updates with safe rollback (opt-in)
|
||||
- Runtime logging controls (switch level/format or mirror to file without downtime)
|
||||
- Update history with rollback guidance captured in the UI
|
||||
- Dark/light themes, responsive design
|
||||
- Built with Go for minimal resource usage
|
||||
|
||||
[View screenshots and full documentation on GitHub →](https://github.com/rcourtman/Pulse)
|
||||
|
||||
## Privacy
|
||||
|
||||
**Pulse respects your privacy:**
|
||||
- No telemetry or analytics collection
|
||||
- No phone-home functionality
|
||||
- No external API calls (except for configured webhooks)
|
||||
- All data stays on your server
|
||||
- Open source - verify it yourself
|
||||
|
||||
Your infrastructure data is yours alone.
|
||||
|
||||
## Quick Start with Docker
|
||||
|
||||
### Basic Setup
|
||||
|
||||
```bash
|
||||
docker run -d \
|
||||
--name pulse \
|
||||
-p 7655:7655 \
|
||||
-v pulse_data:/data \
|
||||
--restart unless-stopped \
|
||||
rcourtman/pulse:latest
|
||||
```
|
||||
|
||||
Then open `http://localhost:7655` and complete the security setup wizard.
|
||||
|
||||
### Network Discovery
|
||||
|
||||
Pulse automatically discovers Proxmox nodes on your network! By default, it scans:
|
||||
- 192.168.0.0/16 (home networks)
|
||||
- 10.0.0.0/8 (private networks)
|
||||
- 172.16.0.0/12 (Docker/internal networks)
|
||||
|
||||
To scan a custom subnet instead:
|
||||
```bash
|
||||
docker run -d \
|
||||
--name pulse \
|
||||
-p 7655:7655 \
|
||||
-v pulse_data:/data \
|
||||
-e DISCOVERY_SUBNET="192.168.50.0/24" \
|
||||
--restart unless-stopped \
|
||||
rcourtman/pulse:latest
|
||||
```
|
||||
|
||||
### Automated Deployment with Pre-configured Auth
|
||||
|
||||
```bash
|
||||
# Deploy with authentication pre-configured
|
||||
docker run -d \
|
||||
--name pulse \
|
||||
-p 7655:7655 \
|
||||
-v pulse_data:/data \
|
||||
-e API_TOKENS="ansible-token,docker-agent-token" \
|
||||
-e PULSE_AUTH_USER="admin" \
|
||||
-e PULSE_AUTH_PASS="your-password" \
|
||||
--restart unless-stopped \
|
||||
rcourtman/pulse:latest
|
||||
|
||||
# Plain text credentials are automatically hashed for security
|
||||
# No setup required - API works immediately
|
||||
```
|
||||
|
||||
### Docker Compose
|
||||
|
||||
```yaml
|
||||
services:
|
||||
pulse:
|
||||
image: rcourtman/pulse:latest
|
||||
container_name: pulse
|
||||
ports:
|
||||
- "7655:7655"
|
||||
volumes:
|
||||
- pulse_data:/data
|
||||
environment:
|
||||
# NOTE: Env vars override UI settings. Remove env var to allow UI configuration.
|
||||
|
||||
# Network discovery (usually not needed - auto-scans common networks)
|
||||
# - DISCOVERY_SUBNET=192.168.50.0/24 # Only for non-standard networks
|
||||
|
||||
# Ports
|
||||
# - PORT=7655 # Backend port (default: 7655)
|
||||
# - FRONTEND_PORT=7655 # Frontend port (default: 7655)
|
||||
|
||||
# Security (all optional - runs open by default)
|
||||
# - PULSE_AUTH_USER=admin # Username for web UI login
|
||||
# - PULSE_AUTH_PASS=your-password # Plain text or bcrypt hash (auto-hashed if plain)
|
||||
# - API_TOKENS=token-a,token-b # Comma-separated tokens (plain or SHA3-256 hashed)
|
||||
# - API_TOKEN=legacy-token # Optional single-token fallback
|
||||
# - ALLOW_UNPROTECTED_EXPORT=false # Allow export without auth (default: false)
|
||||
|
||||
# Security: Plain text credentials are automatically hashed
|
||||
# You can provide either:
|
||||
# 1. Plain text (auto-hashed): PULSE_AUTH_PASS=mypassword
|
||||
# 2. Pre-hashed (advanced): PULSE_AUTH_PASS='$$2a$$12$$...'
|
||||
# Note: Escape $ as $$ in docker-compose.yml for pre-hashed values
|
||||
|
||||
# Performance
|
||||
# - CONNECTION_TIMEOUT=10 # Connection timeout in seconds (default: 10)
|
||||
|
||||
# CORS & logging
|
||||
# - ALLOWED_ORIGINS=https://app.example.com # CORS origins (default: none, same-origin only)
|
||||
# - LOG_LEVEL=info # Log level: debug/info/warn/error (default: info)
|
||||
# - LOG_FORMAT=auto # auto | json | console (default: auto)
|
||||
# - LOG_FILE=/data/pulse.log # Optional mirrored logfile inside container
|
||||
# - LOG_MAX_SIZE=100 # Rotate logfile after N MB
|
||||
# - LOG_MAX_AGE=30 # Retain rotated logs for N days
|
||||
# - LOG_COMPRESS=true # Compress rotated logs
|
||||
restart: unless-stopped
|
||||
|
||||
volumes:
|
||||
pulse_data:
|
||||
|
||||
### Updating & Rollbacks (v4.24.0+)
|
||||
|
||||
```bash
|
||||
# Update to the latest tagged image
|
||||
docker pull rcourtman/pulse:latest
|
||||
docker stop pulse && docker rm pulse
|
||||
docker run -d --name pulse \
|
||||
-p 7655:7655 -v pulse_data:/data \
|
||||
--restart unless-stopped \
|
||||
rcourtman/pulse:latest
|
||||
```
|
||||
- Every upgrade is logged in **Settings → System → Updates** with an `event_id` for change tracking.
|
||||
- Need to revert? Redeploy the previous tag (for example `rcourtman/pulse:v4.23.2`). Record the rollback reason in your change notes and double-check `/api/monitoring/scheduler/health` once the container is back online.
|
||||
```
|
||||
|
||||
## Initial Setup
|
||||
|
||||
1. Open `http://<your-server>:7655`
|
||||
2. **Complete the mandatory security setup** (first-time only)
|
||||
3. Create your admin username and password
|
||||
4. Use **Settings → Security → API tokens** to issue dedicated tokens for automation (one token per integration makes revocation painless)
|
||||
|
||||
## Configure Proxmox/PBS Nodes
|
||||
|
||||
After logging in:
|
||||
|
||||
1. Go to Settings → Nodes
|
||||
2. Discovered nodes appear automatically
|
||||
3. Click "Setup Script" next to any node
|
||||
4. Click "Generate Setup Code" button (creates a 6-character code valid for 5 minutes)
|
||||
5. Copy and run the provided one-liner on your Proxmox/PBS host
|
||||
6. Node is configured and monitoring starts automatically
|
||||
|
||||
**Example setup command:**
|
||||
```bash
|
||||
curl -sSL "http://pulse:7655/api/setup-script?type=pve&host=https://pve:8006&auth_token=ABC123" | bash
|
||||
```
|
||||
|
||||
## Docker Updates
|
||||
|
||||
```bash
|
||||
# Latest stable
|
||||
docker pull rcourtman/pulse:latest
|
||||
|
||||
# Latest RC/pre-release
|
||||
docker pull rcourtman/pulse:rc
|
||||
|
||||
# Specific version
|
||||
docker pull rcourtman/pulse:v4.22.0
|
||||
|
||||
# Then recreate your container
|
||||
docker stop pulse && docker rm pulse
|
||||
# Run your docker run or docker-compose command again
|
||||
```
|
||||
|
||||
## Security
|
||||
|
||||
- **Authentication required** - Protects your Proxmox infrastructure credentials
|
||||
- **Quick setup wizard** - Secure your installation in under a minute
|
||||
- **Multiple auth methods**: Password authentication, API tokens, proxy auth (SSO), or combinations
|
||||
- **Proxy/SSO support** - Integrate with Authentik, Authelia, and other authentication proxies
|
||||
- **Enterprise-grade protection**:
|
||||
- Credentials encrypted at rest (AES-256-GCM)
|
||||
- CSRF tokens for state-changing operations
|
||||
- Rate limiting and account lockout protection
|
||||
- Secure session management with HttpOnly cookies
|
||||
- bcrypt password hashing (cost 12) - passwords NEVER stored in plain text
|
||||
- API tokens stored securely with restricted file permissions
|
||||
- Security headers (CSP, X-Frame-Options, etc.)
|
||||
- Comprehensive audit logging
|
||||
- **Security by design**:
|
||||
- Frontend never receives node credentials
|
||||
- API tokens visible only to authenticated users
|
||||
- Export/import requires authentication when configured
|
||||
|
||||
See [Security Documentation](https://github.com/rcourtman/Pulse/blob/main/docs/SECURITY.md) for details.
|
||||
|
||||
## HTTPS/TLS Configuration
|
||||
|
||||
Enable HTTPS by setting these environment variables:
|
||||
|
||||
```bash
|
||||
docker run -d -p 7655:7655 \
|
||||
-e HTTPS_ENABLED=true \
|
||||
-e TLS_CERT_FILE=/data/cert.pem \
|
||||
-e TLS_KEY_FILE=/data/key.pem \
|
||||
-v pulse_data:/data \
|
||||
-v /path/to/certs:/data/certs:ro \
|
||||
rcourtman/pulse:latest
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Authentication Issues
|
||||
|
||||
#### Cannot login after setting up security
|
||||
- **Docker**: Ensure bcrypt hash is exactly 60 characters and wrapped in single quotes
|
||||
- **Docker Compose**: MUST escape $ characters as $$ (e.g., `$$2a$$12$$...`)
|
||||
- **Example (docker run)**: `PULSE_AUTH_PASS='$2a$12$YTZXOCEylj4TaevZ0DCeI.notayQZ..b0OZ97lUZ.Q24fljLiMQHK'`
|
||||
- **Example (docker-compose.yml)**: `PULSE_AUTH_PASS='$$2a$$12$$YTZXOCEylj4TaevZ0DCeI.notayQZ..b0OZ97lUZ.Q24fljLiMQHK'`
|
||||
- If hash is truncated or mangled, authentication will fail
|
||||
- Use Quick Security Setup in the UI to avoid manual configuration errors
|
||||
|
||||
#### .env file not created (Docker)
|
||||
- **Expected behavior**: When using environment variables, no .env file is created in /data
|
||||
- The .env file is only created when using Quick Security Setup or password changes
|
||||
- If you provide credentials via environment variables, they take precedence
|
||||
- To use Quick Security Setup: Start container WITHOUT auth environment variables
|
||||
|
||||
### VM Disk Stats Show "-"
|
||||
- VMs require QEMU Guest Agent to report disk usage (Proxmox API returns 0 for VMs)
|
||||
- Install guest agent in VM: `apt install qemu-guest-agent` (Linux) or virtio-win tools (Windows)
|
||||
- Enable in VM Options → QEMU Guest Agent, then restart VM
|
||||
- Container (LXC) disk stats always work (no guest agent needed)
|
||||
|
||||
### Connection Issues
|
||||
- Check Proxmox API is accessible (port 8006/8007)
|
||||
- Verify credentials have PVEAuditor role plus VM.GuestAgent.Audit (PVE 9) or VM.Monitor (PVE 8); the setup script applies these via the PulseMonitor role (adds Sys.Audit when available)
|
||||
- For PBS: ensure API token has Datastore.Audit permission
|
||||
|
||||
### Logs
|
||||
```bash
|
||||
# View logs
|
||||
docker logs pulse
|
||||
|
||||
# Follow logs
|
||||
docker logs -f pulse
|
||||
```
|
||||
|
||||
## Documentation
|
||||
|
||||
Full documentation available on GitHub:
|
||||
|
||||
- [Complete Installation Guide](https://github.com/rcourtman/Pulse/blob/main/docs/INSTALL.md)
|
||||
- [Configuration Guide](https://github.com/rcourtman/Pulse/blob/main/docs/CONFIGURATION.md)
|
||||
- [VM Disk Monitoring](https://github.com/rcourtman/Pulse/blob/main/docs/VM_DISK_MONITORING.md) - Set up QEMU Guest Agent for accurate VM disk usage
|
||||
- [Troubleshooting](https://github.com/rcourtman/Pulse/blob/main/docs/TROUBLESHOOTING.md)
|
||||
- [API Reference](https://github.com/rcourtman/Pulse/blob/main/docs/API.md)
|
||||
- [Webhook Guide](https://github.com/rcourtman/Pulse/blob/main/docs/WEBHOOKS.md)
|
||||
- [Proxy Authentication](https://github.com/rcourtman/Pulse/blob/main/docs/PROXY_AUTH.md) - SSO integration with Authentik, Authelia, etc.
|
||||
- [Reverse Proxy Setup](https://github.com/rcourtman/Pulse/blob/main/docs/REVERSE_PROXY.md) - nginx, Caddy, Apache, Traefik configs
|
||||
- [Security](https://github.com/rcourtman/Pulse/blob/main/docs/SECURITY.md)
|
||||
- [FAQ](https://github.com/rcourtman/Pulse/blob/main/docs/FAQ.md)
|
||||
|
||||
## Links
|
||||
|
||||
- [GitHub Repository](https://github.com/rcourtman/Pulse)
|
||||
- [Releases & Changelog](https://github.com/rcourtman/Pulse/releases)
|
||||
- [Issues & Feature Requests](https://github.com/rcourtman/Pulse/issues)
|
||||
- [Live Demo](https://demo.pulserelay.pro)
|
||||
|
||||
## License
|
||||
|
||||
MIT - See [LICENSE](https://github.com/rcourtman/Pulse/blob/main/LICENSE)
|
||||
@@ -1,127 +0,0 @@
|
||||
# Port Configuration Guide
|
||||
|
||||
Pulse supports multiple ways to configure the frontend port (default: 7655).
|
||||
|
||||
> **Development tip:** The hot-reload workflow (`scripts/hot-dev.sh` or `make dev-hot`) loads `.env`, `.env.local`, and `.env.dev`. Set `FRONTEND_PORT` or `PULSE_DEV_API_PORT` there to run the backend on a different port while keeping the generated `curl` commands and Vite proxy in sync.
|
||||
|
||||
## Recommended Methods
|
||||
|
||||
### 1. During Installation (Easiest)
|
||||
The installer prompts for the port. To skip the prompt, use:
|
||||
```bash
|
||||
FRONTEND_PORT=8080 curl -fsSL https://raw.githubusercontent.com/rcourtman/Pulse/main/install.sh | bash
|
||||
```
|
||||
|
||||
### 2. Using systemd override (For existing installations)
|
||||
```bash
|
||||
sudo systemctl edit pulse
|
||||
```
|
||||
Add these lines:
|
||||
```ini
|
||||
[Service]
|
||||
Environment="FRONTEND_PORT=8080"
|
||||
```
|
||||
Then restart: `sudo systemctl restart pulse`
|
||||
|
||||
### 3. Using system.json (Alternative method)
|
||||
Edit `/etc/pulse/system.json`:
|
||||
```json
|
||||
{
|
||||
"frontendPort": 8080
|
||||
}
|
||||
```
|
||||
Then restart: `sudo systemctl restart pulse`
|
||||
|
||||
### 4. Using environment variables (Docker)
|
||||
For Docker deployments:
|
||||
```bash
|
||||
docker run -e FRONTEND_PORT=8080 -p 8080:8080 rcourtman/pulse:latest
|
||||
```
|
||||
|
||||
## Priority Order
|
||||
|
||||
Pulse checks for port configuration in this order:
|
||||
1. `FRONTEND_PORT` environment variable
|
||||
2. `PORT` environment variable (legacy)
|
||||
3. `frontendPort` in system.json
|
||||
4. Default: 7655
|
||||
|
||||
Environment variables always override configuration files.
|
||||
|
||||
## Why not .env?
|
||||
|
||||
The `/etc/pulse/.env` file is reserved exclusively for authentication credentials:
|
||||
- `API_TOKENS` - One or more API authentication tokens (hashed)
|
||||
- `API_TOKEN` - Legacy single API token (hashed)
|
||||
- `PULSE_AUTH_USER` - Web UI username
|
||||
- `PULSE_AUTH_PASS` - Web UI password (hashed)
|
||||
|
||||
Keeping application configuration separate from authentication credentials:
|
||||
- Makes it clear what's a secret vs what's configuration
|
||||
- Allows different permission models if needed
|
||||
- Follows the principle of separation of concerns
|
||||
- Makes it easier to backup/share configs without exposing credentials
|
||||
|
||||
## Service Name Variations
|
||||
|
||||
**Important:** Pulse uses different service names depending on the deployment environment:
|
||||
|
||||
- **Systemd (default):** `pulse.service` or `pulse-backend.service` (legacy)
|
||||
- **Hot-dev scripts:** `pulse-hot-dev` (development only)
|
||||
- **Kubernetes/Helm:** Deployment `pulse`, Service `pulse` (port configured via Helm values)
|
||||
|
||||
**To check the active service:**
|
||||
```bash
|
||||
# Systemd
|
||||
systemctl list-units | grep pulse
|
||||
systemctl status pulse
|
||||
|
||||
# Kubernetes
|
||||
kubectl -n pulse get svc pulse
|
||||
kubectl -n pulse get deploy pulse
|
||||
```
|
||||
|
||||
## Change Tracking (v4.24.0+)
|
||||
|
||||
Port changes via environment variables or `system.json` take effect immediately after restart. **v4.24.0 records configuration changes in update history**—useful for audit trails and troubleshooting.
|
||||
|
||||
**To view change history:**
|
||||
```bash
|
||||
# Via UI
|
||||
# Navigate to Settings → System → Updates
|
||||
|
||||
# Via API
|
||||
curl -s http://localhost:7655/api/updates/history | jq '.entries[] | {timestamp, action, status}'
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Port not changing after configuration?
|
||||
1. **Check which service name is in use:**
|
||||
```bash
|
||||
systemctl list-units | grep pulse
|
||||
```
|
||||
It might be `pulse` (default), `pulse-backend` (legacy), or `pulse-hot-dev` (dev environment) depending on your installation method.
|
||||
|
||||
2. **Verify the configuration is loaded:**
|
||||
```bash
|
||||
# Systemd
|
||||
sudo systemctl show pulse | grep Environment
|
||||
|
||||
# Kubernetes
|
||||
kubectl -n pulse get deploy pulse -o jsonpath='{.spec.template.spec.containers[0].env}' | jq
|
||||
```
|
||||
|
||||
3. **Check if another process is using the port:**
|
||||
```bash
|
||||
sudo lsof -i :8080
|
||||
```
|
||||
|
||||
4. **Verify post-restart** (v4.24.0+):
|
||||
```bash
|
||||
# Check actual listening port
|
||||
curl -s http://localhost:7655/api/version | jq
|
||||
|
||||
# Check update history for restart event
|
||||
curl -s http://localhost:7655/api/updates/history?limit=5 | jq
|
||||
```
|
||||
File diff suppressed because it is too large
Load Diff
@@ -20,7 +20,6 @@ Welcome to the Pulse documentation portal. Here you'll find everything you need
|
||||
- **[Docker Guide](DOCKER.md)** – Advanced Docker & Compose configurations.
|
||||
- **[Kubernetes](KUBERNETES.md)** – Helm charts, ingress, and HA setups.
|
||||
- **[Reverse Proxy](REVERSE_PROXY.md)** – Nginx, Caddy, Traefik, and Cloudflare Tunnel recipes.
|
||||
- **[Port Configuration](PORT_CONFIGURATION.md)** – Changing default ports.
|
||||
- **[Troubleshooting](TROUBLESHOOTING.md)** – Deep dive into common issues and logs.
|
||||
|
||||
## 🔐 Security
|
||||
|
||||
@@ -324,7 +324,7 @@ journalctl -u pulse-sensor-proxy -f
|
||||
```
|
||||
|
||||
Forward these logs off-host for retention by following
|
||||
[operations/sensor-proxy-log-forwarding.md](operations/sensor-proxy-log-forwarding.md).
|
||||
[operations/SENSOR_PROXY_LOGS.md](operations/SENSOR_PROXY_LOGS.md).
|
||||
|
||||
In the Pulse container, check the logs at startup:
|
||||
```bash
|
||||
@@ -718,7 +718,7 @@ pulse-sensor-proxy config set-allowed-nodes --replace --merge 192.168.0.1
|
||||
- Installer uses CLI (no more shell/Python divergence)
|
||||
|
||||
**See also:**
|
||||
- [Sensor Proxy Config Management Guide](operations/sensor-proxy-config-management.md) - Complete runbook
|
||||
- [Sensor Proxy Config Management Guide](operations/SENSOR_PROXY_CONFIG.md) - Complete runbook
|
||||
- [Sensor Proxy CLI Reference](/opt/pulse/cmd/pulse-sensor-proxy/README.md) - Full command documentation
|
||||
|
||||
## Control-Plane Sync & Migration
|
||||
|
||||
@@ -1,499 +0,0 @@
|
||||
# Temperature Monitoring Security Guide
|
||||
|
||||
This document describes the security architecture of Pulse's temperature monitoring system with pulse-sensor-proxy.
|
||||
|
||||
## Table of Contents
|
||||
- [Architecture Overview](#architecture-overview)
|
||||
- [Security Boundaries](#security-boundaries)
|
||||
- [Authentication & Authorization](#authentication--authorization)
|
||||
- [Rate Limiting](#rate-limiting)
|
||||
- [SSH Security](#ssh-security)
|
||||
- [Container Isolation](#container-isolation)
|
||||
- [Monitoring & Alerting](#monitoring--alerting)
|
||||
- [Development Mode](#development-mode)
|
||||
- [Troubleshooting](#troubleshooting)
|
||||
|
||||
---
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
Container[Pulse Container]
|
||||
Proxy[pulse-sensor-proxy<br/>Host Service]
|
||||
Cluster[Cluster Nodes<br/>SSH sensors -j]
|
||||
|
||||
Container -->|Unix Socket<br/>Rate Limited| Proxy
|
||||
Proxy -->|SSH<br/>Forced Command| Cluster
|
||||
Cluster -->|Temperature JSON| Proxy
|
||||
Proxy -->|Temperature JSON| Container
|
||||
|
||||
style Proxy fill:#e1f5e1
|
||||
style Container fill:#fff4e1
|
||||
style Cluster fill:#e1f0ff
|
||||
```
|
||||
|
||||
**Key Principle**: SSH keys never enter containers. All SSH operations are performed by the host-side proxy.
|
||||
|
||||
---
|
||||
|
||||
## Security Boundaries
|
||||
|
||||
### 1. Host ↔ Container Boundary
|
||||
- **Enforced by**: Method-level authorization + ID-mapped root detection
|
||||
- **Container CAN**:
|
||||
- ✅ Call `get_temperature` (read temperature data)
|
||||
- ✅ Call `get_status` (check proxy health)
|
||||
- **Container CANNOT**:
|
||||
- ❌ Call `ensure_cluster_keys` (SSH key distribution)
|
||||
- ❌ Call `register_nodes` (node discovery)
|
||||
- ❌ Call `request_cleanup` (cleanup operations)
|
||||
- ❌ Use direct SSH (blocked by container detection)
|
||||
|
||||
### 2. Proxy ↔ Cluster Nodes Boundary
|
||||
- **Enforced by**: SSH forced commands + IP filtering
|
||||
- **SSH authorized_keys entry**:
|
||||
```bash
|
||||
from="192.168.0.0/24",command="sensors -j",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty ssh-ed25519 AAAA... pulse-sensor-proxy
|
||||
```
|
||||
- Proxy can ONLY run `sensors -j` on cluster nodes
|
||||
- IP restrictions prevent lateral movement
|
||||
|
||||
### 3. Client ↔ Proxy Boundary
|
||||
- **Enforced by**: UID-based ACL + adaptive rate limiting
|
||||
- SO_PEERCRED verifies caller's UID/GID/PID
|
||||
- Rate limiting (defaults): ~12 requests per minute per UID (burst 2), per-UID concurrency 2, global concurrency 8, 2 s penalty on validation failures
|
||||
- Per-node guard: only 1 SSH fetch per node at a time
|
||||
|
||||
---
|
||||
|
||||
## Authentication & Authorization
|
||||
|
||||
### Authentication (Who can connect?)
|
||||
|
||||
**Allowed UIDs**:
|
||||
- Root (UID 0) - host processes
|
||||
- Proxy's own UID (pulse-sensor-proxy user)
|
||||
- Configured UIDs from `/etc/pulse-sensor-proxy/config.yaml`
|
||||
- ID-mapped root ranges (containers, if enabled)
|
||||
|
||||
**ID-Mapped Root Detection**:
|
||||
- Reads `/etc/subuid` and `/etc/subgid` for UID/GID mapping ranges
|
||||
- Containers typically use ranges like `100000-165535`
|
||||
- Both UID AND GID must be in mapped ranges
|
||||
|
||||
### Authorization (What can they call?)
|
||||
|
||||
**Privileged Methods** (host-only):
|
||||
```go
|
||||
var privilegedMethods = map[string]bool{
|
||||
"ensure_cluster_keys": true, // SSH key distribution
|
||||
"register_nodes": true, // Node registration
|
||||
"request_cleanup": true, // Cleanup operations
|
||||
}
|
||||
```
|
||||
|
||||
**Authorization Check**:
|
||||
```go
|
||||
if privilegedMethods[method] && isIDMappedRoot(credentials) {
|
||||
return "method requires host-level privileges"
|
||||
}
|
||||
```
|
||||
|
||||
**Read-Only Methods** (containers allowed):
|
||||
- `get_temperature` - Fetch temperature data via proxy
|
||||
- `get_status` - Check proxy health and version
|
||||
|
||||
---
|
||||
|
||||
## Rate Limiting
|
||||
|
||||
### Per-Peer Limits (commit 46b8b8d)
|
||||
|
||||
- **Rate:** 1 request per second (`per_peer_interval_ms = 1000`)
|
||||
- **Burst:** 5 requests (enough to sweep five nodes per polling window)
|
||||
- **Per-peer concurrency:** Maximum 2 concurrent RPCs
|
||||
- **Global concurrency:** 8 simultaneous RPCs across all peers
|
||||
- **Penalty:** 2 s enforced delay on validation failures (oversized payloads, unauthorized methods)
|
||||
- **Cleanup:** Peer entries expire after 10 minutes of inactivity
|
||||
|
||||
### Configurable Overrides
|
||||
|
||||
Administrators can raise or lower thresholds via `/etc/pulse-sensor-proxy/config.yaml`:
|
||||
|
||||
```yaml
|
||||
rate_limit:
|
||||
per_peer_interval_ms: 500 # 2 rps
|
||||
per_peer_burst: 10 # allow 10-node sweep
|
||||
```
|
||||
|
||||
Security guidance:
|
||||
- Keep `per_peer_interval_ms ≥ 100` in production; lower values expand the attack surface for noisy callers.
|
||||
- Ensure UID/GID filters stay in place when increasing throughput, and continue to ship audit logs off-host.
|
||||
- Monitor `pulse_proxy_limiter_penalties_total` alongside `pulse_proxy_limiter_rejects_total` to spot abusive or compromised clients.
|
||||
|
||||
### Per-Node Concurrency
|
||||
- **Limit**: 1 concurrent SSH request per node
|
||||
- **Purpose**: Prevents SSH connection storms
|
||||
- **Scope**: Applies to all peers requesting same node
|
||||
|
||||
### Monitoring Rate Limits
|
||||
```bash
|
||||
# Check rate limit metrics
|
||||
curl -s http://127.0.0.1:9127/metrics | grep pulse_proxy_limiter_rejects_total
|
||||
|
||||
# Watch for rate limit warnings in logs
|
||||
journalctl -u pulse-sensor-proxy -f | grep "Rate limit exceeded"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## SSH Security
|
||||
|
||||
### SSH Key Management
|
||||
|
||||
**Key Location**: `/var/lib/pulse-sensor-proxy/ssh/id_ed25519`
|
||||
- **Owner**: `pulse-sensor-proxy:pulse-sensor-proxy`
|
||||
- **Permissions**: `0600` (read/write for owner only)
|
||||
- **Type**: Ed25519 (modern, secure)
|
||||
|
||||
**Key Distribution**:
|
||||
- Only host processes can trigger distribution (via `ensure_cluster_keys`)
|
||||
- Containers are blocked from key distribution operations
|
||||
- Keys are distributed with forced commands and IP restrictions
|
||||
|
||||
### Forced Command Restrictions
|
||||
|
||||
**On cluster nodes**, the SSH key can ONLY run:
|
||||
```bash
|
||||
sensors -j
|
||||
```
|
||||
|
||||
**No other commands possible**:
|
||||
- ❌ Shell access denied (`no-pty`)
|
||||
- ❌ Port forwarding disabled (`no-port-forwarding`)
|
||||
- ❌ X11 forwarding disabled (`no-X11-forwarding`)
|
||||
- ❌ Agent forwarding disabled (`no-agent-forwarding`)
|
||||
|
||||
### IP Filtering
|
||||
|
||||
**Source IP restrictions**:
|
||||
```bash
|
||||
from="192.168.0.0/24,10.0.0.0/8"
|
||||
```
|
||||
- Automatically detected from cluster node IPs
|
||||
- Prevents SSH key use from outside the cluster
|
||||
- Updated during key rotation
|
||||
|
||||
---
|
||||
|
||||
## Container Isolation
|
||||
|
||||
### Fallback SSH Protection
|
||||
|
||||
**In containers**, direct SSH is blocked:
|
||||
```go
|
||||
if system.InContainer() && !devModeAllowSSH {
|
||||
log.Error().Msg("SECURITY BLOCK: SSH temperature collection disabled in containers")
|
||||
return &Temperature{Available: false}, nil
|
||||
}
|
||||
```
|
||||
|
||||
**Container Detection Methods**:
|
||||
1. `PULSE_FORCE_CONTAINER=1` override for explicit opt-in
|
||||
2. Presence of `/.dockerenv` or `/run/.containerenv`
|
||||
3. `container=` hints from environment variables
|
||||
4. `/proc/1/environ` and `/proc/1/cgroup` markers (`docker`, `lxc`, `containerd`, `kubepods`, etc.)
|
||||
|
||||
**Bypass**: Only possible with explicit environment variable (see [Development Mode](#development-mode))
|
||||
|
||||
### ID-Mapped Root Detection
|
||||
|
||||
**How it works**:
|
||||
```go
|
||||
// Check /etc/subuid and /etc/subgid for mapping ranges
|
||||
// Example /etc/subuid:
|
||||
// root:100000:65536
|
||||
|
||||
func isIDMappedRoot(cred *peerCredentials) bool {
|
||||
return uidInRange(cred.uid, idMappedUIDRanges) &&
|
||||
gidInRange(cred.gid, idMappedGIDRanges)
|
||||
}
|
||||
```
|
||||
|
||||
**Why both UID and GID?**:
|
||||
- Container root: `uid=100000, gid=100000` → ID-mapped
|
||||
- Container app user: `uid=101001, gid=101001` → ID-mapped
|
||||
- Host root: `uid=0, gid=0` → NOT ID-mapped
|
||||
- Mixed: `uid=100000, gid=50` → NOT ID-mapped (fails check)
|
||||
|
||||
---
|
||||
|
||||
## Monitoring & Alerting
|
||||
|
||||
### Log Locations
|
||||
|
||||
**Proxy logs**:
|
||||
```bash
|
||||
journalctl -u pulse-sensor-proxy -f
|
||||
```
|
||||
|
||||
**Backend logs** (inside container):
|
||||
```bash
|
||||
journalctl -u pulse-backend -f
|
||||
```
|
||||
|
||||
Want off-host retention? Forward `audit.log` and `proxy.log` using
|
||||
[`scripts/setup-log-forwarding.sh`](operations/sensor-proxy-log-forwarding.md)
|
||||
so events land in your SIEM with RELP + TLS.
|
||||
|
||||
**Audit rotation**: Use the steps in [operations/audit-log-rotation.md](operations/audit-log-rotation.md) to rotate `/var/log/pulse/sensor-proxy/audit.log`. After each rotation, restart the proxy and confirm temperature pollers are healthy in `/api/monitoring/scheduler/health` (closed breakers, no DLQ entries).
|
||||
|
||||
### Security Events to Monitor
|
||||
|
||||
#### 1. Privileged Method Denials
|
||||
```
|
||||
SECURITY: Container attempted to call privileged method - access denied
|
||||
method=ensure_cluster_keys uid=101000 gid=101000 pid=12345
|
||||
```
|
||||
|
||||
**Alert on**: Any occurrence (indicates attempted privilege escalation)
|
||||
|
||||
#### 2. Rate Limit Violations
|
||||
```
|
||||
Rate limit exceeded uid=101000 pid=12345
|
||||
```
|
||||
|
||||
**Alert on**: Sustained violations (>10/minute indicates possible abuse)
|
||||
|
||||
#### 3. Authorization Failures
|
||||
```
|
||||
Peer authorization failed uid=50000 gid=50000
|
||||
```
|
||||
|
||||
**Alert on**: Repeated failures from same UID (indicates misconfiguration or probing)
|
||||
|
||||
#### 4. SSH Fallback Attempts
|
||||
```
|
||||
SECURITY BLOCK: SSH temperature collection disabled in containers
|
||||
```
|
||||
|
||||
**Alert on**: Any occurrence (should only happen during misconfigurations)
|
||||
|
||||
### Metrics to Track
|
||||
|
||||
```bash
|
||||
# Rate limit hits
|
||||
pulse_proxy_rate_limit_hits_total
|
||||
|
||||
# RPC requests by method and result
|
||||
pulse_proxy_rpc_requests_total{method="get_temperature",result="success"}
|
||||
pulse_proxy_rpc_requests_total{method="ensure_cluster_keys",result="unauthorized"}
|
||||
|
||||
# SSH request latency
|
||||
pulse_proxy_ssh_latency_seconds{node="example-node"}
|
||||
|
||||
# Active connections
|
||||
pulse_proxy_queue_depth
|
||||
pulse_proxy_global_concurrency_inflight
|
||||
```
|
||||
|
||||
### Recommended Alerts
|
||||
|
||||
1. **Privilege Escalation Attempts**:
|
||||
```
|
||||
pulse_proxy_rpc_requests_total{result="unauthorized"} > 0
|
||||
```
|
||||
|
||||
2. **Rate Limit Abuse**:
|
||||
```
|
||||
rate(pulse_proxy_rate_limit_hits_total[5m]) > 1
|
||||
```
|
||||
|
||||
3. **Proxy Unavailable**:
|
||||
```
|
||||
up{job="pulse-sensor-proxy"} == 0
|
||||
```
|
||||
|
||||
4. **Scheduler Drift** (Pulse side – ensures temperature pollers stay healthy):
|
||||
```
|
||||
max_over_time(pulse_monitor_poll_queue_depth[5m]) > <baseline*1.5>
|
||||
```
|
||||
Pair with a check of `/api/monitoring/scheduler/health` to confirm temperature instances report `breaker.state == "closed"`.
|
||||
|
||||
---
|
||||
|
||||
## Development Mode
|
||||
|
||||
### SSH Fallback Override
|
||||
|
||||
**Purpose**: Allow direct SSH from containers during development/testing
|
||||
|
||||
**Environment Variable**:
|
||||
```bash
|
||||
export PULSE_DEV_ALLOW_CONTAINER_SSH=true
|
||||
```
|
||||
|
||||
**Security Implications**:
|
||||
- ⚠️ **NEVER use in production**
|
||||
- Allows container to use SSH keys if present
|
||||
- Defeats the security isolation model
|
||||
- Should only be used in trusted development environments
|
||||
|
||||
**Example Usage**:
|
||||
```bash
|
||||
# In systemd override for pulse-backend
|
||||
mkdir -p /etc/systemd/system/pulse-backend.service.d
|
||||
cat <<EOF > /etc/systemd/system/pulse-backend.service.d/dev-ssh.conf
|
||||
[Service]
|
||||
Environment=PULSE_DEV_ALLOW_CONTAINER_SSH=true
|
||||
EOF
|
||||
systemctl daemon-reload
|
||||
systemctl restart pulse-backend
|
||||
```
|
||||
|
||||
**Monitoring**:
|
||||
```bash
|
||||
# Check if dev mode is active
|
||||
journalctl -u pulse-backend | grep "dev mode" | tail -1
|
||||
```
|
||||
|
||||
**Disable dev mode**:
|
||||
```bash
|
||||
rm /etc/systemd/system/pulse-backend.service.d/dev-ssh.conf
|
||||
systemctl daemon-reload
|
||||
systemctl restart pulse-backend
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "method requires host-level privileges"
|
||||
|
||||
**Symptom**: Container gets this error when calling RPC
|
||||
|
||||
**Cause**: Container attempted to call privileged method
|
||||
|
||||
**Resolution**: This is expected behavior. Only these methods are restricted:
|
||||
- `ensure_cluster_keys`
|
||||
- `register_nodes`
|
||||
- `request_cleanup`
|
||||
|
||||
**If host process is blocked**:
|
||||
1. Check UID is not in ID-mapped range:
|
||||
```bash
|
||||
id
|
||||
cat /etc/subuid /etc/subgid
|
||||
```
|
||||
|
||||
2. Verify proxy's allowed UIDs:
|
||||
```bash
|
||||
cat /etc/pulse-sensor-proxy/config.yaml
|
||||
```
|
||||
|
||||
### "Rate limit exceeded"
|
||||
|
||||
**Symptom**: Requests failing with rate limit error
|
||||
|
||||
**Cause**: Peer exceeded ~12 requests/minute (or exhausted per-peer/global concurrency)
|
||||
|
||||
**Resolution**:
|
||||
1. Confirm workload is legitimate (look for retry loops or aggressive polling).
|
||||
2. Allow the limiter to recover—penalty sleeps clear in ~2 s and idle peers expire after 10 minutes.
|
||||
3. If sustained higher throughput is required, adjust the constants in `cmd/pulse-sensor-proxy/throttle.go` and rebuild.
|
||||
|
||||
### Temperature monitoring unavailable
|
||||
|
||||
**Symptom**: No temperature data in dashboard
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# 1. Check proxy is running
|
||||
systemctl status pulse-sensor-proxy
|
||||
|
||||
# 2. Check socket exists
|
||||
ls -la /run/pulse-sensor-proxy/
|
||||
|
||||
# 3. Check socket is accessible in container
|
||||
ls -la /mnt/pulse-proxy/
|
||||
|
||||
# 4. Test proxy from host
|
||||
curl -s --unix-socket /run/pulse-sensor-proxy/pulse-sensor-proxy.sock \
|
||||
-X POST -d '{"method":"get_status"}' | jq
|
||||
|
||||
# 5. Check SSH connectivity
|
||||
ssh root@example-node "sensors -j"
|
||||
|
||||
# 6. Inspect adaptive polling for temperature pollers
|
||||
curl -s http://localhost:7655/api/monitoring/scheduler/health \
|
||||
| jq '.instances[] | select(.key | contains("temperature")) | {key, breaker: .breaker.state, deadLetter: .deadLetter.present, lastSuccess: .pollStatus.lastSuccess}'
|
||||
```
|
||||
|
||||
### SSH key not distributed
|
||||
|
||||
**Symptom**: Manual `ensure_cluster_keys` call fails
|
||||
|
||||
**Check**:
|
||||
1. Are you calling from host (not container)?
|
||||
2. Is pvecm available? `command -v pvecm`
|
||||
3. Can you reach cluster nodes? `pvecm status`
|
||||
4. Check proxy logs: `journalctl -u pulse-sensor-proxy -f`
|
||||
|
||||
---
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Production Deployments
|
||||
|
||||
1. ✅ **Never use dev mode** (`PULSE_DEV_ALLOW_CONTAINER_SSH=true`)
|
||||
2. ✅ **Monitor security logs** for unauthorized access attempts
|
||||
3. ✅ **Use IP filtering** on SSH authorized_keys entries
|
||||
4. ✅ **Rotate SSH keys** periodically (use `ensure_cluster_keys` with rotation)
|
||||
5. ✅ **Limit allowed_peer_uids** to minimum necessary
|
||||
6. ✅ **Enable audit logging** for privileged operations
|
||||
|
||||
### Development Environments
|
||||
|
||||
1. ✅ Use dev mode SSH override if needed (document why)
|
||||
2. ✅ Test with actual ID-mapped containers
|
||||
3. ✅ Verify privileged method blocking works
|
||||
4. ✅ Test rate limiting under load
|
||||
|
||||
### Incident Response
|
||||
|
||||
**If container compromise suspected**:
|
||||
|
||||
1. Check for privileged method attempts:
|
||||
```bash
|
||||
journalctl -u pulse-sensor-proxy | grep "SECURITY:"
|
||||
```
|
||||
|
||||
2. Check rate limit violations:
|
||||
```bash
|
||||
journalctl -u pulse-sensor-proxy | grep "Rate limit"
|
||||
```
|
||||
|
||||
3. Restart proxy to clear state:
|
||||
```bash
|
||||
systemctl restart pulse-sensor-proxy
|
||||
```
|
||||
|
||||
4. Consider rotating SSH keys:
|
||||
```bash
|
||||
# From host, call ensure_cluster_keys with new key
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- [Pulse Installation Guide](../README.md)
|
||||
- [pulse-sensor-proxy Configuration](../cmd/pulse-sensor-proxy/README.md)
|
||||
- [Security Audit Results](../SECURITY.md)
|
||||
- [LXC ID Mapping Documentation](https://linuxcontainers.org/lxc/manpages/man5/lxc.container.conf.5.html#lbAJ)
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: 2025-10-19
|
||||
**Security Contact**: File issues at https://github.com/rcourtman/Pulse/issues
|
||||
@@ -1,134 +1,11 @@
|
||||
# Scheduler Health API
|
||||
# 🩺 Scheduler Health API
|
||||
|
||||
Adaptive scheduler health endpoint
|
||||
**Endpoint**: `GET /api/monitoring/scheduler/health`
|
||||
**Auth**: Required (Bearer token or Cookie)
|
||||
|
||||
Endpoint: `GET /api/monitoring/scheduler/health`
|
||||
Returns a real-time snapshot of the adaptive scheduler, including queue state, circuit breakers, and dead-letter tasks.
|
||||
|
||||
Returns a snapshot of the adaptive polling scheduler, queue state, circuit breakers, and per-instance status. Requires authentication (session cookie or bearer token).
|
||||
|
||||
**Key Features:**
|
||||
- Real-time scheduler health monitoring
|
||||
- Circuit breaker status per instance
|
||||
- Dead-letter queue tracking (tasks that repeatedly fail)
|
||||
- Per-instance staleness metrics
|
||||
- No query parameters required
|
||||
- Read-only endpoint (rate-limited under general 500 req/min bucket)
|
||||
|
||||
---
|
||||
|
||||
## Request
|
||||
|
||||
```
|
||||
GET /api/monitoring/scheduler/health
|
||||
Authorization: Bearer <token>
|
||||
```
|
||||
|
||||
No query parameters are needed.
|
||||
|
||||
---
|
||||
|
||||
## Response Overview
|
||||
|
||||
```json
|
||||
{
|
||||
"updatedAt": "2025-10-20T13:05:42Z", // RFC 3339 timestamp
|
||||
"enabled": true, // Mirrors AdaptivePollingEnabled setting
|
||||
"queue": {...},
|
||||
"deadLetter": {...},
|
||||
"breakers": [...], // legacy summary (for backward compatibility)
|
||||
"staleness": [...], // legacy summary (for backward compatibility)
|
||||
"instances": [ ... ] // authoritative per-instance view (v4.24.0+)
|
||||
}
|
||||
```
|
||||
|
||||
**Field Notes:**
|
||||
- `updatedAt`: RFC 3339 timestamp of when this snapshot was generated
|
||||
- `enabled`: Reflects the current `AdaptivePollingEnabled` system setting
|
||||
- `breakers` and `staleness`: Legacy arrays maintained for backward compatibility; use `instances` for complete data
|
||||
- `instances`: Authoritative source for per-instance health (v4.24.0+)
|
||||
|
||||
### Queue Snapshot (`queue`)
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `depth` | integer | Current queue size |
|
||||
| `dueWithinSeconds` | integer | Items scheduled within the next 12 seconds |
|
||||
| `perType` | object | Counts per instance type, e.g. `{"pve":4}` |
|
||||
|
||||
### Dead-letter Snapshot (`deadLetter`)
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `count` | integer | Total items in the dead-letter queue |
|
||||
| `tasks` | array | **Limited to 25 entries** for performance. Each task includes `instance`, `type`, `nextRun`, `lastError`, and `failures` count. For complete per-instance DLQ data, use `instances[].deadLetter` |
|
||||
|
||||
**Note:** The top-level `deadLetter.tasks` array is capped at 25 items to prevent large responses. Use the `instances` array for exhaustive coverage.
|
||||
|
||||
### Instances (`instances`)
|
||||
|
||||
Each element gives a complete view of one instance.
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `key` | string | Unique key `type::name` |
|
||||
| `type` | string | Instance type (`pve`, `pbs`, `pmg`, etc.) |
|
||||
| `displayName` | string | Friendly name (falls back to host/name) |
|
||||
| `instance` | string | Raw instance identifier |
|
||||
| `connection` | string | Connection URL or host |
|
||||
| `pollStatus` | object | Recent poll outcomes |
|
||||
| `breaker` | object | Circuit breaker state |
|
||||
| `deadLetter` | object | Dead-letter insight for this instance |
|
||||
|
||||
#### Poll Status (`pollStatus`)
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `lastSuccess` | timestamp nullable | RFC 3339 timestamp of most recent successful poll |
|
||||
| `lastError` | object nullable | `{ at, message, category }` where `at` is RFC 3339, `message` describes the error, and `category` is `transient` (network issues, timeouts) or `permanent` (auth failures, invalid config) |
|
||||
| `consecutiveFailures` | integer | Current failure streak length (resets on successful poll) |
|
||||
| `firstFailureAt` | timestamp nullable | RFC 3339 timestamp when the current failure streak began. Useful for calculating failure duration |
|
||||
|
||||
**Timing Metadata (v4.24.0+):**
|
||||
- `firstFailureAt`: Tracks when a failure streak started, enabling "failing for X minutes" calculations
|
||||
- Resets to `null` when a successful poll occurs
|
||||
- Combine with `consecutiveFailures` to assess severity
|
||||
|
||||
#### Breaker (`breaker`)
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `state` | string | `closed` (healthy), `open` (failing), `half_open` (testing recovery), or `unknown` (not initialized) |
|
||||
| `since` | timestamp nullable | RFC 3339 timestamp when the current state began. Use to calculate how long a breaker has been open |
|
||||
| `lastTransition` | timestamp nullable | RFC 3339 timestamp of the most recent state change (e.g., closed → open) |
|
||||
| `retryAt` | timestamp nullable | RFC 3339 timestamp of next scheduled retry attempt when breaker is open or half-open |
|
||||
| `failureCount` | integer | Number of failures in the current breaker cycle. Resets when breaker closes |
|
||||
|
||||
**Circuit Breaker Timing (v4.24.0+):**
|
||||
- `since`: When did the current state start? (e.g., "breaker has been open for 5 minutes")
|
||||
- `lastTransition`: When was the last state change? (useful for detecting flapping)
|
||||
- `retryAt`: When will the next retry attempt occur? (for open/half-open states)
|
||||
- `failureCount`: How many failures have accumulated? (triggers state transitions)
|
||||
|
||||
**State Transitions:**
|
||||
- `closed` → `open`: Triggered after N failures (default: 5)
|
||||
- `open` → `half_open`: After timeout period, allows one test request
|
||||
- `half_open` → `closed`: If test request succeeds
|
||||
- `half_open` → `open`: If test request fails
|
||||
|
||||
#### Dead-letter (`deadLetter`)
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `present` | boolean | `true` if instance is in the DLQ |
|
||||
| `reason` | string | `max_retry_attempts` or `permanent_failure` |
|
||||
| `firstAttempt` | timestamp nullable | First time the instance hit DLQ |
|
||||
| `lastAttempt` | timestamp nullable | Most recent DLQ enqueue |
|
||||
| `retryCount` | integer | Number of DLQ attempts |
|
||||
| `nextRetry` | timestamp nullable | Next scheduled retry time |
|
||||
|
||||
---
|
||||
|
||||
## Example Response
|
||||
## 📦 Response Format
|
||||
|
||||
```json
|
||||
{
|
||||
@@ -137,44 +14,13 @@ Each element gives a complete view of one instance.
|
||||
"queue": {
|
||||
"depth": 7,
|
||||
"dueWithinSeconds": 2,
|
||||
"perType": { "pve": 4, "pbs": 2, "pmg": 1 }
|
||||
"perType": { "pve": 4, "pbs": 2 }
|
||||
},
|
||||
"deadLetter": {
|
||||
"count": 1,
|
||||
"tasks": [
|
||||
{
|
||||
"instance": "pbs-b",
|
||||
"type": "pbs",
|
||||
"nextRun": "2025-10-20T13:30:00Z",
|
||||
"lastError": "401 unauthorized",
|
||||
"failures": 5
|
||||
}
|
||||
]
|
||||
},
|
||||
"breakers": [
|
||||
{
|
||||
"instance": "pve-a",
|
||||
"type": "pve",
|
||||
"state": "half_open",
|
||||
"failures": 3,
|
||||
"retryAt": "2025-10-20T13:06:15Z"
|
||||
}
|
||||
],
|
||||
"staleness": [
|
||||
{
|
||||
"instance": "pve-a",
|
||||
"type": "pve",
|
||||
"score": 0.42,
|
||||
"lastSuccess": "2025-10-20T13:05:10Z",
|
||||
"lastError": "2025-10-20T13:05:40Z"
|
||||
}
|
||||
],
|
||||
"instances": [
|
||||
{
|
||||
"key": "pve::pve-a",
|
||||
"type": "pve",
|
||||
"displayName": "Pulse PVE Cluster",
|
||||
"instance": "pve-a",
|
||||
"connection": "https://pve-a:8006",
|
||||
"pollStatus": {
|
||||
"lastSuccess": "2025-10-20T13:05:10Z",
|
||||
@@ -187,133 +33,50 @@ Each element gives a complete view of one instance.
|
||||
"firstFailureAt": "2025-10-20T13:05:20Z"
|
||||
},
|
||||
"breaker": {
|
||||
"state": "half_open",
|
||||
"since": "2025-10-20T13:05:40Z",
|
||||
"lastTransition": "2025-10-20T13:05:40Z",
|
||||
"state": "half_open", // closed, open, half_open
|
||||
"retryAt": "2025-10-20T13:06:15Z",
|
||||
"failureCount": 3
|
||||
},
|
||||
"deadLetter": {
|
||||
"present": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"key": "pbs::pbs-b",
|
||||
"type": "pbs",
|
||||
"displayName": "Backup PBS",
|
||||
"instance": "pbs-b",
|
||||
"connection": "https://pbs-b:8007",
|
||||
"pollStatus": {
|
||||
"lastSuccess": "2025-10-20T12:55:00Z",
|
||||
"lastError": {
|
||||
"at": "2025-10-20T13:00:01Z",
|
||||
"message": "401 unauthorized",
|
||||
"category": "permanent"
|
||||
},
|
||||
"consecutiveFailures": 5,
|
||||
"firstFailureAt": "2025-10-20T12:58:30Z"
|
||||
},
|
||||
"breaker": {
|
||||
"state": "open",
|
||||
"since": "2025-10-20T13:00:01Z",
|
||||
"lastTransition": "2025-10-20T13:00:01Z",
|
||||
"retryAt": "2025-10-20T13:02:01Z",
|
||||
"failureCount": 5
|
||||
},
|
||||
"deadLetter": {
|
||||
"present": true,
|
||||
"reason": "max_retry_attempts",
|
||||
"firstAttempt": "2025-10-20T12:58:30Z",
|
||||
"lastAttempt": "2025-10-20T13:00:01Z",
|
||||
"retryCount": 5,
|
||||
"nextRetry": "2025-10-20T13:30:00Z"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
## 🔍 Key Fields
|
||||
|
||||
## Useful `jq` Queries
|
||||
### Instances (`instances`)
|
||||
The authoritative source for per-instance health.
|
||||
|
||||
### Instances with recent errors
|
||||
* **`pollStatus`**:
|
||||
* `lastSuccess`: Timestamp of last successful poll.
|
||||
* `lastError`: Details of the last error (message, category).
|
||||
* `consecutiveFailures`: Current failure streak.
|
||||
* **`breaker`**:
|
||||
* `state`: `closed` (healthy), `open` (failing), `half_open` (recovering).
|
||||
* `retryAt`: Next retry time if open/half-open.
|
||||
* **`deadLetter`**:
|
||||
* `present`: `true` if the instance is in the DLQ (stopped polling).
|
||||
* `reason`: Why it was moved to DLQ (e.g., `permanent_failure`).
|
||||
|
||||
```
|
||||
curl -s http://HOST:7655/api/monitoring/scheduler/health \
|
||||
| jq '.instances[] | select(.pollStatus.lastError != null) | {key, lastError: .pollStatus.lastError}'
|
||||
## 🛠️ Common Queries (jq)
|
||||
|
||||
**Find Failing Instances:**
|
||||
```bash
|
||||
curl -s http://HOST:7655/api/monitoring/scheduler/health | \
|
||||
jq '.instances[] | select(.pollStatus.consecutiveFailures > 0) | {key, failures: .pollStatus.consecutiveFailures}'
|
||||
```
|
||||
|
||||
### Current dead-letter queue entries
|
||||
|
||||
```
|
||||
curl -s http://HOST:7655/api/monitoring/scheduler/health \
|
||||
| jq '.instances[] | select(.deadLetter.present) | {key, reason: .deadLetter.reason, retryCount: .deadLetter.retryCount}'
|
||||
**Check Dead Letter Queue:**
|
||||
```bash
|
||||
curl -s http://HOST:7655/api/monitoring/scheduler/health | \
|
||||
jq '.instances[] | select(.deadLetter.present) | {key, reason: .deadLetter.reason}'
|
||||
```
|
||||
|
||||
### Breakers not closed
|
||||
|
||||
**Find Open Breakers:**
|
||||
```bash
|
||||
curl -s http://HOST:7655/api/monitoring/scheduler/health | \
|
||||
jq '.instances[] | select(.breaker.state != "closed") | {key, state: .breaker.state}'
|
||||
```
|
||||
curl -s http://HOST:7655/api/monitoring/scheduler/health \
|
||||
| jq '.instances[] | select(.breaker.state != "closed") | {key, breaker: .breaker}'
|
||||
```
|
||||
|
||||
### Stale instances (score > 0.5)
|
||||
|
||||
```
|
||||
curl -s http://HOST:7655/api/monitoring/scheduler/health \
|
||||
| jq '.staleness[] | select(.score > 0.5)'
|
||||
```
|
||||
|
||||
### Instances sorted by failure streak
|
||||
|
||||
```
|
||||
curl -s http://HOST:7655/api/monitoring/scheduler/health \
|
||||
| jq '.instances[] | select(.pollStatus.consecutiveFailures > 0) | {key, failures: .pollStatus.consecutiveFailures}'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Migration Notes
|
||||
|
||||
| Legacy Field | Status | Replacement |
|
||||
|--------------|--------|-------------|
|
||||
| `breakers` array | retains summary | use `instances[].breaker` for detailed view |
|
||||
| `deadLetter.tasks` | retains summary | use `instances[].deadLetter` for per-instance enrichment |
|
||||
| `staleness` array | unchanged | combined with `pollStatus.lastSuccess` gives precise timestamps |
|
||||
|
||||
The `instances` array centralizes per-instance telemetry; existing integrations can migrate at their own pace.
|
||||
|
||||
---
|
||||
|
||||
## Operational Notes
|
||||
|
||||
**v4.24.0 Behavior:**
|
||||
- **Read-only endpoint**: This endpoint is informational only and does not modify scheduler state
|
||||
- **Rate limiting**: Falls under the general API limit (500 requests/minute per IP)
|
||||
- **Authentication required**: Must provide valid session cookie or API token
|
||||
- **Adaptive polling disabled**: When adaptive polling is disabled (`enabled: false`), the response includes empty `breakers`, `staleness`, and `instances` arrays
|
||||
- **Real-time data**: Reflects current scheduler state; not historical (for trends, use metrics/logs)
|
||||
- **No query parameters**: Returns complete snapshot on every request
|
||||
- **Automatic adjustments**: The `enabled` field automatically reflects the `AdaptivePollingEnabled` system setting
|
||||
|
||||
**Use Cases:**
|
||||
- **Monitoring dashboards**: Embed in Grafana/Prometheus for real-time scheduler health
|
||||
- **Alerting**: Trigger alerts on open circuit breakers or high DLQ counts
|
||||
- **Debugging**: Investigate why specific instances aren't polling successfully
|
||||
- **Capacity planning**: Monitor queue depth trends to assess if polling intervals need adjustment
|
||||
|
||||
**Breaking Changes:**
|
||||
- **None**: v4.24.0 only adds fields; all existing consumers continue to work
|
||||
- Consumers just gain access to richer metadata (`firstFailureAt`, breaker timestamps, DLQ retry windows)
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting Examples
|
||||
|
||||
1. **Transient outages:** look for `pollStatus.lastError.category == "transient"` to confirm network hiccups; check `breaker.retryAt` to see when retries resume.
|
||||
2. **Permanent failures:** `deadLetter.present == true` with `reason == "permanent_failure"` indicates credential or configuration issues.
|
||||
3. **Breaker stuck:** `breaker.state != "closed"` with `since` > 5 minutes suggests manual intervention or rollback.
|
||||
4. **Staleness spike:** compare `pollStatus.lastSuccess` with `updatedAt` to estimate data age; cross-reference `staleness.score` for alert thresholds.
|
||||
|
||||
Use Grafana dashboards for historical trends; the API complements dashboards by revealing instant state and precise failure context.
|
||||
|
||||
@@ -1,111 +1,37 @@
|
||||
# Mock Mode Development Guide
|
||||
# 🧪 Mock Mode Development
|
||||
|
||||
Pulse ships with a mock data pipeline so you can iterate on UI and backend
|
||||
changes without touching real infrastructure. This guide collects everything you
|
||||
need to know about running in mock mode during development.
|
||||
Develop Pulse without real infrastructure using the mock data pipeline.
|
||||
|
||||
---
|
||||
|
||||
## Why Mock Mode?
|
||||
|
||||
- Exercise dashboards, alert timelines, and charts with predictable sample data.
|
||||
- Reproduce edge cases (offline nodes, noisy containers, backup failures) by
|
||||
tweaking configuration values rather than waiting for production incidents.
|
||||
- Swap between synthetic and live data without rebuilding services.
|
||||
|
||||
---
|
||||
|
||||
## Starting the Dev Stack
|
||||
## 🚀 Quick Start
|
||||
|
||||
```bash
|
||||
# Launch backend + frontend with hot reload
|
||||
# Start dev stack
|
||||
./scripts/hot-dev.sh
|
||||
|
||||
# Toggle mock mode
|
||||
npm run mock:on # Enable
|
||||
npm run mock:off # Disable
|
||||
npm run mock:status # Check status
|
||||
```
|
||||
|
||||
The script exposes:
|
||||
- Frontend: `http://localhost:7655` (Vite hot module reload)
|
||||
- Backend API: `http://localhost:7656`
|
||||
## ⚙️ Configuration
|
||||
Edit `mock.env` (or `mock.env.local` for overrides):
|
||||
|
||||
---
|
||||
| Variable | Default | Description |
|
||||
| :--- | :--- | :--- |
|
||||
| `PULSE_MOCK_MODE` | `false` | Enable mock mode. |
|
||||
| `PULSE_MOCK_NODES` | `7` | Number of synthetic nodes. |
|
||||
| `PULSE_MOCK_VMS_PER_NODE` | `5` | VMs per node. |
|
||||
| `PULSE_MOCK_LXCS_PER_NODE` | `8` | Containers per node. |
|
||||
| `PULSE_MOCK_RANDOM_METRICS` | `true` | Jitter metrics. |
|
||||
| `PULSE_MOCK_STOPPED_PERCENT` | `20` | % of offline guests. |
|
||||
|
||||
## Toggling Mock Data
|
||||
## ℹ️ How it Works
|
||||
* **Data**: Swaps `PULSE_DATA_DIR` to `/opt/pulse/tmp/mock-data`.
|
||||
* **Restart**: Backend restarts automatically; Frontend hot-reloads.
|
||||
* **Reset**: To regenerate data, delete `/opt/pulse/tmp/mock-data` and toggle mock mode on.
|
||||
|
||||
The npm helpers and `toggle-mock.sh` wrapper point the backend at different
|
||||
`.env` files and restart the relevant services automatically.
|
||||
|
||||
```bash
|
||||
npm run mock:on # Enable mock mode
|
||||
npm run mock:off # Return to real data
|
||||
npm run mock:status # Display current state
|
||||
npm run mock:edit # Open mock.env in $EDITOR
|
||||
```
|
||||
|
||||
Equivalent shell invocations:
|
||||
|
||||
```bash
|
||||
./scripts/toggle-mock.sh on
|
||||
./scripts/toggle-mock.sh off
|
||||
./scripts/toggle-mock.sh status
|
||||
```
|
||||
|
||||
When switching:
|
||||
- `mock.env` (or `mock.env.local`) feeds configuration values to the backend.
|
||||
- `PULSE_DATA_DIR` swaps between `/opt/pulse/tmp/mock-data` (synthetic) and
|
||||
`/etc/pulse` (real data) so test credentials never mix with production ones.
|
||||
- The backend process restarts; the frontend stays hot-reloading.
|
||||
|
||||
---
|
||||
|
||||
## Customising Mock Fixtures
|
||||
|
||||
`mock.env` exposes the knobs most developers care about:
|
||||
|
||||
```bash
|
||||
PULSE_MOCK_MODE=false # Enable/disable mock mode
|
||||
PULSE_MOCK_NODES=7 # Number of synthetic nodes
|
||||
PULSE_MOCK_VMS_PER_NODE=5 # Average VM count per node
|
||||
PULSE_MOCK_LXCS_PER_NODE=8 # Average container count per node
|
||||
PULSE_MOCK_RANDOM_METRICS=true # Toggle metric jitter
|
||||
PULSE_MOCK_STOPPED_PERCENT=20 # Percentage of guests stopped/offline
|
||||
PULSE_ALLOW_DOCKER_UPDATES=true # Treat Docker builds as update-capable (skips restart)
|
||||
```
|
||||
|
||||
When `PULSE_ALLOW_DOCKER_UPDATES` (or `PULSE_MOCK_MODE`) is enabled the backend
|
||||
exposes the full update flow inside containers, fakes the deployment type to
|
||||
`mock`, and suppresses the automatic process exit that normally follows a
|
||||
successful upgrade. This is what the Playwright update suite uses inside CI.
|
||||
|
||||
Create `mock.env.local` for personal tweaks that should not be committed:
|
||||
|
||||
```bash
|
||||
cp mock.env mock.env.local
|
||||
$EDITOR mock.env.local
|
||||
```
|
||||
|
||||
The toggle script prioritises `.local` files, falling back to the shared
|
||||
defaults when none are present.
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
- **Backend did not restart:** flip mock mode off/on again (`npm run mock:off`,
|
||||
then `npm run mock:on`) to force a reload.
|
||||
- **Ports already in use:** confirm nothing else is listening on `7655`/`7656`
|
||||
(`lsof -i :7655` / `lsof -i :7656`) and kill stray processes.
|
||||
- **Data feels stale:** delete `/opt/pulse/tmp/mock-data` and toggle mock mode
|
||||
back on to regenerate fixtures.
|
||||
|
||||
---
|
||||
|
||||
## Limitations
|
||||
|
||||
- Mock data focuses on happy-path flows; use real Proxmox/PBS environments
|
||||
before shipping changes that touch API integrations.
|
||||
- Webhook payloads are synthetically generated and omit provider-specific
|
||||
quirks—test with real channels for production rollouts.
|
||||
- Encrypt/decrypt flows still use the local crypto stack; do not treat mock mode
|
||||
as a sandbox for experimenting with credential formats.
|
||||
|
||||
For more advanced scenarios, inspect `scripts/hot-dev.sh` and the mock seeders
|
||||
under `internal/mock` for additional entry points.
|
||||
## ⚠️ Limitations
|
||||
* **Happy Path**: Focuses on standard flows; use real infrastructure for complex edge cases.
|
||||
* **Webhooks**: Synthetic payloads only.
|
||||
* **Encryption**: Uses local crypto stack (not a sandbox for auth).
|
||||
|
||||
@@ -1,187 +1,52 @@
|
||||
# Adaptive Polling Architecture
|
||||
# 📉 Adaptive Polling
|
||||
|
||||
## Overview
|
||||
Pulse uses an adaptive polling scheduler that adapts poll cadence based on freshness, errors, and workload. The goal is to prioritize stale or changing instances while backing off on healthy, idle targets.
|
||||
Pulse uses an adaptive scheduler to optimize polling based on instance health and activity.
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
Scheduler[Scheduler]
|
||||
Queue[Priority Queue<br/>by NextRun]
|
||||
Workers[Workers]
|
||||
## 🧠 Architecture
|
||||
* **Scheduler**: Calculates intervals based on health/staleness.
|
||||
* **Priority Queue**: Min-heap keyed by `NextRun`.
|
||||
* **Circuit Breaker**: Prevents hot loops on failing instances.
|
||||
* **Backoff**: Exponential retry delays (5s to 5m).
|
||||
|
||||
Scheduler -->|schedule| Queue
|
||||
Queue -->|dequeue| Workers
|
||||
Workers -->|success| Scheduler
|
||||
Workers -->|failure| CB[Circuit Breaker]
|
||||
CB -->|backoff| Scheduler
|
||||
```
|
||||
## ⚙️ Configuration
|
||||
Adaptive polling is **enabled by default**.
|
||||
|
||||
- **Scheduler** computes `ScheduledTask` entries using adaptive intervals.
|
||||
- **Task queue** is a min-heap keyed by `NextRun`; only due tasks execute.
|
||||
- **Workers** execute tasks, capture outcomes, reschedule via scheduler or backoff logic.
|
||||
### UI
|
||||
**Settings → System → Monitoring**.
|
||||
|
||||
## Key Components
|
||||
### Environment Variables
|
||||
| Variable | Default | Description |
|
||||
| :--- | :--- | :--- |
|
||||
| `ADAPTIVE_POLLING_ENABLED` | `true` | Enable/disable. |
|
||||
| `ADAPTIVE_POLLING_BASE_INTERVAL` | `10s` | Healthy poll rate. |
|
||||
| `ADAPTIVE_POLLING_MIN_INTERVAL` | `5s` | Active/busy rate. |
|
||||
| `ADAPTIVE_POLLING_MAX_INTERVAL` | `5m` | Idle/backoff rate. |
|
||||
|
||||
| Component | File | Responsibility |
|
||||
|-----------------------|-------------------------------------------|--------------------------------------------------------------|
|
||||
| Scheduler | `internal/monitoring/scheduler.go` | Calculates adaptive intervals per instance. |
|
||||
| Staleness tracker | `internal/monitoring/staleness_tracker.go`| Maintains freshness metadata and scores. |
|
||||
| Priority queue | `internal/monitoring/task_queue.go` | Orders `ScheduledTask` items by due time + priority. |
|
||||
| Circuit breaker | `internal/monitoring/circuit_breaker.go` | Trips on repeated failures, preventing hot loops. |
|
||||
| Backoff | `internal/monitoring/backoff.go` | Exponential retry delays with jitter. |
|
||||
| Workers | `internal/monitoring/monitor.go` | Pop tasks, execute pollers, reschedule or dead-letter. |
|
||||
## 📊 Metrics
|
||||
Exposed at `:9091/metrics`.
|
||||
|
||||
## Configuration
|
||||
| Metric | Type | Description |
|
||||
| :--- | :--- | :--- |
|
||||
| `pulse_monitor_poll_total` | Counter | Total poll attempts. |
|
||||
| `pulse_monitor_poll_duration_seconds` | Histogram | Poll latency. |
|
||||
| `pulse_monitor_poll_staleness_seconds` | Gauge | Age since last success. |
|
||||
| `pulse_monitor_poll_queue_depth` | Gauge | Queue size. |
|
||||
| `pulse_monitor_poll_errors_total` | Counter | Error counts by category. |
|
||||
|
||||
**v4.24.0:** Adaptive polling is **enabled by default** but can be toggled without restart.
|
||||
## ⚡ Circuit Breaker
|
||||
| State | Trigger | Recovery |
|
||||
| :--- | :--- | :--- |
|
||||
| **Closed** | Normal operation. | — |
|
||||
| **Open** | ≥3 failures. | Backoff (max 5m). |
|
||||
| **Half-open** | Retry window elapsed. | Success = Closed; Fail = Open. |
|
||||
|
||||
### Via UI
|
||||
Navigate to **Settings → System → Monitoring** to enable/disable adaptive polling. Changes apply immediately without requiring a restart.
|
||||
**Dead Letter Queue**: After 5 transient or 1 permanent failure, tasks move to DLQ (30m retry).
|
||||
|
||||
### Via Environment Variables
|
||||
Environment variables (default in `internal/config/config.go`):
|
||||
## 🩺 Health API
|
||||
`GET /api/monitoring/scheduler/health` (Auth required)
|
||||
|
||||
| Variable | Default | Description |
|
||||
|-------------------------------------|---------|--------------------------------------------------|
|
||||
| `ADAPTIVE_POLLING_ENABLED` | true | **Changed in v4.24.0**: Now enabled by default |
|
||||
| `ADAPTIVE_POLLING_BASE_INTERVAL` | 10s | Target cadence when system is healthy |
|
||||
| `ADAPTIVE_POLLING_MIN_INTERVAL` | 5s | Lower bound (active instances) |
|
||||
| `ADAPTIVE_POLLING_MAX_INTERVAL` | 5m | Upper bound (idle instances) |
|
||||
|
||||
All settings persist in `system.json` and respond to environment overrides. **Changes apply without restart** when modified via UI.
|
||||
|
||||
## Metrics
|
||||
|
||||
**v4.24.0:** Extended metrics for comprehensive monitoring.
|
||||
|
||||
Exposed via Prometheus (`:9091/metrics`):
|
||||
|
||||
| Metric | Type | Labels | Description |
|
||||
|---------------------------------------------|-----------|---------------------------------------|-------------------------------------------------|
|
||||
| `pulse_monitor_poll_total` | counter | `instance_type`, `instance`, `result` | Overall poll attempts (success/error) |
|
||||
| `pulse_monitor_poll_duration_seconds` | histogram | `instance_type`, `instance` | Poll latency per instance |
|
||||
| `pulse_monitor_poll_staleness_seconds` | gauge | `instance_type`, `instance` | Age since last success (0 on success) |
|
||||
| `pulse_monitor_poll_queue_depth` | gauge | — | Size of priority queue |
|
||||
| `pulse_monitor_poll_inflight` | gauge | `instance_type` | Concurrent tasks per type |
|
||||
| `pulse_monitor_poll_errors_total` | counter | `instance_type`, `instance`, `category` | Error counts by category (transient/permanent) |
|
||||
| `pulse_monitor_poll_last_success_timestamp` | gauge | `instance_type`, `instance` | Unix timestamp of last successful poll |
|
||||
|
||||
**Alerting Recommendations:**
|
||||
- Alert when `pulse_monitor_poll_staleness_seconds` > 120 for critical instances
|
||||
- Alert when `pulse_monitor_poll_queue_depth` > 50 (backlog building)
|
||||
- Alert when `pulse_monitor_poll_errors_total` with `category=permanent` increases (auth/config issues)
|
||||
|
||||
## Circuit Breaker & Backoff
|
||||
|
||||
| State | Trigger | Recovery |
|
||||
|-------------|---------------------------------------------|--------------------------------------------|
|
||||
| **Closed** | Default. Failures counted. | — |
|
||||
| **Open** | ≥3 consecutive failures. Poll suppressed. | Exponential delay (max 5 min). |
|
||||
| **Half-open**| Retry window elapsed. Limited re-attempt. | Success ⇒ closed. Failure ⇒ open. |
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> Closed: Startup / reset
|
||||
Closed: Default state\nPolling active\nFailure counter increments
|
||||
Closed --> Open: ≥3 consecutive failures
|
||||
Open: Polls suppressed\nScheduler schedules backoff (max 5m)
|
||||
Open --> HalfOpen: Retry window elapsed
|
||||
HalfOpen: Single probe allowed\nBreaker watches probe result
|
||||
HalfOpen --> Closed: Probe success\nReset failure streak & delay
|
||||
HalfOpen --> Open: Probe failure\nIncrease streak & backoff
|
||||
```
|
||||
|
||||
Backoff configuration:
|
||||
|
||||
- Initial delay: 5 s
|
||||
- Multiplier: x2 per failure
|
||||
- Jitter: ±20 %
|
||||
- Max delay: 5 minutes
|
||||
- After 5 transient failures or any permanent failure, task moves to dead-letter queue for operator action.
|
||||
|
||||
## Dead-Letter Queue
|
||||
|
||||
Dead-letter entries are kept in memory (same `TaskQueue` structure) with a 30 min recheck interval. Operators should inspect logs for `Routing task to dead-letter queue` messages. Future work (Task 8) will add API surfaces for inspection.
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### GET /api/monitoring/scheduler/health
|
||||
|
||||
Returns comprehensive scheduler health data (authentication required).
|
||||
|
||||
**Response format:**
|
||||
|
||||
```json
|
||||
{
|
||||
"updatedAt": "2025-03-21T18:05:00Z",
|
||||
"enabled": true,
|
||||
"queue": {
|
||||
"depth": 7,
|
||||
"dueWithinSeconds": 2,
|
||||
"perType": {
|
||||
"pve": 4,
|
||||
"pbs": 2,
|
||||
"pmg": 1
|
||||
}
|
||||
},
|
||||
"deadLetter": {
|
||||
"count": 2,
|
||||
"tasks": [
|
||||
{
|
||||
"instance": "pbs-nas",
|
||||
"type": "pbs",
|
||||
"nextRun": "2025-03-21T18:25:00Z",
|
||||
"lastError": "connection timeout",
|
||||
"failures": 7
|
||||
}
|
||||
]
|
||||
},
|
||||
"breakers": [
|
||||
{
|
||||
"instance": "pve-core",
|
||||
"type": "pve",
|
||||
"state": "half_open",
|
||||
"failures": 3,
|
||||
"retryAt": "2025-03-21T18:05:45Z"
|
||||
}
|
||||
],
|
||||
"staleness": [
|
||||
{
|
||||
"instance": "pve-core",
|
||||
"type": "pve",
|
||||
"score": 0.12,
|
||||
"lastSuccess": "2025-03-21T18:04:50Z"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Field descriptions:**
|
||||
|
||||
- `enabled`: Feature flag status
|
||||
- `queue.depth`: Total queued tasks
|
||||
- `queue.dueWithinSeconds`: Tasks due within 12 seconds
|
||||
- `queue.perType`: Distribution by instance type
|
||||
- `deadLetter.count`: Total dead-letter tasks
|
||||
- `deadLetter.tasks`: Up to 25 most recent dead-letter entries
|
||||
- `breakers`: Circuit breaker states (only non-default states shown)
|
||||
- `staleness`: Freshness scores per instance (0 = fresh, 1 = max stale)
|
||||
|
||||
## Operational Guidance
|
||||
|
||||
1. **Enable adaptive polling**: set `ADAPTIVE_POLLING_ENABLED=true` via UI or environment overrides, then restart hot-dev (`scripts/hot-dev.sh`).
|
||||
2. **Monitor metrics** to ensure queue depth and staleness remain within SLA. Configure alerting on `poll_staleness_seconds` and `poll_queue_depth`.
|
||||
3. **Inspect scheduler health** via API endpoint `/api/monitoring/scheduler/health` for circuit breaker trips and dead-letter queue status.
|
||||
4. **Review dead-letter logs** for persistent failures; resolve underlying connectivity or auth issues before re-enabling.
|
||||
|
||||
## Rollout Plan
|
||||
|
||||
1. **Dev/QA**: Run hot-dev with feature flag enabled; observe metrics and logs for several cycles.
|
||||
2. **Staged deploy**: Enable flag on a subset of clusters; monitor queue depth (<50) and staleness (<45 s).
|
||||
3. **Full rollout**: Toggle flag globally once metrics are stable; document any overrides in release notes.
|
||||
4. **Post-launch**: Add Grafana panels for queue depth & staleness; alert on circuit breaker trips (future API work).
|
||||
|
||||
## Known Follow-ups
|
||||
|
||||
- Task 8: expose scheduler health & dead-letter statistics via API and UI panels.
|
||||
- Task 9: add dedicated unit/integration harness for the scheduler & workers.
|
||||
Returns:
|
||||
* Queue depth & breakdown.
|
||||
* Dead-letter tasks.
|
||||
* Circuit breaker states.
|
||||
* Per-instance staleness.
|
||||
|
||||
@@ -1,81 +1,36 @@
|
||||
# Pulse Prometheus Metrics (v4.24.0+)
|
||||
# 📊 Prometheus Metrics
|
||||
|
||||
Pulse exposes multiple metric families that cover HTTP ingress, per-node poll execution, scheduler health, and diagnostics caching. Use the following reference when wiring dashboards or alert rules.
|
||||
|
||||
---
|
||||
|
||||
## HTTP Request Metrics
|
||||
|
||||
| Metric | Type | Labels | Description |
|
||||
| --- | --- | --- | --- |
|
||||
| `pulse_http_request_duration_seconds` | Histogram | `method`, `route`, `status` | Request latency buckets. `route` is a normalised path (dynamic segments collapsed to `:id`, `:uuid`, etc.). |
|
||||
| `pulse_http_requests_total` | Counter | `method`, `route`, `status` | Total requests handled. |
|
||||
| `pulse_http_request_errors_total` | Counter | `method`, `route`, `status_class` | Counts 4xx/5xx responses. |
|
||||
|
||||
**Alert suggestion:**
|
||||
`rate(pulse_http_request_errors_total{status_class="server_error"}[5m]) > 0.05` (more than ~3 server errors/min) should page ops.
|
||||
|
||||
---
|
||||
|
||||
## Per-Node Poll Metrics
|
||||
|
||||
| Metric | Type | Labels | Description |
|
||||
| --- | --- | --- | --- |
|
||||
| `pulse_monitor_node_poll_duration_seconds` | Histogram | `instance_type`, `instance`, `node` | Wall-clock duration for each node poll. |
|
||||
| `pulse_monitor_node_poll_total` | Counter | `instance_type`, `instance`, `node`, `result` | Success/error counts per node. |
|
||||
| `pulse_monitor_node_poll_errors_total` | Counter | `instance_type`, `instance`, `node`, `error_type` | Error type breakdown (connection, auth, internal, etc.). |
|
||||
| `pulse_monitor_node_poll_last_success_timestamp` | Gauge | `instance_type`, `instance`, `node` | Unix timestamp of last successful poll. |
|
||||
| `pulse_monitor_node_poll_staleness_seconds` | Gauge | `instance_type`, `instance`, `node` | Seconds since last success (−1 means no success yet). |
|
||||
|
||||
**Alert suggestion:**
|
||||
`max_over_time(pulse_monitor_node_poll_staleness_seconds{node!=""}[10m]) > 300` indicates a node has been stale for 5+ minutes.
|
||||
|
||||
---
|
||||
|
||||
## Scheduler Health Metrics
|
||||
|
||||
| Metric | Type | Labels | Description |
|
||||
| --- | --- | --- | --- |
|
||||
| `pulse_scheduler_queue_due_soon` | Gauge | — | Number of tasks due within 12 seconds. |
|
||||
| `pulse_scheduler_queue_depth` | Gauge | `instance_type` | Queue depth per instance type (PVE, PBS, PMG). |
|
||||
| `pulse_scheduler_queue_wait_seconds` | Histogram | `instance_type` | Wait time between when a task should run and when it actually executes. |
|
||||
| `pulse_scheduler_dead_letter_depth` | Gauge | `instance_type`, `instance` | Dead-letter queue depth per monitored instance. |
|
||||
| `pulse_scheduler_breaker_state` | Gauge | `instance_type`, `instance` | Circuit breaker state: `0`=closed, `1`=half-open, `2`=open, `-1`=unknown. |
|
||||
| `pulse_scheduler_breaker_failure_count` | Gauge | `instance_type`, `instance` | Consecutive failures tracked by the breaker. |
|
||||
| `pulse_scheduler_breaker_retry_seconds` | Gauge | `instance_type`, `instance` | Seconds until the breaker will allow the next attempt. |
|
||||
|
||||
**Alert suggestions:**
|
||||
- Queue saturation: `max_over_time(pulse_scheduler_queue_depth[10m]) > <instance count * 1.5>`
|
||||
- DLQ growth: `increase(pulse_scheduler_dead_letter_depth[10m]) > 0`
|
||||
- Breaker stuck open: `pulse_scheduler_breaker_state == 2` for > 10 minutes.
|
||||
|
||||
---
|
||||
|
||||
## Diagnostics Cache Metrics
|
||||
|
||||
| Metric | Type | Labels | Description |
|
||||
| --- | --- | --- | --- |
|
||||
| `pulse_diagnostics_cache_hits_total` | Counter | — | Diagnostics requests served from cache. |
|
||||
| `pulse_diagnostics_cache_misses_total` | Counter | — | Requests that triggered a fresh probe. |
|
||||
| `pulse_diagnostics_refresh_duration_seconds` | Histogram | — | Time taken to refresh diagnostics payload. |
|
||||
|
||||
**Alert suggestion:**
|
||||
`rate(pulse_diagnostics_cache_misses_total[5m])` spiking alongside `pulse_diagnostics_refresh_duration_seconds` > 20s can signal upstream slowness.
|
||||
|
||||
---
|
||||
|
||||
## Existing Instance-Level Poll Metrics (for completeness)
|
||||
|
||||
The following metrics pre-date v4.24.0 but remain essential:
|
||||
Pulse exposes metrics at `/metrics` (default port `9091`).
|
||||
|
||||
## 🌐 HTTP Ingress
|
||||
| Metric | Type | Description |
|
||||
| --- | --- | --- |
|
||||
| `pulse_monitor_poll_duration_seconds` | Histogram | Poll duration per instance. |
|
||||
| `pulse_monitor_poll_total` | Counter | Success/error counts per instance. |
|
||||
| `pulse_monitor_poll_errors_total` | Counter | Error counts per instance. |
|
||||
| `pulse_monitor_poll_last_success_timestamp` | Gauge | Last successful poll timestamp. |
|
||||
| `pulse_monitor_poll_staleness_seconds` | Gauge | Seconds since last successful poll (instance-level). |
|
||||
| `pulse_monitor_poll_queue_depth` | Gauge | Current queue depth. |
|
||||
| `pulse_monitor_poll_inflight` | Gauge | Polls currently running. |
|
||||
| :--- | :--- | :--- |
|
||||
| `pulse_http_request_duration_seconds` | Histogram | Latency buckets by `method`, `route`, `status`. |
|
||||
| `pulse_http_requests_total` | Counter | Total requests. |
|
||||
| `pulse_http_request_errors_total` | Counter | 4xx/5xx errors. |
|
||||
|
||||
Refer to this document whenever you build dashboards or craft alert policies. Scrape all metrics from the Pulse backend `/metrics` endpoint (9091 by default for systemd installs).
|
||||
## 🔄 Polling & Nodes
|
||||
| Metric | Type | Description |
|
||||
| :--- | :--- | :--- |
|
||||
| `pulse_monitor_node_poll_duration_seconds` | Histogram | Per-node poll latency. |
|
||||
| `pulse_monitor_node_poll_total` | Counter | Success/error counts per node. |
|
||||
| `pulse_monitor_node_poll_staleness_seconds` | Gauge | Seconds since last success. |
|
||||
| `pulse_monitor_poll_queue_depth` | Gauge | Global queue depth. |
|
||||
|
||||
## 🧠 Scheduler Health
|
||||
| Metric | Type | Description |
|
||||
| :--- | :--- | :--- |
|
||||
| `pulse_scheduler_queue_depth` | Gauge | Queue depth per instance type. |
|
||||
| `pulse_scheduler_dead_letter_depth` | Gauge | DLQ depth per instance. |
|
||||
| `pulse_scheduler_breaker_state` | Gauge | `0`=Closed, `1`=Half-Open, `2`=Open. |
|
||||
|
||||
## ⚡ Diagnostics Cache
|
||||
| Metric | Type | Description |
|
||||
| :--- | :--- | :--- |
|
||||
| `pulse_diagnostics_cache_hits_total` | Counter | Cache hits. |
|
||||
| `pulse_diagnostics_refresh_duration_seconds` | Histogram | Refresh latency. |
|
||||
|
||||
## 🚨 Alerting Examples
|
||||
* **High Error Rate**: `rate(pulse_http_request_errors_total[5m]) > 0.05`
|
||||
* **Stale Node**: `pulse_monitor_node_poll_staleness_seconds > 300`
|
||||
* **Breaker Open**: `pulse_scheduler_breaker_state == 2`
|
||||
|
||||
@@ -1,83 +1,30 @@
|
||||
# Adaptive Polling Rollout Runbook
|
||||
# 🚀 Adaptive Polling Rollout
|
||||
|
||||
Adaptive polling (v4.24.0+) lets the scheduler dynamically adjust poll
|
||||
intervals per resource. This runbook documents the safe way to enable, monitor,
|
||||
and, if needed, disable the feature across environments.
|
||||
Safely enable dynamic scheduling (v4.24.0+).
|
||||
|
||||
## Scope & Prerequisites
|
||||
## 📋 Pre-Flight
|
||||
1. **Snapshot Health**:
|
||||
```bash
|
||||
curl -s http://localhost:7655/api/monitoring/scheduler/health | jq .
|
||||
```
|
||||
2. **Check Metrics**: Ensure `pulse_monitor_poll_queue_depth` is stable.
|
||||
|
||||
- Pulse **v4.24.0 or newer**
|
||||
- Admin access to **Settings → System → Monitoring**
|
||||
- Prometheus access to `pulse_monitor_*` metrics
|
||||
- Ability to run authenticated `curl` commands against the Pulse API
|
||||
## 🟢 Enable
|
||||
Choose one method:
|
||||
* **UI**: Settings → System → Monitoring → Adaptive Polling.
|
||||
* **CLI**: `jq '.AdaptivePollingEnabled=true' /var/lib/pulse/system.json > tmp && mv tmp system.json`
|
||||
* **Env**: `ADAPTIVE_POLLING_ENABLED=true` (Docker/K8s).
|
||||
|
||||
## Change Windows
|
||||
## 🔍 Monitor (First 15m)
|
||||
Watch for stability:
|
||||
```bash
|
||||
watch -n 5 'curl -s http://localhost:9091/metrics | grep pulse_monitor_poll_queue_depth'
|
||||
```
|
||||
* **Success**: Queue depth < 50, no permanent errors.
|
||||
* **Failure**: High queue depth, open breakers.
|
||||
|
||||
Run rollouts during a maintenance window where transient alert jitter is
|
||||
acceptable. Adaptive polling touches every monitor queue; give yourself at least
|
||||
15 minutes to observe steady state metrics.
|
||||
|
||||
## Rollout Steps
|
||||
|
||||
1. **Snapshot current health**
|
||||
```bash
|
||||
curl -s http://localhost:7655/api/monitoring/scheduler/health | jq '.enabled, .queue.depth'
|
||||
```
|
||||
Record queue depth, breaker count, and dead-letter entries.
|
||||
|
||||
2. **Enable adaptive polling**
|
||||
- UI: toggle **Settings → System → Monitoring → Adaptive Polling** → Enable
|
||||
- CLI: `jq '.AdaptivePollingEnabled=true' /var/lib/pulse/system.json > tmp && mv tmp system.json`
|
||||
- Env override: `ADAPTIVE_POLLING_ENABLED=true` before starting Pulse (for
|
||||
containers/k8s)
|
||||
|
||||
3. **Watch metrics (first 5 minutes)**
|
||||
```bash
|
||||
watch -n 5 'curl -s http://localhost:9091/metrics | grep -E "pulse_monitor_(poll_queue_depth|poll_staleness_seconds)" | head'
|
||||
```
|
||||
Targets:
|
||||
- `pulse_monitor_poll_queue_depth < 50`
|
||||
- `pulse_monitor_poll_staleness_seconds` under your SLA (typically < 60 s)
|
||||
- No spikes in `pulse_monitor_poll_errors_total{category="permanent"}`
|
||||
|
||||
4. **Validate scheduler state**
|
||||
```bash
|
||||
curl -s http://localhost:7655/api/monitoring/scheduler/health \
|
||||
| jq '{enabled, queue: .queue.depth, breakers: [.breakers[]?.instance], deadLetter: .deadLetter.count}'
|
||||
```
|
||||
Expect `enabled: true`, empty breaker list, and `deadLetter.count == 0`.
|
||||
|
||||
5. **Document overrides**
|
||||
- Note any instances moved to manual polling (Settings → Nodes → Polling)
|
||||
- Capture Grafana screenshots for queue depth/staleness widgets
|
||||
|
||||
## Rollback
|
||||
|
||||
If queue depth climbs uncontrollably or breakers remain open for >10 minutes:
|
||||
|
||||
1. Disable the feature the same way you enabled it (UI/environment).
|
||||
2. Restart Pulse if environment overrides were used, otherwise hot toggle is
|
||||
immediate.
|
||||
3. Continue monitoring until queue depth and staleness return to baseline.
|
||||
|
||||
## Canary Strategy Suggestions
|
||||
|
||||
| Stage | Action | Acceptance Criteria |
|
||||
| --- | --- | --- |
|
||||
| Dev | Enable flag in hot-dev (scripts/hot-dev.sh) | No scheduler panics, UI reflects flag instantly |
|
||||
| Staging | Enable on one Pulse instance per region | `queue.depth` within ±20 % of baseline after 15 min |
|
||||
| Production | Enable per cluster with 30 min soak | No more than 5 breaker openings per hour |
|
||||
|
||||
## Instrumentation Checklist
|
||||
|
||||
- Grafana dashboard with `queue.depth`, `poll_staleness_seconds`,
|
||||
`poll_errors_total` by type
|
||||
- Alert rule: `rate(pulse_monitor_poll_errors_total{category="permanent"}[5m]) > 0`
|
||||
- Alert rule: `max_over_time(pulse_monitor_poll_queue_depth[5m]) > 75`
|
||||
- JSON log search for `"scheduler":` warnings immediately after enablement
|
||||
|
||||
## References
|
||||
|
||||
- [Architecture doc](../monitoring/ADAPTIVE_POLLING.md)
|
||||
- [Scheduler Health API](../api/SCHEDULER_HEALTH.md)
|
||||
- [Kubernetes guidance](../KUBERNETES.md#adaptive-polling-configuration-v4250)
|
||||
## ↩️ Rollback
|
||||
If instability occurs > 10m:
|
||||
1. **Disable**: Toggle off via UI or Env.
|
||||
2. **Restart**: Required if using Env/CLI overrides.
|
||||
3. **Verify**: Confirm queue drains.
|
||||
|
||||
51
docs/operations/AUDIT_LOG_ROTATION.md
Normal file
51
docs/operations/AUDIT_LOG_ROTATION.md
Normal file
@@ -0,0 +1,51 @@
|
||||
# 🔄 Sensor Proxy Audit Log Rotation
|
||||
|
||||
The proxy writes append-only, hash-chained logs to `/var/log/pulse/sensor-proxy/audit.log`.
|
||||
|
||||
## ⚠️ Important
|
||||
* **Do not delete**: The file is protected with `chattr +a`.
|
||||
* **Rotate when**: >200MB or >30 days.
|
||||
|
||||
## 🛠️ Manual Rotation
|
||||
|
||||
Run as root:
|
||||
|
||||
```bash
|
||||
# 1. Unlock file
|
||||
chattr -a /var/log/pulse/sensor-proxy/audit.log
|
||||
|
||||
# 2. Rotate (copy & truncate)
|
||||
cp -a /var/log/pulse/sensor-proxy/audit.log /var/log/pulse/sensor-proxy/audit.log.$(date +%Y%m%d)
|
||||
: > /var/log/pulse/sensor-proxy/audit.log
|
||||
|
||||
# 3. Relock & Restart
|
||||
chown pulse-sensor-proxy:pulse-sensor-proxy /var/log/pulse/sensor-proxy/audit.log
|
||||
chmod 0640 /var/log/pulse/sensor-proxy/audit.log
|
||||
chattr +a /var/log/pulse/sensor-proxy/audit.log
|
||||
systemctl restart pulse-sensor-proxy
|
||||
```
|
||||
|
||||
## 🤖 Logrotate Config
|
||||
|
||||
Create `/etc/logrotate.d/pulse-sensor-proxy`:
|
||||
|
||||
```conf
|
||||
/var/log/pulse/sensor-proxy/audit.log {
|
||||
weekly
|
||||
rotate 8
|
||||
compress
|
||||
missingok
|
||||
notifempty
|
||||
create 0640 pulse-sensor-proxy pulse-sensor-proxy
|
||||
sharedscripts
|
||||
prerotate
|
||||
/usr/bin/chattr -a /var/log/pulse/sensor-proxy/audit.log || true
|
||||
endscript
|
||||
postrotate
|
||||
/bin/systemctl restart pulse-sensor-proxy.service || true
|
||||
/usr/bin/chattr +a /var/log/pulse/sensor-proxy/audit.log || true
|
||||
endscript
|
||||
}
|
||||
```
|
||||
|
||||
**Note**: Do NOT use `copytruncate`. The restart is required to reset the hash chain.
|
||||
47
docs/operations/AUTO_UPDATE.md
Normal file
47
docs/operations/AUTO_UPDATE.md
Normal file
@@ -0,0 +1,47 @@
|
||||
# 🔄 Automatic Updates
|
||||
Manage Pulse auto-updates on host-mode installations.
|
||||
|
||||
> **Note**: Docker/Kubernetes users should manage updates via their orchestrator.
|
||||
|
||||
## ⚙️ Components
|
||||
| File | Purpose |
|
||||
| :--- | :--- |
|
||||
| `pulse-update.timer` | Daily check (02:00 + jitter). |
|
||||
| `pulse-update.service` | Runs the update script. |
|
||||
| `pulse-auto-update.sh` | Fetches release & restarts Pulse. |
|
||||
|
||||
## 🚀 Enable/Disable
|
||||
|
||||
### Via UI (Recommended)
|
||||
**Settings → System → Updates → Automatic Updates**.
|
||||
|
||||
### Via CLI
|
||||
```bash
|
||||
# Enable
|
||||
sudo jq '.autoUpdateEnabled=true' /var/lib/pulse/system.json > tmp && sudo mv tmp /var/lib/pulse/system.json
|
||||
sudo systemctl enable --now pulse-update.timer
|
||||
|
||||
# Disable
|
||||
sudo jq '.autoUpdateEnabled=false' /var/lib/pulse/system.json > tmp && sudo mv tmp /var/lib/pulse/system.json
|
||||
sudo systemctl disable --now pulse-update.timer
|
||||
```
|
||||
|
||||
## 🧪 Manual Run
|
||||
Test the update process:
|
||||
```bash
|
||||
sudo systemctl start pulse-update.service
|
||||
journalctl -u pulse-update -f
|
||||
```
|
||||
|
||||
## 🔍 Observability
|
||||
* **History**: `curl -s http://localhost:7655/api/updates/history | jq`
|
||||
* **Logs**: `/var/log/pulse/update-*.log`
|
||||
|
||||
## ↩️ Rollback
|
||||
If an update fails:
|
||||
1. Check logs: `/var/log/pulse/update-YYYYMMDDHHMMSS.log`.
|
||||
2. Revert manually:
|
||||
```bash
|
||||
sudo /opt/pulse/install.sh --version v4.30.0
|
||||
```
|
||||
Or use the **Rollback** button in the UI if available.
|
||||
40
docs/operations/SENSOR_PROXY_CONFIG.md
Normal file
40
docs/operations/SENSOR_PROXY_CONFIG.md
Normal file
@@ -0,0 +1,40 @@
|
||||
# ⚙️ Sensor Proxy Configuration
|
||||
|
||||
Safe configuration management using the CLI (v4.31.1+).
|
||||
|
||||
## 📂 Files
|
||||
* **`config.yaml`**: General settings (logging, metrics).
|
||||
* **`allowed_nodes.yaml`**: Authorized node list (managed via CLI).
|
||||
|
||||
## 🛠️ CLI Reference
|
||||
|
||||
### Validation
|
||||
Check for errors before restart.
|
||||
```bash
|
||||
pulse-sensor-proxy config validate
|
||||
```
|
||||
|
||||
### Managing Nodes
|
||||
**Add Nodes (Merge):**
|
||||
```bash
|
||||
pulse-sensor-proxy config set-allowed-nodes --merge 192.168.0.10
|
||||
```
|
||||
|
||||
**Replace List:**
|
||||
```bash
|
||||
pulse-sensor-proxy config set-allowed-nodes --replace \
|
||||
--merge 192.168.0.1 --merge 192.168.0.2
|
||||
```
|
||||
|
||||
## ⚠️ Troubleshooting
|
||||
|
||||
**Validation Fails:**
|
||||
* Check for duplicate `allowed_nodes` blocks in `config.yaml`.
|
||||
* Run `pulse-sensor-proxy config validate 2>&1` for details.
|
||||
|
||||
**Lock Errors:**
|
||||
* Remove stale locks if process is dead: `rm /etc/pulse-sensor-proxy/*.lock`.
|
||||
|
||||
**Empty List:**
|
||||
* Valid for IPC-only clusters.
|
||||
* Populate manually if needed using `--replace`.
|
||||
31
docs/operations/SENSOR_PROXY_LOGS.md
Normal file
31
docs/operations/SENSOR_PROXY_LOGS.md
Normal file
@@ -0,0 +1,31 @@
|
||||
# 📝 Sensor Proxy Log Forwarding
|
||||
|
||||
Forward `audit.log` and `proxy.log` to a central SIEM via RELP + TLS.
|
||||
|
||||
## 🚀 Quick Start
|
||||
Run the helper script with your collector details:
|
||||
|
||||
```bash
|
||||
sudo REMOTE_HOST=logs.example.com \
|
||||
REMOTE_PORT=6514 \
|
||||
CERT_DIR=/etc/pulse/log-forwarding \
|
||||
CA_CERT=/path/to/ca.crt \
|
||||
CLIENT_CERT=/path/to/client.crt \
|
||||
CLIENT_KEY=/path/to/client.key \
|
||||
/opt/pulse/scripts/setup-log-forwarding.sh
|
||||
```
|
||||
|
||||
## 📋 What It Does
|
||||
1. **Inputs**: Watches `/var/log/pulse/sensor-proxy/{audit,proxy}.log`.
|
||||
2. **Queue**: Disk-backed queue (50k messages) for reliability.
|
||||
3. **Output**: RELP over TLS to `REMOTE_HOST`.
|
||||
4. **Mirror**: Local debug file at `/var/log/pulse/sensor-proxy/forwarding.log`.
|
||||
|
||||
## ✅ Verification
|
||||
1. **Check Status**: `sudo systemctl status rsyslog`
|
||||
2. **View Mirror**: `tail -f /var/log/pulse/sensor-proxy/forwarding.log`
|
||||
3. **Test**: Restart proxy and check remote collector for `pulse.audit` tag.
|
||||
|
||||
## 🧹 Maintenance
|
||||
* **Disable**: Remove `/etc/rsyslog.d/pulse-sensor-proxy.conf` and restart rsyslog.
|
||||
* **Rotate Certs**: Replace files in `CERT_DIR` and restart rsyslog.
|
||||
@@ -1,120 +0,0 @@
|
||||
# Sensor Proxy Audit Log Rotation
|
||||
|
||||
The temperature sensor proxy writes append-only, hash-chained audit events to
|
||||
`/var/log/pulse/sensor-proxy/audit.log`. The file is created with `0640`
|
||||
permissions, owned by `pulse-sensor-proxy`, and protected with `chattr +a` via
|
||||
`scripts/secure-sensor-files.sh`. Because the process keeps the file handle open
|
||||
and enforces append-only mode, you **must** follow the steps below to rotate the
|
||||
log without losing events.
|
||||
|
||||
## When to Rotate
|
||||
|
||||
- File exceeds **200 MB** or contains more than 30 days of history
|
||||
- Prior to exporting evidence for an incident review
|
||||
- Immediately before changing log-forwarding endpoints (rsyslog/RELp)
|
||||
|
||||
The proxy falls back to stderr (systemd journal) only when the file cannot be
|
||||
opened. Do not rely on the fallback for long-term retention.
|
||||
|
||||
## Pre-flight Checklist
|
||||
|
||||
1. Confirm the service is healthy:
|
||||
```bash
|
||||
systemctl status pulse-sensor-proxy --no-pager
|
||||
```
|
||||
2. Make sure `/var/log/pulse/sensor-proxy` is mounted with enough free space:
|
||||
```bash
|
||||
df -h /var/log/pulse/sensor-proxy
|
||||
```
|
||||
3. Note the current scheduler health inside Pulse for later verification:
|
||||
```bash
|
||||
curl -s http://localhost:7655/api/monitoring/scheduler/health | jq '.queue.depth, .deadLetter.count'
|
||||
```
|
||||
|
||||
## Manual Rotation Procedure
|
||||
|
||||
> Run these steps as **root** on the Proxmox host that runs the proxy.
|
||||
|
||||
1. Remove the append-only flag (logrotate needs to truncate the file):
|
||||
```bash
|
||||
chattr -a /var/log/pulse/sensor-proxy/audit.log
|
||||
```
|
||||
2. Copy the current file to an evidence path, then truncate in place:
|
||||
```bash
|
||||
ts=$(date +%Y%m%d-%H%M%S)
|
||||
cp -a /var/log/pulse/sensor-proxy/audit.log /var/log/pulse/sensor-proxy/audit.log.$ts
|
||||
: > /var/log/pulse/sensor-proxy/audit.log
|
||||
```
|
||||
3. Restore permissions and the append-only flag:
|
||||
```bash
|
||||
chown pulse-sensor-proxy:pulse-sensor-proxy /var/log/pulse/sensor-proxy/audit.log
|
||||
chmod 0640 /var/log/pulse/sensor-proxy/audit.log
|
||||
chattr +a /var/log/pulse/sensor-proxy/audit.log
|
||||
```
|
||||
4. Restart the proxy so the file descriptor is reopened:
|
||||
```bash
|
||||
systemctl restart pulse-sensor-proxy
|
||||
```
|
||||
5. Verify the service recreated the correlation hash chain:
|
||||
```bash
|
||||
journalctl -u pulse-sensor-proxy -n 20 | grep -i "audit" || true
|
||||
```
|
||||
6. Re-check Pulse adaptive polling health (temperature pollers rely on the
|
||||
proxy):
|
||||
```bash
|
||||
curl -s http://localhost:7655/api/monitoring/scheduler/health \
|
||||
| jq '.instances[] | select(.key | contains("temperature")) | {key, breaker: .breaker.state, deadLetter: .deadLetter.present}'
|
||||
```
|
||||
All temperature instances should show `breaker: "closed"` with
|
||||
`deadLetter: false`.
|
||||
|
||||
## Logrotate Configuration
|
||||
|
||||
Automate rotation with `/etc/logrotate.d/pulse-sensor-proxy`. Copy the snippet
|
||||
below and adjust retention to match your compliance needs:
|
||||
|
||||
```conf
|
||||
/var/log/pulse/sensor-proxy/audit.log {
|
||||
weekly
|
||||
rotate 8
|
||||
compress
|
||||
missingok
|
||||
notifempty
|
||||
create 0640 pulse-sensor-proxy pulse-sensor-proxy
|
||||
sharedscripts
|
||||
prerotate
|
||||
/usr/bin/chattr -a /var/log/pulse/sensor-proxy/audit.log || true
|
||||
endscript
|
||||
postrotate
|
||||
/bin/systemctl restart pulse-sensor-proxy.service || true
|
||||
/usr/bin/chattr +a /var/log/pulse/sensor-proxy/audit.log || true
|
||||
endscript
|
||||
}
|
||||
```
|
||||
|
||||
Keep `copytruncate` disabled—the restart ensures the proxy writes to a fresh
|
||||
file with a new hash chain. Always forward rotated files to your SIEM before
|
||||
removing them.
|
||||
|
||||
## Forwarding Validations
|
||||
|
||||
If you forward audit logs over RELP using `scripts/setup-log-forwarding.sh`:
|
||||
|
||||
1. Tail the forwarding log:
|
||||
```bash
|
||||
tail -f /var/log/pulse/sensor-proxy/forwarding.log
|
||||
```
|
||||
2. Ensure queues drain (`action.resumeRetryCount=-1` keeps retrying).
|
||||
3. Confirm the remote receiver ingests the new file (look for the `pulse.audit`
|
||||
tag).
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Symptom | Action |
|
||||
| --- | --- |
|
||||
| `Operation not permitted` when truncating | `chattr -a` was not executed or SELinux/AppArmor denies it. Check `auditd`. |
|
||||
| Proxy fails to restart | Run `journalctl -u pulse-sensor-proxy -xe` for context. The proxy refuses to start if the audit file cannot be opened. |
|
||||
| Temperature polls stop after rotation | Check `/api/monitoring/scheduler/health` for dead-letter entries. Restart the main Pulse service if breakers stay open. |
|
||||
|
||||
Once logs are rotated and validated, upload the archived copy to your evidence
|
||||
store and record the event in your change log.
|
||||
@@ -1,104 +0,0 @@
|
||||
# Pulse Automatic Update Runbook
|
||||
|
||||
Automatic updates are handled by three systemd units that live on host-mode
|
||||
installations:
|
||||
|
||||
| Component | Purpose | File |
|
||||
| --- | --- | --- |
|
||||
| `pulse-update.timer` | Schedules daily checks (02:00 + 0‑4 h jitter) | `/etc/systemd/system/pulse-update.timer` |
|
||||
| `pulse-update.service` | Runs a single update cycle when triggered | `/etc/systemd/system/pulse-update.service` |
|
||||
| `scripts/pulse-auto-update.sh` | Fetches release metadata, downloads binaries, restarts Pulse | `/opt/pulse/scripts/pulse-auto-update.sh` |
|
||||
|
||||
> Docker and Kubernetes deployments do **not** use this flow—manage upgrades via
|
||||
> your orchestrator.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- `autoUpdateEnabled` must be `true` in `/var/lib/pulse/system.json` (or toggled in
|
||||
**Settings → System → Updates → Automatic Updates**).
|
||||
- `pulse.service` must be healthy—the update service short-circuits if Pulse is
|
||||
not running.
|
||||
- Host needs outbound HTTPS access to `github.com` and `objects.githubusercontent.com`.
|
||||
|
||||
## Enable or Disable
|
||||
|
||||
### From the UI
|
||||
1. Navigate to **Settings → System → Updates**.
|
||||
2. Toggle **Automatic Updates** on. The backend persists `autoUpdateEnabled:true`
|
||||
and surfaces a reminder to enable the timer.
|
||||
3. On the host, run:
|
||||
```bash
|
||||
sudo systemctl enable --now pulse-update.timer
|
||||
sudo systemctl status pulse-update.timer --no-pager
|
||||
```
|
||||
4. To disable later, toggle the UI switch off **and** run
|
||||
`sudo systemctl disable --now pulse-update.timer`.
|
||||
|
||||
### From the CLI only
|
||||
```bash
|
||||
# Opt in
|
||||
sudo jq '.autoUpdateEnabled=true' /var/lib/pulse/system.json | sudo tee /var/lib/pulse/system.json >/dev/null
|
||||
sudo systemctl daemon-reload
|
||||
sudo systemctl enable --now pulse-update.timer
|
||||
|
||||
# Opt out
|
||||
sudo jq '.autoUpdateEnabled=false' /var/lib/pulse/system.json | sudo tee /var/lib/pulse/system.json >/dev/null
|
||||
sudo systemctl disable --now pulse-update.timer
|
||||
```
|
||||
> Editing `system.json` while Pulse is running is safe, but prefer the UI so
|
||||
> validation rules stay in place.
|
||||
|
||||
## Trigger a Manual Run
|
||||
|
||||
Use this when testing new releases or after changing firewall rules:
|
||||
|
||||
```bash
|
||||
sudo systemctl start pulse-update.service
|
||||
sudo journalctl -u pulse-update -n 50
|
||||
```
|
||||
|
||||
The oneshot service exits when the script finishes. A successful run logs the new
|
||||
version and writes an entry to `update-history.jsonl`.
|
||||
|
||||
## Observability Checklist
|
||||
|
||||
- **Timer status**: `systemctl list-timers pulse-update`
|
||||
- **History API**: `curl -s http://localhost:7655/api/updates/history | jq '.entries[0]'`
|
||||
- **Raw log**: `/var/log/pulse/update-*.log` (referenced inside the history entry’s
|
||||
`log_path` field)
|
||||
- **Journal**: `journalctl -u pulse-update -f`
|
||||
- **Backups**: The script records `backup_path` in history (defaults to
|
||||
`/etc/pulse.backup.<timestamp>`). Ensure the path exists before acknowledging
|
||||
the rollout.
|
||||
|
||||
## Failure Handling & Rollback
|
||||
|
||||
1. Inspect the failing history entry:
|
||||
```bash
|
||||
curl -s http://localhost:7655/api/updates/history?limit=1 | jq '.entries[0]'
|
||||
```
|
||||
Common statuses: `failed`, `rolled_back`, `succeeded`.
|
||||
2. Review `/var/log/pulse/update-YYYYMMDDHHMMSS.log` for the stack trace.
|
||||
3. To revert, redeploy the previous release:
|
||||
```bash
|
||||
sudo /opt/pulse/install.sh --version v4.30.0
|
||||
```
|
||||
or use the main installer command from the update history output. The installer
|
||||
restores the `backup_path` recorded earlier when you choose **Rollback** in the
|
||||
UI.
|
||||
4. Confirm Pulse is healthy (`systemctl status pulse.service`) and that
|
||||
`/api/updates/history` now contains a `rolled_back` entry referencing the same
|
||||
`event_id`.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Symptom | Resolution |
|
||||
| --- | --- |
|
||||
| `Auto-updates disabled in configuration` in journal | Set `autoUpdateEnabled:true` (UI or edit `system.json`) and restart the timer. |
|
||||
| `pulse-update.timer` immediately exits | Ensure `systemd` knows about the units (`sudo systemctl daemon-reload`) and that `pulse.service` exists (installer may not have run with `--enable-auto-updates`). |
|
||||
| `github.com` errors / rate limit | The script retries via the release redirect. For proxied environments set `https_proxy` before the service runs. |
|
||||
| Update succeeds but Pulse stays on previous version | Check `journalctl -u pulse-update` for `restart failed`; Pulse only switches after the service restarts successfully. |
|
||||
| Timer enabled but no history entries | Verify time has passed since enablement (timer includes random delay) or start the service manually to seed the first run. |
|
||||
|
||||
Document each run (success or rollback) in your change journal with the
|
||||
`event_id` from `/api/updates/history` so you can cross-reference audit trails.
|
||||
@@ -1,469 +0,0 @@
|
||||
# Sensor Proxy Configuration Management
|
||||
|
||||
This guide covers safe configuration management for pulse-sensor-proxy, including the new CLI tools introduced in v4.31.1+ to prevent config corruption.
|
||||
|
||||
## Overview
|
||||
|
||||
Starting with v4.31.1, pulse-sensor-proxy uses a two-file configuration system:
|
||||
|
||||
1. **Main config:** `/etc/pulse-sensor-proxy/config.yaml` - Contains all settings except allowed nodes
|
||||
2. **Allowed nodes:** `/etc/pulse-sensor-proxy/allowed_nodes.yaml` - Separate file for the authorized node list
|
||||
|
||||
This separation prevents corruption from concurrent updates by the installer, control-plane sync, and self-heal timer.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Why Two Files?
|
||||
|
||||
Earlier versions stored `allowed_nodes:` inline in `config.yaml`, causing corruption when:
|
||||
- The installer updated node lists
|
||||
- The self-heal timer ran (every 5 minutes)
|
||||
- Control-plane sync modified the list
|
||||
- Version detection had edge cases
|
||||
|
||||
Multiple code paths (shell, Python, Go) would race to update the same YAML file, creating duplicate `allowed_nodes:` keys that broke YAML parsing.
|
||||
|
||||
### New System (v4.31.1+)
|
||||
|
||||
**Phase 1 (Migration):**
|
||||
- Force file-based mode exclusively
|
||||
- Installer migrates inline blocks to `allowed_nodes.yaml`
|
||||
- Self-heal timer includes corruption detection and repair
|
||||
|
||||
**Phase 2 (Atomic Operations):**
|
||||
- Go CLI replaces all shell/Python config manipulation
|
||||
- File locking prevents concurrent writes
|
||||
- Atomic writes (temp file + rename) ensure consistency
|
||||
- systemd validation prevents startup with corrupt config
|
||||
|
||||
## Configuration CLI Reference
|
||||
|
||||
### Validate Configuration
|
||||
|
||||
Check config files for errors before restarting the service:
|
||||
|
||||
```bash
|
||||
# Validate both config.yaml and allowed_nodes.yaml
|
||||
pulse-sensor-proxy config validate
|
||||
|
||||
# Validate specific config file
|
||||
pulse-sensor-proxy config validate --config /path/to/config.yaml
|
||||
|
||||
# Validate specific allowed_nodes file
|
||||
pulse-sensor-proxy config validate --allowed-nodes /path/to/allowed_nodes.yaml
|
||||
```
|
||||
|
||||
**Exit codes:**
|
||||
- 0 = valid
|
||||
- Non-zero = validation failed (check stderr for details)
|
||||
|
||||
**Common validation errors:**
|
||||
- "duplicate allowed_nodes blocks" - Run migration (see below)
|
||||
- "failed to parse YAML" - Syntax error in config file
|
||||
- "read_timeout must be positive" - Invalid timeout value
|
||||
|
||||
### Manage Allowed Nodes
|
||||
|
||||
The CLI provides two modes:
|
||||
|
||||
**Merge mode (default):** Adds nodes to existing list
|
||||
```bash
|
||||
# Add single node
|
||||
pulse-sensor-proxy config set-allowed-nodes --merge 192.168.0.10
|
||||
|
||||
# Add multiple nodes
|
||||
pulse-sensor-proxy config set-allowed-nodes \
|
||||
--merge 192.168.0.1 \
|
||||
--merge 192.168.0.2 \
|
||||
--merge node1.local
|
||||
```
|
||||
|
||||
**Replace mode:** Overwrites entire list
|
||||
```bash
|
||||
# Replace with new list
|
||||
pulse-sensor-proxy config set-allowed-nodes --replace \
|
||||
--merge 192.168.0.1 \
|
||||
--merge 192.168.0.2
|
||||
|
||||
# Clear the list (empty is valid for IPC-only clusters)
|
||||
pulse-sensor-proxy config set-allowed-nodes --replace
|
||||
```
|
||||
|
||||
**Custom paths:**
|
||||
```bash
|
||||
# Use non-default path
|
||||
pulse-sensor-proxy config set-allowed-nodes \
|
||||
--allowed-nodes /custom/path.yaml \
|
||||
--merge 192.168.0.10
|
||||
```
|
||||
|
||||
### How It Works
|
||||
|
||||
1. **File locking:** Uses `flock(LOCK_EX)` on separate `.lock` file
|
||||
2. **Atomic writes:** Writes to temp file, syncs, then renames
|
||||
3. **Deduplication:** Automatically removes duplicate entries
|
||||
4. **Normalization:** Trims whitespace, sorts entries
|
||||
5. **Empty lists allowed:** Useful for security lockdown or IPC-based discovery
|
||||
|
||||
## Common Tasks
|
||||
|
||||
### Adding Nodes After Cluster Expansion
|
||||
|
||||
When you add a new node to your Proxmox cluster:
|
||||
|
||||
```bash
|
||||
# Add the new node to allowed list
|
||||
pulse-sensor-proxy config set-allowed-nodes --merge new-node.local
|
||||
|
||||
# Validate config
|
||||
pulse-sensor-proxy config validate
|
||||
|
||||
# Restart proxy to apply
|
||||
sudo systemctl restart pulse-sensor-proxy
|
||||
|
||||
# Verify in Pulse UI
|
||||
# Check Settings → Diagnostics → Temperature Proxy
|
||||
```
|
||||
|
||||
### Removing Decommissioned Nodes
|
||||
|
||||
When removing a node from your cluster:
|
||||
|
||||
```bash
|
||||
# Get current list
|
||||
cat /etc/pulse-sensor-proxy/allowed_nodes.yaml
|
||||
|
||||
# Replace with updated list (without old node)
|
||||
pulse-sensor-proxy config set-allowed-nodes --replace \
|
||||
--merge 192.168.0.1 \
|
||||
--merge 192.168.0.2
|
||||
# (omit the decommissioned node)
|
||||
|
||||
# Validate and restart
|
||||
pulse-sensor-proxy config validate
|
||||
sudo systemctl restart pulse-sensor-proxy
|
||||
```
|
||||
|
||||
**Note:** The proxy cleanup system automatically removes SSH keys from deleted nodes. See temperature monitoring docs for details.
|
||||
|
||||
### Migrating from Inline Config
|
||||
|
||||
If you're running an older version with inline `allowed_nodes:` in config.yaml:
|
||||
|
||||
```bash
|
||||
# Upgrade to latest version (auto-migrates)
|
||||
curl -fsSL https://raw.githubusercontent.com/rcourtman/Pulse/main/scripts/install-sensor-proxy.sh | \
|
||||
sudo bash -s -- --standalone --pulse-server http://your-pulse:7655
|
||||
|
||||
# Verify migration
|
||||
pulse-sensor-proxy config validate
|
||||
|
||||
# Check that allowed_nodes only appears in allowed_nodes.yaml
|
||||
grep -n "allowed_nodes:" /etc/pulse-sensor-proxy/*.yaml
|
||||
# Should show: allowed_nodes.yaml:3:allowed_nodes:
|
||||
# Should NOT show duplicate entries in config.yaml
|
||||
```
|
||||
|
||||
### Changing Other Config Settings
|
||||
|
||||
For settings in `config.yaml` (not allowed_nodes):
|
||||
|
||||
```bash
|
||||
# Stop the service first
|
||||
sudo systemctl stop pulse-sensor-proxy
|
||||
|
||||
# Edit config.yaml manually
|
||||
sudo nano /etc/pulse-sensor-proxy/config.yaml
|
||||
|
||||
# Validate before starting
|
||||
pulse-sensor-proxy config validate
|
||||
|
||||
# Start service
|
||||
sudo systemctl start pulse-sensor-proxy
|
||||
|
||||
# Check for errors
|
||||
sudo systemctl status pulse-sensor-proxy
|
||||
journalctl -u pulse-sensor-proxy -n 50
|
||||
```
|
||||
|
||||
**Safe to edit in config.yaml:**
|
||||
- `allowed_source_subnets`
|
||||
- `allowed_peers` (UID/GID permissions)
|
||||
- `rate_limit` settings
|
||||
- `metrics_address`
|
||||
- `http_*` settings (HTTPS mode)
|
||||
- `pulse_control_plane` block
|
||||
|
||||
**Never edit manually:**
|
||||
- `allowed_nodes:` (use CLI instead, or it will be in allowed_nodes.yaml anyway)
|
||||
- Lock files (`.lock`)
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Config Validation Fails
|
||||
|
||||
**Symptom:** `pulse-sensor-proxy config validate` returns error
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Run validation with full output
|
||||
pulse-sensor-proxy config validate 2>&1
|
||||
|
||||
# Check for duplicate blocks
|
||||
grep -n "allowed_nodes:" /etc/pulse-sensor-proxy/config.yaml
|
||||
|
||||
# Check YAML syntax
|
||||
python3 -c "import yaml; yaml.safe_load(open('/etc/pulse-sensor-proxy/config.yaml'))"
|
||||
```
|
||||
|
||||
**Common fixes:**
|
||||
- Duplicate blocks: Run migration (upgrade to v4.31.1+)
|
||||
- YAML syntax errors: Fix indentation, remove tabs, check colons
|
||||
- Missing required fields: Add `read_timeout`, `write_timeout`
|
||||
|
||||
### Service Won't Start After Config Change
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check systemd logs
|
||||
journalctl -u pulse-sensor-proxy -n 100
|
||||
|
||||
# Look for validation errors
|
||||
journalctl -u pulse-sensor-proxy | grep -i "validation\|corrupt\|duplicate"
|
||||
|
||||
# Try starting in foreground for better errors
|
||||
sudo -u pulse-sensor-proxy /opt/pulse/sensor-proxy/bin/pulse-sensor-proxy # legacy installs: /usr/local/bin/pulse-sensor-proxy
|
||||
```
|
||||
|
||||
**Fix:**
|
||||
```bash
|
||||
# Validate config first
|
||||
pulse-sensor-proxy config validate
|
||||
|
||||
# If validation passes but service fails, check permissions
|
||||
ls -la /etc/pulse-sensor-proxy/
|
||||
ls -la /var/lib/pulse-sensor-proxy/
|
||||
|
||||
# Ensure proxy user owns files
|
||||
sudo chown -R pulse-sensor-proxy:pulse-sensor-proxy /etc/pulse-sensor-proxy/
|
||||
sudo chown -R pulse-sensor-proxy:pulse-sensor-proxy /var/lib/pulse-sensor-proxy/
|
||||
```
|
||||
|
||||
### Lock File Errors
|
||||
|
||||
**Symptom:** `failed to acquire file lock` or `failed to open lock file`
|
||||
|
||||
**Cause:** Lock file has wrong permissions or process holds stale lock
|
||||
|
||||
**Fix:**
|
||||
```bash
|
||||
# Check lock file permissions (should be 0600)
|
||||
ls -la /etc/pulse-sensor-proxy/*.lock
|
||||
|
||||
# Fix permissions
|
||||
sudo chmod 0600 /etc/pulse-sensor-proxy/*.lock
|
||||
sudo chown pulse-sensor-proxy:pulse-sensor-proxy /etc/pulse-sensor-proxy/*.lock
|
||||
|
||||
# If stale lock, identify holder
|
||||
sudo lsof /etc/pulse-sensor-proxy/allowed_nodes.yaml.lock
|
||||
|
||||
# Kill stale process if needed (use with caution)
|
||||
sudo kill <PID>
|
||||
```
|
||||
|
||||
**Prevention:** Locks are automatically released when process exits. Don't manually delete lock files.
|
||||
|
||||
### Allowed Nodes List is Empty
|
||||
|
||||
**Symptom:** allowed_nodes.yaml exists but has no entries
|
||||
|
||||
**Is this a problem?** Not necessarily:
|
||||
- Empty list is valid for clusters using IPC discovery (pvecm status)
|
||||
- Control-plane mode populates the list automatically
|
||||
- Standalone nodes require manual node entries
|
||||
|
||||
**To populate manually:**
|
||||
```bash
|
||||
# Add your cluster nodes
|
||||
pulse-sensor-proxy config set-allowed-nodes --replace \
|
||||
--merge 192.168.0.1 \
|
||||
--merge 192.168.0.2 \
|
||||
--merge 192.168.0.3
|
||||
|
||||
# Verify
|
||||
cat /etc/pulse-sensor-proxy/allowed_nodes.yaml
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### General Guidelines
|
||||
|
||||
1. **Always validate before restarting:**
|
||||
```bash
|
||||
pulse-sensor-proxy config validate && sudo systemctl restart pulse-sensor-proxy
|
||||
```
|
||||
|
||||
2. **Use the CLI for allowed_nodes changes:**
|
||||
- Don't edit `allowed_nodes.yaml` manually
|
||||
- Use `config set-allowed-nodes` instead
|
||||
|
||||
3. **Stop service before editing config.yaml:**
|
||||
- Prevents race conditions with running process
|
||||
- systemd validation will catch errors on startup
|
||||
|
||||
4. **Back up config before major changes:**
|
||||
```bash
|
||||
sudo cp /etc/pulse-sensor-proxy/config.yaml /etc/pulse-sensor-proxy/config.yaml.backup
|
||||
sudo cp /etc/pulse-sensor-proxy/allowed_nodes.yaml /etc/pulse-sensor-proxy/allowed_nodes.yaml.backup
|
||||
```
|
||||
|
||||
5. **Monitor after changes:**
|
||||
```bash
|
||||
journalctl -u pulse-sensor-proxy -f
|
||||
# Check Pulse UI: Settings → Diagnostics → Temperature Proxy
|
||||
```
|
||||
|
||||
### Automation Scripts
|
||||
|
||||
When scripting config changes:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
set -euo pipefail
|
||||
|
||||
# Function to safely update allowed nodes
|
||||
update_allowed_nodes() {
|
||||
local nodes=("$@")
|
||||
|
||||
# Build command
|
||||
local cmd="pulse-sensor-proxy config set-allowed-nodes --replace"
|
||||
for node in "${nodes[@]}"; do
|
||||
cmd="$cmd --merge $node"
|
||||
done
|
||||
|
||||
# Execute with validation
|
||||
if eval "$cmd"; then
|
||||
echo "Allowed nodes updated successfully"
|
||||
else
|
||||
echo "Failed to update allowed nodes" >&2
|
||||
return 1
|
||||
fi
|
||||
|
||||
# Validate
|
||||
if ! pulse-sensor-proxy config validate; then
|
||||
echo "Config validation failed after update" >&2
|
||||
return 1
|
||||
fi
|
||||
|
||||
# Restart service
|
||||
if sudo systemctl restart pulse-sensor-proxy; then
|
||||
echo "Service restarted successfully"
|
||||
else
|
||||
echo "Service restart failed" >&2
|
||||
return 1
|
||||
fi
|
||||
|
||||
# Wait for service to be active
|
||||
sleep 2
|
||||
if systemctl is-active --quiet pulse-sensor-proxy; then
|
||||
echo "Service is running"
|
||||
else
|
||||
echo "Service failed to start" >&2
|
||||
journalctl -u pulse-sensor-proxy -n 20
|
||||
return 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Example usage
|
||||
update_allowed_nodes "192.168.0.1" "192.168.0.2" "node3.local"
|
||||
```
|
||||
|
||||
### Monitoring Config Health
|
||||
|
||||
Add to your monitoring system:
|
||||
|
||||
```bash
|
||||
# Check for config corruption (should return 0)
|
||||
pulse-sensor-proxy config validate
|
||||
echo $?
|
||||
|
||||
# Check for duplicate blocks (should be empty)
|
||||
grep "allowed_nodes:" /etc/pulse-sensor-proxy/config.yaml | wc -l
|
||||
|
||||
# Check lock file permissions (should be 0600)
|
||||
stat -c "%a" /etc/pulse-sensor-proxy/*.lock
|
||||
|
||||
# Check service is running
|
||||
systemctl is-active pulse-sensor-proxy
|
||||
```
|
||||
|
||||
## Migration Path
|
||||
|
||||
### Upgrading from Pre-v4.31.1
|
||||
|
||||
**Automatic migration** (recommended):
|
||||
```bash
|
||||
# Simply reinstall - migration runs automatically
|
||||
curl -fsSL https://raw.githubusercontent.com/rcourtman/Pulse/main/scripts/install-sensor-proxy.sh | \
|
||||
sudo bash -s -- --standalone --pulse-server http://your-pulse:7655
|
||||
|
||||
# Verify
|
||||
pulse-sensor-proxy config validate
|
||||
sudo systemctl status pulse-sensor-proxy
|
||||
```
|
||||
|
||||
**Manual migration** (if needed):
|
||||
```bash
|
||||
# 1. Stop service
|
||||
sudo systemctl stop pulse-sensor-proxy
|
||||
|
||||
# 2. Extract allowed_nodes from config.yaml
|
||||
grep -A 100 "^allowed_nodes:" /etc/pulse-sensor-proxy/config.yaml > /tmp/nodes.txt
|
||||
|
||||
# 3. Parse and add to allowed_nodes.yaml
|
||||
# (Example for simple list - adjust for your format)
|
||||
pulse-sensor-proxy config set-allowed-nodes --replace \
|
||||
--merge node1.local \
|
||||
--merge node2.local
|
||||
|
||||
# 4. Remove allowed_nodes from config.yaml
|
||||
# Edit manually or use sed:
|
||||
sudo sed -i '/^allowed_nodes:/,/^[a-z_]/d' /etc/pulse-sensor-proxy/config.yaml
|
||||
|
||||
# 5. Add reference to allowed_nodes.yaml
|
||||
echo "allowed_nodes_file: /etc/pulse-sensor-proxy/allowed_nodes.yaml" | \
|
||||
sudo tee -a /etc/pulse-sensor-proxy/config.yaml
|
||||
|
||||
# 6. Validate
|
||||
pulse-sensor-proxy config validate
|
||||
|
||||
# 7. Start service
|
||||
sudo systemctl start pulse-sensor-proxy
|
||||
```
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Temperature Monitoring](../TEMPERATURE_MONITORING.md) - Setup and troubleshooting
|
||||
- [Sensor Proxy README](/opt/pulse/cmd/pulse-sensor-proxy/README.md) - Complete CLI reference
|
||||
- [Audit Log Rotation](audit-log-rotation.md) - Managing append-only logs
|
||||
- [Temperature Monitoring Security](../TEMPERATURE_MONITORING_SECURITY.md) - Security architecture
|
||||
|
||||
## Support
|
||||
|
||||
If config management issues persist after following this guide:
|
||||
|
||||
1. Collect diagnostics:
|
||||
```bash
|
||||
pulse-sensor-proxy config validate 2>&1 > /tmp/validate.log
|
||||
sudo systemctl status pulse-sensor-proxy > /tmp/status.log
|
||||
journalctl -u pulse-sensor-proxy -n 200 > /tmp/journal.log
|
||||
grep -n "allowed_nodes:" /etc/pulse-sensor-proxy/*.yaml > /tmp/grep.log
|
||||
```
|
||||
|
||||
2. File an issue at https://github.com/rcourtman/Pulse/issues
|
||||
|
||||
3. Include:
|
||||
- Pulse version
|
||||
- Sensor proxy version (`pulse-sensor-proxy --version`)
|
||||
- Output from diagnostic commands above
|
||||
- Steps that led to the issue
|
||||
@@ -1,73 +0,0 @@
|
||||
# Sensor Proxy Log Forwarding
|
||||
|
||||
Forward `pulse-sensor-proxy` logs to a central syslog/SIEM endpoint so audit
|
||||
records survive host loss and can drive alerting. Pulse ships a helper script
|
||||
(`scripts/setup-log-forwarding.sh`) that configures rsyslog to ship both
|
||||
`audit.log` and `proxy.log` over RELP + TLS.
|
||||
|
||||
## Requirements
|
||||
|
||||
- Debian/Ubuntu host with **rsyslog** and the `imfile` + `omrelp` modules (present
|
||||
by default).
|
||||
- Root privileges to install certificates and restart rsyslog.
|
||||
- TLS assets for the RELP connection:
|
||||
- `ca.crt` – CA that issued the remote collector certificate.
|
||||
- `client.crt` / `client.key` – mTLS credentials for this host.
|
||||
- Network access to the remote collector (`REMOTE_HOST`, default `logs.pulse.example`,
|
||||
port `6514`).
|
||||
|
||||
## Installation Steps
|
||||
|
||||
1. Copy your CA and client certificates into a safe directory on the host (the
|
||||
script defaults to `/etc/pulse/log-forwarding`).
|
||||
2. Run the helper with environment overrides for your collector:
|
||||
```bash
|
||||
sudo REMOTE_HOST=logs.company.tld \
|
||||
REMOTE_PORT=6514 \
|
||||
CERT_DIR=/etc/pulse/log-forwarding \
|
||||
CA_CERT=/etc/pulse/log-forwarding/ca.crt \
|
||||
CLIENT_CERT=/etc/pulse/log-forwarding/pulse.crt \
|
||||
CLIENT_KEY=/etc/pulse/log-forwarding/pulse.key \
|
||||
/opt/pulse/scripts/setup-log-forwarding.sh
|
||||
```
|
||||
The script writes `/etc/rsyslog.d/pulse-sensor-proxy.conf`, ensures the
|
||||
certificate directory exists (`0750`), and restarts rsyslog.
|
||||
|
||||
## What the Script Configures
|
||||
|
||||
- Two `imfile` inputs that watch `/var/log/pulse/sensor-proxy/audit.log` and
|
||||
`/var/log/pulse/sensor-proxy/proxy.log` with `Tag`s `pulse.audit` and
|
||||
`pulse.app`.
|
||||
- A local mirror file at `/var/log/pulse/sensor-proxy/forwarding.log` so you can
|
||||
inspect rsyslog activity.
|
||||
- An RELP action with TLS, infinite retry (`action.resumeRetryCount=-1`), and a
|
||||
50k message disk-backed queue to absorb collector outages.
|
||||
|
||||
## Verification Checklist
|
||||
|
||||
1. Confirm rsyslog picked up the new config:
|
||||
```bash
|
||||
sudo rsyslogd -N1
|
||||
sudo systemctl status rsyslog --no-pager
|
||||
```
|
||||
2. Tail the local mirror to ensure entries stream through:
|
||||
```bash
|
||||
sudo tail -f /var/log/pulse/sensor-proxy/forwarding.log
|
||||
```
|
||||
3. On the collector side, filter for the `pulse.audit` tag and make sure new
|
||||
entries arrive. For Splunk/ELK, index on `programname`.
|
||||
4. Simulate a test event (e.g., restart `pulse-sensor-proxy` or deny a fake peer)
|
||||
and verify it appears remotely.
|
||||
|
||||
## Maintenance
|
||||
|
||||
- **Certificate rotation**: Replace the key/cert files, then restart rsyslog.
|
||||
Because the config points at static paths, no additional edits are required.
|
||||
- **Disable forwarding**: Remove `/etc/rsyslog.d/pulse-sensor-proxy.conf` and run
|
||||
`sudo systemctl restart rsyslog`. The local audit log remains untouched.
|
||||
- **Queue monitoring**: Track rsyslog’s main log or use `rsyslogd -N6` to check
|
||||
for queue overflows. At scale, scrape `/var/log/pulse/sensor-proxy/forwarding.log`
|
||||
for `action resumed` messages.
|
||||
|
||||
For rotation guidance on the underlying audit file, see
|
||||
[operations/audit-log-rotation.md](audit-log-rotation.md).
|
||||
39
docs/security/SENSOR_PROXY_APPARMOR.md
Normal file
39
docs/security/SENSOR_PROXY_APPARMOR.md
Normal file
@@ -0,0 +1,39 @@
|
||||
# 🛡️ Sensor Proxy Hardening
|
||||
|
||||
Secure `pulse-sensor-proxy` with AppArmor and Seccomp.
|
||||
|
||||
## 🛡️ AppArmor
|
||||
|
||||
Profile: `security/apparmor/pulse-sensor-proxy.apparmor`
|
||||
* **Allows**: Configs, logs, SSH keys, outbound TCP/SSH.
|
||||
* **Blocks**: Raw sockets, module loading, ptrace, exec outside allowlist.
|
||||
|
||||
### Install & Enforce
|
||||
```bash
|
||||
sudo install -m 0644 security/apparmor/pulse-sensor-proxy.apparmor /etc/apparmor.d/pulse-sensor-proxy
|
||||
sudo apparmor_parser -r /etc/apparmor.d/pulse-sensor-proxy
|
||||
sudo aa-enforce pulse-sensor-proxy
|
||||
```
|
||||
|
||||
## 🔒 Seccomp
|
||||
|
||||
Profile: `security/seccomp/pulse-sensor-proxy.json`
|
||||
* **Allows**: Go runtime syscalls, network, file IO.
|
||||
* **Blocks**: Everything else (returns `EPERM`).
|
||||
|
||||
### Systemd (Classic)
|
||||
Add to service override:
|
||||
```ini
|
||||
[Service]
|
||||
AppArmorProfile=pulse-sensor-proxy
|
||||
SystemCallFilter=@system-service
|
||||
SystemCallAllow=accept;connect;recvfrom;sendto;recvmsg;sendmsg;sendmmsg;getsockname;getpeername;getsockopt;setsockopt;shutdown
|
||||
```
|
||||
|
||||
### Containers (Docker/Podman)
|
||||
```bash
|
||||
podman run --seccomp-profile /opt/pulse/security/seccomp/pulse-sensor-proxy.json ...
|
||||
```
|
||||
|
||||
## 🔍 Verification
|
||||
Check status with `aa-status` or `journalctl -t auditbeat`.
|
||||
57
docs/security/SENSOR_PROXY_HARDENING.md
Normal file
57
docs/security/SENSOR_PROXY_HARDENING.md
Normal file
@@ -0,0 +1,57 @@
|
||||
# 🛡️ Sensor Proxy Hardening
|
||||
|
||||
The `pulse-sensor-proxy` runs on the host to securely collect temperatures, keeping SSH keys out of containers.
|
||||
|
||||
## 🏗️ Architecture
|
||||
* **Host**: Runs `pulse-sensor-proxy` (unprivileged user).
|
||||
* **Container**: Connects via Unix socket (`/run/pulse-sensor-proxy/pulse-sensor-proxy.sock`).
|
||||
* **Auth**: Uses `SO_PEERCRED` to verify container UID/PID.
|
||||
|
||||
## 🔒 Host Hardening
|
||||
|
||||
### Service Account
|
||||
Runs as `pulse-sensor-proxy` (no shell, no home).
|
||||
```bash
|
||||
id pulse-sensor-proxy # uid=XXX(pulse-sensor-proxy)
|
||||
```
|
||||
|
||||
### Systemd Security
|
||||
The service unit uses:
|
||||
* `User=pulse-sensor-proxy`
|
||||
* `NoNewPrivileges=true`
|
||||
* `ProtectSystem=strict`
|
||||
* `PrivateTmp=true`
|
||||
|
||||
### File Permissions
|
||||
| Path | Owner | Mode |
|
||||
| :--- | :--- | :--- |
|
||||
| `/var/lib/pulse-sensor-proxy/` | `pulse-sensor-proxy` | `0750` |
|
||||
| `/var/lib/pulse-sensor-proxy/ssh/` | `pulse-sensor-proxy` | `0700` |
|
||||
| `/run/pulse-sensor-proxy/` | `pulse-sensor-proxy` | `0775` |
|
||||
|
||||
## 📦 LXC Configuration
|
||||
Required for the container to access the proxy socket.
|
||||
|
||||
**`/etc/pve/lxc/<VMID>.conf`**:
|
||||
```ini
|
||||
unprivileged: 1
|
||||
lxc.apparmor.profile: generated
|
||||
lxc.mount.entry: /run/pulse-sensor-proxy mnt/pulse-proxy none bind,create=dir 0 0
|
||||
```
|
||||
|
||||
## 🔑 Key Management
|
||||
SSH keys are restricted to `sensors -j` only.
|
||||
|
||||
**Rotation**:
|
||||
```bash
|
||||
/opt/pulse/scripts/pulse-sensor-proxy-rotate-keys.sh
|
||||
```
|
||||
* **Dry Run**: Add `--dry-run`.
|
||||
* **Rollback**: Add `--rollback`.
|
||||
|
||||
## 🚨 Incident Response
|
||||
If compromised:
|
||||
1. **Stop Proxy**: `systemctl stop pulse-sensor-proxy`.
|
||||
2. **Rotate Keys**: Remove old keys from nodes manually or use `pulse-sensor-proxy-rotate-keys.sh`.
|
||||
3. **Audit Logs**: Check `journalctl -u pulse-sensor-proxy`.
|
||||
4. **Reinstall**: Run `/opt/pulse/scripts/install-sensor-proxy.sh`.
|
||||
35
docs/security/SENSOR_PROXY_NETWORK.md
Normal file
35
docs/security/SENSOR_PROXY_NETWORK.md
Normal file
@@ -0,0 +1,35 @@
|
||||
# 🌐 Sensor Proxy Network Segmentation
|
||||
|
||||
Isolate the proxy to prevent lateral movement.
|
||||
|
||||
## 🚧 Zones
|
||||
* **Pulse App**: Connects to Proxy via Unix socket (local).
|
||||
* **Sensor Proxy**: Outbound SSH to Proxmox nodes only.
|
||||
* **Proxmox Nodes**: Accept SSH from Proxy.
|
||||
* **Logging**: Accepts RELP/TLS from Proxy.
|
||||
|
||||
## 🛡️ Firewall Rules
|
||||
|
||||
| Source | Dest | Port | Purpose | Action |
|
||||
| :--- | :--- | :--- | :--- | :--- |
|
||||
| **Pulse App** | Proxy | `unix` | RPC Requests | **Allow** (Local) |
|
||||
| **Proxy** | Nodes | `22` | SSH (sensors) | **Allow** |
|
||||
| **Proxy** | Logs | `6514` | Audit Logs | **Allow** |
|
||||
| **Any** | Proxy | `22` | SSH Access | **Deny** (Use Bastion) |
|
||||
| **Proxy** | Internet | `any` | Outbound | **Deny** |
|
||||
|
||||
## 🔧 Implementation (iptables)
|
||||
```bash
|
||||
# Allow SSH to Proxmox
|
||||
iptables -A OUTPUT -p tcp -d <PROXMOX_SUBNET> --dport 22 -j ACCEPT
|
||||
|
||||
# Allow Log Forwarding
|
||||
iptables -A OUTPUT -p tcp -d <LOG_HOST> --dport 6514 -j ACCEPT
|
||||
|
||||
# Drop all other outbound
|
||||
iptables -P OUTPUT DROP
|
||||
```
|
||||
|
||||
## 🚨 Monitoring
|
||||
* Alert on outbound connections to non-whitelisted IPs.
|
||||
* Monitor `pulse_proxy_limiter_rejects_total` for abuse.
|
||||
31
docs/security/TEMPERATURE_MONITORING.md
Normal file
31
docs/security/TEMPERATURE_MONITORING.md
Normal file
@@ -0,0 +1,31 @@
|
||||
# 🌡️ Temperature Monitoring Security
|
||||
|
||||
Secure architecture for collecting hardware temperatures.
|
||||
|
||||
## 🛡️ Security Model
|
||||
* **Isolation**: SSH keys live on the host, not in the container.
|
||||
* **Least Privilege**: Proxy runs as `pulse-sensor-proxy` (no shell).
|
||||
* **Verification**: Container identity verified via `SO_PEERCRED`.
|
||||
|
||||
## 🏗️ Components
|
||||
1. **Pulse Backend**: Connects to Unix socket `/mnt/pulse-proxy/pulse-sensor-proxy.sock`.
|
||||
2. **Sensor Proxy**: Validates request, executes SSH to node.
|
||||
3. **Target Node**: Accepts SSH key restricted to `sensors -j`.
|
||||
|
||||
## 🔒 Key Restrictions
|
||||
SSH keys deployed to nodes are locked down:
|
||||
```
|
||||
command="sensors -j",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty
|
||||
```
|
||||
|
||||
## 🚦 Rate Limiting
|
||||
* **Per Peer**: ~12 req/min.
|
||||
* **Concurrency**: Max 2 parallel requests per peer.
|
||||
* **Global**: Max 8 concurrent requests.
|
||||
|
||||
## 📝 Auditing
|
||||
All requests logged to system journal:
|
||||
```bash
|
||||
journalctl -u pulse-sensor-proxy
|
||||
```
|
||||
Logs include: `uid`, `pid`, `method`, `node`, `correlation_id`.
|
||||
@@ -1,52 +0,0 @@
|
||||
# Pulse Sensor Proxy AppArmor & Seccomp Hardening
|
||||
|
||||
## AppArmor Profile
|
||||
- Profile path: `security/apparmor/pulse-sensor-proxy.apparmor`
|
||||
- Grants read-only access to configs, logs, SSH keys, and binaries; allows outbound TCP/SSH; blocks raw sockets, module loading, ptrace, and absolute command execution outside the allowlist.
|
||||
|
||||
### Installation
|
||||
```bash
|
||||
sudo install -m 0644 security/apparmor/pulse-sensor-proxy.apparmor /etc/apparmor.d/pulse-sensor-proxy
|
||||
sudo apparmor_parser -r /etc/apparmor.d/pulse-sensor-proxy
|
||||
sudo ln -sf /etc/apparmor.d/pulse-sensor-proxy /etc/apparmor.d/force-complain/pulse-sensor-proxy # optional staged mode
|
||||
sudo systemctl restart apparmor
|
||||
```
|
||||
|
||||
### Enforce Mode
|
||||
```bash
|
||||
sudo aa-enforce pulse-sensor-proxy
|
||||
```
|
||||
Monitor `/var/log/syslog` for `DENIED` events and update the profile as needed.
|
||||
|
||||
## Seccomp Filter
|
||||
- OCI-style profile: `security/seccomp/pulse-sensor-proxy.json`
|
||||
- Allows standard Go runtime syscalls, network operations, file IO, and `execve` for whitelisted helpers; other syscalls return `EPERM`.
|
||||
|
||||
### Apply via systemd (classic service)
|
||||
Add to the override:
|
||||
```ini
|
||||
[Service]
|
||||
AppArmorProfile=pulse-sensor-proxy
|
||||
RestrictNamespaces=yes
|
||||
NoNewPrivileges=yes
|
||||
SystemCallFilter=@system-service
|
||||
SystemCallArchitectures=native
|
||||
SystemCallAllow=accept;connect;recvfrom;sendto;recvmsg;sendmsg;sendmmsg;getsockname;getpeername;getsockopt;setsockopt;shutdown
|
||||
```
|
||||
|
||||
Reload and restart:
|
||||
```bash
|
||||
sudo systemctl daemon-reload
|
||||
sudo systemctl restart pulse-sensor-proxy
|
||||
```
|
||||
|
||||
### Apply seccomp JSON (containerised deployments)
|
||||
- Profile: `security/seccomp/pulse-sensor-proxy.json`
|
||||
- Use with Podman/Docker style runtimes:
|
||||
```bash
|
||||
podman run --seccomp-profile /opt/pulse/security/seccomp/pulse-sensor-proxy.json ...
|
||||
```
|
||||
|
||||
## Operational Notes
|
||||
- Use `journalctl -t auditbeat -g pulse-sensor-proxy` or `aa-status` to confirm profile status.
|
||||
- Pair with network ACLs (see `docs/security/pulse-sensor-proxy-network.md`) and log shipping via [`scripts/setup-log-forwarding.sh` + the RELP runbook](../operations/sensor-proxy-log-forwarding.md).
|
||||
@@ -1,64 +0,0 @@
|
||||
# Pulse Sensor Proxy Network Segmentation
|
||||
|
||||
## Overview
|
||||
- **Proxy host** collects temperatures via SSH from Proxmox nodes and serves a Unix socket to the Pulse stack.
|
||||
- Goals: isolate the proxy from production hypervisors, prevent lateral movement, and ensure log forwarding/audit channels remain available.
|
||||
|
||||
## Zones & Connectivity
|
||||
- **Pulse Application Zone (AZ-Pulse)**
|
||||
- Hosts Pulse backend/frontend containers.
|
||||
- Allowed to reach the proxy over Unix socket (local) or loopback if containerised via `socat`.
|
||||
- **Sensor Proxy Zone (AZ-Sensor)**
|
||||
- Dedicated VM/bare-metal host running `pulse-sensor-proxy`.
|
||||
- Maintains outbound SSH to Proxmox management interfaces only.
|
||||
- **Proxmox Management Zone (AZ-Proxmox)**
|
||||
- Hypervisors / BMCs reachable on `tcp/22` (SSH) and optional IPMI UDP.
|
||||
- **Logging/Monitoring Zone (AZ-Logging)**
|
||||
- Receives forwarded audit/application logs (e.g. RELP/TLS on `tcp/6514`).
|
||||
- Exposes Prometheus scrape port (default `tcp/9127`) if remote monitoring required.
|
||||
|
||||
## Recommended Firewall Rules
|
||||
|
||||
| Source Zone | Destination Zone | Protocol/Port | Purpose | Action |
|
||||
|-------------|------------------|---------------|---------|--------|
|
||||
| AZ-Pulse (localhost) | AZ-Sensor (Unix socket) | `unix` | RPC requests from Pulse | Allow (local only) |
|
||||
| AZ-Sensor | AZ-Proxmox nodes | `tcp/22` | SSH for sensors/ipmitool wrapper | Allow (restricted to node list) |
|
||||
| AZ-Sensor | AZ-Proxmox BMC | `udp/623` *(optional)* | IPMI if required for temperature data | Allow if needed |
|
||||
| AZ-Proxmox | AZ-Sensor | `any` | Return SSH traffic | Allow stateful |
|
||||
| AZ-Sensor | AZ-Logging | `tcp/6514` (TLS RELP) | Audit/application log forwarding | Allow |
|
||||
| AZ-Logging | AZ-Sensor | `tcp/9127` *(optional)* | Prometheus scrape of proxy metrics | Allow if scraping remotely |
|
||||
| Any | AZ-Sensor | `tcp/22` | Shell/SSH access | Deny (use management bastion) |
|
||||
| AZ-Sensor | Internet | `any` | Outbound Internet | Deny (except package mirrors via proxy if required) |
|
||||
|
||||
## Implementation Steps
|
||||
1. Place proxy host in dedicated subnet/VLAN with ACLs enforcing the table above.
|
||||
2. Populate `/etc/hosts` or routing so proxy resolves Proxmox nodes to management IPs only (no public networks).
|
||||
3. Configure iptables/nftables on proxy:
|
||||
```bash
|
||||
# Allow SSH to Proxmox nodes
|
||||
iptables -A OUTPUT -p tcp -d <PROXMOX_SUBNET>/24 --dport 22 -m conntrack --ctstate NEW,ESTABLISHED -j ACCEPT
|
||||
iptables -A INPUT -p tcp -s <PROXMOX_SUBNET>/24 --sport 22 -m conntrack --ctstate ESTABLISHED -j ACCEPT
|
||||
|
||||
# Allow log forwarding
|
||||
iptables -A OUTPUT -p tcp -d <LOG_HOST> --dport 6514 -m conntrack --ctstate NEW,ESTABLISHED -j ACCEPT
|
||||
iptables -A INPUT -p tcp -s <LOG_HOST> --sport 6514 -m conntrack --ctstate ESTABLISHED -j ACCEPT
|
||||
|
||||
# (Optional) allow Prometheus scrape
|
||||
iptables -A INPUT -p tcp -s <SCRAPE_HOST> --dport 9127 -m conntrack --ctstate NEW,ESTABLISHED -j ACCEPT
|
||||
iptables -A OUTPUT -p tcp -d <SCRAPE_HOST> --sport 9127 -m conntrack --ctstate ESTABLISHED -j ACCEPT
|
||||
|
||||
# Drop everything else
|
||||
iptables -P OUTPUT DROP
|
||||
iptables -P INPUT DROP
|
||||
```
|
||||
4. Deny inbound SSH to proxy except via management bastion: block `tcp/22` or whitelist bastion IPs.
|
||||
5. Ensure log-forwarding TLS certificates are rotated and stored under `/etc/pulse/log-forwarding`.
|
||||
|
||||
## Monitoring & Alerting
|
||||
- Alert if proxy initiates connections outside permitted subnets (Netflow or host firewall counters).
|
||||
- Monitor `pulse_proxy_limiter_*` metrics for unusual rate-limit hits that might signal abuse.
|
||||
- Track `audit_log` forwarding queue depth and remote availability; on failure, emit alert via rsyslog action queue (set `action.resumeRetryCount=-1` already).
|
||||
|
||||
## Change Management
|
||||
- Document node IP changes and update firewall objects (`PROXMOX_NODES`) before redeploying certificates.
|
||||
- Capture segmentation in infrastructure-as-code (e.g. Terraform/security group definitions) to avoid drift.
|
||||
Reference in New Issue
Block a user