mirror of
https://github.com/rcourtman/Pulse.git
synced 2026-02-18 00:17:39 +01:00
docs: comprehensive documentation for rate limit fix and configurability
Document the pulse-sensor-proxy rate limiting bug fix and new configurability across all relevant documentation: TEMPERATURE_MONITORING.md: - Added 'Rate Limiting & Scaling' section with symptom diagnosis - Included sizing table for 1-3, 4-10, 10-20, and 30+ node deployments - Provided tuning formula: interval_ms = polling_interval / node_count TROUBLESHOOTING.md: - Added 'Temperature data flickers after adding nodes' section - Step-by-step diagnosis using limiter metrics and scheduler health - Quick fix with config example CONFIGURATION.md: - Added pulse-sensor-proxy/config.yaml reference section - Documented rate_limit.per_peer_interval_ms and per_peer_burst fields - Included defaults and example override pulse-sensor-proxy-runbook.md: - Updated quick reference with new defaults (1 req/sec, burst 5) - Added 'Rate Limit Tuning' procedure with 4 deployment profiles - Included validation steps and monitoring commands TEMPERATURE_MONITORING_SECURITY.md: - Updated rate limiting section with new defaults - Added configurable overrides guidance - Documented security considerations for production deployments Related commits: -46b8b8d08: Initial rate limit fix (hardcoded defaults) -ca534e2b6: Made rate limits configurable via YAML -e244da837: Added guidance for large deployments (30+ nodes)
This commit is contained in:
@@ -284,6 +284,27 @@ PROXY_AUTH_LOGOUT_URL=/logout # URL for SSO logout
|
||||
|
||||
> Tip: Back up `alerts.json` alongside `.env` during exports. Restoring it preserves all overrides, quiet-hour schedules, and webhook routing.
|
||||
|
||||
### `pulse-sensor-proxy/config.yaml`
|
||||
|
||||
The sensor proxy reads `/etc/pulse-sensor-proxy/config.yaml` (or the path supplied via `PULSE_SENSOR_PROXY_CONFIG`). Key fields:
|
||||
|
||||
| Field | Type | Default | Notes |
|
||||
| --- | --- | --- | --- |
|
||||
| `allowed_source_subnets` | list(string) | auto-detected host CIDRs | Restrict which networks can reach the UNIX socket listener. |
|
||||
| `allowed_peer_uids` / `allowed_peer_gids` | list(uint32) | empty | Required when Pulse runs in a container; use mapped UID/GID. |
|
||||
| `allow_idmapped_root` | bool | `true` | Governs acceptance of ID-mapped root callers. |
|
||||
| `allowed_idmap_users` | list(string) | `["root"]` | Restricts which ID-mapped usernames are accepted. |
|
||||
| `metrics_address` | string | `default` (maps to `127.0.0.1:9127`) | Set to `"disabled"` to turn metrics off. |
|
||||
| `rate_limit.per_peer_interval_ms` | int | `1000` | Milliseconds between allowed RPCs per UID. Set `>=100` in production. |
|
||||
| `rate_limit.per_peer_burst` | int | `5` | Number of requests allowed in a burst; should meet or exceed node count. |
|
||||
|
||||
Example (also shipped as `cmd/pulse-sensor-proxy/config.example.yaml`):
|
||||
```yaml
|
||||
rate_limit:
|
||||
per_peer_interval_ms: 500 # 2 rps
|
||||
per_peer_burst: 10 # allow 10-node sweep
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔄 Automatic Updates
|
||||
|
||||
@@ -695,10 +695,46 @@ test -S /run/pulse-sensor-proxy/pulse-sensor-proxy.sock && echo "Socket OK" || e
|
||||
- Standalone Proxmox nodes work but only monitor that single node
|
||||
- Fallback: Re-run setup script manually to reconfigure cluster access
|
||||
|
||||
**SSH Fan-Out Scaling:**
|
||||
- Proxy SSHs to each node sequentially during each polling cycle
|
||||
- Large clusters (10+ nodes) at short intervals may trigger rate limiting or increase load
|
||||
- Consider implementing caching or throttling if you experience SSH connection issues
|
||||
**Rate Limiting & Scaling** (updated in commit 46b8b8d):
|
||||
|
||||
**What changed:** pulse-sensor-proxy now defaults to 1 request per second with a burst of 5 per calling UID. Earlier builds throttled after two calls every five seconds, which caused temperature tiles to flicker or fall back to `--` as soon as clusters reached three or more nodes.
|
||||
|
||||
**Symptoms of saturation:**
|
||||
- Temperature widgets flicker between values and `--`, or entire node rows disappear after adding new hardware
|
||||
- `Settings → System → Updates` shows no proxy restarts, yet scheduler health reports breaker openings for temperature pollers
|
||||
- Proxy logs include `limiter.rejection` or `Rate limit exceeded` entries for the container UID
|
||||
|
||||
**Diagnose:**
|
||||
1. Check scheduler health for temperature pollers:
|
||||
```bash
|
||||
curl -s http://localhost:7655/api/monitoring/scheduler/health \
|
||||
| jq '.instances[] | select(.key | contains("temperature")) \
|
||||
| {key, lastSuccess: .pollStatus.lastSuccess, breaker: .breaker.state, deadLetter: .deadLetter.present}'
|
||||
```
|
||||
Breakers that remain `open` or repeated dead letters indicate the proxy is rejecting calls.
|
||||
2. Inspect limiter metrics on the host:
|
||||
```bash
|
||||
curl -s http://127.0.0.1:9127/metrics \
|
||||
| grep -E 'pulse_proxy_limiter_(rejects|penalties)_total'
|
||||
```
|
||||
A rising counter confirms the limiter is backing off callers.
|
||||
3. Review logs for throttling:
|
||||
```bash
|
||||
journalctl -u pulse-sensor-proxy -n 100 | grep -i "rate limit"
|
||||
```
|
||||
|
||||
**Tuning guidance:** Add a `rate_limit` block to `/etc/pulse-sensor-proxy/config.yaml` (see `cmd/pulse-sensor-proxy/config.example.yaml`) when clusters grow beyond the defaults. Use the formula `per_peer_interval_ms = polling_interval_ms / node_count` and set `per_peer_burst ≥ node_count` to allow one full sweep per polling window.
|
||||
|
||||
| Deployment size | Nodes | 10 s poll interval → interval_ms | Suggested burst | Notes |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| Small | 1–3 | 1000 (default) | 5 | Works for most single Proxmox hosts. |
|
||||
| Medium | 4–10 | 500 | 10 | Halves wait time; keep burst ≥ node count. |
|
||||
| Large | 10–20 | 250 | 20 | Monitor CPU on proxy; consider staggering polls. |
|
||||
| XL | 30+ | 100–150 | 30–50 | Only enable after validating proxy host capacity. |
|
||||
|
||||
**Security note:** Lower intervals increase throughput and reduce UI staleness, but they also allow untrusted callers to issue more RPCs per second. Keep `per_peer_interval_ms ≥ 100` in production and continue to rely on UID allow-lists plus audit logs when raising limits.
|
||||
|
||||
**SSH latency monitoring:**
|
||||
- Monitor SSH latency metrics: `curl -s http://127.0.0.1:9127/metrics | grep pulse_proxy_ssh_latency`
|
||||
|
||||
**Requires Proxmox Cluster Membership:**
|
||||
|
||||
@@ -108,13 +108,29 @@ if privilegedMethods[method] && isIDMappedRoot(credentials) {
|
||||
|
||||
## Rate Limiting
|
||||
|
||||
### Per-Peer Limits
|
||||
- **Rate**: ~12 requests per minute (`rate.Every(5s)`)
|
||||
- **Burst**: 2 requests (short spikes are tolerated)
|
||||
- **Per-peer concurrency**: Maximum 2 simultaneous RPCs
|
||||
- **Global concurrency**: 8 total in-flight RPCs across all peers
|
||||
- **Penalty**: 2 s enforced delay when validation fails (payloads too large, unauthorized methods)
|
||||
- **Cleanup**: Idle peer entries removed after 10 minutes
|
||||
### Per-Peer Limits (commit 46b8b8d)
|
||||
|
||||
- **Rate:** 1 request per second (`per_peer_interval_ms = 1000`)
|
||||
- **Burst:** 5 requests (enough to sweep five nodes per polling window)
|
||||
- **Per-peer concurrency:** Maximum 2 concurrent RPCs
|
||||
- **Global concurrency:** 8 simultaneous RPCs across all peers
|
||||
- **Penalty:** 2 s enforced delay on validation failures (oversized payloads, unauthorized methods)
|
||||
- **Cleanup:** Peer entries expire after 10 minutes of inactivity
|
||||
|
||||
### Configurable Overrides
|
||||
|
||||
Administrators can raise or lower thresholds via `/etc/pulse-sensor-proxy/config.yaml`:
|
||||
|
||||
```yaml
|
||||
rate_limit:
|
||||
per_peer_interval_ms: 500 # 2 rps
|
||||
per_peer_burst: 10 # allow 10-node sweep
|
||||
```
|
||||
|
||||
Security guidance:
|
||||
- Keep `per_peer_interval_ms ≥ 100` in production; lower values expand the attack surface for noisy callers.
|
||||
- Ensure UID/GID filters stay in place when increasing throughput, and continue to ship audit logs off-host.
|
||||
- Monitor `pulse_proxy_limiter_penalties_total` alongside `pulse_proxy_limiter_rejects_total` to spot abusive or compromised clients.
|
||||
|
||||
### Per-Node Concurrency
|
||||
- **Limit**: 1 concurrent SSH request per node
|
||||
|
||||
@@ -174,6 +174,30 @@ systemctl status pulse 2>/dev/null \
|
||||
- Try a test service like webhook.site
|
||||
- Check logs for response codes (temporarily set `LOG_LEVEL=debug` via **Settings → System → Logging** or export `LOG_LEVEL=debug` and restart; review `webhook.delivery` entries, then revert to `info`)
|
||||
|
||||
### Temperature Monitoring Issues
|
||||
|
||||
#### Temperature data flickers after adding nodes
|
||||
|
||||
**Symptoms:** Dashboard temperatures alternate between values and `--`, or new nodes never show readings. Proxy logs contain `limiter.rejection` messages.
|
||||
|
||||
**Diagnosis:**
|
||||
1. Confirm you are running a build with commit 46b8b8d or later (defaults are 1 rps, burst 5). Older binaries throttle multi-node clusters aggressively.
|
||||
2. Check limiter metrics:
|
||||
```bash
|
||||
curl -s http://127.0.0.1:9127/metrics \
|
||||
| grep -E 'pulse_proxy_limiter_(rejects|penalties)_total'
|
||||
```
|
||||
Any recent increment indicates rate-limit saturation.
|
||||
3. Inspect scheduler health for temperature pollers (`breaker.state` should be `closed` and `deadLetter.present` must be `false`).
|
||||
|
||||
**Fix:** Increase the proxy burst/interval in `/etc/pulse-sensor-proxy/config.yaml`:
|
||||
```yaml
|
||||
rate_limit:
|
||||
per_peer_interval_ms: 500 # medium cluster (≈10 nodes)
|
||||
per_peer_burst: 10
|
||||
```
|
||||
Restart `pulse-sensor-proxy`, verify limiter counters stop increasing, and confirm the dashboard stabilises. Document the change in your operations log.
|
||||
|
||||
### VM Disk Monitoring Issues
|
||||
|
||||
#### VMs show "-" for disk usage
|
||||
|
||||
@@ -6,7 +6,7 @@
|
||||
- Logs: `/var/log/pulse/sensor-proxy/proxy.log`
|
||||
- Audit trail: `/var/log/pulse/sensor-proxy/audit.log` (hash chained, forwarded via rsyslog)
|
||||
- Metrics: `http://127.0.0.1:9127/metrics` (set `PULSE_SENSOR_PROXY_METRICS_ADDR` to change/disable)
|
||||
- Limiters: ~12 requests/minute per UID (burst 2), per-UID concurrency 2, global concurrency 8, 2 s penalty on validation failures
|
||||
- Limiters: 1 request/sec per UID (burst 5), per-UID concurrency 2, global concurrency 8, 2 s penalty on validation failures
|
||||
|
||||
## Monitoring Alerts & Response
|
||||
|
||||
@@ -71,10 +71,28 @@ Temperature instances should show recent `lastSuccess` timestamps with no DLQ en
|
||||
```
|
||||
Expect `breaker.state=="closed"` and `deadLetter.present==false` for all proxy-driven pollers.
|
||||
|
||||
### Adjust Rate Limits
|
||||
1. Update `limiter_policy` environment overrides (future config).
|
||||
2. Restart proxy; monitor limiter metrics to validate new thresholds.
|
||||
3. Document change in security runbook.
|
||||
### Rate Limit Tuning
|
||||
|
||||
| Profile | Nodes | `per_peer_interval_ms` | `per_peer_burst` | Notes |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| Default | ≤5 | 1000 | 5 | Shipped with commit 46b8b8d; no action needed for single host clusters. |
|
||||
| Medium | 6–10 | 500 | 10 | Doubles throughput; monitor `pulse_proxy_limiter_rejects_total`. |
|
||||
| Large | 11–20 | 250 | 20 | Confirm proxy CPU stays below 70 % and audit logs remain clean. |
|
||||
| XL | 21–40 | 150 | 30 | Requires high-trust environment; ensure UID filters are locked down. |
|
||||
|
||||
**Procedure:**
|
||||
1. Edit `/etc/pulse-sensor-proxy/config.yaml` and set the desired profile values under `rate_limit`.
|
||||
2. Restart the service:
|
||||
```bash
|
||||
sudo systemctl restart pulse-sensor-proxy
|
||||
```
|
||||
3. Validate:
|
||||
```bash
|
||||
curl -s http://127.0.0.1:9127/metrics \
|
||||
| grep pulse_proxy_limiter_rejects_total
|
||||
```
|
||||
The counter should stop incrementing during steady-state polling.
|
||||
4. Record the change in the operations log and review audit entries for unexpected callers.
|
||||
|
||||
## Incident Handling
|
||||
- **Unauthorized Command Attempt:** audit log shows `command.validation_failed` and limiter penalties; capture correlation ID, check Pulse side for compromised container.
|
||||
|
||||
Reference in New Issue
Block a user