mirror of
https://github.com/rcourtman/Pulse.git
synced 2026-02-19 07:50:43 +01:00
Document the pulse-sensor-proxy rate limiting bug fix and new configurability across all relevant documentation: TEMPERATURE_MONITORING.md: - Added 'Rate Limiting & Scaling' section with symptom diagnosis - Included sizing table for 1-3, 4-10, 10-20, and 30+ node deployments - Provided tuning formula: interval_ms = polling_interval / node_count TROUBLESHOOTING.md: - Added 'Temperature data flickers after adding nodes' section - Step-by-step diagnosis using limiter metrics and scheduler health - Quick fix with config example CONFIGURATION.md: - Added pulse-sensor-proxy/config.yaml reference section - Documented rate_limit.per_peer_interval_ms and per_peer_burst fields - Included defaults and example override pulse-sensor-proxy-runbook.md: - Updated quick reference with new defaults (1 req/sec, burst 5) - Added 'Rate Limit Tuning' procedure with 4 deployment profiles - Included validation steps and monitoring commands TEMPERATURE_MONITORING_SECURITY.md: - Updated rate limiting section with new defaults - Added configurable overrides guidance - Documented security considerations for production deployments Related commits: -46b8b8d08: Initial rate limit fix (hardcoded defaults) -ca534e2b6: Made rate limits configurable via YAML -e244da837: Added guidance for large deployments (30+ nodes)
4.9 KiB
4.9 KiB
Pulse Sensor Proxy Runbook
Quick Reference
- Binary:
/opt/pulse/sensor-proxy/bin/pulse-sensor-proxy - Unit:
pulse-sensor-proxy.service - Logs:
/var/log/pulse/sensor-proxy/proxy.log - Audit trail:
/var/log/pulse/sensor-proxy/audit.log(hash chained, forwarded via rsyslog) - Metrics:
http://127.0.0.1:9127/metrics(setPULSE_SENSOR_PROXY_METRICS_ADDRto change/disable) - Limiters: 1 request/sec per UID (burst 5), per-UID concurrency 2, global concurrency 8, 2 s penalty on validation failures
Monitoring Alerts & Response
sequenceDiagram
participant Backend
participant Proxy
participant Node
Backend->>Proxy: get_temperature
Proxy->>Proxy: Check rate limit
Proxy->>Node: SSH sensors -j
Node->>Proxy: JSON response
Proxy->>Backend: Temperature data
Rate Limit Hits (pulse_proxy_limiter_rejections_total)
- Check audit log entries tagged
limiter.rejectionfor offending UID. - Confirm workload legitimacy; if expected, consider increasing limits via config override.
- If malicious, block source process/user and inspect Pulse audit logs.
Penalty Events (pulse_proxy_limiter_penalties_total)
- Review corresponding validation failures in audit log (
command.validation_failed). - If repeated invalid JSON/unknown methods, inspect caller code for regressions or intrusion attempts.
Audit Log Forwarder Down
journalctl -u rsyslogto confirm transmission errors.- Ensure
/etc/pulse/log-forwardingcerts valid & remote host reachable. - Forwarding queue stored locally in
/var/log/pulse/sensor-proxy/forwarding.log; ship manually if outage exceeds 1 hour.
Proxy Health Endpoint Fails
systemctl status pulse-sensor-proxy- Check
/var/log/pulse/sensor-proxy/proxy.logfor panic or limiter exhaustion. - Inspect
/var/log/pulse/sensor-proxy/audit.logfor recent privileged method denials.
Standard Procedures
Restart Proxy Safely
sudo systemctl stop pulse-sensor-proxy
sudo apparmor_parser -r /etc/apparmor.d/pulse-sensor-proxy # if updating policy
sudo systemctl start pulse-sensor-proxy
Verify:
# Metrics endpoint exposes proxy build/health
curl -s http://127.0.0.1:9127/metrics | grep pulse_proxy_build_info
# Ensure adaptive polling sees the proxy again
curl -s http://localhost:7655/api/monitoring/scheduler/health \
| jq '.instances[] | select(.key | contains("temperature")) | {key, pollStatus}'
Temperature instances should show recent lastSuccess timestamps with no DLQ entries.
Rotate SSH Keys
- Run
scripts/secure-sensor-files.shto regenerate keys (ensure environment locked down). - Use RPC
ensure_cluster_keysto distribute new public key. - Confirm nodes accept
sshfrom proxy host. - Confirm the scheduler clears any temporary breakers/dlq entries:
Expect
curl -s http://localhost:7655/api/monitoring/scheduler/health \ | jq '.instances[] | select(.key | contains("temperature")) | {key, breaker: .breaker.state, deadLetter: .deadLetter.present}'breaker.state=="closed"anddeadLetter.present==falsefor all proxy-driven pollers.
Rate Limit Tuning
| Profile | Nodes | per_peer_interval_ms |
per_peer_burst |
Notes |
|---|---|---|---|---|
| Default | ≤5 | 1000 | 5 | Shipped with commit 46b8b8d; no action needed for single host clusters. |
| Medium | 6–10 | 500 | 10 | Doubles throughput; monitor pulse_proxy_limiter_rejects_total. |
| Large | 11–20 | 250 | 20 | Confirm proxy CPU stays below 70 % and audit logs remain clean. |
| XL | 21–40 | 150 | 30 | Requires high-trust environment; ensure UID filters are locked down. |
Procedure:
- Edit
/etc/pulse-sensor-proxy/config.yamland set the desired profile values underrate_limit. - Restart the service:
sudo systemctl restart pulse-sensor-proxy - Validate:
The counter should stop incrementing during steady-state polling.
curl -s http://127.0.0.1:9127/metrics \ | grep pulse_proxy_limiter_rejects_total - Record the change in the operations log and review audit entries for unexpected callers.
Incident Handling
- Unauthorized Command Attempt: audit log shows
command.validation_failedand limiter penalties; capture correlation ID, check Pulse side for compromised container. - Excessive Temperature Failures: refer to
pulse_proxy_ssh_requests_total{result="error"}; validate network ACLs and node health; escalate to Proxmox team if nodes unreachable. - Log Tampering Suspected: verify audit hash chain by replaying
eventHashvalues; compare with remote log store (immutable). Trigger security response if mismatch.
Postmortem Checklist
- Timeline: command audit entries, limiter stats, rsyslog queue depth.
- Verify AppArmor/seccomp status (
aa-status,systemctl show pulse-sensor-proxy -p AppArmorProfile). - Ensure firewall ACLs match
docs/security/pulse-sensor-proxy-network.md.