Files
Pulse/docs/operations/pulse-sensor-proxy-runbook.md
rcourtman 4463b7e015 docs: comprehensive documentation for rate limit fix and configurability
Document the pulse-sensor-proxy rate limiting bug fix and new
configurability across all relevant documentation:

TEMPERATURE_MONITORING.md:
- Added 'Rate Limiting & Scaling' section with symptom diagnosis
- Included sizing table for 1-3, 4-10, 10-20, and 30+ node deployments
- Provided tuning formula: interval_ms = polling_interval / node_count

TROUBLESHOOTING.md:
- Added 'Temperature data flickers after adding nodes' section
- Step-by-step diagnosis using limiter metrics and scheduler health
- Quick fix with config example

CONFIGURATION.md:
- Added pulse-sensor-proxy/config.yaml reference section
- Documented rate_limit.per_peer_interval_ms and per_peer_burst fields
- Included defaults and example override

pulse-sensor-proxy-runbook.md:
- Updated quick reference with new defaults (1 req/sec, burst 5)
- Added 'Rate Limit Tuning' procedure with 4 deployment profiles
- Included validation steps and monitoring commands

TEMPERATURE_MONITORING_SECURITY.md:
- Updated rate limiting section with new defaults
- Added configurable overrides guidance
- Documented security considerations for production deployments

Related commits:
- 46b8b8d08: Initial rate limit fix (hardcoded defaults)
- ca534e2b6: Made rate limits configurable via YAML
- e244da837: Added guidance for large deployments (30+ nodes)
2025-10-21 11:36:07 +00:00

4.9 KiB
Raw Blame History

Pulse Sensor Proxy Runbook

Quick Reference

  • Binary: /opt/pulse/sensor-proxy/bin/pulse-sensor-proxy
  • Unit: pulse-sensor-proxy.service
  • Logs: /var/log/pulse/sensor-proxy/proxy.log
  • Audit trail: /var/log/pulse/sensor-proxy/audit.log (hash chained, forwarded via rsyslog)
  • Metrics: http://127.0.0.1:9127/metrics (set PULSE_SENSOR_PROXY_METRICS_ADDR to change/disable)
  • Limiters: 1 request/sec per UID (burst 5), per-UID concurrency 2, global concurrency 8, 2s penalty on validation failures

Monitoring Alerts & Response

sequenceDiagram
    participant Backend
    participant Proxy
    participant Node

    Backend->>Proxy: get_temperature
    Proxy->>Proxy: Check rate limit
    Proxy->>Node: SSH sensors -j
    Node->>Proxy: JSON response
    Proxy->>Backend: Temperature data

Rate Limit Hits (pulse_proxy_limiter_rejections_total)

  1. Check audit log entries tagged limiter.rejection for offending UID.
  2. Confirm workload legitimacy; if expected, consider increasing limits via config override.
  3. If malicious, block source process/user and inspect Pulse audit logs.

Penalty Events (pulse_proxy_limiter_penalties_total)

  1. Review corresponding validation failures in audit log (command.validation_failed).
  2. If repeated invalid JSON/unknown methods, inspect caller code for regressions or intrusion attempts.

Audit Log Forwarder Down

  1. journalctl -u rsyslog to confirm transmission errors.
  2. Ensure /etc/pulse/log-forwarding certs valid & remote host reachable.
  3. Forwarding queue stored locally in /var/log/pulse/sensor-proxy/forwarding.log; ship manually if outage exceeds 1 hour.

Proxy Health Endpoint Fails

  1. systemctl status pulse-sensor-proxy
  2. Check /var/log/pulse/sensor-proxy/proxy.log for panic or limiter exhaustion.
  3. Inspect /var/log/pulse/sensor-proxy/audit.log for recent privileged method denials.

Standard Procedures

Restart Proxy Safely

sudo systemctl stop pulse-sensor-proxy
sudo apparmor_parser -r /etc/apparmor.d/pulse-sensor-proxy   # if updating policy
sudo systemctl start pulse-sensor-proxy

Verify:

# Metrics endpoint exposes proxy build/health
curl -s http://127.0.0.1:9127/metrics | grep pulse_proxy_build_info

# Ensure adaptive polling sees the proxy again
curl -s http://localhost:7655/api/monitoring/scheduler/health \
  | jq '.instances[] | select(.key | contains("temperature")) | {key, pollStatus}'

Temperature instances should show recent lastSuccess timestamps with no DLQ entries.

Rotate SSH Keys

  1. Run scripts/secure-sensor-files.sh to regenerate keys (ensure environment locked down).
  2. Use RPC ensure_cluster_keys to distribute new public key.
  3. Confirm nodes accept ssh from proxy host.
  4. Confirm the scheduler clears any temporary breakers/dlq entries:
    curl -s http://localhost:7655/api/monitoring/scheduler/health \
      | jq '.instances[] | select(.key | contains("temperature")) | {key, breaker: .breaker.state, deadLetter: .deadLetter.present}'
    
    Expect breaker.state=="closed" and deadLetter.present==false for all proxy-driven pollers.

Rate Limit Tuning

Profile Nodes per_peer_interval_ms per_peer_burst Notes
Default ≤5 1000 5 Shipped with commit 46b8b8d; no action needed for single host clusters.
Medium 610 500 10 Doubles throughput; monitor pulse_proxy_limiter_rejects_total.
Large 1120 250 20 Confirm proxy CPU stays below 70% and audit logs remain clean.
XL 2140 150 30 Requires high-trust environment; ensure UID filters are locked down.

Procedure:

  1. Edit /etc/pulse-sensor-proxy/config.yaml and set the desired profile values under rate_limit.
  2. Restart the service:
    sudo systemctl restart pulse-sensor-proxy
    
  3. Validate:
    curl -s http://127.0.0.1:9127/metrics \
      | grep pulse_proxy_limiter_rejects_total
    
    The counter should stop incrementing during steady-state polling.
  4. Record the change in the operations log and review audit entries for unexpected callers.

Incident Handling

  • Unauthorized Command Attempt: audit log shows command.validation_failed and limiter penalties; capture correlation ID, check Pulse side for compromised container.
  • Excessive Temperature Failures: refer to pulse_proxy_ssh_requests_total{result="error"}; validate network ACLs and node health; escalate to Proxmox team if nodes unreachable.
  • Log Tampering Suspected: verify audit hash chain by replaying eventHash values; compare with remote log store (immutable). Trigger security response if mismatch.

Postmortem Checklist

  • Timeline: command audit entries, limiter stats, rsyslog queue depth.
  • Verify AppArmor/seccomp status (aa-status, systemctl show pulse-sensor-proxy -p AppArmorProfile).
  • Ensure firewall ACLs match docs/security/pulse-sensor-proxy-network.md.