Files
Pulse/docs/TEMPERATURE_MONITORING_SECURITY.md

14 KiB
Raw Blame History

Temperature Monitoring Security Guide

This document describes the security architecture of Pulse's temperature monitoring system with pulse-sensor-proxy.

Table of Contents


Architecture Overview

graph TD
    Container[Pulse Container]
    Proxy[pulse-sensor-proxy<br/>Host Service]
    Cluster[Cluster Nodes<br/>SSH sensors -j]

    Container -->|Unix Socket<br/>Rate Limited| Proxy
    Proxy -->|SSH<br/>Forced Command| Cluster
    Cluster -->|Temperature JSON| Proxy
    Proxy -->|Temperature JSON| Container

    style Proxy fill:#e1f5e1
    style Container fill:#fff4e1
    style Cluster fill:#e1f0ff

Key Principle: SSH keys never enter containers. All SSH operations are performed by the host-side proxy.


Security Boundaries

1. Host ↔ Container Boundary

  • Enforced by: Method-level authorization + ID-mapped root detection
  • Container CAN:
    • Call get_temperature (read temperature data)
    • Call get_status (check proxy health)
  • Container CANNOT:
    • Call ensure_cluster_keys (SSH key distribution)
    • Call register_nodes (node discovery)
    • Call request_cleanup (cleanup operations)
    • Use direct SSH (blocked by container detection)

2. Proxy ↔ Cluster Nodes Boundary

  • Enforced by: SSH forced commands + IP filtering
  • SSH authorized_keys entry:
from="192.168.0.0/24",command="sensors -j",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty ssh-ed25519 AAAA... pulse-sensor-proxy
  • Proxy can ONLY run sensors -j on cluster nodes
  • IP restrictions prevent lateral movement

3. Client ↔ Proxy Boundary

  • Enforced by: UID-based ACL + adaptive rate limiting
  • SO_PEERCRED verifies caller's UID/GID/PID
  • Rate limiting (defaults): ~12 requests per minute per UID (burst 2), per-UID concurrency 2, global concurrency 8, 2s penalty on validation failures
  • Per-node guard: only 1 SSH fetch per node at a time

Authentication & Authorization

Authentication (Who can connect?)

Allowed UIDs:

  • Root (UID 0) - host processes
  • Proxy's own UID (pulse-sensor-proxy user)
  • Configured UIDs from /etc/pulse-sensor-proxy/config.yaml
  • ID-mapped root ranges (containers, if enabled)

ID-Mapped Root Detection:

  • Reads /etc/subuid and /etc/subgid for UID/GID mapping ranges
  • Containers typically use ranges like 100000-165535
  • Both UID AND GID must be in mapped ranges

Authorization (What can they call?)

Privileged Methods (host-only):

var privilegedMethods = map[string]bool{
    "ensure_cluster_keys": true,  // SSH key distribution
    "register_nodes":      true,  // Node registration
    "request_cleanup":     true,  // Cleanup operations
}

Authorization Check:

if privilegedMethods[method] && isIDMappedRoot(credentials) {
    return "method requires host-level privileges"
}

Read-Only Methods (containers allowed):

  • get_temperature - Fetch temperature data via proxy
  • get_status - Check proxy health and version

Rate Limiting

Per-Peer Limits (commit 46b8b8d)

  • Rate: 1 request per second (per_peer_interval_ms = 1000)
  • Burst: 5 requests (enough to sweep five nodes per polling window)
  • Per-peer concurrency: Maximum 2 concurrent RPCs
  • Global concurrency: 8 simultaneous RPCs across all peers
  • Penalty: 2s enforced delay on validation failures (oversized payloads, unauthorized methods)
  • Cleanup: Peer entries expire after 10minutes of inactivity

Configurable Overrides

Administrators can raise or lower thresholds via /etc/pulse-sensor-proxy/config.yaml:

rate_limit:
  per_peer_interval_ms: 500   # 2 rps
  per_peer_burst: 10          # allow 10-node sweep

Security guidance:

  • Keep per_peer_interval_ms ≥ 100 in production; lower values expand the attack surface for noisy callers.
  • Ensure UID/GID filters stay in place when increasing throughput, and continue to ship audit logs off-host.
  • Monitor pulse_proxy_limiter_penalties_total alongside pulse_proxy_limiter_rejects_total to spot abusive or compromised clients.

Per-Node Concurrency

  • Limit: 1 concurrent SSH request per node
  • Purpose: Prevents SSH connection storms
  • Scope: Applies to all peers requesting same node

Monitoring Rate Limits

# Check rate limit metrics
curl -s http://127.0.0.1:9127/metrics | grep pulse_proxy_limiter_rejects_total

# Watch for rate limit warnings in logs
journalctl -u pulse-sensor-proxy -f | grep "Rate limit exceeded"

SSH Security

SSH Key Management

Key Location: /var/lib/pulse-sensor-proxy/ssh/id_ed25519

  • Owner: pulse-sensor-proxy:pulse-sensor-proxy
  • Permissions: 0600 (read/write for owner only)
  • Type: Ed25519 (modern, secure)

Key Distribution:

  • Only host processes can trigger distribution (via ensure_cluster_keys)
  • Containers are blocked from key distribution operations
  • Keys are distributed with forced commands and IP restrictions

Forced Command Restrictions

On cluster nodes, the SSH key can ONLY run:

sensors -j

No other commands possible:

  • Shell access denied (no-pty)
  • Port forwarding disabled (no-port-forwarding)
  • X11 forwarding disabled (no-X11-forwarding)
  • Agent forwarding disabled (no-agent-forwarding)

IP Filtering

Source IP restrictions:

from="192.168.0.0/24,10.0.0.0/8"
  • Automatically detected from cluster node IPs
  • Prevents SSH key use from outside the cluster
  • Updated during key rotation

Container Isolation

Fallback SSH Protection

In containers, direct SSH is blocked:

if system.InContainer() && !devModeAllowSSH {
    log.Error().Msg("SECURITY BLOCK: SSH temperature collection disabled in containers")
    return &Temperature{Available: false}, nil
}

Container Detection Methods:

  1. PULSE_FORCE_CONTAINER=1 override for explicit opt-in
  2. Presence of /.dockerenv or /run/.containerenv
  3. container= hints from environment variables
  4. /proc/1/environ and /proc/1/cgroup markers (docker, lxc, containerd, kubepods, etc.)

Bypass: Only possible with explicit environment variable (see Development Mode)

ID-Mapped Root Detection

How it works:

// Check /etc/subuid and /etc/subgid for mapping ranges
// Example /etc/subuid:
//   root:100000:65536

func isIDMappedRoot(cred *peerCredentials) bool {
    return uidInRange(cred.uid, idMappedUIDRanges) &&
           gidInRange(cred.gid, idMappedGIDRanges)
}

Why both UID and GID?:

  • Container root: uid=100000, gid=100000 → ID-mapped
  • Container app user: uid=101001, gid=101001 → ID-mapped
  • Host root: uid=0, gid=0 → NOT ID-mapped
  • Mixed: uid=100000, gid=50 → NOT ID-mapped (fails check)

Monitoring & Alerting

Log Locations

Proxy logs:

journalctl -u pulse-sensor-proxy -f

Backend logs (inside container):

journalctl -u pulse-backend -f

Want off-host retention? Forward audit.log and proxy.log using scripts/setup-log-forwarding.sh so events land in your SIEM with RELP + TLS.

Audit rotation: Use the steps in operations/audit-log-rotation.md to rotate /var/log/pulse/sensor-proxy/audit.log. After each rotation, restart the proxy and confirm temperature pollers are healthy in /api/monitoring/scheduler/health (closed breakers, no DLQ entries).

Security Events to Monitor

1. Privileged Method Denials

SECURITY: Container attempted to call privileged method - access denied
method=ensure_cluster_keys uid=101000 gid=101000 pid=12345

Alert on: Any occurrence (indicates attempted privilege escalation)

2. Rate Limit Violations

Rate limit exceeded uid=101000 pid=12345

Alert on: Sustained violations (>10/minute indicates possible abuse)

3. Authorization Failures

Peer authorization failed uid=50000 gid=50000

Alert on: Repeated failures from same UID (indicates misconfiguration or probing)

4. SSH Fallback Attempts

SECURITY BLOCK: SSH temperature collection disabled in containers

Alert on: Any occurrence (should only happen during misconfigurations)

Metrics to Track

# Rate limit hits
pulse_proxy_rate_limit_hits_total

# RPC requests by method and result
pulse_proxy_rpc_requests_total{method="get_temperature",result="success"}
pulse_proxy_rpc_requests_total{method="ensure_cluster_keys",result="unauthorized"}

# SSH request latency
pulse_proxy_ssh_latency_seconds{node="example-node"}

# Active connections
pulse_proxy_queue_depth
pulse_proxy_global_concurrency_inflight
  1. Privilege Escalation Attempts:

    pulse_proxy_rpc_requests_total{result="unauthorized"} > 0
    
  2. Rate Limit Abuse:

    rate(pulse_proxy_rate_limit_hits_total[5m]) > 1
    
  3. Proxy Unavailable:

    up{job="pulse-sensor-proxy"} == 0
    
  4. Scheduler Drift (Pulse side ensures temperature pollers stay healthy):

    max_over_time(pulse_monitor_poll_queue_depth[5m]) > <baseline*1.5>
    

    Pair with a check of /api/monitoring/scheduler/health to confirm temperature instances report breaker.state == "closed".


Development Mode

SSH Fallback Override

Purpose: Allow direct SSH from containers during development/testing

Environment Variable:

export PULSE_DEV_ALLOW_CONTAINER_SSH=true

Security Implications:

  • ⚠️ NEVER use in production
  • Allows container to use SSH keys if present
  • Defeats the security isolation model
  • Should only be used in trusted development environments

Example Usage:

# In systemd override for pulse-backend
mkdir -p /etc/systemd/system/pulse-backend.service.d
cat <<EOF > /etc/systemd/system/pulse-backend.service.d/dev-ssh.conf
[Service]
Environment=PULSE_DEV_ALLOW_CONTAINER_SSH=true
EOF
systemctl daemon-reload
systemctl restart pulse-backend

Monitoring:

# Check if dev mode is active
journalctl -u pulse-backend | grep "dev mode" | tail -1

Disable dev mode:

rm /etc/systemd/system/pulse-backend.service.d/dev-ssh.conf
systemctl daemon-reload
systemctl restart pulse-backend

Troubleshooting

"method requires host-level privileges"

Symptom: Container gets this error when calling RPC

Cause: Container attempted to call privileged method

Resolution: This is expected behavior. Only these methods are restricted:

  • ensure_cluster_keys
  • register_nodes
  • request_cleanup

If host process is blocked:

  1. Check UID is not in ID-mapped range:

    id
    cat /etc/subuid /etc/subgid
    
  2. Verify proxy's allowed UIDs:

    cat /etc/pulse-sensor-proxy/config.yaml
    

"Rate limit exceeded"

Symptom: Requests failing with rate limit error

Cause: Peer exceeded ~12 requests/minute (or exhausted per-peer/global concurrency)

Resolution:

  1. Confirm workload is legitimate (look for retry loops or aggressive polling).
  2. Allow the limiter to recover—penalty sleeps clear in ~2s and idle peers expire after 10minutes.
  3. If sustained higher throughput is required, adjust the constants in cmd/pulse-sensor-proxy/throttle.go and rebuild.

Temperature monitoring unavailable

Symptom: No temperature data in dashboard

Diagnosis:

# 1. Check proxy is running
systemctl status pulse-sensor-proxy

# 2. Check socket exists
ls -la /run/pulse-sensor-proxy/

# 3. Check socket is accessible in container
ls -la /mnt/pulse-proxy/

# 4. Test proxy from host
curl -s --unix-socket /run/pulse-sensor-proxy/pulse-sensor-proxy.sock \
  -X POST -d '{"method":"get_status"}' | jq

# 5. Check SSH connectivity
ssh root@example-node "sensors -j"

# 6. Inspect adaptive polling for temperature pollers
curl -s http://localhost:7655/api/monitoring/scheduler/health \
  | jq '.instances[] | select(.key | contains("temperature")) | {key, breaker: .breaker.state, deadLetter: .deadLetter.present, lastSuccess: .pollStatus.lastSuccess}'

SSH key not distributed

Symptom: Manual ensure_cluster_keys call fails

Check:

  1. Are you calling from host (not container)?
  2. Is pvecm available? command -v pvecm
  3. Can you reach cluster nodes? pvecm status
  4. Check proxy logs: journalctl -u pulse-sensor-proxy -f

Best Practices

Production Deployments

  1. Never use dev mode (PULSE_DEV_ALLOW_CONTAINER_SSH=true)
  2. Monitor security logs for unauthorized access attempts
  3. Use IP filtering on SSH authorized_keys entries
  4. Rotate SSH keys periodically (use ensure_cluster_keys with rotation)
  5. Limit allowed_peer_uids to minimum necessary
  6. Enable audit logging for privileged operations

Development Environments

  1. Use dev mode SSH override if needed (document why)
  2. Test with actual ID-mapped containers
  3. Verify privileged method blocking works
  4. Test rate limiting under load

Incident Response

If container compromise suspected:

  1. Check for privileged method attempts:

    journalctl -u pulse-sensor-proxy | grep "SECURITY:"
    
  2. Check rate limit violations:

    journalctl -u pulse-sensor-proxy | grep "Rate limit"
    
  3. Restart proxy to clear state:

    systemctl restart pulse-sensor-proxy
    
  4. Consider rotating SSH keys:

    # From host, call ensure_cluster_keys with new key
    

References


Last Updated: 2025-10-19 Security Contact: File issues at https://github.com/rcourtman/Pulse/issues