Files
Pulse/docs/AI.md
rcourtman 9bbe0d6203 docs: expand AI Patrol documentation with full context explanation
- Add comprehensive explanation of what data Patrol receives
- Document the enriched context (trends, baselines, predictions)
- Explain operational memory (notes, dismissed alerts, incidents)
- Clarify why Patrol catches issues that static alerts miss
- Mark Patrol as Pro feature with link to pulserelay.pro
2025-12-23 23:21:36 +00:00

7.0 KiB

Pulse AI

Pulse AI adds an optional assistant for troubleshooting and proactive monitoring. It is off by default and can be enabled per instance.

What Makes AI Patrol Different

Unlike chatting with a generic AI where you manually describe your infrastructure, Patrol runs automatically and sees your entire infrastructure at once - every node, VM, container, storage pool, backup job, and Kubernetes cluster. It's not just a static checklist; it's an LLM analyzing real-time data enriched with historical context.

Context Patrol Receives (That Generic LLMs Can't See)

Every patrol run passes the LLM comprehensive context about your environment:

Data Category What's Included
Proxmox Nodes Status, CPU%, memory%, uptime, 24h/7d trend analysis
VMs & Containers Full metrics, backup status, OCI images, historical trends, anomaly flags
Storage Pools Usage %, capacity predictions, type (ZFS/LVM/Ceph), growth rates
Docker/Podman Container counts, health states, unhealthy container lists
Kubernetes Nodes, pods, deployments, services, DaemonSets, StatefulSets, namespaces
PBS/PMG Datastore status, backup jobs, job failures, verification status
Ceph Cluster health, OSD states, PG status
Agent Hosts Load averages, memory, disk, RAID status, temperatures

Enriched Context (The Real Differentiator)

Beyond raw metrics, Patrol enriches the context with intelligence that transforms raw data into actionable insights:

  • Trend analysis - 24h and 7d patterns showing growing, stable, declining, or volatile behavior
  • Learned baselines - Z-score anomaly detection based on what's normal for your environment
  • Capacity predictions - "Storage pool will be full in 12 days at current growth rate"
  • Infrastructure changes - Detected config changes, VM migrations, new deployments
  • Resource correlations - Pattern detection across related resources (e.g., containers on same host)
  • User notes - Your annotations explaining expected behavior ("runs hot for transcoding")
  • Dismissed findings - Respects your feedback and suppressed alerts
  • Incident memory - Learns from past investigations and successful remediations

Examples of What Patrol Catches

Because it's an LLM with full context, Patrol catches issues that static threshold-based alerting misses:

Issue Severity Example
Node offline Critical Proxmox node not responding
Disk approaching capacity Warning/Critical Storage at 85%+, or growing toward full
Backup failures Warning PBS job failed, no backup in 48+ hours
Service down Critical Docker container crashed, agent offline
High resource usage Warning Sustained memory >90%, CPU >85%
Storage issues Critical PBS datastore errors, ZFS pool degraded
Ceph problems Warning/Critical Degraded OSDs, unhealthy PGs
Kubernetes issues Warning Pods stuck in Pending/CrashLoopBackOff
Restart loops Warning VMs that keep restarting without errors
Clock drift Warning Node time drift affecting Ceph/HA
Unusual patterns Varies Any anomaly the LLM identifies as unusual for your setup

What Patrol Ignores (by design)

Patrol is intentionally conservative to avoid noise:

  • Small baseline deviations ("CPU at 15% vs typical 10%")
  • Low utilization that's "elevated" but fine (disk at 40%)
  • Stopped VMs/containers that were intentionally stopped
  • Brief spikes that resolve on their own
  • Anything that doesn't require human action

Philosophy: If a finding wouldn't be worth waking someone up at 3am, Patrol won't create it.

Features

  • Interactive chat: Ask questions about current cluster state and get AI-assisted troubleshooting.
  • Patrol: Background checks every 15 minutes (configurable) that generate findings.
  • Alert analysis: Optional token-efficient analysis when alerts fire.
  • Command execution: When enabled, AI can run commands via connected agents.
  • Finding management: Dismiss, resolve, or suppress findings to prevent recurrence.
  • Cost tracking: Tracks token usage and supports monthly budget limits.

Configuration

Configure in the UI: Settings → AI

AI settings are stored encrypted at rest in ai.enc under the Pulse config directory (/etc/pulse for systemd installs, /data for Docker/Kubernetes).

Supported Providers

  • Anthropic (API key or OAuth)
  • OpenAI
  • DeepSeek
  • Google Gemini
  • Ollama (self-hosted, with tool/function calling support)
  • OpenAI-compatible base URL (for providers that implement the OpenAI API shape)

Models

Pulse uses model identifiers in the form: provider:model-name

You can set separate models for:

  • Chat (chat_model)
  • Patrol (patrol_model)
  • Auto-fix remediation (auto_fix_model)

Testing

  • Test provider connectivity: POST /api/ai/test and POST /api/ai/test/{provider}
  • List available models: GET /api/ai/models

Patrol Service (Pro Feature)

Patrol runs automated health checks on a configurable schedule (default: every 15 minutes). It passes comprehensive infrastructure context to the LLM (see "Context Patrol Receives" above) and generates findings when issues are detected.

Patrol requires a Pulse Pro license. Free users get full access to the AI Chat assistant (BYOK) and all other monitoring features.

Finding Severity

  • Critical: Immediate attention required (service down, data at risk)
  • Warning: Should be addressed soon (disk filling, backup stale)

Note: info and watch level findings are filtered out to reduce noise.

Managing Findings

Findings can be managed via the UI or API:

  • Get help: Chat with AI to troubleshoot the issue
  • Resolve: Mark as fixed (finding will reappear if the issue resurfaces)
  • Dismiss: Mark as expected behavior (creates suppression rule)

Dismissed and resolved findings persist across Pulse restarts.

AI-Assisted Remediation

When chatting with AI about a patrol finding, the AI can:

  • Run diagnostic commands on connected agents
  • Propose fixes with explanations
  • Automatically resolve findings after successful remediation

Safety Controls

Pulse includes settings that control how "active" AI features are:

  • Autonomous mode: When enabled, AI may execute safe commands without approval.
  • Patrol auto-fix: Allows patrol to attempt automatic remediation.
  • Alert-triggered analysis: Limits AI to analyzing specific events when alerts occur.

If you enable execution features, ensure agent tokens and scopes are appropriately restricted.

Troubleshooting

Issue Solution
AI not responding Verify provider credentials in Settings → AI
No execution capability Confirm at least one agent is connected
Findings not persisting Check Pulse has write access to config directory
Too many findings This shouldn't happen - please report if it does