Files
Pulse/docs/AI.md
rcourtman fa1b74792e docs: add comprehensive deep-dive documentation for AI subsystems
Adds detailed architecture documentation for Pulse Patrol and Pulse Assistant. Updates AI.md and PULSE_PRO.md. Also includes additional tests.
2026-02-02 10:29:07 +00:00

19 KiB
Raw Blame History

Pulse AI

Pulse Patrol is available to everyone with BYOK (your own AI provider). Pulse Pro unlocks auto-fix and advanced analysis. Learn more at https://pulserelay.pro or see the technical overview in PULSE_PRO.md.


Overview

Pulse includes two AI-powered systems:

  1. Pulse Assistant — An interactive chat interface for ad-hoc troubleshooting, investigations, and infrastructure control.
  2. Pulse Patrol — A scheduled, context-aware analysis service that continuously monitors your infrastructure, learns what's normal, predicts issues, and generates actionable findings.

Both systems are built on the same tool-driven architecture where the LLM acts as a proposer and Go code enforces safety gates.

Not Just Another Chatbot

Pulse Assistant is a protocol-driven, safety-gated agentic system that:

  • Proactively gathers context — understands resources before you ask (context prefetcher)
  • Learns within sessions — extracts and caches facts to avoid redundant queries (knowledge accumulator)
  • Enforces workflow invariants — FSM prevents dangerous state transitions
  • Supports parallel tool execution — efficient batch operations with concurrency control
  • Detects and prevents hallucinations — phantom execution detection
  • Auto-recovers from errors — structured error envelopes enable self-correction

📖 For a deep technical dive into the Assistant architecture, see architecture/pulse-assistant-deep-dive.md.

Not Just Another Alerting System

Pulse Patrol is a multi-layered intelligence platform that:

  • Learns what's normal for your environment (baseline engine)
  • Predicts issues before they become critical (pattern detection + forecasting)
  • Correlates events across your entire infrastructure (root cause analysis)
  • Remembers past incidents and successful remediations (incident memory)
  • Investigates issues autonomously when configured (investigation orchestrator)
  • Verifies fixes and tracks remediation effectiveness (verification loops)

All while running entirely on your infrastructure with BYOK for complete privacy.

📖 For a deep technical dive into the intelligence subsystems, see architecture/pulse-patrol-deep-dive.md.

See architecture/pulse-assistant.md for the original safety architecture documentation.


Pulse Patrol

Patrol is a scheduled analysis pipeline that builds a rich, system-wide snapshot and produces actionable findings.

How Patrol Works

Scheduled/Event Trigger
        │
        ▼
buildSeedContext()  ── infrastructure snapshot
        │
        ▼
LLM analysis (with tools) ← pulse_storage, pulse_metrics, pulse_alerts, etc.
        │
        ▼
DetectSignals() ── deterministic signal detection from tool outputs
        │
        ▼
createFinding() ── validated, deduplicated findings stored
        │
        ▼ (if configured)
MaybeInvestigateFinding() ── automatic investigation + remediation

What Patrol Sees

Every patrol run passes the LLM comprehensive context about your environment:

Data Category What's Included
Proxmox Nodes Status, CPU%, memory%, uptime, 24h/7d trend analysis
VMs & Containers Full metrics, backup status, OCI images, historical trends, anomaly flags
Storage Pools Usage %, capacity predictions, type (ZFS/LVM/Ceph), growth rates
Docker/Podman Container counts, health states, unhealthy container lists
Kubernetes Nodes, pods, deployments, services, DaemonSets, StatefulSets, namespaces
PBS/PMG Datastore status, backup jobs, job failures, verification status
Ceph Cluster health, OSD states, PG status
Agent Hosts Load averages, memory, disk, RAID status, temperatures

Enriched Context

Beyond raw metrics, Patrol enriches the context with intelligence:

  • Trend analysis — 24h and 7d patterns showing growing, stable, declining, or volatile behavior
  • Learned baselines — Z-score anomaly detection based on what's normal for your environment
  • Capacity predictions — "Storage pool will be full in 12 days at current growth rate"
  • Infrastructure changes — Detected config changes, VM migrations, new deployments
  • Resource correlations — Pattern detection across related resources
  • User notes — Your annotations explaining expected behavior
  • Dismissed findings — Respects your feedback and suppressed alerts
  • Incident memory — Learns from past investigations and successful remediations

Deterministic Signal Detection

Patrol doesn't rely solely on LLM judgment. It parses tool call outputs and fires deterministic signals for known problems:

Signal Type Trigger Default Threshold
smart_failure SMART health status not OK/PASSED N/A
high_cpu Average CPU usage 70%
high_memory Average memory usage 80%
high_disk Storage pool usage 75% (warning), 95% (critical)
backup_failed Recent backup task with error status Within 48h
backup_stale No backup completed for VM/CT 48+ hours
active_alert Critical/warning alert in list N/A

Thresholds can be configured via alert settings to match user-defined values.

Examples of What Patrol Catches

Issue Severity Example
Node offline Critical Proxmox node not responding
Disk approaching capacity Warning/Critical Storage at 85%+, or growing toward full
Backup failures Warning PBS job failed, no backup in 48+ hours
Service down Critical Docker container crashed, agent offline
High resource usage Warning Sustained memory >90%, CPU >85%
Storage issues Critical PBS datastore errors, ZFS pool degraded
Ceph problems Warning/Critical Degraded OSDs, unhealthy PGs
Kubernetes issues Warning Pods stuck in Pending/CrashLoopBackOff
SMART failures Critical Disk health check failed

What Patrol Ignores (by design)

Patrol is intentionally conservative to avoid noise:

  • Small baseline deviations ("CPU at 15% vs typical 10%")
  • Low utilization that's "elevated" but fine (disk at 40%)
  • Stopped VMs/containers that were intentionally stopped
  • Brief spikes that resolve on their own
  • Anything that doesn't require human action

Philosophy: If a finding wouldn't be worth waking someone up at 3am, Patrol won't create it.

Finding Severity

  • Critical: Immediate attention required (service down, data at risk)
  • Warning: Should be addressed soon (disk filling, backup stale)

Note: info and watch level findings are filtered out to reduce noise.

Managing Findings

Findings can be managed via the UI or API:

  • Get help: Chat with AI to troubleshoot the issue
  • Resolve: Mark as fixed (finding will reappear if the issue resurfaces)
  • Dismiss: Mark as expected behavior (creates suppression rule)

Dismissed and resolved findings persist across Pulse restarts.


Autonomy Levels

Patrol supports three autonomy modes that control how much action it can take:

Mode Behavior License
Monitor Detect issues only. No investigation or fixes. Free (BYOK)
Investigate Investigates findings and proposes fixes. All fixes require approval before execution. Free (BYOK)
Auto-fix Automatically fixes issues and verifies results. Critical findings still require approval by default. Pro

Investigation Flow

When a finding is created in Investigate or Auto-fix mode:

Finding created
      │
      ▼
MaybeInvestigateFinding()
      │
      ├─ Has orch + chatService?
      │        │
      │        ▼
      │   InvestigateFinding()
      │        │
      │        ▼
      │   Create chat session
      │        │
      │        ▼
      │   AI analysis (with tools)
      │        │
      │        ▼
      │   [Fix proposed?] ──Yes──► Queue approval (or auto-execute in full mode)
      │        │
      │        No
      │        ▼
      │   Update finding with outcome
      │
      └─ Skip investigation

Investigation Configuration

Setting Default Description
MaxTurns 15 Maximum agentic turns per investigation
Timeout 10 min Maximum duration per investigation
MaxConcurrent 3 Maximum concurrent investigations
MaxAttemptsPerFinding 3 Maximum investigation attempts per finding
CooldownDuration 1 hour Cooldown before re-investigating
TimeoutCooldownDuration 10 min Shorter cooldown for timeout failures
VerificationDelay 30 sec Wait before verifying fix

Investigation Outcomes

Outcome Meaning
resolved Issue resolved during investigation
fix_queued Fix proposed, awaiting approval
fix_executed Fix auto-executed successfully
fix_failed Fix attempted but failed
fix_verified Fix worked, issue confirmed resolved
fix_verification_failed Fix ran but issue persists
needs_attention Requires human intervention
cannot_fix Issue cannot be automatically fixed
timed_out Investigation timed out (will retry sooner)

Pulse Assistant (Chat)

Pulse Assistant is a tool-driven chat interface. It does not "guess" system state — it calls live tools and reports their outputs.

The Model's Workflow (Discover → Investigate → Act)

  1. Discover: Uses pulse_query or pulse_discovery to find real resources and IDs
  2. Investigate: Uses pulse_read to run bounded, read-only commands and check status/logs
  3. Act (optional): Uses pulse_control for changes, then verifies with a read

Available Tools

Tool Classification Purpose
pulse_query, pulse_discovery Resolve Resource discovery and query
pulse_read Read Read-only operations: exec, file, find, tail, logs
pulse_metrics Read Performance metrics
pulse_storage Read Storage information
pulse_kubernetes Read Kubernetes cluster info
pulse_pmg Read Proxmox Mail Gateway stats
pulse_alerts Read/Write Alert management (resolve/dismiss are writes)
pulse_docker Read/Write Docker operations (control/update are writes)
pulse_knowledge Read/Write Knowledge persistence (remember/note/save are writes)
pulse_file_edit Read/Write File operations (write/append are writes)
pulse_control Write Guest control, service management
pulse_patrol Read Patrol findings and status

Safety Gates

The assistant enforces multiple safety gates:

  1. Discovery Before Action — Action tools cannot operate on resources that weren't first discovered
  2. Verification After Write — After any write, the model must perform a read/status check before providing a final answer
  3. Read/Write Separation — Read operations route through pulse_read (stays in READING state); write operations route through pulse_control (enters VERIFYING state)
  4. Phantom Detection — Detects when the model claims execution without tool calls
  5. Approval Mode — In Controlled mode, every write requires explicit user approval
  6. Execution Context Binding — Commands execute within the resolved resource's context, not on parent hosts

Control Levels

Level Behavior License
Read-only AI can observe and query data only Free
Controlled AI asks for approval before executing commands Free
Autonomous AI executes actions without prompting Pro

Using Approvals (Controlled Mode)

When control level is Controlled, write actions pause for approval:

  1. Tool returns APPROVAL_REQUIRED: { approval_id, command, ... }
  2. Agentic loop emits approval_needed SSE event
  3. UI shows approval card with the proposed command
  4. Approve to execute and verify, or Deny to cancel
  5. Only users with admin privileges can approve/deny

Configuration

Configure in the UI: Settings → System → AI Assistant

Supported Providers

  • Anthropic (API key or OAuth)
  • OpenAI
  • DeepSeek
  • Google Gemini
  • Ollama (self-hosted, with tool/function calling support)
  • OpenAI-compatible base URL (for providers that implement the OpenAI API shape)

Models

Pulse uses model identifiers in the form: provider:model-name

You can set separate models for:

  • Chat (chat_model)
  • Patrol (patrol_model)
  • Auto-fix remediation (auto_fix_model)

Storage

AI settings are stored encrypted at rest in ai.enc under the Pulse config directory. Related files:

File Purpose
ai.enc Encrypted AI configuration and credentials
ai_findings.json Patrol findings
ai_patrol_runs.json Patrol run history
ai_usage_history.json Token usage data
ai_chat_sessions.json Legacy chat sessions (UI sync)
baselines.json Learned resource baselines
ai_correlations.json Resource correlation data
ai_patterns.json Detected patterns

Config directory: /etc/pulse (systemd) or /data (Docker/Kubernetes)

Testing

  • Test provider connectivity: POST /api/ai/test and POST /api/ai/test/{provider}
  • List available models: GET /api/ai/models

Schedule and Triggers

Patrol runs on a configurable schedule:

Interval Description
Disabled Patrol runs only when manually triggered
10 min 7 days Configurable interval (default: 6 hours)

Patrol can also be triggered by:

  • Manual run: Click "Run Patrol" in the UI
  • Alert-triggered analysis (Pro): Runs when an alert fires
  • API call: POST /api/ai/patrol/run

AI Intelligence Layer

Pulse includes a unified intelligence system that aggregates data from all AI subsystems:

Components

Component Purpose
Baseline Engine Learns normal behavior, detects anomalies via z-score
Pattern Detector Identifies recurring issues and trends
Correlation Engine Links related issues across resources
Incident Memory Tracks past incidents and successful remediations
Knowledge Store Persists user annotations and learned preferences
Forecast Engine Predicts capacity issues and resource exhaustion

Health Scoring

Each resource receives a health score (AF) based on:

  • Current metrics vs baseline
  • Active findings and alerts
  • Recent incidents
  • Trend direction (improving/stable/declining)

Model Matrix (Pulse Assistant)

This table summarizes the most recent Pulse Assistant eval runs per model.

Update the table from eval reports:

EVAL_REPORT_DIR=tmp/eval-reports go run ./cmd/eval -scenario matrix -auto-models
python3 scripts/eval/render_model_matrix.py tmp/eval-reports --write-doc docs/AI.md

Or use the helper script:

scripts/eval/run_model_matrix.sh
Model Smoke Read-only Time (matrix) Tokens (matrix) Last run (UTC)
anthropic:claude-3-haiku-20240307 2m 42s 2026-01-29
anthropic:claude-haiku-4-5-20251001 8s 18,923 2026-01-29
anthropic:claude-opus-4-5-20251101 9m 31s 1,120,530 2026-01-29
gemini:gemini-3-flash-preview 7m 4s 2026-01-29
gemini:gemini-3-pro-preview 3m 54s 1,914 2026-01-29
openai:gpt-5.2 5s 12,363 2026-01-29
openai:gpt-5.2-chat-latest 8s 12,595 2026-01-29

Safety Controls

Pulse includes settings that control how "active" AI features are:

  • Autonomous mode (Pro): When enabled, AI may execute safe commands without approval
  • Patrol auto-fix (Pro): Allows patrol to attempt automatic remediation
  • Alert-triggered analysis (Pro): Limits AI to analyzing specific events when alerts occur
  • Full autonomy unlock (Pro): Enables auto-fix for critical findings without approval (requires explicit toggle)

If you enable execution features, ensure agent tokens and scopes are appropriately restricted.

Advanced Network Restrictions

Pulse blocks AI tool HTTP fetches to loopback and link-local addresses by default. For local development:

  • PULSE_AI_ALLOW_LOOPBACK=true

Use this only in trusted environments.


Privacy

Patrol runs on your server and only sends the minimal context needed for analysis to the configured provider (when AI is enabled). No telemetry is sent to Pulse by default.


Why Patrol Is Different From Traditional Alerts

Alerts are threshold-based and narrow. Patrol is context-based and cross-system.

  • Alerts: "Disk > 90%"
  • Patrol: "ZFS pool is 86% but trending +4%/day; projected to hit 95% within a week. Largest consumer is datastore X. Recommend prune or expand."

Cost Tracking

Pulse tracks token usage and costs:

  • View usage summary: GET /api/ai/cost/summary
  • Reset counters: POST /api/ai/cost/reset (admin)
  • Set monthly budget limits in AI settings

Troubleshooting

Issue Solution
AI not responding Verify provider credentials in Settings → System → AI Assistant
No execution capability Confirm at least one agent is connected
Findings not persisting Check Pulse has write access to ai_findings.json in the config directory
Too many findings This shouldn't happen — please report if it does
Investigation stuck Check circuit breaker status at /api/ai/circuit/status; may auto-reset after cooldown
Model not available Ensure provider API key is valid and model ID matches provider format
  • Pulse Assistant Deep Dive — Complete technical breakdown of the agentic architecture: context prefetching, knowledge accumulation, FSM enforcement, parallel execution, phantom detection, auto-recovery
  • Pulse Patrol Deep Dive — Full intelligence layer documentation: baseline learning, pattern detection, forecasting, correlation analysis, incident memory, investigation orchestration

Reference Documentation