mirror of https://github.com/rcourtman/Pulse.git synced 2026-02-18 00:17:39 +01:00

Files

rcourtman fa1b74792e docs: add comprehensive deep-dive documentation for AI subsystems

Adds detailed architecture documentation for Pulse Patrol and Pulse Assistant. Updates AI.md and PULSE_PRO.md. Also includes additional tests.

2026-02-02 10:29:07 +00:00

19 KiB

Raw Blame History

Pulse AI

Pulse Patrol is available to everyone with BYOK (your own AI provider). Pulse Pro unlocks auto-fix and advanced analysis. Learn more at https://pulserelay.pro or see the technical overview in PULSE_PRO.md.

Overview

Pulse includes two AI-powered systems:

Pulse Assistant — An interactive chat interface for ad-hoc troubleshooting, investigations, and infrastructure control.
Pulse Patrol — A scheduled, context-aware analysis service that continuously monitors your infrastructure, learns what's normal, predicts issues, and generates actionable findings.

Both systems are built on the same tool-driven architecture where the LLM acts as a proposer and Go code enforces safety gates.

Not Just Another Chatbot

Pulse Assistant is a protocol-driven, safety-gated agentic system that:

Proactively gathers context — understands resources before you ask (context prefetcher)
Learns within sessions — extracts and caches facts to avoid redundant queries (knowledge accumulator)
Enforces workflow invariants — FSM prevents dangerous state transitions
Supports parallel tool execution — efficient batch operations with concurrency control
Detects and prevents hallucinations — phantom execution detection
Auto-recovers from errors — structured error envelopes enable self-correction

📖 For a deep technical dive into the Assistant architecture, see architecture/pulse-assistant-deep-dive.md.

Not Just Another Alerting System

Pulse Patrol is a multi-layered intelligence platform that:

Learns what's normal for your environment (baseline engine)
Predicts issues before they become critical (pattern detection + forecasting)
Correlates events across your entire infrastructure (root cause analysis)
Remembers past incidents and successful remediations (incident memory)
Investigates issues autonomously when configured (investigation orchestrator)
Verifies fixes and tracks remediation effectiveness (verification loops)

All while running entirely on your infrastructure with BYOK for complete privacy.

📖 For a deep technical dive into the intelligence subsystems, see architecture/pulse-patrol-deep-dive.md.

See architecture/pulse-assistant.md for the original safety architecture documentation.

Pulse Patrol

Patrol is a scheduled analysis pipeline that builds a rich, system-wide snapshot and produces actionable findings.

How Patrol Works

Scheduled/Event Trigger
        │
        ▼
buildSeedContext()  ── infrastructure snapshot
        │
        ▼
LLM analysis (with tools) ← pulse_storage, pulse_metrics, pulse_alerts, etc.
        │
        ▼
DetectSignals() ── deterministic signal detection from tool outputs
        │
        ▼
createFinding() ── validated, deduplicated findings stored
        │
        ▼ (if configured)
MaybeInvestigateFinding() ── automatic investigation + remediation

What Patrol Sees

Every patrol run passes the LLM comprehensive context about your environment:

Data Category	What's Included
Proxmox Nodes	Status, CPU%, memory%, uptime, 24h/7d trend analysis
VMs & Containers	Full metrics, backup status, OCI images, historical trends, anomaly flags
Storage Pools	Usage %, capacity predictions, type (ZFS/LVM/Ceph), growth rates
Docker/Podman	Container counts, health states, unhealthy container lists
Kubernetes	Nodes, pods, deployments, services, DaemonSets, StatefulSets, namespaces
PBS/PMG	Datastore status, backup jobs, job failures, verification status
Ceph	Cluster health, OSD states, PG status
Agent Hosts	Load averages, memory, disk, RAID status, temperatures

Enriched Context

Beyond raw metrics, Patrol enriches the context with intelligence:

Trend analysis — 24h and 7d patterns showing growing, stable, declining, or volatile behavior
Learned baselines — Z-score anomaly detection based on what's normal for your environment
Capacity predictions — "Storage pool will be full in 12 days at current growth rate"
Infrastructure changes — Detected config changes, VM migrations, new deployments
Resource correlations — Pattern detection across related resources
User notes — Your annotations explaining expected behavior
Dismissed findings — Respects your feedback and suppressed alerts
Incident memory — Learns from past investigations and successful remediations

Deterministic Signal Detection

Patrol doesn't rely solely on LLM judgment. It parses tool call outputs and fires deterministic signals for known problems:

Signal Type	Trigger	Default Threshold
`smart_failure`	SMART health status not OK/PASSED	N/A
`high_cpu`	Average CPU usage	70%
`high_memory`	Average memory usage	80%
`high_disk`	Storage pool usage	75% (warning), 95% (critical)
`backup_failed`	Recent backup task with error status	Within 48h
`backup_stale`	No backup completed for VM/CT	48+ hours
`active_alert`	Critical/warning alert in list	N/A

Thresholds can be configured via alert settings to match user-defined values.

Examples of What Patrol Catches

Issue	Severity	Example
Node offline	Critical	Proxmox node not responding
Disk approaching capacity	Warning/Critical	Storage at 85%+, or growing toward full
Backup failures	Warning	PBS job failed, no backup in 48+ hours
Service down	Critical	Docker container crashed, agent offline
High resource usage	Warning	Sustained memory >90%, CPU >85%
Storage issues	Critical	PBS datastore errors, ZFS pool degraded
Ceph problems	Warning/Critical	Degraded OSDs, unhealthy PGs
Kubernetes issues	Warning	Pods stuck in Pending/CrashLoopBackOff
SMART failures	Critical	Disk health check failed

What Patrol Ignores (by design)

Patrol is intentionally conservative to avoid noise:

Small baseline deviations ("CPU at 15% vs typical 10%")
Low utilization that's "elevated" but fine (disk at 40%)
Stopped VMs/containers that were intentionally stopped
Brief spikes that resolve on their own
Anything that doesn't require human action

Philosophy: If a finding wouldn't be worth waking someone up at 3am, Patrol won't create it.

Finding Severity

Critical: Immediate attention required (service down, data at risk)
Warning: Should be addressed soon (disk filling, backup stale)

Note: info and watch level findings are filtered out to reduce noise.

Managing Findings

Findings can be managed via the UI or API:

Get help: Chat with AI to troubleshoot the issue
Resolve: Mark as fixed (finding will reappear if the issue resurfaces)
Dismiss: Mark as expected behavior (creates suppression rule)

Dismissed and resolved findings persist across Pulse restarts.

Autonomy Levels

Patrol supports three autonomy modes that control how much action it can take:

Mode	Behavior	License
Monitor	Detect issues only. No investigation or fixes.	Free (BYOK)
Investigate	Investigates findings and proposes fixes. All fixes require approval before execution.	Free (BYOK)
Auto-fix	Automatically fixes issues and verifies results. Critical findings still require approval by default.	Pro

Investigation Flow

When a finding is created in Investigate or Auto-fix mode:

Finding created
      │
      ▼
MaybeInvestigateFinding()
      │
      ├─ Has orch + chatService?
      │        │
      │        ▼
      │   InvestigateFinding()
      │        │
      │        ▼
      │   Create chat session
      │        │
      │        ▼
      │   AI analysis (with tools)
      │        │
      │        ▼
      │   [Fix proposed?] ──Yes──► Queue approval (or auto-execute in full mode)
      │        │
      │        No
      │        ▼
      │   Update finding with outcome
      │
      └─ Skip investigation

Investigation Configuration

Setting	Default	Description
`MaxTurns`	15	Maximum agentic turns per investigation
`Timeout`	10 min	Maximum duration per investigation
`MaxConcurrent`	3	Maximum concurrent investigations
`MaxAttemptsPerFinding`	3	Maximum investigation attempts per finding
`CooldownDuration`	1 hour	Cooldown before re-investigating
`TimeoutCooldownDuration`	10 min	Shorter cooldown for timeout failures
`VerificationDelay`	30 sec	Wait before verifying fix

Investigation Outcomes

Outcome	Meaning
`resolved`	Issue resolved during investigation
`fix_queued`	Fix proposed, awaiting approval
`fix_executed`	Fix auto-executed successfully
`fix_failed`	Fix attempted but failed
`fix_verified`	Fix worked, issue confirmed resolved
`fix_verification_failed`	Fix ran but issue persists
`needs_attention`	Requires human intervention
`cannot_fix`	Issue cannot be automatically fixed
`timed_out`	Investigation timed out (will retry sooner)

Pulse Assistant (Chat)

Pulse Assistant is a tool-driven chat interface. It does not "guess" system state — it calls live tools and reports their outputs.

The Model's Workflow (Discover → Investigate → Act)

Discover: Uses pulse_query or pulse_discovery to find real resources and IDs
Investigate: Uses pulse_read to run bounded, read-only commands and check status/logs
Act (optional): Uses pulse_control for changes, then verifies with a read

Available Tools

Tool	Classification	Purpose
`pulse_query`, `pulse_discovery`	Resolve	Resource discovery and query
`pulse_read`	Read	Read-only operations: exec, file, find, tail, logs
`pulse_metrics`	Read	Performance metrics
`pulse_storage`	Read	Storage information
`pulse_kubernetes`	Read	Kubernetes cluster info
`pulse_pmg`	Read	Proxmox Mail Gateway stats
`pulse_alerts`	Read/Write	Alert management (resolve/dismiss are writes)
`pulse_docker`	Read/Write	Docker operations (control/update are writes)
`pulse_knowledge`	Read/Write	Knowledge persistence (remember/note/save are writes)
`pulse_file_edit`	Read/Write	File operations (write/append are writes)
`pulse_control`	Write	Guest control, service management
`pulse_patrol`	Read	Patrol findings and status

Safety Gates

The assistant enforces multiple safety gates:

Discovery Before Action — Action tools cannot operate on resources that weren't first discovered
Verification After Write — After any write, the model must perform a read/status check before providing a final answer
Read/Write Separation — Read operations route through pulse_read (stays in READING state); write operations route through pulse_control (enters VERIFYING state)
Phantom Detection — Detects when the model claims execution without tool calls
Approval Mode — In Controlled mode, every write requires explicit user approval
Execution Context Binding — Commands execute within the resolved resource's context, not on parent hosts

Control Levels

Level	Behavior	License
Read-only	AI can observe and query data only	Free
Controlled	AI asks for approval before executing commands	Free
Autonomous	AI executes actions without prompting	Pro

Using Approvals (Controlled Mode)

When control level is Controlled, write actions pause for approval:

Tool returns APPROVAL_REQUIRED: { approval_id, command, ... }
Agentic loop emits approval_needed SSE event
UI shows approval card with the proposed command
Approve to execute and verify, or Deny to cancel
Only users with admin privileges can approve/deny

Configuration

Configure in the UI: Settings → System → AI Assistant

Supported Providers

Anthropic (API key or OAuth)
OpenAI
DeepSeek
Google Gemini
Ollama (self-hosted, with tool/function calling support)
OpenAI-compatible base URL (for providers that implement the OpenAI API shape)

Models

Pulse uses model identifiers in the form: provider:model-name

You can set separate models for:

Chat (chat_model)
Patrol (patrol_model)
Auto-fix remediation (auto_fix_model)

Storage

AI settings are stored encrypted at rest in ai.enc under the Pulse config directory. Related files:

File	Purpose
`ai.enc`	Encrypted AI configuration and credentials
`ai_findings.json`	Patrol findings
`ai_patrol_runs.json`	Patrol run history
`ai_usage_history.json`	Token usage data
`ai_chat_sessions.json`	Legacy chat sessions (UI sync)
`baselines.json`	Learned resource baselines
`ai_correlations.json`	Resource correlation data
`ai_patterns.json`	Detected patterns

Config directory: /etc/pulse (systemd) or /data (Docker/Kubernetes)

Testing

Test provider connectivity: POST /api/ai/test and POST /api/ai/test/{provider}
List available models: GET /api/ai/models

Schedule and Triggers

Patrol runs on a configurable schedule:

Interval	Description
Disabled	Patrol runs only when manually triggered
10 min – 7 days	Configurable interval (default: 6 hours)

Patrol can also be triggered by:

Manual run: Click "Run Patrol" in the UI
Alert-triggered analysis (Pro): Runs when an alert fires
API call: POST /api/ai/patrol/run

AI Intelligence Layer

Pulse includes a unified intelligence system that aggregates data from all AI subsystems:

Components

Component	Purpose
Baseline Engine	Learns normal behavior, detects anomalies via z-score
Pattern Detector	Identifies recurring issues and trends
Correlation Engine	Links related issues across resources
Incident Memory	Tracks past incidents and successful remediations
Knowledge Store	Persists user annotations and learned preferences
Forecast Engine	Predicts capacity issues and resource exhaustion

Health Scoring

Each resource receives a health score (A–F) based on:

Current metrics vs baseline
Active findings and alerts
Recent incidents
Trend direction (improving/stable/declining)

Model Matrix (Pulse Assistant)

This table summarizes the most recent Pulse Assistant eval runs per model.

Update the table from eval reports:

EVAL_REPORT_DIR=tmp/eval-reports go run ./cmd/eval -scenario matrix -auto-models
python3 scripts/eval/render_model_matrix.py tmp/eval-reports --write-doc docs/AI.md

Or use the helper script:

scripts/eval/run_model_matrix.sh

Model	Smoke	Read-only	Time (matrix)	Tokens (matrix)	Last run (UTC)
anthropic:claude-3-haiku-20240307	✅	❌	2m 42s	—	2026-01-29
anthropic:claude-haiku-4-5-20251001	✅	✅	8s	18,923	2026-01-29
anthropic:claude-opus-4-5-20251101	✅	✅	9m 31s	1,120,530	2026-01-29
gemini:gemini-3-flash-preview	✅	✅	7m 4s	—	2026-01-29
gemini:gemini-3-pro-preview	✅	✅	3m 54s	1,914	2026-01-29
openai:gpt-5.2	✅	✅	5s	12,363	2026-01-29
openai:gpt-5.2-chat-latest	✅	✅	8s	12,595	2026-01-29

Safety Controls

Pulse includes settings that control how "active" AI features are:

Autonomous mode (Pro): When enabled, AI may execute safe commands without approval
Patrol auto-fix (Pro): Allows patrol to attempt automatic remediation
Alert-triggered analysis (Pro): Limits AI to analyzing specific events when alerts occur
Full autonomy unlock (Pro): Enables auto-fix for critical findings without approval (requires explicit toggle)

If you enable execution features, ensure agent tokens and scopes are appropriately restricted.

Advanced Network Restrictions

Pulse blocks AI tool HTTP fetches to loopback and link-local addresses by default. For local development:

PULSE_AI_ALLOW_LOOPBACK=true

Use this only in trusted environments.

Privacy

Patrol runs on your server and only sends the minimal context needed for analysis to the configured provider (when AI is enabled). No telemetry is sent to Pulse by default.

Why Patrol Is Different From Traditional Alerts

Alerts are threshold-based and narrow. Patrol is context-based and cross-system.

Alerts: "Disk > 90%"
Patrol: "ZFS pool is 86% but trending +4%/day; projected to hit 95% within a week. Largest consumer is datastore X. Recommend prune or expand."

Cost Tracking

Pulse tracks token usage and costs:

View usage summary: GET /api/ai/cost/summary
Reset counters: POST /api/ai/cost/reset (admin)
Set monthly budget limits in AI settings

Troubleshooting

Issue	Solution
AI not responding	Verify provider credentials in Settings → System → AI Assistant
No execution capability	Confirm at least one agent is connected
Findings not persisting	Check Pulse has write access to `ai_findings.json` in the config directory
Too many findings	This shouldn't happen — please report if it does
Investigation stuck	Check circuit breaker status at `/api/ai/circuit/status`; may auto-reset after cooldown
Model not available	Ensure provider API key is valid and model ID matches provider format

Deep Dives (Recommended for Technical Audiences)

Pulse Assistant Deep Dive — Complete technical breakdown of the agentic architecture: context prefetching, knowledge accumulation, FSM enforcement, parallel execution, phantom detection, auto-recovery
Pulse Patrol Deep Dive — Full intelligence layer documentation: baseline learning, pattern detection, forecasting, correlation analysis, incident memory, investigation orchestration

Reference Documentation

Architecture: Pulse Assistant (Safety Gates) — Detailed FSM states, tool protocol, and invariants
API Reference — Complete API endpoint documentation
Pulse Pro — Pro features and licensing

19 KiB Raw Blame History Unescape Escape