mirror of
https://github.com/rcourtman/Pulse.git
synced 2026-02-18 00:17:39 +01:00
Adds detailed architecture documentation for Pulse Patrol and Pulse Assistant. Updates AI.md and PULSE_PRO.md. Also includes additional tests.
473 lines
19 KiB
Markdown
473 lines
19 KiB
Markdown
# Pulse AI
|
||
|
||
Pulse Patrol is available to everyone with BYOK (your own AI provider). Pulse Pro unlocks auto-fix and advanced analysis. Learn more at <https://pulserelay.pro> or see the technical overview in [PULSE_PRO.md](PULSE_PRO.md).
|
||
|
||
---
|
||
|
||
## Overview
|
||
|
||
Pulse includes two AI-powered systems:
|
||
|
||
1. **Pulse Assistant** — An interactive chat interface for ad-hoc troubleshooting, investigations, and infrastructure control.
|
||
2. **Pulse Patrol** — A scheduled, context-aware analysis service that continuously monitors your infrastructure, learns what's normal, predicts issues, and generates actionable findings.
|
||
|
||
Both systems are built on the same tool-driven architecture where the LLM acts as a proposer and Go code enforces safety gates.
|
||
|
||
### Not Just Another Chatbot
|
||
|
||
Pulse Assistant is a **protocol-driven, safety-gated agentic system** that:
|
||
|
||
- **Proactively gathers context** — understands resources before you ask (context prefetcher)
|
||
- **Learns within sessions** — extracts and caches facts to avoid redundant queries (knowledge accumulator)
|
||
- **Enforces workflow invariants** — FSM prevents dangerous state transitions
|
||
- **Supports parallel tool execution** — efficient batch operations with concurrency control
|
||
- **Detects and prevents hallucinations** — phantom execution detection
|
||
- **Auto-recovers from errors** — structured error envelopes enable self-correction
|
||
|
||
📖 **For a deep technical dive into the Assistant architecture, see [architecture/pulse-assistant-deep-dive.md](architecture/pulse-assistant-deep-dive.md).**
|
||
|
||
### Not Just Another Alerting System
|
||
|
||
Pulse Patrol is a **multi-layered intelligence platform** that:
|
||
|
||
- **Learns** what's normal for your environment (baseline engine)
|
||
- **Predicts** issues before they become critical (pattern detection + forecasting)
|
||
- **Correlates** events across your entire infrastructure (root cause analysis)
|
||
- **Remembers** past incidents and successful remediations (incident memory)
|
||
- **Investigates** issues autonomously when configured (investigation orchestrator)
|
||
- **Verifies** fixes and tracks remediation effectiveness (verification loops)
|
||
|
||
All while running entirely on your infrastructure with BYOK for complete privacy.
|
||
|
||
📖 **For a deep technical dive into the intelligence subsystems, see [architecture/pulse-patrol-deep-dive.md](architecture/pulse-patrol-deep-dive.md).**
|
||
|
||
See [architecture/pulse-assistant.md](architecture/pulse-assistant.md) for the original safety architecture documentation.
|
||
|
||
---
|
||
|
||
## Pulse Patrol
|
||
|
||
Patrol is a scheduled analysis pipeline that builds a rich, system-wide snapshot and produces actionable findings.
|
||
|
||
### How Patrol Works
|
||
|
||
```
|
||
Scheduled/Event Trigger
|
||
│
|
||
▼
|
||
buildSeedContext() ── infrastructure snapshot
|
||
│
|
||
▼
|
||
LLM analysis (with tools) ← pulse_storage, pulse_metrics, pulse_alerts, etc.
|
||
│
|
||
▼
|
||
DetectSignals() ── deterministic signal detection from tool outputs
|
||
│
|
||
▼
|
||
createFinding() ── validated, deduplicated findings stored
|
||
│
|
||
▼ (if configured)
|
||
MaybeInvestigateFinding() ── automatic investigation + remediation
|
||
```
|
||
|
||
### What Patrol Sees
|
||
|
||
Every patrol run passes the LLM comprehensive context about your environment:
|
||
|
||
| Data Category | What's Included |
|
||
|---------------|-----------------|
|
||
| **Proxmox Nodes** | Status, CPU%, memory%, uptime, 24h/7d trend analysis |
|
||
| **VMs & Containers** | Full metrics, backup status, OCI images, historical trends, anomaly flags |
|
||
| **Storage Pools** | Usage %, capacity predictions, type (ZFS/LVM/Ceph), growth rates |
|
||
| **Docker/Podman** | Container counts, health states, unhealthy container lists |
|
||
| **Kubernetes** | Nodes, pods, deployments, services, DaemonSets, StatefulSets, namespaces |
|
||
| **PBS/PMG** | Datastore status, backup jobs, job failures, verification status |
|
||
| **Ceph** | Cluster health, OSD states, PG status |
|
||
| **Agent Hosts** | Load averages, memory, disk, RAID status, temperatures |
|
||
|
||
### Enriched Context
|
||
|
||
Beyond raw metrics, Patrol enriches the context with intelligence:
|
||
|
||
- **Trend analysis** — 24h and 7d patterns showing `growing`, `stable`, `declining`, or `volatile` behavior
|
||
- **Learned baselines** — Z-score anomaly detection based on what's *normal for your environment*
|
||
- **Capacity predictions** — "Storage pool will be full in 12 days at current growth rate"
|
||
- **Infrastructure changes** — Detected config changes, VM migrations, new deployments
|
||
- **Resource correlations** — Pattern detection across related resources
|
||
- **User notes** — Your annotations explaining expected behavior
|
||
- **Dismissed findings** — Respects your feedback and suppressed alerts
|
||
- **Incident memory** — Learns from past investigations and successful remediations
|
||
|
||
### Deterministic Signal Detection
|
||
|
||
Patrol doesn't rely solely on LLM judgment. It parses tool call outputs and fires deterministic signals for known problems:
|
||
|
||
| Signal Type | Trigger | Default Threshold |
|
||
|------------|---------|-------------------|
|
||
| `smart_failure` | SMART health status not OK/PASSED | N/A |
|
||
| `high_cpu` | Average CPU usage | 70% |
|
||
| `high_memory` | Average memory usage | 80% |
|
||
| `high_disk` | Storage pool usage | 75% (warning), 95% (critical) |
|
||
| `backup_failed` | Recent backup task with error status | Within 48h |
|
||
| `backup_stale` | No backup completed for VM/CT | 48+ hours |
|
||
| `active_alert` | Critical/warning alert in list | N/A |
|
||
|
||
Thresholds can be configured via alert settings to match user-defined values.
|
||
|
||
### Examples of What Patrol Catches
|
||
|
||
| Issue | Severity | Example |
|
||
|-------|----------|---------|
|
||
| **Node offline** | Critical | Proxmox node not responding |
|
||
| **Disk approaching capacity** | Warning/Critical | Storage at 85%+, or growing toward full |
|
||
| **Backup failures** | Warning | PBS job failed, no backup in 48+ hours |
|
||
| **Service down** | Critical | Docker container crashed, agent offline |
|
||
| **High resource usage** | Warning | Sustained memory >90%, CPU >85% |
|
||
| **Storage issues** | Critical | PBS datastore errors, ZFS pool degraded |
|
||
| **Ceph problems** | Warning/Critical | Degraded OSDs, unhealthy PGs |
|
||
| **Kubernetes issues** | Warning | Pods stuck in Pending/CrashLoopBackOff |
|
||
| **SMART failures** | Critical | Disk health check failed |
|
||
|
||
### What Patrol Ignores (by design)
|
||
|
||
Patrol is **intentionally conservative** to avoid noise:
|
||
|
||
- Small baseline deviations ("CPU at 15% vs typical 10%")
|
||
- Low utilization that's "elevated" but fine (disk at 40%)
|
||
- Stopped VMs/containers that were intentionally stopped
|
||
- Brief spikes that resolve on their own
|
||
- Anything that doesn't require human action
|
||
|
||
> **Philosophy**: If a finding wouldn't be worth waking someone up at 3am, Patrol won't create it.
|
||
|
||
### Finding Severity
|
||
|
||
- **Critical**: Immediate attention required (service down, data at risk)
|
||
- **Warning**: Should be addressed soon (disk filling, backup stale)
|
||
|
||
Note: `info` and `watch` level findings are filtered out to reduce noise.
|
||
|
||
### Managing Findings
|
||
|
||
Findings can be managed via the UI or API:
|
||
|
||
- **Get help**: Chat with AI to troubleshoot the issue
|
||
- **Resolve**: Mark as fixed (finding will reappear if the issue resurfaces)
|
||
- **Dismiss**: Mark as expected behavior (creates suppression rule)
|
||
|
||
Dismissed and resolved findings persist across Pulse restarts.
|
||
|
||
---
|
||
|
||
## Autonomy Levels
|
||
|
||
Patrol supports three autonomy modes that control how much action it can take:
|
||
|
||
| Mode | Behavior | License |
|
||
|------|----------|---------|
|
||
| **Monitor** | Detect issues only. No investigation or fixes. | Free (BYOK) |
|
||
| **Investigate** | Investigates findings and proposes fixes. All fixes require approval before execution. | Free (BYOK) |
|
||
| **Auto-fix** | Automatically fixes issues and verifies results. Critical findings still require approval by default. | Pro |
|
||
|
||
### Investigation Flow
|
||
|
||
When a finding is created in Investigate or Auto-fix mode:
|
||
|
||
```
|
||
Finding created
|
||
│
|
||
▼
|
||
MaybeInvestigateFinding()
|
||
│
|
||
├─ Has orch + chatService?
|
||
│ │
|
||
│ ▼
|
||
│ InvestigateFinding()
|
||
│ │
|
||
│ ▼
|
||
│ Create chat session
|
||
│ │
|
||
│ ▼
|
||
│ AI analysis (with tools)
|
||
│ │
|
||
│ ▼
|
||
│ [Fix proposed?] ──Yes──► Queue approval (or auto-execute in full mode)
|
||
│ │
|
||
│ No
|
||
│ ▼
|
||
│ Update finding with outcome
|
||
│
|
||
└─ Skip investigation
|
||
```
|
||
|
||
### Investigation Configuration
|
||
|
||
| Setting | Default | Description |
|
||
|---------|---------|-------------|
|
||
| `MaxTurns` | 15 | Maximum agentic turns per investigation |
|
||
| `Timeout` | 10 min | Maximum duration per investigation |
|
||
| `MaxConcurrent` | 3 | Maximum concurrent investigations |
|
||
| `MaxAttemptsPerFinding` | 3 | Maximum investigation attempts per finding |
|
||
| `CooldownDuration` | 1 hour | Cooldown before re-investigating |
|
||
| `TimeoutCooldownDuration` | 10 min | Shorter cooldown for timeout failures |
|
||
| `VerificationDelay` | 30 sec | Wait before verifying fix |
|
||
|
||
### Investigation Outcomes
|
||
|
||
| Outcome | Meaning |
|
||
|---------|---------|
|
||
| `resolved` | Issue resolved during investigation |
|
||
| `fix_queued` | Fix proposed, awaiting approval |
|
||
| `fix_executed` | Fix auto-executed successfully |
|
||
| `fix_failed` | Fix attempted but failed |
|
||
| `fix_verified` | Fix worked, issue confirmed resolved |
|
||
| `fix_verification_failed` | Fix ran but issue persists |
|
||
| `needs_attention` | Requires human intervention |
|
||
| `cannot_fix` | Issue cannot be automatically fixed |
|
||
| `timed_out` | Investigation timed out (will retry sooner) |
|
||
|
||
---
|
||
|
||
## Pulse Assistant (Chat)
|
||
|
||
Pulse Assistant is a **tool-driven** chat interface. It does not "guess" system state — it calls live tools and reports their outputs.
|
||
|
||
### The Model's Workflow (Discover → Investigate → Act)
|
||
|
||
1. **Discover**: Uses `pulse_query` or `pulse_discovery` to find real resources and IDs
|
||
2. **Investigate**: Uses `pulse_read` to run bounded, read-only commands and check status/logs
|
||
3. **Act** (optional): Uses `pulse_control` for changes, then verifies with a read
|
||
|
||
### Available Tools
|
||
|
||
| Tool | Classification | Purpose |
|
||
|------|---------------|---------|
|
||
| `pulse_query`, `pulse_discovery` | Resolve | Resource discovery and query |
|
||
| `pulse_read` | Read | Read-only operations: exec, file, find, tail, logs |
|
||
| `pulse_metrics` | Read | Performance metrics |
|
||
| `pulse_storage` | Read | Storage information |
|
||
| `pulse_kubernetes` | Read | Kubernetes cluster info |
|
||
| `pulse_pmg` | Read | Proxmox Mail Gateway stats |
|
||
| `pulse_alerts` | Read/Write | Alert management (resolve/dismiss are writes) |
|
||
| `pulse_docker` | Read/Write | Docker operations (control/update are writes) |
|
||
| `pulse_knowledge` | Read/Write | Knowledge persistence (remember/note/save are writes) |
|
||
| `pulse_file_edit` | Read/Write | File operations (write/append are writes) |
|
||
| `pulse_control` | Write | Guest control, service management |
|
||
| `pulse_patrol` | Read | Patrol findings and status |
|
||
|
||
### Safety Gates
|
||
|
||
The assistant enforces multiple safety gates:
|
||
|
||
1. **Discovery Before Action** — Action tools cannot operate on resources that weren't first discovered
|
||
2. **Verification After Write** — After any write, the model must perform a read/status check before providing a final answer
|
||
3. **Read/Write Separation** — Read operations route through `pulse_read` (stays in READING state); write operations route through `pulse_control` (enters VERIFYING state)
|
||
4. **Phantom Detection** — Detects when the model claims execution without tool calls
|
||
5. **Approval Mode** — In Controlled mode, every write requires explicit user approval
|
||
6. **Execution Context Binding** — Commands execute within the resolved resource's context, not on parent hosts
|
||
|
||
### Control Levels
|
||
|
||
| Level | Behavior | License |
|
||
|-------|----------|---------|
|
||
| **Read-only** | AI can observe and query data only | Free |
|
||
| **Controlled** | AI asks for approval before executing commands | Free |
|
||
| **Autonomous** | AI executes actions without prompting | Pro |
|
||
|
||
### Using Approvals (Controlled Mode)
|
||
|
||
When control level is **Controlled**, write actions pause for approval:
|
||
|
||
1. Tool returns `APPROVAL_REQUIRED: { approval_id, command, ... }`
|
||
2. Agentic loop emits `approval_needed` SSE event
|
||
3. UI shows approval card with the proposed command
|
||
4. **Approve** to execute and verify, or **Deny** to cancel
|
||
5. Only users with admin privileges can approve/deny
|
||
|
||
---
|
||
|
||
## Configuration
|
||
|
||
Configure in the UI: **Settings → System → AI Assistant**
|
||
|
||
### Supported Providers
|
||
|
||
- **Anthropic** (API key or OAuth)
|
||
- **OpenAI**
|
||
- **DeepSeek**
|
||
- **Google Gemini**
|
||
- **Ollama** (self-hosted, with tool/function calling support)
|
||
- **OpenAI-compatible base URL** (for providers that implement the OpenAI API shape)
|
||
|
||
### Models
|
||
|
||
Pulse uses model identifiers in the form: `provider:model-name`
|
||
|
||
You can set separate models for:
|
||
- Chat (`chat_model`)
|
||
- Patrol (`patrol_model`)
|
||
- Auto-fix remediation (`auto_fix_model`)
|
||
|
||
### Storage
|
||
|
||
AI settings are stored encrypted at rest in `ai.enc` under the Pulse config directory. Related files:
|
||
|
||
| File | Purpose |
|
||
|------|---------|
|
||
| `ai.enc` | Encrypted AI configuration and credentials |
|
||
| `ai_findings.json` | Patrol findings |
|
||
| `ai_patrol_runs.json` | Patrol run history |
|
||
| `ai_usage_history.json` | Token usage data |
|
||
| `ai_chat_sessions.json` | Legacy chat sessions (UI sync) |
|
||
| `baselines.json` | Learned resource baselines |
|
||
| `ai_correlations.json` | Resource correlation data |
|
||
| `ai_patterns.json` | Detected patterns |
|
||
|
||
Config directory: `/etc/pulse` (systemd) or `/data` (Docker/Kubernetes)
|
||
|
||
### Testing
|
||
|
||
- Test provider connectivity: `POST /api/ai/test` and `POST /api/ai/test/{provider}`
|
||
- List available models: `GET /api/ai/models`
|
||
|
||
---
|
||
|
||
## Schedule and Triggers
|
||
|
||
Patrol runs on a configurable schedule:
|
||
|
||
| Interval | Description |
|
||
|----------|-------------|
|
||
| Disabled | Patrol runs only when manually triggered |
|
||
| 10 min – 7 days | Configurable interval (default: 6 hours) |
|
||
|
||
Patrol can also be triggered by:
|
||
- **Manual run**: Click "Run Patrol" in the UI
|
||
- **Alert-triggered analysis (Pro)**: Runs when an alert fires
|
||
- **API call**: `POST /api/ai/patrol/run`
|
||
|
||
---
|
||
|
||
## AI Intelligence Layer
|
||
|
||
Pulse includes a unified intelligence system that aggregates data from all AI subsystems:
|
||
|
||
### Components
|
||
|
||
| Component | Purpose |
|
||
|-----------|---------|
|
||
| **Baseline Engine** | Learns normal behavior, detects anomalies via z-score |
|
||
| **Pattern Detector** | Identifies recurring issues and trends |
|
||
| **Correlation Engine** | Links related issues across resources |
|
||
| **Incident Memory** | Tracks past incidents and successful remediations |
|
||
| **Knowledge Store** | Persists user annotations and learned preferences |
|
||
| **Forecast Engine** | Predicts capacity issues and resource exhaustion |
|
||
|
||
### Health Scoring
|
||
|
||
Each resource receives a health score (A–F) based on:
|
||
- Current metrics vs baseline
|
||
- Active findings and alerts
|
||
- Recent incidents
|
||
- Trend direction (improving/stable/declining)
|
||
|
||
---
|
||
|
||
## Model Matrix (Pulse Assistant)
|
||
|
||
This table summarizes the most recent **Pulse Assistant** eval runs per model.
|
||
|
||
Update the table from eval reports:
|
||
```
|
||
EVAL_REPORT_DIR=tmp/eval-reports go run ./cmd/eval -scenario matrix -auto-models
|
||
python3 scripts/eval/render_model_matrix.py tmp/eval-reports --write-doc docs/AI.md
|
||
```
|
||
Or use the helper script:
|
||
```
|
||
scripts/eval/run_model_matrix.sh
|
||
```
|
||
|
||
<!-- MODEL_MATRIX_START -->
|
||
| Model | Smoke | Read-only | Time (matrix) | Tokens (matrix) | Last run (UTC) |
|
||
| --- | --- | --- | --- | --- | --- |
|
||
| anthropic:claude-3-haiku-20240307 | ✅ | ❌ | 2m 42s | — | 2026-01-29 |
|
||
| anthropic:claude-haiku-4-5-20251001 | ✅ | ✅ | 8s | 18,923 | 2026-01-29 |
|
||
| anthropic:claude-opus-4-5-20251101 | ✅ | ✅ | 9m 31s | 1,120,530 | 2026-01-29 |
|
||
| gemini:gemini-3-flash-preview | ✅ | ✅ | 7m 4s | — | 2026-01-29 |
|
||
| gemini:gemini-3-pro-preview | ✅ | ✅ | 3m 54s | 1,914 | 2026-01-29 |
|
||
| openai:gpt-5.2 | ✅ | ✅ | 5s | 12,363 | 2026-01-29 |
|
||
| openai:gpt-5.2-chat-latest | ✅ | ✅ | 8s | 12,595 | 2026-01-29 |
|
||
<!-- MODEL_MATRIX_END -->
|
||
|
||
---
|
||
|
||
## Safety Controls
|
||
|
||
Pulse includes settings that control how "active" AI features are:
|
||
|
||
- **Autonomous mode (Pro)**: When enabled, AI may execute safe commands without approval
|
||
- **Patrol auto-fix (Pro)**: Allows patrol to attempt automatic remediation
|
||
- **Alert-triggered analysis (Pro)**: Limits AI to analyzing specific events when alerts occur
|
||
- **Full autonomy unlock (Pro)**: Enables auto-fix for critical findings without approval (requires explicit toggle)
|
||
|
||
If you enable execution features, ensure agent tokens and scopes are appropriately restricted.
|
||
|
||
### Advanced Network Restrictions
|
||
|
||
Pulse blocks AI tool HTTP fetches to loopback and link-local addresses by default. For local development:
|
||
|
||
- `PULSE_AI_ALLOW_LOOPBACK=true`
|
||
|
||
Use this only in trusted environments.
|
||
|
||
---
|
||
|
||
## Privacy
|
||
|
||
Patrol runs on your server and only sends the minimal context needed for analysis to the configured provider (when AI is enabled). No telemetry is sent to Pulse by default.
|
||
|
||
---
|
||
|
||
## Why Patrol Is Different From Traditional Alerts
|
||
|
||
Alerts are threshold-based and narrow. Patrol is context-based and cross-system.
|
||
|
||
- **Alerts**: "Disk > 90%"
|
||
- **Patrol**: "ZFS pool is 86% but trending +4%/day; projected to hit 95% within a week. Largest consumer is datastore X. Recommend prune or expand."
|
||
|
||
---
|
||
|
||
## Cost Tracking
|
||
|
||
Pulse tracks token usage and costs:
|
||
|
||
- View usage summary: `GET /api/ai/cost/summary`
|
||
- Reset counters: `POST /api/ai/cost/reset` (admin)
|
||
- Set monthly budget limits in AI settings
|
||
|
||
---
|
||
|
||
## Troubleshooting
|
||
|
||
| Issue | Solution |
|
||
|-------|----------|
|
||
| AI not responding | Verify provider credentials in **Settings → System → AI Assistant** |
|
||
| No execution capability | Confirm at least one agent is connected |
|
||
| Findings not persisting | Check Pulse has write access to `ai_findings.json` in the config directory |
|
||
| Too many findings | This shouldn't happen — please report if it does |
|
||
| Investigation stuck | Check circuit breaker status at `/api/ai/circuit/status`; may auto-reset after cooldown |
|
||
| Model not available | Ensure provider API key is valid and model ID matches provider format |
|
||
|
||
## Related Documentation
|
||
|
||
### Deep Dives (Recommended for Technical Audiences)
|
||
|
||
- **[Pulse Assistant Deep Dive](architecture/pulse-assistant-deep-dive.md)** — Complete technical breakdown of the agentic architecture: context prefetching, knowledge accumulation, FSM enforcement, parallel execution, phantom detection, auto-recovery
|
||
- **[Pulse Patrol Deep Dive](architecture/pulse-patrol-deep-dive.md)** — Full intelligence layer documentation: baseline learning, pattern detection, forecasting, correlation analysis, incident memory, investigation orchestration
|
||
|
||
### Reference Documentation
|
||
|
||
- [Architecture: Pulse Assistant (Safety Gates)](architecture/pulse-assistant.md) — Detailed FSM states, tool protocol, and invariants
|
||
- [API Reference](API.md) — Complete API endpoint documentation
|
||
- [Pulse Pro](PULSE_PRO.md) — Pro features and licensing
|