mirror of
https://github.com/rcourtman/Pulse.git
synced 2026-02-18 00:17:39 +01:00
docs: update AI evaluation matrix and approval workflow documentation
This commit is contained in:
65
docs/AI.md
65
docs/AI.md
@@ -125,6 +125,35 @@ Alert-triggered analysis runs attach a timeline event to the alert, so investiga
|
||||
|
||||
> **License note**: Kubernetes AI analysis is gated by the `kubernetes_ai` Pulse Pro feature.
|
||||
|
||||
## Pulse Assistant (Chat): How It Works
|
||||
|
||||
Pulse Assistant is **tool-driven**. It does not "guess" system state — it calls live tools and reports their outputs.
|
||||
|
||||
### The Model's Workflow (Discover → Investigate → Act)
|
||||
- **Discover**: Uses `pulse_query` (or `pulse_discovery`) to find real resources and IDs.
|
||||
- **Investigate**: Uses `pulse_read` to run bounded, read-only commands and check status/logs.
|
||||
- **Act** (optional): Uses `pulse_control` for changes, then verifies with a read.
|
||||
|
||||
### Safety Gates That Make It Trustworthy
|
||||
- **Strict Resolution (optional)**: When enabled, the assistant must discover a resource before it can act on it. This prevents fabricated IDs.
|
||||
- **Read/Write separation**: Read-only commands go through `pulse_read`; write actions go through `pulse_control`. This keeps the workflow state machine honest.
|
||||
- **Verification after writes**: After any write, the assistant must perform a read check before it can finish the response.
|
||||
- **Non‑interactive guardrails**: Commands that could hang (e.g., `tail -f`) are rewritten into bounded, safe forms.
|
||||
- **Approval mode**: In Controlled mode, every write requires explicit user approval. Autonomous mode is available only with Pro.
|
||||
|
||||
### What You See As a User
|
||||
- **Clear tool usage**: Each step shows which tool ran and what it returned.
|
||||
- **Structured recovery**: If a tool is blocked, the assistant adapts (e.g., runs discovery, switches tools, or asks for approval).
|
||||
- **Verified outcomes**: Changes are followed by a read check before the assistant claims success.
|
||||
|
||||
## Why It's Impressive (and Reliable)
|
||||
|
||||
Pulse Assistant behaves like a careful operator:
|
||||
- It **grounds answers in live data** instead of assumptions.
|
||||
- It **adapts** when guardrails block an action.
|
||||
- It **verifies** changes before reporting success.
|
||||
- It **keeps you in control** with explicit approval gates.
|
||||
|
||||
## Configuration
|
||||
|
||||
Configure in the UI: **Settings → System → AI Assistant**
|
||||
@@ -149,6 +178,34 @@ You can set separate models for:
|
||||
- Patrol (`patrol_model`)
|
||||
- Auto-fix remediation (`auto_fix_model`)
|
||||
|
||||
## Model Matrix (Pulse Assistant)
|
||||
|
||||
This table summarizes the most recent **Pulse Assistant** eval runs per model. Patrol is still in development and is not scored yet.
|
||||
Time/tokens reflect the combined **Smoke + Read-only** matrix run.
|
||||
Transient provider errors (rate limits, unavailable chat endpoints) are skipped when rendering the table.
|
||||
|
||||
Update the table from eval reports:
|
||||
```
|
||||
EVAL_REPORT_DIR=tmp/eval-reports go run ./cmd/eval -scenario matrix -auto-models
|
||||
python3 scripts/eval/render_model_matrix.py tmp/eval-reports --write-doc docs/AI.md
|
||||
```
|
||||
Or use the helper script:
|
||||
```
|
||||
scripts/eval/run_model_matrix.sh
|
||||
```
|
||||
|
||||
<!-- MODEL_MATRIX_START -->
|
||||
| Model | Smoke | Read-only | Time (matrix) | Tokens (matrix) | Last run (UTC) |
|
||||
| --- | --- | --- | --- | --- | --- |
|
||||
| anthropic:claude-3-haiku-20240307 | ✅ | ❌ | 2m 42s | — | 2026-01-29 |
|
||||
| anthropic:claude-haiku-4-5-20251001 | ✅ | ✅ | 8s | 18,923 | 2026-01-29 |
|
||||
| anthropic:claude-opus-4-5-20251101 | ✅ | ✅ | 9m 31s | 1,120,530 | 2026-01-29 |
|
||||
| gemini:gemini-3-flash-preview | ✅ | ✅ | 7m 4s | — | 2026-01-29 |
|
||||
| gemini:gemini-3-pro-preview | ✅ | ✅ | 3m 54s | 1,914 | 2026-01-29 |
|
||||
| openai:gpt-5.2 | ✅ | ✅ | 5s | 12,363 | 2026-01-29 |
|
||||
| openai:gpt-5.2-chat-latest | ✅ | ✅ | 8s | 12,595 | 2026-01-29 |
|
||||
<!-- MODEL_MATRIX_END -->
|
||||
|
||||
### Testing
|
||||
|
||||
- Test provider connectivity: `POST /api/ai/test` and `POST /api/ai/test/{provider}`
|
||||
@@ -202,6 +259,14 @@ Pulse uses three AI permission levels for infrastructure control:
|
||||
- **Controlled**: AI asks for approval before executing commands or control actions.
|
||||
- **Autonomous (Pro)**: AI executes actions without prompting.
|
||||
|
||||
### Using Approvals (Controlled Mode)
|
||||
|
||||
When control level is **Controlled**, write actions pause for approval:
|
||||
|
||||
- In chat, you’ll see an approval card with the proposed command.
|
||||
- **Approve** to execute and verify the change, or **Deny** to cancel it.
|
||||
- Only users with admin privileges can approve/deny.
|
||||
|
||||
### Advanced Network Restrictions
|
||||
|
||||
Pulse blocks AI tool HTTP fetches to loopback and link-local addresses by default. For local development, you can allow loopback targets:
|
||||
|
||||
43
docs/EVAL.md
43
docs/EVAL.md
@@ -20,6 +20,16 @@ Run a single scenario:
|
||||
go run ./cmd/eval -scenario readonly
|
||||
```
|
||||
|
||||
Run the model matrix quick set:
|
||||
```
|
||||
go run ./cmd/eval -scenario matrix
|
||||
```
|
||||
|
||||
Auto-select models (latest per provider):
|
||||
```
|
||||
go run ./cmd/eval -scenario matrix -auto-models
|
||||
```
|
||||
|
||||
## Environment Overrides
|
||||
|
||||
These env vars let you align the evals with your infrastructure naming:
|
||||
@@ -35,6 +45,10 @@ EVAL_HOMEASSISTANT_CONTAINER
|
||||
EVAL_MQTT_CONTAINER
|
||||
EVAL_ZIGBEE_CONTAINER
|
||||
EVAL_FRIGATE_CONTAINER
|
||||
EVAL_MODEL (optional model override)
|
||||
EVAL_MODEL_PROVIDERS (optional comma-separated provider filter for auto selection; defaults to openai,anthropic,deepseek,gemini,ollama)
|
||||
EVAL_MODEL_LIMIT (optional per-provider limit for auto selection, default 2)
|
||||
EVAL_MODEL_EXCLUDE_KEYWORDS (optional comma-separated keywords to skip models; default filters image/video/audio, codex, and specific pre-release IDs like openai:gpt-5.2-pro until chat support is live; set to "none" to disable)
|
||||
```
|
||||
|
||||
Write/verify and strict-resolution controls:
|
||||
@@ -51,12 +65,15 @@ EVAL_EXPECT_APPROVAL (set to 1 to assert approval_needed event)
|
||||
Retry controls and reports:
|
||||
|
||||
```
|
||||
EVAL_HTTP_TIMEOUT (seconds, default 300)
|
||||
EVAL_STEP_RETRIES (default 2)
|
||||
EVAL_RETRY_ON_PHANTOM (default 1)
|
||||
EVAL_RETRY_ON_EXPLICIT_TOOL (default 1)
|
||||
EVAL_RETRY_ON_STREAM_FAILURE (default 1)
|
||||
EVAL_RETRY_ON_EMPTY_RESPONSE (default 1)
|
||||
EVAL_RETRY_ON_TOOL_ERRORS (default 1)
|
||||
EVAL_RETRY_ON_RATE_LIMIT (default 0)
|
||||
EVAL_RATE_LIMIT_COOLDOWN (seconds, optional backoff before retry)
|
||||
EVAL_PREFLIGHT (set to 1 to run a quick chat preflight)
|
||||
EVAL_PREFLIGHT_TIMEOUT (seconds, default 15)
|
||||
EVAL_REPORT_DIR (write JSON report per scenario)
|
||||
@@ -106,12 +123,38 @@ EVAL_EXPECT_APPROVAL=1 \
|
||||
go run ./cmd/eval -scenario approval-deny
|
||||
```
|
||||
|
||||
Approval combo flow (approve + deny in one session):
|
||||
```
|
||||
EVAL_EXPECT_APPROVAL=1 \
|
||||
go run ./cmd/eval -scenario approval-combo
|
||||
```
|
||||
|
||||
Write then verify (safe no-op command by default):
|
||||
```
|
||||
EVAL_REQUIRE_WRITE_VERIFY=1 \
|
||||
go run ./cmd/eval -scenario writeverify
|
||||
```
|
||||
|
||||
## Model Matrix Workflow
|
||||
|
||||
Run the matrix and update the docs table in one step:
|
||||
```
|
||||
scripts/eval/run_model_matrix.sh
|
||||
```
|
||||
|
||||
Key overrides:
|
||||
```
|
||||
PULSE_BASE_URL=http://127.0.0.1:7655
|
||||
PULSE_EVAL_USER=admin
|
||||
PULSE_EVAL_PASS=admin
|
||||
EVAL_MODEL_PROVIDERS=openai,anthropic,gemini
|
||||
EVAL_MODEL_LIMIT=2
|
||||
EVAL_MODELS=anthropic:claude-haiku-4-5-20251001
|
||||
EVAL_SCENARIO=matrix
|
||||
EVAL_REPORT_DIR=tmp/eval-reports
|
||||
EVAL_WRITE_DOC=1
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- The evals run against live infrastructure. Use safe commands or keep the default `EVAL_WRITE_COMMAND=true`.
|
||||
|
||||
@@ -57,6 +57,18 @@ Pulse Assistant is a **protocol-driven, safety-gated AI system** for infrastruct
|
||||
3. **Writes must be verified.** FSM enforces read-after-write before final answer.
|
||||
4. **Errors are recoverable.** Structured error responses enable self-correction without prompt engineering.
|
||||
|
||||
## 1.1 User-Visible Behavior (What Feels "Impressive")
|
||||
|
||||
When you use Pulse Assistant in chat, these behaviors are deliberate and enforced by the backend:
|
||||
|
||||
- **Grounded answers**: The assistant uses live tools and surfaces their outputs.
|
||||
- **Discover → Investigate → Act**: It queries resources first, reads status/logs, and only then acts.
|
||||
- **Verified changes**: After a write, it performs a read check before concluding.
|
||||
- **Approval gates**: In Controlled mode, write actions emit approvals and wait for a decision.
|
||||
- **Self‑recovery**: If blocked (routing mismatch, read‑only violation, strict resolution), it adapts and retries with a safe path.
|
||||
|
||||
These are not prompt conventions — they are enforced by the FSM + tool executor.
|
||||
|
||||
---
|
||||
|
||||
## 2. Core Design Principles (Invariants)
|
||||
@@ -88,6 +100,18 @@ Resolved resources are **session-scoped** and **in-memory only**. They are never
|
||||
|
||||
**Enforcement:** `ResolvedContext` not serialized, rebuilt each session in `chat/session.go`
|
||||
|
||||
### Approval Flow (Controlled Mode)
|
||||
|
||||
When `control_level=controlled`, write tools emit an approval request instead of executing:
|
||||
|
||||
1. Tool returns `APPROVAL_REQUIRED: { approval_id, command, ... }`
|
||||
2. Agentic loop emits `approval_needed` SSE event
|
||||
3. UI or API approves/denies via `/api/ai/approvals/{id}/approve|deny`
|
||||
4. On approve, the tool re-executes with `_approval_id` and proceeds
|
||||
5. On deny, the assistant returns `Command denied: <reason>`
|
||||
|
||||
This keeps the LLM in a proposer role while letting users explicitly authorize actions.
|
||||
|
||||
### Invariant 6: Read/Write Tool Separation
|
||||
|
||||
> **This is the most commonly violated invariant.** Read it carefully.
|
||||
|
||||
Reference in New Issue
Block a user