mirror of
https://github.com/rcourtman/Pulse.git
synced 2026-02-18 00:17:39 +01:00
166 lines
4.4 KiB
Markdown
166 lines
4.4 KiB
Markdown
# Pulse Assistant Eval Harness
|
|
|
|
This is a live, end-to-end eval harness that exercises the AI chat API, tool calls, and safety gates.
|
|
It requires a running Pulse instance and valid credentials.
|
|
|
|
## Quickstart
|
|
|
|
List scenarios:
|
|
```
|
|
go run ./cmd/eval -list
|
|
```
|
|
|
|
Run the full suite:
|
|
```
|
|
go run ./cmd/eval -scenario full
|
|
```
|
|
|
|
Run a single scenario:
|
|
```
|
|
go run ./cmd/eval -scenario readonly
|
|
```
|
|
|
|
Run the model matrix quick set:
|
|
```
|
|
go run ./cmd/eval -scenario matrix
|
|
```
|
|
|
|
Auto-select models (latest per provider):
|
|
```
|
|
go run ./cmd/eval -scenario matrix -auto-models
|
|
```
|
|
|
|
## Environment Overrides
|
|
|
|
These env vars let you align the evals with your infrastructure naming:
|
|
|
|
```
|
|
EVAL_NODE
|
|
EVAL_NODE_CONTAINER
|
|
EVAL_DOCKER_HOST
|
|
EVAL_HOMEPAGE_CONTAINER
|
|
EVAL_JELLYFIN_CONTAINER
|
|
EVAL_GRAFANA_CONTAINER
|
|
EVAL_HOMEASSISTANT_CONTAINER
|
|
EVAL_MQTT_CONTAINER
|
|
EVAL_ZIGBEE_CONTAINER
|
|
EVAL_FRIGATE_CONTAINER
|
|
EVAL_MODEL (optional model override)
|
|
EVAL_MODEL_PROVIDERS (optional comma-separated provider filter for auto selection; defaults to openai,anthropic,deepseek,gemini,ollama)
|
|
EVAL_MODEL_LIMIT (optional per-provider limit for auto selection, default 2)
|
|
EVAL_MODEL_EXCLUDE_KEYWORDS (optional comma-separated keywords to skip models; default filters image/video/audio, codex, and specific pre-release IDs like openai:gpt-5.2-pro until chat support is live; set to "none" to disable)
|
|
```
|
|
|
|
Write/verify and strict-resolution controls:
|
|
|
|
```
|
|
EVAL_WRITE_HOST (defaults to EVAL_NODE)
|
|
EVAL_WRITE_COMMAND (defaults to "true")
|
|
EVAL_REQUIRE_WRITE_VERIFY (set to 1 to assert pulse_control -> pulse_read)
|
|
EVAL_STRICT_RESOLUTION (set to 1 to expect STRICT_RESOLUTION block)
|
|
EVAL_REQUIRE_STRICT_RECOVERY (set to 1 to require pulse_query -> pulse_control)
|
|
EVAL_EXPECT_APPROVAL (set to 1 to assert approval_needed event)
|
|
```
|
|
|
|
Retry controls and reports:
|
|
|
|
```
|
|
EVAL_HTTP_TIMEOUT (seconds, default 300)
|
|
EVAL_STEP_RETRIES (default 2)
|
|
EVAL_RETRY_ON_PHANTOM (default 1)
|
|
EVAL_RETRY_ON_EXPLICIT_TOOL (default 1)
|
|
EVAL_RETRY_ON_STREAM_FAILURE (default 1)
|
|
EVAL_RETRY_ON_EMPTY_RESPONSE (default 1)
|
|
EVAL_RETRY_ON_TOOL_ERRORS (default 1)
|
|
EVAL_RETRY_ON_RATE_LIMIT (default 0)
|
|
EVAL_RATE_LIMIT_COOLDOWN (seconds, optional backoff before retry)
|
|
EVAL_PREFLIGHT (set to 1 to run a quick chat preflight)
|
|
EVAL_PREFLIGHT_TIMEOUT (seconds, default 15)
|
|
EVAL_REPORT_DIR (write JSON report per scenario)
|
|
```
|
|
|
|
## Recommended Runs
|
|
|
|
Full suite with custom resource names:
|
|
```
|
|
EVAL_NODE=delly EVAL_DOCKER_HOST=homepage-docker \
|
|
go run ./cmd/eval -scenario full
|
|
```
|
|
|
|
Strict-resolution block + recovery (requires server with PULSE_STRICT_RESOLUTION=true):
|
|
```
|
|
EVAL_STRICT_RESOLUTION=1 EVAL_REQUIRE_STRICT_RECOVERY=1 \
|
|
go run ./cmd/eval -scenario strict
|
|
```
|
|
|
|
Strict-resolution block only (no recovery):
|
|
```
|
|
EVAL_STRICT_RESOLUTION=1 \
|
|
go run ./cmd/eval -scenario strict-block
|
|
```
|
|
|
|
Strict-resolution recovery in a single step:
|
|
```
|
|
EVAL_STRICT_RESOLUTION=1 EVAL_REQUIRE_STRICT_RECOVERY=1 \
|
|
go run ./cmd/eval -scenario strict-recovery
|
|
```
|
|
|
|
Approval flow (requires Control Level = Controlled):
|
|
```
|
|
EVAL_EXPECT_APPROVAL=1 \
|
|
go run ./cmd/eval -scenario approval
|
|
```
|
|
|
|
Approval approve flow (auto-approves approvals during the step):
|
|
```
|
|
EVAL_EXPECT_APPROVAL=1 \
|
|
go run ./cmd/eval -scenario approval-approve
|
|
```
|
|
|
|
Approval deny flow (auto-denies approvals during the step):
|
|
```
|
|
EVAL_EXPECT_APPROVAL=1 \
|
|
go run ./cmd/eval -scenario approval-deny
|
|
```
|
|
|
|
Approval combo flow (approve + deny in one session):
|
|
```
|
|
EVAL_EXPECT_APPROVAL=1 \
|
|
go run ./cmd/eval -scenario approval-combo
|
|
```
|
|
|
|
Write then verify (safe no-op command by default):
|
|
```
|
|
EVAL_REQUIRE_WRITE_VERIFY=1 \
|
|
go run ./cmd/eval -scenario writeverify
|
|
```
|
|
|
|
## Model Matrix Workflow
|
|
|
|
Run the matrix and update the docs table in one step:
|
|
```
|
|
scripts/eval/run_model_matrix.sh
|
|
```
|
|
|
|
Key overrides:
|
|
```
|
|
PULSE_BASE_URL=http://127.0.0.1:7655
|
|
PULSE_EVAL_USER=admin
|
|
PULSE_EVAL_PASS=admin
|
|
EVAL_MODEL_PROVIDERS=openai,anthropic,gemini
|
|
EVAL_MODEL_LIMIT=2
|
|
EVAL_MODELS=anthropic:claude-haiku-4-5-20251001
|
|
EVAL_SCENARIO=matrix
|
|
EVAL_REPORT_DIR=tmp/eval-reports
|
|
EVAL_WRITE_DOC=1
|
|
```
|
|
|
|
## Notes
|
|
|
|
- The evals run against live infrastructure. Use safe commands or keep the default `EVAL_WRITE_COMMAND=true`.
|
|
- Scenario assertions are intentionally coarse; use stricter env flags to enforce write/verify or strict-recovery sequences.
|
|
- Live tests via `go test`:
|
|
```
|
|
go test -v ./internal/ai/eval -run TestQuickSmokeTest -live
|
|
```
|