Files
Pulse/docs/EVAL.md

4.4 KiB

Pulse Assistant Eval Harness

This is a live, end-to-end eval harness that exercises the AI chat API, tool calls, and safety gates. It requires a running Pulse instance and valid credentials.

Quickstart

List scenarios:

go run ./cmd/eval -list

Run the full suite:

go run ./cmd/eval -scenario full

Run a single scenario:

go run ./cmd/eval -scenario readonly

Run the model matrix quick set:

go run ./cmd/eval -scenario matrix

Auto-select models (latest per provider):

go run ./cmd/eval -scenario matrix -auto-models

Environment Overrides

These env vars let you align the evals with your infrastructure naming:

EVAL_NODE
EVAL_NODE_CONTAINER
EVAL_DOCKER_HOST
EVAL_HOMEPAGE_CONTAINER
EVAL_JELLYFIN_CONTAINER
EVAL_GRAFANA_CONTAINER
EVAL_HOMEASSISTANT_CONTAINER
EVAL_MQTT_CONTAINER
EVAL_ZIGBEE_CONTAINER
EVAL_FRIGATE_CONTAINER
EVAL_MODEL                  (optional model override)
EVAL_MODEL_PROVIDERS        (optional comma-separated provider filter for auto selection; defaults to openai,anthropic,deepseek,gemini,ollama)
EVAL_MODEL_LIMIT            (optional per-provider limit for auto selection, default 2)
EVAL_MODEL_EXCLUDE_KEYWORDS (optional comma-separated keywords to skip models; default filters image/video/audio, codex, and specific pre-release IDs like openai:gpt-5.2-pro until chat support is live; set to "none" to disable)

Write/verify and strict-resolution controls:

EVAL_WRITE_HOST              (defaults to EVAL_NODE)
EVAL_WRITE_COMMAND           (defaults to "true")
EVAL_REQUIRE_WRITE_VERIFY    (set to 1 to assert pulse_control -> pulse_read)
EVAL_STRICT_RESOLUTION       (set to 1 to expect STRICT_RESOLUTION block)
EVAL_REQUIRE_STRICT_RECOVERY (set to 1 to require pulse_query -> pulse_control)
EVAL_EXPECT_APPROVAL         (set to 1 to assert approval_needed event)

Retry controls and reports:

EVAL_HTTP_TIMEOUT           (seconds, default 300)
EVAL_STEP_RETRIES            (default 2)
EVAL_RETRY_ON_PHANTOM        (default 1)
EVAL_RETRY_ON_EXPLICIT_TOOL  (default 1)
EVAL_RETRY_ON_STREAM_FAILURE (default 1)
EVAL_RETRY_ON_EMPTY_RESPONSE (default 1)
EVAL_RETRY_ON_TOOL_ERRORS    (default 1)
EVAL_RETRY_ON_RATE_LIMIT     (default 0)
EVAL_RATE_LIMIT_COOLDOWN     (seconds, optional backoff before retry)
EVAL_PREFLIGHT              (set to 1 to run a quick chat preflight)
EVAL_PREFLIGHT_TIMEOUT       (seconds, default 15)
EVAL_REPORT_DIR              (write JSON report per scenario)

Full suite with custom resource names:

EVAL_NODE=delly EVAL_DOCKER_HOST=homepage-docker \
go run ./cmd/eval -scenario full

Strict-resolution block + recovery (requires server with PULSE_STRICT_RESOLUTION=true):

EVAL_STRICT_RESOLUTION=1 EVAL_REQUIRE_STRICT_RECOVERY=1 \
go run ./cmd/eval -scenario strict

Strict-resolution block only (no recovery):

EVAL_STRICT_RESOLUTION=1 \
go run ./cmd/eval -scenario strict-block

Strict-resolution recovery in a single step:

EVAL_STRICT_RESOLUTION=1 EVAL_REQUIRE_STRICT_RECOVERY=1 \
go run ./cmd/eval -scenario strict-recovery

Approval flow (requires Control Level = Controlled):

EVAL_EXPECT_APPROVAL=1 \
go run ./cmd/eval -scenario approval

Approval approve flow (auto-approves approvals during the step):

EVAL_EXPECT_APPROVAL=1 \
go run ./cmd/eval -scenario approval-approve

Approval deny flow (auto-denies approvals during the step):

EVAL_EXPECT_APPROVAL=1 \
go run ./cmd/eval -scenario approval-deny

Approval combo flow (approve + deny in one session):

EVAL_EXPECT_APPROVAL=1 \
go run ./cmd/eval -scenario approval-combo

Write then verify (safe no-op command by default):

EVAL_REQUIRE_WRITE_VERIFY=1 \
go run ./cmd/eval -scenario writeverify

Model Matrix Workflow

Run the matrix and update the docs table in one step:

scripts/eval/run_model_matrix.sh

Key overrides:

PULSE_BASE_URL=http://127.0.0.1:7655
PULSE_EVAL_USER=admin
PULSE_EVAL_PASS=admin
EVAL_MODEL_PROVIDERS=openai,anthropic,gemini
EVAL_MODEL_LIMIT=2
EVAL_MODELS=anthropic:claude-haiku-4-5-20251001
EVAL_SCENARIO=matrix
EVAL_REPORT_DIR=tmp/eval-reports
EVAL_WRITE_DOC=1

Notes

  • The evals run against live infrastructure. Use safe commands or keep the default EVAL_WRITE_COMMAND=true.
  • Scenario assertions are intentionally coarse; use stricter env flags to enforce write/verify or strict-recovery sequences.
  • Live tests via go test:
    go test -v ./internal/ai/eval -run TestQuickSmokeTest -live