docs: update AI evaluation matrix and approval workflow documentation

2026-02-18 00:17:39 +01:00 · 2026-01-30 19:00:10 +00:00
parent 10df3e4d95
commit 17208cbf9d
10 changed files with 774 additions and 67 deletions
--- a/docs/EVAL.md
+++ b/docs/EVAL.md
@@ -20,6 +20,16 @@ Run a single scenario:
 go run ./cmd/eval -scenario readonly
 ```

+Run the model matrix quick set:
+```
+go run ./cmd/eval -scenario matrix
+```
+
+Auto-select models (latest per provider):
+```
+go run ./cmd/eval -scenario matrix -auto-models
+```
+
 ## Environment Overrides

 These env vars let you align the evals with your infrastructure naming:
@@ -35,6 +45,10 @@ EVAL_HOMEASSISTANT_CONTAINER
 EVAL_MQTT_CONTAINER
 EVAL_ZIGBEE_CONTAINER
 EVAL_FRIGATE_CONTAINER
+EVAL_MODEL                  (optional model override)
+EVAL_MODEL_PROVIDERS        (optional comma-separated provider filter for auto selection; defaults to openai,anthropic,deepseek,gemini,ollama)
+EVAL_MODEL_LIMIT            (optional per-provider limit for auto selection, default 2)
+EVAL_MODEL_EXCLUDE_KEYWORDS (optional comma-separated keywords to skip models; default filters image/video/audio, codex, and specific pre-release IDs like openai:gpt-5.2-pro until chat support is live; set to "none" to disable)
 ```

 Write/verify and strict-resolution controls:
@@ -51,12 +65,15 @@ EVAL_EXPECT_APPROVAL         (set to 1 to assert approval_needed event)
 Retry controls and reports:

 ```
+EVAL_HTTP_TIMEOUT           (seconds, default 300)
 EVAL_STEP_RETRIES            (default 2)
 EVAL_RETRY_ON_PHANTOM        (default 1)
 EVAL_RETRY_ON_EXPLICIT_TOOL  (default 1)
 EVAL_RETRY_ON_STREAM_FAILURE (default 1)
 EVAL_RETRY_ON_EMPTY_RESPONSE (default 1)
 EVAL_RETRY_ON_TOOL_ERRORS    (default 1)
+EVAL_RETRY_ON_RATE_LIMIT     (default 0)
+EVAL_RATE_LIMIT_COOLDOWN     (seconds, optional backoff before retry)
 EVAL_PREFLIGHT              (set to 1 to run a quick chat preflight)
 EVAL_PREFLIGHT_TIMEOUT       (seconds, default 15)
 EVAL_REPORT_DIR              (write JSON report per scenario)
@@ -106,12 +123,38 @@ EVAL_EXPECT_APPROVAL=1 \
 go run ./cmd/eval -scenario approval-deny
 ```

+Approval combo flow (approve + deny in one session):
+```
+EVAL_EXPECT_APPROVAL=1 \
+go run ./cmd/eval -scenario approval-combo
+```
+
 Write then verify (safe no-op command by default):
 ```
 EVAL_REQUIRE_WRITE_VERIFY=1 \
 go run ./cmd/eval -scenario writeverify
 ```

+## Model Matrix Workflow
+
+Run the matrix and update the docs table in one step:
+```
+scripts/eval/run_model_matrix.sh
+```
+
+Key overrides:
+```
+PULSE_BASE_URL=http://127.0.0.1:7655
+PULSE_EVAL_USER=admin
+PULSE_EVAL_PASS=admin
+EVAL_MODEL_PROVIDERS=openai,anthropic,gemini
+EVAL_MODEL_LIMIT=2
+EVAL_MODELS=anthropic:claude-haiku-4-5-20251001
+EVAL_SCENARIO=matrix
+EVAL_REPORT_DIR=tmp/eval-reports
+EVAL_WRITE_DOC=1
+```
+
 ## Notes

 - The evals run against live infrastructure. Use safe commands or keep the default `EVAL_WRITE_COMMAND=true`.