Refactor patrol eval runner to use a dual approach:
1. Poll GET /api/ai/patrol/status until Running=false (primary signal)
2. Best-effort SSE stream connection for tool event visibility
Changes:
- Add status polling loop with configurable timeout
- Make SSE stream optional (may not connect in time)
- Add Completed flag to PatrolRunResult
- Improve assertion error messages
- Add new scenarios and assertions
This is more reliable than relying solely on SSE stream which
may timeout waiting for headers during slow patrol initialization.
- Add retry logic for transient failures (phantom, stream, empty response)
- Add environment variable overrides for infrastructure naming
- Add JSON report output per scenario
- Expand assertions with new validation types
- Add more comprehensive test scenarios
- Add docs/EVAL.md with usage documentation
The eval harness now better handles flaky AI responses and provides
detailed reports for debugging.