mirror of
https://github.com/rcourtman/Pulse.git
synced 2026-02-18 00:17:39 +01:00
Document decision to defer mutation endpoints after soak testing:
**Assessment Results:**
- Integration tests (55s, 12 instances): Automatic recovery worked perfectly
- Soak tests (2-240min, 80 instances): No manual intervention needed
- Circuit breakers: Opened/closed automatically as designed
- DLQ routing: Permanent failures handled correctly
**Current Capabilities (Sufficient):**
- Read-only scheduler health API provides full visibility
- Operator workarounds: service restart, feature flag toggle
- Grafana alerting: queue depth, staleness, DLQ, breakers
**Why Defer:**
- No operational need demonstrated in testing
- Implementation requires auth/RBAC/audit/UI work
- Cost not justified until production usage reveals need
- Can add later when data shows actual pain points
**Future Design Notes:**
- POST /api/monitoring/breakers/{instance}/reset
- POST /api/monitoring/dlq/retry (all or specific)
- DELETE /api/monitoring/dlq/{instance}
- Auth, audit, rate limiting, UI integration required
**Re-evaluation Criteria:**
- Operators request controls >3x in 30 days
- Troubleshooting steps inadequate
- Service restarts too disruptive
- Production incidents need surgical controls
Decision: Monitor production usage for 60 days, then reassess based on actual operator feedback and support ticket patterns.
Part of Phase 2 - Adaptive Polling completion