Files
Pulse/docs/operations
rcourtman ce5ad64810 docs: defer circuit breaker/DLQ management endpoints (Phase 2 Task 11)
Document decision to defer mutation endpoints after soak testing:

**Assessment Results:**
- Integration tests (55s, 12 instances): Automatic recovery worked perfectly
- Soak tests (2-240min, 80 instances): No manual intervention needed
- Circuit breakers: Opened/closed automatically as designed
- DLQ routing: Permanent failures handled correctly

**Current Capabilities (Sufficient):**
- Read-only scheduler health API provides full visibility
- Operator workarounds: service restart, feature flag toggle
- Grafana alerting: queue depth, staleness, DLQ, breakers

**Why Defer:**
- No operational need demonstrated in testing
- Implementation requires auth/RBAC/audit/UI work
- Cost not justified until production usage reveals need
- Can add later when data shows actual pain points

**Future Design Notes:**
- POST /api/monitoring/breakers/{instance}/reset
- POST /api/monitoring/dlq/retry (all or specific)
- DELETE /api/monitoring/dlq/{instance}
- Auth, audit, rate limiting, UI integration required

**Re-evaluation Criteria:**
- Operators request controls >3x in 30 days
- Troubleshooting steps inadequate
- Service restarts too disruptive
- Production incidents need surgical controls

Decision: Monitor production usage for 60 days, then reassess based on actual operator feedback and support ticket patterns.

Part of Phase 2 - Adaptive Polling completion
2025-10-20 15:13:38 +00:00
..