docs: add adaptive polling production rollout playbook (Phase 2 Task 10)

Add comprehensive operator playbook for production enablement:

**Prerequisites:**
- Test suite validation (unit, integration, soak)
- Monitoring readiness (Grafana dashboards, alerts)
- Configuration management and rollback planning
- Stakeholder sign-off

**Staging Rollout:**
- Feature flag enablement steps
- Verification procedures (scheduler health API)
- 24-48h observation window with success criteria
- Metric checkpoints at 0h, 12h, 24h

**Production Rollout:**
- Gradual strategy (25% nodes every 2 hours)
- Low-traffic maintenance window
- Per-cluster monitoring during rollout
- Success criteria and completion validation

**Grafana/Alert Configuration:**
- Dashboard panels: queue depth, staleness, throughput, breakers/DLQ
- Alert thresholds:
  - Queue depth > 1.5× instances for >10min (Warning)
  - Staleness > 60s for >5min (Critical)
  - DLQ growth (Warning)
  - Stuck breakers >10min (Critical)

**Rollback Procedure:**
- Clear disable/restart steps
- Verification of rollback success
- Post-rollback actions and incident reporting

**Troubleshooting:**
- Symptom/cause/action table
- Scheduler health API access guide
- Immediate rollback triggers

Operators can now safely enable adaptive polling following this step-by-step playbook.

Part of Phase 2 Task 10 (Documentation)
This commit is contained in:
rcourtman
2025-10-20 13:26:39 +00:00
parent 14d06a1654
commit cb8be81f1d

View File

@@ -0,0 +1,176 @@
# Adaptive Polling Rollout Playbook
This playbook guides operators through enabling the adaptive polling scheduler in
production. Follow the steps sequentially and record key checkpoints in the run
sheet for audit purposes.
---
## 1. Prerequisites
1. **Test suite status**
- `go test ./...` and `go test -tags=integration ./internal/monitoring -run TestAdaptiveSchedulerIntegration`
- Adaptive polling soak test:
```
HARNESS_SOAK_MINUTES=15 go test -tags=integration ./internal/monitoring -run TestAdaptiveSchedulerSoak -soak -timeout 30m
```
- All tests must pass within the last 24hours.
2. **Monitoring readiness**
- Grafana dashboard updated with:
- `pulse_monitor_poll_queue_depth` (gauge)
- `pulse_monitor_poll_staleness_seconds` (gauge, per instance)
- `pulse_monitor_poll_total` and `pulse_monitor_poll_errors_total` (rate panels)
- Alerting panels for circuit breaker state (via scheduler health API).
- Alerts configured (see §4).
3. **Configuration management**
- Ensure staging and production environments are managed via `system.json` or appropriate env vars.
- Identify the operator owning flag toggles and service restarts.
4. **Rollback plan**
- Confirm ability to set `ADAPTIVE_POLLING_ENABLED=false` and restart `pulse-hot-dev` or equivalent service within 5minutes.
- Document the `systemctl restart pulse-hot-dev` command path or container restart procedure.
5. **Stakeholder sign-off**
- Adaptive polling feature owner approves rollout window.
- SRE and on-call engineer acknowledge the playbook.
---
## 2. Staging Rollout
1. **Enable feature flag**
- Update staging configuration:
```
export ADAPTIVE_POLLING_ENABLED=true
```
or edit `system.json` and set `"adaptivePollingEnabled": true`.
- Restart hot-dev service / container to apply:
```
systemctl restart pulse-hot-dev
```
(Adapt to your env if using Docker/K8s.)
2. **Verification**
- `curl -s http://<staging-host>:7655/api/monitoring/scheduler/health | jq`
- Expect `"enabled": true`.
- Check Grafana dashboard for the staging cluster:
- Queue depth should stabilise near historic baseline (< instances × 1.5).
- Staleness gauges should stay below 60s for healthy instances.
- No persistent circuit breakers (`state != "closed"`) except known failing endpoints.
3. **Observation window**
- Monitor for 2448hours.
- Success criteria:
- No increase in polling failures or alert volume.
- Queue depth and staleness metrics remain within SLO (queue depth < 1.5× instance count, staleness < 60s).
- Scheduler health API shows empty dead-letter queue or expected entries only.
- Record key metric snapshots at 0h, 12h, 24h.
4. **Sign-off**
- If criteria met, proceed to production. Otherwise revert flag to false and investigate (§6).
---
## 3. Production Rollout
1. **Rollout strategy**
- Perform during low-traffic maintenance window.
- Enable flag gradually by cluster or instance group (e.g., 25% of nodes every 2hours):
1. Update config (`ADAPTIVE_POLLING_ENABLED=true`) for first subset.
2. Restart service on those nodes.
3. Watch metrics for at least 30minutes before continuing.
2. **Monitoring during rollout**
- Grafana dashboard per cluster:
- `poll_queue_depth`
- `poll_staleness_seconds`
- `poll_total` success/error ratio
- Scheduler health API:
```
curl -s http://<prod-host>:7655/api/monitoring/scheduler/health | jq
```
- Confirm `enabled: true`, `deadLetter.count` stable, `breakers` mostly empty.
3. **Success criteria**
- Queue depth rises temporarily but settles within threshold (< 1.5× instance count).
- Staleness stays below 60s for healthy instances.
- No unexplained increase in alert volume or API error rate.
- Dead-letter queue holds only known failing targets.
4. **Completion**
- After all nodes enabled, monitor for an additional 24h.
- Record final metric snapshot.
---
## 4. Grafana & Alert Configuration
1. **Dashboard panels**
- **Queue Depth**: `pulse_monitor_poll_queue_depth`.
- Use single-stat with alert if > 1.5× active instances for > 10min.
- **Instance Staleness**: panel per instance type using `pulse_monitor_poll_staleness_seconds`.
- Alert threshold: > 60s for > 5min (excluding known failing instances).
- **Polling Throughput**: rate of `pulse_monitor_poll_total{result="success"}` vs `result="error"`.
- **Circuit Breakers / DLQ**: table from scheduler health API (via scripted datasource) highlighting non-closed breakers or DLQ entries.
2. **Alerts**
- Queue depth > threshold for >10min (Warning), >20min (Critical).
- Staleness > 60s for >5min (Critical).
- Dead-letter count increase > N (based on baseline) triggers Warning.
- Any breaker stuck in `open` for >10min triggers Critical.
3. **Notification routing**
- Ensure alerts route to on-call + feature owner.
---
## 5. Rollback Procedure
1. **Disable adaptive polling**
- Set `ADAPTIVE_POLLING_ENABLED=false` (env or `system.json`).
- Restart service (`systemctl restart pulse-hot-dev` or equivalent).
2. **Verification**
- Scheduler health API should show `"enabled": false`.
- Queue depth returns to pre-feature baseline within 1015minutes.
- Staleness/queue alerts clear.
3. **Post-rollback actions**
- Notify stakeholders, capture metric snapshots showing recovery.
- File incident report if rollback triggered by outage.
---
## 6. Troubleshooting
| Symptom | Possible Cause | Action |
|---------|----------------|--------|
| Queue depth remains high (> 2× usual) | Insufficient workers, hidden breaker, misconfigured flag | Check scheduler health API for breaker states; consider increasing workers or reverting flag. |
| Staleness spikes across many instances | Backend API slowdown or connectivity issues | Inspect backend logs, network health; revert flag if duration > 15min. |
| Dead-letter count climbs rapidly | Downstream API failures | Investigate specific instances via scheduler health API; fix credential/connectivity issues or rollback. |
| Circuit breakers stuck half-open/open | Persistent transient failures | Review error logs, ensure backoff/rate limits not starving retries; rollback if unresolved quickly. |
| Grafana panels flatline | Metrics exporter or job issue | Ensure Prometheus scraping working; verify service restarted with flag. |
### Accessing Scheduler Health API
```
curl -s http://<host>:7655/api/monitoring/scheduler/health | jq
```
Key fields:
- `queue.depth`, `queue.perType`
- `deadLetter.count`, task list
- `breakers` array with states (`closed`, `open`, `half_open`)
- `staleness` per instance
### When to Roll Back
Rollback immediately if any of the following occurs:
- Queue depth > 3× baseline for > 15min.
- Staleness > 120s on majority of instances.
- Dead-letter count doubles without clear cause.
- Customer-facing alerts or latency regressions attributed to adaptive polling.
Document the incident and notify stakeholders after rollback.