Pulse

mirror of https://github.com/rcourtman/Pulse.git synced 2026-02-18 00:17:39 +01:00

Author	SHA1	Message	Date
rcourtman	3f0808e9f9	docs: comprehensive core and Pro documentation overhaul - Major updates to README.md and docs/README.md for Pulse v5 - Added technical deep-dives for Pulse Pro (docs/PULSE_PRO.md) and AI Patrol (docs/AI.md) - Updated Prometheus metrics documentation and Helm schema for metrics separation - Refreshed security, installation, and deployment documentation for unified agent models - Cleaned up legacy summary files	2026-01-07 17:38:27 +00:00
rcourtman	9cfcdbb247	fix: Use per-node shared flag for storage deduplication The storage deduplication logic only checked cluster config's Shared flag, but this required the cluster config API call to succeed. When the per-node storage API already returns shared=1 (as the user verified), we should use that directly. Now we check three sources for shared storage detection: 1. Per-node API shared flag (storage.Shared) 2. Cluster config shared flag (if available) 3. Storage type heuristics (NFS, RBD, PBS, etc.) Related to #1049	2026-01-07 10:16:23 +00:00
rcourtman	2b48b0a459	feat: add --kube-include-all-deployments flag for Kubernetes agent Adds IncludeAllDeployments option to show all deployments, not just problem ones (where replicas don't match desired). This provides parity with the existing --kube-include-all-pods flag. - Add IncludeAllDeployments to kubernetesagent.Config - Add --kube-include-all-deployments flag and PULSE_KUBE_INCLUDE_ALL_DEPLOYMENTS env var - Update collectDeployments to respect the new flag - Add test for IncludeAllDeployments functionality - Update UNIFIED_AGENT.md documentation Addresses feedback from PR #855	2025-12-18 20:58:30 +00:00
rcourtman	8948e84fe5	feat: AI features, agent improvements, and host monitoring enhancements AI Chat Integration: - Multi-provider support (Anthropic, OpenAI, Ollama) - Streaming responses with markdown rendering - Agent command execution for remote troubleshooting - Context-aware conversations with host/container metadata Agent Updates: - Add --enable-proxmox flag for automatic PVE/PBS token setup - Improve auto-update with semver comparison (prevents downgrades) - Add updatedFrom tracking to report previous version after update - Reduce initial update check delay from 30s to 5s - Add agent version column to Hosts page table Host Metrics: - Add DiskIO stats collection (read/write bytes, ops, time) - Improve disk filtering to exclude Docker overlay mounts - Add RAID array monitoring via mdadm - Enhanced temperature sensor parsing Frontend: - New Agent Version column on Hosts overview table - Improved node modal with agent-first installation flow - Add DiskIO display in host drawer - Better responsive handling for metric bars	2025-12-05 10:37:02 +00:00
courtmanr@gmail.com	fd39196166	refactor: finalize documentation overhaul - Refactor specialized docs for conciseness and clarity - Rename files to UPPER_CASE.md convention - Verify accuracy against codebase - Fix broken links	2025-11-25 00:45:20 +00:00
rcourtman	b72fc2ab79	docs: align sensor proxy config with current defaults	2025-11-20 12:40:01 +00:00
rcourtman	e39c6a3660	docs(sensor-proxy): comprehensive config management documentation Adds complete documentation for the new sensor-proxy config management CLI implemented in Phase 2. Addresses user-facing aspects of the corruption fix. New Documentation: - docs/operations/sensor-proxy-config-management.md (469 lines) - Complete operations runbook for config management - Full CLI reference with examples - Migration guide from inline config - Architecture explanation - Common operational tasks - Troubleshooting guide - Best practices and automation Updated Documentation: - cmd/pulse-sensor-proxy/README.md - Configuration Management CLI section - Allowed Nodes File format - Enhanced troubleshooting - Config corruption recovery - docs/TEMPERATURE_MONITORING.md - Config validation failure troubleshooting - Configuration Management quick reference - Cross-links to detailed docs - docs/TROUBLESHOOTING.md - Sensor proxy config validation errors - Comprehensive diagnosis steps - Automatic and manual recovery - README.md & docs/README.md - Added new runbook to operations index - Positioned for discoverability Coverage: - Both CLI commands fully documented - Phase 1 & Phase 2 architecture explained - Migration path from pre-v4.31.1 - Config corruption recovery procedures - Safe config editing practices - Automation examples - Troubleshooting all failure modes Documentation Quality: - Cross-linked from 5 different documents - Clear examples for common use cases - Target audience: system administrators - Follows project documentation style - Production-ready This completes the sensor-proxy config corruption fix by providing users with comprehensive guidance for the new config management system. Related to Phase 2 commits `3dc073a28`, `804a638ea`, `131666bc1`	2025-11-19 10:01:33 +00:00
rcourtman	a4eb70af96	docs: document sensor proxy log forwarding	2025-11-14 01:12:25 +00:00
rcourtman	bffc8f3f83	docs: add auto-update runbook	2025-11-14 01:05:06 +00:00
rcourtman	3c41d3960c	docs: add operations runbooks and audit fixes	2025-11-14 01:01:21 +00:00
rcourtman	68ce8e7520	feat: finalize swarm service monitoring (#598 )	2025-10-26 09:35:49 +00:00
rcourtman	e0396c1362	docs: update documentation for diagnostics improvements Add comprehensive operator documentation for the new observability features introduced in the previous commit. New Documentation: - docs/monitoring/PROMETHEUS_METRICS.md - Complete reference for all 18 new Prometheus metrics with alert suggestions Updated Documentation: - docs/API.md - Document X-Request-ID and X-Diagnostics-Cached-At headers, explain diagnostics endpoint caching behavior - docs/TROUBLESHOOTING.md - Add section on correlating API calls with logs using request IDs - docs/operations/ADAPTIVE_POLLING_ROLLOUT.md - Update monitoring checklists with new per-node and scheduler metrics - docs/CONFIGURATION.md - Clarify LOG_FILE dual-output behavior and rotation defaults These updates ensure operators understand: - How to set up monitoring/alerting for new metrics - How to configure file logging with rotation - How to troubleshoot using request correlation - What metrics are available for dashboards Related to: `495e6c794` (feat: comprehensive diagnostics improvements)	2025-10-21 12:45:19 +00:00
rcourtman	ddc9a7a068	docs: comprehensive documentation for rate limit fix and configurability Document the pulse-sensor-proxy rate limiting bug fix and new configurability across all relevant documentation: TEMPERATURE_MONITORING.md: - Added 'Rate Limiting & Scaling' section with symptom diagnosis - Included sizing table for 1-3, 4-10, 10-20, and 30+ node deployments - Provided tuning formula: interval_ms = polling_interval / node_count TROUBLESHOOTING.md: - Added 'Temperature data flickers after adding nodes' section - Step-by-step diagnosis using limiter metrics and scheduler health - Quick fix with config example CONFIGURATION.md: - Added pulse-sensor-proxy/config.yaml reference section - Documented rate_limit.per_peer_interval_ms and per_peer_burst fields - Included defaults and example override pulse-sensor-proxy-runbook.md: - Updated quick reference with new defaults (1 req/sec, burst 5) - Added 'Rate Limit Tuning' procedure with 4 deployment profiles - Included validation steps and monitoring commands TEMPERATURE_MONITORING_SECURITY.md: - Updated rate limiting section with new defaults - Added configurable overrides guidance - Documented security considerations for production deployments Related commits: - `46b8b8d08`: Initial rate limit fix (hardcoded defaults) - `ca534e2b6`: Made rate limits configurable via YAML - `e244da837`: Added guidance for large deployments (30+ nodes)	2025-10-21 11:36:07 +00:00
rcourtman	2f43d67af9	docs: simplify Mermaid diagrams for better readability The previous diagrams were too complex and overwhelming. Simplified all diagrams to show core concepts clearly: - Adaptive polling: reduced to basic scheduler→queue→workers flow - Temperature proxy: simplified to 3-box trust boundary view - Sensor proxy sequence: simplified to essential request flow - Webhook pipeline: reduced to template→send→retry flow - Script library: simplified to code→test→bundle→dist flow Fixed parsing error in temperature proxy diagram (parentheses in edge label causing render failure). Diagrams should clarify architecture, not recreate implementation.	2025-10-21 10:50:40 +00:00
rcourtman	85ffe10aed	docs: add Mermaid diagrams to improve visual documentation Enhance documentation with six Mermaid diagrams to better explain complex system implementations: - Adaptive polling lifecycle flowchart showing enqueue→execute→feedback cycle with scheduler, priority queue, and worker interactions - Circuit breaker state machine diagram illustrating Closed↔Open↔Half-open transitions with triggers and recovery paths - Temperature proxy architecture diagram highlighting trust boundaries, security controls, and data flow between host/container/cluster - Sensor proxy request flow sequence diagram showing auth, rate limiting, validation, and SSH execution pipeline - Alert webhook pipeline flowchart detailing template resolution, URL rendering, HTTP dispatch, and retry logic - Script library workflow diagram illustrating dev→test→bundle→distribute lifecycle emphasizing modular design These visualizations make it easier for operators and contributors to understand Pulse's sophisticated architectural patterns.	2025-10-21 10:40:33 +00:00
rcourtman	c91b7874ac	docs: comprehensive v4.24.0 documentation audit and updates Complete documentation overhaul for Pulse v4.24.0 release covering all new features and operational procedures. Documentation Updates (19 files): P0 Release-Critical: - Operations: Rewrote ADAPTIVE_POLLING_ROLLOUT.md as GA operations runbook - Operations: Updated ADAPTIVE_POLLING_MANAGEMENT_ENDPOINTS.md with DEFERRED status - Operations: Enhanced audit-log-rotation.md with scheduler health checks - Security: Updated proxy hardening docs with rate limit defaults - Docker: Added runtime logging and rollback procedures P1 Deployment & Integration: - KUBERNETES.md: Runtime logging config, adaptive polling, post-upgrade verification - PORT_CONFIGURATION.md: Service naming, change tracking via update history - REVERSE_PROXY.md: Rate limit headers, error pass-through, v4.24.0 verification - PROXY_AUTH.md, OIDC.md, WEBHOOKS.md: Runtime logging integration - TROUBLESHOOTING.md, VM_DISK_MONITORING.md, zfs-monitoring.md: Updated workflows Features Documented: - X-RateLimit-* headers for all API responses - Updates rollback workflow (UI & CLI) - Scheduler health API with rich metadata - Runtime logging configuration (no restart required) - Adaptive polling (GA, enabled by default) - Enhanced audit logging - Circuit breakers and dead-letter queue Supporting Changes: - Discovery service enhancements - Config handlers updates - Sensor proxy installer improvements Total Changes: 1,626 insertions(+), 622 deletions(-) Files Modified: 24 (19 docs, 5 code) All documentation is production-ready for v4.24.0 release.	2025-10-20 17:20:13 +00:00
rcourtman	469d11fc7e	docs: add comprehensive scheduler health API documentation Add detailed API reference and update rollout playbook: New: docs/api/SCHEDULER_HEALTH.md - Complete endpoint reference for /api/monitoring/scheduler/health - Request/response structure with field descriptions - Enhanced "instances" array documentation - Example responses showing all states (healthy, transient, DLQ) - Useful jq queries for troubleshooting: - Find instances with errors - List DLQ entries - Show open circuit breakers - Sort by failure streaks - Migration guide (legacy → new fields) - Troubleshooting examples with real scenarios Updated: docs/operations/ADAPTIVE_POLLING_ROLLOUT.md - Enhanced "Accessing Scheduler Health API" section (§6) - Added examples using new instances[] array - Updated queries to use pollStatus, breaker, deadLetter fields - Practical jq commands for operators Key Documentation Features: - Complete JSON schema with examples - All new fields documented with types and descriptions - Real-world troubleshooting scenarios - Copy-paste ready jq queries - Migration path for existing integrations - Backward compatibility notes Operators can now: - Find error messages without log digging - Understand circuit breaker states - Track DLQ entries with full context - Diagnose issues using single API call Part of Phase 2 follow-up - enhanced observability	2025-10-20 15:13:38 +00:00
rcourtman	ce5ad64810	docs: defer circuit breaker/DLQ management endpoints (Phase 2 Task 11) Document decision to defer mutation endpoints after soak testing: Assessment Results: - Integration tests (55s, 12 instances): Automatic recovery worked perfectly - Soak tests (2-240min, 80 instances): No manual intervention needed - Circuit breakers: Opened/closed automatically as designed - DLQ routing: Permanent failures handled correctly Current Capabilities (Sufficient): - Read-only scheduler health API provides full visibility - Operator workarounds: service restart, feature flag toggle - Grafana alerting: queue depth, staleness, DLQ, breakers Why Defer: - No operational need demonstrated in testing - Implementation requires auth/RBAC/audit/UI work - Cost not justified until production usage reveals need - Can add later when data shows actual pain points Future Design Notes: - POST /api/monitoring/breakers/{instance}/reset - POST /api/monitoring/dlq/retry (all or specific) - DELETE /api/monitoring/dlq/{instance} - Auth, audit, rate limiting, UI integration required Re-evaluation Criteria: - Operators request controls >3x in 30 days - Troubleshooting steps inadequate - Service restarts too disruptive - Production incidents need surgical controls Decision: Monitor production usage for 60 days, then reassess based on actual operator feedback and support ticket patterns. Part of Phase 2 - Adaptive Polling completion	2025-10-20 15:13:38 +00:00
rcourtman	cb8be81f1d	docs: add adaptive polling production rollout playbook (Phase 2 Task 10) Add comprehensive operator playbook for production enablement: Prerequisites: - Test suite validation (unit, integration, soak) - Monitoring readiness (Grafana dashboards, alerts) - Configuration management and rollback planning - Stakeholder sign-off Staging Rollout: - Feature flag enablement steps - Verification procedures (scheduler health API) - 24-48h observation window with success criteria - Metric checkpoints at 0h, 12h, 24h Production Rollout: - Gradual strategy (25% nodes every 2 hours) - Low-traffic maintenance window - Per-cluster monitoring during rollout - Success criteria and completion validation Grafana/Alert Configuration: - Dashboard panels: queue depth, staleness, throughput, breakers/DLQ - Alert thresholds: - Queue depth > 1.5× instances for >10min (Warning) - Staleness > 60s for >5min (Critical) - DLQ growth (Warning) - Stuck breakers >10min (Critical) Rollback Procedure: - Clear disable/restart steps - Verification of rollback success - Post-rollback actions and incident reporting Troubleshooting: - Symptom/cause/action table - Scheduler health API access guide - Immediate rollback triggers Operators can now safely enable adaptive polling following this step-by-step playbook. Part of Phase 2 Task 10 (Documentation)	2025-10-20 15:13:38 +00:00
rcourtman	d5c7a3494b	chore: remove deprecated Pulse+ agent metrics and add audit log rotation docs Removed all legacy Pulse+ agent metrics infrastructure (cloud-relay) which has been fully replaced by the new docker agent and temperature agent implementations. Changes: - Remove cloud-relay directory and all related binaries (relay, relay-linux, etc.) - Remove Pulse+ documentation (AGENT_METRICS_IMPLEMENTATION.md, AGENT_METRICS_SETUP.md) - Clean up pulse-relay references in workflows and release checklist - Add audit log rotation documentation for sensor proxy hash-chained logs - Update .gitignore to remove cloud-relay/ entry The new docker and temp agents remain fully functional and unaffected by this cleanup.	2025-10-20 15:13:38 +00:00
rcourtman	524f42cc28	security: complete Phase 1 sensor proxy hardening Implements comprehensive security hardening for pulse-sensor-proxy: - Privilege drop from root to unprivileged user (UID 995) - Hash-chained tamper-evident audit logging with remote forwarding - Per-UID rate limiting (0.2 QPS, burst 2) with concurrency caps - Enhanced command validation with 10+ attack pattern tests - Fuzz testing (7M+ executions, 0 crashes) - SSH hardening, AppArmor/seccomp profiles, operational runbooks All 27 Phase 1 tasks complete. Ready for production deployment.	2025-10-20 15:13:37 +00:00

21 Commits