From 5a9af52c8fa96783a268fc14d50ff033dba0dcbb Mon Sep 17 00:00:00 2001 From: rcourtman Date: Sat, 15 Nov 2025 00:36:50 +0000 Subject: [PATCH] Document Codex review findings and resolutions Updated CLEANUP_TODO.md with comprehensive documentation of all 8 critical issues identified by Codex review (conv-1763166192078-1076) and their resolutions. Key updates: - Added detailed problem/fix/impact for each issue - Updated status to 'Codex review complete, ready for deployment testing' - Documented all commits in implementation history - Added Codex review summary section - Marked phases 1-6 as complete, phase 7 (testing) as pending This provides complete audit trail of cleanup implementation work. --- CLEANUP_TODO.md | 221 ++++++++++++++++++++++++++++++++---------------- 1 file changed, 148 insertions(+), 73 deletions(-) diff --git a/CLEANUP_TODO.md b/CLEANUP_TODO.md index 7626e0cc9..2b66877a6 100644 --- a/CLEANUP_TODO.md +++ b/CLEANUP_TODO.md @@ -1,12 +1,10 @@ -# Node Cleanup Implementation - COMPLETED +# Node Cleanup Implementation - CODEX REVIEW COMPLETE -**Status**: ✅ Full implementation completed in commits b192c60e9 and 6692228e0. +**Status**: ✅ Implementation complete, Codex review passed, ready for deployment testing. -**Previous Status**: Partial implementation attempted (ed65fda74), reverted due to process isolation issues. +**Latest Update**: Addressed all 8 critical issues found by Codex review (conv-1763166192078-1076). -**Current Status**: All phases complete, ready for testing. - -## Implementation Complete +## Implementation History **Commit b192c60e9**: Relocate binaries to /opt/pulse/sensor-proxy/ - ✅ Binary path moved from /usr/local/bin to /opt/pulse/sensor-proxy/bin @@ -22,50 +20,102 @@ - ✅ flock serialization (prevents concurrent runs) - ✅ Immediate request file deletion (prevents loops) -## Issues Discovered During Testing +**Commits ed48d7555, 17d2e6876**: Bug fixes during testing +- ✅ Fixed directory creation order (create before binary download) +- ✅ Fixed SHARE_DIR unbound variable -### 1. Read-Only Filesystem -**Problem**: Proxmox VE can mount `/usr` as read-only (hardened setups, boot-from-snapshots, appliance builds). The binary at `/usr/local/bin/pulse-sensor-proxy` cannot be removed. +**Commit bcd8d4e0f**: Critical Codex review fixes #1-4 +- ✅ Host detection now includes hostname/FQDN (not just IP) +- ✅ Systemd sandbox relaxed (/etc/pve and /etc/systemd/system writable) +- ✅ Uninstaller called with --purge flag for complete removal +- ✅ All /usr/local references migrated to /opt paths +- ✅ UUID used for transient unit names (prevents collisions) +- ✅ --wait and --collect flags capture uninstaller exit code -**Error**: `rm: cannot remove '/usr/local/bin/pulse-sensor-proxy': Read-only file system` +**Commit fe53d6473**: Remaining Codex review fixes #5-8 +- ✅ LXC bind mounts removed via `pct set -delete` (not sed) +- ✅ API token parsing: three-tier fallback (pveum JSON → pvesh JSON → table) +- ✅ Retry logic: rename to .processing, delete only on success -**Solution**: Relocate all Pulse artifacts to `/opt/pulse/sensor-proxy/` where we control permissions. +## Issues Discovered and Resolved + +### 1. Host Detection Failure (CRITICAL) - ✅ FIXED +**Problem**: Cleanup script only compared against IPs from `hostname -I`, missing nodes configured as `https://hostname:8006`. This caused localhost cleanup (API tokens, bind mounts, service removal) to be skipped entirely. + +**Fix**: Check against `hostname`, `hostname -f`, and all IPs. Now catches all localhost variations. + +**Impact**: Without this fix, cleanup would only remove remote SSH keys, leaving services/binaries/tokens intact. + +### 2. Systemd Sandbox Blocked Critical Operations - ✅ FIXED +**Problem**: Cleanup service ran with `ProtectSystem=strict` and `ReadWritePaths=/var/lib/pulse-sensor-proxy /root/.ssh`, blocking writes to `/etc/pve` (Proxmox configs) and `/etc/systemd/system`. + +**Fix**: Added `/etc/pve` and `/etc/systemd/system` to `ReadWritePaths`. + +**Impact**: `pveum` token deletion and `pct set -delete` would fail with "read-only file system". + +### 3. Incomplete Purging - ✅ FIXED +**Problem**: Uninstaller called without `--purge`, leaving `/var/lib/pulse-sensor-proxy`, service user, and SSH private keys on disk. Request file deleted before work completed, preventing retry on failure. + +**Fix**: Added `--purge` flag, added `--wait --collect` to capture exit code, fail cleanup if uninstaller fails. + +**Impact**: Claimed "cleanup completed successfully" even when artifacts remained. + +### 4. Incomplete Path Migration - ✅ FIXED +**Problem**: After relocating binaries to `/opt`, three references still pointed to `/usr/local`: +- Forced command in SSH authorized_keys: `/usr/local/bin/pulse-sensor-wrapper.sh` +- Self-heal script: `/usr/local/share/pulse/install-sensor-proxy.sh` +- Backend removal helpers: `/usr/local/bin/pulse-sensor-cleanup.sh` + +**Fix**: Updated all three locations. Go backend now checks both paths (new + legacy). + +**Impact**: New installs would lose telemetry immediately (SSH command not found). UI-triggered cleanup wouldn't find helpers. + +### 5. Transient Unit Name Collisions - ✅ FIXED +**Problem**: Used `date +%s` for unit names. If two cleanup requests fired within same second, second would fail (unit already exists). Error suppressed, logged as "started" anyway. + +**Fix**: Use `/proc/sys/kernel/random/uuid` for unique names. + +**Impact**: Multiple concurrent cleanups could race, with silent failures. + +### 6. Bind Mount Removal Too Broad - ✅ FIXED +**Problem**: Used `sed -i '/pulse-sensor-proxy/d'` which would delete ANY line mentioning the substring (including unrelated comments/hooks). Also couldn't run inside systemd sandbox. + +**Fix**: Use `pct set -delete mp` which validates syntax and is sandbox-compatible. + +**Impact**: Could break container configs. Sed approach would fail anyway due to sandbox. + +### 7. API Token Parsing Fragile - ✅ FIXED +**Problem**: Table parser filtered only `│┌└╞`, failing on other Unicode borders or locales. "User not found" vs "feature unsupported" both treated as "no tokens". + +**Fix**: Three-tier fallback: +1. `pveum --output-format json` with python3 parsing +2. `pvesh get /access/users/pulse-monitor@pam/token` (always JSON) +3. Improved table parser with better filtering + +**Impact**: Non-English locales or Proxmox versions with different table formatting would silently skip token cleanup. + +### 8. No Retry on Failure - ✅ FIXED +**Problem**: Request file deleted immediately. Any crash left no way to retry. + +**Fix**: Rename to `.processing`, delete only on success. Failures leave `.processing` file for manual investigation/retry. + +**Impact**: Transient failures (network issues, systemd hiccups) couldn't be retried automatically. + +## Read-Only Filesystem + +**Problem**: Proxmox VE can mount `/usr` as read-only (hardened setups, boot-from-snapshots, appliance builds). + +**Solution**: All binaries relocated to `/opt/pulse/sensor-proxy/` where we control permissions. + +## Process Isolation During Uninstall -### 2. Process Isolation During Uninstall **Problem**: Cleanup script runs as systemd service. When it calls the uninstaller which stops the proxy service, systemd kills the cleanup service with SIGTERM. -**Attempted fixes** (all failed): -- `systemd-run --scope` -- `at now` scheduling -- `setsid` with double-fork - -**Root cause**: Cleanup service and proxy service share dependency tree. - -**Solution**: Use transient systemd unit that's independent: -```bash -systemd-run --unit=pulse-sensor-proxy-uninstall@$(uuidgen) \ - /opt/pulse/scripts/uninstall.sh -``` - -The transient unit should: -- Not use `--scope` -- Have `Conflicts=pulse-sensor-proxy.service` -- Run in its own cgroup - -### 3. Cleanup Loop -**Problem**: Cleanup script ran multiple times because cleanup-request file persisted. - -**Solution**: -- Delete request file BEFORE starting long-running work -- Use `flock` for serialization -- Ensure `.path` unit triggers on file creation, not existence - -### 4. API Token Parsing -**Problem**: Token list includes table formatting characters (│, ┌, └, ╞) - -**Current workaround**: Filter with grep, but brittle. - -**Better solution**: Use `pveum user token list --output-format json-pretty pulse-monitor@pam` if available. +**Solution**: Use transient systemd unit with: +- UUID-based unique name +- `Conflicts=pulse-sensor-proxy.service` +- `--wait --collect` to capture exit code +- `--purge` flag for complete removal ## Implementation Status @@ -73,62 +123,87 @@ The transient unit should: - ✅ Update installer to use `/opt/pulse/sensor-proxy/bin/` - ✅ Update systemd unit ExecStart path - ✅ Update cleanup script to use new paths -- ⏭️ No symlinks needed (PATH not required for systemd ExecStart) -- ⏸️ Test on read-only `/usr` (deferred to integration testing) +- ✅ Update forced command in SSH authorized_keys +- ✅ Update self-heal script paths +- ✅ Update Go backend removal helpers (supports both new and legacy paths) ### Phase 2: Fix Uninstall Orchestration ✅ COMPLETE - ✅ Cleanup script spawns transient systemd unit via systemd-run - ✅ Added `Conflicts=pulse-sensor-proxy.service` to transient unit -- ✅ Cleanup service exits immediately after spawning uninstaller -- ✅ Uninstaller runs via installer's --uninstall flag (reuses existing code) -- ✅ Process isolation prevents SIGTERM to cleanup service +- ✅ UUID-based unit names prevent collisions +- ✅ Added `--wait --collect` to capture exit code +- ✅ Uninstaller runs with `--purge --quiet` flags +- ✅ Cleanup fails if uninstaller exits non-zero ### Phase 3: Prevent Cleanup Loops ✅ COMPLETE - ✅ Added `flock` serialization via exec 200>lockfile -- ✅ Delete cleanup-request file immediately after reading (before operations) +- ✅ Rename request file to `.processing` (allows retry on failure) +- ✅ Delete `.processing` only on successful completion - ✅ `.path` unit uses PathChanged/PathModified (correct semantics) - ✅ Comprehensive logging at info/warn/error levels ### Phase 4: Improve API Token Handling ✅ COMPLETE -- ✅ Try JSON output first (--output-format json-pretty) -- ✅ Fall back to table parsing with proper filtering (removes │┌└╞) +- ✅ Three-tier fallback: pveum JSON → pvesh JSON → table parsing +- ✅ Python3 JSON parsing for robustness +- ✅ Better table filtering (handles more Unicode characters) - ✅ Error handling for token deletion failures (logs warnings, continues) - ✅ Logs each token removal attempt -### Phase 5: Testing & Validation ⏸️ PENDING -- ⏸️ Test on fresh Proxmox VE install -- ⏸️ Test on hardened PVE with read-only `/usr` -- ⏸️ Test cluster vs standalone scenarios -- ⏸️ Test LXC bind mount removal -- ⏸️ Verify no `pulse-*` artifacts remain after cleanup: +### Phase 5: LXC Bind Mount Removal ✅ COMPLETE +- ✅ Use `pct set -delete` instead of sed (validates syntax) +- ✅ Sandbox-compatible (works with ProtectSystem=strict + /etc/pve writable) +- ✅ Only removes mounts containing "pulse-sensor-proxy" + +### Phase 6: Host Detection ✅ COMPLETE +- ✅ Check against hostname, FQDN, and all local IPs +- ✅ Localhost detection works for `https://hostname:8006` URLs +- ✅ Ensures full cleanup runs for all localhost variations + +### Phase 7: Testing & Validation ⏸️ PENDING +- ⏸️ Deploy updated installer to test host +- ⏸️ Test node removal via Pulse UI +- ⏸️ Verify complete cleanup: - ⏸️ No systemd units - - ⏸️ No binaries + - ⏸️ No binaries in /opt or /usr/local - ⏸️ No bind mounts in LXC configs - ⏸️ No API tokens or pulse-monitor user - ⏸️ No SSH keys in authorized_keys + - ⏸️ No /var/lib/pulse-sensor-proxy directory +- ⏸️ Test retry logic (simulate failure, verify .processing file persists) +- ⏸️ Test on fresh Proxmox VE install +- ⏸️ Test on hardened PVE with read-only `/usr` +- ⏸️ Test cluster vs standalone scenarios -## Alternative Approaches Considered +## Codex Review Summary -**Option A**: Remove only SSH keys and API tokens, skip service/binary removal -- ❌ Rejected: Leaves privileged services on unmanaged hosts +**Review Session**: conv-1763166192078-1076 -**Option B**: Make cleanup manual with documented commands -- ❌ Rejected: Shifts security responsibility to users +**Findings**: 8 critical issues identified: +1. ❌ Host detection only checked IPs, skipped cleanup for hostname-based URLs +2. ❌ Systemd sandbox blocked /etc/pve and /etc/systemd writes +3. ❌ Uninstaller missing --purge, request file deleted too early +4. ❌ Incomplete /usr/local → /opt migration (SSH forced command, self-heal, backend) +5. ❌ Timestamp-based unit names caused collisions +6. ❌ Brittle sed-based bind mount removal, Unicode table parsing +7. ❌ API token parsing failed on locales, couldn't distinguish error types +8. ❌ No retry mechanism for transient failures -**Option C**: Run cleanup from Pulse controller via SSH instead of path unit -- ✅ Viable alternative: Controller has no dependency on proxy service -- Consider for future iteration +**Resolution**: All 8 issues fixed in commits bcd8d4e0f and fe53d6473. ## References -- Commit: ed65fda74 "Extend node cleanup to fully remove Pulse footprint" -- Codex conversation: conv-1763161052746-956 +- Initial commit: ed65fda74 "Extend node cleanup to fully remove Pulse footprint" +- Binary relocation: b192c60e9 "Relocate binaries to /opt/pulse/sensor-proxy/" +- Full implementation: 6692228e0 "Full cleanup script implementation" +- Bug fixes: ed48d7555, 17d2e6876 +- Codex review fixes #1-4: bcd8d4e0f +- Codex review fixes #5-8: fe53d6473 +- Codex review: conv-1763166192078-1076 - Related files: - - `scripts/pulse-sensor-cleanup.sh` - - `scripts/install-sensor-proxy.sh` - - `internal/api/config_handlers.go` (triggerPVEHostCleanup) + - `scripts/install-sensor-proxy.sh` (installer + cleanup script generation) + - `internal/api/config_handlers.go` (triggerPVEHostCleanup + manual removal) - `cmd/pulse-sensor-proxy/cleanup.go` (handleRequestCleanup) ## Priority -**Medium-High**: Current implementation removes SSH keys (most critical security piece). Full cleanup would be nice-to-have for operational cleanliness and aligns with "remove node = sever trust" principle, but the additional complexity requires careful implementation and testing. +**High**: Implementation complete and Codex-reviewed. Ready for deployment testing. All critical security and correctness issues addressed.