Commit Graph

10 Commits

Author SHA1 Message Date
CanbiZ (MickLesk)
ce5b7986ac fix: send telemetry BEFORE log collection in signal handlers
- Swap ensure_log_on_host/post_update_to_api order in on_interrupt, on_terminate, api_exit_script, and inline SIGHUP/SIGINT/SIGTERM traps
- For signal exits (>128): send telemetry immediately, then best-effort log collection
- Add 2>/dev/null || true to all I/O in signal handlers to prevent SIGPIPE
- Fix on_exit: exit_code=0 now reports 'done' instead of 'failed 1'
- Root cause: pct pull hangs on dying containers blocked telemetry updates, leaving 595+ records stuck in 'installing' daily
2026-02-17 16:38:28 +01:00
CanbiZ (MickLesk)
c9ecb1ccca core: smart recovery for failed installs | extend exit_codes (#11221)
* feat(build.func): smart error recovery menu for failed installations

Replace simple Y/n removal prompt with interactive recovery menu:

- Option 1: Remove container and exit (default, auto after 60s timeout)
- Option 2: Keep container for debugging
- Option 3: Retry installation with verbose mode enabled
- Option 4: Retry with 1.5x RAM and +1 CPU core (OOM errors only)

Improvements:
- Detect OOM errors (exit codes 137, 243) and offer resource increase
- Show human-readable error explanation using explain_exit_code()
- Recursive rebuild preserves ALL settings from advanced/app.vars/default.vars
- Settings preserved: Network (IP, Gateway, VLAN, MTU, Bridge), Features
  (Nesting, FUSE, TUN, GPU), Storage, SSH keys, Tags, Hostname, etc.
- Show rebuild summary before retry (old→new CTID, resources, network)
- New container ID generated automatically for rebuilds

This helps users recover from transient failures without re-running
the entire script manually.

* fix(api.func): fix duplicate exit codes and add missing error codes

Exit code fixes:
- Remove duplicate definitions for codes 243, 254 (Node.js vs DB)
- Reassign MySQL/MariaDB to 240-242, 244 (was 241-244)
- Reassign MongoDB to 250-253 (was 251-254)

New exit codes added (based on GitHub issues analysis):
- 6: curl couldn't resolve host (DNS failure)
- 7: curl failed to connect (network unreachable)
- 22: curl HTTP error (404, 429 rate limit, 500)
- 28: curl timeout (very common in download failures)
- 35: curl SSL error
- 102: APT lock held by another process
- 124: Command timeout
- 141: SIGPIPE (broken pipe)

Also update OOM detection to include exit code 134 (SIGABRT)
which is commonly seen in Node.js heap overflow issues.

Fixes based on analysis of ~500 GitHub issues.

* fix(exit-codes): sync error_handler.func and api.func with conflict-free code ranges

- Add curl error codes (6, 7, 22, 28, 35)
- Add APT lock code (102), timeout (124), signals (134, 141)
- Move Python codes: 210-212 → 160-162 (avoid Proxmox conflict)
- Move PostgreSQL codes: 231-234 → 170-173
- Move MySQL/MariaDB codes: 241-244 → 180-183
- Move MongoDB codes: 251-254 → 190-193
- Keep Node.js at 243-249, Proxmox at 200-231
- Both files now synchronized with identical mappings

* feat(exit-codes): add systemd and build error codes (150-154)

- 150: Systemd service failed to start
- 151: Systemd service unit not found
- 152: Permission denied (EACCES)
- 153: Build/compile failed (make/gcc/cmake)
- 154: Node.js native addon build failed (node-gyp)

Based on issue analysis: 57 service failures, 25 build failures, 22 node-gyp issues

* fix(build): restore smart recovery and add OOM/DNS retry paths

* feat(build): APT in-place repair, exit 1 subclassification, new exit codes

- Add APT/DPKG in-place recovery: detects exit 100/101/102/255 and exit 1
  with APT log patterns, offers to repair dpkg state and re-run install
  script without destroying the container
- Add exit 1 subclassification: analyzes combined log to identify root
  cause (APT, OOM, network, command-not-found) and routes to appropriate
  recovery option
- Add exit 10 hint: shows privileged mode / nesting suggestion
- Add exit 127 hint: extracts missing command name from logs
- Refactor recovery menu: use named option variables (APT_OPTION,
  OOM_OPTION, DNS_OPTION) instead of hardcoded option numbers, supports
  up to 6 dynamic options cleanly
- Map missing exit codes in api.func: curl 27/36/45/47/55, signals
  129 (SIGHUP) / 131 (SIGQUIT), npm 239

* feat(api+build): map 25 more exit codes, add SIGHUP trap, network/perm hints

api.func:
- Map 25+ new exit codes that were showing as 'Unknown' in telemetry:
  curl: 3, 16, 18, 24, 26, 32-34, 39, 44, 46, 48, 51, 52, 57, 59, 61,
  63, 79, 92, 95; signals: 125, 132, 144, 146
- Update code 8 description (FTP + apk untrusted key)
- Update header comment with full supported ranges

build.func:
- Add SIGHUP trap: reports 'failed/129' to API when terminal is closed,
  should significantly reduce the 2841 stuck 'installing' records
- Add exit 52 (empty reply) and 57 (poll error) to network issue
  detection for DNS override recovery option
- Add exit 125/126 hint: suggests privileged mode for permission errors

* fix: sync error_handler fallback, Alpine APK repair, retry limit

error_handler.func:
- Sync fallback explain_exit_code() with api.func: add 25+ codes that
  were missing (curl 16/18/24/26/27/32-34/36/39/44-48/51/52/55/57/59/
  61/63/79/92/95, signals 125/129/131/132/144/146, npm 239, code 3/8)
- Ensures consistent error descriptions even when api.func isn't loaded

build.func:
- Alpine APK repair: detect var_os=alpine and run 'apk fix && apk
  cache clean && apk update' instead of apt-get/dpkg commands
- Show 'Repair APK state' instead of 'APT/DPKG' in menu for Alpine
- Retry safety counter: OOM x2 retry limited to max 2 attempts
  (prevents infinite RAM doubling via RECOVERY_ATTEMPT env var)
- Show attempt count in rebuild summary

* fix(build): preserve exit code in ERR trap to prevent false exit_code=0

The ERR trap called ensure_log_on_host before post_update_to_api,
which reset \True to 0 (success). This caused ~15-20 records/day to be
reported as 'failed' with exit_code=0 instead of the actual error code.

Root cause chain:
1. Command fails with exit code N → ERR trap fires (\True = N)
2. ensure_log_on_host succeeds → \True becomes 0
3. post_update_to_api 'failed' '\True' → sends 'failed/0' (wrong!)
4. POST_UPDATE_DONE=true → EXIT trap skips the correct code

Fix: capture \True into _ERR_CODE before ensure_log_on_host runs.

* Implement telemetry settings and repo source detection

Add telemetry configuration and repository source detection function.
2026-02-17 12:14:46 +01:00
CanbiZ (MickLesk)
896714e06f core/vm's: ensure script state is sent on script exit (#11991)
* Ensure API update is sent on script exit

Add exit-time telemetry handling across scripts to avoid orphaned "installing" records. Introduce local exit_code capture in api_exit_script and cleanup handlers and, when POST_TO_API_DONE is true but POST_UPDATE_DONE is not, post a final status (marking failures on non-zero exit codes, or marking done/failed in VM cleanups based on exit code). Changes touch misc/build.func, misc/vm-core.func and various vm/*-vm.sh cleanup functions to reliably send post_update_to_api on normal or early exits.

* Update api.func

* fix(telemetry): add missing exit codes to explain_exit_code()

- Add curl error codes: 4, 5, 8, 23, 25, 30, 56, 78
- Add code 10: Docker/privileged mode required (used in ~15 scripts)
- Add code 75: Temporary failure (retry later)
- Add BSD sysexits.h codes: 64-77
- Sync error_handler.func fallback with canonical api.func
2026-02-16 17:14:00 +01:00
CanbiZ (MickLesk)
cecadf5681 core: improve error reporting with structured error strings and better categorization + output formatting (#11907)
* fix(telemetry): improve error reporting with structured error strings and better categorization

- Add build_error_string() that creates structured format:
  'exit_code=N | description\n---\n<last 20 log lines>'
- Fix categorize_error() to map ALL known exit codes:
  - Added: shell(1,2), proxmox(200-231), service(150-154),
    database(170-193), runtime(243-249), signal(139,141,143)
  - Split timeout from network (28 was in both)
  - Added DPKG(255) to dependency category
- Update all API functions to use build_error_string():
  post_update_to_api, post_update_to_api_extended,
  post_tool_to_api, post_addon_to_api
- Add ensure_log_on_host() calls to on_exit, on_interrupt,
  on_terminate handlers to prevent race condition where
  telemetry reports before container log is pulled to host

* fix(ui): improve error output formatting and remove redundant log paths

- error_handler: Use msg_info/msg_ok/msg_warn for container cleanup
  instead of raw echo with manual ANSI codes
- error_handler: Add  icon before 'Remove broken container?' prompt
- error_handler: Indent log output with TAB for visual consistency
- build.func: Use msg_custom for installation log path display
- build.func: Use msg_info → msg_ok for container removal flow
- build.func: Use msg_warn for 'kept for debugging' message
- core.func/vm-core.func: Remove redundant container-internal log
  path display (📋 View full log) since combined log on host is
  the canonical location shown after failure
2026-02-14 15:28:30 +01:00
CanbiZ (MickLesk)
f23414a1a8 Retry reporting with fallback payloads (#11885)
Enhance post_update_to_api to support a "force" mode and robust retry logic: add a 3rd-arg bypass to duplicate suppression, capture a short error summary, and perform up to three POST attempts (full payload, shortened error payload, minimal payload) with HTTP code checks and small backoffs. Mark POST_UPDATE_DONE on success (or after three attempts) to avoid infinite retries. Also invoke post_update_to_api with the "force" flag from cleanup paths in build.func and error_handler.func so a final status update is attempted after cleanup.
2026-02-13 13:49:49 +01:00
CanbiZ (MickLesk)
f7cf7c8adc fix(telemetry): prevent stuck 'installing' status in error_handler.func (#11845)
The catch_errors() function in CT scripts overrides the API telemetry
traps set by build.func. This caused on_exit, on_interrupt, and
on_terminate to never call post_update_to_api, leaving telemetry
records permanently stuck on 'installing'.

Changes:
- on_exit: Report orphaned 'installing' records on ANY exit where
  post_to_api was called but post_update_to_api was not
- on_interrupt: Call post_update_to_api('failed', '130') before exit
- on_terminate: Call post_update_to_api('failed', '143') before exit

All calls are guarded by POST_UPDATE_DONE flag to prevent duplicates.
2026-02-12 20:05:27 +01:00
CanbiZ (MickLesk)
406d53ea2f core: remove old Go API and extend misc/api.func with new backend (#11822)
* Remove Go API and extend misc/api.func

Delete the Go-based API (api/main.go, api/go.mod, api/go.sum, api/.env.example) and significantly enhance misc/api.func. The shell telemetry file now includes telemetry configuration, repo source detection, GPU/CPU/RAM detection, expanded explain_exit_code mappings, and refactored post_to_api/post_to_api_vm to send non-blocking telemetry to telemetry.community-scripts.org while respecting DIAGNOSTICS/DEV_MODE and adding richer metadata (cpu/gpu/ram/repo_source). Also updates header/author info and improves privacy/robustness and error handling.

* Start install timer and refine error reporting

Call start_install_timer during build startup and overhaul exit/error reporting.

Changes:
- Invoke start_install_timer early in misc/build.func to track install duration.
- Update api_exit_script comments to reference PocketBase/api.func and adjust ERR/SIGINT/SIGTERM traps to post numeric exit codes (use $? / 130 / 143) instead of command strings.
- Replace the previous explain_exit_code implementation with a conditional fallback: only define explain_exit_code if not already provided (api.func is the canonical source). Expanded and reorganized exit code mappings (curl, timeout, systemd, Node/Python/Postgres/MySQL/MongoDB, Proxmox, etc.).
- In error_handler: stop echoing the container log path (host shows combined log), and post a "failed" update to the API with the exit code before offering container cleanup.

Rationale: these changes make telemetry more consistent and robust (numeric codes), provide a safe fallback for exit descriptions when api.func isn't loaded, and ensure failures are reported to the API prior to any automatic cleanup.

* Report install start/failure to telemetry API

Add telemetry hooks in misc/build.func: call post_to_api at installation start to capture early or immediately-failing installs, and call post_update_to_api with status "failed" and the install exit code when a container installation fails. This improves visibility into install failures for monitoring/telemetry.
2026-02-12 11:55:13 +01:00
Tobias
c1fe8b91b4 chore: bump copyright to 2026 - happy new year (#10585)
* chore: bump copyright to 2026 - happy new year

* fix

* meilisearch fix source url

* livebook: fix space

* fix source cmd

* fix source cmd
2026-01-06 13:28:12 +01:00
CanbiZ
0f7c201b57 core: extend storage type support (rbd, nfs, cifs) and validation (iscidirect, isci, zfs, cephfs, pbs) (#9646)
* core: enhance storage type validation and error codes

Improve storage validation for LXC container creation with
explicit checks for unsupported storage types.

Changes:
- Add validation for storage types that don't support containers:
  - iscsidirect (exit 212) - VMs only
  - iscsi/zfs (exit 213) - no rootdir support
  - cephfs (exit 219) - use RBD instead
  - pbs (exit 224) - backups only
- Add connectivity check for network storage (linstor, rbd, nfs, cifs)
- Simplify storage content validation using pvesm status
- Reorganize Proxmox error codes (200-231) for consistency
- Update error messages to be more descriptive and actionable

This helps users identify storage compatibility issues early
before container creation fails with cryptic errors.

* Update build.func
2025-12-04 19:41:01 +01:00
CanbiZ
a826769899 Three-tier defaults system | security improvements | error_handler | improved logging | improved container creation | improved architecture (#9540)
* Refactor Core

Refactored misc/alpine-install.func to improve error handling, network checks, and MOTD setup. Added misc/alpine-tools.func and misc/error_handler.func for modular tool installation and error management. Enhanced misc/api.func with detailed exit code explanations and telemetry functions. Updated misc/core.func for better initialization, validation, and execution helpers. Removed misc/create_lxc.sh as part of cleanup.

* Delete config-file.func

* Update install.func

* Refactor stop_all_services function and variable names

Refactor service stopping logic and improve variable handling

* Refactor installation script and update copyright

Updated copyright information and adjusted package installation commands. Enhanced IPv6 disabling logic and improved container customization process.

* Update install.func

* Update license comment format in install.func

* Refactor IPv6 handling and enhance MOTD and SSH

Refactor IPv6 handling and update OS function. Enhance MOTD with additional details and configure SSH settings.

* big core refactor

* Enhance IPv6 configuration menu options

Updated IPv6 Address Management menu options for clarity and added a new option for fully disabling IPv6.

* Update default Node.js version to 24 LTS

* Update misc/alpine-tools.func

Co-authored-by: Michel Roegl-Brunner <73236783+michelroegl-brunner@users.noreply.github.com>

* indention

* remove debugf and duplicate codes

* Update whiptail backtitles and error codes

Removed '[dev]' from whiptail --backtitle strings for consistency. Refactored custom exit codes in build.func and error_handler.func: updated Proxmox error codes, shifted MySQL/MariaDB codes to 260-263, and removed unused MongoDB code. Updated error descriptions to match new codes.

* comments

* Refactor error handling and clean up debug comments

Standardized bash variable checks, removed unnecessary debug and commented code, and clarified error handling logic in container build and setup scripts. These changes improve code readability and maintainability without altering functional behavior.

* Update build.func

* feat: Improve LXC network checks and LINSTOR storage handling

Enhanced LXC container network setup to check for both IPv4 and IPv6 addresses, added connectivity (ping) tests, and provided troubleshooting tips on failure. Updated storage validation to support LINSTOR, including cluster connectivity checks and special handling for LINSTOR template storage.

---------

Co-authored-by: Michel Roegl-Brunner <73236783+michelroegl-brunner@users.noreply.github.com>
2025-12-04 07:52:18 +01:00