mirror of
https://github.com/rcourtman/Pulse.git
synced 2026-02-18 00:17:39 +01:00
This change addresses intermittent "Guest details unavailable" and "Disk stats unavailable" errors affecting users with large VM deployments (50+ VMs) or high-load Proxmox environments. Changes: - Increased default guest agent timeouts (3-5s → 10-15s) to better handle environments under load - Added automatic retry logic (1 retry by default) for transient timeout failures - Made all timeouts and retry count configurable via environment variables: * GUEST_AGENT_FSINFO_TIMEOUT (default: 15s) * GUEST_AGENT_NETWORK_TIMEOUT (default: 10s) * GUEST_AGENT_OSINFO_TIMEOUT (default: 10s) * GUEST_AGENT_VERSION_TIMEOUT (default: 10s) * GUEST_AGENT_RETRIES (default: 1) - Added comprehensive documentation in VM_DISK_MONITORING.md with configuration examples for different deployment scenarios These improvements allow Pulse to gracefully handle intermittent API timeouts without immediately displaying errors, while remaining configurable for different network conditions and environment sizes. Fixes: https://github.com/rcourtman/Pulse/discussions/592
327 lines
10 KiB
Markdown
327 lines
10 KiB
Markdown
# VM Disk Usage Monitoring
|
|
|
|
Pulse can show actual disk usage for VMs (just like containers) when the QEMU Guest Agent is installed and configured properly.
|
|
|
|
## Quick Summary
|
|
|
|
**Without QEMU Guest Agent:**
|
|
- VMs show "-" for disk usage (no data available)
|
|
- Cannot monitor actual disk usage inside the VM
|
|
|
|
**With QEMU Guest Agent:**
|
|
- VMs show real disk usage like containers do (e.g., "5.2GB used of 32GB / 16%")
|
|
- Accurate threshold alerts based on actual usage
|
|
- Better capacity planning with real data
|
|
|
|
## How It Works
|
|
|
|
Proxmox doesn't track VM disk usage natively (unlike containers which share the host kernel). To get real disk usage from VMs:
|
|
|
|
1. Proxmox API returns `disk=0` and `maxdisk=<allocated_size>` (this is normal)
|
|
2. Pulse automatically queries the QEMU Guest Agent API to get filesystem info
|
|
3. Guest agent reports all mounted filesystems from inside the VM
|
|
4. Pulse aggregates the data (filtering out special filesystems) and displays it
|
|
|
|
**Important**: This works with both API tokens and password authentication. API tokens work fine for guest agent queries when permissions are set correctly.
|
|
|
|
## Requirements
|
|
|
|
### 1. Install QEMU Guest Agent in Your VMs
|
|
|
|
**Linux VMs:**
|
|
```bash
|
|
# Debian/Ubuntu
|
|
apt-get install qemu-guest-agent
|
|
systemctl enable --now qemu-guest-agent
|
|
|
|
# RHEL/Rocky/AlmaLinux
|
|
yum install qemu-guest-agent
|
|
systemctl enable --now qemu-guest-agent
|
|
|
|
# Alpine
|
|
apk add qemu-guest-agent
|
|
rc-update add qemu-guest-agent
|
|
rc-service qemu-guest-agent start
|
|
```
|
|
|
|
**Windows VMs:**
|
|
- Download virtio-win guest tools from: https://fedorapeople.org/groups/virt/virtio-win/direct-downloads/
|
|
- Install the guest tools package which includes the QEMU Guest Agent
|
|
- The service starts automatically after installation
|
|
|
|
### 2. Enable Guest Agent in VM Options
|
|
|
|
In Proxmox web UI:
|
|
1. Select your VM
|
|
2. Go to **Options** → **QEMU Guest Agent**
|
|
3. Check **Enabled**
|
|
4. Start/restart the VM
|
|
|
|
Or via CLI:
|
|
```bash
|
|
qm set <vmid> --agent enabled=1
|
|
```
|
|
|
|
### 3. Verify Guest Agent is Working
|
|
|
|
Check if the agent is responding:
|
|
```bash
|
|
qm agent <vmid> ping
|
|
```
|
|
|
|
Get filesystem info (what Pulse uses):
|
|
```bash
|
|
qm agent <vmid> get-fsinfo
|
|
```
|
|
|
|
### 4. Pulse Permissions
|
|
|
|
Pulse needs the right permissions to query the guest agent:
|
|
|
|
**Proxmox VE 8 and below:**
|
|
- Requires `VM.Monitor` for guest agent access
|
|
- `Sys.Audit` adds Ceph/cluster metrics and is applied when available
|
|
- Pulse setup script creates a `PulseMonitor` role with these privileges automatically
|
|
|
|
**Proxmox VE 9+:**
|
|
- Requires `VM.GuestAgent.Audit` for guest agent access
|
|
- `Sys.Audit` remains recommended for Ceph/cluster metrics
|
|
- Pulse setup script applies both via the `PulseMonitor` role (even if `PVEAuditor` lacks them)
|
|
|
|
**Both API tokens and passwords work** - tokens do NOT have any limitation accessing guest agent data.
|
|
|
|
When you run the Pulse setup script, it automatically detects your Proxmox version and sets the correct permissions. If setting up manually:
|
|
|
|
```bash
|
|
# Shared read-only access
|
|
pveum aclmod / -user pulse-monitor@pam -role PVEAuditor
|
|
|
|
# Extra privileges for guest metrics and Ceph
|
|
EXTRA_PRIVS=()
|
|
|
|
# Sys.Audit (Ceph, cluster status)
|
|
if pveum role list 2>/dev/null | grep -q "Sys.Audit"; then
|
|
EXTRA_PRIVS+=(Sys.Audit)
|
|
else
|
|
if pveum role add PulseTmpSysAudit -privs Sys.Audit 2>/dev/null; then
|
|
EXTRA_PRIVS+=(Sys.Audit)
|
|
pveum role delete PulseTmpSysAudit 2>/dev/null
|
|
fi
|
|
fi
|
|
|
|
# VM guest agent / monitor privileges
|
|
VM_PRIV=""
|
|
if pveum role list 2>/dev/null | grep -q "VM.Monitor"; then
|
|
VM_PRIV="VM.Monitor"
|
|
elif pveum role list 2>/dev/null | grep -q "VM.GuestAgent.Audit"; then
|
|
VM_PRIV="VM.GuestAgent.Audit"
|
|
else
|
|
if pveum role add PulseTmpVMMonitor -privs VM.Monitor 2>/dev/null; then
|
|
VM_PRIV="VM.Monitor"
|
|
pveum role delete PulseTmpVMMonitor 2>/dev/null
|
|
elif pveum role add PulseTmpGuestAudit -privs VM.GuestAgent.Audit 2>/dev/null; then
|
|
VM_PRIV="VM.GuestAgent.Audit"
|
|
pveum role delete PulseTmpGuestAudit 2>/dev/null
|
|
fi
|
|
fi
|
|
|
|
if [ -n "$VM_PRIV" ]; then
|
|
EXTRA_PRIVS+=("$VM_PRIV")
|
|
fi
|
|
|
|
if [ ${#EXTRA_PRIVS[@]} -gt 0 ]; then
|
|
PRIV_STRING="${EXTRA_PRIVS[*]}"
|
|
pveum role delete PulseMonitor 2>/dev/null
|
|
pveum role add PulseMonitor -privs "$PRIV_STRING"
|
|
pveum aclmod / -user pulse-monitor@pam -role PulseMonitor
|
|
fi
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Quick Diagnostic Tool
|
|
|
|
Pulse includes a diagnostic script that can identify why a VM isn't showing disk usage:
|
|
|
|
```bash
|
|
# Run on your Proxmox host (latest version from GitHub)
|
|
curl -sSL https://raw.githubusercontent.com/rcourtman/Pulse/main/scripts/test-vm-disk.sh | bash
|
|
|
|
# Or use the bundled copy installed with Pulse
|
|
/opt/pulse/scripts/test-vm-disk.sh
|
|
```
|
|
|
|
Enter the VM ID when prompted. The script will check:
|
|
- VM running status
|
|
- Guest agent configuration
|
|
- Guest agent runtime status
|
|
- Filesystem information
|
|
- API permissions
|
|
|
|
### Understanding Disk Display States
|
|
|
|
**Shows percentage** (e.g., "45%")
|
|
- Everything working correctly
|
|
- Guest agent installed and accessible
|
|
|
|
**Shows "-" with hover tooltip**
|
|
- Hover to see the specific reason
|
|
- Common reasons:
|
|
- "Guest agent not running" - Agent not installed or service not started
|
|
- "Guest agent disabled" - Not enabled in VM config
|
|
- "Permission denied" - Token/user lacks required permissions
|
|
- "Agent timeout" - Agent installed but not responding
|
|
- "No filesystems" - Agent returned no usable filesystem data
|
|
|
|
### Guest Agent Not Responding
|
|
|
|
**Check if agent is running inside VM:**
|
|
```bash
|
|
# Linux
|
|
systemctl status qemu-guest-agent
|
|
|
|
# Windows
|
|
Get-Service QEMU-GA
|
|
```
|
|
|
|
**Check VM configuration:**
|
|
```bash
|
|
# Should show "agent: 1"
|
|
qm config <vmid> | grep agent
|
|
```
|
|
|
|
**Check agent communication:**
|
|
```bash
|
|
# Should return without error
|
|
qm agent <vmid> ping
|
|
```
|
|
|
|
### Configuring Guest Agent Timeouts
|
|
|
|
**New in v4.27:** Guest agent timeouts and retry behavior can be configured via environment variables to handle high-load environments or slow networks (refs #592).
|
|
|
|
**Available Environment Variables:**
|
|
|
|
```bash
|
|
# Timeout for filesystem info queries (default: 15s, previously 5s)
|
|
GUEST_AGENT_FSINFO_TIMEOUT=15s
|
|
|
|
# Timeout for network interface queries (default: 10s, previously 5s)
|
|
GUEST_AGENT_NETWORK_TIMEOUT=10s
|
|
|
|
# Timeout for OS info queries (default: 10s, previously 3s)
|
|
GUEST_AGENT_OSINFO_TIMEOUT=10s
|
|
|
|
# Timeout for agent version queries (default: 10s, previously 3s)
|
|
GUEST_AGENT_VERSION_TIMEOUT=10s
|
|
|
|
# Number of retries for timeout failures (default: 1, meaning one retry after initial failure)
|
|
GUEST_AGENT_RETRIES=1
|
|
```
|
|
|
|
**When to Adjust:**
|
|
- **Large environments (50+ VMs):** Increase timeouts to 20-30s if you see frequent timeout errors
|
|
- **Slow networks/WAN:** Increase timeouts proportionally to network latency
|
|
- **High load periods:** Consider increasing retries to 2 for better resilience
|
|
- **Fast local network:** Can reduce timeouts to 5-8s for quicker feedback
|
|
|
|
**How to Apply:**
|
|
|
|
```bash
|
|
# Docker deployment - add to docker run or compose
|
|
docker run -e GUEST_AGENT_FSINFO_TIMEOUT=20s -e GUEST_AGENT_RETRIES=2 ...
|
|
|
|
# Systemd deployment - add to /etc/systemd/system/pulse.service
|
|
[Service]
|
|
Environment="GUEST_AGENT_FSINFO_TIMEOUT=20s"
|
|
Environment="GUEST_AGENT_RETRIES=2"
|
|
```
|
|
|
|
After changing environment variables, restart Pulse for the changes to take effect.
|
|
|
|
### Permission Denied Errors
|
|
|
|
If you see "permission denied" in Pulse logs when querying guest agent:
|
|
|
|
1. **Verify token/user permissions:**
|
|
```bash
|
|
pveum user permissions pulse-monitor@pam
|
|
```
|
|
|
|
2. **For Proxmox 9+:** Ensure user has the `VM.GuestAgent.Audit` privilege (PulseMonitor role handles this)
|
|
|
|
3. **For Proxmox 8:** Ensure user has the `VM.Monitor` privilege (PulseMonitor role handles this)
|
|
|
|
4. **All versions:** Confirm `Sys.Audit` is present for Ceph metrics when applicable
|
|
|
|
5. **Re-run setup script** if you added the node before Pulse v4.7 (old scripts didn't add VM.Monitor/guest agent privileges)
|
|
|
|
### Disk Usage Still Not Showing
|
|
|
|
If the agent is working but Pulse still shows "-":
|
|
|
|
1. **Check Pulse logs** for specific error messages:
|
|
```bash
|
|
# Docker
|
|
docker logs pulse | grep -i "guest agent\|fsinfo"
|
|
|
|
# Systemd
|
|
journalctl -u pulse -f | grep -i "guest agent\|fsinfo"
|
|
```
|
|
|
|
2. **Test guest agent manually** from Proxmox host:
|
|
```bash
|
|
qm agent <vmid> get-fsinfo
|
|
```
|
|
If this works but Pulse doesn't show data, check Pulse permissions and logs
|
|
(v4.24.0: adjust **Settings → System → Logging** to `debug` temporarily if you need more detail, then revert to `info`).
|
|
|
|
3. **Check agent version** - Older agents might not support filesystem info
|
|
|
|
4. **Windows VMs** - Ensure virtio-win drivers are up to date
|
|
|
|
### Network Filesystems
|
|
|
|
The agent reports all mounted filesystems. Pulse automatically filters out:
|
|
- Network mounts (NFS, CIFS, SMB)
|
|
- Special filesystems (proc, sys, tmpfs, devtmpfs, etc.)
|
|
- Special Windows partitions ("System Reserved")
|
|
- Bind mounts and overlays
|
|
- Read-only appliance or optical images (squashfs, erofs, iso9660, CDFS, UDF, cramfs, romfs, fuse.cdfs)
|
|
|
|
Only local disk usage is counted toward the VM's total.
|
|
|
|
## Best Practices
|
|
|
|
1. **Install guest agent in VM templates** - New VMs will have it ready
|
|
2. **Monitor agent status** - Set up alerts if critical VMs lose agent connectivity
|
|
3. **Keep agents updated** - Update guest agents when updating VM operating systems
|
|
4. **Test after VM migrations** - Verify agent still works after moving VMs between nodes
|
|
5. **Check logs regularly** - Monitor Pulse logs for guest agent errors
|
|
|
|
## Platform-Specific Notes
|
|
|
|
### Cloud-Init Images
|
|
Most cloud images include qemu-guest-agent pre-installed but may need to be enabled:
|
|
```bash
|
|
systemctl enable --now qemu-guest-agent
|
|
```
|
|
|
|
### Docker/Kubernetes VMs
|
|
Container workloads can show high disk usage due to container layers. Consider:
|
|
- Using separate disks for container storage
|
|
- Monitoring container disk usage separately
|
|
- Setting appropriate thresholds for container hosts
|
|
|
|
### Database VMs
|
|
Databases often pre-allocate space. The guest agent shows actual usage, which might be less than what the database reports internally.
|
|
|
|
## Benefits
|
|
|
|
With QEMU Guest Agent disk monitoring:
|
|
- **Accurate alerts** - Alert on real usage, not allocated space
|
|
- **Better planning** - See actual growth trends
|
|
- **Prevent surprises** - Know when VMs are actually running out of space
|
|
- **Optimize storage** - Identify over-provisioned VMs
|
|
- **Consistent monitoring** - VMs and containers use the same metrics
|