52 KiB
Pulse Assistant Architecture
Status: Authoritative documentation for Pulse Assistant safety architecture Last Updated: 2026-01-28 Maintainers: Core team
CRITICAL INVARIANT (read this first)
Read-only operations must never route through write tools.
- Read operations →
pulse_read(ToolKindRead)- Write operations →
pulse_control(ToolKindWrite)- FSM VERIFYING is only entered after
ToolKindWritesuccessViolation causes: Read commands like
grep logstrigger VERIFYING state, blocking investigation workflows. See Invariant 6 for details.
1. High-Level Architecture
Pulse Assistant is a protocol-driven, safety-gated AI system for infrastructure management. The key insight is that the LLM is treated as an untrusted proposer - it can request tool calls but cannot execute them directly. All routing, permissions, and execution are enforced in Go code.
┌─────────────────────────────────────────────────────────────────────┐
│ User Request │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Agentic Loop (chat/agentic.go) │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ LLM API │───▶│ FSM │───▶│ Tool Executor │ │
│ │ (proposer) │ │ (gating) │ │ (validation/execution) │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Phantom │ │ Telemetry │ │ ResolvedContext │ │
│ │ Detection │ │ Counters │ │ (session-scoped truth) │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Agent Execution Layer │
│ (CommandPolicy → AgentServer → Connected Agents) │
└─────────────────────────────────────────────────────────────────────┘
Core Philosophy
- LLM is a proposer, not an executor. The model suggests tool calls; Go code decides what runs.
- Resources must be discovered before controlled. Strict resolution prevents fabricated resource IDs.
- Writes must be verified. FSM enforces read-after-write before final answer.
- Errors are recoverable. Structured error responses enable self-correction without prompt engineering.
1.1 User-Visible Behavior (What Feels "Impressive")
When you use Pulse Assistant in chat, these behaviors are deliberate and enforced by the backend:
- Grounded answers: The assistant uses live tools and surfaces their outputs.
- Discover → Investigate → Act: It queries resources first, reads status/logs, and only then acts.
- Verified changes: After a write, it performs a read check before concluding.
- Approval gates: In Controlled mode, write actions emit approvals and wait for a decision.
- Self‑recovery: If blocked (routing mismatch, read‑only violation, strict resolution), it adapts and retries with a safe path.
These are not prompt conventions — they are enforced by the FSM + tool executor.
2. Core Design Principles (Invariants)
These are the structural guarantees that the system must maintain. They are enforced in code, not prompts.
Invariant 1: Discovery Before Action
An action tool cannot operate on a resource that wasn't first discovered via pulse_query or pulse_discovery. This prevents the model from fabricating resource IDs.
Enforcement: validateResolvedResource() in tools/tools_query.go
Invariant 2: Verification After Write
After any write operation, the model must perform a read/status check before providing a final answer. This catches hallucinated success.
Enforcement: FSM StateVerifying in chat/fsm.go, gate in chat/agentic.go:CanFinalAnswer()
Invariant 3: Consistent Error Envelopes
All tool responses use the ToolResponse envelope structure. Errors include structured codes (STRICT_RESOLUTION, FSM_BLOCKED) that enable auto-recovery.
Enforcement: ToolResponse / ToolError types and helper functions in tools/protocol.go
Invariant 4: Phantom Detection
If the model claims to have executed an action but produced no tool calls, the response is replaced with a safe failure message.
Enforcement: hasPhantomExecution() in chat/agentic.go
Invariant 5: Session-Scoped Truth
Resolved resources are session-scoped and in-memory only. They are never persisted because infrastructure state may change between sessions.
Enforcement: ResolvedContext not serialized, rebuilt each session in chat/session.go
Approval Flow (Controlled Mode)
When control_level=controlled, write tools emit an approval request instead of executing:
- Tool returns
APPROVAL_REQUIRED: { approval_id, command, ... } - Agentic loop emits
approval_neededSSE event - UI or API approves/denies via
/api/ai/approvals/{id}/approve|deny - On approve, the tool re-executes with
_approval_idand proceeds - On deny, the assistant returns
Command denied: <reason>
This keeps the LLM in a proposer role while letting users explicitly authorize actions.
Invariant 6: Read/Write Tool Separation
This is the most commonly violated invariant. Read it carefully.
Read operations and write operations must go through different tools to ensure correct FSM classification. The FSM uses tool classification, not command content, to determine state transitions.
The Rule:
Read operations → pulse_read → ToolKindRead → stays in READING
Write operations → pulse_control → ToolKindWrite → enters VERIFYING
Tool Classification:
| Tool | Classification | Purpose |
|---|---|---|
pulse_query, pulse_discovery |
ToolKindResolve | Resource discovery and query |
pulse_read |
ToolKindRead | Read-only operations: exec, file, find, tail, logs |
pulse_metrics |
ToolKindRead | Performance metrics |
pulse_storage |
ToolKindRead | Storage information |
pulse_kubernetes |
ToolKindRead | Kubernetes cluster info |
pulse_pmg |
ToolKindRead | Proxmox Mail Gateway stats |
pulse_control |
ToolKindWrite | Write operations: guest control, service management |
pulse_file_edit action=read |
ToolKindRead | File reading |
pulse_file_edit action=write/append |
ToolKindWrite | File modification |
pulse_alerts (action-dependent) |
resolve/dismiss → Write; others → Read | Alert management |
pulse_docker (action-dependent) |
control/update → Write; others → Read | Docker operations |
pulse_knowledge (action-dependent) |
remember/note/save → Write; others → Read | Knowledge persistence |
Enforcement Points:
- Tool registration:
tools/tools_read.go(read-only tool) - FSM classification:
classifyToolByName()inchat/fsm.go(pulse_read→ToolKindRead) - Read-only enforcement:
tools/tools_read.go(usesClassifyExecutionIntent) - Intent classification:
ClassifyExecutionIntent()intools/tools_query.go - Regression test:
chat/fsm_test.go:TestFSM_RegressionJellyfinLogsScenario
ExecutionIntent Classification:
pulse_read uses the ExecutionIntent abstraction to determine if a command is provably non-mutating.
| Intent | Meaning | Example |
|---|---|---|
IntentReadOnlyCertain |
Non-mutating by construction | cat, grep, ls, docker logs |
IntentReadOnlyConditional |
Proven read-only by content inspection | sqlite3 db.db "SELECT..." |
IntentWriteOrUnknown |
Cannot prove non-mutating | rm, curl -X POST, unknown binaries |
Classification Phases:
- Mutation-capability guards: Blocks
sudo, redirects, pipes to dual-use tools, shell chaining - Known write patterns: Matches destructive commands (
rm,shutdown,systemctl restart) - Read-only by construction: Matches commands that cannot mutate (
cat,grep,docker logs) - Content inspection: Uses
ContentInspectorinterface for dual-use tools (SQL CLIs) - Conservative fallback: Unknown commands →
IntentWriteOrUnknown
CRITICAL: Phase Ordering Contract
Any match in Phase 1–2 dominates and forces
IntentWriteOrUnknown. Never "optimize" by short-circuiting on allowlists (Phase 3) first. This ordering ensures guardrails cannot be bypassed by prepending a read-only command.
pulse_read Contract:
pulse_read accepts: IntentReadOnlyCertain, IntentReadOnlyConditional
pulse_read rejects: IntentWriteOrUnknown (with recovery hint)
Any Phase 1–2 guardrail → forces IntentWriteOrUnknown
Unknown commands → IntentWriteOrUnknown (conservative)
NonInteractiveOnly Invariant (Exit-Boundedness):
pulse_read must only execute commands that terminate deterministically.
Rule: Allowed = exits deterministically OR has explicit bound
Blocked = can run indefinitely without explicit bound
Categories (for telemetry labels):
┌─────────────────────┬──────────────────────────────────────────────────┐
│ [tty_flag] │ docker exec -it, kubectl exec -it │
│ [pager] │ less, more, vim, nano, emacs │
│ [unbounded_stream] │ top, htop, watch, tail -f, journalctl -f │
│ [interactive_repl] │ ssh host, mysql, psql, python, node (no command) │
└─────────────────────┴──────────────────────────────────────────────────┘
Exit bounds that allow streaming:
- Line count: -n, --lines, --tail
- Time window: --since, --until
- Timeout wrapper: timeout 5s <cmd>
Examples:
allow: journalctl --since "10 min ago" -f
allow: kubectl logs --since=10m --tail=100
allow: ssh host "ls -la"
allow: mysql -e "SELECT 1"
block: journalctl -f
block: ssh host (no command)
block: mysql (bare REPL)
Auto-recovery (bounded to 1 attempt):
When blocked with auto_recoverable=true and suggested_rewrite:
1. Agentic loop automatically applies the rewrite
2. Retries once with modified command
3. If still blocked, surfaces error to user
Rewrite templates:
journalctl -f → journalctl -n 200 --since "10 min ago"
docker logs -f x → docker logs --tail=200 x
kubectl logs -f x → kubectl logs --tail=200 --since=10m x
tail -f file → tail -n 200 file
Adding New Dual-Use Tools:
Implement the ContentInspector interface in tools/tools_query.go:
type ContentInspector interface {
Applies(cmdLower string) bool // Does this inspector handle this command?
IsReadOnly(cmdLower string) (bool, string) // Is the content read-only?
}
Then add to registeredInspectors slice. Example: sqlContentInspector inspects SQL CLI commands.
What happens if violated:
- User asks "check the logs on jellyfin"
- Model runs
grep -i error /var/log/*.logthroughpulse_control - FSM classifies
pulse_controlasToolKindWrite - FSM enters VERIFYING state
- Model tries to run another command → BLOCKED
- Error: "Must verify the previous write operation"
- User cannot investigate, stuck in verification loop
Correct behavior:
- User asks "check the logs on jellyfin"
- Model runs
grep -i error /var/log/*.logthroughpulse_read action=exec - FSM classifies
pulse_readasToolKindRead - FSM stays in READING state
- Model can run unlimited read operations
- Investigation succeeds
Invariant 7: Execution Context Binding
Tool execution must be context-bound to a resolved resource.
When the user targets a non-host resource (LXC/VM/Docker container), file operations and commands must execute within that resource's execution context, never on the parent host.
The Rule:
If user mentions @homepage-docker (resolves to lxc:delly:141):
- All file reads/writes MUST run inside that LXC context
- NOT on delly just because a path happens to exist there
What happens if violated:
- User asks "add InfluxDB to my @homepage-docker config"
- Model runs
pulse_file_editwithtarget_host="delly"(the Proxmox node) - File exists at
/opt/homepage/config/services.yamlon delly (coincidentally) - Model edits the file on the host
- Homepage (running in LXC 141) doesn't see the change
- User's request appears to succeed but nothing actually changes
Enforcement: validateRoutingContext() in tools/tools_query.go
- Blocks with
ROUTING_MISMATCHif targeting a Proxmox host when recently referenced child resources exist - Error response includes
auto_recoverable: trueand suggests the correct target - Applied to:
pulse_file_edit,pulse_read,pulse_control type=command
Critical Implementation Detail: LRU Access vs Explicit Access
LRU access is a cache/eviction mechanism; explicit access is an intent mechanism. They must never be conflated.
The routing validation uses two separate tracking mechanisms in ResolvedContext:
| Field | Purpose | Set By | Used For |
|---|---|---|---|
lastAccessed |
LRU eviction, TTL expiry | Every add/get | Cache management |
explicitlyAccessed |
Routing validation | MarkExplicitAccess() only |
Detecting user intent |
Why this matters:
- Bulk discovery (e.g.,
pulse_query action=search) returns many resources → setslastAccessedfor all - If routing validation used
lastAccessed, bulk discovery would block subsequent host operations - Instead, routing validation checks
explicitlyAccessedwhich is ONLY set for single-resource operations
Correct events to mark explicit access:
- ✅ User @mention resolved to a resource
- ✅
pulse_query action=getreturning a single resource - ✅ Explicit selection of a specific resource
Events that do NOT mark explicit access:
- ❌ Bulk discovery returning many resources
- ❌ Background prefetch operations
- ❌ General
lastAccessedupdates for LRU
Implementation: chat/types.go:ResolvedContext
WasRecentlyAccessed()checksexplicitlyAccessed, notlastAccessedMarkExplicitAccess()only setsexplicitlyAccessed- Registration methods:
registerResolvedResource()vsregisterResolvedResourceWithExplicitAccess()
Canonical Resource ID Format:
{kind}:{host}:{provider_uid} # Scoped resources (globally unique)
{kind}:{provider_uid} # Global resources (nodes, clusters)
Examples:
lxc:delly:141 # LXC container 141 on node delly
vm:minipc:203 # VM 203 on node minipc
docker_container:server1:abc123 # Docker container on host server1
node:delly # Proxmox node (no parent scope)
Regression tests:
tools/strict_resolution_test.go:TestRoutingMismatch_RegressionHomepageScenariotools/strict_resolution_test.go:TestRoutingValidation_BulkDiscoveryShouldNotPoisonRoutingtools/strict_resolution_test.go:TestRoutingValidation_ExplicitGetShouldMarkAccess
Error Response:
{
"ok": false,
"error": {
"code": "ROUTING_MISMATCH",
"message": "target_host 'delly' is a Proxmox node, but you recently referenced more specific resources on it: [homepage-docker].",
"blocked": true,
"details": {
"target_host": "delly",
"more_specific_resources": ["homepage-docker"],
"more_specific_resource_ids": ["lxc:delly:141"],
"target_resource_id": "lxc:delly:141",
"recovery_hint": "Retry with target_resource_id='lxc:delly:141' (preferred) or target_host='homepage-docker' (legacy)",
"auto_recoverable": true
}
}
}
Telemetry: pulse_ai_routing_mismatch_block_total{tool, target_kind, child_kind}
3. Tool Protocol
All tools return a consistent ToolResponse envelope defined in tools/protocol.go.
Response Structure
type ToolResponse struct {
OK bool `json:"ok"` // true if tool succeeded
Data interface{} `json:"data,omitempty"` // result data if ok=true
Error *ToolError `json:"error,omitempty"`// error details if ok=false
Meta map[string]interface{} `json:"meta,omitempty"` // optional metadata
}
type ToolError struct {
Code string `json:"code"` // e.g., "STRICT_RESOLUTION"
Message string `json:"message"` // Human-readable
Blocked bool `json:"blocked,omitempty"` // Policy/validation block
Failed bool `json:"failed,omitempty"` // Runtime failure
Retryable bool `json:"retryable,omitempty"` // Auto-retry might succeed
Details map[string]interface{} `json:"details,omitempty"` // Additional context
}
Error Codes (tools/protocol.go)
| Code | Meaning | Auto-Recoverable |
|---|---|---|
STRICT_RESOLUTION |
Resource not discovered | Yes (discover then retry) |
FSM_BLOCKED |
FSM state prevents operation | Yes (perform required action) |
NOT_FOUND |
Resource doesn't exist | No |
ACTION_NOT_ALLOWED |
Action not permitted | No |
POLICY_BLOCKED |
Security policy blocked | No |
APPROVAL_REQUIRED |
User approval needed | Yes (wait for approval) |
INVALID_INPUT |
Bad parameters | No |
EXECUTION_FAILED |
Runtime error | Depends on cause |
Helper Functions
NewToolSuccess(data) // Success response
NewToolBlockedError(code, message, details) // Policy/validation block
NewToolFailedError(code, message, retryable, details) // Runtime failure
4. Strict Resolution
Strict resolution prevents the model from operating on fabricated or hallucinated resource IDs.
Enabling
export PULSE_STRICT_RESOLUTION=true
Behavior (tools/tools_query.go)
When enabled (PULSE_STRICT_RESOLUTION=true):
- Write actions (
start,stop,restart,delete,exec,write,append) are blocked if the resource wasn't discovered first - Read actions are allowed if the session has any resolved context (scoped bypass)
- Error response includes
auto_recoverable: trueto signal the model can self-correct
Error Type (tools/tools_query.go)
type ErrStrictResolution struct {
ResourceID string // The resource that wasn't found
Action string // The action that was attempted
Message string // Human-readable message
}
// ToToolResponse returns consistent error envelope
func (e *ErrStrictResolution) ToToolResponse() ToolResponse {
return NewToolBlockedError(
ErrCodeStrictResolution,
e.Message,
map[string]interface{}{
"resource_id": e.ResourceID,
"action": e.Action,
"recovery_hint": "Use pulse_query action=search to discover the resource first",
"auto_recoverable": true,
},
)
}
Validation Flow (tools/tools_query.go:validateResolvedResource)
func (e *PulseToolExecutor) validateResolvedResource(resourceName, action string, skipIfNoContext bool) ValidationResult {
strictMode := isStrictResolutionEnabled()
isWrite := isWriteAction(action)
requireHardValidation := strictMode && isWrite
// 1. Check if context exists
if e.resolvedContext == nil {
if requireHardValidation {
return ValidationResult{StrictError: &ErrStrictResolution{...}}
}
// Soft validation: allow but log warning
}
// 2. Try to find resource by alias or ID
res, found := e.resolvedContext.GetResolvedResourceByAlias(resourceName)
if found {
// 3. Check if action is allowed
// ...
return ValidationResult{Resource: res}
}
// 4. Not found - block if strict mode
if requireHardValidation {
return ValidationResult{StrictError: &ErrStrictResolution{...}}
}
}
5. ResolvedContext
ResolvedContext is the session-scoped source of truth for discovered resources. It's defined in chat/types.go.
Design Principles
- Authoritative: Only query/discovery tools can add resources
- Session-scoped: Not persisted across sessions
- In-memory only: Infrastructure state may change
- Multi-indexed: By name, ID, and aliases
Structure (chat/types.go)
type ResolvedContext struct {
SessionID string
Resources map[string]*ResolvedResource // By name
ResourcesByID map[string]*ResolvedResource // By canonical ID
ResourcesByAlias map[string]*ResolvedResource // By any alias
lastAccessed map[string]time.Time // LRU tracking
pinned map[string]bool // Eviction protection
ttl time.Duration // Default: 45 minutes
maxEntries int // Default: 500
}
Features
| Feature | Default | Purpose |
|---|---|---|
| TTL | 45 minutes | Sliding window expiration |
| Max Entries | 500 | LRU eviction when exceeded |
| Pinning | - | Protect primary targets from eviction |
Resource Registration Interface (tools/executor.go)
type ResourceRegistration struct {
Kind string // "node", "vm", "lxc", "docker_container"
ProviderUID string // Stable provider ID (container ID, VMID)
Name string // Primary display name
Aliases []string // Additional names
HostUID string // Host identifier
HostName string // Host display name
VMID int // For Proxmox guests
Node string // For Proxmox guests
Executors []ExecutorRegistration // How to reach this resource
}
Coherent Reset (chat/session.go:ClearSessionState)
When clearing session state, both FSM and ResolvedContext must be reset together:
func (s *SessionStore) ClearSessionState(sessionID string, keepPinned bool) {
// Clear resolved context
ctx.Clear(keepPinned)
// Reset FSM coherently
if !keepPinned {
fsm.Reset() // Back to RESOLVING
} else if ctx.HasAnyResources() {
fsm.ResetKeepProgress() // Stay in READING
} else {
fsm.Reset() // No resources left, must rediscover
}
}
6. Command Risk Classification
Command risk classification (classifyCommandRisk() in tools/tools_query.go) determines how shell commands are treated by strict resolution.
Risk Levels
const (
CommandRiskReadOnly CommandRisk = 0 // Safe read-only commands
CommandRiskLowWrite CommandRisk = 1 // Low-risk writes
CommandRiskMediumWrite CommandRisk = 2 // Medium-risk writes
CommandRiskHighWrite CommandRisk = 3 // High-risk writes
)
Classification Algorithm
The classifier evaluates commands in 4 phases:
Phase 1: Shell Metacharacters
sudo→ HighWrite>,>>,tee,2>→ HighWrite (output redirection);,&&,||→ MediumWrite (command chaining)$(...), backticks → MediumWrite (command substitution)
Phase 2: High-Risk Patterns
rm,shutdown,reboot,poweroffsystemctl restart/stop/start- Package managers:
apt,yum,dnf,pacman - Dangerous docker commands:
docker rm,docker kill - System modification:
chmod,chown,iptables
Phase 3: Medium-Risk Patterns
- File operations:
mv,cp,touch,mkdir - In-place editing:
sed -i - Archive extraction:
tar -x,unzip - Curl with mutation:
-X POST,-X DELETE,--data
Phase 4: Read-Only Patterns
- File inspection:
cat,head,tail,less,grep - System status:
ps,top,free,df,du - Docker read:
docker ps,docker logs,docker inspect - Network diagnostics:
ping,netstat,ss,ip addr - Systemd status:
systemctl status,systemctl is-active
Strict Mode Behavior (tools/tools_query.go:validateResolvedResourceForExec)
func (e *PulseToolExecutor) validateResolvedResourceForExec(resourceName, command string, ...) {
risk := classifyCommandRisk(command)
if risk == CommandRiskReadOnly && isStrictResolutionEnabled() {
// Read-only commands allowed if session has ANY resolved context
if e.resolvedContext != nil && e.hasAnyResolvedHost() {
return ValidationResult{} // Allow with warning
}
// No context at all - require discovery
return ValidationResult{StrictError: ...}
}
// Write commands require specific resource discovery
return e.validateResolvedResource(resourceName, "exec", ...)
}
7. FSM (Finite State Machine)
The session workflow FSM (chat/fsm.go) enforces structural guarantees about discovery and verification.
States (chat/fsm.go)
┌─────────────┐ resolve/read ┌─────────────┐
│ RESOLVING │─────────────────────▶│ READING │
│ (initial) │ │ │
└─────────────┘ └──────┬──────┘
▲ │
│ Reset() write│success
│ ▼
│ ┌─────────────┐ ┌─────────────┐
└──────────────│ (done) │◀─│ VERIFYING │
└─────────────┘ └──────┬──────┘
▲ │
│ read │
└─────────────────┘
┌─────────────┐
│ WRITING │ (defined, transitional — all tools allowed;
│ │ not currently assigned by any transition)
└─────────────┘
| State | Description | Allowed Operations |
|---|---|---|
RESOLVING |
Initial state, no validated target | Resolve, Read |
READING |
Resources discovered, ready for actions | Resolve, Read, Write |
WRITING |
Transitional state (defined but not actively assigned by transitions) | Resolve, Read, Write |
VERIFYING |
Write performed, must verify before next write or final answer | Resolve, Read |
Tool Classification (chat/fsm.go)
const (
ToolKindResolve ToolKind = iota // Discovery/query tools
ToolKindRead // Read-only tools
ToolKindWrite // Mutating tools
)
Classification Function (chat/fsm.go:classifyToolByName)
ClassifyToolCall(toolName string, args map[string]interface{}) ToolKind is the single source of truth for tool classification:
func classifyToolByName(toolName string, args map[string]interface{}) ToolKind {
action, _ := args["action"].(string)
actionLower := strings.ToLower(action)
switch toolName {
// === Query/Discovery tools (Resolve) ===
case "pulse_query", "pulse_discovery":
return ToolKindResolve
// === Read-only tools (Read) ===
case "pulse_metrics", "pulse_storage", "pulse_kubernetes", "pulse_pmg":
return ToolKindRead
case "pulse_read":
return ToolKindRead // ALWAYS read-only, enforced at tool layer
// === Control tools (Write) ===
case "pulse_control", "pulse_run_command":
return ToolKindWrite
// === Action-dependent tools ===
case "pulse_alerts":
switch actionLower {
case "resolve", "dismiss":
return ToolKindWrite
default:
return ToolKindRead
}
case "pulse_docker":
switch actionLower {
case "control", "update", "check_updates", "trigger_update":
return ToolKindWrite
default:
return ToolKindRead
}
case "pulse_knowledge":
switch actionLower {
case "remember", "note", "save":
return ToolKindWrite
default:
return ToolKindRead
}
case "pulse_file_edit":
switch actionLower {
case "read":
return ToolKindRead
case "write", "append":
return ToolKindWrite
default:
return ToolKindRead
}
// === Legacy tool names (backwards compatibility) ===
case "pulse_control_guest", "pulse_control_docker":
return ToolKindWrite
case "pulse_search_resources", "pulse_get_resource", "pulse_get_topology",
"pulse_list_infrastructure", "pulse_get_connection_health":
return ToolKindResolve
case "pulse_get_docker_logs", "pulse_get_performance_metrics",
"pulse_get_temperatures", "pulse_get_baselines", "pulse_get_patterns":
return ToolKindRead
}
// Fallback: check action/operation parameter for write/read intent
// ...
// Default to WRITE for unknown tools (security-safe)
return ToolKindWrite
}
Key Methods (chat/fsm.go)
// Check if tool is allowed in current state
func (fsm *SessionFSM) CanExecuteTool(kind ToolKind, toolName string) error
// Check if final answer is allowed
func (fsm *SessionFSM) CanFinalAnswer() error
// Update state after successful tool execution
func (fsm *SessionFSM) OnToolSuccess(kind ToolKind, toolName string)
// Complete verification and return to READING
func (fsm *SessionFSM) CompleteVerification()
Recovery Tracking (chat/fsm.go)
The FSM tracks pending recoveries for metrics correlation:
type PendingRecovery struct {
RecoveryID string // Unique ID for correlation
ErrorCode string // FSM_BLOCKED or STRICT_RESOLUTION
Tool string // Tool that was blocked
CreatedAt time.Time
Attempts int
}
// Track a blocked operation
func (fsm *SessionFSM) TrackPendingRecovery(errorCode, tool string) string
// Check if a success resolves a pending recovery
func (fsm *SessionFSM) CheckRecoverySuccess(tool string) *PendingRecovery
8. Agentic Loop Enforcement
The agentic loop (chat/agentic.go) orchestrates LLM calls and enforces the FSM gates.
Enforcement Gate 1: Tool Execution (chat/agentic.go)
Before executing any tool:
// Check if tool is allowed in current state
toolKind := ClassifyToolCall(tc.Name, tc.Input)
if fsm != nil {
if fsmErr := fsm.CanExecuteTool(toolKind, tc.Name); fsmErr != nil {
// Record telemetry
metrics.RecordFSMToolBlock(fsm.State, tc.Name, toolKind)
// Track recovery opportunity
if fsmBlockedErr.Recoverable {
fsm.TrackPendingRecovery("FSM_BLOCKED", tc.Name)
metrics.RecordAutoRecoveryAttempt("FSM_BLOCKED", tc.Name)
}
// Return error with recovery hint
return ToolResult{
Content: fsmErr.Error() + " Use a discovery or read tool first, then retry.",
IsError: true,
}
}
}
Enforcement Gate 2: Final Answer (chat/agentic.go)
Before allowing the model to respond without tool calls:
if len(toolCalls) == 0 {
if fsm != nil {
if fsmErr := fsm.CanFinalAnswer(); fsmErr != nil {
// Record telemetry
metrics.RecordFSMFinalBlock(fsm.State)
// Inject verification constraint (factual, not narrative)
verifyPrompt := fmt.Sprintf(
"Verification required: perform a read or status check on %s before responding.",
fsm.LastWriteTool,
)
// Continue loop to force verification read
continue
}
}
}
Phantom Detection (chat/agentic.go:hasPhantomExecution)
Detects when the model claims execution without tool calls:
func hasPhantomExecution(content string) bool {
lower := strings.ToLower(content)
// Category 1: Concrete metrics/values that MUST come from tools
metricsPatterns := []string{
"cpu usage is ", "memory usage is ", "disk usage is ",
}
// Category 2: Claims of infrastructure state
statePatterns := []string{
"is currently running", "is now restarted",
"the logs show", "according to the output",
}
// Category 3: Fake tool call formatting
fakeToolPatterns := []string{
"```tool", "pulse_query(", "<tool_call>",
}
// Category 4: Past tense claims of specific actions
actionResultPatterns := []string{
"i restarted the", "successfully stopped",
}
}
State Transition After Success (chat/agentic.go)
if fsm != nil && !isError {
fsm.OnToolSuccess(toolKind, tc.Name)
// Check if this success resolves a pending recovery
if pr := fsm.CheckRecoverySuccess(tc.Name); pr != nil {
metrics.RecordAutoRecoverySuccess(pr.ErrorCode, pr.Tool)
}
}
9. Telemetry & Regression Detection
Prometheus counters (chat/metrics.go) provide operational visibility and regression detection.
Metrics
| Counter | Labels | Purpose |
|---|---|---|
pulse_ai_fsm_tool_block_total |
state, tool, kind | FSM blocks of tool execution |
pulse_ai_fsm_final_block_total |
state | FSM blocks of final answer |
pulse_ai_strict_resolution_block_total |
tool, action | Strict resolution blocks |
pulse_ai_phantom_detected_total |
provider, model | Phantom execution detected |
pulse_ai_auto_recovery_attempt_total |
error_code, tool | Auto-recovery attempts |
pulse_ai_auto_recovery_success_total |
error_code, tool | Successful auto-recoveries |
pulse_ai_agentic_iterations_total |
provider, model | Agentic loop turns |
Recording Points
FSM Tool Block (agentic.go — Gate 1):
metrics.RecordFSMToolBlock(fsm.State, tc.Name, toolKind)
FSM Final Block (agentic.go — Gate 2):
metrics.RecordFSMFinalBlock(fsm.State)
Strict Resolution Block (tools_query.go — validateResolvedResource):
e.telemetryCallback.RecordStrictResolutionBlock("validateResolvedResource", action)
Phantom Detection (agentic.go — phantom check):
metrics.RecordPhantomDetected(providerName, modelName)
Auto-Recovery Attempt (agentic.go — recovery tracking):
metrics.RecordAutoRecoveryAttempt("FSM_BLOCKED", tc.Name)
Auto-Recovery Success (agentic.go — OnToolSuccess):
metrics.RecordAutoRecoverySuccess(pr.ErrorCode, pr.Tool)
Label Sanitization (chat/metrics.go:sanitizeLabel)
All labels are sanitized to prevent cardinality explosion:
func sanitizeLabel(s string) string {
if s == "" {
return "unknown"
}
s = strings.ReplaceAll(s, " ", "_")
if len(s) > maxLabelLen {
s = s[:maxLabelLen]
}
return s
}
Recovery Rate Calculation
# Auto-recovery success rate
sum(rate(pulse_ai_auto_recovery_success_total[1h]))
/
sum(rate(pulse_ai_auto_recovery_attempt_total[1h]))
10. What NOT to Change
This section documents critical invariants that must not be modified without understanding the full system impact.
DO NOT: Allow Action Tools Without Discovery
// WRONG: Bypass validation
if validation.IsBlocked() {
// Just log and continue ← NO!
}
// CORRECT: Return error for auto-recovery
if validation.IsBlocked() {
return NewToolResponseResult(validation.StrictError.ToToolResponse())
}
Location: tools/tools_control.go, tools/tools_file.go (validation checks in action handlers)
DO NOT: Remove FSM Gates from Agentic Loop
// WRONG: Skip FSM check
// if fsm != nil { ... } ← Removing this breaks verification
// CORRECT: Always check FSM
if fsm != nil {
if fsmErr := fsm.CanExecuteTool(toolKind, tc.Name); fsmErr != nil {
// Handle block
}
}
Location: chat/agentic.go — Gate 1 (tool execution) and Gate 2 (final answer)
DO NOT: Change Unknown Tool Default to Read
// WRONG: Default to read (bypasses gates for new tools)
return ToolKindRead ← NO!
// CORRECT: Default to write (security-safe)
return ToolKindWrite
Location: classifyToolByName() default case in chat/fsm.go
DO NOT: Persist ResolvedContext
// WRONG: Save to disk
json.Marshal(resolvedContext) ← NO!
// CORRECT: Keep in-memory only
resolvedContexts map[string]*ResolvedContext // Not persisted
Location: chat/session.go (SessionStore), chat/types.go (ResolvedContext)
DO NOT: Reset FSM Without Context
// WRONG: Reset FSM only
fsm.Reset()
// CORRECT: Reset both coherently
s.ClearSessionState(sessionID, keepPinned) // Handles both
Location: ClearSessionState() in chat/session.go
DO NOT: Remove Phantom Detection
// WRONG: Skip phantom check
// if hasPhantomExecution(content) { ... } ← Removing this allows hallucinated actions
// CORRECT: Always check and replace
if hasPhantomExecution(assistantMsg.Content) {
metrics.RecordPhantomDetected(...)
resultMessages[...].Content = safeResponse
}
Location: hasPhantomExecution() check in chat/agentic.go
DO NOT: Route Read Operations Through Write Tools
Read-only operations must use pulse_read, NOT pulse_control:
// WRONG: Read command through write tool
pulse_control type="command" command="grep logs" ← Triggers VERIFYING!
// CORRECT: Read command through read tool
pulse_read action="exec" command="grep logs" ← Stays in READING
Why: pulse_control is classified as ToolKindWrite regardless of what command it runs. Running grep through pulse_control triggers VERIFYING state, blocking subsequent commands. Using pulse_read keeps the FSM in READING state where unlimited reads are allowed.
Location: tools/tools_read.go (read-only tool), classifyToolByName() in chat/fsm.go (classification)
DO NOT: Target Parent Host When Child Resources Exist
File and exec operations must target the specific resource, not the parent Proxmox host:
// WRONG: Target Proxmox host when LXC was mentioned
pulse_file_edit target_host="delly" path="/opt/homepage/config/services.yaml"
// ^ Edits file on host, not inside homepage-docker LXC!
// CORRECT: Target the specific resource
pulse_file_edit target_host="homepage-docker" path="/opt/homepage/config/services.yaml"
// ^ Edits file inside the LXC where Homepage actually runs
Why: Files at the same path may exist on both the host and inside containers, but they are completely different filesystems. Editing the host file has no effect on applications running inside containers.
Enforcement: validateRoutingContext() in tools/tools_query.go blocks with ROUTING_MISMATCH when targeting a Proxmox host that has discovered child resources.
Location: tools/tools_query.go:validateRoutingContext(), applied in tools_file.go and tools_read.go
DO NOT: Add Prompts to Error Recovery
The system uses structured error responses, not prompts, for recovery:
// WRONG: Add narrative prompts
return "I notice you haven't discovered resources yet. Let me suggest..." ← NO!
// CORRECT: Return structured error
return NewToolBlockedError(
ErrCodeStrictResolution,
"Resource not discovered",
map[string]interface{}{
"recovery_hint": "Use pulse_query action=search...",
"auto_recoverable": true,
},
)
DO NOT: Skip Telemetry Recording
Every block must be recorded for regression detection:
// WRONG: Silent block
return ValidationResult{StrictError: err}
// CORRECT: Record then return
if e.telemetryCallback != nil {
e.telemetryCallback.RecordStrictResolutionBlock(...)
}
return ValidationResult{StrictError: err}
Locations: All validation functions in tools/tools_query.go
Appendix A: Known Failure Modes & Fixes
This appendix documents failure modes that have been discovered and fixed. Use this for debugging when similar symptoms appear.
Failure Mode 1: VERIFYING Deadlock (Read Operations)
Symptom: write → read → write blocked repeatedly. User cannot investigate after any write.
Cause: Read operations routed through pulse_control (classified as ToolKindWrite) instead of pulse_read (classified as ToolKindRead).
Fix/Invariant: Read operations must use pulse_read. Tool classification determines FSM transitions, not command content.
Regression Test: TestFSM_RegressionJellyfinLogsScenario
Telemetry Signal: High fsm_tool_block_total{state="VERIFYING"} for read-like commands.
Failure Mode 2: VERIFYING Never Clears
Symptom: After write → read, subsequent writes still blocked. FSM stays in VERIFYING.
Cause: CompleteVerification() was only called when model gave final answer (no tool calls), not during tool execution loop.
Fix/Invariant: Call CompleteVerification() immediately after OnToolSuccess() when in VERIFYING and ReadAfterWrite=true. Verification must complete on first successful read, not wait for final answer.
Regression Test: TestFSM_RegressionWriteReadWriteSequence, TestFSM_RegressionMultipleReadsAfterWrite
Telemetry Signal: Persistent fsm_tool_block_total{state="VERIFYING"} even after successful reads.
Failure Mode 3: Stderr Redirects Blocked
Symptom: Commands like find ... 2>/dev/null rejected as "not read-only".
Cause: classifyCommandRisk() treated all > characters as dangerous output redirection.
Fix/Invariant: Strip safe stderr patterns (2>/dev/null, 2>&1) before checking for dangerous redirects.
Regression Test: TestClassifyCommandRisk cases for stderr redirects.
Failure Mode 4: Sticky ReadAfterWrite Flag
Symptom: Weird "already verified" states after multiple write cycles.
Cause: CompleteVerification() didn't reset ReadAfterWrite flag.
Fix/Invariant: CompleteVerification() must reset ReadAfterWrite=false for clean cycle.
Failure Mode 5: File Operations on Wrong Host (Routing Mismatch)
Symptom: File edits "succeed" but have no effect. Config changes don't appear in the application.
Cause: Model targeted a Proxmox host (target_host="delly") when user mentioned an LXC/VM (@homepage-docker). The file was edited on the host's filesystem, not inside the container where the application runs.
Fix/Invariant: validateRoutingContext() blocks operations that target a Proxmox host when the ResolvedContext contains LXC/VM children on that host. Returns ROUTING_MISMATCH with auto_recoverable: true.
Regression Test: TestRoutingMismatch_RegressionHomepageScenario
Telemetry Signal: Watch for ROUTING_MISMATCH error code in tool responses.
Why This Happens:
- Path
/opt/homepage/config/services.yamlexists on both the host and inside the LXC - Model picks the host because it's "simpler" or appears first
- Routing system correctly routes to the target it's given
- But the user intended the LXC context, not the host
Appendix B: Contributor Checklist
Use this checklist when modifying FSM, tools, or the agentic loop.
When Adding/Changing Tools
- Tool returns
ToolResponseenvelope (useNewToolSuccess,NewToolBlockedError, etc.) - Tool is explicitly classified in
classifyToolByName()(chat/fsm.go)- Unknown tools default to
ToolKindWrite(security-safe)
- Unknown tools default to
- Read operations use
pulse_reador are classifiedToolKindRead - Write operations are blocked in RESOLVING and require strict resolution
- Any new write requires verification read before final answer
- Routing context validation for file/exec tools (
validateRoutingContext())- Block if targeting Proxmox host when child resources (LXC/VM) exist in ResolvedContext
- Telemetry recorded for blocks (
RecordStrictResolutionBlock, etc.) - Add at least one FSM regression test for new behavior
When Changing FSM Logic
write → read → writesequence is supported (regression test must pass)- VERIFYING clears on successful
ToolKindRead CompleteVerification()resets appropriate flags- Verification gating is structural (code), not prompt-dependent
- Run all FSM tests:
go test ./internal/ai/chat/... -run FSM
When Changing Agentic Loop
- Gate 1 (tool execution) checks
CanExecuteTool()before every tool - Gate 2 (final answer) checks
CanFinalAnswer()before responding OnToolSuccess()called after every successful tool executionCompleteVerification()called when in VERIFYING andReadAfterWrite=true- Phantom detection remains active
- Recovery tracking records attempts and successes
When Changing Strict Resolution
- Write actions require discovered resources
- Read-only commands allowed with scoped bypass (session has context)
- Error responses include
auto_recoverable: trueandrecovery_hint - Command risk classification handles edge cases (stderr, pipes)
- Telemetry recorded for all blocks
Deployment Checklist
After deploying FSM changes, monitor for 24-48h:
# Should drop after fixing VERIFYING issues
rate(pulse_ai_fsm_tool_block_total{state="VERIFYING"}[5m])
# Should become rarer with proper verification clearing
rate(pulse_ai_fsm_final_block_total{state="VERIFYING"}[5m])
# Should maintain high rate (model self-corrects)
sum(rate(pulse_ai_auto_recovery_success_total[1h]))
/ sum(rate(pulse_ai_auto_recovery_attempt_total[1h]))
If VERIFYING blocks don't drop:
- Check if reads are being classified as
ToolKindRead - Check if reads are failing (look at tool error rates)
- Check if
CompleteVerification()is being called
If file operations have no effect:
- Check for
ROUTING_MISMATCHerrors in tool responses - Verify
target_hostmatches the intended resource (LXC/VM name, not Proxmox host) - Check if child resources are in ResolvedContext but model is targeting parent host
Appendix C: File Reference
| File | Purpose |
|---|---|
chat/fsm.go |
Session workflow state machine |
chat/fsm_test.go |
FSM unit tests |
chat/agentic.go |
Agentic loop with enforcement gates |
chat/session.go |
Session store with FSM/context management |
chat/types.go |
ResolvedContext and ResolvedResource |
chat/metrics.go |
Prometheus telemetry |
tools/protocol.go |
ToolResponse envelope |
tools/executor.go |
PulseToolExecutor and interfaces |
tools/registry.go |
Tool registry (registration, lookup, execution dispatch) |
tools/adapters.go |
MCP adapters bridging internal data to tool interfaces |
tools/types.go |
Tool-layer interfaces (AgentProfileManager, AgentScope) |
tools/tools_query.go |
Query tools, strict resolution, and routing validation |
tools/tools_read.go |
Read-only operations (exec, file, find, tail, logs) |
tools/tools_control.go |
Control tool handlers (legacy) |
tools/tools_control_consolidated.go |
Consolidated pulse_control (write-only) |
tools/tools_file.go |
File editing tools with routing validation |
tools/tools_alerts.go |
Alert management (pulse_alerts) |
tools/tools_docker.go |
Docker operations (pulse_docker) |
tools/tools_metrics.go |
Performance metrics (pulse_metrics) |
tools/tools_storage.go |
Storage information (pulse_storage) |
tools/tools_kubernetes.go |
Kubernetes cluster info (pulse_kubernetes) |
tools/tools_knowledge.go |
Knowledge persistence (pulse_knowledge) |
tools/tools_pmg_consolidated.go |
Proxmox Mail Gateway (pulse_pmg) |
tools/tools_discovery_consolidated.go |
Consolidated pulse_discovery |
tools/tools_infrastructure.go |
Infrastructure tools (backups, disks, temps, etc.) |
tools/tools_intelligence.go |
Correlation, incident windows, relationships |
tools/tools_patrol.go |
Baselines, patterns, findings, alerts |
tools/tools_profiles.go |
Agent scope/profile management |
memory/ |
Session memory subsystem |
investigation/ |
Investigation workflow support |
knowledge/ |
Knowledge base and recall |
cost/ |
Token/cost tracking |
safety/ |
Safety checks and guardrails |
correlation/ |
Event correlation engine |
patterns/ |
Pattern detection |
baseline/ |
Baseline metrics |
forecast/ |
Forecasting subsystem |
Appendix D: Tool Inventory
The assistant exposes 50+ tools across several categories. This appendix groups them by FSM classification.
Resolve Tools (ToolKindResolve)
| Consolidated Tool | Legacy Equivalents | Purpose |
|---|---|---|
pulse_query |
pulse_search_resources, pulse_get_resource, pulse_get_topology, pulse_list_infrastructure, pulse_get_connection_health |
Resource search, get, config, topology, list, health |
pulse_discovery |
pulse_get_discovery, pulse_list_discoveries |
Infrastructure discovery |
Read-Only Tools (ToolKindRead)
| Tool | Source File | Purpose |
|---|---|---|
pulse_read |
tools_read.go |
Exec, file read, find, tail, logs (enforced read-only at tool layer) |
pulse_metrics |
tools_metrics.go |
Performance metrics queries |
pulse_storage |
tools_storage.go |
Storage pool/volume information |
pulse_kubernetes |
tools_kubernetes.go |
Kubernetes cluster information |
pulse_pmg |
tools_pmg_consolidated.go |
Proxmox Mail Gateway status, queues, spam stats |
| Legacy patrol tools | tools_patrol.go |
pulse_get_metrics, pulse_get_baselines, pulse_get_patterns, pulse_list_alerts, pulse_list_findings, etc. |
| Legacy infrastructure tools | tools_infrastructure.go |
pulse_list_backups, pulse_get_temperatures, pulse_get_ceph_status, pulse_get_network_stats, etc. (25+ read-only tools) |
| Legacy intelligence tools | tools_intelligence.go |
pulse_get_incident_window, pulse_correlate_events, pulse_recall, etc. |
Write Tools (ToolKindWrite)
| Tool | Source File | Purpose |
|---|---|---|
pulse_control |
tools_control_consolidated.go |
Guest control, run command (always Write) |
pulse_run_command |
tools_control.go |
Legacy command execution (Write) |
Action-Dependent Tools
| Tool | Write Actions | Read Actions (default) | Source File |
|---|---|---|---|
pulse_alerts |
resolve, dismiss |
all others | tools_alerts.go |
pulse_docker |
control, update, check_updates, trigger_update |
services, tasks, swarm, list, etc. |
tools_docker.go |
pulse_knowledge |
remember, note, save |
recall, others |
tools_knowledge.go |
pulse_file_edit |
write, append |
read (default) |
tools_file.go |