§4

Response System

Graduated response levels (0–5), actuation via cgroups v2 and XDP, quarantine modes by environment class, selectivity policy, hysteresis, the Thalamic Filter, and decision audit trail.

Based on Whitepaper v2.1 — Sections 5.4–5.7 · ~30 min read

1. Design Philosophy — The Reflex Arc

The HOSA response system is modeled after the spinal reflex arc in the human nervous system. When you touch a hot surface, the nociceptive signal does not travel the full path to the cerebral cortex for contextual processing and conscious deliberation — the latency of that long path (hundreds of milliseconds) would result in tissue damage. Instead, the signal travels a short arc to the spinal cord, which executes a reflexive muscle contraction in sub-milliseconds, withdrawing the limb from the source of harm. The cortex is notified after the reflex executes.

Biological Reflex Arc

  • Sensor detects noxious stimulus
  • Signal → spinal cord (short path)
  • Immediate motor response
  • Cortex notified after the fact
  • → Tissue preserved

HOSA Response System

  • eBPF probes detect anomaly acceleration
  • Signal → user-space engine (local)
  • Immediate cgroups/XDP actuation
  • Orchestrator notified via webhook
  • → Node preserved

This pattern — immediate local action followed by contextual notification to command center — is the operational model of HOSA. The response system is not binary ("everything is fine" vs. "kill everything"). It implements a spectrum of proportional responses that escalate with the severity and rate of change of the detected anomaly.

Three non-negotiable principles govern all response actions:

  1. Proportionality. The severity of the response matches the severity and acceleration of the anomaly. Throttling before killing. Containment before isolation.
  2. Reversibility. Every action at Levels 0–4 is automatically reversible. No destructive action (process kill, interface deactivation) is executed below Level 5.
  3. Observability. Every autonomous action is logged locally with full mathematical justification — the exact DM value, derivative, threshold crossed, dimensional contribution, and action taken. The agent is fully auditable.

2. Response Levels Overview

HOSA implements six response levels (0–5), each with specific activation conditions, actions, and reversibility guarantees. The levels form a monotonically escalating spectrum of intervention intensity:

Figure 1 The graduated response spectrum
0 1 2 3 4 5
Homeostasis Vigilance Soft Containment Active Containment Severe Containment Quarantine
← No intervention PROPORTIONAL Full isolation →
Table 1 Summary of all six response levels with activation conditions and reversibility.
Level Name Activation Condition Action Reversibility
0 Homeostasis DM < θ₁ and dDM/dt ≤ 0 None. Suppress redundant telemetry (heartbeat only). N/A
1 Vigilance DM > θ₁ or sustained dDM/dt > 0 Local logging. Increase sampling rate. No intervention. Automatic (return to L0 when condition ceases)
2 Soft Containment DM > θ₂ and dDM/dt > 0 renice non-essential processes via cgroups. Webhook notification. Automatic (gradual relaxation)
3 Active Containment DM > θ₃ and d²DM/dt² > 0 (positive acceleration) CPU/memory throttling via cgroups. Partial load shedding via XDP. Urgent webhook. Automatic with hysteresis
4 Severe Containment DM > θ₄ or convergence velocity indicates exhaustion in < T seconds Aggressive throttling. XDP blocks all inbound except healthcheck. Freeze non-critical cgroups. Requires sustained DM < θ₃
5 Quarantine Containment failure at previous levels. DM rising despite active mitigations. Network isolation. Non-essential processes frozen (SIGSTOP). Detailed log persisted. Manual intervention required

The thresholds θ₁ through θ₄ are not static constants. They are computed during the warm-up phase as multiples of the baseline standard deviation observed in homeostasis (e.g., θ₁ = 2σ, θ₂ = 3σ, θ₃ = 4σ, θ₄ = 5σ). This ensures the thresholds are adapted to the specific node's behavioral profile, not arbitrary values.

3. Level 0 — Homeostasis

The steady-state normal operation. DM is low and stable, the Load Direction Index φ oscillates around zero, derivatives are near zero, and the covariance structure is consistent with the baseline profile. HOSA performs no intervention — the system is operating within its expected behavioral envelope.

Level 0 ⟺ DM(t) < θ₁ ∧ dD̄M/dt ≤ 0
Condition — Homeostasis

During homeostasis, the primary activity of the agent is baseline refinement: μ and Σ continue to be updated incrementally via Welford (see §3 — Welford), continuously improving the statistical profile of the node.

3.1. The Thalamic Filter

In neuroscience, the thalamus acts as a relay station that filters sensory information before it reaches the cortex — suppressing redundant or low-priority signals to prevent cognitive overload. HOSA implements an analogous mechanism: the Thalamic Filter.

When the system is in homeostasis, the vast majority of telemetry data is redundant — it confirms what is already known: "the system is healthy." Transmitting this data to external monitoring systems (Prometheus, Datadog, etc.) incurs cost:

  • Network bandwidth — metric payloads consume egress
  • TSDB storage — each sample is stored, indexed, and retained
  • Query cost — more data increases query latency and compute
  • Financial cost — cloud monitoring is typically priced per metric per month

The Thalamic Filter suppresses detailed telemetry during homeostasis, emitting only a periodic heartbeat that confirms the node is alive, healthy, and in homeostasis. When DM begins to rise (transition to Level 1+), the filter is deactivated and full telemetry resumes immediately.

FinOps Impact

For a fleet of 1,000 nodes where 95% are in homeostasis at any given time, the Thalamic Filter can reduce metric ingestion volume by up to 90% — a direct and significant reduction in observability costs. The filter does not compromise detection capability because detection is performed locally by the agent, not by the external monitoring system.

4. Level 1 — Vigilance

The first stage of heightened awareness. HOSA has detected a deviation from baseline that warrants closer observation, but the deviation is not yet severe or accelerating enough to justify intervention.

Level 1 ⟺ DM(t) > θ₁ ∨ sustained dD̄M/dt > 0
Condition — Vigilance

Actions:

  • Sampling rate increase. The eBPF collection interval is reduced from the homeostasis rate (typically 100ms) to a heightened rate (typically 10ms), providing 10× temporal resolution for derivative estimation.
  • Local structured logging. HOSA begins recording the state vector, DM, derivatives, and dimensional contributions to the local audit log at full resolution.
  • Thalamic Filter deactivated. Full telemetry is transmitted to external monitoring systems, ensuring the operator has visibility.
  • No system intervention. No processes are modified, throttled, or signaled. HOSA is observing, not acting.

Reversibility: Automatic. When DM drops below θ₁ and the derivative is non-positive for a sustained period, HOSA returns to Level 0. The sampling rate is restored and the Thalamic Filter is re-engaged.

Why Not Intervene Immediately?

Many anomalies are transient — a brief CPU spike from a cron job, a momentary memory allocation for a large request, a burst of network traffic from a health check cascade. Level 1 provides a grace period during which the agent accumulates evidence before committing to intervention. This dramatically reduces false positive interventions while adding minimal latency (typically < 1 second) to genuine escalation.

5. Level 2 — Soft Containment

The anomaly is confirmed and growing. DM has crossed the second threshold and the first derivative remains positive — the system is moving away from homeostasis and not self-correcting. HOSA begins gentle intervention that reduces the priority of non-essential workloads without hard-limiting any process.

Level 2 ⟺ DM(t) > θ₂ ∧ dD̄M/dt > 0
Condition — Soft Containment

Actions:

  • renice non-essential processes. HOSA adjusts the scheduling priority of processes in non-protected cgroups via cpu.weight, giving protected processes (safelist) preferential access to CPU time without hard-capping any process.
  • Webhook notification (opportunistic). HOSA dispatches a POST to the configured webhook endpoint with severity warning, including the full state vector, dimensional contribution decomposition, and the suspected contributing cgroup.
  • Dimensional decomposition logging. The per-dimension contributions cj to DM² are computed and logged, identifying which resources are driving the anomaly (see §3 — Dimensional Contribution).

Reversibility: Automatic. When DM drops below θ₂, the cpu.weight values are gradually restored to their original values over a configurable relaxation period (default: 30 seconds). The gradual relaxation prevents oscillation (flapping) between Level 1 and Level 2.

6. Level 3 — Active Containment

The critical transition point. The anomaly is not only present and growing — it is accelerating. The second derivative d²D̄M/dt² is positive, indicating that the rate of departure from homeostasis is itself increasing. Without intervention, the system will reach resource exhaustion.

Level 3 ⟺ DM(t) > θ₃ ∧ d²D̄M/dt² > 0
Condition — Active Containment

This is where HOSA deploys its primary actuation mechanisms: cgroups v2 for resource throttling and XDP for network load shedding.

6.1. Actuation via cgroups v2

cgroups v2 [1] provides the kernel's native interface for controlling resource allocation per process group. HOSA manipulates cgroup control files directly via the Linux VFS — no external libraries or daemons are required.

ResourceControl FileHOSA ActionEffect
Memory memory.high Reduce from current limit to a lower value Kernel applies memory backpressure (aggressive reclaim). Process slows allocation rate but is not killed.
Memory (hard) memory.max Set as absolute ceiling (Level 4+ only) Allocations beyond this limit trigger OOM within the cgroup, confined to the offending process group.
CPU cpu.max Reduce quota (e.g., from 100000/100000 to 50000/100000) Process group is limited to the specified fraction of CPU time per period.
I/O io.max Set read/write bandwidth limits I/O operations exceeding the limit are throttled by the block I/O scheduler.

The critical design decision: HOSA uses memory.high, not memory.max, at Level 3. The memory.high boundary is a soft limit that instructs the kernel to apply memory reclaim pressure — the process slows down but continues executing. This preserves in-flight transactions and avoids the destructive effects of OOM-kill.

6.2. Load Shedding via XDP

XDP (eXpress Data Path) [2] allows packet processing at the earliest possible point in the network stack — at the NIC driver level, before the kernel allocates an sk_buff structure. This makes XDP-based load shedding extraordinarily efficient: dropped packets consume near-zero CPU.

At Level 3, HOSA attaches an XDP program that implements partial load shedding:

  • New connections are dropped. SYN packets from addresses not in the existing connection table are discarded, preventing the server from accepting additional work.
  • Existing connections are preserved. Packets belonging to established connections (matched by 5-tuple) continue to be processed normally, allowing in-flight transactions to complete.
  • Healthcheck traffic is exempted. Packets from configured healthcheck sources (e.g., load balancer IP, Kubernetes API server) are always passed through, ensuring the node remains visible to the orchestrator.
Why XDP over iptables/nftables?

Traditional packet filtering via iptables/nftables operates at the Netfilter layer — after the kernel has already allocated memory for the packet (sk_buff), parsed headers, and performed routing lookups. Under a DDoS flood, this processing itself can saturate the CPU. XDP drops packets before any of this work occurs, providing packet filtering that scales to millions of packets per second with minimal CPU overhead.

Reversibility: Automatic with hysteresis. When DM drops below θ₂ (not θ₃ — the lower threshold provides a buffer against oscillation) for a sustained period (default: 60 seconds), containment is gradually relaxed. The XDP program transitions from drop-new to rate-limit-new before being fully removed. cgroup limits are restored incrementally, not instantaneously.

7. Level 4 — Severe Containment

The anomaly is severe. Either DM has exceeded the fourth threshold, or the velocity of convergence toward resource exhaustion indicates that a critical resource (typically memory or disk) will be fully consumed within a short time window.

Level 4 ⟺ DM(t) > θ₄

∨ TTF(t) < Tcritical

where TTF = estimated Time to resource Failure based on derivative extrapolation
Condition — Severe Containment

Actions:

  • Aggressive throttling. cpu.max and memory.max (hard limits) are applied to contributing cgroups. The memory.high soft limit from Level 3 is replaced with a hard ceiling.
  • Full inbound traffic block. The XDP program is updated to drop all inbound traffic except: healthcheck probes from the orchestrator, and management traffic (SSH, IPMI) from configured addresses.
  • Non-critical cgroup freeze. cgroups identified as non-essential are frozen via cgroup.freeze, suspending all processes within them. This is equivalent to sending SIGSTOP to every process in the group, but managed at the cgroup level.
  • Critical webhook. A high-priority notification is dispatched with severity critical, including the full state vector, the estimated TTF, and the actions taken.

Reversibility: Requires sustained recovery. DM must drop below θ₃ and remain there for an extended period (default: 5 minutes) before Level 4 mitigations are relaxed. The extended hold period accounts for the severity of the situation — a brief dip in DM during a cascading failure should not trigger premature relaxation.

8. Level 5 — Autonomous Quarantine

The last resort. All previous containment levels have failed to stabilize the system. DM continues to rise despite active throttling and load shedding. The node is in uncontrolled collapse or has been compromised by activity that cannot be contained by resource throttling alone.

Level 5 ⟺ Containment failure at Levels 3–4
∧ DM(t) rising despite active mitigations

∨ ICP(t) > ICPcritical
(high Propagation Behavior Index — node may be spreading the problem)
Condition — Autonomous Quarantine

Actions:

  • Network isolation. Programmatic deactivation of network interfaces (strategy varies by environment — see §9).
  • Process freeze. All non-essential processes are frozen via SIGSTOP. Kernel processes, the HOSA agent itself, and explicitly protected processes continue running.
  • Persistent logging. A detailed forensic log is written to persistent storage, including the complete timeline of detection, escalation, and actions taken.
  • Final webhook. Before network isolation, HOSA dispatches a last webhook signaling the quarantine state. If the webhook fails (network already compromised), the state is signaled through available out-of-band channels (IPMI, cloud metadata, Kubernetes API — depending on environment).

Reversibility: Manual intervention required. Level 5 is the only level that cannot be automatically reversed. An administrator must inspect the node, diagnose the root cause, and explicitly restore the system. This is a deliberate design decision: if the agent's full arsenal of containment measures failed to stabilize the system, the problem requires human judgment.

The ICP Trigger

Level 5 can also be triggered by a high Propagation Behavior Index (ICP) regardless of containment effectiveness. If the node shows signs of propagation behavior (outbound connection explosion, anomalous forks, destination entropy — see §3 — Supplementary Metrics), the priority shifts from preserving the node to protecting the cluster. Network isolation is applied preemptively to prevent lateral movement or cascading failure.

9. Quarantine Modes by Environment Class

The feasibility and strategy of network isolation vary fundamentally by infrastructure class. A bare-metal server with IPMI can safely deactivate its primary network interfaces because it remains accessible via out-of-band management. A cloud VM that deactivates its network interface becomes permanently unreachable. HOSA implements environment-aware quarantine modes, selected automatically during the Hardware Proprioception phase or configured explicitly by the operator.

9.1. Bare Metal with IPMI/iLO/iDRAC

Detection: Presence of IPMI interface via /sys/class/net/ and ipmi_* kernel modules.

Strategy: Programmatic deactivation of all network interfaces except the out-of-band management interface (IPMI/iLO/iDRAC). The node remains accessible via management console for diagnosis and restoration.

Recovery: Manual via IPMI console. Operator inspects HOSA logs, diagnoses root cause, restores interfaces, and restarts services.

9.2. Cloud VM (AWS, GCP, Azure)

Detection: Via DMI/SMBIOS (dmidecode), presence of metadata service (169.254.169.254), and hypervisor identification via /sys/hypervisor/ or CPUID.

Strategy: Does not deactivate network interfaces. Instead:

  1. XDP applies total drop on all inbound/outbound traffic except: traffic to the cloud metadata service (169.254.169.254), DHCP traffic (IP lease maintenance), and traffic to the orchestrator API endpoint.
  2. HOSA signals quarantine state via cloud-native mechanisms: instance tag/label via metadata service (e.g., hosa-quarantine=true), SNS/Pub-Sub publication, or healthcheck endpoint returning HTTP 503 with JSON body detailing the state.
  3. The external orchestrator (Kubernetes, ASG, etc.) is responsible for the terminate/replace decision.

Recovery: The external orchestrator terminates the instance and provisions a replacement. If the orchestrator does not act within a configurable timeout (default: 5 minutes), HOSA can execute self-termination via cloud provider API (when IAM credentials with appropriate permissions are available). Self-termination is disabled by default and requires explicit activation.

9.3. Kubernetes (Pod/Container)

Detection: Execution in container via /proc/1/cgroup with cgroup namespace, environment variables KUBERNETES_SERVICE_HOST, or service account mounted at /var/run/secrets/kubernetes.io/.

Strategy: HOSA operating as DaemonSet does not isolate the node (it lacks permission to deactivate host interfaces). Instead:

  1. Applies maximum containment via cgroups on pods identified as contributors.
  2. Updates the Node status via Kubernetes API with taint hosa.io/quarantine=true:NoExecute and condition HOSAQuarantine=True, causing automatic pod evacuation by the scheduler.
  3. Emits a Kubernetes Event in the namespace of the affected pod with type Warning and reason HOSAQuarantine.

Recovery: Operator or automation removes the taint after investigation. The node returns to the scheduling pool.

9.4. Edge/IoT Environments

For edge devices with physical access (flag environment: edge-physical): complete network interface deactivation. The device operates in isolated mode until physical intervention. Logs are preserved on local persistent storage (flash/eMMC). If the device has a status LED or display, HOSA signals quarantine state visually.

For edge devices without physical access (flag environment: edge-remote): quarantine with hardware watchdog timer. Network interfaces are deactivated, and a hardware watchdog (/dev/watchdog) is activated with a configurable timeout (default: 30 minutes). If no intervention occurs before the timeout, the watchdog reboots the device, which returns to pre-quarantine state with a persistent flag quarantine_recovery=true. On restart with this flag, HOSA enters conservative mode (logging only) for a configurable period to allow remote diagnosis.

9.5. Air-Gapped Networks

(flag environment: airgap): Identical to bare metal, with the addition that all opportunistic communication is permanently disabled (no webhooks, no endpoint exposure). HOSA operates in purely endogenous mode. Logs are written exclusively to local encrypted storage and collected periodically by authorized personnel with physical access.

Design Principle: Auto-Detect with Manual Override

HOSA attempts to automatically detect the environment class and select the appropriate quarantine mode. The operator can override this detection via explicit configuration. In case of ambiguity (e.g., VM in private cloud that doesn't respond to the standard metadata service), HOSA assumes the most conservative mode (cloud — does not deactivate interfaces), prioritizing recoverability over isolation.

10. Selectivity Policy — The Throttling Problem

Process throttling via cgroups is an effective mitigation against resource exhaustion, but it introduces secondary risks that must be explicitly addressed. This section formalizes HOSA's selectivity policy — the rules governing which processes are throttled and which are protected.

10.1. Risks of Throttling

RiskMechanismPotential Impact
Cascading timeouts A throttled HTTP backend accumulates connections upstream Degradation propagates to services that depend on the throttled service
Transaction deadlocks A process throttled during a database transaction holds locks indefinitely Other transactions block, potentially freezing the entire database
Critical starvation If kubelet is throttled, the node is marked NotReady All pods are evacuated, causing more damage than the original problem
Self-referential failure If the HOSA agent itself is throttled Detection latency increases, potentially missing the window for effective mitigation

10.2. The Safelist

HOSA maintains a safelist of processes and cgroups that are never targeted for throttling, regardless of their resource consumption:

  • Kernel processeskthreadd, ksoftirqd, kworker, etc.
  • The HOSA agent itself — always the first entry in the safelist
  • Orchestration agentskubelet, containerd, dockerd (auto-detected when present)
  • Operator-designated processes — explicitly marked via configuration or cgroup label

The safelist is populated during the Hardware Proprioception phase and can be modified at runtime via the HOSA configuration file.

10.3. Contributor Targeting

Throttling is applied preferentially to the processes identified as the dominant contributors to the anomaly. The targeting algorithm uses two inputs:

  1. Dimensional decomposition (from §3 — Dimensional Contribution): identifies which resource is driving the anomaly (e.g., memory contributes 68% of DM²).
  2. Per-cgroup consumption delta: identifies which process group is consuming the most of the dominant resource. This is determined by comparing memory.current, cpu.stat, or io.stat across cgroups and identifying the group with the largest recent increase.

The combination of dimensional decomposition and per-cgroup attribution allows HOSA to answer the critical question: "What specific process is consuming what specific resource in a way that is driving the system away from homeostasis?" — and to target its intervention precisely at that intersection.

11. Hysteresis and De-Escalation

A naïve implementation that escalates at threshold θn and de-escalates at the same threshold would produce oscillation (flapping): the system crosses the threshold, containment activates, DM drops below the threshold, containment deactivates, DM rises again, and the cycle repeats indefinitely.

HOSA implements hysteresis on all level transitions:

Escalation: Level N → N+1 when DM > θN+1

De-escalation: Level N → N−1 when DM < θN−1 for t > Thold(N)
Equation 1 — Hysteresis Rules

Where:

  • Escalation threshold is the next level's threshold (θN+1)
  • De-escalation threshold is the previous level's threshold (θN−1) — one level below the escalation threshold, creating a dead zone that prevents oscillation
  • Thold(N) is the minimum hold time before de-escalation is permitted — longer for higher levels:
TransitionDe-escalation ThresholdMinimum Hold Time
Level 1 → 0DM < θ₁10 seconds
Level 2 → 1DM < θ₁30 seconds
Level 3 → 2DM < θ₂60 seconds
Level 4 → 3DM < θ₃5 minutes
Level 5 → manualN/AN/A (manual only)

Additionally, de-escalation is always gradual: HOSA never jumps from Level 4 to Level 0 in a single step. Each level is traversed in sequence, with the hold time enforced at each transition. This ensures that relaxation of containment measures is cautious and monitored.

The Second Derivative as De-Escalation Accelerator

When d²D̄M/dt² is strongly negative (the system is decelerating rapidly toward homeostasis — the mitigation is clearly working), the hold times can be optionally reduced by a configurable factor. This allows faster recovery when the evidence of effective mitigation is unambiguous. This optimization is disabled by default and requires explicit activation.

12. Decision Observability and Audit Trail

Every autonomous action taken by HOSA is recorded in a local structured log with full mathematical justification. This is not optional — it is Architectural Principle #5 (Decision Observability). An agent that executes autonomous mitigation on production systems must be auditable.

Each log entry contains:

FieldDescriptionExample
timestamp Nanosecond-precision UTC timestamp 2026-03-10T14:23:09.127Z
level Response level (0–5) 2
level_name Human-readable level name SOFT_CONTAINMENT
d_m Current Mahalanobis Distance 4.7
d_m_derivative First derivative (velocity) +2.1
d_m_acceleration Second derivative (acceleration) +0.5
phi Load Direction Index +1.8
dominant_dims Top contributing dimensions with cj percentages mem_used:68%, mem_pressure:19%, io_latency:8%
target_cgroup The cgroup targeted for action (if any) /kubepods/pod-payment-7b4f
action The specific action taken memory.high 2G→1.6G
action_effect Whether the action is having the desired effect effective (d2DM/dt2 < 0)
regime Current regime classification +4 (Local Failure)

Logs are written to /var/log/hosa/decisions.log in JSON Lines format. Log rotation is managed by the agent (configurable max file size and retention). When webhook connectivity is available, the same structured data is included in webhook payloads.

"The operator should be able to read HOSA's decision log and answer, for every action taken: What did the agent observe? What mathematical threshold was crossed? What action did it take? Was the action effective? The log is the agent's testimony."

13. Walkthrough: Memory Leak in Payment Service

This end-to-end scenario illustrates the full perceptive-motor cycle of the response system, from detection through containment and stabilization. Numerical values are representative and based on behavior observed in production systems.

Context:

  • Node: VM worker-node-07 in Kubernetes cluster, 8 vCPUs, 16GB RAM
  • Workload: 12 pods, including payment-service-7b4f (Java, 2GB cgroup limit)
  • External monitoring: Prometheus with 15s scrape interval, alert rule: memory > 1.8GB for 1m
  • HOSA: In homeostasis (Level 0) for 6 hours. Baseline calibrated. 8-dimensional state vector.
  • Memory leak rate: ~50MB/s (circular reference in session cache)
TimeSystem StateHOSA (Endogenous)Prometheus (Exogenous)
t=0s Leak starts. mem: 61% DM=1.1. Level 0 (Homeostasis). Last scrape 8s ago. Data shows: "healthy".
t=1s mem: 64%. PSI: 18% ⚡ DM=2.8. Level 0→1. Sampling: 100ms→10ms. Log: dominant dim = mem_used (72%). (no scrape)
t=2s mem: 68%. PSI: 29%. Swap activating. ⚡ DM=4.7. Level 1→2. memory.high → 1.6G on payment-service-7b4f. Webhook dispatched. (no scrape)
t=4s mem: 72% (rate reduced by containment) DM=5.9. dDM/dt decelerating (+2.1→+1.2). d²DM/dt² = −0.45. Containment effective. Maintains Level 2. Scrape! mem=1.47GB. Rule: >1.8GB for 1m. Result: OK (!)
t=8s mem: 74% (plateau — reclaim ≈ allocation) DM=6.2. dDM/dt ≈ 0. Stabilized. System degraded but functional. Transactions preserved. (no scrape)
t=35s mem: 75% (held by containment) Maintains containment. Operator received webhook with full dimensional context. Scrape. mem=1.55GB. Rule: OK (!) — containment prevents threshold breach.

Counterfactual — Without HOSA:

  • t≈40s: Container allocates ~3GB, exceeds 2GB cgroup limit. OOM-Kill terminates the Java process. All in-flight payment transactions are aborted without graceful shutdown.
  • t≈80s: Kubelet restarts the pod. Memory leak persists. CrashLoopBackOff begins (~40s crash cycles).
  • t≈100s: Prometheus fires alert (for 1m condition finally satisfied) — 60 seconds after the first crash. Customers have been experiencing 502 errors for a full minute.
Key Insight

HOSA transformed a destructive crash with data loss into controlled degradation with functional preservation. Detection took 1 second (vs. >60 seconds for the exogenous model). The containment preserved in-flight transactions. The operator received actionable information with dimensional context — not a binary alert 100 seconds too late.

14. Linux Capabilities by Level

When HOSA operates as a container (DaemonSet in Kubernetes), its access to host cgroups and network interfaces depends on Linux capabilities. Following the principle of least privilege, the required capabilities are documented per response level:

LevelsRequired CapabilitiesAccess Needed
0–1 CAP_BPF Read-only access to /sys/. eBPF probe attachment for metric collection.
2 CAP_BPF, CAP_SYS_ADMIN Write access to cgroup control files (cpu.weight).
3–4 CAP_BPF, CAP_SYS_ADMIN, CAP_NET_ADMIN cgroup manipulation (memory.high, cpu.max). XDP program attachment.
5 CAP_BPF, CAP_SYS_ADMIN, CAP_NET_ADMIN Network interface manipulation. In Kubernetes mode: API access for taint application.

An operator who wants detection-only (Levels 0–1) without any actuation can deploy HOSA with only CAP_BPF — the agent will detect, log, and notify but never intervene. This is the recommended starting configuration for evaluation deployments.

15. Known Limitations

  1. Throttling side effects. Despite the safelist and contributor targeting mechanisms, throttling a process during a critical section (holding a database lock, mid-write to disk) can cause secondary failures. The experimental phase will quantify these side effects under controlled conditions.
  2. XDP compatibility. XDP requires driver support. Not all network drivers support XDP in native mode — some fall back to generic mode (SKB-based), which offers less performance benefit. The agent gracefully degrades: if native XDP is unavailable, it falls back to generic XDP; if generic XDP is unavailable, network-level load shedding is disabled and only cgroup-based containment is available.
  3. Level 5 irreversibility. The manual recovery requirement for Level 5 means that in environments without accessible operators (remote IoT without physical access and no watchdog), a quarantined node may remain isolated indefinitely. The watchdog timer mechanism (§9.4) mitigates but does not fully solve this in all scenarios.
  4. Threshold calibration sensitivity. The adaptive thresholds (θ₁–θ₄) depend on the quality of the warm-up phase baseline. A warm-up period that coincides with atypical system behavior (ongoing deployment, burst workload) will produce suboptimal thresholds. The experimental phase will quantify threshold sensitivity under various warm-up conditions.
  5. Multi-tenant interference. In shared environments (multiple pods on the same node), throttling one process can indirectly affect others through shared resources (CPU cache, memory bus, I/O scheduler). The decomposition identifies the dominant contributor, but the containment action may have second-order effects on co-located workloads.

16. References

  1. Heo, T. (2015). Control Group v2. Linux Kernel Documentation. kernel.org
  2. Vieira, M. A., Castanho, M. S., Pacífico, R. D. G., Santos, E. R. S., Júnior, E. P. M. C., & Vieira, L. F. M. (2020). Fast Packet Processing with eBPF and XDP: Concepts, Code, Challenges, and Applications. ACM Computing Surveys, 53(1), Article 16.
  3. Gregg, B. (2019). BPF Performance Tools: Linux System and Application Observability. Addison-Wesley Professional.
  4. Horn, P. (2001). Autonomic Computing: IBM's Perspective on the State of Information Technology. IBM Corporation.
  5. Bear, M. F., Connors, B. W., & Paradiso, M. A. (2015). Neuroscience: Exploring the Brain (4th ed.). Wolters Kluwer.
  6. Beyer, B., Jones, C., Petoff, J., & Murphy, N. R. (2016). Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media.
  7. Poettering, L. (2020). systemd-oomd: A userspace out-of-memory (OOM) killer. systemd Documentation.
  8. Weiner, J. (2018). PSI — Pressure Stall Information. Linux Kernel Documentation. kernel.org
  9. Tang, C., et al. (2020). FBAR: Facebook's Automated Remediation System. Proceedings of the ACM Symposium on Cloud Computing (SoCC).