1. Design Philosophy — The Reflex Arc
The HOSA response system is modeled after the spinal reflex arc in the human nervous system. When you touch a hot surface, the nociceptive signal does not travel the full path to the cerebral cortex for contextual processing and conscious deliberation — the latency of that long path (hundreds of milliseconds) would result in tissue damage. Instead, the signal travels a short arc to the spinal cord, which executes a reflexive muscle contraction in sub-milliseconds, withdrawing the limb from the source of harm. The cortex is notified after the reflex executes.
Biological Reflex Arc
- Sensor detects noxious stimulus
- Signal → spinal cord (short path)
- Immediate motor response
- Cortex notified after the fact
- → Tissue preserved
HOSA Response System
- eBPF probes detect anomaly acceleration
- Signal → user-space engine (local)
- Immediate cgroups/XDP actuation
- Orchestrator notified via webhook
- → Node preserved
This pattern — immediate local action followed by contextual notification to command center — is the operational model of HOSA. The response system is not binary ("everything is fine" vs. "kill everything"). It implements a spectrum of proportional responses that escalate with the severity and rate of change of the detected anomaly.
Three non-negotiable principles govern all response actions:
- Proportionality. The severity of the response matches the severity and acceleration of the anomaly. Throttling before killing. Containment before isolation.
- Reversibility. Every action at Levels 0–4 is automatically reversible. No destructive action (process kill, interface deactivation) is executed below Level 5.
- Observability. Every autonomous action is logged locally with full mathematical justification — the exact DM value, derivative, threshold crossed, dimensional contribution, and action taken. The agent is fully auditable.
2. Response Levels Overview
HOSA implements six response levels (0–5), each with specific activation conditions, actions, and reversibility guarantees. The levels form a monotonically escalating spectrum of intervention intensity:
| Level | Name | Activation Condition | Action | Reversibility |
|---|---|---|---|---|
| 0 | Homeostasis | DM < θ₁ and dDM/dt ≤ 0 | None. Suppress redundant telemetry (heartbeat only). | N/A |
| 1 | Vigilance | DM > θ₁ or sustained dDM/dt > 0 | Local logging. Increase sampling rate. No intervention. | Automatic (return to L0 when condition ceases) |
| 2 | Soft Containment | DM > θ₂ and dDM/dt > 0 | renice non-essential processes via cgroups. Webhook notification. |
Automatic (gradual relaxation) |
| 3 | Active Containment | DM > θ₃ and d²DM/dt² > 0 (positive acceleration) | CPU/memory throttling via cgroups. Partial load shedding via XDP. Urgent webhook. | Automatic with hysteresis |
| 4 | Severe Containment | DM > θ₄ or convergence velocity indicates exhaustion in < T seconds | Aggressive throttling. XDP blocks all inbound except healthcheck. Freeze non-critical cgroups. | Requires sustained DM < θ₃ |
| 5 | Quarantine | Containment failure at previous levels. DM rising despite active mitigations. | Network isolation. Non-essential processes frozen (SIGSTOP). Detailed log persisted. | Manual intervention required |
The thresholds θ₁ through θ₄ are not static constants. They are computed during the warm-up phase as multiples of the baseline standard deviation observed in homeostasis (e.g., θ₁ = 2σ, θ₂ = 3σ, θ₃ = 4σ, θ₄ = 5σ). This ensures the thresholds are adapted to the specific node's behavioral profile, not arbitrary values.
3. Level 0 — Homeostasis
The steady-state normal operation. DM is low and stable, the Load Direction Index φ oscillates around zero, derivatives are near zero, and the covariance structure is consistent with the baseline profile. HOSA performs no intervention — the system is operating within its expected behavioral envelope.
During homeostasis, the primary activity of the agent is baseline refinement: μ and Σ continue to be updated incrementally via Welford (see §3 — Welford), continuously improving the statistical profile of the node.
3.1. The Thalamic Filter
In neuroscience, the thalamus acts as a relay station that filters sensory information before it reaches the cortex — suppressing redundant or low-priority signals to prevent cognitive overload. HOSA implements an analogous mechanism: the Thalamic Filter.
When the system is in homeostasis, the vast majority of telemetry data is redundant — it confirms what is already known: "the system is healthy." Transmitting this data to external monitoring systems (Prometheus, Datadog, etc.) incurs cost:
- Network bandwidth — metric payloads consume egress
- TSDB storage — each sample is stored, indexed, and retained
- Query cost — more data increases query latency and compute
- Financial cost — cloud monitoring is typically priced per metric per month
The Thalamic Filter suppresses detailed telemetry during homeostasis, emitting only a periodic heartbeat that confirms the node is alive, healthy, and in homeostasis. When DM begins to rise (transition to Level 1+), the filter is deactivated and full telemetry resumes immediately.
For a fleet of 1,000 nodes where 95% are in homeostasis at any given time, the Thalamic Filter can reduce metric ingestion volume by up to 90% — a direct and significant reduction in observability costs. The filter does not compromise detection capability because detection is performed locally by the agent, not by the external monitoring system.
4. Level 1 — Vigilance
The first stage of heightened awareness. HOSA has detected a deviation from baseline that warrants closer observation, but the deviation is not yet severe or accelerating enough to justify intervention.
Actions:
- Sampling rate increase. The eBPF collection interval is reduced from the homeostasis rate (typically 100ms) to a heightened rate (typically 10ms), providing 10× temporal resolution for derivative estimation.
- Local structured logging. HOSA begins recording the state vector, DM, derivatives, and dimensional contributions to the local audit log at full resolution.
- Thalamic Filter deactivated. Full telemetry is transmitted to external monitoring systems, ensuring the operator has visibility.
- No system intervention. No processes are modified, throttled, or signaled. HOSA is observing, not acting.
Reversibility: Automatic. When DM drops below θ₁ and the derivative is non-positive for a sustained period, HOSA returns to Level 0. The sampling rate is restored and the Thalamic Filter is re-engaged.
Many anomalies are transient — a brief CPU spike from a cron job, a momentary memory allocation for a large request, a burst of network traffic from a health check cascade. Level 1 provides a grace period during which the agent accumulates evidence before committing to intervention. This dramatically reduces false positive interventions while adding minimal latency (typically < 1 second) to genuine escalation.
5. Level 2 — Soft Containment
The anomaly is confirmed and growing. DM has crossed the second threshold and the first derivative remains positive — the system is moving away from homeostasis and not self-correcting. HOSA begins gentle intervention that reduces the priority of non-essential workloads without hard-limiting any process.
Actions:
-
renicenon-essential processes. HOSA adjusts the scheduling priority of processes in non-protected cgroups viacpu.weight, giving protected processes (safelist) preferential access to CPU time without hard-capping any process. -
Webhook notification (opportunistic). HOSA dispatches a
POSTto the configured webhook endpoint with severitywarning, including the full state vector, dimensional contribution decomposition, and the suspected contributing cgroup. - Dimensional decomposition logging. The per-dimension contributions cj to DM² are computed and logged, identifying which resources are driving the anomaly (see §3 — Dimensional Contribution).
Reversibility: Automatic. When DM drops below θ₂,
the cpu.weight values are gradually restored to their original
values over a configurable relaxation period (default: 30 seconds). The gradual
relaxation prevents oscillation (flapping) between Level 1 and Level 2.
6. Level 3 — Active Containment
The critical transition point. The anomaly is not only present and growing — it is accelerating. The second derivative d²D̄M/dt² is positive, indicating that the rate of departure from homeostasis is itself increasing. Without intervention, the system will reach resource exhaustion.
This is where HOSA deploys its primary actuation mechanisms: cgroups v2 for resource throttling and XDP for network load shedding.
6.1. Actuation via cgroups v2
cgroups v2 [1] provides the kernel's native interface for controlling resource allocation per process group. HOSA manipulates cgroup control files directly via the Linux VFS — no external libraries or daemons are required.
| Resource | Control File | HOSA Action | Effect |
|---|---|---|---|
| Memory | memory.high |
Reduce from current limit to a lower value | Kernel applies memory backpressure (aggressive reclaim). Process slows allocation rate but is not killed. |
| Memory (hard) | memory.max |
Set as absolute ceiling (Level 4+ only) | Allocations beyond this limit trigger OOM within the cgroup, confined to the offending process group. |
| CPU | cpu.max |
Reduce quota (e.g., from 100000/100000 to 50000/100000) | Process group is limited to the specified fraction of CPU time per period. |
| I/O | io.max |
Set read/write bandwidth limits | I/O operations exceeding the limit are throttled by the block I/O scheduler. |
The critical design decision: HOSA uses memory.high, not
memory.max, at Level 3. The memory.high boundary is a
soft limit that instructs the kernel to apply memory reclaim pressure
— the process slows down but continues executing. This preserves
in-flight transactions and avoids the destructive effects of OOM-kill.
6.2. Load Shedding via XDP
XDP (eXpress Data Path) [2] allows
packet processing at the earliest possible point in the network stack — at the
NIC driver level, before the kernel allocates an
sk_buff structure. This makes XDP-based load shedding
extraordinarily efficient: dropped packets consume near-zero CPU.
At Level 3, HOSA attaches an XDP program that implements partial load shedding:
- New connections are dropped. SYN packets from addresses not in the existing connection table are discarded, preventing the server from accepting additional work.
- Existing connections are preserved. Packets belonging to established connections (matched by 5-tuple) continue to be processed normally, allowing in-flight transactions to complete.
- Healthcheck traffic is exempted. Packets from configured healthcheck sources (e.g., load balancer IP, Kubernetes API server) are always passed through, ensuring the node remains visible to the orchestrator.
Traditional packet filtering via iptables/nftables operates at the Netfilter
layer — after the kernel has already allocated memory for the packet
(sk_buff), parsed headers, and performed routing lookups. Under
a DDoS flood, this processing itself can saturate the CPU. XDP drops packets
before any of this work occurs, providing packet filtering that
scales to millions of packets per second with minimal CPU overhead.
Reversibility: Automatic with hysteresis. When DM drops below θ₂ (not θ₃ — the lower threshold provides a buffer against oscillation) for a sustained period (default: 60 seconds), containment is gradually relaxed. The XDP program transitions from drop-new to rate-limit-new before being fully removed. cgroup limits are restored incrementally, not instantaneously.
7. Level 4 — Severe Containment
The anomaly is severe. Either DM has exceeded the fourth threshold, or the velocity of convergence toward resource exhaustion indicates that a critical resource (typically memory or disk) will be fully consumed within a short time window.
∨ TTF(t) < Tcritical
where TTF = estimated Time to resource Failure based on derivative extrapolation
Actions:
-
Aggressive throttling.
cpu.maxandmemory.max(hard limits) are applied to contributing cgroups. Thememory.highsoft limit from Level 3 is replaced with a hard ceiling. - Full inbound traffic block. The XDP program is updated to drop all inbound traffic except: healthcheck probes from the orchestrator, and management traffic (SSH, IPMI) from configured addresses.
-
Non-critical cgroup freeze. cgroups identified as
non-essential are frozen via
cgroup.freeze, suspending all processes within them. This is equivalent to sending SIGSTOP to every process in the group, but managed at the cgroup level. -
Critical webhook. A high-priority notification is dispatched
with severity
critical, including the full state vector, the estimated TTF, and the actions taken.
Reversibility: Requires sustained recovery. DM must drop below θ₃ and remain there for an extended period (default: 5 minutes) before Level 4 mitigations are relaxed. The extended hold period accounts for the severity of the situation — a brief dip in DM during a cascading failure should not trigger premature relaxation.
8. Level 5 — Autonomous Quarantine
The last resort. All previous containment levels have failed to stabilize the system. DM continues to rise despite active throttling and load shedding. The node is in uncontrolled collapse or has been compromised by activity that cannot be contained by resource throttling alone.
∧ DM(t) rising despite active mitigations
∨ ICP(t) > ICPcritical
(high Propagation Behavior Index — node may be spreading the problem)
Actions:
- Network isolation. Programmatic deactivation of network interfaces (strategy varies by environment — see §9).
- Process freeze. All non-essential processes are frozen via SIGSTOP. Kernel processes, the HOSA agent itself, and explicitly protected processes continue running.
- Persistent logging. A detailed forensic log is written to persistent storage, including the complete timeline of detection, escalation, and actions taken.
- Final webhook. Before network isolation, HOSA dispatches a last webhook signaling the quarantine state. If the webhook fails (network already compromised), the state is signaled through available out-of-band channels (IPMI, cloud metadata, Kubernetes API — depending on environment).
Reversibility: Manual intervention required. Level 5 is the only level that cannot be automatically reversed. An administrator must inspect the node, diagnose the root cause, and explicitly restore the system. This is a deliberate design decision: if the agent's full arsenal of containment measures failed to stabilize the system, the problem requires human judgment.
Level 5 can also be triggered by a high Propagation Behavior Index (ICP) regardless of containment effectiveness. If the node shows signs of propagation behavior (outbound connection explosion, anomalous forks, destination entropy — see §3 — Supplementary Metrics), the priority shifts from preserving the node to protecting the cluster. Network isolation is applied preemptively to prevent lateral movement or cascading failure.
9. Quarantine Modes by Environment Class
The feasibility and strategy of network isolation vary fundamentally by infrastructure class. A bare-metal server with IPMI can safely deactivate its primary network interfaces because it remains accessible via out-of-band management. A cloud VM that deactivates its network interface becomes permanently unreachable. HOSA implements environment-aware quarantine modes, selected automatically during the Hardware Proprioception phase or configured explicitly by the operator.
9.1. Bare Metal with IPMI/iLO/iDRAC
Detection: Presence of IPMI interface via
/sys/class/net/ and ipmi_* kernel modules.
Strategy: Programmatic deactivation of all network interfaces except the out-of-band management interface (IPMI/iLO/iDRAC). The node remains accessible via management console for diagnosis and restoration.
Recovery: Manual via IPMI console. Operator inspects HOSA logs, diagnoses root cause, restores interfaces, and restarts services.
9.2. Cloud VM (AWS, GCP, Azure)
Detection: Via DMI/SMBIOS (dmidecode), presence
of metadata service (169.254.169.254), and hypervisor identification via
/sys/hypervisor/ or CPUID.
Strategy: Does not deactivate network interfaces. Instead:
- XDP applies total drop on all inbound/outbound traffic except: traffic to the cloud metadata service (169.254.169.254), DHCP traffic (IP lease maintenance), and traffic to the orchestrator API endpoint.
-
HOSA signals quarantine state via cloud-native mechanisms: instance
tag/label via metadata service (e.g.,
hosa-quarantine=true), SNS/Pub-Sub publication, or healthcheck endpoint returning HTTP 503 with JSON body detailing the state. - The external orchestrator (Kubernetes, ASG, etc.) is responsible for the terminate/replace decision.
Recovery: The external orchestrator terminates the instance and provisions a replacement. If the orchestrator does not act within a configurable timeout (default: 5 minutes), HOSA can execute self-termination via cloud provider API (when IAM credentials with appropriate permissions are available). Self-termination is disabled by default and requires explicit activation.
9.3. Kubernetes (Pod/Container)
Detection: Execution in container via
/proc/1/cgroup with cgroup namespace, environment variables
KUBERNETES_SERVICE_HOST, or service account mounted at
/var/run/secrets/kubernetes.io/.
Strategy: HOSA operating as DaemonSet does not isolate the node (it lacks permission to deactivate host interfaces). Instead:
- Applies maximum containment via cgroups on pods identified as contributors.
-
Updates the Node status via Kubernetes API with taint
hosa.io/quarantine=true:NoExecuteand conditionHOSAQuarantine=True, causing automatic pod evacuation by the scheduler. -
Emits a Kubernetes Event in the namespace of the affected pod with type
Warningand reasonHOSAQuarantine.
Recovery: Operator or automation removes the taint after investigation. The node returns to the scheduling pool.
9.4. Edge/IoT Environments
For edge devices with physical access
(flag environment: edge-physical): complete network interface
deactivation. The device operates in isolated mode until physical intervention.
Logs are preserved on local persistent storage (flash/eMMC). If the device
has a status LED or display, HOSA signals quarantine state visually.
For edge devices without physical access
(flag environment: edge-remote): quarantine with
hardware watchdog timer. Network interfaces are deactivated,
and a hardware watchdog (/dev/watchdog) is activated with a
configurable timeout (default: 30 minutes). If no intervention occurs before
the timeout, the watchdog reboots the device, which returns to
pre-quarantine state with a persistent flag
quarantine_recovery=true. On restart with this flag, HOSA enters
conservative mode (logging only) for a configurable period to allow remote
diagnosis.
9.5. Air-Gapped Networks
(flag environment: airgap): Identical to bare metal, with the
addition that all opportunistic communication is permanently
disabled (no webhooks, no endpoint exposure). HOSA operates in purely
endogenous mode. Logs are written exclusively to local encrypted storage and
collected periodically by authorized personnel with physical access.
HOSA attempts to automatically detect the environment class and select the appropriate quarantine mode. The operator can override this detection via explicit configuration. In case of ambiguity (e.g., VM in private cloud that doesn't respond to the standard metadata service), HOSA assumes the most conservative mode (cloud — does not deactivate interfaces), prioritizing recoverability over isolation.
10. Selectivity Policy — The Throttling Problem
Process throttling via cgroups is an effective mitigation against resource exhaustion, but it introduces secondary risks that must be explicitly addressed. This section formalizes HOSA's selectivity policy — the rules governing which processes are throttled and which are protected.
10.1. Risks of Throttling
| Risk | Mechanism | Potential Impact |
|---|---|---|
| Cascading timeouts | A throttled HTTP backend accumulates connections upstream | Degradation propagates to services that depend on the throttled service |
| Transaction deadlocks | A process throttled during a database transaction holds locks indefinitely | Other transactions block, potentially freezing the entire database |
| Critical starvation | If kubelet is throttled, the node is marked NotReady | All pods are evacuated, causing more damage than the original problem |
| Self-referential failure | If the HOSA agent itself is throttled | Detection latency increases, potentially missing the window for effective mitigation |
10.2. The Safelist
HOSA maintains a safelist of processes and cgroups that are never targeted for throttling, regardless of their resource consumption:
- Kernel processes —
kthreadd,ksoftirqd,kworker, etc. - The HOSA agent itself — always the first entry in the safelist
- Orchestration agents —
kubelet,containerd,dockerd(auto-detected when present) - Operator-designated processes — explicitly marked via configuration or cgroup label
The safelist is populated during the Hardware Proprioception phase and can be modified at runtime via the HOSA configuration file.
10.3. Contributor Targeting
Throttling is applied preferentially to the processes identified as the dominant contributors to the anomaly. The targeting algorithm uses two inputs:
- Dimensional decomposition (from §3 — Dimensional Contribution): identifies which resource is driving the anomaly (e.g., memory contributes 68% of DM²).
-
Per-cgroup consumption delta: identifies which
process group is consuming the most of the dominant resource. This is
determined by comparing
memory.current,cpu.stat, orio.statacross cgroups and identifying the group with the largest recent increase.
The combination of dimensional decomposition and per-cgroup attribution allows HOSA to answer the critical question: "What specific process is consuming what specific resource in a way that is driving the system away from homeostasis?" — and to target its intervention precisely at that intersection.
11. Hysteresis and De-Escalation
A naïve implementation that escalates at threshold θn and de-escalates at the same threshold would produce oscillation (flapping): the system crosses the threshold, containment activates, DM drops below the threshold, containment deactivates, DM rises again, and the cycle repeats indefinitely.
HOSA implements hysteresis on all level transitions:
De-escalation: Level N → N−1 when DM < θN−1 for t > Thold(N)
Where:
- Escalation threshold is the next level's threshold (θN+1)
- De-escalation threshold is the previous level's threshold (θN−1) — one level below the escalation threshold, creating a dead zone that prevents oscillation
- Thold(N) is the minimum hold time before de-escalation is permitted — longer for higher levels:
| Transition | De-escalation Threshold | Minimum Hold Time |
|---|---|---|
| Level 1 → 0 | DM < θ₁ | 10 seconds |
| Level 2 → 1 | DM < θ₁ | 30 seconds |
| Level 3 → 2 | DM < θ₂ | 60 seconds |
| Level 4 → 3 | DM < θ₃ | 5 minutes |
| Level 5 → manual | N/A | N/A (manual only) |
Additionally, de-escalation is always gradual: HOSA never jumps from Level 4 to Level 0 in a single step. Each level is traversed in sequence, with the hold time enforced at each transition. This ensures that relaxation of containment measures is cautious and monitored.
When d²D̄M/dt² is strongly negative (the system is decelerating rapidly toward homeostasis — the mitigation is clearly working), the hold times can be optionally reduced by a configurable factor. This allows faster recovery when the evidence of effective mitigation is unambiguous. This optimization is disabled by default and requires explicit activation.
12. Decision Observability and Audit Trail
Every autonomous action taken by HOSA is recorded in a local structured log with full mathematical justification. This is not optional — it is Architectural Principle #5 (Decision Observability). An agent that executes autonomous mitigation on production systems must be auditable.
Each log entry contains:
| Field | Description | Example |
|---|---|---|
timestamp |
Nanosecond-precision UTC timestamp | 2026-03-10T14:23:09.127Z |
level |
Response level (0–5) | 2 |
level_name |
Human-readable level name | SOFT_CONTAINMENT |
d_m |
Current Mahalanobis Distance | 4.7 |
d_m_derivative |
First derivative (velocity) | +2.1 |
d_m_acceleration |
Second derivative (acceleration) | +0.5 |
phi |
Load Direction Index | +1.8 |
dominant_dims |
Top contributing dimensions with cj percentages | mem_used:68%, mem_pressure:19%, io_latency:8% |
target_cgroup |
The cgroup targeted for action (if any) | /kubepods/pod-payment-7b4f |
action |
The specific action taken | memory.high 2G→1.6G |
action_effect |
Whether the action is having the desired effect | effective (d2DM/dt2 < 0) |
regime |
Current regime classification | +4 (Local Failure) |
Logs are written to /var/log/hosa/decisions.log in JSON Lines
format. Log rotation is managed by the agent (configurable max file size and
retention). When webhook connectivity is available, the same structured data
is included in webhook payloads.
"The operator should be able to read HOSA's decision log and answer, for every action taken: What did the agent observe? What mathematical threshold was crossed? What action did it take? Was the action effective? The log is the agent's testimony."
13. Walkthrough: Memory Leak in Payment Service
This end-to-end scenario illustrates the full perceptive-motor cycle of the response system, from detection through containment and stabilization. Numerical values are representative and based on behavior observed in production systems.
Context:
- Node: VM
worker-node-07in Kubernetes cluster, 8 vCPUs, 16GB RAM - Workload: 12 pods, including
payment-service-7b4f(Java, 2GB cgroup limit) - External monitoring: Prometheus with 15s scrape interval, alert rule:
memory > 1.8GB for 1m - HOSA: In homeostasis (Level 0) for 6 hours. Baseline calibrated. 8-dimensional state vector.
- Memory leak rate: ~50MB/s (circular reference in session cache)
| Time | System State | HOSA (Endogenous) | Prometheus (Exogenous) |
|---|---|---|---|
| t=0s | Leak starts. mem: 61% | DM=1.1. Level 0 (Homeostasis). | Last scrape 8s ago. Data shows: "healthy". |
| t=1s | mem: 64%. PSI: 18% | ⚡ DM=2.8. Level 0→1. Sampling: 100ms→10ms. Log: dominant dim = mem_used (72%). |
(no scrape) |
| t=2s | mem: 68%. PSI: 29%. Swap activating. | ⚡ DM=4.7. Level 1→2. memory.high → 1.6G on payment-service-7b4f. Webhook dispatched. |
(no scrape) |
| t=4s | mem: 72% (rate reduced by containment) | DM=5.9. dDM/dt decelerating (+2.1→+1.2). d²DM/dt² = −0.45. Containment effective. Maintains Level 2. | Scrape! mem=1.47GB. Rule: >1.8GB for 1m. Result: OK (!) |
| t=8s | mem: 74% (plateau — reclaim ≈ allocation) | DM=6.2. dDM/dt ≈ 0. Stabilized. System degraded but functional. Transactions preserved. | (no scrape) |
| t=35s | mem: 75% (held by containment) | Maintains containment. Operator received webhook with full dimensional context. | Scrape. mem=1.55GB. Rule: OK (!) — containment prevents threshold breach. |
Counterfactual — Without HOSA:
- t≈40s: Container allocates ~3GB, exceeds 2GB cgroup limit. OOM-Kill terminates the Java process. All in-flight payment transactions are aborted without graceful shutdown.
- t≈80s: Kubelet restarts the pod. Memory leak persists. CrashLoopBackOff begins (~40s crash cycles).
- t≈100s: Prometheus fires alert (
for 1mcondition finally satisfied) — 60 seconds after the first crash. Customers have been experiencing 502 errors for a full minute.
HOSA transformed a destructive crash with data loss into controlled degradation with functional preservation. Detection took 1 second (vs. >60 seconds for the exogenous model). The containment preserved in-flight transactions. The operator received actionable information with dimensional context — not a binary alert 100 seconds too late.
14. Linux Capabilities by Level
When HOSA operates as a container (DaemonSet in Kubernetes), its access to host cgroups and network interfaces depends on Linux capabilities. Following the principle of least privilege, the required capabilities are documented per response level:
| Levels | Required Capabilities | Access Needed |
|---|---|---|
| 0–1 | CAP_BPF |
Read-only access to /sys/. eBPF probe attachment for metric collection. |
| 2 | CAP_BPF, CAP_SYS_ADMIN |
Write access to cgroup control files (cpu.weight). |
| 3–4 | CAP_BPF, CAP_SYS_ADMIN, CAP_NET_ADMIN |
cgroup manipulation (memory.high, cpu.max). XDP program attachment. |
| 5 | CAP_BPF, CAP_SYS_ADMIN, CAP_NET_ADMIN |
Network interface manipulation. In Kubernetes mode: API access for taint application. |
An operator who wants detection-only (Levels 0–1) without any actuation can
deploy HOSA with only CAP_BPF — the agent will detect, log, and
notify but never intervene. This is the recommended starting configuration for
evaluation deployments.
15. Known Limitations
- Throttling side effects. Despite the safelist and contributor targeting mechanisms, throttling a process during a critical section (holding a database lock, mid-write to disk) can cause secondary failures. The experimental phase will quantify these side effects under controlled conditions.
- XDP compatibility. XDP requires driver support. Not all network drivers support XDP in native mode — some fall back to generic mode (SKB-based), which offers less performance benefit. The agent gracefully degrades: if native XDP is unavailable, it falls back to generic XDP; if generic XDP is unavailable, network-level load shedding is disabled and only cgroup-based containment is available.
- Level 5 irreversibility. The manual recovery requirement for Level 5 means that in environments without accessible operators (remote IoT without physical access and no watchdog), a quarantined node may remain isolated indefinitely. The watchdog timer mechanism (§9.4) mitigates but does not fully solve this in all scenarios.
- Threshold calibration sensitivity. The adaptive thresholds (θ₁–θ₄) depend on the quality of the warm-up phase baseline. A warm-up period that coincides with atypical system behavior (ongoing deployment, burst workload) will produce suboptimal thresholds. The experimental phase will quantify threshold sensitivity under various warm-up conditions.
- Multi-tenant interference. In shared environments (multiple pods on the same node), throttling one process can indirectly affect others through shared resources (CPU cache, memory bus, I/O scheduler). The decomposition identifies the dominant contributor, but the containment action may have second-order effects on co-located workloads.
16. References
- Heo, T. (2015). Control Group v2. Linux Kernel Documentation. kernel.org
- Vieira, M. A., Castanho, M. S., Pacífico, R. D. G., Santos, E. R. S., Júnior, E. P. M. C., & Vieira, L. F. M. (2020). Fast Packet Processing with eBPF and XDP: Concepts, Code, Challenges, and Applications. ACM Computing Surveys, 53(1), Article 16.
- Gregg, B. (2019). BPF Performance Tools: Linux System and Application Observability. Addison-Wesley Professional.
- Horn, P. (2001). Autonomic Computing: IBM's Perspective on the State of Information Technology. IBM Corporation.
- Bear, M. F., Connors, B. W., & Paradiso, M. A. (2015). Neuroscience: Exploring the Brain (4th ed.). Wolters Kluwer.
- Beyer, B., Jones, C., Petoff, J., & Murphy, N. R. (2016). Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media.
- Poettering, L. (2020). systemd-oomd: A userspace out-of-memory (OOM) killer. systemd Documentation.
- Weiner, J. (2018). PSI — Pressure Stall Information. Linux Kernel Documentation. kernel.org
- Tang, C., et al. (2020). FBAR: Facebook's Automated Remediation System. Proceedings of the ACM Symposium on Cloud Computing (SoCC).