How agents are changing the meaning of computer security
If OpenClaw and other agents are here to stay, we must ask the question: When it comes to AI, where does the model end and security begin?
Hello, and welcome. I’m Peter Hall. As a PhD candidate at NYU, I’ve worked on theoretical cryptography, and as a tech and science journalist, I’ve written about AI’s unique security vulnerabilities, agent-agent communication, and more. Today, I’m excited to write for Project Glitch about the challenge of securing AI agents—an increasingly important issue as agents make their way into wider use.
AI agents are a whole new kind of security challenge
In March, a tweet captured the scene on the ground in Shenzen, China: crowds of people, including “lots of grannies,” lined up to get OpenClaw agents installed on their personal computers.
Such enthusiasm certainly isn’t universal, but it is still a significant moment—agents are going mainstream, fast. They’re bringing with them entire new categories of vulnerabilities that are forcing researchers and developers to confront the fact that AI “security” works in ways we are only just beginning to understand.
OpenClaw is a new player in the agent space, but it’s caught on like wildfire. At a high level, it connects an LLM like OpenAI’s ChatGPT or Anthropic’s Claude to a user’s computer, giving the LLM essentially unfettered access to the user’s files and programs. The hope is that the agent can continuously and autonomously complete tasks like answer emails, gather and digest information, and monitor long-running programs. The reality may include all of this, but it has come with a large helping of side effects as well—agents unintentionally leaking sensitive API keys, deleting entire email inboxes, and wiping people’s hard drives.
The reason this is happening is structural—OpenClaw requires deep access to its owner’s device. The agents interact with other devices through messaging platforms where they are vulnerable to an attack known as prompt injection, in which a malicious actor sends them messages designed to get them to act in ways their creator did not intend. The combination is ripe for misuse.
The new security risks that agents pose are not only external. As AI researcher Xun Liu puts it, the issue lies in the fact that AI agents generally have “two failure modes.” “One is that there is a third party attacker,” says Liu, who researches agent security at the University of Illinois Urbana-Champaign. “The second thing is that the agent itself could fail.”
According to some practitioners, tackling the latter may be more important. Marco De Rossi, who works on AI for the blockchain company Consensys, says companies he’s interacted with think of agent security more in that sense: “Sure, they’re concerned about attacks, but they are even more concerned about their agent making mistakes.”
That’s part of what makes the emerging field of AI risk management so complex. In traditional computer security, attackers tend to be viewed as external actors who will take advantage of any vulnerability in a piece of software. In AI, security might fail due to the fact that models are by their nature always trying to extrapolate novel outputs. Even if a model is told to avoid certain kinds of responses, with enough querying in a dangerous direction (even unintentionally), the model is sure to eventually produce responses that veer off course. In other words, the program becomes its own adversary.
In that case, where does the model end and security begin?
The harness problem
Developers’ and researchers’ answers vary. On one end of the spectrum, interpretability research attempts to look inside a model’s weights and activations as it’s being trained or used to better understand how to align its actions with what users want it to do.
On the other, harness engineering builds out everything around the model, like the plugs that link agents to the web tools and programs they use and sandbox environments for the agents to play in. New open-source projects like IronClaw and NVIDIA’s NemoClaw, for example, are trying to suit up OpenClaw in secure harnesses that mitigate agent risks in different ways.
IronClaw, for instance, claims it achieves greater security by forcing the agent to use each program it has access to in a separate box. This allows the system to monitor the flow of information—especially private information like API keys—between these boxes, hopefully stopping any mismanaged information from passing to places it shouldn’t.
There’s a downside to the harness approach, however. Putting guardrails on an agent can limit what an agent can do, and thus prevent it from being an effective personal assistant.
Liu thinks that a better approach to creating good agents is to weave security into the underlying model from the start. He says that training models with security in mind may actually increase their utility, as long as they’re trained only to do what we want—and that means thinking about safety “as early as pre-training.” For example, developers might curate the training data such that no sensitive or dangerous information is present, or consider new model architectures that have contingencies for checking problematic responses. Or they could make updates to models via fine-tuning as researchers make interpretability gains.
Security from within, or without?
Liu led work on a paper appearing at a top AI conference last month focused on using AI agents to “red-team” or stress test models by simulating realistic security threats. He’s currently working on a follow-up extending this to red-teaming other agents as well.
The authors of that paper hope that the thousands of stress tests they performed in their work will help developers deploy safer and more secure AI agents. To people like Liu who believe agentic security should be approached at the model level, this could look like fine-tuning or training bespoke models for safe usage in agentic settings. But Zhaorun Chen, one of Liu’s co-authors on both the initial study and their follow-up work, believes good harnesses should be enough. “Security must be separated out from the language model,” says Chen, who researches AI security at the University of Chicago. To him, AI models will always be susceptible to attacks like prompt injections, so it’s fruitless to focus on the model when new holes in its alignment will always pop up. Instead, Chen suggests security should primarily rely on outside guardrails on agents to “decide whether to block actions or follow the instructions.”
Researchers are bound to pursue different avenues. But regardless of where the model ends and security begins, approaches to agent security are likely to converge, in that AI itself can be useful in helping make agents more secure. Content filters on models, for example, use learning algorithms to help detect disallowed prompts. Similarly, Liu and Chen’s red-teaming work is only feasible thanks to agents running the tests. Even OpenClaw itself was originally vibe-coded, using OpenAI’s Codex. When it comes to agent security, it’s AI all the way down.


