LLM Security

AI Agents: Chapter 2 - The Thin Line between AI Agents and Rogue Agents

•

10 min read

•

April 24, 2024

Introduction

In our previous article, we dissected Open AI’s and the broader LLM vendor market’s push for LLM agency and the respective lack of concern surrounding its security. What is clear however is that LLMs are breaking boundaries and will seemingly get more connected to external systems, data, and compute, elevating their capabilities and privileges. Yet, with the lingering risks surrounding prompt injections, such LLMs will become prone to a variety of attacks that can be hidden within vulnerable plugins or actions (with elevated privileges), and untrusted sources of information that will be used within prompts.

At the time of the release of the first GPTs, we saw a variety of well-known folks in the LLM space, ranging from investors to entrepreneurs (even some in security), showcase their GPTs built for a specific use-case and underpinned by proprietary data they ingested into the GPTs with RAG (retrieval augmented generation). In parallel, we also started seeing a lot of posts emerge showcasing how one could both extract the underlying data (owned by the GPT owner) and the configuration/instructions (i.e. system prompt) of the GPT. All of that was possible after ingesting just 2-3 prompts, highlighting the ease at which data leakage could occur. While these types of attacks seem trivial and “fun” in theory, the increasing complexity and interconnectivity of LLM applications may lead to more damaging situations in the near future. As referenced in the last article’s quote, “Random web pages and data will hijack your AI, steal your stuff, and spend your money”.

Writing on top of the great research work on LLM Security vulnerabilities that has been done by Johann Rehberger, Simon Willison, Eden Marco, and others, we will have a closer look at how these more damaging situations could potentially occur in the near future.

Unpacking the Vulnerabilities

While we mentioned Open AI quite a few times, we do want to note that these security risks are a horizontal problem and that other players moving in the same direction are not impervious to these potential breaches. Any LLM platform that interacts with untrusted information sources can deal with sources embedding indirect prompt injections, which will leave these applications effectively exposed to these risks. Besides that, any potential unrestricted interaction with plugins can compound these risks, creating a fertile ground for security breaches of larger magnitudes.

In May 2023, Rehberger, delved into how the confused deputy problem could be applicable to LLMs as well. As he noted down: “In information security the confused deputy problem is when a system can be tricked to misuse it’s authority”. Translated to LLMs, hidden prompt injections from untrusted sources could trigger plugins to perform actions and tasks that it is capable of. This situation, akin to a privilege escalation, represents a significant threat, especially in the realm of prompt injection which as we covered seemingly is an afterthought to many players.

Self-Replicating Zero Click AI Worms

We have now seen a glimpse of what such increasing complexity and blast radius may look like. Just a few weeks ago, researchers Ben Nassi, Stav Cohen, and Ron Bitton created the first AI worm, which they called Morris II (a reference back to the first computer worm Morris).

Screenshot from: https://www.youtube.com/watch?v=FL3qHH02Yd4

In short, they demonstrated that by targeting a specific email Agent which they built for the POC, they could inject the agent and force it to replicate the attack by crafting it into a new email. It was then sent by the affected user to all their contacts who may have the same vulnerable email agent running, after which they would repeat the cycle. They showed how this attack can be conducted during retrieval from a vector database, where the email is stored, into the LLM. They further showed how the attack could materialize through a hidden prompt injection embedded in an image, something we wrote about in a separate blog. While the attack POC has been made under lab conditions, we anticipate real-world examples to emerge in the coming year where such attacks will lead to brand damage (i.e. toxic content), phishing from trusted accounts (i.e. malicious self-replicating emails), data leakage, and much more.

Unknown Privileges

As agents further proliferate, from the email agent built by the Israeli researchers in the previous example, to scheduling appointments to running through your notes, they will continue to be fed a cocktail of both trusted and untrusted data, like asking them to summarize content from a sketchy website. If that site happens to be laced with hidden malicious instructions, it can trick the agent into performing unintended actions. To prevent such trickery, the current strategy involves getting a human to greenlight any significant action initiated by the AI. This has been Open AI’s response to limiting these security risks thusfar.

But let's face it, that's like asking someone to manually approve every step of a dance routine – it defeats the purpose of having an agent and could lead to 'dialogue fatigue,' where users robotically approve everything without a second thought. Now, imagine a more potent plugin, perhaps one that can impersonate a user, allowing the agent to act on your behalf without your direct input. When users enable such plugins, they often (unknowingly) grant them privileged access, opening the door to a range of risks. It's like handing over the keys to your data, tools, or infra. For example, a plugin like Zapier, when enabled, empowers ChatGPT to interact with emails, Google Drive, Slack, and more, turning it into a formidable workflow engine. In a more direct example, Rehberger showed how simply browsing a malicious website with hidden prompts could prompt the agent to trigger another plugin to book a flight, just because the website said so.

Unauthorized Actions

In one of his examples (see below), Rehberger showcased how the usage of a browsing plugin with the “chat with your code” plugin led a malicious website to prompt the plugin to create Github repos, steal private code, and switch Github repositories from private to public visibility. This eventually led Open AI to enforce the greenlighting scheme for any significant action undertaken by an agent.

Data Exfiltration

In another example, Rehberger highlights the risk of over-reliance by LLM users on prompt output. One potential straightforward data exfiltration technique could involve hyperlinks generated by the hidden prompt injection with an extended URL including your chat history. Many chat applications automatically inspect URLs by default and issue a server-side (sometimes client-side) request to the link. This is the exfiltration channel.

Expanding Threat Landscape

As Rehberger's research showed in June 2023, such potent plugins are already out there and ready to be exploited with 51 out of 445 approved plugins of OpenAI supporting OAuth. The reason why this is a good proxy to determine potent and vulnerable plugins is that they seem to do some form of delegation/impersonation and possibly have access to personal data or infra leading to potential access of PII, code execution on remote machines, secrets, etc. With the rise of Open AI’s GPT store, the number of potent and vulnerable agents is likely going to rise substantially. In parallel GPTs with embedded plugins/actions will proliferate over time we will likely see a concerted effort by adversarial actors to hide tailored made prompt injections to specific popular plugins or GPTs.

The examples we covered in this article are just the tip of the iceberg and are meant to exemplify what’s next as AI Agents and Plugins will proliferate rapidly without the necessary security in place. If you want to delve in deeper, we recommend the blogs of Simon Willison and Johann Rehberger who are great at unpacking and discovering new LLM security threats.

Stay tuned for our next article, in which we will discuss some mitigation techniques and how LLM Guard can secure enterprise interactions with AI Agents and LLMs.