LLM Security

AI Agents: Chapter 3 - Practical Approaches to AI Agents Security

•

7 min read

•

May 30, 2024

Introduction

In our last article, we discussed how GPTs or AI Agents’ risk could be boiled down to prompt injections, plugins (or actions) with elevated privileges, and untrusted sources of information. As a recap:

Prompt Injections: Attackers can manipulate LLM outputs by injecting malicious prompts, leading to misleading or harmful responses. It can further affect the behavior of the model in downstream queries as well as with plugins connected to the LLM. It is important to note that these attacks can be both direct and indirect.
- Direct: If the attacker interacts directly with the LLM to force it to make a specific response, it is a direct prompt injection.
- Indirect: Indirect attacks rely on untrusted sources of information wherein it is used to hide prompt injections that can be invoked alongside a specific query. The prompt injection is then inserted into the prompt to produce the response desired by the attacker.
Vulnerabilities in Plugins: Plugins or actions with elevated privileges can be exploited, leading to unauthorized actions and data breaches. As we saw in our last write-up, in June 2023 over 51 out of 445 approved plugins supported OAuth. This showcased plugins with elevated privileges which could cause a lot of havoc if they were prompt injected.
Untrusted Information Sources: LLMs processing data from untrusted sources can inadvertently integrate malicious content into their responses.

Open AI’s current security risk mitigation strategy is currently focused on offering both the disclaimer, as well as forcing the user for certain plugins to approve each action the plugin wants to execute with the LLM. This effectively shifts the security responsibility into the hands of the users but also defeats the point of agents due to having to approve each action.

Now, let’s dive into some of the mitigation strategies you as a user/company can implement to protect yourself against potential security breaches using AI Agents.

Mitigation Strategies

With the rise of GPTs and the broader adoption of the GPT store, the risk of plugins with elevated privileges ready to be exploited through the use of untrusted sources of information will be substantially increased. As these plugins become more advanced, they will gain significant control over external systems which will further increase the potential blast radius and damages of such exploits. Therefore it is paramount that there will be multiple layers of security that can wrap around any GPT to isolate any potential security breach. Beyond the need for continuous security assessments, This would require a security product that could enforce control over plugin interactions, human oversight, and adopting robust security frameworks:

Plugin Parameterization and IDs: Limiting plugin actions and creating respective parameters around them are essential. For example, an email plugin should only perform specific tasks like 'reply' or 'forward' with sanitized content. For this, Plugins/Agents will require unique identities with clear permission levels that can be set by the user. This also involves automated identification and flagging of potential high-risk plugins through marketplaces such as the upcoming GPT store of Open AI. The latter will drive more transparency and make the user aware of any capability that can be invoked by the plugin.
Sanitization and Prompt Injection Security: Given that the risk of exploitation in plugins primarily arises from either the attacker trying to abuse the LLM through direct prompt injections, or due to untrusted sources of data embedding prompt injections, we need a detection and sanitization layer. For example, an email plugin should remove any HTML elements in email bodies to prevent hidden exploitation. For that, we need solutions like (our very own) LLM Guard that remain up to date to detect both known and unknown prompt injections.
Explicit User Authorization: Crucial actions, especially those involving sensitive systems, must require user re-authorization. This includes a clear summary of the intended action, enhancing transparency and control. This is already implemented by Open AI yet begs the question of whether this would not lead to dialogue fatigue, effectively defeating the purpose of automation. Besides that, we believe the approach of “accept, deny, always accept, always deny” will likely emerge to further abstract plugin authorization alongside a management dashboard. While this is likely the most straightforward approach to avoid the exploitation of crucial actions with privileged plugins, we equally believe it will lead to significant plugin management chaos.
Sequential Plugin Authorization: Similarly to crucial actions requiring authorization, we see the need to require re-authorization when plugins are being invoked sequentially. This would effectively reduce the risk surrounding cross-plugin request forgeries. Yet again, this would require the user to have the ability also to set configurations within LLM applications to once/always allow specific plugins to be invoked together or not at all.
Isolation Approach:
1. Kernel LLMs versus Sandbox LLMs: This could be one approach to add a security layer through operation segregation. This approach was detailed at great length by Simon Willison. It would involve using two separate LLMs one to deal with trusted information and another one that is quarantined to deal with untrusted content. Yet it implies that it is limited in its functionality yet could prevent potential unknown prompt injections in untrusted sources of information. This is somewhat similar to approaches in file-based malware protection such as ReSec.
2. Contextual isolation: You could also enforce data restrictions for plugins to avoid having them access the entire conversation context, limiting their access to only essential data.

Automated Authorization Management

One thing is clear, GPTs, plugins, and prompt injections will lead to dialogue fatigue and exponential chaos in managing an ever-increasing amount of plugins/agents. Besides that, we don’t believe in the isolation approach. We view it as impractical for enterprises as it requires running two LLMs and hence introduces blockers surrounding latency and cost. Even if we assume the cost per token will decrease significantly, it is likely that the latency question will remain.

A more effective solution to mitigate dialogue fatigue, especially with the rising use of plugins, is the implementation of automated authorization controls. These controls would analyze the context of the prompt and the parameters of the invoked plugin. For instance, if a user requests to write an email, the system would compare the intent of the input with the actions about to be triggered by the plugin. If there’s a mismatch, the system will either block the plugin automatically or prompt a manual authorization request. This method would effectively analyze the intent behind the prompt input versus the plugin's actions about to be invoked.

This approach draws inspiration from modern IAM practices such as those by Entitle, Authomize, and Permit. AI Agent management could parallel these players so that plugin actions will be analyzed in the context of the prompt input, and other relevant factors. If the prompt input aligns with the scheduled actions of the plugin, authorization is granted automatically; if not, the authorization request is flagged, requiring manual permission from the user, security, or IT team. This would ensure a balance between scaling automation and stringent security.