LLM Security

Advancing LLM Adoption and Enhancing Security Against Invisible Prompt Injections with LLM Guard

•

6 min read

•

February 21, 2024

Introduction

As 2024 unfolds, it's shaping up to be a big year for LLM adoption as well as its respective security. We are especially excited to see LLMs cross the chasm between the MVP and production phases at many large enterprises. With it, we anticipate the initial return on investment trickling back to the enterprise, which will drive a stronger interest to allocate security budgets to LLM security. At the same time, this change might be more gradual than anticipated given the mishaps some vendors on Amazon’s marketplace encountered at the beginning of the year.

Nevertheless, one of the more interesting pieces to us at the beginning of 2024 has been the surfacing of a critical issue by Riley Goodside, a Staff Prompt Engineer at Scale AI. With a POC, he showcased the risk of LLM prompt injections via invisible instructions in pasted text. While it was further recognized by others in the industry such as Joseph Thacker as one of the most significant security concerns since prompt injections itself, it especially sparked our concern as we see the risk for those adopters using LLMs in tandem with untrusted sources of information. The fact that these prompt injections are invisible further increases the degree of complexity around the LLM security for such use-cases interfacing with untrusted sources of information.

Riley Goodside’s POC

Understanding Invisible Prompt Injections

Simply put, invisible prompt injections involve embedding hidden instructions within pasted text. In other words, the attack relies on Steganography: the practice of concealing messages or information within other non-secret text or data. The approach exploits Unicode characteristics, particularly a special category for Unicode characters that are not rendered by most fonts. With this category, the instructions remain hidden when text is rendered but they do remain interpretable by GPT-4, and potentially other models too. As Riley showcased in the POC, the process can be broken down into three stages:

A user poses a general question about pasted text.
The visible pasted text appears innocuous, like Zalgo or emoji sequences.
An invisible suffix of Unicode “tag” characters.

For instance, the invisible portion in a prompt might instruct the LLM to ignore a previous query and generate a specific response, such as drawing a cartoon or stating a particular phrase. This level of invisibility in prompt injections is unprecedented. Unlike older methods like "white on white" text, these injections can be embedded across any source of untrusted information - from product reviews to security logs, making detection by humans an order of magnitude harder. The versatility of the injection approach means they can appear in any text box or be part of data scraped by bots, rendering traditional detection methods ineffective. The only feasible solution may involve disabling Unicode in user interfaces, but this is a partial fix, applicable to specific platforms like ChatGPT.

We replicated and tested the invisible prompt attack (try it out for yourself) in Open AI’s API. We learned that while mixing both encoded and non-encoded characters no longer works, implying it has been patched, a prompt solely adding encoded characters does still work. When mixing both encoded and non-encoded characters, Open AI will just ignore the encoded characters and execute the prompt with the non-encoded characters.

Addressing the Challenge with LLM Guard

As soon as we learned about the potential implications of invisible prompt injections we got to work to extend LLM Guard’s capabilities and to secure our users against these emerging attacks. You can now find it among the input scanners under “invisible text”.

The “Invisible Text Scanner” within LLM Guard is specifically designed to detect and eliminate these invisible Unicode characters as you can see above. It focuses on non-printable characters in the Private Use Areas of Unicode, including Basic Multilingual Plane and Supplementary Private Use Areas. It effectively identifies and removes characters that are typically non-printable, thus maintaining the integrity of text inputs for LLMs. To get started, users can utilize the InvisibleText scanner from LLM Guard's input_scanners module. This assesses the text, determining its validity and associated risk score, thereby ensuring cleaner and safer inputs for LLM processing.

To a large extent, the security issue has already been patched for ChatGPT, Open AI API, Bedrock, and Mistral’s API. Yet, these sorts of security challenges make for an even stronger case for a firewall approach to LLM security that acts as a last line of defense and provides consistent security across any model a company or developer wants to deploy. We are actively building more capabilities into LLM Guard, and encourage you to join us on github to get started, and visit our website to learn more or book a demo.