PAI-favicon-120423 MLSecOps-favicon icon3

Hiding in Plain Sight: The Challenge of Prompt Injections in a Multi-Modal World


The announcement of Open AI’s Sora model which can create imaginative scenes from text prompts is another testament of the speed at which the AI space is moving and what is in store for 2024. It is yet another tool in the broader LLM toolbox that will further drive value to businesses adopting LLMs. At the same time, it will likely become another avenue that we need to consider from the perspective of security. Similarly, in September 2023 we saw that toolbox expand as well with the release of GPT-4V, making ChatGPT multi-modal, effectively allowing it and the developer to have computer vision capabilities. Other players have since also become multi-modal, including Google which just released its latest model: Gemini 1.5. In this blog post, we’ll break down the opportunities and security challenges surrounding multi-model LLMs today and in the future.  

Image by Sonali Pattnaik

Unlike text-only models, multi-modal LLMs are capable of processing and interacting with a variety of data types, including text, images, and audio. This ability not only broadens the application spectrum but also inches closer to a more holistic, human-like understanding of the digital environment by LLMs. While this is a significant step forward towards multi-modality shifting into the mainstream, it also unveils a new set of security challenges. We have seen several attack examples emerge wherein attackers craft adversarial perturbations, which are subtle alterations embedded within images to misguide LLMs, leading to either a targeted-output attack/indirect prompt instruction or dialog poisoning/context contamination/hidden prompt injections.

Example by Riley Goodside

These attacks are currently straightforward to execute, using off-the-shelf components like pre-trained encoders in a plug-and-play manner. In a targeted-output attack, the adversarial perturbation influences the LLM to produce a specific piece of text chosen by the attacker (i.e. malicious URLs), whereas dialog poisoning manipulates the LLM to follow a series of instructions, thus affecting the entire dialogue between the user and the LLM. The real-world ramifications of such adversarial exploits are vast, ranging from misinformation spread and data breaches to phishing and social engineering attacks.

Our Outlook

Although there was and still is a lot of hype around it, the solution seems to boil down to (1) better alignment of models in fine tuning so malicious images are refused directly, (2) improving existing classification models to a universal model that can also detect hidden/encoded texts and other adversarial attacks within different types of images, (3) better prompt injection detection from the extracted texts, and (4) combining the latter two into a single model that runs at low latency so that it doesn't impede on UX. As models will expand into other modalities such as audio, the blast radius and challenge set will expand significantly. While vision requires great classification models, the future of modality with audio will require great speech-to-text, all running on low latency and underpinned by a great prompt injection model for any extracted text. As we expect these expanded capabilities to proliferate quickly at large enterprises, we are actively working on tackling these challenges through LLM Guard.