Introduction
Since the release of our first prompt injection detection model at the very end of November 2023, we were happy to see the model’s widespread industry adoption with over 4.1M downloads among security and LLM development professionals. As a result, our model became one of the most popular text classification models on HuggingFace and the leading prompt injection detection model in the industry. Not only was it used to secure LLMs in deployment at large companies, it further also underpinned the security of LLM application startups and in some cases also for LLM Security Startups.
We have always believed that the security of LLMs will remain a moving target, as new offensive security techniques arise so do we need to remain on top of the research to deliver cutting edge security to our model and library (LLM Guard) users. One of the great benefits of being open source, is that a lot of this research and additional feedback is right at our fingertips. Our community has grown substantially (and fully organically) with queries and feedback in our Slack on a daily basis. It allowed us to establish a quick feedback loop about our model’s shortcomings and where we could improve it. Some of these improvements could be done directly in this new model, other things were improved in our LLM Guard library by extending the available scanners at hand. In any case, through the community, for which we are incredibly grateful, we have been able to create this long-awaited and substantially improved prompt injection model, which we are happy to introduce you to in this article.
Our Next-Gen v2 Prompt Injection Model
Our latest model is also a fine-tuned version of Microsoft’s Deberta Model (deberta-v3-base) on multiple combined datasets that we prepared based on extensive research. It is a text classification model that identifies prompt injections, classifying inputs into two categories: 0 for no injection and 1 for injection detected. With this release, we achieved the following result on the evaluation set, including a test comparison in performance to our previous model. The comparison is based on a separate list of +20k prompt injections which were excluded from training and were combined from 3 other datasets:
Model Version |
protectai/deberta-v3-base-prompt-injection (base model) |
|
Accuracy |
0.9525 |
0.9480 |
Recall |
0.9974 |
0.9964 |
Precision |
0.9159 |
0.9092 |
F1 |
0.9549 |
0.9508 |
License |
Apache license 2.0 |
Apache license 2.0 |
Base Model |
Over the coming quarter, we will continue to fine-tune multiple base models with different architectures that will only be available to our commercial partners within Layer. These models will have a higher accuracy, cover edge cases, and handle long-context windows for RAG.
Lessons Learned
Beyond the extensive valuable feedback from our community, we also delved into the research to evaluate novel offensive prompt techniques and their respective datasets. With it, and after a lot of cleaning up, we created a unique dataset with over +300k prompts. This further served as a basis for us to carve out issues found in the previous model through synthetic dataset generation. From this effort, we learned some valuable lessons and wanted to address some core questions that we are often asked about alternative solutions in the market.
Splitting Prompt Injections from Jailbreaks
Many individuals, including in our community, have asked to include better Jailbreak detection in our prompt injection model. In fact, many of our closed-source alternatives construe Jailbreaks with prompt injections within their offerings. Nevertheless, as greatly explained at length by Simon Willison there is a distinction to be made between the two:
- Prompt injection is a class of attacks against applications built on top of Large Language Models (LLMs) that work by concatenating untrusted user input with a trusted prompt constructed by the application’s developer.
- Jailbreaking is the class of attacks that attempt to subvert safety filters built into the LLMs themselves.
We fully agree and have decided to remove most jailbreaks from our training data set because we believe jailbreaks should be covered separately by another classification model - stay tuned for updates. Additionally, our opinion is that jailbreaks should be scanned by multiple scanners both on the prompt side and the output.
Addressing alternative approaches
Several of our competitors adopt a gamified approach to data collection through crowdsourcing adversarial prompts. While we think this method serves as a great educational tool for raising industry awareness about LLM Security, we would like to see more manual red teaming challenges based on a specific use-case that may draw more effective insights.
Moreover, while some competitors question the efficacy of BERT models, preferring vector databases instead, it’s important to consider their respective strengths and weaknesses. BERT models transform text into embeddings and identify contextual similarities. Vector databases take a similar approach yet lack the same level of contextualization that BERT models offer. This issue can be partially addressed through a RAG (retrieval-augmented generation) architecture that includes a reranking component, though scaling this effectively poses several challenges. In contrast, using large BERT models can significantly enhance accuracy and prove more cost-efficient compared to the extensive infrastructure required for RAG systems (including models for converting to vector embeddings, reranking, and maintaining the vector database itself). In contrast, the relatively low training costs of BERT models enable frequent updates—potentially weekly—with refreshed datasets to enhance the self-hardening capabilities of our solutions.
—
To get started, head over to our HuggingFace page. We also offer documentation to use our prompt injection model within Langchain. Additionally, if you want to tap into our broader LLM Security capabilities, head over to our LLM Guard library here.
If you have any questions, feel free to go to our contact page, or join our Slack channel!