The breakthrough of large language models (LLMs) has captivated the natural language processing (NLP) world, with their influence extending far beyond the research communities in which they originated. [1, 2] Industries like business, marketing, and content creation have embraced LLMs for editing, writing, and creative tasks. As a result, companies such as OpenAI and Google have deployed interfaces for these powerful models, making them widely accessible.
However, with great popularity comes great scrutiny. The rise of LLMs has also attracted the attention of bad actors. As with any new technology, there are always those who seek to test its boundaries and exploit its weaknesses. One notable type of attack on LLMs is the denial of service (DoS) attack. In this context, offenders aim to overwhelm the models in one of two ways:
A specific type of DoS attack involves crafting prompts designed to generate extremely long outputs, pushing the model to its limits. Let's take a look at some examples of such prompts:
These prompts are not innocent requests; they are deliberately engineered to exhaust the LLM's resources. By consuming all available context length or memory, such prompts can lead to degraded performance or unexpected behavior. Essentially, they try to force the model outside its operational boundaries.
While these issues may seem niche, the implications are significant. If attackers can exploit LLMs with "expensive" prompts (so called “sponge attacks”), they could compromise availability, disrupt services, and even affect the reliability of deployed systems. Despite this, research in this area is still in its early stages. Some recent studies have explored similar challenges, but the problem of detecting and mitigating such expensive prompts remains largely uncharted territory. [3, 4, 5]
Recognizing this gap, we set out to investigate the problem. The goal? To develop methods for predicting the output length of an LLM response based on a given prompt. At first glance, this might sound ambitious—or even a little far-fetched—but as we dug deeper, it became clear that this is a valuable and promising research direction. The ability to estimate output length could provide an early warning system for resource-intensive prompts, helping mitigate potential DoS attacks and ensuring more efficient LLM deployment.
Like any good research project, we started by defining the problem and deciding what to model. Once this was clear, the next step was to compile a dataset that would allow us to train and evaluate our models effectively. This was no small task, as it involved designing prompts, collecting response data, and carefully curating examples that reflect real-world use cases as well as edge cases.
Our in-house dataset is the backbone of this research, containing prompts, their expected outputs, and various length-related metrics as targets. To compile the dataset, we drew from two primary sources:
Our dataset is divided into several subsets, each representing a distinct category of prompts and responses.
Figure 1: Percentages of subsets in our dataset.
These subsets ensure we cover a variety of use cases and output types.
Each subset helps us explore how LLMs handle different genres of text, such as plain text, code, or mathematical notation. By organizing the dataset this way, we ensure that it captures the diversity of real-world use cases.
A typical entry in our dataset includes a prompt, the expected output, and length-related metrics. Here's an example:
{
"prompt": "Write a Python function that iterates through a list of strings and returns a list containing the lengths of those strings.",
"output": "def get_string_lengths(strings):\nreturn [len(string) for string in strings]",
"num_words": 8,
"num_chars": 75,
"num_lines": 2
"label": "mid"
}
The label (mid, longish, long, etc.) indicates the length category of the output. This is derived from the metrics (word count, character count, and line count) and varies depending on the genre of the text. For example, code is measured in characters, while plain text uses word count as the primary metric.
To assign a length bin (mid, long, etc.) to each instance, we used genre-specific heuristics. The criteria vary depending on the type of output (code, math, or text).
Code (measured in characters): Code is dense and precise, with every character carrying weight (e.g., syntax, function names). Character count is the best measure for estimating output length:
Math (measured in characters): Mathematical text relies on dense notation, making character count the best proxy for complexity:
Text (measured in words): Plain text is less dense than code or math, so word count better reflects its length:
The number of newlines is particularly important for code and writing. For example, estimating newlines provides insight into the structure of code (e.g., function blocks) or the paragraphs in text-based outputs.
Consequently, this is the length/label distribution of our dataset:
Figure 2: Length distribution of our dataset
While most of our dataset falls into the mid length category due to the nature of instruction-tuning datasets, real-world LLM responses often extend into longer outputs—especially for complex topics. To address this imbalance, we generated synthetic data to enrich the long, long-long, and ultra-long categories.
For example, writing-focused prompts naturally contribute to longer outputs (e.g., essays or articles), and the synthetic prompts were specifically crafted to encourage extended responses. This ensures a better representation of real-world scenarios and improves the dataset’s robustness.
For this project, we used a multitask learning approach. The primary task was to classify instances into a length category, while auxiliary tasks involved estimating the number of words, characters, and newlines in the output. To implement this, we attached three regression heads and one classification head to an encoder. For the encoder, we experimented with RoBERTa, modernBERT, and the LLaMA family. We used cross-entropy loss for the classification task and Mean Absolute Error (MAE) for the regression tasks.
Initially, we considered directly predicting output lengths via regression, but this approach was deemed impractical for several reasons. First, predicting exact lengths is inherently difficult for encoder models, as this task aligns more with generative modeling. Classifying into length bins offered a simpler and more feasible solution. Second, LLM outputs often vary slightly from the reference answer in instruction datasets. For example, while the dataset may define a target of 256 words, an LLM might generate 270 or 280 words by adding greetings or extra details. Length bins account for these variations without requiring precise alignment. Finally, minor changes in output length do not affect the bin classification, making it a more robust target for prediction.
One of the main challenges in modeling was the wide range of regression targets, which span from small integers to extremely large values (e.g., 10¹⁵). To address this, we applied logarithmic scaling to the regression targets, compressing their range and making them more manageable. Another issue arose with loss values: standard MAE would result in excessively large losses due to the scale of the targets. To mitigate this, we introduced a ScaledMAE loss function, where differences between predictions and targets were scaled by the absolute value of the targets. Additionally, the final loss was a weighted sum of the three regression losses and the classification loss. To ensure that large values from the regression losses did not overwhelm the classification loss, we applied scaling weights to balance their contributions effectively.
By combining multitask learning with careful loss adjustments and scaling strategies, we were able to train robust models capable of accurately classifying length bins while also estimating auxiliary metrics like word count, character count, and newline count.
Both modernBERT-base and RoBERTa-base achieved an F1 score of around 0.85 per class on the classification task, which is a strong result for this challenging problem. The confusion matrix, shown below, demonstrates that most predictions align with the diagonal, indicating accurate classifications.
Figure 3: Confusion matrix of the model
There are occasional misclassifications, primarily between the long and longish classes, which were mostly found in the Writing subset. This is understandable, as many of these cases were borderline outputs where the length could vary slightly depending on the content. These confusions are tolerable, as the length bins still provide a reasonable estimate of the output.
Coming to the regression tasks, according to the ablation studies, they served perfectly as auxiliary tasks. Regarding the individual success of regression tasks, the model doesn’t look successful at first glance, given in the below figure.
Figure 4: True vs predicted word counts on the Writing subset
Here, for the values of 0–500 (mid and longish classes), the model performs well. Between 500 and 1500 (long class), predictions begin to scatter and have more variance. The predictions still lie around the long band though. After this point, model performance deteriorates, yet still mostly in the long-long band. As the length increases, predictions scatter more, still in the same length bin.
For the regression tasks, the auxiliary objectives proved to be valuable in improving the overall model performance. However, when considering regression predictions individually, the results are less precise. For example, as shown in the plot of true vs. predicted word counts for the Writing subset, the model performs well for shorter outputs (0–500 words, corresponding to mid and longish classes), where predictions closely align with the true values. In the 500–1500 range (long class), predictions begin to scatter, though they still fall within the correct bin. Beyond this range, for longer outputs (long-long and above), the predictions scatter further but remain within the correct length bins. While the exact word counts often deviate, the model demonstrates a solid conceptual understanding of output length, which is sufficient for this task.
Surprisingly, encoder-based models like RoBERTa and modernBERT outperformed the LLaMA family, which we initially expected to excel. For example, LLaMA 3.1 8B achieved a slightly lower F1 score of 0.82, with the 1B and 3B versions performing similarly. This result highlights the continued strength of encoder models in handling natural language understanding (NLU) tasks, even for a problem like length prediction, which involves a mix of classification and auxiliary regression.
Overall, these results show that encoder models are well suited for predicting length bins and estimating auxiliary metrics like word count, even in a challenging task that combines classification and regression. The results surpassed our initial expectations, proving that encoder models can effectively tackle tasks that seem generative in nature.
When we first started this project, we weren’t sure if encoder models like RoBERTa or modernBERT could handle something as tricky as predicting output lengths. After all, this feels more like a generative task. But to our surprise, these models totally delivered, hitting a solid 0.85 F1 score on the classification task. Even when the regression predictions weren’t exact, the models consistently got the right length bin, showing they really "get" the concept of output length.
And here’s the kicker: RoBERTa and modernBERT outperformed the mighty LLaMA models! Sure, we thought LLaMA 8B would crush this task, but the encoders held their ground and proved they’re still the top dogs of natural language understanding. This project not only tackled the challenge of detecting "expensive" prompts but also opened up exciting possibilities for optimizing LLM efficiency.
[1] Language Models are Few-Shot Learners, Brown et al, NIPS2020. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
[2] LLaMA: Open and Efficient Foundation Language Models, Touvro et al, Arxiv 2023. 2302.13971
[3] Denial-of-Service Poisoning Attacks against Large Language Models, Gao et al., Arxiv 2024. https://arxiv.org/abs/2410.10760
[4] Safeguard is a Double-edged Sword: Denial-of-service Attack on Large Language Models, Zhang et al, Arxiv 2024. 2410.02916v2
[5] Automatically Detecting Expensive Prompts and Configuring Firewall Rules to Mitigate Denial of Service Attacks on Large Language Models, Namer at al, 2024. https://www.tdcommons.org/cgi/viewcontent.cgi?article=7777&context=dpubs_series