A Brief Introduction to Optimized Batched Inference with vLLM

A Brief Introduction to Optimized Batched Inference with vLLM
Reading Time: 5 minutes

by Sergio Morales, Principal Data Engineer at Growth Acceleration Partners.

In a previous article, we talked about how off-the-shelf, pre-trained models made available through Hugging Face’s model hub could be leveraged to fulfill a wide range of Natural Language Processing (NLP) tasks. This included text classification, question answering, and content generation — either by taking advantage of their base knowledge or by fine-tuning them to create specialized models that honed in on particular subject matters or contexts.

In this article, we will introduce the vLLM library to optimize the performance of these models, and introduce a mechanism through which we can take advantage of a large language model (LLM)’s text generation capabilities to make it perform more specific and context-sensitive tasks.

Having access to efficient inference backends plays a pivotal role in optimizing the deployment and usage of natural language processing models and their availability to all kinds of teams and organizations. Given the memory and resources cost associated with LLMs, the ability to conserve computing resources during inference — contributing to reduced latency and improved scalability — is of great value.

Streamlining inference processes not only enhances real-time applications but is also crucial for minimizing operational costs, making it more economically viable to deploy large-scale language models. Industries that rely on resource-intensive tasks stand to benefit from being able to instantiate sophisticated language models in a sustainable and accessible way.

The vLLM Python Package

vLLM is a library designed for the efficient inference and serving of LLMs, similar to the transformers backend as made available by Hugging Face. It provides high serving throughput and efficient attention key-value memory management using PagedAttention and continuous batching, with PagedAttention serving an optimized version of the classic attention algorithm inspired by virtual memory and paging.

It seamlessly integrates with a variety of LLMs, such as Llama, OPT, Mixtral, StableLM, and Falcon, sharing many commonalities with Hugging Face in terms of available models. Per its developers, it’s capable of delivering up to 24x higher throughput than Hugging Face’s transformers, without requiring any model architecture changes.
vLLM Python Package
Once installed on a suitable Python environment, the vLLM API is simple enough to use. In the following example, we instantiate a text generation model off of the Hugging Face model hub (jondurbin/airoboros-m-7b-3.1.2):

from vllm import LLM, SamplingParams

# Create an LLM
llm = LLM(model="jondurbin/airoboros-m-7b-3.1.2",

# Provide prompts
prompts = ["Here are some tips for taking care of your skin: ",
           "To cook this recipe, we'll need "]

# adjust sampling parameters as necessary for the task
sampling_params = SamplingParams(temperature=0.2,

# Generate texts from the prompts
outputs = llm.generate(prompts, sampling_params)

# Print outputs
for i, o in zip(prompts, outputs):
    print(f"Prompt: {i}")
    print(f"Output: {o.outputs[0].text}")

In the above fragment, we can see the model is pointed to using the same addressing system as one would when working with the transformers library. Some additional parameters are being passed into the LLM class initialization function, such as gpu_memory_utilization (the fraction of GPU memory to be used for the model executor, set to 0.95 in this case) and max_model_len (the maximum length for the model context).

Executing Batched Inferences on a Large Dataset

As can be expected, batched inference refers to the practice of processing multiple input sequences simultaneously during the inference phase, rather than one at a time. This strategy capitalizes on the parallelization capabilities of modern hardware, allowing the model to handle multiple inputs in a single computational step.

Batched inference significantly improves overall inference speed and efficiency, as the model processes several sequences concurrently, reducing the computational overhead associated with individual predictions. This technique is especially crucial in scenarios where large-scale language models are deployed for real-time applications, as it helps maximize the utilization of computational resources and ensures faster response times for tasks such as text generation, translation, and sentiment analysis.

In the following example, we’ll set up a simple pipeline that will execute a specific, domain-sensitive task on a regular pandas dataframe containing a column of unstructured text. We will construct a dynamic, few-shot prompt using the model’s context, which will allow us to provide a consistent prompt to the model at the same time we feed it distinct inputs to perform inference on. We will continue to use the llama-2 chat templating format.

summary_few_shot = """[INST] <> You are a helpful assistant who converts long-form board game descriptions into a short standardized format. <>
The following are paragraphs describing board games themes and mechanics. Following each record is a single sentence describing the theme and discernible gameplay mechanics:

Record: Earthborne Rangers is a customizable, co-operative card game set in the wilderness of the far future. You take on the role of a Ranger, a protector of the mountain valley you call home: a vast wilderness transformed by monumental feats of science and technology devised to save the Earth from destruction long ago.
Description: Mechanics: Cooperative, Cards. Themes: Wilderness, Conservationism

Record: Ostia is a strategy game for 1-4 players. Players lead a large fleet to explore the ocean, trade and develop the port.
Make good use of the Mancala system to strengthen your personal board and aim for the highest honor!
Description: Mechanics: strategy, trading, mancala. Themes: Ocean exploration

Follow the above examples, and provide a description of the following:
Record: <> [/INST]


As it can be appreciated from the above, building a batched inference pipeline with a package like vLLM can be relatively easy as long as there is a clear understanding of the underlying data structures and the specific requirements of the inference task, as well as the limits of the technology involved. This approach proves particularly advantageous when dealing with large-scale language models, ensuring the deployment of sophisticated AI solutions remains both rapid and resource-efficient.

At GAP, our expertise not only resides in our ability to leverage state-of-the-art tools and frameworks such as vLLM to fulfill your AI needs, but in integrating them into a bigger data infrastructure that emphasizes efficiency and smart resource utilization. Whether it’s deploying LLMs for natural language understanding, sentiment analysis or other NLP tasks, our approach encompasses an integrated understanding of your organizational objectives.

By combining cutting-edge technologies with a strategic framework, GAP engineers ensure your AI efforts exceed expectations, delivering solutions that are both innovative and resource-efficient.