Skip to main content


vLLM is a fast and easy-to-use library for LLM inference and serving, offering:

  • State-of-the-art serving throughput
  • Efficient management of attention key and value memory with PagedAttention
  • Continuous batching of incoming requests
  • Optimized CUDA kernels

This notebooks goes over how to use a LLM with langchain and vLLM.

To use, you should have the vllm python package installed.

#!pip install vllm -q
from langchain.llms import VLLM

llm = VLLM(model="mosaicml/mpt-7b",
trust_remote_code=True, # mandatory for hf models

print(llm("What is the capital of France ?"))

API Reference:

    INFO 08-06 11:37:33] Initializing an LLM engine with config: model='mosaicml/mpt-7b', tokenizer='mosaicml/mpt-7b', tokenizer_mode=auto, trust_remote_code=True, dtype=torch.bfloat16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 08-06 11:37:41] # GPU blocks: 861, # CPU blocks: 512

Processed prompts: 100%|██████████| 1/1 [00:00<00:00, 2.00it/s]

What is the capital of France ? The capital of France is Paris.

Integrate the model in an LLMChain

from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate(template=template, input_variables=["question"])

llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "Who was the US president in the year the first Pokemon game was released?"

    Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.34s/it]

1. The first Pokemon game was released in 1996.
2. The president was Bill Clinton.
3. Clinton was president from 1993 to 2001.
4. The answer is Clinton.

Distributed Inference

vLLM supports distributed tensor-parallel inference and serving.

To run multi-GPU inference with the LLM class, set the tensor_parallel_size argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs

from langchain.llms import VLLM

llm = VLLM(model="mosaicml/mpt-30b",
trust_remote_code=True, # mandatory for hf models

llm("What is the future of AI?")

API Reference:

OpenAI-Compatible Server

vLLM can be deployed as a server that mimics the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.

This server can be queried in the same format as OpenAI API.

OpenAI-Compatible Completion

from langchain.llms import VLLMOpenAI

llm = VLLMOpenAI(
model_kwargs={"stop": ["."]}
print(llm("Rome is"))

API Reference:

     a city that is filled with history, ancient buildings, and art around every corner