vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving, offering:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Optimized CUDA kernels

This notebooks goes over how to use a LLM with langchain and vLLM.

To use, you should have the vllm python package installed.

%pip install --upgrade --quiet  vllm -q

from langchain_community.llms import VLLM

llm = VLLM(
    model="mosaicml/mpt-7b",
    trust_remote_code=True,  # mandatory for hf models
    max_new_tokens=128,
    top_k=10,
    top_p=0.95,
    temperature=0.8,
)

print(llm.invoke("What is the capital of France ?"))

INFO 08-06 11:37:33 llm_engine.py:70] Initializing an LLM engine with config: model='mosaicml/mpt-7b', tokenizer='mosaicml/mpt-7b', tokenizer_mode=auto, trust_remote_code=True, dtype=torch.bfloat16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 08-06 11:37:41 llm_engine.py:196] # GPU blocks: 861, # CPU blocks: 512
``````output
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  2.00it/s]
``````output

What is the capital of France ? The capital of France is Paris.

Integrate the model in an LLMChain

from langchain.chains import LLMChain
from langchain_core.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate.from_template(template)

llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "Who was the US president in the year the first Pokemon game was released?"

print(llm_chain.invoke(question))

API Reference:PromptTemplate

Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.34s/it]
``````output


1. The first Pokemon game was released in 1996.
2. The president was Bill Clinton.
3. Clinton was president from 1993 to 2001.
4. The answer is Clinton.

Distributed Inference

vLLM supports distributed tensor-parallel inference and serving.

To run multi-GPU inference with the LLM class, set the tensor_parallel_size argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs

from langchain_community.llms import VLLM

llm = VLLM(
    model="mosaicml/mpt-30b",
    tensor_parallel_size=4,
    trust_remote_code=True,  # mandatory for hf models
)

llm.invoke("What is the future of AI?")

Quantization

vLLM supports awq quantization. To enable it, pass quantization to vllm_kwargs.

llm_q = VLLM(
    model="TheBloke/Llama-2-7b-Chat-AWQ",
    trust_remote_code=True,
    max_new_tokens=512,
    vllm_kwargs={"quantization": "awq"},
)

OpenAI-Compatible Server

vLLM can be deployed as a server that mimics the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.

This server can be queried in the same format as OpenAI API.

OpenAI-Compatible Completion

from langchain_community.llms import VLLMOpenAI

llm = VLLMOpenAI(
    openai_api_key="EMPTY",
    openai_api_base="http://localhost:8000/v1",
    model_name="tiiuae/falcon-7b",
    model_kwargs={"stop": ["."]},
)
print(llm.invoke("Rome is"))

 a city that is filled with history, ancient buildings, and art around every corner

LoRA adapter

LoRA adapters can be used with any vLLM model that implements SupportsLoRA.

from langchain_community.llms import VLLM
from vllm.lora.request import LoRARequest

llm = VLLM(
    model="meta-llama/Llama-3.2-3B-Instruct",
    max_new_tokens=300,
    top_k=1,
    top_p=0.90,
    temperature=0.1,
    vllm_kwargs={
        "gpu_memory_utilization": 0.5,
        "enable_lora": True,
        "max_model_len": 350,
    },
)
LoRA_ADAPTER_PATH = "path/to/adapter"
lora_adapter = LoRARequest("lora_adapter", 1, LoRA_ADAPTER_PATH)

print(
    llm.invoke("What are some popular Korean street foods?", lora_request=lora_adapter)
)

LLM conceptual guide
LLM how-to guides

Integrate the model in an LLMChain​

Distributed Inference​

Quantization​

OpenAI-Compatible Server​

OpenAI-Compatible Completion​

LoRA adapter​

Related​