Skip to main content


This notebook provides a quick overview for getting started with chat model intergrated with llama cpp python.


Integration details

ClassPackageLocalSerializableJS support

Model features

Tool callingStructured outputJSON modeImage inputAudio inputVideo inputToken-level streamingNative asyncToken usageLogprobs


To get started and use all the features show below, we reccomend using a model that has been fine-tuned for tool-calling.

We will use Hermes-2-Pro-Llama-3-8B-GGUF from NousResearch.

Hermes 2 Pro is an upgraded version of Nous Hermes 2, consisting of an updated and cleaned version of the OpenHermes 2.5 Dataset, as well as a newly introduced Function Calling and JSON Mode dataset developed in-house. This new version of Hermes maintains its excellent general task and conversation capabilities - but also excels at Function Calling

See our guides on local models to go deeper:


The LangChain LlamaCpp integration lives in the langchain-community and llama-cpp-python packages:

%pip install -qU langchain-community llama-cpp-python


Now we can instantiate our model object and generate chat completions:

# Path to your model weights
local_model = "local/path/to/Hermes-2-Pro-Llama-3-8B-Q8_0.gguf"
import multiprocessing

from langchain_community.chat_models import ChatLlamaCpp

llm = ChatLlamaCpp(
n_batch=300, # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
n_threads=multiprocessing.cpu_count() - 1,
API Reference:ChatLlamaCpp


messages = [
"You are a helpful assistant that translates English to French. Translate the user sentence.",
("human", "I love programming."),

ai_msg = llm.invoke(messages)
J'aime programmer. (In France, "programming" is often used in its original sense of scheduling or organizing events.) 

If you meant computer-programming:
Je suis amoureux de la programmation informatique.

(You might also say simply 'programmation', which would be understood as both meanings - depending on context).


We can chain our model with a prompt template like so:

from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages(
"You are a helpful assistant that translates {input_language} to {output_language}.",
("human", "{input}"),

chain = prompt | llm
"input_language": "English",
"output_language": "German",
"input": "I love programming.",
API Reference:ChatPromptTemplate

Tool calling

Firstly, it works mostly the same as OpenAI Function Calling

OpenAI has a tool calling (we use "tool calling" and "function calling" interchangeably here) API that lets you describe tools and their arguments, and have the model return a JSON object with a tool to invoke and the inputs to that tool. tool-calling is extremely useful for building tool-using chains and agents, and for getting structured outputs from models more generally.

With ChatLlamaCpp.bind_tools, we can easily pass in Pydantic classes, dict schemas, LangChain tools, or even functions as tools to the model. Under the hood these are converted to an OpenAI tool schemas, which looks like:

"name": "...",
"description": "...",
"parameters": {...} # JSONSchema

and passed in every model invocation.

However, it cannot automatically trigger a function/tool, we need to force it by specifying the 'tool choice' parameter. This parameter is typically formatted as described below.

{"type": "function", "function": {"name": <<tool_name>>}}.

from import tool
from langchain_core.pydantic_v1 import BaseModel, Field

class WeatherInput(BaseModel):
location: str = Field(description="The city and state, e.g. San Francisco, CA")
unit: str = Field(enum=["celsius", "fahrenheit"])

@tool("get_current_weather", args_schema=WeatherInput)
def get_weather(location: str, unit: str):
"""Get the current weather in a given location"""
return f"Now the weather in {location} is 22 {unit}"

llm_with_tools = llm.bind_tools(
tool_choice={"type": "function", "function": {"name": "get_current_weather"}},
API Reference:tool
ai_msg = llm_with_tools.invoke(
"what is the weather like in HCMC in celsius",
[{'name': 'get_current_weather',
'args': {'location': 'Ho Chi Minh City', 'unit': 'celsius'},
'id': 'call__0_get_current_weather_cmpl-394d9943-0a1f-425b-8139-d2826c1431f2'}]
class MagicFunctionInput(BaseModel):
magic_function_input: int = Field(description="The input value for magic function")

@tool("get_magic_function", args_schema=MagicFunctionInput)
def magic_function(magic_function_input: int):
"""Get the value of magic function for an input."""
return magic_function_input + 2

llm_with_tools = llm.bind_tools(
tool_choice={"type": "function", "function": {"name": "get_magic_function"}},

ai_msg = llm_with_tools.invoke(
"What is magic function of 3?",

[{'name': 'get_magic_function',
'args': {'magic_function_input': 3},
'id': 'call__0_get_magic_function_cmpl-cd83a994-b820-4428-957c-48076c68335a'}]

Structured output

from langchain_core.pydantic_v1 import BaseModel
from langchain_core.utils.function_calling import convert_to_openai_tool

class Joke(BaseModel):
"""A setup to a joke and the punchline."""

setup: str
punchline: str

dict_schema = convert_to_openai_tool(Joke)
structured_llm = llm.with_structured_output(dict_schema)
result = structured_llm.invoke("Tell me a joke about birds")
{'setup': '- Why did the chicken cross the playground?',
'punchline': '\n\n- To get to its gilded cage on the other side!'}


for chunk in"what is 25x5"):
print(chunk.content, end="\n", flush=True)

API reference

For detailed documentation of all ChatLlamaCpp features and configurations head to the API reference:

Was this page helpful?

You can also leave detailed feedback on GitHub.