NVIDIARerank#

class langchain_nvidia_ai_endpoints.reranking.NVIDIARerank[source]#

Bases: BaseDocumentCompressor

LangChain Document Compressor that uses the NVIDIA NeMo Retriever Reranking API.

Create a new NVIDIARerank document compressor.

This class provides access to a NVIDIA NIM for reranking. By default, it connects to a hosted NIM, but can be configured to connect to a local NIM using the base_url parameter. An API key is required to connect to the hosted NIM.

Args:
model (str): The model to use for reranking. nvidia_api_key (str): The API key to use for connecting to the hosted NIM. api_key (str): Alternative to nvidia_api_key. base_url (str): The base URL of the NIM to connect to. truncate (str): “NONE”, “END”, truncate input text if it exceeds

the model’s context length. Default is model dependent and is likely to raise an error if an input is too long.

API Key: - The recommended way to provide the API key is through the NVIDIA_API_KEY

environment variable.

Base URL: - Connect to a self-hosted model with NVIDIA NIM using the base_url arg to

link to the local host at localhost:8000: ranker = NVIDIARerank(base_url=”http://localhost:8000/v1”)

Example: >>> from langchain_nvidia_ai_endpoints import NVIDIARerank >>> from langchain_core.documents import Document
>>> query = "What is the GPU memory bandwidth of H100 SXM?"
>>> passages = [
        "The Hopper GPU is paired with the Grace CPU using NVIDIA's ultra-fast
        chip-to-chip interconnect, delivering 900GB/s of bandwidth, 7X faster
        than PCIe Gen5. This innovative design will deliver up to 30X higher
        aggregate system memory bandwidth to the GPU compared to today's fastest
        servers and up to 10X higher performance for applications running
        terabytes of data.",
“A100 provides up to 20X higher performance over the prior generation and can be partitioned into seven GPU instances to dynamically adjust to shifting demands. The A100 80GB debuts the world’s fastest memory bandwidth at over 2 terabytes per second (TB/s) to run the largest models and datasets.”,

“Accelerated servers with H100 deliver the compute power—along with 3 terabytes per second (TB/s) of memory bandwidth per GPU and scalability with NVLink and NVSwitch™.”,

]
>>> client = NVIDIARerank(
        model="nvidia/nv-rerankqa-mistral-4b-v3",
        api_key="$API_KEY_REQUIRED_IF_EXECUTING_OUTSIDE_NGC"
    )
>>> response = client.compress_documents(
        query=query,
        documents=[Document(page_content=passage) for passage in passages]
    )
>>> print(f"Most relevant: {response[0].page_content}

“: f”Least relevant: {response[-1].page_content}”

)

Most relevant: Accelerated servers with H100 deliver the compute power—along with 3 terabytes per second (TB/s) of memory bandwidth per GPU and scalability with NVLink and NVSwitch™. Least relevant: A100 provides up to 20X higher performance over the prior generation and can be partitioned into seven GPU instances to dynamically adjust to shifting demands. The A100 80GB debuts the world’s fastest memory bandwidth at over 2 terabytes per second (TB/s) to run the largest models and datasets.

param base_url: str | None = None#: Base url for model listing an invocation

param extra_headers: dict [Optional]#: Extra headers to include in the request.

param max_batch_size: int = 32#

The maximum batch size.

Constraints:

ge = 1

param model: str | None = None#: The model to use for reranking.

param top_n: int = 5#

The number of documents to return.

Constraints:

ge = 0

param truncate: Literal['NONE', 'END'] | None = None#: Truncate input text if it exceeds the model’s maximum token length. Default is model dependent and is likely to raise error if an input is too long.

classmethod get_available_models(

**kwargs: Any,

) → List[Model][source]#

Get a list of available models that work with NVIDIARerank.

Parameters:: kwargs (Any)
Return type:: List[Model]

async acompress_documents( documents: Sequence[Document], query: str, callbacks: Callbacks | None = None, ) → Sequence[Document]#

Async compress retrieved documents given the query context.

Parameters:

documents (Sequence[Document]) – The retrieved documents.
query (str) – The query context.
callbacks (Optional[Callbacks]) – Optional callbacks to run during compression.

Returns:

The compressed documents.

Return type:

Sequence[Document]

compress_documents( documents: Sequence[Document], query: str, callbacks: list[BaseCallbackHandler] | BaseCallbackManager | None = None, ) → Sequence[Document][source]#

Compress documents using the NVIDIA NeMo Retriever Reranking microservice API.

Parameters:

documents (Sequence[Document]) – A sequence of documents to compress.
query (str) – The query to use for compressing the documents.
callbacks (list[BaseCallbackHandler] | BaseCallbackManager | None) – Callbacks to run during the compression process.

Returns:

A sequence of compressed documents.

Return type:

Sequence[Document]

property available_models: List[Model]#: Get a list of available models that work with NVIDIARerank.