OpenAIMetadataTagger#

class langchain_community.document_transformers.openai_functions.OpenAIMetadataTagger[source]#

Bases: BaseDocumentTransformer, BaseModel

Extract metadata tags from document contents using OpenAI functions.

Example:
from langchain_community.chat_models import ChatOpenAI
from langchain_community.document_transformers import OpenAIMetadataTagger
from langchain_core.documents import Document

schema = {
    "properties": {
        "movie_title": { "type": "string" },
        "critic": { "type": "string" },
        "tone": {
            "type": "string",
            "enum": ["positive", "negative"]
        },
        "rating": {
            "type": "integer",
            "description": "The number of stars the critic rated the movie"
        }
    },
    "required": ["movie_title", "critic", "tone"]
}

# Must be an OpenAI model that supports functions
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo-0613")
tagging_chain = create_tagging_chain(schema, llm)
document_transformer = OpenAIMetadataTagger(tagging_chain=tagging_chain)
original_documents = [
    Document(page_content="Review of The Bee Movie

By Roger Ebert

This is the greatest movie ever made. 4 out of 5 stars.”),

Document(page_content=”Review of The Godfather

By Anonymous

This movie was super boring. 1 out of 5 stars.”, metadata={β€œreliable”: False}),

]

enhanced_documents = document_transformer.transform_documents(original_documents)

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

param tagging_chain: Any [Required]#

The chain used to extract metadata from each document.

async atransform_documents(documents: Sequence[Document], **kwargs: Any) β†’ Sequence[Document][source]#

Asynchronously transform a list of documents.

Parameters:
  • documents (Sequence[Document]) – A sequence of Documents to be transformed.

  • kwargs (Any)

Returns:

A sequence of transformed Documents.

Return type:

Sequence[Document]

transform_documents(documents: Sequence[Document], **kwargs: Any) β†’ Sequence[Document][source]#

Automatically extract and populate metadata for each document according to the provided schema.

Parameters:
  • documents (Sequence[Document])

  • kwargs (Any)

Return type:

Sequence[Document]