document_transformers#

Document Transformers are classes to transform Documents.

Document Transformers usually used to transform a lot of Documents in a single run.

Class hierarchy:

BaseDocumentTransformer --> <name>  # Examples: DoctranQATransformer, DoctranTextTranslator

Main helpers:

Document

Classes

document_transformers.beautiful_soup_transformer.BeautifulSoupTransformer()

Transform HTML content by extracting specific tags and removing unwanted ones.

document_transformers.doctran_text_extract.DoctranPropertyExtractor(...)

Extract properties from text documents using doctran.

document_transformers.doctran_text_qa.DoctranQATransformer([...])

Extract QA from text documents using doctran.

document_transformers.doctran_text_translate.DoctranTextTranslator([...])

Translate text documents using doctran.

document_transformers.embeddings_redundant_filter.EmbeddingsClusteringFilter

Perform K-means clustering on document vectors.

document_transformers.embeddings_redundant_filter.EmbeddingsRedundantFilter

Filter that drops redundant documents by comparing their embeddings.

document_transformers.html2text.Html2TextTransformer([...])

Replace occurrences of a particular search pattern with a replacement string

document_transformers.long_context_reorder.LongContextReorder

Reorder long context.

document_transformers.markdownify.MarkdownifyTransformer([...])

Converts HTML documents to Markdown format with customizable options for handling links, images, other tags and heading styles using the markdownify library.

document_transformers.nuclia_text_transform.NucliaTextTransformer(nua)

Nuclia Text Transformer.

document_transformers.openai_functions.OpenAIMetadataTagger

Extract metadata tags from document contents using OpenAI functions.

Functions

document_transformers.beautiful_soup_transformer.get_navigable_strings(...)

Get all navigable strings from a BeautifulSoup element.

document_transformers.embeddings_redundant_filter.get_stateful_documents(...)

Convert a list of documents to a list of documents with state.

document_transformers.openai_functions.create_metadata_tagger(...)

Create a DocumentTransformer that uses an OpenAI function chain to automatically

Deprecated classes

document_transformers.google_translate.GoogleTranslateTransformer(...)

Deprecated since version 0.0.32: Use langchain_google_community.DocAIParser instead.