ZeroxPDFLoader#
- class langchain_community.document_loaders.pdf.ZeroxPDFLoader(file_path: str | Path, model: str = 'gpt-4o-mini', **zerox_kwargs: Any)[source]#
Document loader utilizing Zerox library: getomni-ai/zerox
Zerox converts PDF document to serties of images (page-wise) and uses vision-capable LLM model to generate Markdown representation.
Zerox utilizes anyc operations. Therefore when using this loader inside Jupyter Notebook (or any environment running async) you will need to: ```python
import nest_asyncio nest_asyncio.apply()
Initialize with a file path.
- Parameters:
file_path (str | Path) β Either a local, S3 or web path to a PDF file.
headers β Headers to use for GET request to download a file from a web path.
model (str)
zerox_kwargs (Any)
Attributes
source
Methods
__init__
(file_path[,Β model])Initialize with a file path.
A lazy loader for Documents.
aload
()Load data into Document objects.
Loads documnts from pdf utilizing zerox library: getomni-ai/zerox
load
()Load data into Document objects.
load_and_split
([text_splitter])Load Documents and split into chunks.
- __init__(file_path: str | Path, model: str = 'gpt-4o-mini', **zerox_kwargs: Any) None [source]#
Initialize with a file path.
- Parameters:
file_path (str | Path) β Either a local, S3 or web path to a PDF file.
headers β Headers to use for GET request to download a file from a web path.
model (str)
zerox_kwargs (Any)
- Return type:
None
- async alazy_load() AsyncIterator[Document] #
A lazy loader for Documents.
- Return type:
AsyncIterator[Document]
- lazy_load() Iterator[Document] [source]#
Loads documnts from pdf utilizing zerox library: getomni-ai/zerox
- Returns:
An iterator over parsed Document instances.
- Return type:
Iterator[Document]
- load_and_split(text_splitter: TextSplitter | None = None) list[Document] #
Load Documents and split into chunks. Chunks are returned as Documents.
Do not override this method. It should be considered to be deprecated!
- Parameters:
text_splitter (Optional[TextSplitter]) β TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.
- Returns:
List of Documents.
- Return type:
list[Document]