PyPDFDirectoryLoader#

class langchain_community.document_loaders.pdf.PyPDFDirectoryLoader(

path: str | PurePath,

glob: str = '**/[!.]*.pdf',

silent_errors: bool = False,

load_hidden: bool = False,

recursive: bool = False,

extract_images: bool = False,

*,

password: str | None = None,

mode: Literal['single', 'page'] = 'page',

images_parser: BaseImageBlobParser | None = None,

headers: dict | None = None,

extraction_mode: Literal['plain', 'layout'] = 'plain',

extraction_kwargs: dict | None = None,

)[source]#

Load and parse a directory of PDF files using ‘pypdf’ library.

This class provides methods to load and parse multiple PDF documents in a directory, supporting options for recursive search, handling password-protected files, extracting images, and defining extraction modes. It integrates the pypdf library for PDF processing and offers synchronous document loading.

Examples

Setup:

pip install -U langchain-community pypdf

Instantiate the loader:

from langchain_community.document_loaders import PyPDFDirectoryLoader

loader = PyPDFDirectoryLoader(
    path = "./example_data/",
    glob = "**/[!.]*.pdf",
    silent_errors = False,
    load_hidden = False,
    recursive = False,
    extract_images = False,
    password = None,
    mode = "page",
    images_to_text = None,
    headers = None,
    extraction_mode = "plain",
    # extraction_kwargs = None,
)

Load documents:

docs = loader.load()
print(docs[0].page_content[:100])
print(docs[0].metadata)

Load documents asynchronously:

docs = await loader.aload()
print(docs[0].page_content[:100])
print(docs[0].metadata)

Initialize with a directory path.

Parameters:

path (str | PurePath) – The path to the directory containing PDF files to be loaded.
glob (str) – The glob pattern to match files in the directory.
silent_errors (bool) – Whether to log errors instead of raising them.
load_hidden (bool) – Whether to include hidden files in the search.
recursive (bool) – Whether to search subdirectories recursively.
extract_images (bool) – Whether to extract images from PDFs.
password (str | None) – Optional password for opening encrypted PDFs.
mode (Literal['single', 'page']) – The extraction mode, either “single” for extracting the entire document or “page” for page-wise extraction.
images_parser (BaseImageBlobParser | None) – Optional image blob parser..
headers (dict | None) – Optional headers to use for GET request to download a file from a web path.
extraction_mode (Literal['plain', 'layout']) – “plain” for legacy functionality, “layout” for experimental layout mode functionality
extraction_kwargs (dict | None) – Optional additional parameters for the extraction process.

Returns:

This method does not directly return data. Use the load method to retrieve parsed documents with content and metadata.

Methods

`__init__`(path[, glob, silent_errors, ...])	Initialize with a directory path.
`alazy_load`()	A lazy loader for Documents.
`aload`()	Load data into Document objects.
`lazy_load`()	A lazy loader for Documents.
`load`()	Load data into Document objects.
`load_and_split`([text_splitter])	Load Documents and split into chunks.

__init__(

path: str | PurePath,

glob: str = '**/[!.]*.pdf',

silent_errors: bool = False,

load_hidden: bool = False,

recursive: bool = False,

extract_images: bool = False,

*,

password: str | None = None,

mode: Literal['single', 'page'] = 'page',

images_parser: BaseImageBlobParser | None = None,

headers: dict | None = None,

extraction_mode: Literal['plain', 'layout'] = 'plain',

extraction_kwargs: dict | None = None,

)[source]#

Initialize with a directory path.

Parameters:

path (str | PurePath) – The path to the directory containing PDF files to be loaded.
glob (str) – The glob pattern to match files in the directory.
silent_errors (bool) – Whether to log errors instead of raising them.
load_hidden (bool) – Whether to include hidden files in the search.
recursive (bool) – Whether to search subdirectories recursively.
extract_images (bool) – Whether to extract images from PDFs.
password (str | None) – Optional password for opening encrypted PDFs.
mode (Literal['single', 'page']) – The extraction mode, either “single” for extracting the entire document or “page” for page-wise extraction.
images_parser (BaseImageBlobParser | None) – Optional image blob parser..
headers (dict | None) – Optional headers to use for GET request to download a file from a web path.
extraction_mode (Literal['plain', 'layout']) – “plain” for legacy functionality, “layout” for experimental layout mode functionality
extraction_kwargs (dict | None) – Optional additional parameters for the extraction process.

Returns:

This method does not directly return data. Use the load method to retrieve parsed documents with content and metadata.

async alazy_load() → AsyncIterator[Document]#

A lazy loader for Documents.

Return type:: AsyncIterator[Document]

async aload() → list[Document]#

Load data into Document objects.

Return type:: list[Document]

lazy_load() → Iterator[Document]#

A lazy loader for Documents.

Return type:: Iterator[Document]

load() → list[Document][source]#

Load data into Document objects.

Return type:: list[Document]

load_and_split( text_splitter: TextSplitter | None = None, ) → list[Document]#

Load Documents and split into chunks. Chunks are returned as Documents.

Do not override this method. It should be considered to be deprecated!

Parameters:: text_splitter (Optional[TextSplitter]) – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.
Returns:: List of Documents.
Return type:: list[Document]

Examples using PyPDFDirectoryLoader

PyPDFDirectoryLoader