PyPDFDirectoryLoader#

class langchain_community.document_loaders.pdf.PyPDFDirectoryLoader(path: str | PurePath, glob: str = '**/[!.]*.pdf', silent_errors: bool = False, load_hidden: bool = False, recursive: bool = False, extract_images: bool = False, *, password: str | None = None, mode: Literal['single', 'page'] = 'page', images_parser: BaseImageBlobParser | None = None, headers: dict | None = None, extraction_mode: Literal['plain', 'layout'] = 'plain', extraction_kwargs: dict | None = None)[source]#

Load and parse a directory of PDF files using ‘pypdf’ library.

This class provides methods to load and parse multiple PDF documents in a directory, supporting options for recursive search, handling password-protected files, extracting images, and defining extraction modes. It integrates the pypdf library for PDF processing and offers synchronous document loading.

Examples

Setup:

pip install -U langchain-community pypdf

Instantiate the loader:

from langchain_community.document_loaders import PyPDFDirectoryLoader

loader = PyPDFDirectoryLoader(
    path = "./example_data/",
    glob = "**/[!.]*.pdf",
    silent_errors = False,
    load_hidden = False,
    recursive = False,
    extract_images = False,
    password = None,
    mode = "page",
    images_to_text = None,
    headers = None,
    extraction_mode = "plain",
    # extraction_kwargs = None,
)

Load documents:

docs = loader.load()
print(docs[0].page_content[:100])
print(docs[0].metadata)

Load documents asynchronously:

docs = await loader.aload()
print(docs[0].page_content[:100])
print(docs[0].metadata)

Initialize with a directory path.

Parameters:
  • path (str | PurePath) – The path to the directory containing PDF files to be loaded.

  • glob (str) – The glob pattern to match files in the directory.

  • silent_errors (bool) – Whether to log errors instead of raising them.

  • load_hidden (bool) – Whether to include hidden files in the search.

  • recursive (bool) – Whether to search subdirectories recursively.

  • extract_images (bool) – Whether to extract images from PDFs.

  • password (str | None) – Optional password for opening encrypted PDFs.

  • mode (Literal['single', 'page']) – The extraction mode, either “single” for extracting the entire document or “page” for page-wise extraction.

  • images_parser (BaseImageBlobParser | None) – Optional image blob parser..

  • headers (dict | None) – Optional headers to use for GET request to download a file from a web path.

  • extraction_mode (Literal['plain', 'layout']) – “plain” for legacy functionality, “layout” for experimental layout mode functionality

  • extraction_kwargs (dict | None) – Optional additional parameters for the extraction process.

Returns:

This method does not directly return data. Use the load method to retrieve parsed documents with content and metadata.

Methods

__init__(path[, glob, silent_errors, ...])

Initialize with a directory path.

alazy_load()

A lazy loader for Documents.

aload()

Load data into Document objects.

lazy_load()

A lazy loader for Documents.

load()

Load data into Document objects.

load_and_split([text_splitter])

Load Documents and split into chunks.

__init__(path: str | PurePath, glob: str = '**/[!.]*.pdf', silent_errors: bool = False, load_hidden: bool = False, recursive: bool = False, extract_images: bool = False, *, password: str | None = None, mode: Literal['single', 'page'] = 'page', images_parser: BaseImageBlobParser | None = None, headers: dict | None = None, extraction_mode: Literal['plain', 'layout'] = 'plain', extraction_kwargs: dict | None = None)[source]#

Initialize with a directory path.

Parameters:
  • path (str | PurePath) – The path to the directory containing PDF files to be loaded.

  • glob (str) – The glob pattern to match files in the directory.

  • silent_errors (bool) – Whether to log errors instead of raising them.

  • load_hidden (bool) – Whether to include hidden files in the search.

  • recursive (bool) – Whether to search subdirectories recursively.

  • extract_images (bool) – Whether to extract images from PDFs.

  • password (str | None) – Optional password for opening encrypted PDFs.

  • mode (Literal['single', 'page']) – The extraction mode, either “single” for extracting the entire document or “page” for page-wise extraction.

  • images_parser (BaseImageBlobParser | None) – Optional image blob parser..

  • headers (dict | None) – Optional headers to use for GET request to download a file from a web path.

  • extraction_mode (Literal['plain', 'layout']) – “plain” for legacy functionality, “layout” for experimental layout mode functionality

  • extraction_kwargs (dict | None) – Optional additional parameters for the extraction process.

Returns:

This method does not directly return data. Use the load method to retrieve parsed documents with content and metadata.

async alazy_load() AsyncIterator[Document]#

A lazy loader for Documents.

Return type:

AsyncIterator[Document]

async aload() list[Document]#

Load data into Document objects.

Return type:

list[Document]

lazy_load() Iterator[Document]#

A lazy loader for Documents.

Return type:

Iterator[Document]

load() list[Document][source]#

Load data into Document objects.

Return type:

list[Document]

load_and_split(text_splitter: TextSplitter | None = None) list[Document]#

Load Documents and split into chunks. Chunks are returned as Documents.

Do not override this method. It should be considered to be deprecated!

Parameters:

text_splitter (Optional[TextSplitter]) – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.

Returns:

List of Documents.

Return type:

list[Document]

Examples using PyPDFDirectoryLoader