PyMuPDFParser#

class langchain_community.document_loaders.parsers.pdf.PyMuPDFParser(text_kwargs: dict[str, Any] | None = None, extract_images: bool = False, *, password: str | None = None, mode: Literal['single', 'page'] = 'page', pages_delimiter: str = '\n\x0c', images_parser: BaseImageBlobParser | None = None, images_inner_format: Literal['text', 'markdown-img', 'html-img'] = 'text', extract_tables: Literal['csv', 'markdown', 'html'] | None = None, extract_tables_settings: dict[str, Any] | None = None)[source]#

Parse a blob from a PDF using PyMuPDF library.

This class provides methods to parse a blob from a PDF document, supporting various configurations such as handling password-protected PDFs, extracting images, and defining extraction mode. It integrates the ‘PyMuPDF’ library for PDF processing and offers synchronous blob parsing.

Examples:

Setup:

pip install -U langchain-community pymupdf

Load a blob from a PDF file:

from langchain_core.documents.base import Blob

blob = Blob.from_path("./example_data/layout-parser-paper.pdf")

Instantiate the parser:

from langchain_community.document_loaders.parsers import PyMuPDFParser

parser = PyMuPDFParser(
    # password = None,
    mode = "single",
    pages_delimiter = "
“,

# images_parser = TesseractBlobParser(), # extract_tables=”markdown”, # extract_tables_settings=None, # text_kwargs=None,

)

Lazily parse the blob:

docs = []
docs_lazy = parser.lazy_parse(blob)

for doc in docs_lazy:
    docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)

Initialize a parser based on PyMuPDF.

Parameters:
  • password (Optional[str]) – Optional password for opening encrypted PDFs.

  • mode (Literal['single', 'page']) – The extraction mode, either “single” for the entire document or “page” for page-wise extraction.

  • pages_delimiter (str) – A string delimiter to separate pages in single-mode extraction.

  • extract_images (bool) – Whether to extract images from the PDF.

  • images_parser (Optional[BaseImageBlobParser]) – Optional image blob parser.

  • images_inner_format (Literal['text', 'markdown-img', 'html-img']) – The format for the parsed output. - “text” = return the content as is - “markdown-img” = wrap the content into an image markdown link, w/ link pointing to (![body)(#)] - “html-img” = wrap the content as the alt text of an tag and link to (<img alt=”{body}” src=”#”/>)

  • extract_tables (Union[Literal['csv', 'markdown', 'html'], None]) – Whether to extract tables in a specific format, such as “csv”, “markdown”, or “html”.

  • extract_tables_settings (Optional[dict[str, Any]]) – Optional dictionary of settings for customizing table extraction.

  • text_kwargs (Optional[dict[str, Any]])

Returns:

This method does not directly return data. Use the parse or lazy_parse methods to retrieve parsed documents with content and metadata.

Raises:
  • ValueError – If the mode is not “single” or “page”.

  • ValueError – If the extract_tables format is not “markdown”, “html”,

  • or "csv".

Methods

__init__([text_kwargs, extract_images, ...])

Initialize a parser based on PyMuPDF.

lazy_parse(blob)

Lazy parsing interface.

parse(blob)

Eagerly parse the blob into a document or documents.

__init__(text_kwargs: dict[str, Any] | None = None, extract_images: bool = False, *, password: str | None = None, mode: Literal['single', 'page'] = 'page', pages_delimiter: str = '\n\x0c', images_parser: BaseImageBlobParser | None = None, images_inner_format: Literal['text', 'markdown-img', 'html-img'] = 'text', extract_tables: Literal['csv', 'markdown', 'html'] | None = None, extract_tables_settings: dict[str, Any] | None = None) None[source]#

Initialize a parser based on PyMuPDF.

Parameters:
  • password (str | None) – Optional password for opening encrypted PDFs.

  • mode (Literal['single', 'page']) – The extraction mode, either “single” for the entire document or “page” for page-wise extraction.

  • pages_delimiter (str) – A string delimiter to separate pages in single-mode extraction.

  • extract_images (bool) – Whether to extract images from the PDF.

  • images_parser (BaseImageBlobParser | None) – Optional image blob parser.

  • images_inner_format (Literal['text', 'markdown-img', 'html-img']) – The format for the parsed output. - “text” = return the content as is - “markdown-img” = wrap the content into an image markdown link, w/ link pointing to (![body)(#)] - “html-img” = wrap the content as the alt text of an tag and link to (<img alt=”{body}” src=”#”/>)

  • extract_tables (Literal['csv', 'markdown', 'html'] | None) – Whether to extract tables in a specific format, such as “csv”, “markdown”, or “html”.

  • extract_tables_settings (dict[str, Any] | None) – Optional dictionary of settings for customizing table extraction.

  • text_kwargs (dict[str, Any] | None)

Returns:

This method does not directly return data. Use the parse or lazy_parse methods to retrieve parsed documents with content and metadata.

Raises:
  • ValueError – If the mode is not “single” or “page”.

  • ValueError – If the extract_tables format is not “markdown”, “html”,

  • or "csv".

Return type:

None

lazy_parse(blob: Blob) Iterator[Document][source]#

Lazy parsing interface.

Subclasses are required to implement this method.

Parameters:

blob (Blob) – Blob instance

Returns:

Generator of documents

Return type:

Iterator[Document]

parse(blob: Blob) list[Document]#

Eagerly parse the blob into a document or documents.

This is a convenience method for interactive development environment.

Production applications should favor the lazy_parse method instead.

Subclasses should generally not over-ride this parse method.

Parameters:

blob (Blob) – Blob instance

Returns:

List of documents

Return type:

list[Document]