PDFMinerParser#

class langchain_community.document_loaders.parsers.pdf.PDFMinerParser( extract_images: bool = False, *, password: str | None = None, mode: Literal['single', 'page'] = 'single', pages_delimiter: str = '\n\x0c', images_parser: BaseImageBlobParser | None = None, images_inner_format: Literal['text', 'markdown-img', 'html-img'] = 'text', concatenate_pages: bool | None = None, )[source]#

Parse a blob from a PDF using pdfminer.six library.

This class provides methods to parse a blob from a PDF document, supporting various configurations such as handling password-protected PDFs, extracting images, and defining extraction mode. It integrates the ‘pdfminer.six’ library for PDF processing and offers synchronous blob parsing.
Examples:
Setup:
pip install -U langchain-community pdfminer.six pillow
Load a blob from a PDF file:
from langchain_core.documents.base import Blob

blob = Blob.from_path("./example_data/layout-parser-paper.pdf")
Instantiate the parser:
from langchain_community.document_loaders.parsers import PDFMinerParser

parser = PDFMinerParser(
    # password = None,
    mode = "single",
    pages_delimiter = "

“,

# extract_images = True, # images_to_text = convert_images_to_text_with_tesseract(),

)

Lazily parse the blob:

docs = []
docs_lazy = parser.lazy_parse(blob)

for doc in docs_lazy:
    docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)

Initialize a parser based on PDFMiner.

Parameters:

password (Optional[str]) – Optional password for opening encrypted PDFs.
mode (Literal['single', 'page']) – Extraction mode to use. Either “single” or “page” for page-wise extraction.
pages_delimiter (str) – A string delimiter to separate pages in single-mode extraction.
extract_images (bool) – Whether to extract images from PDF.
images_inner_format (Literal['text', 'markdown-img', 'html-img']) – The format for the parsed output. - “text” = return the content as is - “markdown-img” = wrap the content into an image markdown link, w/ link pointing to (![body)(#)] - “html-img” = wrap the content as the alt text of an tag and link to (<img alt=”{body}” src=”#”/>)
concatenate_pages (Optional[bool]) – Deprecated. If True, concatenate all PDF pages into one a single document. Otherwise, return one document per page.
images_parser (Optional[BaseImageBlobParser])

Returns:

This method does not directly return data. Use the parse or lazy_parse methods to retrieve parsed documents with content and metadata.

Raises:

ValueError – If the mode is not “single” or “page”.

Warning

concatenate_pages parameter is deprecated. Use `mode=’single’ or ‘page’ instead.

Methods

`__init__`([extract_images, password, mode, ...])	Initialize a parser based on PDFMiner.
`decode_text`(s)	Decodes a PDFDocEncoding string to Unicode.
`lazy_parse`(blob)	Lazily parse the blob.
`parse`(blob)	Eagerly parse the blob into a document or documents.
`resolve_and_decode`(obj)	Recursively resolve the metadata values.

__init__( extract_images: bool = False, *, password: str | None = None, mode: Literal['single', 'page'] = 'single', pages_delimiter: str = '\n\x0c', images_parser: BaseImageBlobParser | None = None, images_inner_format: Literal['text', 'markdown-img', 'html-img'] = 'text', concatenate_pages: bool | None = None, )[source]#

Initialize a parser based on PDFMiner.

Parameters:

password (str | None) – Optional password for opening encrypted PDFs.
mode (Literal['single', 'page']) – Extraction mode to use. Either “single” or “page” for page-wise extraction.
pages_delimiter (str) – A string delimiter to separate pages in single-mode extraction.
extract_images (bool) – Whether to extract images from PDF.
images_inner_format (Literal['text', 'markdown-img', 'html-img']) – The format for the parsed output. - “text” = return the content as is - “markdown-img” = wrap the content into an image markdown link, w/ link pointing to (![body)(#)] - “html-img” = wrap the content as the alt text of an tag and link to (<img alt=”{body}” src=”#”/>)
concatenate_pages (bool | None) – Deprecated. If True, concatenate all PDF pages into one a single document. Otherwise, return one document per page.
images_parser (BaseImageBlobParser | None)

Returns:

This method does not directly return data. Use the parse or lazy_parse methods to retrieve parsed documents with content and metadata.

Raises:

ValueError – If the mode is not “single” or “page”.

Warning

concatenate_pages parameter is deprecated. Use `mode=’single’ or ‘page’ instead.

static decode_text(s: bytes | str) → str[source]#

Decodes a PDFDocEncoding string to Unicode. Adds py3 compatibility to pdfminer’s version.

Parameters:: s (bytes | str) – The string to decode.
Returns:: The decoded Unicode string.
Return type:: str

lazy_parse( blob: Blob, ) → Iterator[Document][source]#

Lazily parse the blob. Insert image, if possible, between two paragraphs. In this way, a paragraph can be continued on the next page.

Parameters:: blob (Blob) – The blob to parse.
Raises:: ImportError – If the pdfminer.six or pillow package is not found.
Yields:: An iterator over the parsed documents.
Return type:: Iterator[Document]

parse(blob: Blob) → list[Document]#

Eagerly parse the blob into a document or documents.

This is a convenience method for interactive development environment.

Production applications should favor the lazy_parse method instead.

Subclasses should generally not over-ride this parse method.

Parameters:: blob (Blob) – Blob instance
Returns:: List of documents
Return type:: list[Document]

static resolve_and_decode( obj: Any, ) → Any[source]#

Recursively resolve the metadata values.

Parameters:: obj (Any) – The object to resolve and decode. It can be of any type.
Returns:: The resolved and decoded object.
Return type:: Any