PyPDFParser#
- class langchain_community.document_loaders.parsers.pdf.PyPDFParser(password: str | bytes | None = None, extract_images: bool = False, *, mode: Literal['single', 'page'] = 'page', pages_delimiter: str = '\n\x0c', images_parser: BaseImageBlobParser | None = None, images_inner_format: Literal['text', 'markdown-img', 'html-img'] = 'text', extraction_mode: Literal['plain', 'layout'] = 'plain', extraction_kwargs: dict[str, Any] | None = None)[source]#
Parse a blob from a PDF using pypdf library.
This class provides methods to parse a blob from a PDF document, supporting various configurations such as handling password-protected PDFs, extracting images. It integrates the ‘pypdf’ library for PDF processing and offers synchronous blob parsing.
- Examples:
Setup:
pip install -U langchain-community pypdf
Load a blob from a PDF file:
from langchain_core.documents.base import Blob blob = Blob.from_path("./example_data/layout-parser-paper.pdf")
Instantiate the parser:
from langchain_community.document_loaders.parsers import PyPDFParser parser = PyPDFParser( # password = None, mode = "single", pages_delimiter = "
- “,
# images_parser = TesseractBlobParser(),
)
Lazily parse the blob:
docs = [] docs_lazy = parser.lazy_parse(blob) for doc in docs_lazy: docs.append(doc) print(docs[0].page_content[:100]) print(docs[0].metadata)
Initialize a parser based on PyPDF.
- Parameters:
password (Optional[Union[str, bytes]]) – Optional password for opening encrypted PDFs.
extract_images (bool) – Whether to extract images from the PDF.
mode (Literal['single', 'page']) – The extraction mode, either “single” for the entire document or “page” for page-wise extraction.
pages_delimiter (str) – A string delimiter to separate pages in single-mode extraction.
images_parser (Optional[BaseImageBlobParser]) – Optional image blob parser.
images_inner_format (Literal['text', 'markdown-img', 'html-img']) – The format for the parsed output. - “text” = return the content as is - “markdown-img” = wrap the content into an image markdown link, w/ link pointing to (![body)(#)] - “html-img” = wrap the content as the alt text of an tag and link to (<img alt=”{body}” src=”#”/>)
extraction_mode (Literal['plain', 'layout']) – “plain” for legacy functionality, “layout” extract text in a fixed width format that closely adheres to the rendered layout in the source pdf.
extraction_kwargs (Optional[dict[str, Any]]) – Optional additional parameters for the extraction process.
- Raises:
ValueError – If the mode is not “single” or “page”.
Methods
__init__
([password, extract_images, mode, ...])Initialize a parser based on PyPDF.
extract_images_from_page
(page)Extract images from a PDF page and get the text using images_to_text.
lazy_parse
(blob)Lazily parse the blob.
parse
(blob)Eagerly parse the blob into a document or documents.
- __init__(password: str | bytes | None = None, extract_images: bool = False, *, mode: Literal['single', 'page'] = 'page', pages_delimiter: str = '\n\x0c', images_parser: BaseImageBlobParser | None = None, images_inner_format: Literal['text', 'markdown-img', 'html-img'] = 'text', extraction_mode: Literal['plain', 'layout'] = 'plain', extraction_kwargs: dict[str, Any] | None = None)[source]#
Initialize a parser based on PyPDF.
- Parameters:
password (str | bytes | None) – Optional password for opening encrypted PDFs.
extract_images (bool) – Whether to extract images from the PDF.
mode (Literal['single', 'page']) – The extraction mode, either “single” for the entire document or “page” for page-wise extraction.
pages_delimiter (str) – A string delimiter to separate pages in single-mode extraction.
images_parser (BaseImageBlobParser | None) – Optional image blob parser.
images_inner_format (Literal['text', 'markdown-img', 'html-img']) – The format for the parsed output. - “text” = return the content as is - “markdown-img” = wrap the content into an image markdown link, w/ link pointing to (![body)(#)] - “html-img” = wrap the content as the alt text of an tag and link to (<img alt=”{body}” src=”#”/>)
extraction_mode (Literal['plain', 'layout']) – “plain” for legacy functionality, “layout” extract text in a fixed width format that closely adheres to the rendered layout in the source pdf.
extraction_kwargs (dict[str, Any] | None) – Optional additional parameters for the extraction process.
- Raises:
ValueError – If the mode is not “single” or “page”.
- extract_images_from_page(page: pypdf._page.PageObject) str [source]#
Extract images from a PDF page and get the text using images_to_text.
- Parameters:
page (pypdf._page.PageObject) – The page object from which to extract images.
- Returns:
The extracted text from the images on the page.
- Return type:
str
- lazy_parse(blob: Blob) Iterator[Document] [source]#
Lazily parse the blob. Insert image, if possible, between two paragraphs. In this way, a paragraph can be continued on the next page.