DedocPDFLoader#

class langchain_community.document_loaders.pdf.DedocPDFLoader(file_path: str, *, split: str = 'document', with_tables: bool = True, with_attachments: str | bool = False, recursion_deep_attachments: int = 10, pdf_with_text_layer: str = 'auto_tabby', language: str = 'rus+eng', pages: str = ':', is_one_column_document: str = 'auto', document_orientation: str = 'auto', need_header_footer_analysis: str | bool = False, need_binarization: str | bool = False, need_pdf_table_analysis: str | bool = True, delimiter: str | None = None, encoding: str | None = None)[source]#

DedocPDFLoader document loader integration to load PDF files using dedoc. The file loader can automatically detect the correctness of a textual layer in the

PDF document.

Note that __init__ method supports parameters that differ from ones of

DedocBaseLoader.

Setup:

Install dedoc package.

pip install -U dedoc

Instantiate:

from langchain_community.document_loaders import DedocPDFLoader

loader = DedocPDFLoader(
    file_path="example.pdf",
    # split=...,
    # with_tables=...,
    # pdf_with_text_layer=...,
    # pages=...,
    # ...
)

Load:

docs = loader.load()
print(docs[0].page_content[:100])
print(docs[0].metadata)

Some text
{
    'file_name': 'example.pdf',
    'file_type': 'application/pdf',
    # ...
}

Lazy load:

docs = []
docs_lazy = loader.lazy_load()

for doc in docs_lazy:
    docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)

Some text
{
    'file_name': 'example.pdf',
    'file_type': 'application/pdf',
    # ...
}

Parameters used for document parsing via dedoc

(https://dedoc.readthedocs.io/en/latest/parameters/pdf_handling.html):

with_attachments: enable attached files extraction recursion_deep_attachments: recursion level for attached files extraction,

works only when with_attachments==True

pdf_with_text_layer: type of handler for parsing, available options: [“true”, “false”, “tabby”, “auto”, “auto_tabby” (default)]
language: language of the document for PDF without a textual layer,: available options [“eng”, “rus”, “rus+eng” (default)], the list of languages can be extended, please see https://dedoc.readthedocs.io/en/latest/tutorials/add_new_language.html

pages: page slice to define the reading range for parsing is_one_column_document: detect number of columns for PDF without a textual

layer, available options [“true”, “false”, “auto” (default)]

document_orientation: fix document orientation (90, 180, 270 degrees) for PDF: without a textual layer, available options [“auto” (default), “no_change”]

need_header_footer_analysis: remove headers and footers from the output result need_binarization: clean pages background (binarize) for PDF without a textual

layer

need_pdf_table_analysis: parse tables for PDF without a textual layer

Initialize with file path and parsing parameters.

Parameters:

file_path (str) – path to the file for processing
split (str) –
type of document splitting into parts (each part is returned separately), default value “document” “document”: document text is returned as a single langchain Document

object (don’t split)

”page”: split document text into pages (works for PDF, DJVU, PPTX, PPT,
ODP)

”node”: split document text into tree nodes (title nodes, list item
nodes, raw text nodes)

”line”: split document text into lines
with_tables (bool) – add tables to the result - each table is returned as a single langchain Document object
dedoc (Parameters used for document parsing via) –
(https://dedoc.readthedocs.io/en/latest/parameters/parameters.html):

with_attachments: enable attached files extraction recursion_deep_attachments: recursion level for attached files

extraction, works only when with_attachments==True

pdf_with_text_layer: type of handler for parsing PDF documents,
available options [“true”, “false”, “tabby”, “auto”, “auto_tabby” (default)]

language: language of the document for PDF without a textual layer and
images, available options [“eng”, “rus”, “rus+eng” (default)], the list of languages can be extended, please see https://dedoc.readthedocs.io/en/latest/tutorials/add_new_language.html

pages: page slice to define the reading range for parsing PDF documents is_one_column_document: detect number of columns for PDF without

a textual layer and images, available options [“true”, “false”, “auto” (default)]

document_orientation: fix document orientation (90, 180, 270 degrees)
for PDF without a textual layer and images, available options [“auto” (default), “no_change”]

need_header_footer_analysis: remove headers and footers from the output
result for parsing PDF and images

need_binarization: clean pages background (binarize) for PDF without a
textual layer and images

need_pdf_table_analysis: parse tables for PDF without a textual layer
and images

delimiter: column separator for CSV, TSV files encoding: encoding of TXT, CSV, TSV
with_attachments (str | bool) –
recursion_deep_attachments (int) –
pdf_with_text_layer (str) –
language (str) –
pages (str) –
is_one_column_document (str) –
document_orientation (str) –
need_header_footer_analysis (str | bool) –
need_binarization (str | bool) –
need_pdf_table_analysis (str | bool) –
delimiter (str | None) –
encoding (str | None) –

Methods

`__init__`(file_path, *[, split, with_tables, ...])	Initialize with file path and parsing parameters.
`alazy_load`()	A lazy loader for Documents.
`aload`()	Load data into Document objects.
`lazy_load`()	Lazily load documents.
`load`()	Load data into Document objects.
`load_and_split`([text_splitter])	Load Documents and split into chunks.

__init__(file_path: str, *, split: str = 'document', with_tables: bool = True, with_attachments: str | bool = False, recursion_deep_attachments: int = 10, pdf_with_text_layer: str = 'auto_tabby', language: str = 'rus+eng', pages: str = ':', is_one_column_document: str = 'auto', document_orientation: str = 'auto', need_header_footer_analysis: str | bool = False, need_binarization: str | bool = False, need_pdf_table_analysis: str | bool = True, delimiter: str | None = None, encoding: str | None = None) → None#

Initialize with file path and parsing parameters.

Parameters:

file_path (str) – path to the file for processing
split (str) –
type of document splitting into parts (each part is returned separately), default value “document” “document”: document text is returned as a single langchain Document

object (don’t split)

”page”: split document text into pages (works for PDF, DJVU, PPTX, PPT,
ODP)

”node”: split document text into tree nodes (title nodes, list item
nodes, raw text nodes)

”line”: split document text into lines
with_tables (bool) – add tables to the result - each table is returned as a single langchain Document object
dedoc (Parameters used for document parsing via) –
(https://dedoc.readthedocs.io/en/latest/parameters/parameters.html):

with_attachments: enable attached files extraction recursion_deep_attachments: recursion level for attached files

extraction, works only when with_attachments==True

pdf_with_text_layer: type of handler for parsing PDF documents,
available options [“true”, “false”, “tabby”, “auto”, “auto_tabby” (default)]

language: language of the document for PDF without a textual layer and
images, available options [“eng”, “rus”, “rus+eng” (default)], the list of languages can be extended, please see https://dedoc.readthedocs.io/en/latest/tutorials/add_new_language.html

pages: page slice to define the reading range for parsing PDF documents is_one_column_document: detect number of columns for PDF without

a textual layer and images, available options [“true”, “false”, “auto” (default)]

document_orientation: fix document orientation (90, 180, 270 degrees)
for PDF without a textual layer and images, available options [“auto” (default), “no_change”]

need_header_footer_analysis: remove headers and footers from the output
result for parsing PDF and images

need_binarization: clean pages background (binarize) for PDF without a
textual layer and images

need_pdf_table_analysis: parse tables for PDF without a textual layer
and images

delimiter: column separator for CSV, TSV files encoding: encoding of TXT, CSV, TSV
with_attachments (str | bool) –
recursion_deep_attachments (int) –
pdf_with_text_layer (str) –
language (str) –
pages (str) –
is_one_column_document (str) –
document_orientation (str) –
need_header_footer_analysis (str | bool) –
need_binarization (str | bool) –
need_pdf_table_analysis (str | bool) –
delimiter (str | None) –
encoding (str | None) –

Return type:

None

async alazy_load() → AsyncIterator[Document]#

A lazy loader for Documents.

Return type:: AsyncIterator[Document]

async aload() → List[Document]#

Load data into Document objects.

Return type:: List[Document]

lazy_load() → Iterator[Document]#

Lazily load documents.

Return type:: Iterator[Document]

load() → List[Document]#

Load data into Document objects.

Return type:: List[Document]

load_and_split(text_splitter: TextSplitter | None = None) → List[Document]#

Load Documents and split into chunks. Chunks are returned as Documents.

Do not override this method. It should be considered to be deprecated!

Parameters:: text_splitter (Optional[TextSplitter]) – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.
Returns:: List of Documents.
Return type:: List[Document]

Examples using DedocPDFLoader

Dedoc