DedocFileLoader#

class langchain_community.document_loaders.dedoc.DedocFileLoader( file_path: str, *, split: str = 'document', with_tables: bool = True, with_attachments: str | bool = False, recursion_deep_attachments: int = 10, pdf_with_text_layer: str = 'auto_tabby', language: str = 'rus+eng', pages: str = ':', is_one_column_document: str = 'auto', document_orientation: str = 'auto', need_header_footer_analysis: str | bool = False, need_binarization: str | bool = False, need_pdf_table_analysis: str | bool = True, delimiter: str | None = None, encoding: str | None = None, )[source]#

DedocFileLoader document loader integration to load files using dedoc.

The file loader automatically detects the file type (with the correct extension). The list of supported file types is gives at https://dedoc.readthedocs.io/en/latest/index.html#id1. Please see the documentation of DedocBaseLoader to get more details.

Setup:

Install dedoc package.

pip install -U dedoc

Instantiate:

from langchain_community.document_loaders import DedocFileLoader

loader = DedocFileLoader(
    file_path="example.pdf",
    # split=...,
    # with_tables=...,
    # pdf_with_text_layer=...,
    # pages=...,
    # ...
)

Load:

docs = loader.load()
print(docs[0].page_content[:100])
print(docs[0].metadata)

Some text
{
    'file_name': 'example.pdf',
    'file_type': 'application/pdf',
    # ...
}

Lazy load:

docs = []
docs_lazy = loader.lazy_load()

for doc in docs_lazy:
    docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)

Some text
{
    'file_name': 'example.pdf',
    'file_type': 'application/pdf',
    # ...
}

Initialize with file path and parsing parameters.

Parameters:

file_path (str) – path to the file for processing
split (str) –
type of document splitting into parts (each part is returned separately), default value “document” “document”: document text is returned as a single langchain Document

object (don’t split)

”page”: split document text into pages (works for PDF, DJVU, PPTX, PPT,
ODP)

”node”: split document text into tree nodes (title nodes, list item
nodes, raw text nodes)

”line”: split document text into lines
with_tables (bool) – add tables to the result - each table is returned as a single langchain Document object
dedoc (Parameters used for document parsing via) –
(https://dedoc.readthedocs.io/en/latest/parameters/parameters.html):

with_attachments: enable attached files extraction recursion_deep_attachments: recursion level for attached files

extraction, works only when with_attachments==True

pdf_with_text_layer: type of handler for parsing PDF documents,
available options [“true”, “false”, “tabby”, “auto”, “auto_tabby” (default)]

language: language of the document for PDF without a textual layer and
images, available options [“eng”, “rus”, “rus+eng” (default)], the list of languages can be extended, please see https://dedoc.readthedocs.io/en/latest/tutorials/add_new_language.html

pages: page slice to define the reading range for parsing PDF documents is_one_column_document: detect number of columns for PDF without

a textual layer and images, available options [“true”, “false”, “auto” (default)]

document_orientation: fix document orientation (90, 180, 270 degrees)
for PDF without a textual layer and images, available options [“auto” (default), “no_change”]

need_header_footer_analysis: remove headers and footers from the output
result for parsing PDF and images

need_binarization: clean pages background (binarize) for PDF without a
textual layer and images

need_pdf_table_analysis: parse tables for PDF without a textual layer
and images

delimiter: column separator for CSV, TSV files encoding: encoding of TXT, CSV, TSV
with_attachments (str | bool)
recursion_deep_attachments (int)
pdf_with_text_layer (str)
language (str)
pages (str)
is_one_column_document (str)
document_orientation (str)
need_header_footer_analysis (str | bool)
need_binarization (str | bool)
need_pdf_table_analysis (str | bool)
delimiter (str | None)
encoding (str | None)

Methods

`__init__`(file_path, *[, split, with_tables, ...])	Initialize with file path and parsing parameters.
`alazy_load`()	A lazy loader for Documents.
`aload`()	Load data into Document objects.
`lazy_load`()	Lazily load documents.
`load`()	Load data into Document objects.
`load_and_split`([text_splitter])	Load Documents and split into chunks.

__init__( file_path: str, *, split: str = 'document', with_tables: bool = True, with_attachments: str | bool = False, recursion_deep_attachments: int = 10, pdf_with_text_layer: str = 'auto_tabby', language: str = 'rus+eng', pages: str = ':', is_one_column_document: str = 'auto', document_orientation: str = 'auto', need_header_footer_analysis: str | bool = False, need_binarization: str | bool = False, need_pdf_table_analysis: str | bool = True, delimiter: str | None = None, encoding: str | None = None, ) → None#

Initialize with file path and parsing parameters.

Parameters:

file_path (str) – path to the file for processing
split (str) –
type of document splitting into parts (each part is returned separately), default value “document” “document”: document text is returned as a single langchain Document

object (don’t split)

”page”: split document text into pages (works for PDF, DJVU, PPTX, PPT,
ODP)

”node”: split document text into tree nodes (title nodes, list item
nodes, raw text nodes)

”line”: split document text into lines
with_tables (bool) – add tables to the result - each table is returned as a single langchain Document object
dedoc (Parameters used for document parsing via) –
(https://dedoc.readthedocs.io/en/latest/parameters/parameters.html):

with_attachments: enable attached files extraction recursion_deep_attachments: recursion level for attached files

extraction, works only when with_attachments==True

pdf_with_text_layer: type of handler for parsing PDF documents,
available options [“true”, “false”, “tabby”, “auto”, “auto_tabby” (default)]

language: language of the document for PDF without a textual layer and
images, available options [“eng”, “rus”, “rus+eng” (default)], the list of languages can be extended, please see https://dedoc.readthedocs.io/en/latest/tutorials/add_new_language.html

pages: page slice to define the reading range for parsing PDF documents is_one_column_document: detect number of columns for PDF without

a textual layer and images, available options [“true”, “false”, “auto” (default)]

document_orientation: fix document orientation (90, 180, 270 degrees)
for PDF without a textual layer and images, available options [“auto” (default), “no_change”]

need_header_footer_analysis: remove headers and footers from the output
result for parsing PDF and images

need_binarization: clean pages background (binarize) for PDF without a
textual layer and images

need_pdf_table_analysis: parse tables for PDF without a textual layer
and images

delimiter: column separator for CSV, TSV files encoding: encoding of TXT, CSV, TSV
with_attachments (str | bool)
recursion_deep_attachments (int)
pdf_with_text_layer (str)
language (str)
pages (str)
is_one_column_document (str)
document_orientation (str)
need_header_footer_analysis (str | bool)
need_binarization (str | bool)
need_pdf_table_analysis (str | bool)
delimiter (str | None)
encoding (str | None)

Return type:

None

async alazy_load() → AsyncIterator[Document]#

A lazy loader for Documents.

Yields:: the documents.
Return type:: AsyncIterator[Document]

async aload() → list[Document]#

Load data into Document objects.

Returns:: the documents.
Return type:: list[Document]

lazy_load() → Iterator[Document]#

Lazily load documents.

Return type:: Iterator[Document]

load() → list[Document]#

Load data into Document objects.

Returns:: the documents.
Return type:: list[Document]

load_and_split( text_splitter: TextSplitter | None = None, ) → list[Document]#

Load Documents and split into chunks. Chunks are returned as Documents.

Do not override this method. It should be considered to be deprecated!

Parameters:: text_splitter (Optional[TextSplitter]) – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.
Raises:: ImportError – If langchain-text-splitters is not installed and no text_splitter is provided.
Returns:: List of Documents.
Return type:: list[Document]

Examples using DedocFileLoader

Dedoc