DedocAPIFileLoader#

class langchain_community.document_loaders.dedoc.DedocAPIFileLoader(file_path: str, *, url: str = 'http://0.0.0.0:1231', split: str = 'document', with_tables: bool = True, with_attachments: str | bool = False, recursion_deep_attachments: int = 10, pdf_with_text_layer: str = 'auto_tabby', language: str = 'rus+eng', pages: str = ':', is_one_column_document: str = 'auto', document_orientation: str = 'auto', need_header_footer_analysis: str | bool = False, need_binarization: str | bool = False, need_pdf_table_analysis: str | bool = True, delimiter: str | None = None, encoding: str | None = None)[source]#

Load files using dedoc API. The file loader automatically detects the file type (even with the wrong extension). By default, the loader makes a call to the locally hosted dedoc API. More information about dedoc API can be found in dedoc documentation:

https://dedoc.readthedocs.io/en/latest/dedoc_api_usage/api.html

Please see the documentation of DedocBaseLoader to get more details.

Setup:

You don’t need to install dedoc library for using this loader. Instead, the dedoc API needs to be run. You may use Docker container for this purpose. Please see dedoc documentation for more details:

https://dedoc.readthedocs.io/en/latest/getting_started/installation.html#install-and-run-dedoc-using-docker

docker pull dedocproject/dedoc
docker run -p 1231:1231

Instantiate:

from langchain_community.document_loaders import DedocAPIFileLoader

loader = DedocAPIFileLoader(
    file_path="example.pdf",
    # url=...,
    # split=...,
    # with_tables=...,
    # pdf_with_text_layer=...,
    # pages=...,
    # ...
)

Load:

docs = loader.load()
print(docs[0].page_content[:100])
print(docs[0].metadata)

Some text
{
    'file_name': 'example.pdf',
    'file_type': 'application/pdf',
    # ...
}

Lazy load:

docs = []
docs_lazy = loader.lazy_load()

for doc in docs_lazy:
    docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)

Some text
{
    'file_name': 'example.pdf',
    'file_type': 'application/pdf',
    # ...
}

Initialize with file path, API url and parsing parameters.

Parameters:

file_path (str) – path to the file for processing
url (str) – URL to call dedoc API
split (str) –
type of document splitting into parts (each part is returned separately), default value “document” “document”: document is returned as a single langchain Document object

(don’t split)

”page”: split document into pages (works for PDF, DJVU, PPTX, PPT, ODP) “node”: split document into tree nodes (title nodes, list item nodes,

raw text nodes)

”line”: split document into lines
with_tables (bool) – add tables to the result - each table is returned as a single langchain Document object
dedoc (Parameters used for document parsing via) –
(https://dedoc.readthedocs.io/en/latest/parameters/parameters.html):

with_attachments: enable attached files extraction recursion_deep_attachments: recursion level for attached files

extraction, works only when with_attachments==True

pdf_with_text_layer: type of handler for parsing PDF documents,
available options [“true”, “false”, “tabby”, “auto”, “auto_tabby” (default)]

language: language of the document for PDF without a textual layer and
images, available options [“eng”, “rus”, “rus+eng” (default)], the list of languages can be extended, please see https://dedoc.readthedocs.io/en/latest/tutorials/add_new_language.html

pages: page slice to define the reading range for parsing PDF documents is_one_column_document: detect number of columns for PDF without

a textual layer and images, available options [“true”, “false”, “auto” (default)]

document_orientation: fix document orientation (90, 180, 270 degrees)
for PDF without a textual layer and images, available options [“auto” (default), “no_change”]

need_header_footer_analysis: remove headers and footers from the output
result for parsing PDF and images

need_binarization: clean pages background (binarize) for PDF without a
textual layer and images

need_pdf_table_analysis: parse tables for PDF without a textual layer
and images

delimiter: column separator for CSV, TSV files encoding: encoding of TXT, CSV, TSV
with_attachments (str | bool) –
recursion_deep_attachments (int) –
pdf_with_text_layer (str) –
language (str) –
pages (str) –
is_one_column_document (str) –
document_orientation (str) –
need_header_footer_analysis (str | bool) –
need_binarization (str | bool) –
need_pdf_table_analysis (str | bool) –
delimiter (str | None) –
encoding (str | None) –

Methods

`__init__`(file_path, *[, url, split, ...])	Initialize with file path, API url and parsing parameters.
`alazy_load`()	A lazy loader for Documents.
`aload`()	Load data into Document objects.
`lazy_load`()	Lazily load documents.
`load`()	Load data into Document objects.
`load_and_split`([text_splitter])	Load Documents and split into chunks.

__init__(file_path: str, *, url: str = 'http://0.0.0.0:1231', split: str = 'document', with_tables: bool = True, with_attachments: str | bool = False, recursion_deep_attachments: int = 10, pdf_with_text_layer: str = 'auto_tabby', language: str = 'rus+eng', pages: str = ':', is_one_column_document: str = 'auto', document_orientation: str = 'auto', need_header_footer_analysis: str | bool = False, need_binarization: str | bool = False, need_pdf_table_analysis: str | bool = True, delimiter: str | None = None, encoding: str | None = None) → None[source]#

Initialize with file path, API url and parsing parameters.

Parameters:

file_path (str) – path to the file for processing
url (str) – URL to call dedoc API
split (str) –
type of document splitting into parts (each part is returned separately), default value “document” “document”: document is returned as a single langchain Document object

(don’t split)

”page”: split document into pages (works for PDF, DJVU, PPTX, PPT, ODP) “node”: split document into tree nodes (title nodes, list item nodes,

raw text nodes)

”line”: split document into lines
with_tables (bool) – add tables to the result - each table is returned as a single langchain Document object
dedoc (Parameters used for document parsing via) –
(https://dedoc.readthedocs.io/en/latest/parameters/parameters.html):

with_attachments: enable attached files extraction recursion_deep_attachments: recursion level for attached files

extraction, works only when with_attachments==True

pdf_with_text_layer: type of handler for parsing PDF documents,
available options [“true”, “false”, “tabby”, “auto”, “auto_tabby” (default)]

language: language of the document for PDF without a textual layer and
images, available options [“eng”, “rus”, “rus+eng” (default)], the list of languages can be extended, please see https://dedoc.readthedocs.io/en/latest/tutorials/add_new_language.html

pages: page slice to define the reading range for parsing PDF documents is_one_column_document: detect number of columns for PDF without

a textual layer and images, available options [“true”, “false”, “auto” (default)]

document_orientation: fix document orientation (90, 180, 270 degrees)
for PDF without a textual layer and images, available options [“auto” (default), “no_change”]

need_header_footer_analysis: remove headers and footers from the output
result for parsing PDF and images

need_binarization: clean pages background (binarize) for PDF without a
textual layer and images

need_pdf_table_analysis: parse tables for PDF without a textual layer
and images

delimiter: column separator for CSV, TSV files encoding: encoding of TXT, CSV, TSV
with_attachments (str | bool) –
recursion_deep_attachments (int) –
pdf_with_text_layer (str) –
language (str) –
pages (str) –
is_one_column_document (str) –
document_orientation (str) –
need_header_footer_analysis (str | bool) –
need_binarization (str | bool) –
need_pdf_table_analysis (str | bool) –
delimiter (str | None) –
encoding (str | None) –

Return type:

None

async alazy_load() → AsyncIterator[Document]#

A lazy loader for Documents.

Return type:: AsyncIterator[Document]

async aload() → List[Document]#

Load data into Document objects.

Return type:: List[Document]

lazy_load() → Iterator[Document][source]#

Lazily load documents.

Return type:: Iterator[Document]

load() → List[Document]#

Load data into Document objects.

Return type:: List[Document]

load_and_split(text_splitter: TextSplitter | None = None) → List[Document]#

Load Documents and split into chunks. Chunks are returned as Documents.

Do not override this method. It should be considered to be deprecated!

Parameters:: text_splitter (Optional[TextSplitter]) – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.
Returns:: List of Documents.
Return type:: List[Document]

Examples using DedocAPIFileLoader

Dedoc