DedocAPIFileLoader#

class langchain_community.document_loaders.dedoc.DedocAPIFileLoader(file_path: str, *, url: str = 'http://0.0.0.0:1231', split: str = 'document', with_tables: bool = True, with_attachments: str | bool = False, recursion_deep_attachments: int = 10, pdf_with_text_layer: str = 'auto_tabby', language: str = 'rus+eng', pages: str = ':', is_one_column_document: str = 'auto', document_orientation: str = 'auto', need_header_footer_analysis: str | bool = False, need_binarization: str | bool = False, need_pdf_table_analysis: str | bool = True, delimiter: str | None = None, encoding: str | None = None)[source]#

Load files using dedoc API. The file loader automatically detects the file type (even with the wrong extension). By default, the loader makes a call to the locally hosted dedoc API. More information about dedoc API can be found in dedoc documentation:

Please see the documentation of DedocBaseLoader to get more details.

Setup:

You don’t need to install dedoc library for using this loader. Instead, the dedoc API needs to be run. You may use Docker container for this purpose. Please see dedoc documentation for more details:

docker pull dedocproject/dedoc
docker run -p 1231:1231
Instantiate:
from langchain_community.document_loaders import DedocAPIFileLoader

loader = DedocAPIFileLoader(
    file_path="example.pdf",
    # url=...,
    # split=...,
    # with_tables=...,
    # pdf_with_text_layer=...,
    # pages=...,
    # ...
)
Load:
docs = loader.load()
print(docs[0].page_content[:100])
print(docs[0].metadata)
Some text
{
    'file_name': 'example.pdf',
    'file_type': 'application/pdf',
    # ...
}
Lazy load:
docs = []
docs_lazy = loader.lazy_load()

for doc in docs_lazy:
    docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
Some text
{
    'file_name': 'example.pdf',
    'file_type': 'application/pdf',
    # ...
}

Initialize with file path, API url and parsing parameters.

Parameters:
  • file_path (str) – path to the file for processing

  • url (str) – URL to call dedoc API

  • split (str) –

    type of document splitting into parts (each part is returned separately), default value “document” “document”: document is returned as a single langchain Document object

    (don’t split)

    ”page”: split document into pages (works for PDF, DJVU, PPTX, PPT, ODP) “node”: split document into tree nodes (title nodes, list item nodes,

    raw text nodes)

    ”line”: split document into lines

  • with_tables (bool) – add tables to the result - each table is returned as a single langchain Document object

  • dedoc (Parameters used for document parsing via) –

    (https://dedoc.readthedocs.io/en/latest/parameters/parameters.html):

    with_attachments: enable attached files extraction recursion_deep_attachments: recursion level for attached files

    extraction, works only when with_attachments==True

    pdf_with_text_layer: type of handler for parsing PDF documents,

    available options [“true”, “false”, “tabby”, “auto”, “auto_tabby” (default)]

    language: language of the document for PDF without a textual layer and

    images, available options [“eng”, “rus”, “rus+eng” (default)], the list of languages can be extended, please see https://dedoc.readthedocs.io/en/latest/tutorials/add_new_language.html

    pages: page slice to define the reading range for parsing PDF documents is_one_column_document: detect number of columns for PDF without

    a textual layer and images, available options [“true”, “false”, “auto” (default)]

    document_orientation: fix document orientation (90, 180, 270 degrees)

    for PDF without a textual layer and images, available options [“auto” (default), “no_change”]

    need_header_footer_analysis: remove headers and footers from the output

    result for parsing PDF and images

    need_binarization: clean pages background (binarize) for PDF without a

    textual layer and images

    need_pdf_table_analysis: parse tables for PDF without a textual layer

    and images

    delimiter: column separator for CSV, TSV files encoding: encoding of TXT, CSV, TSV

  • with_attachments (str | bool) –

  • recursion_deep_attachments (int) –

  • pdf_with_text_layer (str) –

  • language (str) –

  • pages (str) –

  • is_one_column_document (str) –

  • document_orientation (str) –

  • need_header_footer_analysis (str | bool) –

  • need_binarization (str | bool) –

  • need_pdf_table_analysis (str | bool) –

  • delimiter (str | None) –

  • encoding (str | None) –

Methods

__init__(file_path, *[, url, split, ...])

Initialize with file path, API url and parsing parameters.

alazy_load()

A lazy loader for Documents.

aload()

Load data into Document objects.

lazy_load()

Lazily load documents.

load()

Load data into Document objects.

load_and_split([text_splitter])

Load Documents and split into chunks.

__init__(file_path: str, *, url: str = 'http://0.0.0.0:1231', split: str = 'document', with_tables: bool = True, with_attachments: str | bool = False, recursion_deep_attachments: int = 10, pdf_with_text_layer: str = 'auto_tabby', language: str = 'rus+eng', pages: str = ':', is_one_column_document: str = 'auto', document_orientation: str = 'auto', need_header_footer_analysis: str | bool = False, need_binarization: str | bool = False, need_pdf_table_analysis: str | bool = True, delimiter: str | None = None, encoding: str | None = None) None[source]#

Initialize with file path, API url and parsing parameters.

Parameters:
  • file_path (str) – path to the file for processing

  • url (str) – URL to call dedoc API

  • split (str) –

    type of document splitting into parts (each part is returned separately), default value “document” “document”: document is returned as a single langchain Document object

    (don’t split)

    ”page”: split document into pages (works for PDF, DJVU, PPTX, PPT, ODP) “node”: split document into tree nodes (title nodes, list item nodes,

    raw text nodes)

    ”line”: split document into lines

  • with_tables (bool) – add tables to the result - each table is returned as a single langchain Document object

  • dedoc (Parameters used for document parsing via) –

    (https://dedoc.readthedocs.io/en/latest/parameters/parameters.html):

    with_attachments: enable attached files extraction recursion_deep_attachments: recursion level for attached files

    extraction, works only when with_attachments==True

    pdf_with_text_layer: type of handler for parsing PDF documents,

    available options [“true”, “false”, “tabby”, “auto”, “auto_tabby” (default)]

    language: language of the document for PDF without a textual layer and

    images, available options [“eng”, “rus”, “rus+eng” (default)], the list of languages can be extended, please see https://dedoc.readthedocs.io/en/latest/tutorials/add_new_language.html

    pages: page slice to define the reading range for parsing PDF documents is_one_column_document: detect number of columns for PDF without

    a textual layer and images, available options [“true”, “false”, “auto” (default)]

    document_orientation: fix document orientation (90, 180, 270 degrees)

    for PDF without a textual layer and images, available options [“auto” (default), “no_change”]

    need_header_footer_analysis: remove headers and footers from the output

    result for parsing PDF and images

    need_binarization: clean pages background (binarize) for PDF without a

    textual layer and images

    need_pdf_table_analysis: parse tables for PDF without a textual layer

    and images

    delimiter: column separator for CSV, TSV files encoding: encoding of TXT, CSV, TSV

  • with_attachments (str | bool) –

  • recursion_deep_attachments (int) –

  • pdf_with_text_layer (str) –

  • language (str) –

  • pages (str) –

  • is_one_column_document (str) –

  • document_orientation (str) –

  • need_header_footer_analysis (str | bool) –

  • need_binarization (str | bool) –

  • need_pdf_table_analysis (str | bool) –

  • delimiter (str | None) –

  • encoding (str | None) –

Return type:

None

async alazy_load() AsyncIterator[Document]#

A lazy loader for Documents.

Return type:

AsyncIterator[Document]

async aload() List[Document]#

Load data into Document objects.

Return type:

List[Document]

lazy_load() Iterator[Document][source]#

Lazily load documents.

Return type:

Iterator[Document]

load() List[Document]#

Load data into Document objects.

Return type:

List[Document]

load_and_split(text_splitter: TextSplitter | None = None) List[Document]#

Load Documents and split into chunks. Chunks are returned as Documents.

Do not override this method. It should be considered to be deprecated!

Parameters:

text_splitter (Optional[TextSplitter]) – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.

Returns:

List of Documents.

Return type:

List[Document]

Examples using DedocAPIFileLoader