DedocBaseLoader#

class langchain_community.document_loaders.dedoc.DedocBaseLoader(file_path: str, *, split: str = 'document', with_tables: bool = True, with_attachments: str | bool = False, recursion_deep_attachments: int = 10, pdf_with_text_layer: str = 'auto_tabby', language: str = 'rus+eng', pages: str = ':', is_one_column_document: str = 'auto', document_orientation: str = 'auto', need_header_footer_analysis: str | bool = False, need_binarization: str | bool = False, need_pdf_table_analysis: str | bool = True, delimiter: str | None = None, encoding: str | None = None)[source]#

Base Loader that uses dedoc (https://dedoc.readthedocs.io).

Loader enables extracting text, tables and attached files from the given file:
  • Text can be split by pages, dedoc tree nodes, textual lines

    (according to the split parameter).

  • Attached files (when with_attachments=True)

    are split according to the split parameter. For attachments, langchain Document object has an additional metadata field `type`=”attachment”.

  • Tables (when with_tables=True) are not split - each table corresponds to one

    langchain Document object. For tables, Document object has additional metadata fields type`=”table” and `text_as_html with table HTML representation.

Initialize with file path and parsing parameters.

Parameters:
  • file_path (str) – path to the file for processing

  • split (str) –

    type of document splitting into parts (each part is returned separately), default value “document” “document”: document text is returned as a single langchain Document

    object (don’t split)

    ”page”: split document text into pages (works for PDF, DJVU, PPTX, PPT,

    ODP)

    ”node”: split document text into tree nodes (title nodes, list item

    nodes, raw text nodes)

    ”line”: split document text into lines

  • with_tables (bool) – add tables to the result - each table is returned as a single langchain Document object

  • dedoc (Parameters used for document parsing via) –

    (https://dedoc.readthedocs.io/en/latest/parameters/parameters.html):

    with_attachments: enable attached files extraction recursion_deep_attachments: recursion level for attached files

    extraction, works only when with_attachments==True

    pdf_with_text_layer: type of handler for parsing PDF documents,

    available options [“true”, “false”, “tabby”, “auto”, “auto_tabby” (default)]

    language: language of the document for PDF without a textual layer and

    images, available options [“eng”, “rus”, “rus+eng” (default)], the list of languages can be extended, please see https://dedoc.readthedocs.io/en/latest/tutorials/add_new_language.html

    pages: page slice to define the reading range for parsing PDF documents is_one_column_document: detect number of columns for PDF without

    a textual layer and images, available options [“true”, “false”, “auto” (default)]

    document_orientation: fix document orientation (90, 180, 270 degrees)

    for PDF without a textual layer and images, available options [“auto” (default), “no_change”]

    need_header_footer_analysis: remove headers and footers from the output

    result for parsing PDF and images

    need_binarization: clean pages background (binarize) for PDF without a

    textual layer and images

    need_pdf_table_analysis: parse tables for PDF without a textual layer

    and images

    delimiter: column separator for CSV, TSV files encoding: encoding of TXT, CSV, TSV

  • with_attachments (str | bool) –

  • recursion_deep_attachments (int) –

  • pdf_with_text_layer (str) –

  • language (str) –

  • pages (str) –

  • is_one_column_document (str) –

  • document_orientation (str) –

  • need_header_footer_analysis (str | bool) –

  • need_binarization (str | bool) –

  • need_pdf_table_analysis (str | bool) –

  • delimiter (str | None) –

  • encoding (str | None) –

Methods

__init__(file_path, *[, split, with_tables, ...])

Initialize with file path and parsing parameters.

alazy_load()

A lazy loader for Documents.

aload()

Load data into Document objects.

lazy_load()

Lazily load documents.

load()

Load data into Document objects.

load_and_split([text_splitter])

Load Documents and split into chunks.

__init__(file_path: str, *, split: str = 'document', with_tables: bool = True, with_attachments: str | bool = False, recursion_deep_attachments: int = 10, pdf_with_text_layer: str = 'auto_tabby', language: str = 'rus+eng', pages: str = ':', is_one_column_document: str = 'auto', document_orientation: str = 'auto', need_header_footer_analysis: str | bool = False, need_binarization: str | bool = False, need_pdf_table_analysis: str | bool = True, delimiter: str | None = None, encoding: str | None = None) None[source]#

Initialize with file path and parsing parameters.

Parameters:
  • file_path (str) – path to the file for processing

  • split (str) –

    type of document splitting into parts (each part is returned separately), default value “document” “document”: document text is returned as a single langchain Document

    object (don’t split)

    ”page”: split document text into pages (works for PDF, DJVU, PPTX, PPT,

    ODP)

    ”node”: split document text into tree nodes (title nodes, list item

    nodes, raw text nodes)

    ”line”: split document text into lines

  • with_tables (bool) – add tables to the result - each table is returned as a single langchain Document object

  • dedoc (Parameters used for document parsing via) –

    (https://dedoc.readthedocs.io/en/latest/parameters/parameters.html):

    with_attachments: enable attached files extraction recursion_deep_attachments: recursion level for attached files

    extraction, works only when with_attachments==True

    pdf_with_text_layer: type of handler for parsing PDF documents,

    available options [“true”, “false”, “tabby”, “auto”, “auto_tabby” (default)]

    language: language of the document for PDF without a textual layer and

    images, available options [“eng”, “rus”, “rus+eng” (default)], the list of languages can be extended, please see https://dedoc.readthedocs.io/en/latest/tutorials/add_new_language.html

    pages: page slice to define the reading range for parsing PDF documents is_one_column_document: detect number of columns for PDF without

    a textual layer and images, available options [“true”, “false”, “auto” (default)]

    document_orientation: fix document orientation (90, 180, 270 degrees)

    for PDF without a textual layer and images, available options [“auto” (default), “no_change”]

    need_header_footer_analysis: remove headers and footers from the output

    result for parsing PDF and images

    need_binarization: clean pages background (binarize) for PDF without a

    textual layer and images

    need_pdf_table_analysis: parse tables for PDF without a textual layer

    and images

    delimiter: column separator for CSV, TSV files encoding: encoding of TXT, CSV, TSV

  • with_attachments (str | bool) –

  • recursion_deep_attachments (int) –

  • pdf_with_text_layer (str) –

  • language (str) –

  • pages (str) –

  • is_one_column_document (str) –

  • document_orientation (str) –

  • need_header_footer_analysis (str | bool) –

  • need_binarization (str | bool) –

  • need_pdf_table_analysis (str | bool) –

  • delimiter (str | None) –

  • encoding (str | None) –

Return type:

None

async alazy_load() AsyncIterator[Document]#

A lazy loader for Documents.

Return type:

AsyncIterator[Document]

async aload() List[Document]#

Load data into Document objects.

Return type:

List[Document]

lazy_load() Iterator[Document][source]#

Lazily load documents.

Return type:

Iterator[Document]

load() List[Document]#

Load data into Document objects.

Return type:

List[Document]

load_and_split(text_splitter: TextSplitter | None = None) List[Document]#

Load Documents and split into chunks. Chunks are returned as Documents.

Do not override this method. It should be considered to be deprecated!

Parameters:

text_splitter (Optional[TextSplitter]) – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.

Returns:

List of Documents.

Return type:

List[Document]