DedocFileLoader#

class langchain_community.document_loaders.dedoc.DedocFileLoader(file_path: str, *, split: str = 'document', with_tables: bool = True, with_attachments: str | bool = False, recursion_deep_attachments: int = 10, pdf_with_text_layer: str = 'auto_tabby', language: str = 'rus+eng', pages: str = ':', is_one_column_document: str = 'auto', document_orientation: str = 'auto', need_header_footer_analysis: str | bool = False, need_binarization: str | bool = False, need_pdf_table_analysis: str | bool = True, delimiter: str | None = None, encoding: str | None = None)[source]#

DedocFileLoader document loader integration to load files using dedoc.

The file loader automatically detects the file type (with the correct extension). The list of supported file types is gives at https://dedoc.readthedocs.io/en/latest/index.html#id1. Please see the documentation of DedocBaseLoader to get more details.

Setup:

Install dedoc package.

pip install -U dedoc
Instantiate:
from langchain_community.document_loaders import DedocFileLoader

loader = DedocFileLoader(
    file_path="example.pdf",
    # split=...,
    # with_tables=...,
    # pdf_with_text_layer=...,
    # pages=...,
    # ...
)
Load:
docs = loader.load()
print(docs[0].page_content[:100])
print(docs[0].metadata)
Some text
{
    'file_name': 'example.pdf',
    'file_type': 'application/pdf',
    # ...
}
Lazy load:
docs = []
docs_lazy = loader.lazy_load()

for doc in docs_lazy:
    docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
Some text
{
    'file_name': 'example.pdf',
    'file_type': 'application/pdf',
    # ...
}

Initialize with file path and parsing parameters.

Parameters:
  • file_path (str) – path to the file for processing

  • split (str) –

    type of document splitting into parts (each part is returned separately), default value “document” “document”: document text is returned as a single langchain Document

    object (don’t split)

    ”page”: split document text into pages (works for PDF, DJVU, PPTX, PPT,

    ODP)

    ”node”: split document text into tree nodes (title nodes, list item

    nodes, raw text nodes)

    ”line”: split document text into lines

  • with_tables (bool) – add tables to the result - each table is returned as a single langchain Document object

  • dedoc (Parameters used for document parsing via) –

    (https://dedoc.readthedocs.io/en/latest/parameters/parameters.html):

    with_attachments: enable attached files extraction recursion_deep_attachments: recursion level for attached files

    extraction, works only when with_attachments==True

    pdf_with_text_layer: type of handler for parsing PDF documents,

    available options [“true”, “false”, “tabby”, “auto”, “auto_tabby” (default)]

    language: language of the document for PDF without a textual layer and

    images, available options [“eng”, “rus”, “rus+eng” (default)], the list of languages can be extended, please see https://dedoc.readthedocs.io/en/latest/tutorials/add_new_language.html

    pages: page slice to define the reading range for parsing PDF documents is_one_column_document: detect number of columns for PDF without

    a textual layer and images, available options [“true”, “false”, “auto” (default)]

    document_orientation: fix document orientation (90, 180, 270 degrees)

    for PDF without a textual layer and images, available options [“auto” (default), “no_change”]

    need_header_footer_analysis: remove headers and footers from the output

    result for parsing PDF and images

    need_binarization: clean pages background (binarize) for PDF without a

    textual layer and images

    need_pdf_table_analysis: parse tables for PDF without a textual layer

    and images

    delimiter: column separator for CSV, TSV files encoding: encoding of TXT, CSV, TSV

  • with_attachments (str | bool) –

  • recursion_deep_attachments (int) –

  • pdf_with_text_layer (str) –

  • language (str) –

  • pages (str) –

  • is_one_column_document (str) –

  • document_orientation (str) –

  • need_header_footer_analysis (str | bool) –

  • need_binarization (str | bool) –

  • need_pdf_table_analysis (str | bool) –

  • delimiter (str | None) –

  • encoding (str | None) –

Methods

__init__(file_path, *[, split, with_tables, ...])

Initialize with file path and parsing parameters.

alazy_load()

A lazy loader for Documents.

aload()

Load data into Document objects.

lazy_load()

Lazily load documents.

load()

Load data into Document objects.

load_and_split([text_splitter])

Load Documents and split into chunks.

__init__(file_path: str, *, split: str = 'document', with_tables: bool = True, with_attachments: str | bool = False, recursion_deep_attachments: int = 10, pdf_with_text_layer: str = 'auto_tabby', language: str = 'rus+eng', pages: str = ':', is_one_column_document: str = 'auto', document_orientation: str = 'auto', need_header_footer_analysis: str | bool = False, need_binarization: str | bool = False, need_pdf_table_analysis: str | bool = True, delimiter: str | None = None, encoding: str | None = None) None#

Initialize with file path and parsing parameters.

Parameters:
  • file_path (str) – path to the file for processing

  • split (str) –

    type of document splitting into parts (each part is returned separately), default value “document” “document”: document text is returned as a single langchain Document

    object (don’t split)

    ”page”: split document text into pages (works for PDF, DJVU, PPTX, PPT,

    ODP)

    ”node”: split document text into tree nodes (title nodes, list item

    nodes, raw text nodes)

    ”line”: split document text into lines

  • with_tables (bool) – add tables to the result - each table is returned as a single langchain Document object

  • dedoc (Parameters used for document parsing via) –

    (https://dedoc.readthedocs.io/en/latest/parameters/parameters.html):

    with_attachments: enable attached files extraction recursion_deep_attachments: recursion level for attached files

    extraction, works only when with_attachments==True

    pdf_with_text_layer: type of handler for parsing PDF documents,

    available options [“true”, “false”, “tabby”, “auto”, “auto_tabby” (default)]

    language: language of the document for PDF without a textual layer and

    images, available options [“eng”, “rus”, “rus+eng” (default)], the list of languages can be extended, please see https://dedoc.readthedocs.io/en/latest/tutorials/add_new_language.html

    pages: page slice to define the reading range for parsing PDF documents is_one_column_document: detect number of columns for PDF without

    a textual layer and images, available options [“true”, “false”, “auto” (default)]

    document_orientation: fix document orientation (90, 180, 270 degrees)

    for PDF without a textual layer and images, available options [“auto” (default), “no_change”]

    need_header_footer_analysis: remove headers and footers from the output

    result for parsing PDF and images

    need_binarization: clean pages background (binarize) for PDF without a

    textual layer and images

    need_pdf_table_analysis: parse tables for PDF without a textual layer

    and images

    delimiter: column separator for CSV, TSV files encoding: encoding of TXT, CSV, TSV

  • with_attachments (str | bool) –

  • recursion_deep_attachments (int) –

  • pdf_with_text_layer (str) –

  • language (str) –

  • pages (str) –

  • is_one_column_document (str) –

  • document_orientation (str) –

  • need_header_footer_analysis (str | bool) –

  • need_binarization (str | bool) –

  • need_pdf_table_analysis (str | bool) –

  • delimiter (str | None) –

  • encoding (str | None) –

Return type:

None

async alazy_load() AsyncIterator[Document]#

A lazy loader for Documents.

Return type:

AsyncIterator[Document]

async aload() List[Document]#

Load data into Document objects.

Return type:

List[Document]

lazy_load() Iterator[Document]#

Lazily load documents.

Return type:

Iterator[Document]

load() List[Document]#

Load data into Document objects.

Return type:

List[Document]

load_and_split(text_splitter: TextSplitter | None = None) List[Document]#

Load Documents and split into chunks. Chunks are returned as Documents.

Do not override this method. It should be considered to be deprecated!

Parameters:

text_splitter (Optional[TextSplitter]) – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.

Returns:

List of Documents.

Return type:

List[Document]

Examples using DedocFileLoader