DedocFileLoader#
- class langchain_community.document_loaders.dedoc.DedocFileLoader(file_path: str, *, split: str = 'document', with_tables: bool = True, with_attachments: str | bool = False, recursion_deep_attachments: int = 10, pdf_with_text_layer: str = 'auto_tabby', language: str = 'rus+eng', pages: str = ':', is_one_column_document: str = 'auto', document_orientation: str = 'auto', need_header_footer_analysis: str | bool = False, need_binarization: str | bool = False, need_pdf_table_analysis: str | bool = True, delimiter: str | None = None, encoding: str | None = None)[source]#
DedocFileLoader document loader integration to load files using dedoc.
The file loader automatically detects the file type (with the correct extension). The list of supported file types is gives at https://dedoc.readthedocs.io/en/latest/index.html#id1. Please see the documentation of DedocBaseLoader to get more details.
- Setup:
Install
dedoc
package.pip install -U dedoc
- Instantiate:
from langchain_community.document_loaders import DedocFileLoader loader = DedocFileLoader( file_path="example.pdf", # split=..., # with_tables=..., # pdf_with_text_layer=..., # pages=..., # ... )
- Load:
docs = loader.load() print(docs[0].page_content[:100]) print(docs[0].metadata)
Some text { 'file_name': 'example.pdf', 'file_type': 'application/pdf', # ... }
- Lazy load:
docs = [] docs_lazy = loader.lazy_load() for doc in docs_lazy: docs.append(doc) print(docs[0].page_content[:100]) print(docs[0].metadata)
Some text { 'file_name': 'example.pdf', 'file_type': 'application/pdf', # ... }
Initialize with file path and parsing parameters.
- Parameters:
file_path (str) – path to the file for processing
split (str) –
type of document splitting into parts (each part is returned separately), default value “document” “document”: document text is returned as a single langchain Document
object (don’t split)
- ”page”: split document text into pages (works for PDF, DJVU, PPTX, PPT,
ODP)
- ”node”: split document text into tree nodes (title nodes, list item
nodes, raw text nodes)
”line”: split document text into lines
with_tables (bool) – add tables to the result - each table is returned as a single langchain Document object
dedoc (Parameters used for document parsing via) –
(https://dedoc.readthedocs.io/en/latest/parameters/parameters.html):
with_attachments: enable attached files extraction recursion_deep_attachments: recursion level for attached files
extraction, works only when with_attachments==True
- pdf_with_text_layer: type of handler for parsing PDF documents,
available options [“true”, “false”, “tabby”, “auto”, “auto_tabby” (default)]
- language: language of the document for PDF without a textual layer and
images, available options [“eng”, “rus”, “rus+eng” (default)], the list of languages can be extended, please see https://dedoc.readthedocs.io/en/latest/tutorials/add_new_language.html
pages: page slice to define the reading range for parsing PDF documents is_one_column_document: detect number of columns for PDF without
a textual layer and images, available options [“true”, “false”, “auto” (default)]
- document_orientation: fix document orientation (90, 180, 270 degrees)
for PDF without a textual layer and images, available options [“auto” (default), “no_change”]
- need_header_footer_analysis: remove headers and footers from the output
result for parsing PDF and images
- need_binarization: clean pages background (binarize) for PDF without a
textual layer and images
- need_pdf_table_analysis: parse tables for PDF without a textual layer
and images
delimiter: column separator for CSV, TSV files encoding: encoding of TXT, CSV, TSV
with_attachments (str | bool) –
recursion_deep_attachments (int) –
pdf_with_text_layer (str) –
language (str) –
pages (str) –
is_one_column_document (str) –
document_orientation (str) –
need_header_footer_analysis (str | bool) –
need_binarization (str | bool) –
need_pdf_table_analysis (str | bool) –
delimiter (str | None) –
encoding (str | None) –
Methods
__init__
(file_path, *[, split, with_tables, ...])Initialize with file path and parsing parameters.
A lazy loader for Documents.
aload
()Load data into Document objects.
Lazily load documents.
load
()Load data into Document objects.
load_and_split
([text_splitter])Load Documents and split into chunks.
- __init__(file_path: str, *, split: str = 'document', with_tables: bool = True, with_attachments: str | bool = False, recursion_deep_attachments: int = 10, pdf_with_text_layer: str = 'auto_tabby', language: str = 'rus+eng', pages: str = ':', is_one_column_document: str = 'auto', document_orientation: str = 'auto', need_header_footer_analysis: str | bool = False, need_binarization: str | bool = False, need_pdf_table_analysis: str | bool = True, delimiter: str | None = None, encoding: str | None = None) None #
Initialize with file path and parsing parameters.
- Parameters:
file_path (str) – path to the file for processing
split (str) –
type of document splitting into parts (each part is returned separately), default value “document” “document”: document text is returned as a single langchain Document
object (don’t split)
- ”page”: split document text into pages (works for PDF, DJVU, PPTX, PPT,
ODP)
- ”node”: split document text into tree nodes (title nodes, list item
nodes, raw text nodes)
”line”: split document text into lines
with_tables (bool) – add tables to the result - each table is returned as a single langchain Document object
dedoc (Parameters used for document parsing via) –
(https://dedoc.readthedocs.io/en/latest/parameters/parameters.html):
with_attachments: enable attached files extraction recursion_deep_attachments: recursion level for attached files
extraction, works only when with_attachments==True
- pdf_with_text_layer: type of handler for parsing PDF documents,
available options [“true”, “false”, “tabby”, “auto”, “auto_tabby” (default)]
- language: language of the document for PDF without a textual layer and
images, available options [“eng”, “rus”, “rus+eng” (default)], the list of languages can be extended, please see https://dedoc.readthedocs.io/en/latest/tutorials/add_new_language.html
pages: page slice to define the reading range for parsing PDF documents is_one_column_document: detect number of columns for PDF without
a textual layer and images, available options [“true”, “false”, “auto” (default)]
- document_orientation: fix document orientation (90, 180, 270 degrees)
for PDF without a textual layer and images, available options [“auto” (default), “no_change”]
- need_header_footer_analysis: remove headers and footers from the output
result for parsing PDF and images
- need_binarization: clean pages background (binarize) for PDF without a
textual layer and images
- need_pdf_table_analysis: parse tables for PDF without a textual layer
and images
delimiter: column separator for CSV, TSV files encoding: encoding of TXT, CSV, TSV
with_attachments (str | bool) –
recursion_deep_attachments (int) –
pdf_with_text_layer (str) –
language (str) –
pages (str) –
is_one_column_document (str) –
document_orientation (str) –
need_header_footer_analysis (str | bool) –
need_binarization (str | bool) –
need_pdf_table_analysis (str | bool) –
delimiter (str | None) –
encoding (str | None) –
- Return type:
None
- async alazy_load() AsyncIterator[Document] #
A lazy loader for Documents.
- Return type:
AsyncIterator[Document]
- load_and_split(text_splitter: TextSplitter | None = None) List[Document] #
Load Documents and split into chunks. Chunks are returned as Documents.
Do not override this method. It should be considered to be deprecated!
- Parameters:
text_splitter (Optional[TextSplitter]) – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.
- Returns:
List of Documents.
- Return type:
List[Document]
Examples using DedocFileLoader