DocugamiLoader#

class langchain_community.document_loaders.docugami.DocugamiLoader[source]#

Bases: BaseLoader, BaseModel

Deprecated since version 0.0.24: Use :class:`~docugami_langchain.DocugamiLoader` instead.

Load from Docugami.

To use, you should have the dgml-utils python package installed.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

param access_token: str | None = None#

The Docugami API access token to use.

param api: str = 'https://api.docugami.com/v1preview1'#

The Docugami API endpoint to use.

param docset_id: str | None = None#

The Docugami API docset ID to use.

param document_ids: Sequence[str] | None = None#

The Docugami API document IDs to use.

param file_paths: Sequence[Path | str] | None [Required]#

The local file paths to use.

param include_project_metadata_in_doc_metadata: bool = True#

Set to True if you want to include the project metadata in the doc metadata.

param include_xml_tags: bool = False#

Set to true for XML tags in chunk output text.

param max_metadata_length: int = 512#

Max length of metadata text returned.

param max_text_length: int = 4096#

Max length of chunk text returned.

param min_text_length: int = 32#

Threshold under which chunks are appended to next to avoid over-chunking.

param parent_hierarchy_levels: int = 0#

Set appropriately to get parent chunks using the chunk hierarchy.

param parent_id_key: str = 'doc_id'#

Metadata key for parent doc ID.

param sub_chunk_tables: bool = False#

Set to True to return sub-chunks within tables.

param whitespace_normalize_text: bool = True#

Set to False if you want to full whitespace formatting in the original XML doc, including indentation.

async alazy_load() AsyncIterator[Document]#

A lazy loader for Documents.

Return type:

AsyncIterator[Document]

async aload() list[Document]#

Load data into Document objects.

Return type:

list[Document]

lazy_load() Iterator[Document]#

A lazy loader for Documents.

Return type:

Iterator[Document]

load() List[Document][source]#

Load documents.

Return type:

List[Document]

load_and_split(text_splitter: TextSplitter | None = None) list[Document]#

Load Documents and split into chunks. Chunks are returned as Documents.

Do not override this method. It should be considered to be deprecated!

Parameters:

text_splitter (Optional[TextSplitter]) – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.

Returns:

List of Documents.

Return type:

list[Document]