DocugamiLoader#
- class langchain_community.document_loaders.docugami.DocugamiLoader[source]#
Bases:
BaseLoader
,BaseModel
Deprecated since version 0.0.24: Use
docugami_langchain.DocugamiLoader
instead.Load from Docugami.
To use, you should have the
dgml-utils
python package installed.Create a new model by parsing and validating input data from keyword arguments.
Raises ValidationError if the input data cannot be parsed to form a valid model.
- param access_token: str | None = None#
The Docugami API access token to use.
- param api: str = 'https://api.docugami.com/v1preview1'#
The Docugami API endpoint to use.
- param docset_id: str | None = None#
The Docugami API docset ID to use.
- param document_ids: Sequence[str] | None = None#
The Docugami API document IDs to use.
- param file_paths: Sequence[Path | str] | None = None#
The local file paths to use.
- param include_project_metadata_in_doc_metadata: bool = True#
Set to True if you want to include the project metadata in the doc metadata.
- param include_xml_tags: bool = False#
Set to true for XML tags in chunk output text.
- param max_metadata_length: int = 512#
Max length of metadata text returned.
- param max_text_length: int = 4096#
Max length of chunk text returned.
- param min_text_length: int = 32#
Threshold under which chunks are appended to next to avoid over-chunking.
- param parent_hierarchy_levels: int = 0#
Set appropriately to get parent chunks using the chunk hierarchy.
- param parent_id_key: str = 'doc_id'#
Metadata key for parent doc ID.
- param sub_chunk_tables: bool = False#
Set to True to return sub-chunks within tables.
- param whitespace_normalize_text: bool = True#
Set to False if you want to full whitespace formatting in the original XML doc, including indentation.
- async alazy_load() AsyncIterator[Document] #
A lazy loader for Documents.
- Return type:
AsyncIterator[Document]
- load_and_split(text_splitter: TextSplitter | None = None) List[Document] #
Load Documents and split into chunks. Chunks are returned as Documents.
Do not override this method. It should be considered to be deprecated!
- Parameters:
text_splitter (Optional[TextSplitter]) – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.
- Returns:
List of Documents.
- Return type:
List[Document]