UpstageDocumentParseParser#

class langchain_upstage.document_parse_parsers.UpstageDocumentParseParser(
api_key: str | None = None,
base_url: str = 'https://api.upstage.ai/v1/document-ai/document-parse',
model: str = 'document-parse',
split: Literal['none', 'page', 'element'] = 'none',
ocr: Literal['auto', 'force'] = 'auto',
output_format: Literal['text', 'html', 'markdown'] = 'html',
coordinates: bool = True,
base64_encoding: List[Literal['paragraph', 'table', 'figure', 'header', 'footer', 'caption', 'equation', 'heading1', 'list', 'index', 'footnote', 'chart']] = [],
)[source]#

Upstage Document Parse Parser.

To use, you should have the environment variable UPSTAGE_API_KEY set with your API key or pass it as a named parameter to the constructor.

Example

from langchain_upstage import UpstageDocumentParseParser

loader = UpstageDocumentParseParser(split="page", output_format="text")

Initializes an instance of the Upstage class.

Parameters:
  • api_key (str, optional) – The API key for accessing the Upstage API. Defaults to None, in which case it will be fetched from the environment variable UPSTAGE_API_KEY.

  • base_url (str, optional) – The base URL for accessing the Upstage API.

  • model (str) – The model to be used for the document parse. Defaults to “document-parse”.

  • split (SplitType, optional) – The type of splitting to be applied. Defaults to “none” (no splitting).

  • ocr (OCRMode, optional) – Extract text from images in the document using OCR. If the value is “force”, OCR is used to extract text from an image. If the value is “auto”, text is extracted from a PDF. (An error will occur if the value is “auto” and the input is NOT in PDF format)

  • output_format (OutputFormat, optional) – Format of the inference results.

  • coordinates (bool, optional) – Whether to include the coordinates of the OCR in the output.

  • base64_encoding (List[Category], optional) – The category of the elements to be encoded in base64.

Methods

__init__([api_key, base_url, model, split, ...])

Initializes an instance of the Upstage class.

lazy_parse(blob[, is_batch])

Lazily parses a document and yields Document objects based on the specified split type.

parse(blob)

Eagerly parse the blob into a document or documents.

__init__(
api_key: str | None = None,
base_url: str = 'https://api.upstage.ai/v1/document-ai/document-parse',
model: str = 'document-parse',
split: Literal['none', 'page', 'element'] = 'none',
ocr: Literal['auto', 'force'] = 'auto',
output_format: Literal['text', 'html', 'markdown'] = 'html',
coordinates: bool = True,
base64_encoding: List[Literal['paragraph', 'table', 'figure', 'header', 'footer', 'caption', 'equation', 'heading1', 'list', 'index', 'footnote', 'chart']] = [],
)[source]#

Initializes an instance of the Upstage class.

Parameters:
  • api_key (str, optional) – The API key for accessing the Upstage API. Defaults to None, in which case it will be fetched from the environment variable UPSTAGE_API_KEY.

  • base_url (str, optional) – The base URL for accessing the Upstage API.

  • model (str) – The model to be used for the document parse. Defaults to “document-parse”.

  • split (SplitType, optional) – The type of splitting to be applied. Defaults to “none” (no splitting).

  • ocr (OCRMode, optional) – Extract text from images in the document using OCR. If the value is “force”, OCR is used to extract text from an image. If the value is “auto”, text is extracted from a PDF. (An error will occur if the value is “auto” and the input is NOT in PDF format)

  • output_format (OutputFormat, optional) – Format of the inference results.

  • coordinates (bool, optional) – Whether to include the coordinates of the OCR in the output.

  • base64_encoding (List[Category], optional) – The category of the elements to be encoded in base64.

lazy_parse(
blob: Blob,
is_batch: bool = False,
) Iterator[Document][source]#

Lazily parses a document and yields Document objects based on the specified split type.

Parameters:
  • blob (Blob) – The input document blob to parse.

  • is_batch (bool, optional) – Whether to parse the document in batches. Defaults to False (single page parsing)

Yields:

Document – The parsed document object.

Raises:

ValueError – If an invalid split type is provided.

Return type:

Iterator[Document]

parse(
blob: Blob,
) list[Document]#

Eagerly parse the blob into a document or documents.

This is a convenience method for interactive development environment.

Production applications should favor the lazy_parse method instead.

Subclasses should generally not over-ride this parse method.

Parameters:

blob (Blob) – Blob instance

Returns:

List of documents

Return type:

list[Document]