UpstageLayoutAnalysisLoader#

class langchain_upstage.layout_analysis.UpstageLayoutAnalysisLoader(file_path: str | Path | List[str] | List[Path], output_type: Literal['text', 'html'] | dict = 'html', split: Literal['none', 'element', 'page'] = 'none', api_key: str | None = None, use_ocr: bool | None = None, exclude: list = ['header', 'footer'])[source]#

Upstage Layout Analysis.

To use, you should have the environment variable UPSTAGE_API_KEY set with your API key or pass it as a named parameter to the constructor.

Example

from langchain_upstage import UpstageLayoutAnalysis

file_path = "/PATH/TO/YOUR/FILE.pdf"
loader = UpstageLayoutAnalysis(
            file_path, split="page", output_type="text"
         )

Initializes an instance of the Upstage document loader.

Parameters:
  • file_path (Union[str, Path, List[str], List[Path]) – The path to the document to be loaded.

  • output_type (Union[OutputType, dict], optional) – The type of output to be generated by the parser. Defaults to “html”.

  • split (SplitType, optional) – The type of splitting to be applied. Defaults to “none” (no splitting).

  • api_key (str, optional) – The API key for accessing the Upstage API. Defaults to None, in which case it will be fetched from the environment variable UPSTAGE_API_KEY.

  • use_ocr (bool, optional) – Extract text from images in the document using OCR. If the value is True, OCR is used to extract text from an image. If the value is False, text is extracted from a PDF. (An error will occur if the value is False and the input is NOT in PDF format) The default value is None, and the default behavior will be performed based on the API’s policy if no value is specified. Please check https://developers.upstage.ai/docs/apis/layout-analysis#request-body.

  • exclude (list, optional) – Exclude specific elements from the output. Defaults to [“header”, “footer”].

Methods

__init__(file_path[, output_type, split, ...])

Initializes an instance of the Upstage document loader.

alazy_load()

A lazy loader for Documents.

aload()

Load data into Document objects.

lazy_load()

Lazily loads and parses the document using the UpstageLayoutAnalysisParser.

load()

Loads and parses the document using the UpstageLayoutAnalysisParser.

load_and_split([text_splitter])

Load Documents and split into chunks.

merge_and_split(documents[, splitter])

Merges the page content and metadata of multiple documents into a single document, or splits the documents using a custom splitter.

__init__(file_path: str | Path | List[str] | List[Path], output_type: Literal['text', 'html'] | dict = 'html', split: Literal['none', 'element', 'page'] = 'none', api_key: str | None = None, use_ocr: bool | None = None, exclude: list = ['header', 'footer'])[source]#

Initializes an instance of the Upstage document loader.

Parameters:
  • file_path (Union[str, Path, List[str], List[Path]) – The path to the document to be loaded.

  • output_type (Union[OutputType, dict], optional) – The type of output to be generated by the parser. Defaults to “html”.

  • split (SplitType, optional) – The type of splitting to be applied. Defaults to “none” (no splitting).

  • api_key (str, optional) – The API key for accessing the Upstage API. Defaults to None, in which case it will be fetched from the environment variable UPSTAGE_API_KEY.

  • use_ocr (bool, optional) – Extract text from images in the document using OCR. If the value is True, OCR is used to extract text from an image. If the value is False, text is extracted from a PDF. (An error will occur if the value is False and the input is NOT in PDF format) The default value is None, and the default behavior will be performed based on the API’s policy if no value is specified. Please check https://developers.upstage.ai/docs/apis/layout-analysis#request-body.

  • exclude (list, optional) – Exclude specific elements from the output. Defaults to [“header”, “footer”].

async alazy_load() AsyncIterator[Document]#

A lazy loader for Documents.

Return type:

AsyncIterator[Document]

async aload() list[Document]#

Load data into Document objects.

Return type:

list[Document]

lazy_load() Iterator[Document][source]#

Lazily loads and parses the document using the UpstageLayoutAnalysisParser.

Returns:

An iterator of Document objects representing the parsed layout analysis.

Return type:

Iterator[Document]

load() List[Document][source]#

Loads and parses the document using the UpstageLayoutAnalysisParser.

Returns:

A list of Document objects representing the parsed layout analysis.

Return type:

List[Document]

load_and_split(text_splitter: TextSplitter | None = None) list[Document]#

Load Documents and split into chunks. Chunks are returned as Documents.

Do not override this method. It should be considered to be deprecated!

Parameters:

text_splitter (Optional[TextSplitter]) – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.

Returns:

List of Documents.

Return type:

list[Document]

merge_and_split(documents: List[Document], splitter: object | None = None) List[Document][source]#

Merges the page content and metadata of multiple documents into a single document, or splits the documents using a custom splitter.

Parameters:
  • documents (list) – A list of Document objects to be merged and split.

  • splitter (object, optional) – An optional splitter object that implements the split_documents method. If provided, the documents will be split using this splitter. Defaults to None, in which case the documents are merged.

Returns:

A list of Document objects. If no splitter is provided, a single Document object is returned with the merged content and combined metadata. If a splitter is provided, the documents are split and a list of Document objects is returned.

Return type:

list

Raises:
  • AssertionError – If a splitter is provided but it does not implement the

  • split_documents