AmazonTextractPDFLoader#

class langchain_community.document_loaders.pdf.AmazonTextractPDFLoader(file_path: str, textract_features: Sequence[str] | None = None, client: Any | None = None, credentials_profile_name: str | None = None, region_name: str | None = None, endpoint_url: str | None = None, headers: Dict | None = None, *, linearization_config: TextLinearizationConfig | None = None)[source]#

Load PDF files from a local file system, HTTP or S3.

To authenticate, the AWS client uses the following methods to automatically load credentials: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html

If a specific credential profile should be used, you must pass the name of the profile from the ~/.aws/credentials file that is to be used.

Make sure the credentials / roles used have the required policies to access the Amazon Textract service.

Example

Initialize the loader.

Parameters:
  • file_path (str) – A file, url or s3 path for input file

  • textract_features (Sequence[str] | None) – Features to be used for extraction, each feature should be passed as a str that conforms to the enum Textract_Features, see amazon-textract-caller pkg

  • client (Any | None) – boto3 textract client (Optional)

  • credentials_profile_name (str | None) – AWS profile name, if not default (Optional)

  • region_name (str | None) – AWS region, eg us-east-1 (Optional)

  • endpoint_url (str | None) – endpoint url for the textract service (Optional)

  • linearization_config (TextLinearizationConfig | None) – Config to be used for linearization of the output should be an instance of TextLinearizationConfig from the textractor pkg

  • headers (Dict | None)

Attributes

source

Methods

__init__(file_path[,Β textract_features,Β ...])

Initialize the loader.

alazy_load()

A lazy loader for Documents.

aload()

Load data into Document objects.

lazy_load()

Lazy load documents

load()

Load given path as pages.

load_and_split([text_splitter])

Load Documents and split into chunks.

__init__(file_path: str, textract_features: Sequence[str] | None = None, client: Any | None = None, credentials_profile_name: str | None = None, region_name: str | None = None, endpoint_url: str | None = None, headers: Dict | None = None, *, linearization_config: TextLinearizationConfig | None = None) β†’ None[source]#

Initialize the loader.

Parameters:
  • file_path (str) – A file, url or s3 path for input file

  • textract_features (Sequence[str] | None) – Features to be used for extraction, each feature should be passed as a str that conforms to the enum Textract_Features, see amazon-textract-caller pkg

  • client (Any | None) – boto3 textract client (Optional)

  • credentials_profile_name (str | None) – AWS profile name, if not default (Optional)

  • region_name (str | None) – AWS region, eg us-east-1 (Optional)

  • endpoint_url (str | None) – endpoint url for the textract service (Optional)

  • linearization_config (TextLinearizationConfig | None) – Config to be used for linearization of the output should be an instance of TextLinearizationConfig from the textractor pkg

  • headers (Dict | None)

Return type:

None

async alazy_load() β†’ AsyncIterator[Document]#

A lazy loader for Documents.

Return type:

AsyncIterator[Document]

async aload() β†’ list[Document]#

Load data into Document objects.

Return type:

list[Document]

lazy_load() β†’ Iterator[Document][source]#

Lazy load documents

Return type:

Iterator[Document]

load() β†’ List[Document][source]#

Load given path as pages.

Return type:

List[Document]

load_and_split(text_splitter: TextSplitter | None = None) β†’ list[Document]#

Load Documents and split into chunks. Chunks are returned as Documents.

Do not override this method. It should be considered to be deprecated!

Parameters:

text_splitter (Optional[TextSplitter]) – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.

Returns:

List of Documents.

Return type:

list[Document]

Examples using AmazonTextractPDFLoader