Google Cloud Document AI

Document AI is a document understanding platform from Google Cloud to transform unstructured data from documents into structured data, making it easier to understand, analyze, and consume.

The module contains a PDF parser based on DocAI from Google Cloud.

You need to install two libraries to use this parser:

%pip install --upgrade --quiet  langchain-google-community[docai]

First, you need to set up a Google Cloud Storage (GCS) bucket and create your own Optical Character Recognition (OCR) processor as described here:

The GCS_OUTPUT_PATH should be a path to a folder on GCS (starting with gs://) and a PROCESSOR_NAME should look like projects/PROJECT_NUMBER/locations/LOCATION/processors/PROCESSOR_ID or projects/PROJECT_NUMBER/locations/LOCATION/processors/PROCESSOR_ID/processorVersions/PROCESSOR_VERSION_ID. You can get it either programmatically or copy from the Prediction endpoint section of the Processor details tab in the Google Cloud Console.

from langchain_core.document_loaders.blob_loaders import Blob
from langchain_google_community import DocAIParser

Now, create a DocAIParser.

parser = DocAIParser(
location="us", processor_name=PROCESSOR_NAME, gcs_output_path=GCS_OUTPUT_PATH

For this example, you can use an Alphabet earnings report that's uploaded to a public GCS bucket.


Pass the document to the lazy_parse() method to

blob = Blob(

We'll get one document per page, 11 in total:

docs = list(parser.lazy_parse(blob))

You can run end-to-end parsing of a blob one-by-one. If you have many documents, it might be a better approach to batch them together and maybe even detach parsing from handling the results of parsing.

operations = parser.docai_parse([blob])
print([ for op in operations])

You can check whether operations are finished:


And when they're finished, you can parse the results:

results = parser.get_results(operations)
DocAIParsingResults(source_path='gs://vertex-pgt/examples/goog-exhibit-99-1-q1-2023-19.pdf', parsed_path='gs://vertex-pgt/test/run1/16447136779727347991/0')

And now we can finally generate Documents from parsed results:

docs = list(parser.parse_from_results(results))

