unstructuredpackage from Unstructured.IO extracts clean text from raw source documents like PDFs and Word documents. This page covers how to use the
unstructuredecosystem within LangChain.
Installation and Setup#
If you are using a loader that runs locally, use the following steps to get
its dependencies running locally.
Install the Python SDK with
pip install "unstructured[local-inference]"
Install the following system dependencies if they are not already available on your system. Depending on what document types you’re parsing, you may not need all of these.
poppler-utils(images and PDFs)
tesseract-ocr(images and PDFs)
libreoffice(MS Office docs)
If you want to get up and running with less set up, you can
pip install unstructured and use
UnstructuredAPIFileIOLoader. That will process your document using the hosted Unstructured API.
Note that currently (as of 1 May 2023) the Unstructured API is open, but it will soon require
an API. The Unstructured documentation page will have
instructions on how to generate an API key once they’re available. Check out the instructions
if you’d like to self-host the Unstructured API or run it locally.
unstructured wrappers within
langchain are data loaders. The following
shows how to use the most basic unstructured data loader. There are other file-specific
data loaders available in the
from langchain.document_loaders import UnstructuredFileLoader loader = UnstructuredFileLoader("state_of_the_union.txt") loader.load()
If you instantiate the loader with
UnstructuredFileLoader(mode="elements"), the loader
will track additional metadata like the page number and text type (i.e. title, narrative text)
when that information is available.