Skip to main content

Unstructured

The unstructured package from Unstructured.IO extracts clean text from raw source documents like PDFs and Word documents. This page covers how to use the unstructured ecosystem within LangChain.

Installation and Setup​

If you are using a loader that runs locally, use the following steps to get unstructured and its dependencies running locally.

  • Install the Python SDK with pip install unstructured.
    • You can install document specific dependencies with extras, i.e. pip install "unstructured[docx]".
    • To install the dependencies for all document types, use pip install "unstructured[all-docs]".
  • Install the following system dependencies if they are not already available on your system. Depending on what document types you're parsing, you may not need all of these.
    • libmagic-dev (filetype detection)
    • poppler-utils (images and PDFs)
    • tesseract-ocr(images and PDFs)
    • libreoffice (MS Office docs)
    • pandoc (EPUBs)

If you want to get up and running with less set up, you can simply run pip install unstructured and use UnstructuredAPIFileLoader or UnstructuredAPIFileIOLoader. That will process your document using the hosted Unstructured API.

The Unstructured API requires API keys to make requests. You can request an API key here and start using it today! Checkout the README here here to get started making API calls. We'd love to hear your feedback, let us know how it goes in our community slack. And stay tuned for improvements to both quality and performance! Check out the instructions here if you'd like to self-host the Unstructured API or run it locally.

Data Loaders​

The primary usage of the Unstructured is in data loaders.

UnstructuredAPIFileIOLoader​

See a usage example.

from langchain_community.document_loaders import UnstructuredAPIFileIOLoader

UnstructuredAPIFileLoader​

See a usage example.

from langchain_community.document_loaders import UnstructuredAPIFileLoader

UnstructuredCHMLoader​

CHM means Microsoft Compiled HTML Help.

See a usage example in the API documentation.

from langchain_community.document_loaders import UnstructuredCHMLoader

API Reference:

UnstructuredCSVLoader​

A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas.

See a usage example.

from langchain_community.document_loaders import UnstructuredCSVLoader

API Reference:

UnstructuredEmailLoader​

See a usage example.

from langchain_community.document_loaders import UnstructuredEmailLoader

UnstructuredEPubLoader​

EPUB is an e-book file format that uses the β€œ.epub” file extension. The term is short for electronic publication and is sometimes styled ePub. EPUB is supported by many e-readers, and compatible software is available for most smartphones, tablets, and computers.

See a usage example.

from langchain_community.document_loaders import UnstructuredEPubLoader

UnstructuredExcelLoader​

See a usage example.

from langchain_community.document_loaders import UnstructuredExcelLoader

UnstructuredFileIOLoader​

See a usage example.

from langchain_community.document_loaders import UnstructuredFileIOLoader

UnstructuredFileLoader​

See a usage example.

from langchain_community.document_loaders import UnstructuredFileLoader

UnstructuredHTMLLoader​

See a usage example.

from langchain_community.document_loaders import UnstructuredHTMLLoader

UnstructuredImageLoader​

See a usage example.

from langchain_community.document_loaders import UnstructuredImageLoader

UnstructuredMarkdownLoader​

See a usage example.

from langchain_community.document_loaders import UnstructuredMarkdownLoader

UnstructuredODTLoader​

The Open Document Format for Office Applications (ODF), also known as OpenDocument, is an open file format for word processing documents, spreadsheets, presentations and graphics and using ZIP-compressed XML files. It was developed with the aim of providing an open, XML-based file format specification for office applications.

See a usage example.

from langchain_community.document_loaders import UnstructuredODTLoader

API Reference:

UnstructuredOrgModeLoader​

An Org Mode document is a document editing, formatting, and organizing mode, designed for notes, planning, and authoring within the free software text editor Emacs.

See a usage example.

from langchain_community.document_loaders import UnstructuredOrgModeLoader

UnstructuredPDFLoader​

See a usage example.

from langchain_community.document_loaders import UnstructuredPDFLoader

API Reference:

UnstructuredPowerPointLoader​

See a usage example.

from langchain_community.document_loaders import UnstructuredPowerPointLoader

UnstructuredRSTLoader​

A reStructured Text (RST) file is a file format for textual data used primarily in the Python programming language community for technical documentation.

See a usage example.

from langchain_community.document_loaders import UnstructuredRSTLoader

API Reference:

UnstructuredRTFLoader​

See a usage example in the API documentation.

from langchain_community.document_loaders import UnstructuredRTFLoader

API Reference:

UnstructuredTSVLoader​

A tab-separated values (TSV) file is a simple, text-based file format for storing tabular data. Records are separated by newlines, and values within a record are separated by tab characters.

See a usage example.

from langchain_community.document_loaders import UnstructuredTSVLoader

API Reference:

UnstructuredURLLoader​

See a usage example.

from langchain_community.document_loaders import UnstructuredURLLoader

API Reference:

UnstructuredWordDocumentLoader​

See a usage example.

from langchain_community.document_loaders import UnstructuredWordDocumentLoader

UnstructuredXMLLoader​

See a usage example.

from langchain_community.document_loaders import UnstructuredXMLLoader

API Reference:


Help us out by providing feedback on this documentation page: