document_loaders
#
Document Loaders are classes to load Documents.
Document Loaders are usually used to load a lot of Documents in a single run.
Class hierarchy:
BaseLoader --> <name>Loader # Examples: TextLoader, UnstructuredFileLoader
Main helpers:
Document, <name>TextSplitter
Classes
|
Load acreom vault from a directory. |
Load with an Airbyte source connector implemented using the CDK. |
|
Load from Gong using an Airbyte source connector. |
|
Load from Hubspot using an Airbyte source connector. |
|
Load from Salesforce using an Airbyte source connector. |
|
Load from Shopify using an Airbyte source connector. |
|
Load from Stripe using an Airbyte source connector. |
|
Load from Typeform using an Airbyte source connector. |
|
Load from Zendesk Support using an Airbyte source connector. |
|
Load local Airbyte json files. |
|
Load the Airtable tables. |
|
Load datasets from Apify web scraping, crawling, and data extraction platform. |
|
Load records from an ArcGIS FeatureLayer. |
|
|
Load a query result from Arxiv. |
Load AssemblyAI audio transcripts. |
|
|
Load AssemblyAI audio transcripts. |
Transcript format to use for the document loader. |
|
Load HTML asynchronously. |
|
|
Load documents from AWS Athena. |
Load AZLyrics webpages. |
|
Load from Azure AI Data. |
|
|
Load from Azure Blob Storage container. |
|
Load from Azure Blob Storage files. |
|
Load from Baidu BOS directory. |
|
Load from Baidu Cloud BOS file. |
Base class for all loaders that uses O365 Package |
|
|
Load a bibtex file. |
Load fetching transcripts from BiliBili videos. |
|
Load a Blackboard course. |
|
|
Load blobs from cloud URL or file:. |
|
Load blobs in the local file system. |
|
Load YouTube urls as audio file(s). |
Load elements from a blockchain smart contract. |
|
Enumerator of the supported blockchains. |
|
Load with Brave Search engine. |
|
Load pre-rendered web pages using a headless browser hosted on Browserbase. |
|
Load webpages with Browserless /content endpoint. |
|
Document Loader for Apache Cassandra. |
|
|
Load conversations from exported ChatGPT data. |
Microsoft Compiled HTML Help (CHM) Parser. |
|
Load CHM files using Unstructured. |
|
Scrape HTML pages from URLs using a headless instance of the Chromium. |
|
|
Load College Confidential webpages. |
Load and pars Documents concurrently. |
|
Load Confluence pages. |
|
Enumerator of the content formats of Confluence page. |
|
|
Load CoNLL-U files. |
Load documents from Couchbase. |
|
|
Load a CSV file into a list of Documents. |
Load CSV files using Unstructured. |
|
Load Cube semantic layer metadata. |
|
Load Datadog logs. |
|
Initialize with dataframe object. |
|
Load Pandas DataFrame. |
|
Load files using dedoc API. The file loader automatically detects the file type (even with the wrong extension). By default, the loader makes a call to the locally hosted dedoc API. More information about dedoc API can be found in dedoc documentation: https://dedoc.readthedocs.io/en/latest/dedoc_api_usage/api.html. |
|
Base Loader that uses dedoc (https://dedoc.readthedocs.io). |
|
DedocFileLoader document loader integration to load files using dedoc. |
|
Load Diffbot json file. |
|
Load from a directory. |
|
Load Discord chat logs. |
|
|
Load a PDF with Azure Document Intelligence. |
Load from Docusaurus Documentation. |
|
Load files from Dropbox. |
|
Load from DuckDB. |
|
Loads Outlook Message files using extract_msg. |
|
Load email files using Unstructured. |
|
Load EPub files using Unstructured. |
|
Load transactions from Ethereum mainnet. |
|
Load from EverNote. |
|
Load Microsoft Excel files using Unstructured. |
|
Load Facebook Chat messages directory dump. |
|
|
Load from FaunaDB. |
Load Figma file. |
|
FireCrawlLoader document loader integration |
|
Generic Document Loader. |
|
Load geopandas Dataframe. |
|
|
Load Git repository files. |
|
Load GitBook data. |
Load GitHub repository Issues. |
|
Load issues of a GitHub repository. |
|
Load GitHub File |
|
Load table schemas from AWS Glue. |
|
Load from Gutenberg.org. |
|
File encoding as the NamedTuple. |
|
|
Load Hacker News data. |
Load HTML files using Unstructured. |
|
|
__ModuleName__ document loader integration |
|
Load from Hugging Face Hub datasets. |
|
Load model information from Hugging Face Hub, including README content. |
|
Load iFixit repair guides, device wikis and answers. |
Load PNG and JPG files using Unstructured. |
|
Load image captions. |
|
Load IMSDb webpages. |
|
|
Load from IUGU. |
Load notes from Joplin. |
|
Load a JSON file using a jq schema. |
|
Load from Kinetica API. |
|
Client for lakeFS. |
|
|
Load from lakeFS. |
Load from lakeFS as unstructured data. |
|
Load from LarkSuite (FeiShu). |
|
Load from LarkSuite (FeiShu) wiki. |
|
Load Documents using LLMSherpa. |
|
Load Markdown files using Unstructured. |
|
Load the Mastodon 'toots'. |
|
Load from Alibaba Cloud MaxCompute table. |
|
Load MediaWiki dump from an XML file. |
|
Merge documents from a list of loaders |
|
|
Parse MHTML files with BeautifulSoup. |
Load elements from a blockchain smart contract. |
|
Load from Modern Treasury. |
|
Load MongoDB documents. |
|
|
Load news articles from URLs using Unstructured. |
Load Jupyter notebook (.ipynb) files. |
|
Load Notion directory dump. |
|
Load from Notion DB. |
|
|
Load from any file type using Nuclia Understanding API. |
Load from Huawei OBS directory. |
|
Load from the Huawei OBS file. |
|
Load Obsidian files from directory. |
|
Load OpenOffice ODT files using Unstructured. |
|
Load documents from Microsoft OneDrive. |
|
Load a file from Microsoft OneDrive. |
|
Load pages from OneNote notebooks. |
|
Load from Open City. |
|
|
Load from oracle adb |
Read documents using OracleDocLoader :param conn: Oracle Connection, :param params: Loader parameters. |
|
Read a file |
|
Splitting text using Oracle chunker. |
|
Parse Oracle doc metadata... |
|
Load Org-Mode files using Unstructured. |
|
Transcribe and parse audio files using Azure OpenAI Whisper. |
|
Transcribe and parse audio files with faster-whisper. |
|
Transcribe and parse audio files. |
|
|
Transcribe and parse audio files with OpenAI Whisper model. |
Transcribe and parse audio files. |
|
|
Loads a PDF with Azure Document Intelligence (formerly Forms Recognizer). |
Dataclass to store Document AI parsing results. |
|
Parser that uses mime-types to parse a blob. |
|
Load article PDF files using Grobid. |
|
Exception raised when the Grobid server is unavailable. |
|
Parse HTML files using Beautiful Soup. |
|
Code segmenter for C. |
|
|
Code segmenter for COBOL. |
|
Abstract class for the code segmenter. |
Code segmenter for C++. |
|
|
Code segmenter for C#. |
|
Code segmenter for Elixir. |
Code segmenter for Go. |
|
Code segmenter for Java. |
|
|
Code segmenter for JavaScript. |
|
Code segmenter for Kotlin. |
|
Parse using the respective programming language syntax. |
Code segmenter for Lua. |
|
Code segmenter for Perl. |
|
Code segmenter for PHP. |
|
|
Code segmenter for Python. |
Code segmenter for Ruby. |
|
Code segmenter for Rust. |
|
|
Code segmenter for Scala. |
|
Abstract class for `CodeSegmenter`s that use the tree-sitter library. |
|
Code segmenter for TypeScript. |
Parse the Microsoft Word documents from a blob. |
|
Send PDF files to Amazon Textract and parse them. |
|
|
Loads a PDF with Azure Document Intelligence (formerly Form Recognizer) and chunks at character level. |
Parse PDF using PDFMiner. |
|
Parse PDF with PDFPlumber. |
|
Parse PDF using PyMuPDF. |
|
Load PDF using pypdf |
|
Parse PDF with PyPDFium2. |
|
Parser for text blobs. |
|
Parser for vsdx files. |
|
Load PDF files from a local file system, HTTP or S3. |
|
|
Base Loader class for PDF files. |
|
DedocPDFLoader document loader integration to load PDF files using dedoc. The file loader can automatically detect the correctness of a textual layer in the PDF document. Note that __init__ method supports parameters that differ from ones of DedocBaseLoader. |
Load a PDF with Azure Document Intelligence |
|
|
Load PDF files using Mathpix service. |
|
Load online PDF. |
|
Load PDF files using PDFMiner. |
Load PDF files as HTML content using PDFMiner. |
|
|
Load PDF files using pdfplumber. |
alias of |
|
|
Load PDF files using PyMuPDF. |
Load a directory with PDF files using pypdf and chunks at character level. |
|
|
PyPDFLoader document loader integration |
|
Load PDF using pypdfium2 and chunks at character level. |
Load PDF files using Unstructured. |
|
|
Document loader utilizing Zerox library: getomni-ai/zerox |
Pebblo Safe Loader class is a wrapper around document loaders enabling the data to be scrutinized. |
|
Loader for text data. |
|
|
Load Polars DataFrame. |
|
Load Microsoft PowerPoint files using Unstructured. |
Load from Psychic.dev. |
|
Load from the PubMed biomedical library. |
|
|
Load PySpark DataFrames. |
|
Load Python files, respecting any non-default encoding if specified. |
|
Load Quip pages. |
Load ReadTheDocs documentation directory. |
|
|
Recursively load all child links from a root URL. |
Load Reddit posts. |
|
Load Roam files from a directory. |
|
Column not found error. |
|
Load from a Rockset database. |
|
|
Load content from RSpace notebooks, folders, documents or PDF Gallery files. |
|
Load news articles from RSS feeds using Unstructured. |
Load RST files using Unstructured. |
|
Load RTF files using Unstructured. |
|
Load from Amazon AWS S3 directory. |
|
|
Load from Amazon AWS S3 file. |
Turn a url to llm accessible markdown with Scrapfly.io. |
|
Turn an url to LLM accessible markdown with ScrapingAnt. |
|
Load from SharePoint. |
|
|
Load a sitemap and its URLs. |
Load from a Slack directory dump. |
|
Load from Snowflake API. |
|
Load web pages as Documents using Spider AI. |
|
Load from Spreedly API. |
|
Load documents by querying database tables supported by SQLAlchemy. |
|
|
Load .srt (subtitle) files. |
|
Load from Stripe API. |
Load SurrealDB documents. |
|
Load Telegram chat json directory dump. |
|
Load from Telegram chat dump. |
|
alias of |
|
|
Load from Tencent Cloud COS directory. |
Load from Tencent Cloud COS file. |
|
|
Load from TensorFlow Dataset. |
|
Load text file. |
|
Load documents from TiDB. |
Load HTML using 2markdown API. |
|
|
Load TOML files. |
|
Load cards from a Trello board. |
Load TSV files using Unstructured. |
|
Load Twitter tweets. |
|
Base Loader that uses Unstructured. |
|
Load files from remote URLs using Unstructured. |
|
Abstract base class for all evaluators. |
|
Load HTML pages with Playwright and parse with Unstructured. |
|
|
Evaluate the page HTML content using the unstructured library. |
Load HTML pages with Selenium and parse with Unstructured. |
|
|
Initialize with file path. |
Load weather data with Open Weather Map API. |
|
WebBaseLoader document loader integration |
|
Load WhatsApp messages text file. |
|
Load from Wikipedia. |
|
Load DOCX file using docx2txt and chunks at character level. |
|
|
Load Microsoft Word file using Unstructured. |
Load XML file using Unstructured. |
|
Load Xorbits DataFrame. |
|
Generic Google API Client. |
|
Load all Videos from a YouTube Channel. |
|
Output formats of transcripts from YoutubeLoader. |
|
|
Load YouTube video transcripts. |
|
Load documents from Yuque. |
Functions
Fetch the mime types for the specified file types. |
|
Fetch the mime types for the specified file types. |
|
Combine message information in a readable format ready to be used. |
|
Combine message information in a readable format ready to be used. |
|
Try to detect the file encoding. |
|
Combine cells information in a readable format ready to be used. |
|
Recursively remove newlines, no matter the data structure they are stored in. |
|
|
Extract text from images with RapidOCR. |
Get a parser by parser name. |
|
Default joiner for content columns. |
|
Combine message information in a readable format ready to be used. |
|
Convert a string or list of strings to a list of Documents with metadata. |
|
Retrieve a list of elements from the Unstructured API. |
|
|
Check if the installed Unstructured version exceeds the minimum version for the feature in question. |
|
Raise an error if the Unstructured version does not exceed the specified minimum. |
Combine message information in a readable format ready to be used. |
Deprecated classes