document_loaders#

Document Loaders are classes to load Documents.

Document Loaders are usually used to load a lot of Documents in a single run.

Class hierarchy:

BaseLoader --> <name>Loader  # Examples: TextLoader, UnstructuredFileLoader

Main helpers:

Document, <name>TextSplitter

Classes

document_loaders.acreom.AcreomLoader(path[, ...])

Load acreom vault from a directory.

document_loaders.airbyte.AirbyteCDKLoader(...)

Load with an Airbyte source connector implemented using the CDK.

document_loaders.airbyte.AirbyteGongLoader(...)

Load from Gong using an Airbyte source connector.

document_loaders.airbyte.AirbyteHubspotLoader(...)

Load from Hubspot using an Airbyte source connector.

document_loaders.airbyte.AirbyteSalesforceLoader(...)

Load from Salesforce using an Airbyte source connector.

document_loaders.airbyte.AirbyteShopifyLoader(...)

Load from Shopify using an Airbyte source connector.

document_loaders.airbyte.AirbyteStripeLoader(...)

Load from Stripe using an Airbyte source connector.

document_loaders.airbyte.AirbyteTypeformLoader(...)

Load from Typeform using an Airbyte source connector.

document_loaders.airbyte.AirbyteZendeskSupportLoader(...)

Load from Zendesk Support using an Airbyte source connector.

document_loaders.airbyte_json.AirbyteJSONLoader(...)

Load local Airbyte json files.

document_loaders.airtable.AirtableLoader(...)

Load the Airtable tables.

document_loaders.apify_dataset.ApifyDatasetLoader

Load datasets from Apify web scraping, crawling, and data extraction platform.

document_loaders.arcgis_loader.ArcGISLoader(layer)

Load records from an ArcGIS FeatureLayer.

document_loaders.arxiv.ArxivLoader(query[, ...])

Load a query result from Arxiv.

document_loaders.assemblyai.AssemblyAIAudioLoaderById(...)

Load AssemblyAI audio transcripts.

document_loaders.assemblyai.AssemblyAIAudioTranscriptLoader(...)

Load AssemblyAI audio transcripts.

document_loaders.assemblyai.TranscriptFormat(value)

Transcript format to use for the document loader.

document_loaders.async_html.AsyncHtmlLoader(...)

Load HTML asynchronously.

document_loaders.athena.AthenaLoader(query, ...)

Load documents from AWS Athena.

document_loaders.azlyrics.AZLyricsLoader([...])

Load AZLyrics webpages.

document_loaders.azure_ai_data.AzureAIDataLoader(url)

Load from Azure AI Data.

document_loaders.azure_blob_storage_container.AzureBlobStorageContainerLoader(...)

Load from Azure Blob Storage container.

document_loaders.azure_blob_storage_file.AzureBlobStorageFileLoader(...)

Load from Azure Blob Storage files.

document_loaders.baiducloud_bos_directory.BaiduBOSDirectoryLoader(...)

Load from Baidu BOS directory.

document_loaders.baiducloud_bos_file.BaiduBOSFileLoader(...)

Load from Baidu Cloud BOS file.

document_loaders.base_o365.O365BaseLoader

Base class for all loaders that uses O365 Package

document_loaders.bibtex.BibtexLoader(...[, ...])

Load a bibtex file.

document_loaders.bilibili.BiliBiliLoader(...)

Load fetching transcripts from BiliBili videos.

document_loaders.blackboard.BlackboardLoader(...)

Load a Blackboard course.

document_loaders.blob_loaders.cloud_blob_loader.CloudBlobLoader(url, *)

Load blobs from cloud URL or file:.

document_loaders.blob_loaders.file_system.FileSystemBlobLoader(path, *)

Load blobs in the local file system.

document_loaders.blob_loaders.youtube_audio.YoutubeAudioLoader(...)

Load YouTube urls as audio file(s).

document_loaders.blockchain.BlockchainDocumentLoader(...)

Load elements from a blockchain smart contract.

document_loaders.blockchain.BlockchainType(value)

Enumerator of the supported blockchains.

document_loaders.brave_search.BraveSearchLoader(...)

Load with Brave Search engine.

document_loaders.browserbase.BrowserbaseLoader(urls)

Load pre-rendered web pages using a headless browser hosted on Browserbase.

document_loaders.browserless.BrowserlessLoader(...)

Load webpages with Browserless /content endpoint.

document_loaders.cassandra.CassandraLoader(...)

Document Loader for Apache Cassandra.

document_loaders.chatgpt.ChatGPTLoader(log_file)

Load conversations from exported ChatGPT data.

document_loaders.chm.CHMParser(path)

Microsoft Compiled HTML Help (CHM) Parser.

document_loaders.chm.UnstructuredCHMLoader(...)

Load CHM files using Unstructured.

document_loaders.chromium.AsyncChromiumLoader(urls, *)

Scrape HTML pages from URLs using a headless instance of the Chromium.

document_loaders.college_confidential.CollegeConfidentialLoader([...])

Load College Confidential webpages.

document_loaders.concurrent.ConcurrentLoader(...)

Load and pars Documents concurrently.

document_loaders.confluence.ConfluenceLoader(url)

Load Confluence pages.

document_loaders.confluence.ContentFormat(value)

Enumerator of the content formats of Confluence page.

document_loaders.conllu.CoNLLULoader(file_path)

Load CoNLL-U files.

document_loaders.couchbase.CouchbaseLoader(...)

Load documents from Couchbase.

document_loaders.csv_loader.CSVLoader(file_path)

Load a CSV file into a list of Documents.

document_loaders.csv_loader.UnstructuredCSVLoader(...)

Load CSV files using Unstructured.

document_loaders.cube_semantic.CubeSemanticLoader(...)

Load Cube semantic layer metadata.

document_loaders.datadog_logs.DatadogLogsLoader(...)

Load Datadog logs.

document_loaders.dataframe.BaseDataFrameLoader(...)

Initialize with dataframe object.

document_loaders.dataframe.DataFrameLoader(...)

Load Pandas DataFrame.

document_loaders.dedoc.DedocAPIFileLoader(...)

Load files using dedoc API. The file loader automatically detects the file type (even with the wrong extension). By default, the loader makes a call to the locally hosted dedoc API. More information about dedoc API can be found in dedoc documentation: https://dedoc.readthedocs.io/en/latest/dedoc_api_usage/api.html.

document_loaders.dedoc.DedocBaseLoader(...)

Base Loader that uses dedoc (https://dedoc.readthedocs.io).

document_loaders.dedoc.DedocFileLoader(...)

DedocFileLoader document loader integration to load files using dedoc.

document_loaders.diffbot.DiffbotLoader(...)

Load Diffbot json file.

document_loaders.directory.DirectoryLoader(...)

Load from a directory.

document_loaders.discord.DiscordChatLoader(...)

Load Discord chat logs.

document_loaders.doc_intelligence.AzureAIDocumentIntelligenceLoader(...)

Load a PDF with Azure Document Intelligence.

document_loaders.docusaurus.DocusaurusLoader(url)

Load from Docusaurus Documentation.

document_loaders.dropbox.DropboxLoader

Load files from Dropbox.

document_loaders.duckdb_loader.DuckDBLoader(query)

Load from DuckDB.

document_loaders.email.OutlookMessageLoader(...)

Loads Outlook Message files using extract_msg.

document_loaders.email.UnstructuredEmailLoader(...)

Load email files using Unstructured.

document_loaders.epub.UnstructuredEPubLoader(...)

Load EPub files using Unstructured.

document_loaders.etherscan.EtherscanLoader(...)

Load transactions from Ethereum mainnet.

document_loaders.evernote.EverNoteLoader(...)

Load from EverNote.

document_loaders.excel.UnstructuredExcelLoader(...)

Load Microsoft Excel files using Unstructured.

document_loaders.facebook_chat.FacebookChatLoader(path)

Load Facebook Chat messages directory dump.

document_loaders.fauna.FaunaLoader(query, ...)

Load from FaunaDB.

document_loaders.figma.FigmaFileLoader(...)

Load Figma file.

document_loaders.firecrawl.FireCrawlLoader(url, *)

FireCrawlLoader document loader integration

document_loaders.generic.GenericLoader(...)

Generic Document Loader.

document_loaders.geodataframe.GeoDataFrameLoader(...)

Load geopandas Dataframe.

document_loaders.git.GitLoader(repo_path[, ...])

Load Git repository files.

document_loaders.gitbook.GitbookLoader(web_page)

Load GitBook data.

document_loaders.github.BaseGitHubLoader

Load GitHub repository Issues.

document_loaders.github.GitHubIssuesLoader

Load issues of a GitHub repository.

document_loaders.github.GithubFileLoader

Load GitHub File

document_loaders.glue_catalog.GlueCatalogLoader(...)

Load table schemas from AWS Glue.

document_loaders.gutenberg.GutenbergLoader(...)

Load from Gutenberg.org.

document_loaders.helpers.FileEncoding(...)

File encoding as the NamedTuple.

document_loaders.hn.HNLoader([web_path, ...])

Load Hacker News data.

document_loaders.html.UnstructuredHTMLLoader(...)

Load HTML files using Unstructured.

document_loaders.html_bs.BSHTMLLoader(file_path)

__ModuleName__ document loader integration

document_loaders.hugging_face_dataset.HuggingFaceDatasetLoader(path)

Load from Hugging Face Hub datasets.

document_loaders.hugging_face_model.HuggingFaceModelLoader(*)

Load model information from Hugging Face Hub, including README content.

document_loaders.ifixit.IFixitLoader(web_path)

Load iFixit repair guides, device wikis and answers.

document_loaders.image.UnstructuredImageLoader(...)

Load PNG and JPG files using Unstructured.

document_loaders.image_captions.ImageCaptionLoader(images)

Load image captions.

document_loaders.imsdb.IMSDbLoader([...])

Load IMSDb webpages.

document_loaders.iugu.IuguLoader(resource[, ...])

Load from IUGU.

document_loaders.joplin.JoplinLoader([...])

Load notes from Joplin.

document_loaders.json_loader.JSONLoader(...)

Load a JSON file using a jq schema.

document_loaders.kinetica_loader.KineticaLoader(...)

Load from Kinetica API.

document_loaders.lakefs.LakeFSClient(...)

Client for lakeFS.

document_loaders.lakefs.LakeFSLoader(...[, ...])

Load from lakeFS.

document_loaders.lakefs.UnstructuredLakeFSLoader(...)

Load from lakeFS as unstructured data.

document_loaders.larksuite.LarkSuiteDocLoader(...)

Load from LarkSuite (FeiShu).

document_loaders.larksuite.LarkSuiteWikiLoader(...)

Load from LarkSuite (FeiShu) wiki.

document_loaders.llmsherpa.LLMSherpaFileLoader(...)

Load Documents using LLMSherpa.

document_loaders.markdown.UnstructuredMarkdownLoader(...)

Load Markdown files using Unstructured.

document_loaders.mastodon.MastodonTootsLoader(...)

Load the Mastodon 'toots'.

document_loaders.max_compute.MaxComputeLoader(...)

Load from Alibaba Cloud MaxCompute table.

document_loaders.mediawikidump.MWDumpLoader(...)

Load MediaWiki dump from an XML file.

document_loaders.merge.MergedDataLoader(loaders)

Merge documents from a list of loaders

document_loaders.mhtml.MHTMLLoader(file_path)

Parse MHTML files with BeautifulSoup.

document_loaders.mintbase.MintbaseDocumentLoader(...)

Load elements from a blockchain smart contract.

document_loaders.modern_treasury.ModernTreasuryLoader(...)

Load from Modern Treasury.

document_loaders.mongodb.MongodbLoader(...)

Load MongoDB documents.

document_loaders.needle.NeedleLoader([...])

NeedleLoader is a document loader for managing documents stored in a collection.

document_loaders.news.NewsURLLoader(urls[, ...])

Load news articles from URLs using Unstructured.

document_loaders.notebook.NotebookLoader(path)

Load Jupyter notebook (.ipynb) files.

document_loaders.notion.NotionDirectoryLoader(path, *)

Load Notion directory dump.

document_loaders.notiondb.NotionDBLoader(...)

Load from Notion DB.

document_loaders.nuclia.NucliaLoader(path, ...)

Load from any file type using Nuclia Understanding API.

document_loaders.obs_directory.OBSDirectoryLoader(...)

Load from Huawei OBS directory.

document_loaders.obs_file.OBSFileLoader(...)

Load from the Huawei OBS file.

document_loaders.obsidian.ObsidianLoader(path)

Load Obsidian files from directory.

document_loaders.odt.UnstructuredODTLoader(...)

Load OpenOffice ODT files using Unstructured.

document_loaders.onedrive.OneDriveLoader

Load documents from Microsoft OneDrive.

document_loaders.onedrive_file.OneDriveFileLoader

Load a file from Microsoft OneDrive.

document_loaders.onenote.OneNoteLoader

Load pages from OneNote notebooks.

document_loaders.open_city_data.OpenCityDataLoader(...)

Load from Open City.

document_loaders.oracleadb_loader.OracleAutonomousDatabaseLoader(...)

Load from oracle adb

document_loaders.oracleai.OracleDocLoader(...)

Read documents using OracleDocLoader :param conn: Oracle Connection, :param params: Loader parameters.

document_loaders.oracleai.OracleDocReader()

Read a file

document_loaders.oracleai.OracleTextSplitter(...)

Splitting text using Oracle chunker.

document_loaders.oracleai.ParseOracleDocMetadata()

Parse Oracle doc metadata...

document_loaders.org_mode.UnstructuredOrgModeLoader(...)

Load Org-Mode files using Unstructured.

document_loaders.parsers.audio.AzureOpenAIWhisperParser(*)

Transcribe and parse audio files using Azure OpenAI Whisper.

document_loaders.parsers.audio.FasterWhisperParser(*)

Transcribe and parse audio files with faster-whisper.

document_loaders.parsers.audio.OpenAIWhisperParser([...])

Transcribe and parse audio files.

document_loaders.parsers.audio.OpenAIWhisperParserLocal([...])

Transcribe and parse audio files with OpenAI Whisper model.

document_loaders.parsers.audio.YandexSTTParser(*)

Transcribe and parse audio files.

document_loaders.parsers.doc_intelligence.AzureAIDocumentIntelligenceParser(...)

Loads a PDF with Azure Document Intelligence (formerly Forms Recognizer).

document_loaders.parsers.docai.DocAIParsingResults(...)

Dataclass to store Document AI parsing results.

document_loaders.parsers.documentloader_adapter.DocumentLoaderAsParser(...)

document_loaders.parsers.generic.MimeTypeBasedParser(...)

Parser that uses mime-types to parse a blob.

document_loaders.parsers.grobid.GrobidParser(...)

Load article PDF files using Grobid.

document_loaders.parsers.grobid.ServerUnavailableException

Exception raised when the Grobid server is unavailable.

document_loaders.parsers.html.bs4.BS4HTMLParser(*)

Parse HTML files using Beautiful Soup.

document_loaders.parsers.language.c.CSegmenter(code)

Code segmenter for C.

document_loaders.parsers.language.cobol.CobolSegmenter(code)

Code segmenter for COBOL.

document_loaders.parsers.language.code_segmenter.CodeSegmenter(code)

Abstract class for the code segmenter.

document_loaders.parsers.language.cpp.CPPSegmenter(code)

Code segmenter for C++.

document_loaders.parsers.language.csharp.CSharpSegmenter(code)

Code segmenter for C#.

document_loaders.parsers.language.elixir.ElixirSegmenter(code)

Code segmenter for Elixir.

document_loaders.parsers.language.go.GoSegmenter(code)

Code segmenter for Go.

document_loaders.parsers.language.java.JavaSegmenter(code)

Code segmenter for Java.

document_loaders.parsers.language.javascript.JavaScriptSegmenter(code)

Code segmenter for JavaScript.

document_loaders.parsers.language.kotlin.KotlinSegmenter(code)

Code segmenter for Kotlin.

document_loaders.parsers.language.language_parser.LanguageParser([...])

Parse using the respective programming language syntax.

document_loaders.parsers.language.lua.LuaSegmenter(code)

Code segmenter for Lua.

document_loaders.parsers.language.perl.PerlSegmenter(code)

Code segmenter for Perl.

document_loaders.parsers.language.php.PHPSegmenter(code)

Code segmenter for PHP.

document_loaders.parsers.language.python.PythonSegmenter(code)

Code segmenter for Python.

document_loaders.parsers.language.ruby.RubySegmenter(code)

Code segmenter for Ruby.

document_loaders.parsers.language.rust.RustSegmenter(code)

Code segmenter for Rust.

document_loaders.parsers.language.scala.ScalaSegmenter(code)

Code segmenter for Scala.

document_loaders.parsers.language.sql.SQLSegmenter(code)

Code segmenter for SQL.

document_loaders.parsers.language.tree_sitter_segmenter.TreeSitterSegmenter(code)

Abstract class for `CodeSegmenter`s that use the tree-sitter library.

document_loaders.parsers.language.typescript.TypeScriptSegmenter(code)

Code segmenter for TypeScript.

document_loaders.parsers.msword.MsWordParser()

Parse the Microsoft Word documents from a blob.

document_loaders.parsers.pdf.AmazonTextractPDFParser([...])

Send PDF files to Amazon Textract and parse them.

document_loaders.parsers.pdf.DocumentIntelligenceParser(...)

Loads a PDF with Azure Document Intelligence (formerly Form Recognizer) and chunks at character level.

document_loaders.parsers.pdf.PDFMinerParser([...])

Parse PDF using PDFMiner.

document_loaders.parsers.pdf.PDFPlumberParser([...])

Parse PDF with PDFPlumber.

document_loaders.parsers.pdf.PyMuPDFParser([...])

Parse PDF using PyMuPDF.

document_loaders.parsers.pdf.PyPDFParser([...])

Load PDF using pypdf

document_loaders.parsers.pdf.PyPDFium2Parser([...])

Parse PDF with PyPDFium2.

document_loaders.parsers.txt.TextParser()

Parser for text blobs.

document_loaders.parsers.vsdx.VsdxParser()

Parser for vsdx files.

document_loaders.pdf.AmazonTextractPDFLoader(...)

Load PDF files from a local file system, HTTP or S3.

document_loaders.pdf.BasePDFLoader(file_path, *)

Base Loader class for PDF files.

document_loaders.pdf.DedocPDFLoader(file_path, *)

DedocPDFLoader document loader integration to load PDF files using dedoc. The file loader can automatically detect the correctness of a textual layer in the PDF document. Note that __init__ method supports parameters that differ from ones of DedocBaseLoader.

document_loaders.pdf.DocumentIntelligenceLoader(...)

Load a PDF with Azure Document Intelligence

document_loaders.pdf.MathpixPDFLoader(file_path)

Load PDF files using Mathpix service.

document_loaders.pdf.OnlinePDFLoader(...[, ...])

Load online PDF.

document_loaders.pdf.PDFMinerLoader(file_path, *)

Load PDF files using PDFMiner.

document_loaders.pdf.PDFMinerPDFasHTMLLoader(...)

Load PDF files as HTML content using PDFMiner.

document_loaders.pdf.PDFPlumberLoader(file_path)

Load PDF files using pdfplumber.

document_loaders.pdf.PagedPDFSplitter

alias of PyPDFLoader

document_loaders.pdf.PyMuPDFLoader(file_path, *)

Load PDF files using PyMuPDF.

document_loaders.pdf.PyPDFDirectoryLoader(path)

Load a directory with PDF files using pypdf and chunks at character level.

document_loaders.pdf.PyPDFLoader(file_path)

PyPDFLoader document loader integration

document_loaders.pdf.PyPDFium2Loader(...[, ...])

Load PDF using pypdfium2 and chunks at character level.

document_loaders.pdf.UnstructuredPDFLoader(...)

Load PDF files using Unstructured.

document_loaders.pdf.ZeroxPDFLoader(file_path)

Document loader utilizing Zerox library: getomni-ai/zerox

document_loaders.pebblo.PebbloSafeLoader(...)

Pebblo Safe Loader class is a wrapper around document loaders enabling the data to be scrutinized.

document_loaders.pebblo.PebbloTextLoader(...)

Loader for text data.

document_loaders.polars_dataframe.PolarsDataFrameLoader(...)

Load Polars DataFrame.

document_loaders.powerpoint.UnstructuredPowerPointLoader(...)

Load Microsoft PowerPoint files using Unstructured.

document_loaders.psychic.PsychicLoader(...)

Load from Psychic.dev.

document_loaders.pubmed.PubMedLoader(query)

Load from the PubMed biomedical library.

document_loaders.pyspark_dataframe.PySparkDataFrameLoader([...])

Load PySpark DataFrames.

document_loaders.python.PythonLoader(file_path)

Load Python files, respecting any non-default encoding if specified.

document_loaders.quip.QuipLoader(api_url, ...)

Load Quip pages.

document_loaders.readthedocs.ReadTheDocsLoader(path)

Load ReadTheDocs documentation directory.

document_loaders.recursive_url_loader.RecursiveUrlLoader(url)

Recursively load all child links from a root URL.

document_loaders.reddit.RedditPostsLoader(...)

Load Reddit posts.

document_loaders.roam.RoamLoader(path)

Load Roam files from a directory.

document_loaders.rocksetdb.ColumnNotFoundError(...)

Column not found error.

document_loaders.rocksetdb.RocksetLoader(...)

Load from a Rockset database.

document_loaders.rspace.RSpaceLoader(global_id)

Load content from RSpace notebooks, folders, documents or PDF Gallery files.

document_loaders.rss.RSSFeedLoader([urls, ...])

Load news articles from RSS feeds using Unstructured.

document_loaders.rst.UnstructuredRSTLoader(...)

Load RST files using Unstructured.

document_loaders.rtf.UnstructuredRTFLoader(...)

Load RTF files using Unstructured.

document_loaders.s3_directory.S3DirectoryLoader(bucket)

Load from Amazon AWS S3 directory.

document_loaders.s3_file.S3FileLoader(...[, ...])

Load from Amazon AWS S3 file.

document_loaders.scrapfly.ScrapflyLoader(urls, *)

Turn a url to llm accessible markdown with Scrapfly.io.

document_loaders.scrapingant.ScrapingAntLoader(urls, *)

Turn an url to LLM accessible markdown with ScrapingAnt.

document_loaders.sharepoint.SharePointLoader

Load from SharePoint.

document_loaders.sitemap.SitemapLoader(web_path)

Load a sitemap and its URLs.

document_loaders.slack_directory.SlackDirectoryLoader(...)

Load from a Slack directory dump.

document_loaders.snowflake_loader.SnowflakeLoader(...)

Load from Snowflake API.

document_loaders.spider.SpiderLoader(url, *)

Load web pages as Documents using Spider AI.

document_loaders.spreedly.SpreedlyLoader(...)

Load from Spreedly API.

document_loaders.sql_database.SQLDatabaseLoader(...)

Load documents by querying database tables supported by SQLAlchemy.

document_loaders.srt.SRTLoader(file_path)

Load .srt (subtitle) files.

document_loaders.stripe.StripeLoader(resource)

Load from Stripe API.

document_loaders.surrealdb.SurrealDBLoader([...])

Load SurrealDB documents.

document_loaders.telegram.TelegramChatApiLoader([...])

Load Telegram chat json directory dump.

document_loaders.telegram.TelegramChatFileLoader(path)

Load from Telegram chat dump.

document_loaders.telegram.TelegramChatLoader

alias of TelegramChatFileLoader

document_loaders.tencent_cos_directory.TencentCOSDirectoryLoader(...)

Load from Tencent Cloud COS directory.

document_loaders.tencent_cos_file.TencentCOSFileLoader(...)

Load from Tencent Cloud COS file.

document_loaders.tensorflow_datasets.TensorflowDatasetLoader(...)

Load from TensorFlow Dataset.

document_loaders.text.TextLoader(file_path)

Load text file.

document_loaders.tidb.TiDBLoader(...[, ...])

Load documents from TiDB.

document_loaders.tomarkdown.ToMarkdownLoader(...)

Load HTML using 2markdown API.

document_loaders.toml.TomlLoader(source)

Load TOML files.

document_loaders.trello.TrelloLoader(client, ...)

Load cards from a Trello board.

document_loaders.tsv.UnstructuredTSVLoader(...)

Load TSV files using Unstructured.

document_loaders.twitter.TwitterTweetLoader(...)

Load Twitter tweets.

document_loaders.unstructured.UnstructuredBaseLoader([...])

Base Loader that uses Unstructured.

document_loaders.url.UnstructuredURLLoader(urls)

Load files from remote URLs using Unstructured.

document_loaders.url_playwright.PlaywrightEvaluator()

Abstract base class for all evaluators.

document_loaders.url_playwright.PlaywrightURLLoader(urls)

Load HTML pages with Playwright and parse with Unstructured.

document_loaders.url_playwright.UnstructuredHtmlEvaluator([...])

Evaluate the page HTML content using the unstructured library.

document_loaders.url_selenium.SeleniumURLLoader(urls)

Load HTML pages with Selenium and parse with Unstructured.

document_loaders.vsdx.VsdxLoader(file_path)

Initialize with file path.

document_loaders.weather.WeatherDataLoader(...)

Load weather data with Open Weather Map API.

document_loaders.web_base.WebBaseLoader([...])

WebBaseLoader document loader integration

document_loaders.whatsapp_chat.WhatsAppChatLoader(path)

Load WhatsApp messages text file.

document_loaders.wikipedia.WikipediaLoader(query)

Load from Wikipedia.

document_loaders.word_document.Docx2txtLoader(...)

Load DOCX file using docx2txt and chunks at character level.

document_loaders.word_document.UnstructuredWordDocumentLoader(...)

Load Microsoft Word file using Unstructured.

document_loaders.xml.UnstructuredXMLLoader(...)

Load XML file using Unstructured.

document_loaders.xorbits.XorbitsLoader(...)

Load Xorbits DataFrame.

document_loaders.youtube.GoogleApiClient([...])

Generic Google API Client.

document_loaders.youtube.GoogleApiYoutubeLoader(...)

Load all Videos from a YouTube Channel.

document_loaders.youtube.TranscriptFormat(value)

Output formats of transcripts from YoutubeLoader.

document_loaders.youtube.YoutubeLoader(video_id)

Load YouTube video transcripts.

document_loaders.yuque.YuqueLoader(access_token)

Load documents from Yuque.

Functions

document_loaders.base_o365.fetch_extensions(...)

Fetch the mime types for the specified file types.

document_loaders.base_o365.fetch_mime_types(...)

Fetch the mime types for the specified file types.

document_loaders.chatgpt.concatenate_rows(...)

Combine message information in a readable format ready to be used.

document_loaders.facebook_chat.concatenate_rows(row)

Combine message information in a readable format ready to be used.

document_loaders.helpers.detect_file_encodings(...)

Try to detect the file encoding.

document_loaders.notebook.concatenate_cells(...)

Combine cells information in a readable format ready to be used.

document_loaders.notebook.remove_newlines(x)

Recursively remove newlines, no matter the data structure they are stored in.

document_loaders.parsers.pdf.extract_from_images_with_rapidocr(images)

Extract text from images with RapidOCR.

document_loaders.parsers.registry.get_parser(...)

Get a parser by parser name.

document_loaders.rocksetdb.default_joiner(docs)

Default joiner for content columns.

document_loaders.telegram.concatenate_rows(row)

Combine message information in a readable format ready to be used.

document_loaders.telegram.text_to_docs(text)

Convert a string or list of strings to a list of Documents with metadata.

document_loaders.unstructured.get_elements_from_api([...])

Retrieve a list of elements from the Unstructured API.

document_loaders.unstructured.satisfies_min_unstructured_version(...)

Check if the installed Unstructured version exceeds the minimum version for the feature in question.

document_loaders.unstructured.validate_unstructured_version(...)

Raise an error if the Unstructured version does not exceed the specified minimum.

document_loaders.whatsapp_chat.concatenate_rows(...)

Combine message information in a readable format ready to be used.

Deprecated classes