How to load HTML

The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser.

This covers how to load HTML documents into a LangChain Document objects that we can use downstream.

Parsing HTML files often requires specialized tools. Here we demonstrate parsing via Unstructured and BeautifulSoup4, which can be installed via pip. Head over to the integrations page to find integrations with additional services, such as Azure AI Document Intelligence or FireCrawl.

Loading HTML with Unstructured

%pip install unstructured

from langchain_community.document_loaders import UnstructuredHTMLLoader

file_path = "../../docs/integrations/document_loaders/example_data/fake-content.html"

loader = UnstructuredHTMLLoader(file_path)
data = loader.load()

print(data)

[Document(page_content='My First Heading\n\nMy first paragraph.', metadata={'source': '../../docs/integrations/document_loaders/example_data/fake-content.html'})]

Loading HTML with BeautifulSoup4

We can also use BeautifulSoup4 to load HTML documents using the BSHTMLLoader. This will extract the text from the HTML into page_content, and the page title as title into metadata.

%pip install bs4

from langchain_community.document_loaders import BSHTMLLoader

loader = BSHTMLLoader(file_path)
data = loader.load()

print(data)

[Document(page_content='\nTest Title\n\n\nMy First Heading\nMy first paragraph.\n\n\n', metadata={'source': '../../docs/integrations/document_loaders/example_data/fake-content.html', 'title': 'Test Title'})]

Loading HTML with Unstructured​

Loading HTML with BeautifulSoup4​

Loading HTML with Unstructured

Loading HTML with BeautifulSoup4