HTMLHeaderTextSplitter#

class langchain_text_splitters.html.HTMLHeaderTextSplitter( headers_to_split_on: list[tuple[str, str]], return_each_element: bool = False, )[source]#

Split HTML content into structured Documents based on specified headers.

Splits HTML content by detecting specified header tags (e.g., <h1>, <h2>) and creating hierarchical Document objects that reflect the semantic structure of the original content. For each identified section, the splitter associates the extracted text with metadata corresponding to the encountered headers.

If no specified headers are found, the entire content is returned as a single Document. This allows for flexible handling of HTML input, ensuring that information is organized according to its semantic headers.

The splitter provides the option to return each HTML element as a separate Document or aggregate them into semantically meaningful chunks. It also gracefully handles multiple levels of nested headers, creating a rich, hierarchical representation of the content.

Example

from langchain_text_splitters.html_header_text_splitter import (
    HTMLHeaderTextSplitter,
)

# Define headers for splitting on h1 and h2 tags.
headers_to_split_on = [("h1", "Main Topic"), ("h2", "Sub Topic")]

splitter = HTMLHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on,
    return_each_element=False
)

html_content = """
<html>
  <body>
    <h1>Introduction</h1>
    <p>Welcome to the introduction section.</p>
    <h2>Background</h2>
    <p>Some background details here.</p>
    <h1>Conclusion</h1>
    <p>Final thoughts.</p>
  </body>
</html>
"""

documents = splitter.split_text(html_content)

# 'documents' now contains Document objects reflecting the hierarchy:
# - Document with metadata={"Main Topic": "Introduction"} and
#   content="Introduction"
# - Document with metadata={"Main Topic": "Introduction"} and
#   content="Welcome to the introduction section."
# - Document with metadata={"Main Topic": "Introduction",
#   "Sub Topic": "Background"} and content="Background"
# - Document with metadata={"Main Topic": "Introduction",
#   "Sub Topic": "Background"} and content="Some background details here."
# - Document with metadata={"Main Topic": "Conclusion"} and
#   content="Conclusion"
# - Document with metadata={"Main Topic": "Conclusion"} and
#   content="Final thoughts."

Initialize with headers to split on.

Parameters:

headers_to_split_on (list[tuple[str, str]]) – A list of (header_tag, header_name) pairs representing the headers that define splitting boundaries. For example, [(“h1”, “Header 1”), (“h2”, “Header 2”)] will split content by <h1> and <h2> tags, assigning their textual content to the Document metadata.
return_each_element (bool) – If True, every HTML element encountered (including headers, paragraphs, etc.) is returned as a separate Document. If False, content under the same header hierarchy is aggregated into fewer Documents.

Methods

`__init__`(headers_to_split_on[, ...])	Initialize with headers to split on.
`split_text`(text)	Split the given text into a list of Document objects.
`split_text_from_file`(file)	Split HTML content from a file into a list of Document objects.
`split_text_from_url`(url[, timeout])	Fetch text content from a URL and split it into documents.

__init__( headers_to_split_on: list[tuple[str, str]], return_each_element: bool = False, ) → None[source]#

Initialize with headers to split on.

Parameters:

headers_to_split_on (list[tuple[str, str]]) – A list of (header_tag, header_name) pairs representing the headers that define splitting boundaries. For example, [(“h1”, “Header 1”), (“h2”, “Header 2”)] will split content by <h1> and <h2> tags, assigning their textual content to the Document metadata.
return_each_element (bool) – If True, every HTML element encountered (including headers, paragraphs, etc.) is returned as a separate Document. If False, content under the same header hierarchy is aggregated into fewer Documents.

Return type:

None

split_text( text: str, ) → list[Document][source]#

Split the given text into a list of Document objects.

Parameters:: text (str) – The HTML text to split.
Returns:: A list of split Document objects. Each Document contains page_content holding the extracted text and metadata that maps the header hierarchy to their corresponding titles.
Return type:: list[Document]

split_text_from_file( file: str | IO[str], ) → list[Document][source]#

Split HTML content from a file into a list of Document objects.

Parameters:: file (str | IO[str]) – A file path or a file-like object containing HTML content.
Returns:: A list of split Document objects. Each Document contains page_content holding the extracted text and metadata that maps the header hierarchy to their corresponding titles.
Return type:: list[Document]

split_text_from_url(

url: str,

timeout: int = 10,

**kwargs: Any,

) → list[Document][source]#

Fetch text content from a URL and split it into documents.

Parameters:

url (str) – The URL to fetch content from.
timeout (int) – Timeout for the request. Defaults to 10.
**kwargs (Any) – Additional keyword arguments for the request.

Returns:

A list of split Document objects. Each Document contains page_content holding the extracted text and metadata that maps the header hierarchy to their corresponding titles.

Raises:

requests.RequestException – If the HTTP request fails.

Return type:

list[Document]

Examples using HTMLHeaderTextSplitter

How to split by HTML header