HTMLHeaderTextSplitter#

class langchain_text_splitters.html.HTMLHeaderTextSplitter(headers_to_split_on: List[Tuple[str, str]], return_each_element: bool = False)[source]#

Splitting HTML files based on specified headers.

Requires lxml package.

Create a new HTMLHeaderTextSplitter.

Parameters:
  • headers_to_split_on (List[Tuple[str, str]]) – list of tuples of headers we want to track mapped to (arbitrary) keys for metadata. Allowed header values: h1, h2, h3, h4, h5, h6 e.g. [(β€œh1”, β€œHeader 1”), (β€œh2”, β€œHeader 2)].

  • return_each_element (bool) – Return each element w/ associated headers.

Methods

__init__(headers_to_split_on[,Β ...])

Create a new HTMLHeaderTextSplitter.

aggregate_elements_to_chunks(elements)

Combine elements with common metadata into chunks.

split_text(text)

Split HTML text string.

split_text_from_file(file)

Split HTML file.

split_text_from_url(url,Β **kwargs)

Split HTML from web URL.

__init__(headers_to_split_on: List[Tuple[str, str]], return_each_element: bool = False)[source]#

Create a new HTMLHeaderTextSplitter.

Parameters:
  • headers_to_split_on (List[Tuple[str, str]]) – list of tuples of headers we want to track mapped to (arbitrary) keys for metadata. Allowed header values: h1, h2, h3, h4, h5, h6 e.g. [(β€œh1”, β€œHeader 1”), (β€œh2”, β€œHeader 2)].

  • return_each_element (bool) – Return each element w/ associated headers.

aggregate_elements_to_chunks(elements: List[ElementType]) β†’ List[Document][source]#

Combine elements with common metadata into chunks.

Parameters:

elements (List[ElementType]) – HTML element content with associated identifying info and metadata

Return type:

List[Document]

split_text(text: str) β†’ List[Document][source]#

Split HTML text string.

Parameters:

text (str) – HTML text

Return type:

List[Document]

split_text_from_file(file: Any) β†’ List[Document][source]#

Split HTML file.

Parameters:

file (Any) – HTML file

Return type:

List[Document]

split_text_from_url(url: str, **kwargs: Any) β†’ List[Document][source]#

Split HTML from web URL.

Parameters:
  • url (str) – web URL

  • **kwargs (Any) – Arbitrary additional keyword arguments. These are usually passed to the fetch url content request.

Return type:

List[Document]

Examples using HTMLHeaderTextSplitter