HTMLSectionSplitter#

class langchain_text_splitters.html.HTMLSectionSplitter(

headers_to_split_on: list[tuple[str, str]],

**kwargs: Any,

)[source]#

Splitting HTML files based on specified tag and font sizes.

Requires lxml package.

Create a new HTMLSectionSplitter.

Parameters:

headers_to_split_on (list[tuple[str, str]]) – list of tuples of headers we want to track mapped to (arbitrary) keys for metadata. Allowed header values: h1, h2, h3, h4, h5, h6 e.g. [(“h1”, “Header 1”), (“h2”, “Header 2”].
**kwargs (Any) – Additional optional arguments for customizations.

Methods

`__init__`(headers_to_split_on, **kwargs)	Create a new HTMLSectionSplitter.
`convert_possible_tags_to_header`(html_content)	Convert specific HTML tags to headers using an XSLT transformation.
`create_documents`(texts[, metadatas])	Create documents from a list of texts.
`split_documents`(documents)	Split documents.
`split_html_by_headers`(html_doc)	Split an HTML document into sections based on specified header tags.
`split_text`(text)	Split HTML text string.
`split_text_from_file`(file)	Split HTML content from a file into a list of Document objects.

__init__(

headers_to_split_on: list[tuple[str, str]],

**kwargs: Any,

) → None[source]#

Create a new HTMLSectionSplitter.

Parameters:

headers_to_split_on (list[tuple[str, str]]) – list of tuples of headers we want to track mapped to (arbitrary) keys for metadata. Allowed header values: h1, h2, h3, h4, h5, h6 e.g. [(“h1”, “Header 1”), (“h2”, “Header 2”].
**kwargs (Any) – Additional optional arguments for customizations.

Return type:

None

convert_possible_tags_to_header( html_content: str, ) → str[source]#

Convert specific HTML tags to headers using an XSLT transformation.

This method uses an XSLT file to transform the HTML content, converting certain tags into headers for easier parsing. If no XSLT path is provided, the HTML content is returned unchanged.

Parameters:: html_content (str) – The HTML content to be transformed.
Returns:: The transformed HTML content as a string.
Return type:: str

create_documents( texts: list[str], metadatas: list[dict[Any, Any]] | None = None, ) → list[Document][source]#

Create documents from a list of texts.

Parameters:

texts (list[str])
metadatas (list[dict[Any, Any]] | None)

Return type:

list[Document]

split_documents( documents: Iterable[Document], ) → list[Document][source]#

Split documents.

Parameters:: documents (Iterable[Document])
Return type:: list[Document]

split_html_by_headers( html_doc: str, ) → list[dict[str, str | None]][source]#

Split an HTML document into sections based on specified header tags.

This method uses BeautifulSoup to parse the HTML content and divides it into sections based on headers defined in headers_to_split_on. Each section contains the header text, content under the header, and the tag name.

Parameters:

html_doc (str) – The HTML document to be split into sections.

Returns:

A list of dictionaries representing sections.

Each dictionary contains: - ‘header’: The header text or a default title for the first section. - ‘content’: The content under the header. - ‘tag_name’: The name of the header tag (e.g., “h1”, “h2”).

Return type:

List[Dict[str, Optional[str]]]

split_text( text: str, ) → list[Document][source]#

Split HTML text string.

Parameters:: text (str) – HTML text
Return type:: list[Document]

split_text_from_file( file: Any, ) → list[Document][source]#

Split HTML content from a file into a list of Document objects.

Parameters:: file (Any) – A file path or a file-like object containing HTML content.
Returns:: A list of split Document objects.
Return type:: list[Document]

Examples using HTMLSectionSplitter

How to split by HTML sections