HTMLSemanticPreservingSplitter#

class langchain_text_splitters.html.HTMLSemanticPreservingSplitter( headers_to_split_on: List[Tuple[str, str]], *, max_chunk_size: int = 1000, chunk_overlap: int = 0, separators: List[str] | None = None, elements_to_preserve: List[str] | None = None, preserve_links: bool = False, preserve_images: bool = False, preserve_videos: bool = False, preserve_audio: bool = False, custom_handlers: Dict[str, Callable[[Any], str]] | None = None, stopword_removal: bool = False, stopword_lang: str = 'english', normalize_text: bool = False, external_metadata: Dict[str, str] | None = None, allowlist_tags: List[str] | None = None, denylist_tags: List[str] | None = None, preserve_parent_metadata: bool = False, keep_separator: bool | Literal['start', 'end'] = True, )[source]#

Beta

This feature is in beta. It is actively being worked on, so the API may change.

Split HTML content preserving semantic structure.

Splits HTML content by headers into generalized chunks, preserving semantic structure. If chunks exceed the maximum chunk size, it uses RecursiveCharacterTextSplitter for further splitting.

The splitter preserves full HTML elements (e.g., <table>, <ul>) and converts links to Markdown-like links. It can also preserve images, videos, and audio elements by converting them into Markdown format. Note that some chunks may exceed the maximum size to maintain semantic integrity.

Parameters:

headers_to_split_on (List[Tuple[str, str]]) – HTML headers (e.g., “h1”, “h2”) that define content sections.
max_chunk_size (int) – Maximum size for each chunk, with allowance for exceeding this limit to preserve semantics.
chunk_overlap (int) – Number of characters to overlap between chunks to ensure contextual continuity.
separators (List[str]) – Delimiters used by RecursiveCharacterTextSplitter for further splitting.
elements_to_preserve (List[str]) – HTML tags (e.g., <table>, <ul>) to remain intact during splitting.
preserve_links (bool) – Converts <a> tags to Markdown links ([text](url)).
preserve_images (bool) – Converts <img> tags to Markdown images (![alt](src)).
preserve_videos (bool) – Converts <video> tags to Markdown
links (audio)
preserve_audio (bool) – Converts <audio> tags to Markdown
links
custom_handlers (Dict[str, Callable[[Any], str]]) – Optional custom handlers for specific HTML tags, allowing tailored extraction or processing.
stopword_removal (bool) – Optionally remove stopwords from the text.
stopword_lang (str) – The language of stopwords to remove.
normalize_text (bool) – Optionally normalize text (e.g., lowercasing, removing punctuation).
external_metadata (Optional[Dict[str, str]]) – Additional metadata to attach to the Document objects.
allowlist_tags (Optional[List[str]]) – Only these tags will be retained in the HTML.
denylist_tags (Optional[List[str]]) – These tags will be removed from the HTML.
preserve_parent_metadata (bool) – Whether to pass through parent document metadata to split documents when calling transform_documents/atransform_documents().
keep_separator (Union[bool, Literal["start", "end"]]) – Whether separators should be at the beginning of a chunk, at the end, or not at all.

Example

from langchain_text_splitters.html import HTMLSemanticPreservingSplitter

def custom_iframe_extractor(iframe_tag):
    ```
    Custom handler function to extract the 'src' attribute from an <iframe> tag.
    Converts the iframe to a Markdown-like link: [iframe:<src>](src).

    Args:
        iframe_tag (bs4.element.Tag): The <iframe> tag to be processed.

    Returns:
        str: A formatted string representing the iframe in Markdown-like format.
    ```
    iframe_src = iframe_tag.get('src', '')
    return f"[iframe:{iframe_src}]({iframe_src})"

text_splitter = HTMLSemanticPreservingSplitter(
    headers_to_split_on=[("h1", "Header 1"), ("h2", "Header 2")],
    max_chunk_size=500,
    preserve_links=True,
    preserve_images=True,
    custom_handlers={"iframe": custom_iframe_extractor}
)

Initialize splitter.

Methods

`__init__`(headers_to_split_on, *[, ...])	Initialize splitter.
`atransform_documents`(documents, **kwargs)	Asynchronously transform a list of documents.
`split_text`(text)	Splits the provided HTML text into smaller chunks based on the configuration.
`transform_documents`(documents, **kwargs)	Transform sequence of documents by splitting them.

__init__( headers_to_split_on: List[Tuple[str, str]], *, max_chunk_size: int = 1000, chunk_overlap: int = 0, separators: List[str] | None = None, elements_to_preserve: List[str] | None = None, preserve_links: bool = False, preserve_images: bool = False, preserve_videos: bool = False, preserve_audio: bool = False, custom_handlers: Dict[str, Callable[[Any], str]] | None = None, stopword_removal: bool = False, stopword_lang: str = 'english', normalize_text: bool = False, external_metadata: Dict[str, str] | None = None, allowlist_tags: List[str] | None = None, denylist_tags: List[str] | None = None, preserve_parent_metadata: bool = False, keep_separator: bool | Literal['start', 'end'] = True, )[source]#

Initialize splitter.

Parameters:

headers_to_split_on (List[Tuple[str, str]])
max_chunk_size (int)
chunk_overlap (int)
separators (List[str] | None)
elements_to_preserve (List[str] | None)
preserve_links (bool)
preserve_images (bool)
preserve_videos (bool)
preserve_audio (bool)
custom_handlers (Dict[str, Callable[[Any], str]] | None)
stopword_removal (bool)
stopword_lang (str)
normalize_text (bool)
external_metadata (Dict[str, str] | None)
allowlist_tags (List[str] | None)
denylist_tags (List[str] | None)
preserve_parent_metadata (bool)
keep_separator (bool | Literal['start', 'end'])

async atransform_documents(

documents: Sequence[Document],

**kwargs: Any,

) → Sequence[Document]#

Asynchronously transform a list of documents.

Parameters:

documents (Sequence[Document]) – A sequence of Documents to be transformed.
kwargs (Any)

Returns:

A sequence of transformed Documents.

Return type:

Sequence[Document]

split_text( text: str, ) → List[Document][source]#

Splits the provided HTML text into smaller chunks based on the configuration.

Parameters:: text (str) – The HTML content to be split.
Returns:: A list of Document objects containing the split content.
Return type:: List[Document]

transform_documents(

documents: Sequence[Document],

**kwargs: Any,

) → List[Document][source]#

Transform sequence of documents by splitting them.

Parameters:

documents (Sequence[Document])
kwargs (Any)

Return type:

List[Document]