HTMLSectionSplitter#
- class langchain_text_splitters.html.HTMLSectionSplitter(headers_to_split_on: List[Tuple[str, str]], xslt_path: str | None = None, **kwargs: Any)[source]#
Splitting HTML files based on specified tag and font sizes.
Requires lxml package.
Create a new HTMLSectionSplitter.
- Parameters:
headers_to_split_on (List[Tuple[str, str]]) – list of tuples of headers we want to track mapped to (arbitrary) keys for metadata. Allowed header values: h1, h2, h3, h4, h5, h6 e.g. [(“h1”, “Header 1”), (“h2”, “Header 2”].
xslt_path (Optional[str]) – path to xslt file for document transformation.
passed. (Uses a default if not)
layouts. (Needed for html contents that using different format and)
**kwargs (Any) – Additional optional arguments for customizations.
Methods
__init__
(headers_to_split_on[, xslt_path])Create a new HTMLSectionSplitter.
convert_possible_tags_to_header
(html_content)Convert specific HTML tags to headers using an XSLT transformation.
create_documents
(texts[, metadatas])Create documents from a list of texts.
split_documents
(documents)Split documents.
split_html_by_headers
(html_doc)Split an HTML document into sections based on specified header tags.
split_text
(text)Split HTML text string.
split_text_from_file
(file)Split HTML file.
- __init__(headers_to_split_on: List[Tuple[str, str]], xslt_path: str | None = None, **kwargs: Any) None [source]#
Create a new HTMLSectionSplitter.
- Parameters:
headers_to_split_on (List[Tuple[str, str]]) – list of tuples of headers we want to track mapped to (arbitrary) keys for metadata. Allowed header values: h1, h2, h3, h4, h5, h6 e.g. [(“h1”, “Header 1”), (“h2”, “Header 2”].
xslt_path (str | None) – path to xslt file for document transformation.
passed. (Uses a default if not)
layouts. (Needed for html contents that using different format and)
**kwargs (Any) – Additional optional arguments for customizations.
- Return type:
None
- convert_possible_tags_to_header(html_content: str) str [source]#
Convert specific HTML tags to headers using an XSLT transformation.
This method uses an XSLT file to transform the HTML content, converting certain tags into headers for easier parsing. If no XSLT path is provided, the HTML content is returned unchanged.
- Parameters:
html_content (str) – The HTML content to be transformed.
- Returns:
The transformed HTML content as a string.
- Return type:
str
- create_documents(texts: List[str], metadatas: List[dict] | None = None) List[Document] [source]#
Create documents from a list of texts.
- Parameters:
texts (List[str])
metadatas (List[dict] | None)
- Return type:
List[Document]
- split_html_by_headers(html_doc: str) List[Dict[str, str | None]] [source]#
Split an HTML document into sections based on specified header tags.
This method uses BeautifulSoup to parse the HTML content and divides it into sections based on headers defined in headers_to_split_on. Each section contains the header text, content under the header, and the tag name.
- Parameters:
html_doc (str) – The HTML document to be split into sections.
- Returns:
A list of dictionaries representing sections.
Each dictionary contains: - ‘header’: The header text or a default title for the first section. - ‘content’: The content under the header. - ‘tag_name’: The name of the header tag (e.g., “h1”, “h2”).
- Return type:
List[Dict[str, Optional[str]]]
Examples using HTMLSectionSplitter