HTMLSectionSplitter#

class langchain_text_splitters.html.HTMLSectionSplitter(headers_to_split_on: List[Tuple[str, str]], xslt_path: str | None = None, **kwargs: Any)[source]#

Splitting HTML files based on specified tag and font sizes. Requires lxml package.

Create a new HTMLSectionSplitter.

Parameters:
  • headers_to_split_on (List[Tuple[str, str]]) – list of tuples of headers we want to track mapped to (arbitrary) keys for metadata. Allowed header values: h1, h2, h3, h4, h5, h6 e.g. [(β€œh1”, β€œHeader 1”), (β€œh2”, β€œHeader 2”].

  • xslt_path (Optional[str]) – path to xslt file for document transformation.

  • passed. (Uses a default if not) –

  • layouts. (Needed for html contents that using different format and) –

  • kwargs (Any) –

Methods

__init__(headers_to_split_on[,Β xslt_path])

Create a new HTMLSectionSplitter.

convert_possible_tags_to_header(html_content)

create_documents(texts[,Β metadatas])

Create documents from a list of texts.

split_documents(documents)

Split documents.

split_html_by_headers(html_doc)

split_text(text)

Split HTML text string

split_text_from_file(file)

Split HTML file

__init__(headers_to_split_on: List[Tuple[str, str]], xslt_path: str | None = None, **kwargs: Any) β†’ None[source]#

Create a new HTMLSectionSplitter.

Parameters:
  • headers_to_split_on (List[Tuple[str, str]]) – list of tuples of headers we want to track mapped to (arbitrary) keys for metadata. Allowed header values: h1, h2, h3, h4, h5, h6 e.g. [(β€œh1”, β€œHeader 1”), (β€œh2”, β€œHeader 2”].

  • xslt_path (str | None) – path to xslt file for document transformation.

  • passed. (Uses a default if not) –

  • layouts. (Needed for html contents that using different format and) –

  • kwargs (Any) –

Return type:

None

convert_possible_tags_to_header(html_content: str) β†’ str[source]#
Parameters:

html_content (str) –

Return type:

str

create_documents(texts: List[str], metadatas: List[dict] | None = None) β†’ List[Document][source]#

Create documents from a list of texts.

Parameters:
  • texts (List[str]) –

  • metadatas (List[dict] | None) –

Return type:

List[Document]

split_documents(documents: Iterable[Document]) β†’ List[Document][source]#

Split documents.

Parameters:

documents (Iterable[Document]) –

Return type:

List[Document]

split_html_by_headers(html_doc: str) β†’ List[Dict[str, str | None]][source]#
Parameters:

html_doc (str) –

Return type:

List[Dict[str, str | None]]

split_text(text: str) β†’ List[Document][source]#

Split HTML text string

Parameters:

text (str) – HTML text

Return type:

List[Document]

split_text_from_file(file: Any) β†’ List[Document][source]#

Split HTML file

Parameters:

file (Any) – HTML file

Return type:

List[Document]

Examples using HTMLSectionSplitter