ExperimentalMarkdownSyntaxTextSplitter#

class langchain_text_splitters.markdown.ExperimentalMarkdownSyntaxTextSplitter(headers_to_split_on: List[Tuple[str, str]] | None = None, return_each_line: bool = False, strip_headers: bool = True)[source]#

An experimental text splitter for handling Markdown syntax.

This splitter aims to retain the exact whitespace of the original text while extracting structured metadata, such as headers. It is a re-implementation of the MarkdownHeaderTextSplitter with notable changes to the approach and additional features.

Key Features: - Retains the original whitespace and formatting of the Markdown text. - Extracts headers, code blocks, and horizontal rules as metadata. - Splits out code blocks and includes the language in the “Code” metadata key. - Splits text on horizontal rules (—) as well. - Defaults to sensible splitting behavior, which can be overridden using the

headers_to_split_on parameter.

Parameters:#

headers_to_split_onList[Tuple[str, str]], optional: Headers to split on, defaulting to common Markdown headers if not specified.
return_each_linebool, optional: When set to True, returns each line as a separate chunk. Default is False.

Usage example:#

>>> headers_to_split_on = [
>>>     ("#", "Header 1"),
>>>     ("##", "Header 2"),
>>> ]
>>> splitter = ExperimentalMarkdownSyntaxTextSplitter(
>>>     headers_to_split_on=headers_to_split_on
>>> )
>>> chunks = splitter.split(text)
>>> for chunk in chunks:
>>>     print(chunk)

This class is currently experimental and subject to change based on feedback and further development.

Attributes

DEFAULT_HEADER_KEYS

Methods

`__init__`([headers_to_split_on, ...])
`split_text`(text)

__init__(headers_to_split_on: List[Tuple[str, str]] | None = None, return_each_line: bool = False, strip_headers: bool = True)[source]#

Parameters:

headers_to_split_on (List[Tuple[str, str]] | None)
return_each_line (bool)
strip_headers (bool)

split_text(text: str) → List[Document][source]#

Parameters:: text (str)
Return type:: List[Document]

Parameters:

headers_to_split_on (Union[List[Tuple[str, str]], None])
return_each_line (bool)
strip_headers (bool)