ExperimentalMarkdownSyntaxTextSplitter#
- class langchain_text_splitters.markdown.ExperimentalMarkdownSyntaxTextSplitter(headers_to_split_on: List[Tuple[str, str]] | None = None, return_each_line: bool = False, strip_headers: bool = True)[source]#
An experimental text splitter for handling Markdown syntax.
This splitter aims to retain the exact whitespace of the original text while extracting structured metadata, such as headers. It is a re-implementation of the MarkdownHeaderTextSplitter with notable changes to the approach and additional features.
Key Features: - Retains the original whitespace and formatting of the Markdown text. - Extracts headers, code blocks, and horizontal rules as metadata. - Splits out code blocks and includes the language in the βCodeβ metadata key. - Splits text on horizontal rules (β) as well. - Defaults to sensible splitting behavior, which can be overridden using the
headers_to_split_on parameter.
Parameters:#
- headers_to_split_onList[Tuple[str, str]], optional
Headers to split on, defaulting to common Markdown headers if not specified.
- return_each_linebool, optional
When set to True, returns each line as a separate chunk. Default is False.
Usage example:#
>>> headers_to_split_on = [ >>> ("#", "Header 1"), >>> ("##", "Header 2"), >>> ] >>> splitter = ExperimentalMarkdownSyntaxTextSplitter( >>> headers_to_split_on=headers_to_split_on >>> ) >>> chunks = splitter.split(text) >>> for chunk in chunks: >>> print(chunk)
This class is currently experimental and subject to change based on feedback and further development.
Attributes
DEFAULT_HEADER_KEYS
Methods
__init__
([headers_to_split_on,Β ...])split_text
(text)- Parameters:
headers_to_split_on (Union[List[Tuple[str, str]], None])
return_each_line (bool)
strip_headers (bool)