ExperimentalMarkdownSyntaxTextSplitter#

class langchain_text_splitters.markdown.ExperimentalMarkdownSyntaxTextSplitter(headers_to_split_on: List[Tuple[str, str]] | None = None, return_each_line: bool = False, strip_headers: bool = True)[source]#

An experimental text splitter for handling Markdown syntax.

This splitter aims to retain the exact whitespace of the original text while extracting structured metadata, such as headers. It is a re-implementation of the MarkdownHeaderTextSplitter with notable changes to the approach and additional features.

Key Features: - Retains the original whitespace and formatting of the Markdown text. - Extracts headers, code blocks, and horizontal rules as metadata. - Splits out code blocks and includes the language in the β€œCode” metadata key. - Splits text on horizontal rules (β€”) as well. - Defaults to sensible splitting behavior, which can be overridden using the

headers_to_split_on parameter.

Parameters:#

headers_to_split_onList[Tuple[str, str]], optional

Headers to split on, defaulting to common Markdown headers if not specified.

return_each_linebool, optional

When set to True, returns each line as a separate chunk. Default is False.

Usage example:#

>>> headers_to_split_on = [
>>>     ("#", "Header 1"),
>>>     ("##", "Header 2"),
>>> ]
>>> splitter = ExperimentalMarkdownSyntaxTextSplitter(
>>>     headers_to_split_on=headers_to_split_on
>>> )
>>> chunks = splitter.split(text)
>>> for chunk in chunks:
>>>     print(chunk)

This class is currently experimental and subject to change based on feedback and further development.

Initialize the text splitter with header splitting and formatting options.

This constructor sets up the required configuration for splitting text into chunks based on specified headers and formatting preferences.

param headers_to_split_on:

A list of tuples, where each tuple contains a header tag (e.g., β€œh1”) and its corresponding metadata key. If None, default headers are used.

type headers_to_split_on:

Union[List[Tuple[str, str]], None]

param return_each_line:

Whether to return each line as an individual chunk. Defaults to False, which aggregates lines into larger chunks.

type return_each_line:

bool

param strip_headers:

Whether to exclude headers from the resulting chunks. Defaults to True.

type strip_headers:

bool

Attributes

DEFAULT_HEADER_KEYS

Methods

__init__([headers_to_split_on,Β ...])

Initialize the text splitter with header splitting and formatting options.

split_text(text)

Split the input text into structured chunks.

__init__(headers_to_split_on: List[Tuple[str, str]] | None = None, return_each_line: bool = False, strip_headers: bool = True)[source]#

Initialize the text splitter with header splitting and formatting options.

This constructor sets up the required configuration for splitting text into chunks based on specified headers and formatting preferences.

Parameters:
  • headers_to_split_on (Union[List[Tuple[str, str]], None]) – A list of tuples, where each tuple contains a header tag (e.g., β€œh1”) and its corresponding metadata key. If None, default headers are used.

  • return_each_line (bool) – Whether to return each line as an individual chunk. Defaults to False, which aggregates lines into larger chunks.

  • strip_headers (bool) – Whether to exclude headers from the resulting chunks. Defaults to True.

split_text(text: str) β†’ List[Document][source]#

Split the input text into structured chunks.

This method processes the input text line by line, identifying and handling specific patterns such as headers, code blocks, and horizontal rules to split it into structured chunks based on headers, code blocks, and horizontal rules.

Parameters:

text (str) – The input text to be split into chunks.

Returns:

A list of Document objects representing the structured chunks of the input text. If return_each_line is enabled, each line is returned as a separate Document.

Return type:

List[Document]

Parameters:
  • headers_to_split_on (Union[List[Tuple[str, str]], None])

  • return_each_line (bool)

  • strip_headers (bool)