langchain-text-splitters: 0.2.4#

Text Splitters are classes for splitting text.

Class hierarchy:

BaseDocumentTransformer --> TextSplitter --> <name>TextSplitter  # Example: CharacterTextSplitter
                                             RecursiveCharacterTextSplitter -->  <name>TextSplitter

Note: MarkdownHeaderTextSplitter and **HTMLHeaderTextSplitter do not derive from TextSplitter.

Main helpers:

Document, Tokenizer, Language, LineType, HeaderType

base#

Classes

base.Language(value[, names, module, ...])

Enum of the programming languages.

base.TextSplitter(chunk_size, chunk_overlap, ...)

Interface for splitting text into chunks.

base.TokenTextSplitter([encoding_name, ...])

Splitting text to tokens using model tokenizer.

base.Tokenizer(chunk_overlap, ...)

Tokenizer data class.

Functions

base.split_text_on_tokens(*, text, tokenizer)

Split incoming text and return chunks using tokenizer.

character#

Classes

character.CharacterTextSplitter([separator, ...])

Splitting text that looks at characters.

character.RecursiveCharacterTextSplitter([...])

Splitting text by recursively look at characters.

html#

Classes

html.ElementType

Element type as typed dict.

html.HTMLHeaderTextSplitter(headers_to_split_on)

Splitting HTML files based on specified headers.

html.HTMLSectionSplitter(headers_to_split_on)

Splitting HTML files based on specified tag and font sizes.

json#

Classes

json.RecursiveJsonSplitter([max_chunk_size, ...])

konlpy#

Classes

konlpy.KonlpyTextSplitter([separator])

Splitting text using Konlpy package.

latex#

Classes

latex.LatexTextSplitter(**kwargs)

Attempts to split the text along Latex-formatted layout elements.

markdown#

Classes

markdown.ExperimentalMarkdownSyntaxTextSplitter([...])

An experimental text splitter for handling Markdown syntax.

markdown.HeaderType

Header type as typed dict.

markdown.LineType

Line type as typed dict.

markdown.MarkdownHeaderTextSplitter(...[, ...])

Splitting markdown files based on specified headers.

markdown.MarkdownTextSplitter(**kwargs)

Attempts to split the text along Markdown-formatted headings.

nltk#

Classes

nltk.NLTKTextSplitter([separator, language])

Splitting text using NLTK package.

python#

Classes

python.PythonCodeTextSplitter(**kwargs)

Attempts to split the text along Python syntax.

sentence_transformers#

Classes

sentence_transformers.SentenceTransformersTokenTextSplitter([...])

Splitting text to tokens using sentence model tokenizer.

spacy#

Classes

spacy.SpacyTextSplitter([separator, ...])

Splitting text using Spacy package.