Text Splitter#

Functionality for splitting text.

class langchain.text_splitter.CharacterTextSplitter(separator: str = '\n\n', **kwargs: Any)[source]#

Implementation of splitting text that looks at characters.

split_text(text: str) List[str][source]#

Split incoming text and return chunks.

class langchain.text_splitter.Language(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
CPP = 'cpp'#
GO = 'go'#
HTML = 'html'#
JAVA = 'java'#
JS = 'js'#
LATEX = 'latex'#
MARKDOWN = 'markdown'#
PHP = 'php'#
PROTO = 'proto'#
PYTHON = 'python'#
RST = 'rst'#
RUBY = 'ruby'#
RUST = 'rust'#
SCALA = 'scala'#
SWIFT = 'swift'#
class langchain.text_splitter.LatexTextSplitter(**kwargs: Any)[source]#

Attempts to split the text along Latex-formatted layout elements.

class langchain.text_splitter.MarkdownTextSplitter(**kwargs: Any)[source]#

Attempts to split the text along Markdown-formatted headings.

class langchain.text_splitter.NLTKTextSplitter(separator: str = '\n\n', **kwargs: Any)[source]#

Implementation of splitting text that looks at sentences using NLTK.

split_text(text: str) List[str][source]#

Split incoming text and return chunks.

class langchain.text_splitter.PythonCodeTextSplitter(**kwargs: Any)[source]#

Attempts to split the text along Python syntax.

class langchain.text_splitter.RecursiveCharacterTextSplitter(separators: Optional[List[str]] = None, keep_separator: bool = True, **kwargs: Any)[source]#

Implementation of splitting text that looks at characters.

Recursively tries to split by different characters to find one that works.

classmethod from_language(language: langchain.text_splitter.Language, **kwargs: Any) langchain.text_splitter.RecursiveCharacterTextSplitter[source]#
static get_separators_for_language(language: langchain.text_splitter.Language) List[str][source]#
split_text(text: str) List[str][source]#

Split text into multiple components.

class langchain.text_splitter.SentenceTransformersTokenTextSplitter(chunk_overlap: int = 50, model_name: str = 'sentence-transformers/all-mpnet-base-v2', tokens_per_chunk: Optional[int] = None, **kwargs: Any)[source]#

Implementation of splitting text that looks at tokens.

count_tokens(*, text: str) int[source]#
split_text(text: str) List[str][source]#

Split text into multiple components.

class langchain.text_splitter.SpacyTextSplitter(separator: str = '\n\n', pipeline: str = 'en_core_web_sm', **kwargs: Any)[source]#

Implementation of splitting text that looks at sentences using Spacy.

split_text(text: str) List[str][source]#

Split incoming text and return chunks.

class langchain.text_splitter.TextSplitter(chunk_size: int = 4000, chunk_overlap: int = 200, length_function: typing.Callable[[str], int] = <built-in function len>, keep_separator: bool = False, add_start_index: bool = False)[source]#

Interface for splitting text into chunks.

async atransform_documents(documents: Sequence[langchain.schema.Document], **kwargs: Any) Sequence[langchain.schema.Document][source]#

Asynchronously transform a sequence of documents by splitting them.

create_documents(texts: List[str], metadatas: Optional[List[dict]] = None) List[langchain.schema.Document][source]#

Create documents from a list of texts.

classmethod from_huggingface_tokenizer(tokenizer: Any, **kwargs: Any) langchain.text_splitter.TextSplitter[source]#

Text splitter that uses HuggingFace tokenizer to count length.

classmethod from_tiktoken_encoder(encoding_name: str = 'gpt2', model_name: Optional[str] = None, allowed_special: Union[Literal['all'], AbstractSet[str]] = {}, disallowed_special: Union[Literal['all'], Collection[str]] = 'all', **kwargs: Any) langchain.text_splitter.TS[source]#

Text splitter that uses tiktoken encoder to count length.

split_documents(documents: Iterable[langchain.schema.Document]) List[langchain.schema.Document][source]#

Split documents.

abstract split_text(text: str) List[str][source]#

Split text into multiple components.

transform_documents(documents: Sequence[langchain.schema.Document], **kwargs: Any) Sequence[langchain.schema.Document][source]#

Transform sequence of documents by splitting them.

class langchain.text_splitter.TokenTextSplitter(encoding_name: str = 'gpt2', model_name: Optional[str] = None, allowed_special: Union[Literal['all'], AbstractSet[str]] = {}, disallowed_special: Union[Literal['all'], Collection[str]] = 'all', **kwargs: Any)[source]#

Implementation of splitting text that looks at tokens.

split_text(text: str) List[str][source]#

Split text into multiple components.

class langchain.text_splitter.Tokenizer(chunk_overlap: 'int', tokens_per_chunk: 'int', decode: 'Callable[[list[int]], str]', encode: 'Callable[[str], List[int]]')[source]#
chunk_overlap: int#
decode: Callable[[list[int]], str]#
encode: Callable[[str], List[int]]#
tokens_per_chunk: int#
langchain.text_splitter.split_text_on_tokens(*, text: str, tokenizer: langchain.text_splitter.Tokenizer) List[str][source]#

Split incoming text and return chunks.