Text Splitter#

Functionality for splitting text.

class langchain.text_splitter.CharacterTextSplitter(separator: str = '\n\n', **kwargs: Any)[source]#

Implementation of splitting text that looks at characters.

split_text(text: str) List[str][source]#

Split incoming text and return chunks.

class langchain.text_splitter.LatexTextSplitter(**kwargs: Any)[source]#

Attempts to split the text along Latex-formatted layout elements.

class langchain.text_splitter.MarkdownTextSplitter(**kwargs: Any)[source]#

Attempts to split the text along Markdown-formatted headings.

class langchain.text_splitter.NLTKTextSplitter(separator: str = '\n\n', **kwargs: Any)[source]#

Implementation of splitting text that looks at sentences using NLTK.

split_text(text: str) List[str][source]#

Split incoming text and return chunks.

class langchain.text_splitter.PythonCodeTextSplitter(**kwargs: Any)[source]#

Attempts to split the text along Python syntax.

class langchain.text_splitter.RecursiveCharacterTextSplitter(separators: Optional[List[str]] = None, **kwargs: Any)[source]#

Implementation of splitting text that looks at characters.

Recursively tries to split by different characters to find one that works.

split_text(text: str) List[str][source]#

Split incoming text and return chunks.

class langchain.text_splitter.SpacyTextSplitter(separator: str = '\n\n', pipeline: str = 'en_core_web_sm', **kwargs: Any)[source]#

Implementation of splitting text that looks at sentences using Spacy.

split_text(text: str) List[str][source]#

Split incoming text and return chunks.

class langchain.text_splitter.TextSplitter(chunk_size: int = 4000, chunk_overlap: int = 200, length_function: typing.Callable[[str], int] = <built-in function len>)[source]#

Interface for splitting text into chunks.

create_documents(texts: List[str], metadatas: Optional[List[dict]] = None) List[langchain.schema.Document][source]#

Create documents from a list of texts.

classmethod from_huggingface_tokenizer(tokenizer: Any, **kwargs: Any) langchain.text_splitter.TextSplitter[source]#

Text splitter that uses HuggingFace tokenizer to count length.

classmethod from_tiktoken_encoder(encoding_name: str = 'gpt2', allowed_special: Union[Literal['all'], AbstractSet[str]] = {}, disallowed_special: Union[Literal['all'], Collection[str]] = 'all', **kwargs: Any) langchain.text_splitter.TextSplitter[source]#

Text splitter that uses tiktoken encoder to count length.

split_documents(documents: List[langchain.schema.Document]) List[langchain.schema.Document][source]#

Split documents.

abstract split_text(text: str) List[str][source]#

Split text into multiple components.

class langchain.text_splitter.TokenTextSplitter(encoding_name: str = 'gpt2', allowed_special: Union[Literal['all'], AbstractSet[str]] = {}, disallowed_special: Union[Literal['all'], Collection[str]] = 'all', **kwargs: Any)[source]#

Implementation of splitting text that looks at tokens.

split_text(text: str) List[str][source]#

Split incoming text and return chunks.