Text Splitter#
Functionality for splitting text.
- class langchain.text_splitter.CharacterTextSplitter(separator: str = '\n\n', **kwargs: Any)[source]#
Implementation of splitting text that looks at characters.
- class langchain.text_splitter.LatexTextSplitter(**kwargs: Any)[source]#
Attempts to split the text along Latex-formatted layout elements.
- class langchain.text_splitter.MarkdownTextSplitter(**kwargs: Any)[source]#
Attempts to split the text along Markdown-formatted headings.
- class langchain.text_splitter.NLTKTextSplitter(separator: str = '\n\n', **kwargs: Any)[source]#
Implementation of splitting text that looks at sentences using NLTK.
- class langchain.text_splitter.PythonCodeTextSplitter(**kwargs: Any)[source]#
Attempts to split the text along Python syntax.
- class langchain.text_splitter.RecursiveCharacterTextSplitter(separators: Optional[List[str]] = None, **kwargs: Any)[source]#
Implementation of splitting text that looks at characters.
Recursively tries to split by different characters to find one that works.
- class langchain.text_splitter.SpacyTextSplitter(separator: str = '\n\n', pipeline: str = 'en_core_web_sm', **kwargs: Any)[source]#
Implementation of splitting text that looks at sentences using Spacy.
- class langchain.text_splitter.TextSplitter(chunk_size: int = 4000, chunk_overlap: int = 200, length_function: typing.Callable[[str], int] = <built-in function len>)[source]#
Interface for splitting text into chunks.
- create_documents(texts: List[str], metadatas: Optional[List[dict]] = None) List[langchain.schema.Document] [source]#
Create documents from a list of texts.
- classmethod from_huggingface_tokenizer(tokenizer: Any, **kwargs: Any) langchain.text_splitter.TextSplitter [source]#
Text splitter that uses HuggingFace tokenizer to count length.
- classmethod from_tiktoken_encoder(encoding_name: str = 'gpt2', allowed_special: Union[Literal['all'], AbstractSet[str]] = {}, disallowed_special: Union[Literal['all'], Collection[str]] = 'all', **kwargs: Any) langchain.text_splitter.TextSplitter [source]#
Text splitter that uses tiktoken encoder to count length.