Skip to main content

AI21SemanticTextSplitter

This example goes over how to use AI21SemanticTextSplitter in LangChain.

Installation

pip install langchain-ai21

Environment Setup

We'll need to get a AI21 API key and set the AI21_API_KEY environment variable:

import os
from getpass import getpass

os.environ["AI21_API_KEY"] = getpass()

Example Usages

Splitting text by semantic meaning

This example shows how to use AI21SemanticTextSplitter to split a text into chunks based on semantic meaning.

from langchain_ai21 import AI21SemanticTextSplitter

TEXT = (
"We’ve all experienced reading long, tedious, and boring pieces of text - financial reports, "
"legal documents, or terms and conditions (though, who actually reads those terms and conditions to be honest?).\n"
"Imagine a company that employs hundreds of thousands of employees. In today's information "
"overload age, nearly 30% of the workday is spent dealing with documents. There's no surprise "
"here, given that some of these documents are long and convoluted on purpose (did you know that "
"reading through all your privacy policies would take almost a quarter of a year?). Aside from "
"inefficiency, workers may simply refrain from reading some documents (for example, Only 16% of "
"Employees Read Their Employment Contracts Entirely Before Signing!).\nThis is where AI-driven summarization "
"tools can be helpful: instead of reading entire documents, which is tedious and time-consuming, "
"users can (ideally) quickly extract relevant information from a text. With large language models, "
"the development of those tools is easier than ever, and you can offer your users a summary that is "
"specifically tailored to their preferences.\nLarge language models naturally follow patterns in input "
"(prompt), and provide coherent completion that follows the same patterns. For that, we want to feed "
'them with several examples in the input ("few-shot prompt"), so they can follow through. '
"The process of creating the correct prompt for your problem is called prompt engineering, "
"and you can read more about it here."
)

semantic_text_splitter = AI21SemanticTextSplitter()
chunks = semantic_text_splitter.split_text(TEXT)

print(f"The text has been split into {len(chunks)} chunks.")
for chunk in chunks:
print(chunk)
print("====")

Splitting text by semantic meaning with merge

This example shows how to use AI21SemanticTextSplitter to split a text into chunks based on semantic meaning, then merging the chunks based on chunk_size.

from langchain_ai21 import AI21SemanticTextSplitter

TEXT = (
"We’ve all experienced reading long, tedious, and boring pieces of text - financial reports, "
"legal documents, or terms and conditions (though, who actually reads those terms and conditions to be honest?).\n"
"Imagine a company that employs hundreds of thousands of employees. In today's information "
"overload age, nearly 30% of the workday is spent dealing with documents. There's no surprise "
"here, given that some of these documents are long and convoluted on purpose (did you know that "
"reading through all your privacy policies would take almost a quarter of a year?). Aside from "
"inefficiency, workers may simply refrain from reading some documents (for example, Only 16% of "
"Employees Read Their Employment Contracts Entirely Before Signing!).\nThis is where AI-driven summarization "
"tools can be helpful: instead of reading entire documents, which is tedious and time-consuming, "
"users can (ideally) quickly extract relevant information from a text. With large language models, "
"the development of those tools is easier than ever, and you can offer your users a summary that is "
"specifically tailored to their preferences.\nLarge language models naturally follow patterns in input "
"(prompt), and provide coherent completion that follows the same patterns. For that, we want to feed "
'them with several examples in the input ("few-shot prompt"), so they can follow through. '
"The process of creating the correct prompt for your problem is called prompt engineering, "
"and you can read more about it here."
)

semantic_text_splitter_chunks = AI21SemanticTextSplitter(chunk_size=1000)
chunks = semantic_text_splitter_chunks.split_text(TEXT)

print(f"The text has been split into {len(chunks)} chunks.")
for chunk in chunks:
print(chunk)
print("====")

Splitting text to documents

This example shows how to use AI21SemanticTextSplitter to split a text into Documents based on semantic meaning. The metadata will contain a type for each document.

from langchain_ai21 import AI21SemanticTextSplitter

TEXT = (
"We’ve all experienced reading long, tedious, and boring pieces of text - financial reports, "
"legal documents, or terms and conditions (though, who actually reads those terms and conditions to be honest?).\n"
"Imagine a company that employs hundreds of thousands of employees. In today's information "
"overload age, nearly 30% of the workday is spent dealing with documents. There's no surprise "
"here, given that some of these documents are long and convoluted on purpose (did you know that "
"reading through all your privacy policies would take almost a quarter of a year?). Aside from "
"inefficiency, workers may simply refrain from reading some documents (for example, Only 16% of "
"Employees Read Their Employment Contracts Entirely Before Signing!).\nThis is where AI-driven summarization "
"tools can be helpful: instead of reading entire documents, which is tedious and time-consuming, "
"users can (ideally) quickly extract relevant information from a text. With large language models, "
"the development of those tools is easier than ever, and you can offer your users a summary that is "
"specifically tailored to their preferences.\nLarge language models naturally follow patterns in input "
"(prompt), and provide coherent completion that follows the same patterns. For that, we want to feed "
'them with several examples in the input ("few-shot prompt"), so they can follow through. '
"The process of creating the correct prompt for your problem is called prompt engineering, "
"and you can read more about it here."
)

semantic_text_splitter = AI21SemanticTextSplitter()
documents = semantic_text_splitter.split_text_to_documents(TEXT)

print(f"The text has been split into {len(documents)} Documents.")
for doc in documents:
print(f"type: {doc.metadata['source_type']}")
print(f"text: {doc.page_content}")
print("====")

Creating Documents with Metadata

This example shows how to use AI21SemanticTextSplitter to create Documents from texts, and adding custom Metadata to each Document.

from langchain_ai21 import AI21SemanticTextSplitter

TEXT = (
"We’ve all experienced reading long, tedious, and boring pieces of text - financial reports, "
"legal documents, or terms and conditions (though, who actually reads those terms and conditions to be honest?).\n"
"Imagine a company that employs hundreds of thousands of employees. In today's information "
"overload age, nearly 30% of the workday is spent dealing with documents. There's no surprise "
"here, given that some of these documents are long and convoluted on purpose (did you know that "
"reading through all your privacy policies would take almost a quarter of a year?). Aside from "
"inefficiency, workers may simply refrain from reading some documents (for example, Only 16% of "
"Employees Read Their Employment Contracts Entirely Before Signing!).\nThis is where AI-driven summarization "
"tools can be helpful: instead of reading entire documents, which is tedious and time-consuming, "
"users can (ideally) quickly extract relevant information from a text. With large language models, "
"the development of those tools is easier than ever, and you can offer your users a summary that is "
"specifically tailored to their preferences.\nLarge language models naturally follow patterns in input "
"(prompt), and provide coherent completion that follows the same patterns. For that, we want to feed "
'them with several examples in the input ("few-shot prompt"), so they can follow through. '
"The process of creating the correct prompt for your problem is called prompt engineering, "
"and you can read more about it here."
)

semantic_text_splitter = AI21SemanticTextSplitter()
texts = [TEXT]
documents = semantic_text_splitter.create_documents(
texts=texts, metadatas=[{"pikachu": "pika pika"}]
)

print(f"The text has been split into {len(documents)} Documents.")
for doc in documents:
print(f"metadata: {doc.metadata}")
print(f"text: {doc.page_content}")
print("====")

Splitting text to documents with start index

This example shows how to use AI21SemanticTextSplitter to split a text into Documents based on semantic meaning. The metadata will contain a start index for each document. Note that the start index provides an indication of the order of the chunks rather than the actual start index for each chunk.

from langchain_ai21 import AI21SemanticTextSplitter

TEXT = (
"We’ve all experienced reading long, tedious, and boring pieces of text - financial reports, "
"legal documents, or terms and conditions (though, who actually reads those terms and conditions to be honest?).\n"
"Imagine a company that employs hundreds of thousands of employees. In today's information "
"overload age, nearly 30% of the workday is spent dealing with documents. There's no surprise "
"here, given that some of these documents are long and convoluted on purpose (did you know that "
"reading through all your privacy policies would take almost a quarter of a year?). Aside from "
"inefficiency, workers may simply refrain from reading some documents (for example, Only 16% of "
"Employees Read Their Employment Contracts Entirely Before Signing!).\nThis is where AI-driven summarization "
"tools can be helpful: instead of reading entire documents, which is tedious and time-consuming, "
"users can (ideally) quickly extract relevant information from a text. With large language models, "
"the development of those tools is easier than ever, and you can offer your users a summary that is "
"specifically tailored to their preferences.\nLarge language models naturally follow patterns in input "
"(prompt), and provide coherent completion that follows the same patterns. For that, we want to feed "
'them with several examples in the input ("few-shot prompt"), so they can follow through. '
"The process of creating the correct prompt for your problem is called prompt engineering, "
"and you can read more about it here."
)

semantic_text_splitter = AI21SemanticTextSplitter(add_start_index=True)
documents = semantic_text_splitter.create_documents(texts=[TEXT])
print(f"The text has been split into {len(documents)} Documents.")
for doc in documents:
print(f"start_index: {doc.metadata['start_index']}")
print(f"text: {doc.page_content}")
print("====")

Splitting documents

This example shows how to use AI21SemanticTextSplitter to split a list of Documents into chunks based on semantic meaning.

from langchain_ai21 import AI21SemanticTextSplitter
from langchain_core.documents import Document

TEXT = (
"We’ve all experienced reading long, tedious, and boring pieces of text - financial reports, "
"legal documents, or terms and conditions (though, who actually reads those terms and conditions to be honest?).\n"
"Imagine a company that employs hundreds of thousands of employees. In today's information "
"overload age, nearly 30% of the workday is spent dealing with documents. There's no surprise "
"here, given that some of these documents are long and convoluted on purpose (did you know that "
"reading through all your privacy policies would take almost a quarter of a year?). Aside from "
"inefficiency, workers may simply refrain from reading some documents (for example, Only 16% of "
"Employees Read Their Employment Contracts Entirely Before Signing!).\nThis is where AI-driven summarization "
"tools can be helpful: instead of reading entire documents, which is tedious and time-consuming, "
"users can (ideally) quickly extract relevant information from a text. With large language models, "
"the development of those tools is easier than ever, and you can offer your users a summary that is "
"specifically tailored to their preferences.\nLarge language models naturally follow patterns in input "
"(prompt), and provide coherent completion that follows the same patterns. For that, we want to feed "
'them with several examples in the input ("few-shot prompt"), so they can follow through. '
"The process of creating the correct prompt for your problem is called prompt engineering, "
"and you can read more about it here."
)

semantic_text_splitter = AI21SemanticTextSplitter()
document = Document(page_content=TEXT, metadata={"hello": "goodbye"})
documents = semantic_text_splitter.split_documents([document])
print(f"The document list has been split into {len(documents)} Documents.")
for doc in documents:
print(f"text: {doc.page_content}")
print(f"metadata: {doc.metadata}")
print("====")

Help us out by providing feedback on this documentation page: