KeybertLinkExtractor#

class langchain_community.graph_vectorstores.extractors.keybert_link_extractor.KeybertLinkExtractor(*, kind: str = 'kw', embedding_model: str = 'all-MiniLM-L6-v2', extract_keywords_kwargs: Dict[str, Any] | None = None)[source]#

Beta

This feature is in beta. It is actively being worked on, so the API may change.

Extract keywords using KeyBERT.

KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document.

The KeybertLinkExtractor uses KeyBERT to create links between documents that have keywords in common.

Example:

extractor = KeybertLinkExtractor()
results = extractor.extract_one("lorem ipsum...")

How to link Documents on common keywords using Keybert#

Preliminaries#

Install the keybert package:

pip install -q langchain_community keybert

Usage#

We load the state_of_the_union.txt file, chunk it, then for each chunk we extract keyword links and add them to the chunk.

Using extract_one()#

We can use extract_one() on a document to get the links and add the links to the document metadata with add_links():

from langchain_community.document_loaders import TextLoader
from langchain_community.graph_vectorstores import CassandraGraphVectorStore
from langchain_community.graph_vectorstores.extractors import KeybertLinkExtractor
from langchain_core.graph_vectorstores.links import add_links
from langchain_text_splitters import CharacterTextSplitter

loader = TextLoader("state_of_the_union.txt")

raw_documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

documents = text_splitter.split_documents(raw_documents)
keyword_extractor = KeybertLinkExtractor()

for document in documents:
    links = keyword_extractor.extract_one(document)
    add_links(document, links)

print(documents[0].metadata)

{'source': 'state_of_the_union.txt', 'links': [Link(kind='kw', direction='bidir', tag='ukraine'), Link(kind='kw', direction='bidir', tag='ukrainian'), Link(kind='kw', direction='bidir', tag='putin'), Link(kind='kw', direction='bidir', tag='vladimir'), Link(kind='kw', direction='bidir', tag='russia')]}

Using LinkExtractorTransformer#

Using the LinkExtractorTransformer, we can simplify the link extraction:

from langchain_community.document_loaders import TextLoader
from langchain_community.graph_vectorstores.extractors import (
    KeybertLinkExtractor,
    LinkExtractorTransformer,
)
from langchain_text_splitters import CharacterTextSplitter

loader = TextLoader("state_of_the_union.txt")
raw_documents = loader.load()

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)

transformer = LinkExtractorTransformer([KeybertLinkExtractor()])
documents = transformer.transform_documents(documents)

print(documents[0].metadata)

{'source': 'state_of_the_union.txt', 'links': [Link(kind='kw', direction='bidir', tag='ukraine'), Link(kind='kw', direction='bidir', tag='ukrainian'), Link(kind='kw', direction='bidir', tag='putin'), Link(kind='kw', direction='bidir', tag='vladimir'), Link(kind='kw', direction='bidir', tag='russia')]}

The documents with keyword links can then be added to a GraphVectorStore:

from langchain_community.graph_vectorstores import CassandraGraphVectorStore

store = CassandraGraphVectorStore.from_documents(documents=documents, embedding=...)

param kind:: Kind of links to produce with this extractor.
param embedding_model:: Name of the embedding model to use with KeyBERT.
param extract_keywords_kwargs:: Keyword arguments to pass to KeyBERT’s extract_keywords method.

Methods

__init__(*[, kind, embedding_model, ...])

Extract keywords using KeyBERT.

extract_many(inputs)

Add edges from each input to the corresponding documents.

extract_one(input)

Add edges from each input to the corresponding documents.

__init__(*, kind: str = 'kw', embedding_model: str = 'all-MiniLM-L6-v2', extract_keywords_kwargs: Dict[str, Any] | None = None)[source]#

Extract keywords using KeyBERT.

KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document.

The KeybertLinkExtractor uses KeyBERT to create links between documents that have keywords in common.

Example:

extractor = KeybertLinkExtractor()
results = extractor.extract_one("lorem ipsum...")

How to link Documents on common keywords using Keybert#

Preliminaries#

Install the keybert package:

pip install -q langchain_community keybert

Usage#

We load the state_of_the_union.txt file, chunk it, then for each chunk we extract keyword links and add them to the chunk.

Using extract_one()#

We can use extract_one() on a document to get the links and add the links to the document metadata with add_links():

from langchain_community.document_loaders import TextLoader
from langchain_community.graph_vectorstores import CassandraGraphVectorStore
from langchain_community.graph_vectorstores.extractors import KeybertLinkExtractor
from langchain_core.graph_vectorstores.links import add_links
from langchain_text_splitters import CharacterTextSplitter

loader = TextLoader("state_of_the_union.txt")

raw_documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

documents = text_splitter.split_documents(raw_documents)
keyword_extractor = KeybertLinkExtractor()

for document in documents:
    links = keyword_extractor.extract_one(document)
    add_links(document, links)

print(documents[0].metadata)

{'source': 'state_of_the_union.txt', 'links': [Link(kind='kw', direction='bidir', tag='ukraine'), Link(kind='kw', direction='bidir', tag='ukrainian'), Link(kind='kw', direction='bidir', tag='putin'), Link(kind='kw', direction='bidir', tag='vladimir'), Link(kind='kw', direction='bidir', tag='russia')]}

Using LinkExtractorTransformer#

Using the LinkExtractorTransformer, we can simplify the link extraction:

from langchain_community.document_loaders import TextLoader
from langchain_community.graph_vectorstores.extractors import (
    KeybertLinkExtractor,
    LinkExtractorTransformer,
)
from langchain_text_splitters import CharacterTextSplitter

loader = TextLoader("state_of_the_union.txt")
raw_documents = loader.load()

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)

transformer = LinkExtractorTransformer([KeybertLinkExtractor()])
documents = transformer.transform_documents(documents)

print(documents[0].metadata)

{'source': 'state_of_the_union.txt', 'links': [Link(kind='kw', direction='bidir', tag='ukraine'), Link(kind='kw', direction='bidir', tag='ukrainian'), Link(kind='kw', direction='bidir', tag='putin'), Link(kind='kw', direction='bidir', tag='vladimir'), Link(kind='kw', direction='bidir', tag='russia')]}

The documents with keyword links can then be added to a GraphVectorStore:

from langchain_community.graph_vectorstores import CassandraGraphVectorStore

store = CassandraGraphVectorStore.from_documents(documents=documents, embedding=...)

param kind:: Kind of links to produce with this extractor.
param embedding_model:: Name of the embedding model to use with KeyBERT.
param extract_keywords_kwargs:: Keyword arguments to pass to KeyBERT’s extract_keywords method.

Parameters:

kind (str) –
embedding_model (str) –
extract_keywords_kwargs (Dict[str, Any] | None) –

extract_many(inputs: Iterable[str | Document]) → Iterable[Set[Link]][source]#

Add edges from each input to the corresponding documents.

Parameters:: inputs (Iterable[str | Document]) – The input content to extract edges from.
Returns:: Iterable over the set of links extracted from the input.
Return type:: Iterable[Set[Link]]

extract_one(input: str | Document) → Set[Link][source]#

Add edges from each input to the corresponding documents.

Parameters:: input (str | Document) – The input content to extract edges from.
Returns:: Set of links extracted from the input.
Return type:: Set[Link]

Parameters:

kind (str) –
embedding_model (str) –
extract_keywords_kwargs (Dict[str, Any] | None) –