BigtableVectorStore
This guide covers the BigtableVectorStore
integration for using Google Cloud Bigtable as a vector store.
Bigtable is a key-value and wide-column store, ideal for fast access to structured, semi-structured, or unstructured data.
Overview
The BigtableVectorStore
uses Google Cloud Bigtable to store documents and their vector embeddings for similarity search and retrieval. It supports powerful metadata filtering to refine search results.
Integration details
Class | Package | Local | JS support | Package downloads | Package latest |
---|---|---|---|---|---|
BigtableVectorStore | langchain-google-bigtable | ❌ | ❌ |
Setup
Prerequisites
To get started, you will need a Google Cloud project with an active Bigtable instance.
Installation
The integration is in the langchain-google-bigtable
package. The command below also installs langchain-google-vertexai
to use for an embedding service.
%pip install -qU langchain-google-bigtable langchain-google-vertexai
Colab only: Uncomment the following cell to restart the kernel or use the button to restart the kernel. For Vertex AI Workbench you can restart the terminal using the button on top.
# Automatically restart kernel after installs so that your environment can access the new packages
# import IPython
# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)
Set Your Google Cloud Project
Set your Google Cloud project so that you can leverage Google Cloud resources within this notebook.
If you don't know your project ID, try the following:
- Run
gcloud config list
. - Run
gcloud projects list
. - See the support page: Locate the project ID.
# @markdown Please fill in your project, instance, and a new table name.
PROJECT_ID = "google.com:cloud-bigtable-dev" # @param {type:"string"}
INSTANCE_ID = "anweshadas-test" # @param {type:"string"}
TABLE_ID = "your-vector-store-table-3" # @param {type:"string"}
!gcloud config set project {PROJECT_ID}
🔐 Authentication
Authenticate to Google Cloud as the IAM user logged into this notebook in order to access your Google Cloud Project.
- If you are using Colab to run this notebook, use the cell below and continue.
- If you are using Vertex AI Workbench, check out the setup instructions here.
from google.colab import auth
auth.authenticate_user(project_id=PROJECT_ID)
Initialization
Initializing the BigtableVectorStore
involves three steps: setting up the embedding service, ensuring the Bigtable table is created, and configuring the store's parameters.
1. Set up Embedding Service
First, we need a model to create the vector embeddings for our documents. We'll use a Vertex AI model for this example.
from langchain_google_vertexai import VertexAIEmbeddings
embeddings = VertexAIEmbeddings(project=PROJECT_ID, model_name="gemini-embedding-001")
2. Initialize a Table
Before creating a BigtableVectorStore
, a table with the correct column families must exist. The init_vector_store_table
helper function is the recommended way to create and configure a table. If the table already exists, it will do nothing.
from langchain_google_bigtable.vector_store import init_vector_store_table
DATA_COLUMN_FAMILY = "doc_data"
try:
init_vector_store_table(
project_id=PROJECT_ID,
instance_id=INSTANCE_ID,
table_id=TABLE_ID,
content_column_family=DATA_COLUMN_FAMILY,
embedding_column_family=DATA_COLUMN_FAMILY,
)
print(f"Table '{TABLE_ID}' is ready.")
except ValueError as e:
print(e)
3. Configure the Vector Store
Now we define the parameters that control how the vector store connects to Bigtable and how it handles data.
The BigtableEngine
A BigtableEngine
object manages clients and async operations. It is highly recommended to initialize a single engine and reuse it across multiple stores for better performance and resource management.
from langchain_google_bigtable import BigtableEngine
engine = await BigtableEngine.async_initialize(project_id=PROJECT_ID)
Collections
A collection
provides a logical namespace for your documents within a single Bigtable table. It is used as a prefix for the row keys, allowing multiple vector stores to coexist in the same table without interfering with each other.
collection_name = "my_docs"
Metadata Configuration
When creating a BigtableVectorStore
, you have two optional parameters for handling metadata:
metadata_mappings
: This is a list ofVectorMetadataMapping
objects. You must define a mapping for any metadata key you wish to use for filtering in your search queries. Each mapping specifies the data type (encoding
) for the metadata field, which is crucial for correct filtering.metadata_as_json_column
: This is an optionalColumnConfig
that tells the store to save the entire metadata dictionary as a single JSON string in a specific column. This is useful for efficiently retrieving all of a document's metadata at once, including fields not defined inmetadata_mappings
. Note: Fields stored only in this JSON column cannot be used for filtering.
from langchain_google_bigtable import ColumnConfig, VectorMetadataMapping, Encoding
# Define mappings for metadata fields you want to filter on.
metadata_mappings = [
VectorMetadataMapping(metadata_key="author", encoding=Encoding.UTF8),
VectorMetadataMapping(metadata_key="year", encoding=Encoding.INT_BIG_ENDIAN),
VectorMetadataMapping(metadata_key="category", encoding=Encoding.UTF8),
VectorMetadataMapping(metadata_key="rating", encoding=Encoding.FLOAT),
]
# Define the optional column for storing all metadata as a single JSON string.
metadata_as_json_column = ColumnConfig(
column_family=DATA_COLUMN_FAMILY, column_qualifier="metadata_json"
)
4. Create the BigtableVectorStore Instance
# Configure the columns for your store.
content_column = ColumnConfig(
column_family=DATA_COLUMN_FAMILY, column_qualifier="content"
)
embedding_column = ColumnConfig(
column_family=DATA_COLUMN_FAMILY, column_qualifier="embedding"
)
from langchain_google_bigtable import BigtableVectorStore
vector_store = await BigtableVectorStore.create(
project_id=PROJECT_ID,
instance_id=INSTANCE_ID,
table_id=TABLE_ID,
engine=engine,
embedding_service=embeddings,
collection=collection_name,
metadata_mappings=metadata_mappings,
metadata_as_json_column=metadata_as_json_column,
content_column=content_column,
embedding_column=embedding_column,
)
Manage vector store
Add Documents
You can add documents with pre-defined IDs. If a Document
is added without an id
attribute, the vector store will automatically generate a uuid4
string for it.
from langchain_core.documents import Document
docs_to_add = [
Document(
page_content="A young farm boy, Luke Skywalker, is thrust into a galactic conflict.",
id="doc_1",
metadata={
"author": "George Lucas",
"year": 1977,
"category": "sci-fi",
"rating": 4.8,
},
),
Document(
page_content="A hobbit named Frodo Baggins must destroy a powerful ring.",
id="doc_2",
metadata={
"author": "J.R.R. Tolkien",
"year": 1954,
"category": "fantasy",
"rating": 4.9,
},
),
# Document without a pre-defined ID, one will be generated.
Document(
page_content="A group of children confront an evil entity emerging from the sewers.",
metadata={"author": "Stephen King", "year": 1986, "category": "horror"},
),
Document(
page_content="In a distant future, the noble House Atreides rules the desert planet Arrakis.",
id="doc_3",
metadata={
"author": "Frank Herbert",
"year": 1965,
"category": "sci-fi",
"rating": 4.9,
},
),
]
added_ids = await vector_store.aadd_documents(docs_to_add)
print(f"Added documents with IDs: {added_ids}")
Update Documents
BigtableVectorStore
handles updates by overwriting. To update a document, simply add it again with the same ID but with new content or metadata.
doc_to_update = [
Document(
page_content="An old hobbit, Frodo Baggins, must take a powerful ring to be destroyed.", # Updated content
id="doc_2", # Same ID
metadata={
"author": "J.R.R. Tolkien",
"year": 1954,
"category": "epic-fantasy",
"rating": 4.9,
}, # Updated metadata
)
]
await vector_store.aadd_documents(doc_to_update)
print("Document 'doc_2' has been updated.")
Delete Documents
is_deleted = await vector_store.adelete(ids=["doc_2"])
Query vector store
Search
results = await vector_store.asimilarity_search("a story about a powerful ring", k=1)
print(results[0].page_content)
Search with Filters
Apply filters before the vector search runs.
The kNN Search Algorithm and Filtering
By default, BigtableVectorStore
uses a k-Nearest Neighbors (kNN) search algorithm to find the k
vectors in the database that are most similar to your query vector. The vector store offers filtering to reduce the search space before the kNN search is performed, which can make queries faster and more relevant.
Configuring Queries with QueryParameters
All search settings are controlled via the QueryParameters
object. This object allows you to specify not only filters but also other important search aspects:
algorithm
: The search algorithm to use. Defaults to"kNN"
.distance_strategy
: The metric used for comparison, such asCOSINE
(default) orEUCLIDEAN
.vector_data_type
: The data type of the stored vectors, likeFLOAT32
orDOUBLE64
. This should match the precision of your embeddings.filters
: A dictionary defining the filtering logic to apply.
Understanding Encodings
To filter on metadata fields, you must define them in metadata_mappings
with the correct encoding
so Bigtable can properly interpret the data. Supported encodings include:
- String:
UTF8
,UTF16
,ASCII
for text-based metadata. - Numeric:
INT_BIG_ENDIAN
orINT_LITTLE_ENDIAN
for integers, andFLOAT
orDOUBLE
for decimal numbers. - Boolean:
BOOL
for true/false values.
Filtering Support Table
Filter Category | Key / Operator | Meaning |
---|---|---|
Row Key | RowKeyFilter | Narrows search to document IDs with a specific prefix. |
Metadata Key | ColumnQualifiers | Checks for the presence of one or more exact metadata keys. |
ColumnQualifierPrefix | Checks if a metadata key starts with a given prefix. | |
ColumnQualifierRegex | Checks if a metadata key matches a regular expression. | |
Metadata Value | ColumnValueFilter | Container for all value-based conditions. |
== | Equality | |
!= | Inequality | |
> | Greater than | |
< | Less than | |
>= | Greater than or equal | |
<= | Less than or equal | |
in | Value is in a list. | |
nin | Value is not in a list. | |
contains | Checks for substring presence. | |
like | Performs a regex match on a string. | |
Logical | ColumnValueChainFilter | Logical AND for combining value conditions. |
ColumnValueUnionFilter | Logical OR for combining value conditions. |
Complex Filter Example
This example uses multiple nested logical filters. It searches for documents that are either (category
is 'sci-fi' AND year
between 1970-2000) OR (author
is 'J.R.R. Tolkien') OR (rating
> 4.5).
from langchain_google_bigtable.vector_store import QueryParameters
complex_filter = {
"ColumnValueFilter": {
"ColumnValueUnionFilter": { # OR
"ColumnValueChainFilter": { # First AND condition
"category": {"==": "sci-fi"},
"year": {">": 1970, "<": 2000},
},
"author": {"==": "J.R.R. Tolkien"},
}
}
}
query_params_complex = QueryParameters(filters=complex_filter)
complex_results = await vector_store.asimilarity_search(
"a story about a hero's journey", k=5, query_parameters=query_params_complex
)
print(f"Found {len(complex_results)} documents matching the complex filter:")
for doc in complex_results:
print(f"- ID: {doc.id}, Metadata: {doc.metadata}")
Search with score
You can also retrieve the distance score along with the documents.
results_with_scores = await vector_store.asimilarity_search_with_score(
query="an evil entity", k=1
)
for doc, score in results_with_scores:
print(f"* [SCORE={score:.4f}] {doc.page_content} [{doc.metadata}]")
Use as Retriever
The vector store can be easily used as a retriever in RAG applications. You can specify the search type (e.g., similarity
or mmr
) and pass search-time arguments like k
and query_parameters
.
# Define a filter to use with the retriever
retriever_filter = {"ColumnValueFilter": {"category": {"==": "horror"}}}
retriever_query_params = QueryParameters(filters=retriever_filter)
retriever = vector_store.as_retriever(
search_type="mmr", # Specify MMR for retrieval
search_kwargs={
"k": 1,
"lambda_mult": 0.8,
"query_parameters": retriever_query_params, # Pass filter parameters
},
)
retrieved_docs = await retriever.ainvoke("a story about a hobbit")
print(retrieved_docs[0].page_content)
Usage for retrieval-augmented generation
For guides on how to use this vector store for retrieval-augmented generation (RAG), see the following sections:
API reference
For full details on the BigtableVectorStore
class, see the source code on GitHub.
Related
- Vector store conceptual guide
- Vector store how-to guides