A vector database is a specialized type of database designed to efficiently store, retrieve, and query data in vector format. Vectors, often representing numerical or feature embeddings from high-dimensional data (e.g., images, text, audio), are used extensively in machine learning models. These embeddings capture the essential characteristics of the data, such as its semantic meaning, by encoding it in vector space.
Vector databases play a critical role in machine learning tasks where similarity search or clustering of high-dimensional data is needed. Common usage scenarios include:
LanceDB is an open-source vector database designed for efficient and fast storage, retrieval, and management of high-dimensional vectors. It focuses on providing real-time performance and scalability for machine learning and AI applications. LanceDB is built for handling vector search workloads, allowing users to store embeddings (from text, images, or other data types) and perform similarity searches with high efficiency.
Performance-Oriented: Built to handle the performance needs of real-time vector search applications.
Machine Learning-Friendly: Specifically designed to fit within the machine learning ecosystem, making it easy to integrate with modern AI pipelines.
Self-Hosted: Gives users full control over their data without the need for external APIs or services.
#-------------------------------------------------------------#
# pip install lancedb
#-------------------------------------------------------------#
import lancedb
import numpy as np
# Initialize the LanceDB database
db = lancedb.connect("/path/to/lancedb") # specify the path where the database will be stored
# Create or connect to a collection (similar to a table in a traditional DB)
collection = db.create_collection("example_collection")
# Generate random vector data (e.g., 1000 vectors of dimensionality 128)
vectors = np.random.rand(1000, 128).tolist()
# Insert vectors with associated metadata (IDs or labels)
data = [{"id": i, "vector": vec} for i, vec in enumerate(vectors)]
collection.insert(data)
# Query the collection for the nearest neighbors of a new vector
query_vector = np.random.rand(1, 128).tolist()[0] # generate a random query vector
results = collection.search(query_vector).limit(5).to_list() # limit the result to top 5
# Display the nearest neighbors and their distances
for result in results:
print(f"ID: {result['id']}, Distance: {result['distance']}")
.search()
function performs a nearest neighbor search to find the most similar vectors in the collection based on distance (e.g., Euclidean or cosine similarity).Annoy is an open-source library developed by Spotify for finding approximate nearest neighbors. It’s designed for situations where you want fast search with large datasets but are willing to trade some accuracy for performance.
#-------------------------------------------------------------#
pip install annoy
#-------------------------------------------------------------#
from annoy import AnnoyIndex
import numpy as np
# Create an index with 128-dimensional vectors and angular distance metric
f = 128 # dimension of vectors
index = AnnoyIndex(f, 'angular')
# Add vectors to the index
for i in range(1000):
vector = np.random.random(f).tolist()
index.add_item(i, vector)
# Build the index (tree count affects speed/accuracy)
index.build(10)
# Query for the nearest neighbors
nearest_neighbors = index.get_nns_by_item(0, 5) # top-5 neighbors of the first vector
print("Nearest neighbors:", nearest_neighbors)
#-------------------------------------------------------------#
# pip install pinecone-client
#-------------------------------------------------------------#
import pinecone
import numpy as np
pinecone.init(api_key="your_api_key", environment="us-west1-gcp")
# Create a vector index (assumes a vector dimensionality of 128)
index_name = "example-index"
pinecone.create_index(index_name, dimension=128)
# Connect to the index
index = pinecone.Index(index_name)
# Example data: vectors representing feature embeddings
data = [
("item1", np.random.rand(128).tolist()),
("item2", np.random.rand(128).tolist()),
("item3", np.random.rand(128).tolist())
]
# Insert vectors into the index
index.upsert(vectors=data)
# Perform similarity search (find top 3 similar vectors to a query vector)
query_vector = np.random.rand(128).tolist()
results = index.query(queries=[query_vector], top_k=3)
# Print out results
for match in results['matches']:
print(f"ID: {match['id']}, Score: {match['score']}")
# Deleting the index when no longer needed
pinecone.delete_index(index_name)