Mosaic AI Vector Search

Mosaic AI Vector Search is Databricks' fully managed vector database, purpose-built to live alongside the Lakehouse. Rather than standing up a separate vector store and shipping embeddings out of your governed data plane, Vector Search runs inside the workspace, reads directly from Delta tables, and inherits Unity Catalog permissions. The result is a retrieval layer that stays in sync with the source-of-truth data, never drifts during nightly refreshes, and is governed by the same grants that protect the underlying rows.

For RAG applications, agents, semantic search over enterprise content, and recommendation features, Vector Search collapses what is typically a four-product stack — embedding pipeline, vector DB, sync job, and access-control glue — into one declarative resource.

1. What Mosaic AI Vector Search Is

Vector Search is a managed similarity-search service that indexes embedding vectors stored in (or generated from) Delta tables. The service handles index construction, sharding, replication, and online query serving — there is no cluster to size, no JVM to tune, no shard-rebalancing playbook. From a developer's standpoint it appears as a Python SDK and a REST endpoint that returns the top-k nearest neighbors with their associated metadata columns.

Storage backed by Delta: the source-of-truth lives in a Delta table. The vector index is a derived structure, not a separate database to back up.
Endpoint + Index: a Vector Search Endpoint is the compute that serves queries; multiple indexes can attach to one endpoint.
HNSW under the hood: approximate nearest neighbor with tunable recall, designed for low-latency online queries.
Serverless option: Standard endpoints are serverless and scale automatically; Storage Optimized endpoints are tuned for very large indexes.

2. Index Types: Delta Sync vs Direct Vector Access

Vector Search supports two index types. Picking the right one is the most important architectural decision.

Delta Sync Index

A Delta Sync Index is bound to a Delta table. Whenever rows are inserted, updated, or deleted in the source table, the index is automatically reconciled. You never call an "upsert vector" API. This is the index type to default to: it inherits the lineage, governance, and observability of the underlying Delta table, and removes the entire class of bugs where the vector store falls out of sync with the warehouse.

Source must be a Delta table with Change Data Feed (CDF) enabled.
Pipeline modes: TRIGGERED (sync on demand) or CONTINUOUS (low-latency streaming sync).
Embeddings can be either pre-computed (BYO) or generated by Databricks at sync time.

Direct Vector Access Index

A Direct Vector Access Index has no Delta source. Vectors are pushed in via the SDK with explicit upsert calls. Use this only when the embedding lifecycle genuinely lives outside the lakehouse — for example, a third-party SaaS pushes vectors over a webhook, or you are doing rapid experimentation with vectors that should not be persisted to a table.

For 90% of use cases, Delta Sync is the right answer: it makes the lakehouse the source of truth and the index a derived view.

3. Embedding Options

Vector Search supports two embedding modes per index. Choose at index creation; you cannot switch later without rebuilding.

Databricks-Managed Embeddings

You point the index at a text column and a Databricks Foundation Model serving endpoint. Databricks computes embeddings during sync and at query time on your behalf. Two models ship out of the box behind pay-per-token serving endpoints:

databricks-bge-large-en — BAAI's BGE Large EN, 1024 dims, English-focused, strong general-purpose retrieval.
databricks-gte-large-en — Alibaba's GTE Large EN, 1024 dims, often slightly better on domain-shifted retrieval.

The benefit of managed embeddings is operational: nothing to deploy, no drift between query-time and index-time encoders, and no separate billing for the embedding compute outside of Databricks token consumption.

Bring Your Own (BYO) Embeddings

You compute vectors yourself — with OpenAI text-embedding-3-large, a custom fine-tuned model on Mosaic AI Model Serving, sentence-transformers in a notebook, etc. — write them as an ARRAY<FLOAT> column to the source Delta table, and tell Vector Search which column holds the vectors.

Use BYO when you need a model the platform doesn't ship (multilingual E5, code-specific embeddings, image embeddings via CLIP), when you are migrating from an existing pipeline, or when cross-system reproducibility matters.

4. Creating an Index

The recommended workflow: create or pick an endpoint, prepare the source Delta table, then create the index. Endpoints are workspace-level resources and are billed by uptime.

Create an Endpoint


from databricks.vector_search.client import VectorSearchClient

vsc = VectorSearchClient()

vsc.create_endpoint(
    name="prod_vs_endpoint",
    endpoint_type="STANDARD",  # or "STORAGE_OPTIMIZED" for very large indexes
)

Delta Sync Index with Databricks-Managed Embeddings

This is the simplest production setup. The source table just needs a primary key and a text column.


-- Source Delta table must have CDF enabled
CREATE TABLE main.rag.product_docs (
  doc_id        STRING NOT NULL,
  product_sku   STRING,
  category      STRING,
  doc_text      STRING,
  updated_at    TIMESTAMP
)
TBLPROPERTIES (delta.enableChangeDataFeed = true);

ALTER TABLE main.rag.product_docs
  ADD CONSTRAINT pk_product_docs PRIMARY KEY (doc_id);


index = vsc.create_delta_sync_index(
    endpoint_name="prod_vs_endpoint",
    index_name="main.rag.product_docs_index",
    source_table_name="main.rag.product_docs",
    pipeline_type="TRIGGERED",          # or "CONTINUOUS"
    primary_key="doc_id",
    embedding_source_column="doc_text",
    embedding_model_endpoint_name="databricks-bge-large-en",
)

Delta Sync Index with BYO Embeddings


# Source table has a precomputed ARRAY<FLOAT> column called `embedding`
index = vsc.create_delta_sync_index(
    endpoint_name="prod_vs_endpoint",
    index_name="main.rag.product_docs_byo_index",
    source_table_name="main.rag.product_docs",
    pipeline_type="CONTINUOUS",
    primary_key="doc_id",
    embedding_dimension=1536,           # must match your model
    embedding_vector_column="embedding",
)

Direct Vector Access Index


index = vsc.create_direct_access_index(
    endpoint_name="prod_vs_endpoint",
    index_name="main.rag.scratch_index",
    primary_key="id",
    embedding_dimension=1536,
    embedding_vector_column="embedding",
    schema={
        "id": "string",
        "text": "string",
        "embedding": "array",
        "source": "string",
    },
)

index.upsert([
    {"id": "doc-1", "text": "...", "embedding": [...], "source": "wiki"},
    {"id": "doc-2", "text": "...", "embedding": [...], "source": "wiki"},
])

Triggering a Sync


# For TRIGGERED pipelines, kick off a sync after batch ETL completes
idx = vsc.get_index(
    endpoint_name="prod_vs_endpoint",
    index_name="main.rag.product_docs_index",
)
idx.sync()

5. Querying the Index

The query API is a single method, similarity_search. It accepts either a query string (when the index uses managed embeddings) or a query vector (when BYO).


idx = vsc.get_index(
    endpoint_name="prod_vs_endpoint",
    index_name="main.rag.product_docs_index",
)

results = idx.similarity_search(
    query_text="how do I reset the device to factory settings",
    columns=["doc_id", "product_sku", "category", "doc_text"],
    num_results=5,
)

for row in results["result"]["data_array"]:
    doc_id, sku, category, text, score = row
    print(f"{score:.3f}  {sku}  {doc_id}")

For BYO indexes, swap query_text for query_vector:


import openai
qvec = openai.embeddings.create(
    input="how do I reset the device",
    model="text-embedding-3-large",
).data[0].embedding

results = idx.similarity_search(
    query_vector=qvec,
    columns=["doc_id", "doc_text"],
    num_results=10,
)

Metadata Filtering

Filters are applied during ANN traversal — they prune the candidate set rather than post-filtering, so recall stays high even with selective filters. Filters are passed as a JSON-style dict against any indexed metadata column.


results = idx.similarity_search(
    query_text="warranty replacement policy",
    columns=["doc_id", "product_sku", "category", "doc_text"],
    filters={"category": "support", "product_sku": ["SKU-100", "SKU-101"]},
    num_results=5,
)

Range and boolean operators are supported via the operator suffix syntax:


results = idx.similarity_search(
    query_text="recent compliance changes",
    columns=["doc_id", "doc_text", "updated_at"],
    filters={
        "updated_at >=": "2025-01-01",
        "category NOT": "deprecated",
    },
    num_results=10,
)

Hybrid Search

Pure vector search misses on rare proper nouns, error codes, and SKUs — embeddings smear them into a fuzzy neighborhood. Hybrid search blends vector similarity with BM25-style lexical scoring and is almost always the right default for enterprise retrieval. Set query_type="HYBRID" on a managed-embedding index.


results = idx.similarity_search(
    query_text="error code E-4471 firmware patch",
    columns=["doc_id", "product_sku", "doc_text"],
    query_type="HYBRID",
    num_results=10,
)

Hybrid mode is what bridges the gap between "semantic search demo" and "production retrieval that finds the exact ticket your customer asked about."

6. Unity Catalog Governance

Vector Search indexes are first-class Unity Catalog objects. They live under a catalog and schema (main.rag.product_docs_index), inherit ownership, and are protected by standard GRANT statements.


GRANT USE CATALOG ON CATALOG main TO `rag-app-service-principal`;
GRANT USE SCHEMA  ON SCHEMA  main.rag TO `rag-app-service-principal`;
GRANT SELECT      ON TABLE   main.rag.product_docs_index
                            TO `rag-app-service-principal`;

Two governance properties fall out of this model:

Row-level security flows through. If the source Delta table has a row filter that hides certain regions from a group, the index inherits that filter at query time — the user only retrieves rows they could have SELECTed.
Lineage is automatic. The lineage graph in Unity Catalog shows the source table, the index, and any downstream agent or notebook that queries it, which is essential for audit and impact analysis.

7. RAG Retrieval Layer for Databricks Apps

A typical Mosaic-AI RAG chatbot deployed as a Databricks App wires together: Vector Search for retrieval, Foundation Model APIs for generation, and the Agent Framework for orchestration. The retrieval call sits behind a LangChain-compatible retriever:


from databricks_langchain import DatabricksVectorSearch
from databricks_langchain import ChatDatabricks
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

retriever = DatabricksVectorSearch(
    endpoint="prod_vs_endpoint",
    index_name="main.rag.product_docs_index",
    columns=["doc_id", "product_sku", "doc_text"],
).as_retriever(search_kwargs={"k": 5, "query_type": "HYBRID"})

llm = ChatDatabricks(endpoint="databricks-meta-llama-3-1-70b-instruct")

prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer using only the provided context. Cite doc_id."),
    ("human",  "Context:\n{context}\n\nQuestion: {question}"),
])

def format_docs(docs):
    return "\n\n".join(f"[{d.metadata['doc_id']}] {d.page_content}" for d in docs)

chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
)

answer = chain.invoke("how do I reset the device to factory settings")

Deployed behind a Databricks App, the same chain is exposed as an HTTPS endpoint and can be embedded into a Streamlit, Gradio, or React frontend with workspace-level OAuth.

8. When to Use Vector Search vs pgvector vs Pinecone

The decision boils down to where the data lives and who runs the cluster.

Option	Best For	Trade-offs
Mosaic AI Vector Search	Data already in Delta + UC; want governance and zero-ops.	Locked to Databricks workspace; endpoint billed by uptime.
pgvector / pgvectorscale	Existing Postgres app; want vectors next to transactional rows; cost-sensitive.	You own the index tuning, the autovacuum, the read replicas. Doesn't scale to 100M+ vectors gracefully.
Pinecone / Weaviate / Qdrant	Multi-cloud, multi-source, very large scale, want a SaaS independent of the data warehouse.	Separate auth, separate sync pipelines, separate bill, no UC lineage.

The short version: if the corpus already lives in a Delta table and you want a single set of grants to govern who can ask questions of it, use Mosaic AI Vector Search. If you are a Postgres shop with under a few million vectors, pgvector is the cheapest correct answer. If you need a vector store independent of any warehouse, Pinecone-style SaaS is the cleanest choice.

Common Interview Questions:

What is Mosaic AI Vector Search and what problem does it solve?

It is Databricks' managed vector index that sits on top of a Delta table, providing approximate nearest-neighbor search with the source-of-truth corpus governed by Unity Catalog. The problem it solves is consistency: in a typical RAG stack the corpus lives in a warehouse and a separate vector DB has to be kept in sync — that pipeline is the most common source of stale answers. Vector Search collapses the two so the index is automatically kept in lockstep with the table.

Delta Sync Index vs Direct Access Index — when do you use each?

Delta Sync is the right default: you point the index at a Delta table, and Databricks watches the change data feed and reconciles inserts, updates, and deletes automatically (with a configurable trigger). Direct Access is for cases where you compute embeddings outside Databricks and want to push vectors via the API on your own schedule — useful for streaming low-latency updates or when the corpus does not live in a Delta table. Pick Delta Sync unless you have a specific reason not to; Direct Access trades the consistency guarantee for control.

How do you choose an embedding model for vector search?

Three axes: dimensionality (768 is a good default; 1536+ costs more storage and query CPU for marginal recall gains on most corpora), domain fit (general models like bge-large-en-v1.5 for English text, all-MiniLM-L6-v2 when latency matters, code-specific or multilingual models for niche corpora), and serving cost. Always benchmark on your own evaluation set with recall@k and MRR — the leaderboard ordering on MTEB does not always hold on private corpora. If retrieval quality is poor, the embedding model is usually a bigger lever than re-ranking.

How does Unity Catalog govern a vector index?

The index is a Unity Catalog object with the same three-level namespace and grant model as a table — GRANT USE CATALOG, USE SCHEMA, SELECT ON catalog.schema.index. Row-level filters and column masks on the source Delta table are honored by Delta Sync indexes, so a user querying the index sees only rows they would see querying the table. That means RAG applications inherit governance for free: the same data product can serve customer-segment-specific queries without building a separate index per segment.

What index type does Mosaic AI Vector Search use under the hood?

It uses HNSW (Hierarchical Navigable Small World) graphs — the standard approximate nearest-neighbor structure — with the parameters managed by Databricks rather than tuned by hand. This trades the fine-grained control of building your own FAISS or Milvus index for a serverless experience: you specify the embedding column and the endpoint scales automatically. For workloads needing IVF-PQ-style memory compression on billions of vectors, a self-managed store still has the edge, but for the typical enterprise RAG corpus of millions to low tens of millions of chunks it is more than enough.

How do you keep a vector index from drifting from the source corpus?

For Delta Sync indexes the platform handles it via the Delta change data feed — enable CDF on the source table, set the index trigger to either continuous (low-latency, billed) or scheduled (cheaper, latency-tolerant). Track sync lag explicitly with the system tables; alert if lag exceeds the freshness SLA. For Direct Access indexes the responsibility is yours: emit a write to the index from the same job that writes to the source table, ideally inside a transactional pattern (write to staging, then commit both) so partial failures do not leave the index inconsistent.

↑ Back to Top