Retrieval-Augmented Generation (RAG) Using Snowflake Cortex

What is RAG?

Retrieval-Augmented Generation (RAG) is a technique that combines information retrieval with text generation. Instead of relying solely on a standalone large language model (LLM), RAG first retrieves relevant data from a knowledge base and provides that context to the LLM. This approach makes responses more accurate and grounded in up-to-date information without requiring model retraining.

Why Use Snowflake Cortex for RAG?

Fully Managed Service: No need to set up integrations, manage infrastructure, or move data outside Snowflake's governance boundary
Data Security: Your data stays within Snowflake, maintaining performance, scalability, and governance
Hybrid Search: Combines vector (semantic) search with keyword search for optimal retrieval quality
Automatic Embeddings: Cortex Search automatically generates embeddings without manual configuration
Semantic Reranking: Automatically reranks results for higher relevance

Key Components of Snowflake Cortex for RAG

1. Cortex Search Service

A fully managed hybrid search service that powers the retrieval component of RAG applications.

Enables low-latency, high-quality "fuzzy" search over Snowflake data
Automatically handles embedding, infrastructure maintenance, and index refreshes
Supports filtering based on metadata tags

2. Cortex LLM Functions

SQL and Python functions that provide access to industry-leading LLMs for the generation component.

COMPLETE: Generates responses from prompts using LLMs like Mistral, Llama, Claude, and more
EMBED_TEXT: Converts text into vector embeddings
SUMMARIZE: Summarizes text content
EXTRACT_ANSWER: Extracts answers from unstructured data

RAG Architecture with Snowflake Cortex

Data Ingestion: Store your documents in Snowflake tables
Chunking: Split documents into smaller contextually rich blocks (recommended: 512 tokens or less)
Indexing: Create a Cortex Search Service that automatically embeds and indexes your text
Retrieval: Query the search service to find relevant document chunks
Augmentation: Combine retrieved context with the user query
Generation: Pass the augmented prompt to an LLM via the COMPLETE function

Implementation Steps

Step 1: Set Up Database and Warehouse

-- Create database and warehouse
CREATE DATABASE IF NOT EXISTS cortex_search_db;

CREATE OR REPLACE WAREHOUSE cortex_search_wh WITH
   WAREHOUSE_SIZE='X-SMALL';

CREATE OR REPLACE SCHEMA cortex_search_db.services;

Step 2: Create and Populate Your Knowledge Base Table

-- Create table for documents
CREATE OR REPLACE TABLE support_transcripts (
    transcript_text VARCHAR,
    region VARCHAR,
    agent_id VARCHAR
);

-- Insert sample data
INSERT INTO support_transcripts VALUES
    ('My internet has been down since yesterday, can you help?', 'North America', 'AG1001'),
    ('I was overcharged for my last bill, need an explanation.', 'Europe', 'AG1002'),
    ('How do I reset my password? The email link is not working.', 'Asia', 'AG1003'),
    ('I received a faulty router, can I get it replaced?', 'North America', 'AG1004');

Step 3: Create the Cortex Search Service

-- Create Cortex Search Service for RAG retrieval
CREATE OR REPLACE CORTEX SEARCH SERVICE transcript_search_service
  ON transcript_text
  ATTRIBUTES region
  WAREHOUSE = cortex_search_wh
  TARGET_LAG = '1 day'
  EMBEDDING_MODEL = 'snowflake-arctic-embed-l-v2.0'
  AS (
    SELECT
        transcript_text,
        region,
        agent_id
    FROM support_transcripts
);

Step 4: Query the Search Service (Retrieval)

-- Preview search results
SELECT PARSE_JSON(
  SNOWFLAKE.CORTEX.SEARCH_PREVIEW(
      'cortex_search_db.services.transcript_search_service',
      '{
        "query": "internet issues",
        "columns":["transcript_text", "region"],
        "filter": {"@eq": {"region": "North America"} },
        "limit": 5
      }'
  )
)['results'] as results;

Step 5: Generate Response Using LLM (Generation)

-- Use COMPLETE function for RAG response generation
SELECT SNOWFLAKE.CORTEX.COMPLETE(
    'mistral-large',
    CONCAT(
        'Based on the following context, answer the user question.\n\n',
        'Context: ', retrieved_context, '\n\n',
        'Question: ', user_question, '\n\n',
        'Answer:'
    )
) AS response;

Complete RAG Query Example

-- Full RAG pipeline combining retrieval and generation
WITH retrieved_docs AS (
    SELECT PARSE_JSON(
        SNOWFLAKE.CORTEX.SEARCH_PREVIEW(
            'cortex_search_db.services.transcript_search_service',
            '{
                "query": "billing overcharge",
                "columns": ["transcript_text", "region"],
                "limit": 3
            }'
        )
    )['results'] AS results
)
SELECT SNOWFLAKE.CORTEX.COMPLETE(
    'mistral-large',
    CONCAT(
        'You are a helpful customer support assistant. ',
        'Based on the following support transcripts, provide a helpful response.\n\n',
        'Relevant transcripts: ', results::VARCHAR, '\n\n',
        'Customer question: How do I dispute a billing charge?\n\n',
        'Response:'
    )
) AS rag_response
FROM retrieved_docs;

Python Implementation with Snowpark

from snowflake.core import Root
from snowflake.snowpark import Session
from snowflake.cortex import Complete

# Create session
CONNECTION_PARAMETERS = {
    "account": "your_account",
    "user": "your_user",
    "password": "your_password",
    "warehouse": "cortex_search_wh",
    "database": "cortex_search_db",
    "schema": "services"
}

session = Session.builder.configs(CONNECTION_PARAMETERS).create()
root = Root(session)

# Get the search service
search_service = (root
    .databases["cortex_search_db"]
    .schemas["services"]
    .cortex_search_services["transcript_search_service"]
)

# Retrieval: Search for relevant documents
user_query = "internet connection problems"
search_response = search_service.search(
    query=user_query,
    columns=["transcript_text", "region"],
    filter={"@eq": {"region": "North America"}},
    limit=3
)

# Get retrieved context
retrieved_context = search_response.to_json()

# Generation: Use LLM with retrieved context
prompt = f"""Based on the following customer support transcripts, 
provide a helpful response to the user's question.

Retrieved transcripts: {retrieved_context}

User question: {user_query}

Response:"""

response = Complete(
    model="mistral-large",
    prompt=prompt,
    session=session
)

print(response)

Available Embedding Models for Cortex Search

snowflake-arctic-embed-m-v1.5: Default model, English-only, 512 token context, fastest indexing
snowflake-arctic-embed-l-v2.0: Multilingual support, 512 token context, high quality
snowflake-arctic-embed-l-v2.0-8k: Multilingual, 8192 token context for longer documents
voyage-multilingual-2: Voyage AI model, multilingual, 32,000 token context

Available LLMs for Generation

mistral-large: High capability, good baseline for most tasks
mixtral-8x7b: Efficient mixture-of-experts model
llama3.1-70b: Meta's powerful open model
claude-3-5-sonnet: Anthropic's model for complex reasoning
snowflake-arctic: Snowflake's enterprise-grade open model

Best Practices

For Optimal Retrieval Quality

Split text into chunks of no more than 512 tokens (approximately 385 English words)
Use Snowflake's SPLIT_TEXT_RECURSIVE_CHARACTER function for chunking
Enable change tracking on source tables for automatic index updates
Use metadata filters to narrow search scope when applicable

For Better Generation

Provide clear instructions in your prompt
Include relevant context from retrieval results
Specify the desired output format
Use temperature settings to control response creativity

Cost Considerations

Virtual Warehouse Compute: Used for refreshing the search service
Embedding Tokens: Cost per token embedded into vector space
Serving Compute: Per GB/month of indexed data
LLM Inference: Cost per token processed by COMPLETE function
Storage: For materialized tables and index structures

Key Benefits Summary

No External Dependencies: Everything runs within Snowflake
Enterprise Security: Data never leaves Snowflake's governance boundary
Automatic Updates: Search index refreshes automatically when data changes
Hybrid Search: Combines semantic and keyword search for best results
Scalability: Supports up to 100M rows per search service
SQL-First Approach: Build RAG applications using familiar SQL syntax