Snowflake Cortex Agents

Cortex Agents is the orchestration runtime that ties Cortex Search and Cortex Analyst together into multi-step, tool-using assistants. An agent receives a natural-language request, decides whether to retrieve documents, run governed SQL, call a custom HTTP tool, or just answer directly, executes that plan, and returns a response with citations and a tool-use trace. The whole loop runs inside the Snowflake account; the LLM is one of the Cortex-hosted foundation models (Claude, Mistral, Llama variants) and never sees data the calling user couldn't have read directly.

The mental model: Cortex Search gives you grounded text retrieval; Cortex Analyst gives you grounded text-to-SQL; Cortex Agents is the planner that picks between them, chains them, and renders an answer. As of 2026 it is generally available across most Snowflake regions and is billed by Cortex token consumption plus the warehouse credits any tool-invoked SQL consumes.

1. What Cortex Agents Is

A Cortex Agent is a server-side configuration plus a REST endpoint. The configuration declares the model to drive the agent, the system instructions, and a set of tools; the endpoint accepts a conversation and runs the model in a loop until it produces a final answer or hits a tool-call limit.

Tool-using by construction. The agent runtime exposes registered tools to the LLM via structured tool-call schemas. The LLM doesn't run SQL or HTTP itself; it emits a tool-call, the runtime executes it, and the result feeds back into the next reasoning step.
First-class Snowflake tools. cortex_search and cortex_analyst_text_to_sql are built-in tool types. Pointing an agent at a search service or an analyst semantic model is a single block in the agent spec.
Bring-your-own HTTP tools. Any external API can be wrapped as a tool via OpenAPI-style spec; the runtime brokers the call and feeds the JSON response back.
Citations and traces are part of the contract. Each response includes the document IDs retrieved, the SQL executed, and the order of tool calls — essential for audit, debugging, and rendering footnoted UIs.

2. Defining an Agent

Agents can be defined inline in a request or persisted as Snowflake objects via CREATE AGENT. Persisted agents are the right shape for production: they are governed by grants, versioned in source control via Snowflake CLI or Terraform, and reusable across applications.

Tools, Models, Instructions

A typical support-assistant agent has two tools (a knowledge-base search and a Cortex Analyst over the support-tickets warehouse), a Claude-class model, and a short system prompt that pins behavior.


name:        support_assistant
description: |
  Customer support assistant. Answers product questions from the KB,
  pulls live ticket and order data via Cortex Analyst, and cites sources.

models:
  orchestration: claude-3-7-sonnet     # planner / responder

instructions:
  response: |
    You are a customer support assistant for ACME devices.
    - Always cite knowledge-base articles by doc_id when answering from retrieval.
    - Use the analytics tool only for live data questions (open tickets, recent orders).
    - If the user asks about a SKU you do not recognize, ask for clarification.
    - Decline questions outside the support domain.
  sample_questions:
    - "How do I reset the X9 to factory settings?"
    - "How many open tickets do we have for SKU-100 this week?"
    - "What is the warranty policy on refurbished units?"

tools:
  - tool_spec:
      type: cortex_search
      name: kb_search
      description: Search the published support knowledge base.
    tool_resources:
      name:         ANALYTICS.SUPPORT.support_kb_search
      max_results:  6
      id_column:    doc_id
      title_column: doc_id

  - tool_spec:
      type: cortex_analyst_text_to_sql
      name: support_analytics
      description: Run governed SQL over support tickets and orders.
    tool_resources:
      semantic_model_file: "@ANALYTICS.SUPPORT.SEMANTIC_MODELS/support_ops.yaml"

Persist it with one DDL statement:


CREATE OR REPLACE AGENT ANALYTICS.SUPPORT.support_assistant
  FROM @ANALYTICS.SUPPORT.AGENT_SPECS/support_assistant.yaml;

GRANT USAGE ON AGENT ANALYTICS.SUPPORT.support_assistant
  TO ROLE SUPPORT_APP_ROLE;

3. REST API and Streaming Responses

Agents are invoked through the same Cortex REST surface as Analyst. The request shape is a conversation; the response shape is a stream of server-sent events that the client concatenates into a final message and a tool-use trace.


curl -N -X POST \
  "https://${ACCOUNT}.snowflakecomputing.com/api/v2/cortex/agent:run" \
  -H "Authorization: Bearer ${SNOWFLAKE_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "agent_name": "ANALYTICS.SUPPORT.support_assistant",
    "messages": [
      {"role": "user", "content": [{"type": "text", "text": "How many open tickets for SKU-100 this week, and is there a known fix?"}]}
    ]
  }'

Each event in the stream is one of: message.delta (incremental text), tool_use (the LLM is calling a tool), tool_result (the runtime is feeding the tool's response back), or message.stop (final answer). A minimal Python collector:


import json, os, requests, snowflake.connector

conn = snowflake.connector.connect(
    account="abc12345",
    user="support_app",
    authenticator="OAUTH",
    token=os.environ["SNOWFLAKE_OAUTH_TOKEN"],
    role="SUPPORT_APP_ROLE",
    warehouse="AGENT_WH",
)

def run_agent(question: str, history=None) -> dict:
    history = history or []
    history.append({
        "role": "user",
        "content": [{"type": "text", "text": question}],
    })

    resp = requests.post(
        f"https://{conn.host}/api/v2/cortex/agent:run",
        headers={
            "Authorization": f"Bearer {conn.rest.token}",
            "X-Snowflake-Authorization-Token-Type": "OAUTH",
            "Content-Type":  "application/json",
            "Accept":        "text/event-stream",
        },
        json={
            "agent_name": "ANALYTICS.SUPPORT.support_assistant",
            "messages":   history,
        },
        stream=True,
        timeout=120,
    )
    resp.raise_for_status()

    final_text  = []
    tool_trace  = []
    citations   = []

    for line in resp.iter_lines(decode_unicode=True):
        if not line or not line.startswith("data:"):
            continue
        evt = json.loads(line[5:].strip())
        etype = evt.get("event")

        if etype == "message.delta":
            for block in evt["data"]["delta"]["content"]:
                if block.get("type") == "text":
                    final_text.append(block["text"])

        elif etype == "tool_use":
            tool_trace.append({
                "tool":  evt["data"]["name"],
                "input": evt["data"]["input"],
            })

        elif etype == "tool_result":
            result = evt["data"]["content"]
            tool_trace[-1]["output_summary"] = (
                f"{len(result.get('searchResults', []))} hits"
                if "searchResults" in result
                else f"{len(result.get('rows', []))} rows"
            )
            for hit in result.get("searchResults", []):
                citations.append(hit.get("doc_id"))

        elif etype == "message.stop":
            break

    return {
        "answer":     "".join(final_text),
        "tool_trace": tool_trace,
        "citations":  citations,
    }


out = run_agent("How many open tickets for SKU-100 this week, and is there a known fix?")
print(out["answer"])
print(f"Tools called: {[t['tool'] for t in out['tool_trace']]}")
print(f"Citations:    {out['citations']}")

4. Parsing the Tool-Use Trace

The trace is the most valuable debugging artifact the agent produces. It is the equivalent of a query plan: it tells you what the LLM decided to do, what data came back, and in what order. A typical multi-tool answer for the example question above produces a trace like:


[
  {
    "tool":  "support_analytics",
    "input": {"query": "count of open tickets for SKU-100 in the last 7 days"},
    "output_summary": "1 row"
  },
  {
    "tool":  "kb_search",
    "input": {"query": "SKU-100 known fix open issue"},
    "output_summary": "4 hits"
  }
]

A few things to inspect when an agent answer looks wrong:

Tool order. If the LLM called search before analyst when the question was clearly numerical, the system instructions are not pinning behavior firmly enough.
Tool input. The natural-language query the LLM passed to each tool is in the trace. Often this is where the failure is — the LLM rephrased the question in a way that lost a constraint.
Citations array. If citations are empty but the answer claims to quote the KB, the LLM is hallucinating. Tighten instructions.response to require citations whenever search was called.
Token usage per turn. Returned in the final event; long traces blow past the model's context window faster than you'd expect.

5. Custom HTTP Tools

Beyond the built-in cortex_search and cortex_analyst_text_to_sql tool types, agents can call arbitrary HTTPS endpoints — useful for bridging to existing internal APIs (a ticketing system, a pricing service, an external knowledge base) without first ETLing those systems into Snowflake. The custom tool is declared with an OpenAPI-style spec; the runtime handles the auth and the call, and feeds the JSON response back to the LLM.


tools:
  - tool_spec:
      type: generic
      name: open_ticket
      description: Open a support ticket in the internal ticketing system.
      input_schema:
        type: object
        properties:
          customer_id: {type: string, description: "ACME customer id"}
          severity:    {type: string, enum: ["low", "med", "high"]}
          summary:     {type: string, maxLength: 200}
          body:        {type: string}
        required: [customer_id, severity, summary]
    tool_resources:
      endpoint:    "https://tickets.internal.acme.com/api/v1/tickets"
      method:      POST
      auth_secret: SUPPORT_DB.SECRETS.tickets_api_oauth
      headers:
        Content-Type: application/json

The agent will only invoke the tool when it has parameters that match the input schema, which is the right way to think about the boundary: the LLM proposes structured calls, the runtime validates and executes them, and the response is just more JSON to reason over.

6. Governance: RBAC, Row Access, Masking

The single most important property of Cortex Agents in a regulated environment is that there is no privileged service identity bypassing data controls. The agent runs under the calling user's role; the tools it invokes inherit that role.

Agent invocation needs USAGE on the agent object, granted to roles like any other Snowflake object.
Cortex Analyst tool calls generate SQL that is executed under the calling user's role, so row access and masking policies on the underlying tables apply unchanged.
Cortex Search tool calls respect row access policies on the source tables; results returned to the LLM are already filtered.
Custom HTTP tools are governed by the secret used for auth and by allow-listing the destination via Snowflake's network rules and external access integrations.


-- Network rule + external access integration are required for custom HTTP tools
CREATE OR REPLACE NETWORK RULE tickets_api_rule
  TYPE = HOST_PORT
  MODE = EGRESS
  VALUE_LIST = ('tickets.internal.acme.com:443');

CREATE OR REPLACE EXTERNAL ACCESS INTEGRATION tickets_api_int
  ALLOWED_NETWORK_RULES         = (tickets_api_rule)
  ALLOWED_AUTHENTICATION_SECRETS = (SUPPORT_DB.SECRETS.tickets_api_oauth)
  ENABLED = TRUE;

GRANT USAGE ON INTEGRATION tickets_api_int TO ROLE SUPPORT_APP_ROLE;

The pattern that falls out of this: per-tenant or per-role isolation does not require per-tenant agents. One agent plus row access policies on the source tables yields correct multi-tenant behavior, with the LLM never seeing data the user couldn't have read manually.

7. Cortex Agents vs Bedrock Agents vs LangGraph

Framework	Where It Runs	Strengths	Trade-offs
Cortex Agents	Inside the Snowflake account.	Zero data egress; native Search/Analyst tools; row-access policies flow through; one bill.	Snowflake-only model catalog; less expressive than code-defined graphs.
Amazon Bedrock Agents	AWS account, in the Bedrock service.	Wide model catalog (Anthropic, Mistral, Cohere, Amazon); IAM-based tool auth; Knowledge Bases for RAG.	Data has to flow to Bedrock unless retrieval is also in AWS; Snowflake access requires cross-account integration.
LangGraph	Wherever you deploy your Python.	Most expressive — explicit state graph, branch/merge, human-in-the-loop nodes, full control.	You own deployment, governance, secret management, observability. No native warehouse integration.

The deciding axis is again where the data and the controls live. If the answer must come from Snowflake-governed data and the audit story matters, Cortex Agents is the path of least resistance. If the orchestration logic is the hard part — long-running workflows, conditional branches, human approvals — LangGraph (or a similar code-defined graph runtime) is more honest and the warehouse becomes one of the tools the graph calls.

8. Interview Q&A

Q: How does a Cortex Agent decide whether to call Cortex Search or Cortex Analyst for a given question?

It is a pure LLM decision driven by tool descriptions and system instructions. Each tool block has a description the LLM reads at planning time; the agent's instructions.response field can pin defaults like "use the analytics tool only for live data questions". For questions that need both — e.g. "what is the known fix for the SKU with the most open tickets this week" — the LLM calls Analyst first, reads the row, and then calls Search with the resulting SKU. The full sequence is in the tool-use trace returned with the response.

Q: An agent is fabricating citations. How would you fix it?

Tighten instructions.response to require that any factual claim be tied to a citation returned by a tool, and forbid the LLM from inventing IDs. Inspect the trace — if the agent answered without calling search, the model decided retrieval wasn't needed; either rewrite the system prompt to require search for KB-style questions, or route those questions to a sub-agent whose only tool is search. As a last resort, post-validate: drop any citation in the answer that doesn't appear in the trace's results.

Q: How do row access policies interact with a Cortex Agent serving multiple tenants?

The agent runs SQL as the calling user, so row access policies on the underlying tables apply at query time. One agent plus tenant-aware policies is the right shape — a per-tenant isolated answer falls out of the data layer without the agent layer needing to know about tenancy. The same applies to Cortex Search: the search service's source SELECT runs against the calling role, and its results are filtered before they reach the LLM.

Q: When would you reach for LangGraph instead of Cortex Agents?

When the orchestration logic is the hard part. Long-running workflows that span minutes or hours, branches that depend on human approvals, retries with backoff, parallel fan-out across tools — these are graph problems and LangGraph (or a similar code-defined runtime) is more honest. Cortex Agents is optimized for "single-turn or short multi-turn assistants grounded in Snowflake data". The two compose: LangGraph orchestrates the workflow, Cortex Agents serves as one of the nodes when the step is "ask a question against governed Snowflake data".

Q: How do you bound cost on a Cortex Agent?

Three knobs. First, model choice — pick the smallest model that meets quality (Mistral or smaller Llama variants for routine support, Claude-class for reasoning-heavy questions). Second, max-tool-call limits in the request to prevent runaway loops. Third, per-tool result caps: max_results on search, narrow Analyst semantic models so generated SQL stays small. Beyond that, instrument the tool-use trace and log token counts per turn; pathological queries usually show up as long traces with redundant tool calls.

Q: What does the streaming response shape buy you over a single JSON reply?

Two things. Latency-perceived throughput — the user sees text appearing within a second instead of waiting for the full multi-tool answer to compose. And debuggability — the tool-use events arrive in real time, so a UI can render "searching the knowledge base" and "running analytics query" status indicators while the agent works. If you are calling the agent server-to-server and don't need progressive rendering, you can collect the stream and treat it as a single response; the underlying contract is the same.

↑ Back to Top