Snowpark Container Services (SPCS)

Snowpark Container Services is a managed container runtime that runs OCI images inside the Snowflake account. It is essentially "Kubernetes-with-the-rough-edges-removed" co-located with your warehouses, with native, no-egress access to your tables and stages and with the same RBAC, network policies, and audit boundary as the rest of Snowflake. You bring an image; SPCS schedules it onto a compute pool, gives it an internal endpoint (and optionally a public endpoint with Snowflake-issued auth), and hooks it back into the database via service functions or job services.

The reason it exists is straightforward: Cortex hosts a curated catalog of foundation models, but production AI work eventually needs something off-catalog — a fine-tuned open-weights model behind vLLM, a Streamlit dashboard, a FastAPI for a custom Triton-served model, an embedding service the Cortex catalog doesn't ship. SPCS is the answer to "host that next to my data" instead of standing up an EKS cluster, exporting tables to S3, and round-tripping every inference over the public internet.

1. What SPCS Is

SPCS introduces three primitives on top of normal Snowflake objects: compute pools (the underlying instances, including GPU instances), the image registry (an OCI registry living inside a Snowflake schema), and services (the running workloads, declared with a YAML spec). All three are first-class objects governed by grants and visible to the audit log.

No egress from the account. A service running in SPCS reads tables and stages over the internal Snowflake network. Your inference traffic doesn't traverse the public internet, and the data plane never leaves the account boundary.
One bill. Compute-pool minutes, registry storage, and warehouse credits are all on the same Snowflake invoice. There is no separate cloud bill for the GPU host.
Auth reuse. Public endpoints can be gated by Snowflake OAuth — a service principal authenticates via the same identity it uses for SQL. No separate identity provider for the model endpoint.
Two service shapes. Long-running services for online endpoints; job services for batch/finite workloads. Both are scheduled onto compute pools.

2. Compute Pools and Instance Families

A compute pool is a fleet of instances of a single instance family. It has a min and max node count, an auto-suspend timer, and a chosen family. Services request a pool by name when they start. Instance family naming follows a CPU-and-memory or GPU class scheme; common families as of 2026 include the CPU tiers (XS through 6XL standard memory, plus high-memory variants) and GPU tiers including NVIDIA A10, L40S/L40, and H100 in regions where Snowflake has provisioned them.


-- A small CPU pool for a Streamlit dashboard
CREATE COMPUTE POOL IF NOT EXISTS streamlit_pool
  MIN_NODES        = 1
  MAX_NODES        = 2
  INSTANCE_FAMILY  = CPU_X64_S
  AUTO_RESUME      = TRUE
  AUTO_SUSPEND_SECS = 600;

-- A GPU pool for a vLLM endpoint serving an open-weights LLM
CREATE COMPUTE POOL IF NOT EXISTS llm_pool
  MIN_NODES        = 1
  MAX_NODES        = 4
  INSTANCE_FAMILY  = GPU_NV_L          -- NVIDIA A10G class as of 2026; check region
  AUTO_RESUME      = TRUE
  AUTO_SUSPEND_SECS = 1800;

GRANT USAGE ON COMPUTE POOL llm_pool TO ROLE ML_PLATFORM_ROLE;

A few rules learned the hard way:

Auto-suspend matters. GPU instances are expensive; an idle pool with no auto-suspend is the fastest way to surprise a finance team.
Min nodes > 0 keeps things warm. Cold-starts on GPU pools include image pull and model weights download, which is minutes — not what you want on the first request after a long idle.
Pool families are immutable. Switching from CPU to GPU is a new pool, not an alter.
One pool per workload class. Don't multiplex a Streamlit app and a vLLM endpoint on the same pool — one of them will steal resources.

3. Image Registry Inside the Account

SPCS ships an OCI registry that lives in a Snowflake schema. Pushing images is a normal docker login + docker push against a registry URL derived from the account locator and the registry's full path.


# 1. Create a registry inside a schema (one-time)
snowsql -q "
  CREATE IMAGE REPOSITORY ML.IMAGES.MODELS;
  GRANT READ, WRITE ON IMAGE REPOSITORY ML.IMAGES.MODELS TO ROLE ML_PLATFORM_ROLE;
"

# 2. Get the registry URL
REGISTRY_URL=$(snowsql -q "SHOW IMAGE REPOSITORIES IN SCHEMA ML.IMAGES;" \
  | awk '/MODELS/ {print $NF}')
# example: abc12345.registry.snowflakecomputing.com/ml/images/models

# 3. Build, tag, push
docker build -t vllm-llama3-8b:0.1 -f Dockerfile .
docker tag  vllm-llama3-8b:0.1 ${REGISTRY_URL}/vllm-llama3-8b:0.1

docker login ${REGISTRY_URL%%/*} -u kevin@example.com    # password = Snowflake PAT
docker push  ${REGISTRY_URL}/vllm-llama3-8b:0.1

The registry is governed like any other Snowflake object: grants control who can push and pull, and the audit log records every operation. Images are stored in account-managed storage with no separate retention policy to set.

4. Service Specification YAML

A service spec is a YAML document declaring containers, volumes, endpoints, and ingress. It is uploaded to a stage (or passed inline) and referenced by CREATE SERVICE. The spec resembles a stripped-down Pod spec with Snowflake-specific extensions for the registry path, stage-mounted volumes, and Snowflake-issued OAuth on public endpoints.


# vllm-service.yaml
spec:
  containers:
    - name:  vllm
      image: /ml/images/models/vllm-llama3-8b:0.1
      env:
        MODEL_NAME:    "meta-llama/Meta-Llama-3-8B-Instruct"
        MAX_MODEL_LEN: "8192"
        TENSOR_PARALLEL_SIZE: "1"
      args:
        - "--model"
        - "$(MODEL_NAME)"
        - "--port"
        - "8000"
        - "--max-model-len"
        - "$(MAX_MODEL_LEN)"
        - "--tensor-parallel-size"
        - "$(TENSOR_PARALLEL_SIZE)"
      resources:
        requests:
          nvidia.com/gpu: 1
          memory: "24Gi"
        limits:
          nvidia.com/gpu: 1
          memory: "32Gi"
      volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
        - name: hf-secret
          mountPath: /etc/secrets

  endpoints:
    - name:    api
      port:    8000
      public:  true                # exposes via account-locator URL with Snowflake OAuth
      protocol: HTTP

  volumes:
    - name:   model-cache
      source: "@ML.MODELS.HF_CACHE"   # internal Snowflake stage
    - name:   hf-secret
      source: secret
      secrets:
        - secretName: ML.SECRETS.huggingface_token
          secretKeyRef: token
          path: hf_token

  networkPolicyConfig:
    allowInternetEgress: true       # vLLM will pull weights from HF on first start

DDL to create the service:


PUT file:///path/to/vllm-service.yaml @ML.MODELS.SERVICE_SPECS AUTO_COMPRESS=FALSE OVERWRITE=TRUE;

CREATE SERVICE ML.MODELS.vllm_llama3_8b
  IN COMPUTE POOL llm_pool
  FROM @ML.MODELS.SERVICE_SPECS
  SPECIFICATION_FILE = 'vllm-service.yaml'
  MIN_INSTANCES      = 1
  MAX_INSTANCES      = 2
  EXTERNAL_ACCESS_INTEGRATIONS = (huggingface_egress_int);

GRANT USAGE ON SERVICE ML.MODELS.vllm_llama3_8b TO ROLE ML_PLATFORM_ROLE;

5. Pattern: Hosting a vLLM Endpoint

The most common SPCS pattern in 2026 is hosting an open-weights LLM behind vLLM, close to the data, instead of paying a per-token rate to a hosted provider. The image is small — a vLLM base image plus an entrypoint — and the service spec above is essentially complete.

A minimal Dockerfile:


# Dockerfile
FROM vllm/vllm-openai:latest

# Pre-warm any tokenizer/config files at build time if you want faster cold-start;
# weights themselves are too large and should ride the model-cache volume.
RUN pip install --no-cache-dir huggingface_hub

ENV HF_HOME=/root/.cache/huggingface

ENTRYPOINT ["python", "-m", "vllm.entrypoints.openai.api_server"]

Once the service is healthy, querying it from a Snowpark Python session uses the OpenAI-compatible REST surface vLLM exposes — Snowflake auto-injects the public endpoint URL.


import json, requests
from snowflake.snowpark import Session

session = Session.builder.configs({...}).create()

# Look up the public URL Snowflake assigned to the `api` endpoint
url_row = session.sql("""
    SHOW ENDPOINTS IN SERVICE ML.MODELS.vllm_llama3_8b;
""").collect()[0]
endpoint_url = url_row["ingress_url"]   # e.g. abc-xyz.snowflakecomputing.app

# Snowflake-issued OAuth token for service-to-service auth
token = session.connection.rest.token

resp = requests.post(
    f"https://{endpoint_url}/v1/chat/completions",
    headers={
        "Authorization": f"Snowflake Token=\"{token}\"",
        "Content-Type":  "application/json",
    },
    json={
        "model": "meta-llama/Meta-Llama-3-8B-Instruct",
        "messages": [
            {"role": "user", "content": "Summarize the latest support ticket."},
        ],
        "max_tokens": 256,
        "temperature": 0.2,
    },
    timeout=60,
)
print(resp.json()["choices"][0]["message"]["content"])

6. Service Functions and Job Services

Two integration patterns close the loop with SQL.

Service Functions

A service function is a SQL UDF that internally HTTP-calls a long-running service, letting analysts call your model from a normal SELECT. The function is declared with the service name and the path to invoke; Snowflake handles the auth and the round-trip.


CREATE OR REPLACE FUNCTION ML.MODELS.summarize(text VARCHAR)
  RETURNS VARCHAR
  SERVICE   = ML.MODELS.vllm_llama3_8b
  ENDPOINT  = api
  AS '/v1/chat/completions';

-- Use it in a query
SELECT ticket_id,
       ML.MODELS.summarize(ticket_body) AS summary
FROM   SUPPORT.PUBLIC.OPEN_TICKETS
WHERE  created_at >= DATEADD('day', -1, CURRENT_TIMESTAMP());

Job Services

A job service is a finite-lifetime container — the SPCS analog of a Kubernetes Job. Use it for batch inference, fine-tuning runs, or scheduled embeddings refresh. It is created with EXECUTE JOB SERVICE, runs to completion, and writes results to a stage or table.


EXECUTE JOB SERVICE
  IN COMPUTE POOL llm_pool
  NAME = ML.MODELS.nightly_embeddings_job
  FROM @ML.MODELS.SERVICE_SPECS
  SPECIFICATION_FILE = 'embeddings-job.yaml';

-- Inspect status / logs
SELECT SYSTEM$GET_SERVICE_STATUS('ML.MODELS.nightly_embeddings_job');
SELECT SYSTEM$GET_SERVICE_LOGS  ('ML.MODELS.nightly_embeddings_job', 0, 'embedder');

7. Other Patterns: Streamlit, Gradio, FastAPI

SPCS is not LLM-specific. Three other recurring shapes:

Streamlit / Gradio dashboards. Wrap a notebook prototype in a small Streamlit app, push the image, expose port 8501, mark the endpoint public: true. Snowflake OAuth gates access. This is the path for "share the prototype with the team without standing up an internal web host".
FastAPI for custom models. A model that isn't in the Cortex catalog and doesn't fit the vLLM mold (a fine-tuned BERT classifier, an XGBoost ranker, a Triton-served vision model) lives behind a FastAPI process in SPCS, called via service function from SQL or directly via HTTP from applications.
Sidecar embedding services. When you need an embedding model the Cortex catalog doesn't ship — a domain-specific encoder, a multimodal encoder, a fine-tuned variant — host it in SPCS and write a service function that maps a text column to a VECTOR(FLOAT, n) column for use as a BYO embedding in Cortex Search.

8. SPCS vs Databricks Model Serving vs SageMaker vs Bedrock

Option	Best For	Trade-offs
Snowpark Container Services	Data already in Snowflake; want zero egress and governance reuse; need to run arbitrary OCI images close to the warehouse.	GPU family and region availability is narrower than the hyperscalers. No auto-scaling on a per-request basis the way a serving endpoint does.
Databricks Model Serving	Model in Unity Catalog Model Registry; want auto-scale-to-zero serverless serving with UC governance.	More opinionated than SPCS — you ship a model artifact, not an arbitrary container (Custom containers are supported but secondary).
SageMaker Endpoints	AWS-native ML platform; multi-model endpoints; broad instance catalog; deep AWS integration.	Data has to flow to AWS; if it's already in Snowflake or Databricks you pay for the round-trip and lose warehouse-side governance.
Bedrock Provisioned Throughput	Hosted Anthropic/Mistral/Cohere/Amazon models with reserved capacity SLA; AWS-managed; no infra at all.	Limited to the Bedrock model catalog; no custom or fine-tuned open-weights models outside the supported customization paths.

SPCS wins specifically when (a) the data already lives in Snowflake, (b) the model is off the Cortex catalog so you actually need a container, and (c) the governance story matters enough that round-tripping the data to a hyperscaler endpoint is not acceptable. For Snowflake shops, that combination covers most "host an open-weights model next to my data" use cases.

9. Interview Q&A

Q: Why would you host a model in SPCS instead of just calling Cortex COMPLETE?

Three reasons. First, the model isn't in the Cortex catalog — a fine-tuned open-weights variant, a domain-specific embedder, a custom classifier. Second, cost — at high request volume, a dedicated GPU pool can be cheaper than per-token Cortex pricing for the same workload (do the math). Third, control — specific quantization, vLLM speculative decoding settings, custom logits processors, or anything that requires an inference server you actually control. If none of those apply, Cortex COMPLETE is simpler and you should use it.

Q: What does cold-start look like for a vLLM service in SPCS?

Three sequential phases. Image pull from the in-account registry (seconds to a minute, depending on image size). Model-weights download to the cache volume on first start (minutes for a 7-8B model, much longer for 70B+; this is why you mount a stage as a model cache). Engine warmup inside vLLM (CUDA graph capture, KV-cache allocation; typically tens of seconds). The standard mitigations are: keep MIN_NODES >= 1 so a node stays warm, pre-populate the cache volume with a one-time job, and pick a small enough model that the warmup is short.

Q: How do you authenticate to a public SPCS endpoint?

With a Snowflake-issued OAuth token. The same identity that authenticates to SQL — a user, a service principal with key-pair auth — gets a token that the SPCS ingress validates. The header looks like Authorization: Snowflake Token="...". There is no separate IdP for the model endpoint, which is the main operational benefit of running it inside Snowflake.

Q: How do you bound GPU cost on an SPCS pool?

Auto-suspend with a sensible idle window — typically 600 to 1800 seconds, picked from how spiky the traffic is. Min and max node count to cap horizontal scaling. Match GPU class to the model: an L40-class instance is overkill for a 3B model and underprovisioned for a 70B. And monitor — SHOW SERVICES and SHOW COMPUTE POOLS plus the account-usage views give you per-pool credit consumption. Set a resource monitor on the warehouse and the pool with daily quotas; alert before you hit them.

Q: When would you use a job service instead of a long-running service?

Whenever the workload has a defined endpoint and doesn't need to be online for ad hoc requests. Batch embedding refresh of a corpus, nightly fine-tuning runs, scheduled report generation, ETL-style transformations that need GPU. Job services run to completion and release the pool; long-running services are billed for as long as instances are warm even with no traffic. The dividing line is "is this triggered by a request or by a schedule".

Q: How does data egress work — or not — for a service running in SPCS?

By default it doesn't. A service has access to internal Snowflake networking — tables, stages, internal registry, the Cortex API — but no public internet egress. To allow egress (for example to pull model weights from Hugging Face on first start), you create an external access integration listing the destinations and reference it on CREATE SERVICE. The same network-rule and integration model that gates external functions and stored procedures applies, which means the same audit story: every allowed destination is declared in DDL.

↑ Back to Top