Amazon SageMaker JumpStart

SageMaker JumpStart is the model hub built into SageMaker. It packages hundreds of pre-trained foundation and task-specific models — Llama 3.x, Mistral, Falcon, Stable Diffusion, BGE embeddings, plus traditional CV/NLP models — together with deployable inference containers and ready-to-run fine-tuning recipes. JumpStart is the fastest route from "I want to host an open-weights model on AWS" to "I have an HTTPS endpoint in my VPC." It's also where you go when Bedrock doesn't host the model you need.

1. What JumpStart Is

JumpStart wraps three things into one feature:

A curated catalog of pre-trained models (foundation, embedding, vision, tabular) with vendor-blessed weights and license metadata.
Pre-built deployment artifacts — a SageMaker container, an entry-point script, and default instance recommendations — so deploying a Llama variant is a single Python call.
Pre-built fine-tuning recipes — a training container, a training script, and default hyperparameters per (model, task) pair — so QLoRA on Llama becomes a parameter-only config.

Under the hood every JumpStart model is just a regular SageMaker model: you can swap the container, override the entry-point script, attach a private VPC, encrypt with KMS, and use the resulting endpoint exactly like any other SageMaker endpoint.

2. The Model Hub

Representative families currently available (the catalog updates frequently — check the SageMaker Studio UI for the live list):

Meta Llama 3.x — base, instruct, and guard variants from 8B to 70B+; the workhorse open-weights LLM on AWS.
Mistral and Mixtral — Mistral 7B, Mixtral 8x7B / 8x22B sparse-MoE; competitive instruct models with permissive licensing.
Falcon — TII's 7B / 40B / 180B family; long-context Falcon 2.
Stable Diffusion (XL, 3, 3.5) — text-to-image pipelines with the standard diffusers-based handler.
BGE embeddings — BAAI's English/multilingual embedding models, popular for RAG.
BAAI BGE-Reranker — cross-encoder rerankers to bolt onto a vector retriever.
Cohere, AI21, Stability — third-party closed-weights models distributed via AWS Marketplace through JumpStart.
Code models — Code Llama, StarCoder variants for code completion / explanation.
Vision & tabular — ResNet, ViT, YOLOv8, XGBoost recipes for the non-LLM cases.

Each entry exposes a stable model_id (e.g. meta-textgeneration-llama-3-1-8b-instruct) plus a model_version. Pin both in production code; "*" for version is fine for prototyping but will eventually drift.

3. Deploying a Foundation Model

The high-level JumpStartModel class encapsulates container selection, environment variables, and endpoint config. The minimum deploy is three lines:


from sagemaker.jumpstart.model import JumpStartModel

model = JumpStartModel(
    model_id="meta-textgeneration-llama-3-1-8b-instruct",
    model_version="2.*",
)

predictor = model.deploy(
    instance_type="ml.g5.2xlarge",
    initial_instance_count=1,
    accept_eula=True,  # required for gated models like Llama
    endpoint_name="llama3-1-8b-instruct",
)

resp = predictor.predict({
    "inputs": [[
        {"role": "system", "content": "You are a concise assistant."},
        {"role": "user",   "content": "Summarize CAP theorem in two sentences."},
    ]],
    "parameters": {"max_new_tokens": 256, "temperature": 0.2, "top_p": 0.9},
})
print(resp[0]["generated_text"])

3.1 VPC, KMS, and Network Isolation

Override the deploy call to put the endpoint in a private subnet and encrypt the model artifacts with a customer-managed KMS key:


predictor = model.deploy(
    instance_type="ml.g5.2xlarge",
    initial_instance_count=1,
    accept_eula=True,
    endpoint_name="llama3-1-8b-instruct-vpc",
    vpc_config={
        "Subnets":          ["subnet-aaaa1111", "subnet-bbbb2222"],
        "SecurityGroupIds": ["sg-cccc3333"],
    },
    kms_key="arn:aws:kms:us-west-2:111111111111:key/abcd-1234",
    enable_network_isolation=True,  # container has no internet egress at all
)

enable_network_isolation=True is the right default for any model that doesn't need to phone home — once the container is built, no outbound traffic leaves it.

3.2 Async and Serverless Endpoints

For long generations (Stable Diffusion video, 30k-token summaries), an Async endpoint streams the response to S3 and returns immediately. For bursty low-volume usage on smaller models, a Serverless endpoint cold-starts on demand and bills per millisecond.


from sagemaker.async_inference import AsyncInferenceConfig
predictor = model.deploy(
    instance_type="ml.g5.2xlarge",
    initial_instance_count=1,
    accept_eula=True,
    async_inference_config=AsyncInferenceConfig(
        output_path="s3://my-async-out/llama-results/",
        max_concurrent_invocations_per_instance=4,
    ),
)

4. Instance Selection (g5 / g6 / p5)

The right instance is mostly a function of model size, target throughput, and budget. JumpStart populates a default; override it deliberately.

ml.g5.* (NVIDIA A10G, 24 GB): Best price/perf for 7B–13B inference. g5.2xlarge for 7B FP16, g5.12xlarge (4x A10G) for 70B with quantization or tensor parallelism.
ml.g6.* (NVIDIA L4, 24 GB) and ml.g6e.* (NVIDIA L40S, 48 GB): Newer L4/L40S generation; better INT8/FP8 throughput, lower power, better for embedding and small-model inference at scale. g6e.12xlarge (4x L40S) is a sweet spot for 70B inference at lower cost than g5.48xlarge.
ml.p4d.* (NVIDIA A100, 40/80 GB): Heavier-weight; right for 70B without aggressive quantization, multi-tenant inference, or fine-tuning.
ml.p5.* (NVIDIA H100, 80 GB) and p5e (H200, 141 GB): Top of the line. Use for fine-tuning 70B+, serving very long context windows, or maximum throughput inference. Expensive — measure utilization.
ml.inf2.* (AWS Inferentia2): Custom silicon; great cost/perf for compatible models compiled with Neuron SDK. JumpStart publishes pre-compiled Inf2 variants of popular models (e.g. Llama on Neuron); they're a meaningful cost saving when available.
ml.trn1.* / trn2 (AWS Trainium): Training-side analog of Inferentia. Use for cost-effective fine-tuning of supported models.

Rule of thumb for inference sizing: a model in FP16 needs ~2 bytes per parameter (so 7B ~ 14 GB, 70B ~ 140 GB), plus 20–40% headroom for the KV cache at your target context length. Quantization (INT8, AWQ, GPTQ) cuts that by 2–4x at a measurable but usually acceptable quality cost.

5. Fine-Tuning with QLoRA

JumpStart ships fine-tuning recipes for most LLMs in the catalog. QLoRA — 4-bit quantized base weights plus trainable LoRA adapters — is the default and is what you want for cost-effective domain adaptation on a single g5/g6 instance.


from sagemaker.jumpstart.estimator import JumpStartEstimator

estimator = JumpStartEstimator(
    model_id="meta-textgeneration-llama-3-1-8b-instruct",
    model_version="2.*",
    instance_type="ml.g5.12xlarge",
    instance_count=1,
    environment={"accept_eula": "true"},
    hyperparameters={
        "epoch":                  "3",
        "learning_rate":          "1e-4",
        "per_device_train_batch_size": "2",
        "gradient_accumulation_steps": "4",
        "max_input_length":       "2048",

        # PEFT / QLoRA knobs
        "instruction_tuned":      "True",
        "int8_quantization":      "False",
        "enable_fsdp":            "False",
        "peft_type":              "lora",
        "lora_r":                 "16",
        "lora_alpha":             "32",
        "lora_dropout":           "0.05",
        "load_in_4bit":           "True",
    },
)

estimator.fit({"training": "s3://my-bucket/jumpstart-train/instruction-data/"})

# Deploy the fine-tuned model
predictor = estimator.deploy(
    instance_type="ml.g5.2xlarge",
    initial_instance_count=1,
    accept_eula=True,
)

5.1 Training Data Format

Most JumpStart instruction-tuning recipes expect a train.jsonl in S3, one example per line, with an instruction-style schema:


{"instruction": "Summarize the following ticket.", "context": "Customer reports that...", "response": "Customer reports a billing error..."}
{"instruction": "Classify the sentiment.",         "context": "I love the new release.",  "response": "positive"}

The recipe wraps each row in the model's expected chat template (Llama 3 instruct, Mistral instruct, etc.) before tokenization. A template.json sidecar in the same S3 prefix can override the wrapping.

5.2 Distributed Fine-Tuning

For 70B fine-tuning on a single node, set enable_fsdp=True and pick a multi-GPU instance (g5.48xlarge, p4d.24xlarge, p5.48xlarge). For multi-node, set instance_count>1 — the recipe uses NCCL over EFA on supported instances.

6. JumpStart Endpoint vs Bedrock Provisioned Throughput

Dimension	JumpStart Endpoint	Bedrock Provisioned Throughput
Pricing unit	Per instance-hour, regardless of utilization	Per Model Unit (MU) per hour, with 1- or 6-month commit options
Model selection	Anything in the JumpStart catalog or your own container	Only models offered on Bedrock (Claude, Llama, Mistral, Cohere, Titan, Nova)
Custom weights	Yes — fine-tuned, distilled, or fully custom	Only Bedrock-fine-tuned variants; you can't bring arbitrary weights
Networking	VPC-attached, network-isolated, KMS — full control	VPC endpoint (PrivateLink) into Bedrock; model itself runs in AWS-managed VPC
Scaling	SageMaker autoscaling on instance count	Scale by purchasing additional Model Units
Ops surface	SageMaker endpoint config, IAM, CW metrics, traffic shifting	Bedrock APIs only — no instance to size or patch
Cold start	None (instances always on)	None (provisioned units always warm)

Both options give you predictable throughput; the choice is mostly about which model and how much control. JumpStart hands you the keys to the instance; Bedrock PT keeps the kitchen closed.

7. When to Choose JumpStart over Bedrock

The model isn't on Bedrock. Falcon, BGE rerankers, Stable Diffusion XL, code models, vision models — JumpStart hosts dozens of model families Bedrock doesn't.
You need open weights. If your compliance or research workflow requires inspecting / modifying weights, JumpStart gives you the artifact in S3; Bedrock does not.
You need fine-tuning beyond Bedrock's options. Bedrock fine-tuning is limited to specific base models with a fixed recipe. JumpStart lets you tune any supported model with your own hyperparameters, your own data schema, and your own evaluation loop — and the resulting checkpoint is yours to redeploy or hand off.
You need a custom container. Want a custom CUDA build, a vLLM serving backend with continuous batching, NVIDIA TensorRT-LLM, a specific tokenizer patch? JumpStart's container is overridable; Bedrock's is not.
You need strict VPC-only deployment. Bedrock supports PrivateLink, but the model itself runs in an AWS-managed VPC. JumpStart endpoints run inside your VPC with full network isolation — required for some regulated workloads.
You want a single bill / IAM surface for ML. If the rest of your ML platform (training jobs, processing jobs, pipelines, feature store) is on SageMaker, JumpStart inference fits the same operational model.

The flip side: choose Bedrock when you want a serverless, pay-per-token API with no instances to manage, when you need Knowledge Bases / Agents / Guardrails out of the box, or when the model you want is on Bedrock and you don't need to fine-tune beyond what Bedrock supports. A common pattern is Bedrock for the chat surface (Claude on-demand) and JumpStart endpoints for the supporting models (a BGE embedder, a reranker, a vision model) that Bedrock doesn't host.

8. Evaluating a JumpStart Model Before Production

Before pinning a JumpStart model into a production endpoint, run it through SageMaker Clarify FM evaluation or your own held-out test set. The goal is to baseline accuracy, latency, and cost on your actual workload — not on the vendor's marketing benchmarks.


import json, time

# Simple latency + token-throughput probe
prompts = [open(f"prompts/{p}").read() for p in os.listdir("prompts")]

start = time.time()
total_out_tokens = 0
for p in prompts:
    resp = predictor.predict({
        "inputs": [[{"role": "user", "content": p}]],
        "parameters": {"max_new_tokens": 256, "temperature": 0.0},
    })
    total_out_tokens += resp[0].get("usage", {}).get("completion_tokens", 0)

elapsed = time.time() - start
print(f"throughput: {total_out_tokens/elapsed:.1f} tok/s, "
      f"per-request avg: {elapsed/len(prompts)*1000:.0f} ms")

Pair this with a quality eval — golden-answer comparison, LLM-as-judge against Claude on Bedrock, or domain-specific metrics — and you have the three numbers that matter: quality, latency, and dollars per million tokens. Re-run on every new JumpStart version before promoting.

9. Operational Tips

Pin model_version. JumpStart catalog entries get new versions whenever the upstream weights or container update. Production callers should pin to a specific version ("2.0.4"), not "*".
Use endpoint variants for safe rollouts. Deploy the new model as a 10% traffic variant on the same endpoint, watch CloudWatch invocation metrics for a day, then shift to 100%. Far safer than a blue/green endpoint swap.
Set autoscaling on SageMakerVariantInvocationsPerInstance. The default static instance count over-provisions during off hours; autoscaling typically cuts the bill 30–50%.
Compile to Inferentia where supported. JumpStart publishes pre-compiled Neuron variants for popular Llama / Mistral builds; the cost-per-token drop is usually dramatic. Validate output parity against the GPU build before switching.
Watch for EULA prompts. Gated models (Llama, some Mistral variants) require accept_eula=True. Forgetting this in CI/CD is a common deploy failure — surface a clear error early.
Use SageMaker Inference Recommender on a new model + workload combo to right-size the instance type before committing — it spins up several candidate instances, runs your test traffic, and reports throughput, latency, and cost.

↑ Back to Top