Amazon SageMaker JumpStart

SageMaker JumpStart is the model hub built into SageMaker. It packages hundreds of pre-trained foundation and task-specific models — Llama 3.x, Mistral, Falcon, Stable Diffusion, BGE embeddings, plus traditional CV/NLP models — together with deployable inference containers and ready-to-run fine-tuning recipes. JumpStart is the fastest route from "I want to host an open-weights model on AWS" to "I have an HTTPS endpoint in my VPC." It's also where you go when Bedrock doesn't host the model you need.


1. What JumpStart Is

JumpStart wraps three things into one feature:

Under the hood every JumpStart model is just a regular SageMaker model: you can swap the container, override the entry-point script, attach a private VPC, encrypt with KMS, and use the resulting endpoint exactly like any other SageMaker endpoint.


2. The Model Hub

Representative families currently available (the catalog updates frequently — check the SageMaker Studio UI for the live list):

Each entry exposes a stable model_id (e.g. meta-textgeneration-llama-3-1-8b-instruct) plus a model_version. Pin both in production code; "*" for version is fine for prototyping but will eventually drift.


3. Deploying a Foundation Model

The high-level JumpStartModel class encapsulates container selection, environment variables, and endpoint config. The minimum deploy is three lines:


from sagemaker.jumpstart.model import JumpStartModel

model = JumpStartModel(
    model_id="meta-textgeneration-llama-3-1-8b-instruct",
    model_version="2.*",
)

predictor = model.deploy(
    instance_type="ml.g5.2xlarge",
    initial_instance_count=1,
    accept_eula=True,  # required for gated models like Llama
    endpoint_name="llama3-1-8b-instruct",
)

resp = predictor.predict({
    "inputs": [[
        {"role": "system", "content": "You are a concise assistant."},
        {"role": "user",   "content": "Summarize CAP theorem in two sentences."},
    ]],
    "parameters": {"max_new_tokens": 256, "temperature": 0.2, "top_p": 0.9},
})
print(resp[0]["generated_text"])
  

3.1 VPC, KMS, and Network Isolation

Override the deploy call to put the endpoint in a private subnet and encrypt the model artifacts with a customer-managed KMS key:


predictor = model.deploy(
    instance_type="ml.g5.2xlarge",
    initial_instance_count=1,
    accept_eula=True,
    endpoint_name="llama3-1-8b-instruct-vpc",
    vpc_config={
        "Subnets":          ["subnet-aaaa1111", "subnet-bbbb2222"],
        "SecurityGroupIds": ["sg-cccc3333"],
    },
    kms_key="arn:aws:kms:us-west-2:111111111111:key/abcd-1234",
    enable_network_isolation=True,  # container has no internet egress at all
)
  

enable_network_isolation=True is the right default for any model that doesn't need to phone home — once the container is built, no outbound traffic leaves it.

3.2 Async and Serverless Endpoints

For long generations (Stable Diffusion video, 30k-token summaries), an Async endpoint streams the response to S3 and returns immediately. For bursty low-volume usage on smaller models, a Serverless endpoint cold-starts on demand and bills per millisecond.


from sagemaker.async_inference import AsyncInferenceConfig
predictor = model.deploy(
    instance_type="ml.g5.2xlarge",
    initial_instance_count=1,
    accept_eula=True,
    async_inference_config=AsyncInferenceConfig(
        output_path="s3://my-async-out/llama-results/",
        max_concurrent_invocations_per_instance=4,
    ),
)
  


4. Instance Selection (g5 / g6 / p5)

The right instance is mostly a function of model size, target throughput, and budget. JumpStart populates a default; override it deliberately.

Rule of thumb for inference sizing: a model in FP16 needs ~2 bytes per parameter (so 7B ~ 14 GB, 70B ~ 140 GB), plus 20–40% headroom for the KV cache at your target context length. Quantization (INT8, AWQ, GPTQ) cuts that by 2–4x at a measurable but usually acceptable quality cost.


5. Fine-Tuning with QLoRA

JumpStart ships fine-tuning recipes for most LLMs in the catalog. QLoRA — 4-bit quantized base weights plus trainable LoRA adapters — is the default and is what you want for cost-effective domain adaptation on a single g5/g6 instance.


from sagemaker.jumpstart.estimator import JumpStartEstimator

estimator = JumpStartEstimator(
    model_id="meta-textgeneration-llama-3-1-8b-instruct",
    model_version="2.*",
    instance_type="ml.g5.12xlarge",
    instance_count=1,
    environment={"accept_eula": "true"},
    hyperparameters={
        "epoch":                  "3",
        "learning_rate":          "1e-4",
        "per_device_train_batch_size": "2",
        "gradient_accumulation_steps": "4",
        "max_input_length":       "2048",

        # PEFT / QLoRA knobs
        "instruction_tuned":      "True",
        "int8_quantization":      "False",
        "enable_fsdp":            "False",
        "peft_type":              "lora",
        "lora_r":                 "16",
        "lora_alpha":             "32",
        "lora_dropout":           "0.05",
        "load_in_4bit":           "True",
    },
)

estimator.fit({"training": "s3://my-bucket/jumpstart-train/instruction-data/"})

# Deploy the fine-tuned model
predictor = estimator.deploy(
    instance_type="ml.g5.2xlarge",
    initial_instance_count=1,
    accept_eula=True,
)
  

5.1 Training Data Format

Most JumpStart instruction-tuning recipes expect a train.jsonl in S3, one example per line, with an instruction-style schema:


{"instruction": "Summarize the following ticket.", "context": "Customer reports that...", "response": "Customer reports a billing error..."}
{"instruction": "Classify the sentiment.",         "context": "I love the new release.",  "response": "positive"}
  

The recipe wraps each row in the model's expected chat template (Llama 3 instruct, Mistral instruct, etc.) before tokenization. A template.json sidecar in the same S3 prefix can override the wrapping.

5.2 Distributed Fine-Tuning

For 70B fine-tuning on a single node, set enable_fsdp=True and pick a multi-GPU instance (g5.48xlarge, p4d.24xlarge, p5.48xlarge). For multi-node, set instance_count>1 — the recipe uses NCCL over EFA on supported instances.


6. JumpStart Endpoint vs Bedrock Provisioned Throughput

Dimension JumpStart Endpoint Bedrock Provisioned Throughput
Pricing unitPer instance-hour, regardless of utilizationPer Model Unit (MU) per hour, with 1- or 6-month commit options
Model selectionAnything in the JumpStart catalog or your own containerOnly models offered on Bedrock (Claude, Llama, Mistral, Cohere, Titan, Nova)
Custom weightsYes — fine-tuned, distilled, or fully customOnly Bedrock-fine-tuned variants; you can't bring arbitrary weights
NetworkingVPC-attached, network-isolated, KMS — full controlVPC endpoint (PrivateLink) into Bedrock; model itself runs in AWS-managed VPC
ScalingSageMaker autoscaling on instance countScale by purchasing additional Model Units
Ops surfaceSageMaker endpoint config, IAM, CW metrics, traffic shiftingBedrock APIs only — no instance to size or patch
Cold startNone (instances always on)None (provisioned units always warm)

Both options give you predictable throughput; the choice is mostly about which model and how much control. JumpStart hands you the keys to the instance; Bedrock PT keeps the kitchen closed.


7. When to Choose JumpStart over Bedrock

The flip side: choose Bedrock when you want a serverless, pay-per-token API with no instances to manage, when you need Knowledge Bases / Agents / Guardrails out of the box, or when the model you want is on Bedrock and you don't need to fine-tune beyond what Bedrock supports. A common pattern is Bedrock for the chat surface (Claude on-demand) and JumpStart endpoints for the supporting models (a BGE embedder, a reranker, a vision model) that Bedrock doesn't host.


8. Evaluating a JumpStart Model Before Production

Before pinning a JumpStart model into a production endpoint, run it through SageMaker Clarify FM evaluation or your own held-out test set. The goal is to baseline accuracy, latency, and cost on your actual workload — not on the vendor's marketing benchmarks.


import json, time

# Simple latency + token-throughput probe
prompts = [open(f"prompts/{p}").read() for p in os.listdir("prompts")]

start = time.time()
total_out_tokens = 0
for p in prompts:
    resp = predictor.predict({
        "inputs": [[{"role": "user", "content": p}]],
        "parameters": {"max_new_tokens": 256, "temperature": 0.0},
    })
    total_out_tokens += resp[0].get("usage", {}).get("completion_tokens", 0)

elapsed = time.time() - start
print(f"throughput: {total_out_tokens/elapsed:.1f} tok/s, "
      f"per-request avg: {elapsed/len(prompts)*1000:.0f} ms")
  

Pair this with a quality eval — golden-answer comparison, LLM-as-judge against Claude on Bedrock, or domain-specific metrics — and you have the three numbers that matter: quality, latency, and dollars per million tokens. Re-run on every new JumpStart version before promoting.


9. Operational Tips


↑ Back to Top