Confidential Computing for On-Prem Inference

When privileged matters are pinned to an on-prem model, the trust boundary shrinks to the physical server running the weights. But on-prem is not the same as private: the operating system, hypervisor, and anyone with root on the host can, in principle, read the model's prompt and response memory. For attorney–client privileged content, that residual surface is not acceptable — a privileged document opened inside a model process is still privileged.

Confidential computing uses CPU-level memory encryption and attestation (Intel TDX, AMD SEV-SNP, Arm CCA, NVIDIA H100/H200 confidential compute mode) to create a trusted execution environment (TEE) whose memory is inaccessible even to the host OS and hypervisor. The model runs inside the TEE; infra operators cannot read what it processes.



1. Threat Model: What TEEs Defend Against

A confidential VM defends against attackers with privileged host access:

What a TEE gives you: every page of guest memory is encrypted with a key the CPU generates and never releases. Even mmap-ing the guest's physical memory returns ciphertext.


2. Hardware Options


3. Remote Attestation

TEE memory encryption is worthless if you cannot verify that the thing you are sending prompts to is actually a TEE running the expected image. Remote attestation solves this: before the orchestrator releases the data-encryption key that unwraps the prompt, it requests a signed quote from the TEE describing the measured boot state and the workload hash. The quote is signed by the CPU vendor's root of trust; the orchestrator verifies it against expected values.

from dataclasses import dataclass


@dataclass
class AttestationQuote:
    cpu_vendor: str            # "intel-tdx" | "amd-sev-snp"
    measurement: bytes         # hash of TEE initial state (firmware + kernel)
    workload_hash: bytes       # hash of the inference container image
    nonce: bytes               # client-supplied freshness value
    signature: bytes           # signed by CPU vendor root key


EXPECTED_MEASUREMENTS = {
    "intel-tdx": {b"\x8a\x7f..."},   # pinned after provisioning
}
EXPECTED_WORKLOADS = {b"\xde\xad..."} # SHA-256 of the signed model server image


def verify_quote(q: AttestationQuote, expected_nonce: bytes) -> bool:
    if q.nonce != expected_nonce:
        return False
    if q.measurement not in EXPECTED_MEASUREMENTS.get(q.cpu_vendor, set()):
        return False
    if q.workload_hash not in EXPECTED_WORKLOADS:
        return False
    return verify_vendor_signature(q)   # PCCS for Intel, KDS for AMD


def release_key_if_attested(tee_endpoint, data_key_material) -> bool:
    nonce = os.urandom(32)
    quote = tee_endpoint.get_quote(nonce=nonce)
    if not verify_quote(quote, expected_nonce=nonce):
        audit.log("attestation.failed", endpoint=tee_endpoint.url)
        return False
    # Only now do we transfer the key that lets the TEE decrypt the prompt.
    tee_endpoint.wrap_and_send(data_key_material)
    return True

Attestation turns "trust the server" into "trust the CPU vendor + our image signing" — a much smaller and more auditable trust base.


4. Confidential GPUs and Model Weights

Modern inference is GPU-bound. NVIDIA H100 and H200 support confidential compute mode: the GPU attests its firmware state alongside the CPU TEE, and the PCIe transport between CPU and GPU is encrypted. Without this, a CPU TEE alone is insufficient — the prompt would be re-exposed the moment it crossed the bus to the GPU.


5. Operational Considerations


6. What TEEs Do Not Protect Against


↑ Back to Top