Azure OpenAI Service

Azure OpenAI Service is Microsoft's hosted offering of OpenAI's models — GPT-4o, GPT-4o mini, GPT-4.1, o1, o3-mini, DALL·E 3, Whisper, and the text-embedding-3 family — with enterprise controls on top. It is the recommended path for running OpenAI models when you need Azure-native identity (Entra ID), private networking (VNet, Private Endpoints), regional data residency, and contractual guarantees around data handling and content filtering.


Azure OpenAI vs. OpenAI's Own API:


Key Concepts:


Examples

1. Chat Completion with GPT-4o (Python)

The official openai SDK has an AzureOpenAI client that points at your Azure endpoint.


from openai import AzureOpenAI
import os

client = AzureOpenAI(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],   # https://myco.openai.azure.com/
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    api_version="2024-10-21",
)

resp = client.chat.completions.create(
    model="gpt-4o-prod",  # your DEPLOYMENT name, not the model name
    messages=[
        {"role": "system", "content": "You are a concise financial analyst."},
        {"role": "user",   "content": "Summarize Q3 sales trends in 3 bullets."},
    ],
    max_tokens=512,
    temperature=0.2,
)

print(resp.choices[0].message.content)
  


2. Streaming Responses for Chat UIs


stream = client.chat.completions.create(
    model="gpt-4o-prod",
    messages=[{"role": "user", "content": "Explain CAP theorem to a new engineer."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content if chunk.choices else None
    if delta:
        print(delta, end="", flush=True)
  


3. Tool Calling (Function Calling)


tools = [{
    "type": "function",
    "function": {
        "name": "get_order_status",
        "description": "Look up the shipping status of a customer order by ID.",
        "parameters": {
            "type": "object",
            "properties": {"order_id": {"type": "string"}},
            "required": ["order_id"],
        },
    },
}]

messages = [{"role": "user", "content": "Where is order A-482?"}]
resp = client.chat.completions.create(model="gpt-4o-prod", messages=messages, tools=tools)

msg = resp.choices[0].message
if msg.tool_calls:
    call = msg.tool_calls[0]
    # Pretend this calls your order system
    tool_result = {"order_id": "A-482", "status": "In transit, ETA Fri"}

    messages.append(msg)
    messages.append({
        "role": "tool",
        "tool_call_id": call.id,
        "content": str(tool_result),
    })
    final = client.chat.completions.create(model="gpt-4o-prod", messages=messages, tools=tools)
    print(final.choices[0].message.content)
  


4. GPT-4o with Vision


import base64

with open("chart.png", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

resp = client.chat.completions.create(
    model="gpt-4o-prod",
    messages=[{"role": "user", "content": [
        {"type": "text", "text": "What trend does this chart show? Return one sentence."},
        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
    ]}],
    max_tokens=200,
)
print(resp.choices[0].message.content)
  


5. Entra ID (Managed Identity) Authentication — No API Keys

The cleanest production pattern: drop API keys entirely and authenticate with the caller's Azure identity.


from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

token_provider = get_bearer_token_provider(
    DefaultAzureCredential(),
    "https://cognitiveservices.azure.com/.default",
)

client = AzureOpenAI(
    azure_endpoint="https://myco.openai.azure.com/",
    azure_ad_token_provider=token_provider,
    api_version="2024-10-21",
)

resp = client.chat.completions.create(
    model="gpt-4o-prod",
    messages=[{"role": "user", "content": "Ping"}],
)
print(resp.choices[0].message.content)
  


6. Reasoning Models (o1 / o3-mini)

Reasoning models spend extra internal "thinking" tokens before responding. They don't accept system messages (use developer role) and use max_completion_tokens instead of max_tokens.


resp = client.chat.completions.create(
    model="o3-mini-prod",
    messages=[
        {"role": "developer", "content": "Think step by step."},
        {"role": "user", "content": "A train leaves Chicago at 2pm at 60mph. Another leaves NY at 3pm at 80mph going the opposite way. Distance is 800 miles. When do they meet?"},
    ],
    reasoning_effort="medium",       # low | medium | high
    max_completion_tokens=4000,
)
print(resp.choices[0].message.content)
  


7. Embeddings with text-embedding-3-large


vec = client.embeddings.create(
    model="embedding-3-large-prod",
    input=["Azure OpenAI Service hosts OpenAI models in your Azure tenant."],
    dimensions=1536,   # optional truncation; default is 3072
).data[0].embedding

print(len(vec), vec[:5])
  


8. RAG via "On Your Data" (Data Sources parameter)

Azure OpenAI can run RAG against Azure AI Search, Azure Cosmos DB for MongoDB vCore, Azure Blob Storage, or Elasticsearch without you building the retrieval loop.


completion = client.chat.completions.create(
    model="gpt-4o-prod",
    messages=[{"role": "user", "content": "What is our 2026 parental-leave policy?"}],
    extra_body={
        "data_sources": [{
            "type": "azure_search",
            "parameters": {
                "endpoint": "https://myco-search.search.windows.net",
                "index_name": "hr-policies",
                "authentication": {"type": "system_assigned_managed_identity"},
                "query_type": "vector_semantic_hybrid",
                "embedding_dependency": {
                    "type": "deployment_name",
                    "deployment_name": "embedding-3-large-prod",
                },
                "semantic_configuration": "default",
            },
        }],
    },
)
print(completion.choices[0].message.content)
for ctx in completion.choices[0].message.context.get("citations", []):
    print("-", ctx["title"], ctx.get("url"))
  


9. DALL·E 3 Image Generation


img = client.images.generate(
    model="dalle3-prod",
    prompt="A watercolor illustration of a quiet mountain lake at sunrise.",
    size="1024x1024",
    quality="hd",
    n=1,
)
print(img.data[0].url)
  


10. Whisper Speech-to-Text


with open("meeting.m4a", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="whisper-prod",
        file=f,
        response_format="verbose_json",
        timestamp_granularities=["segment"],
    )
print(transcript.text)
  


11. Batch API (50% Cheaper, 24-Hour SLA)

Submit a JSONL file of requests; Azure processes them asynchronously at half price.


# 1) Upload the JSONL file
batch_file = client.files.create(file=open("requests.jsonl", "rb"), purpose="batch")

# 2) Create the batch job
batch = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/chat/completions",
    completion_window="24h",
)
print("batch id:", batch.id, "status:", batch.status)

# 3) Poll (or event-drive); when completed, download the output_file_id
  


Content Filtering & Prompt Shields

Every call runs through Azure's content filter. Check resp.prompt_filter_results and choices[0].content_filter_results to see per-category scores. Enable Prompt Shields on a deployment to block jailbreak attempts and indirect prompt injections from retrieved documents. Enable Groundedness Detection on RAG responses to flag hallucinations.


Cost Optimization: