Amazon Textract

Amazon Textract is a document-AI service that extracts printed text, handwriting, forms (key-value pairs), tables, and signatures from scanned documents and PDFs. Unlike simple OCR, Textract preserves document structure, making it a natural upstream step for document-processing pipelines and generative-AI ingestion.

Key Features:

Text & Handwriting OCR: Detects lines and words with bounding boxes and confidence scores across most Latin-script languages plus Chinese, Japanese, Korean, Hindi, and Arabic.
Forms Extraction: Returns key-value pairs for form fields (e.g., "Name → John Smith", "DOB → 1985-02-14") without templates or training.
Tables Extraction: Preserves rows, columns, and cells — critical for invoices, financial statements, and tabular data.
Queries: Ask natural-language questions ("What is the invoice total?") and Textract returns the span in the document that answers them.
Analyze Expense & Analyze ID: Purpose-built APIs that recognize semantic fields on receipts/invoices (vendor, total, line items) and identity documents (passport, driver's license).
Analyze Lending: End-to-end mortgage-document workflow that classifies pages and extracts fields for common loan forms.
Sync & Async: Real-time API for single pages/small PDFs; asynchronous jobs for multi-page documents stored in S3.

Common Use Cases:

Invoice & Receipt Processing: Auto-extract line items and totals into accounting systems.
KYC / Onboarding: Parse identity documents and supporting forms during account creation.
Claims & Loan Processing: Automate field capture from insurance claims or mortgage applications, reducing manual data entry.
Generative AI Ingestion: Convert scanned PDFs into structured text chunks for Bedrock Knowledge Bases or custom RAG pipelines.
Historical Archive Digitization: Turn legacy paper records into searchable, structured data.

Example: Extract Forms & Tables from a PDF in S3


import boto3, time

textract = boto3.client("textract", region_name="us-west-2")

job = textract.start_document_analysis(
    DocumentLocation={"S3Object": {"Bucket": "my-docs", "Name": "invoices/inv-0421.pdf"}},
    FeatureTypes=["FORMS", "TABLES"],
)
job_id = job["JobId"]

while True:
    status = textract.get_document_analysis(JobId=job_id)
    if status["JobStatus"] in ("SUCCEEDED", "FAILED"):
        break
    time.sleep(2)

for block in status["Blocks"]:
    if block["BlockType"] == "KEY_VALUE_SET" and "KEY" in block.get("EntityTypes", []):
        print(block.get("Text", ""), "->", block.get("Confidence"))

Textract pairs naturally with Bedrock: use Textract to convert documents into structured text, then hand the output to a foundation model for reasoning, summarization, or question answering.