Amazon Textract

Amazon Textract is a document-AI service that extracts printed text, handwriting, forms (key-value pairs), tables, and signatures from scanned documents and PDFs. Unlike simple OCR, Textract preserves document structure, making it a natural upstream step for document-processing pipelines and generative-AI ingestion.


Key Features:


Common Use Cases:


Example: Extract Forms & Tables from a PDF in S3


import boto3, time

textract = boto3.client("textract", region_name="us-west-2")

job = textract.start_document_analysis(
    DocumentLocation={"S3Object": {"Bucket": "my-docs", "Name": "invoices/inv-0421.pdf"}},
    FeatureTypes=["FORMS", "TABLES"],
)
job_id = job["JobId"]

while True:
    status = textract.get_document_analysis(JobId=job_id)
    if status["JobStatus"] in ("SUCCEEDED", "FAILED"):
        break
    time.sleep(2)

for block in status["Blocks"]:
    if block["BlockType"] == "KEY_VALUE_SET" and "KEY" in block.get("EntityTypes", []):
        print(block.get("Text", ""), "->", block.get("Confidence"))
  

Textract pairs naturally with Bedrock: use Textract to convert documents into structured text, then hand the output to a foundation model for reasoning, summarization, or question answering.