Amazon Macie

Amazon Macie is a managed sensitive-data discovery service for Amazon S3. It uses ML and pattern matching to find PII, PHI, financial data, and credentials inside S3 objects, then publishes severity-graded findings to Security Hub. Macie also continuously monitors S3 bucket-level configuration (public access, encryption, sharing) so the data-discovery findings come paired with their exposure context.


1. Overview & Data Flow

Macie works in two modes — continuous bucket inventory (free, always on once enabled) and on-demand or scheduled object content scanning (charged per GB). The diagram below shows the content-scan path: S3 objects flow into the Macie scan engine, are matched against managed and custom identifiers, and emerge as findings routed downstream.

┌──────────────────────────────────────────────────────────────────────────────┐
│                         S3 DATA SOURCES                                      │
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐            │
│  │  Data Lake Bkt   │  │  App Logs Bkt    │  │  Backups Bkt     │            │
│  │  (Parquet, CSV)  │  │  (JSON, gzip)    │  │  (DB dumps)      │            │
│  └──────────────────┘  └──────────────────┘  └──────────────────┘            │
└──────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│                       MACIE SCAN ENGINE                                      │
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐            │
│  │ Managed Data IDs │  │  Custom Regex    │  │  Sample-Based    │            │
│  │ PII / PHI / Cred │  │   Identifiers    │  │   Discovery      │            │
│  └──────────────────┘  └──────────────────┘  └──────────────────┘            │
└──────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│                  FINDINGS (Sensitive Data)                                   │
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐            │
│  │  SSN / DOB / CC  │  │  AWS Access Keys │  │  Health Records  │            │
│  │  Severity: High  │  │  Severity: High  │  │  Severity: High  │            │
│  └──────────────────┘  └──────────────────┘  └──────────────────┘            │
└──────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│                DOWNSTREAM ROUTING & ACTION                                   │
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐            │
│  │  Security Hub    │  │   EventBridge    │  │  Lambda / Jira   │            │
│  │  ASFF Aggregate  │  │  Severity Routes │  │  Tag, Quarantine │            │
│  └──────────────────┘  └──────────────────┘  └──────────────────┘            │
└──────────────────────────────────────────────────────────────────────────────┘

2. Managed Data Identifiers

Macie ships with 100+ managed identifiers maintained by AWS. They cover the standard regulated-data categories and are constantly tuned to reduce false positives.

3. Custom Data Identifiers

For internal patterns that AWS doesn't ship (employee IDs, product license keys, customer numbers), define a custom identifier with a regex, optional keywords, an ignore-words list, and a maximum match distance.


aws macie2 create-custom-data-identifier \
  --name "EmployeeId" \
  --regex "EMP-[0-9]{6}" \
  --keywords "employee,emp_id,personnel" \
  --maximum-match-distance 50 \
  --ignore-words "EMP-000000,EMP-999999"

Custom identifiers are evaluated on both managed and custom-only jobs. They count toward findings severity the same way managed identifiers do.

4. Job Types

Job scope can be filtered by tag, prefix, file type (JSON / CSV / Parquet / Avro / Excel / archives), object size, and last-modified date. Use these filters to skip uninteresting cold storage and concentrate spend on hot data.

5. Findings & Severity

Macie produces two finding categories:

Severity buckets: Low (1-3), Medium (4-6), High (7-9). Each finding includes the bucket / object path, the identifier(s) matched, the count of matches, and a sample with the actual sensitive value redacted.

6. Cost-Optimization Patterns

Content scanning charges per GB; bucket inventory does not. The cost levers are object selection and sampling.

Approximate pricing as of 2026: ~$0.10 per GB scanned for sensitive-data discovery and ~$1.00 per bucket per month for the inventory. Free 30-day trial.

7. Integration with Security Hub

Macie publishes findings to EventBridge in real time and to Security Hub in ASFF format. Hub aggregation lets a single dashboard show "buckets with public access AND containing PII" by joining Macie data findings with Macie policy findings (and Config rules for redundancy).


import boto3

macie = boto3.client("macie2")

# List High-severity sensitive-data findings from the last 7 days
resp = macie.list_findings(
    findingCriteria={
        "criterion": {
            "severity.description": {"eq": ["High"]},
            "category": {"eq": ["CLASSIFICATION"]},
            "updatedAt": {"gte": 1714003200000},  # epoch ms
        }
    },
    maxResults=50,
    sortCriteria={"attributeName": "severity.score", "orderBy": "DESC"},
)

for fid in resp["findingIds"]:
    detail = macie.get_findings(findingIds=[fid])["findings"][0]
    print(f"{detail['severity']['description']:<6} "
          f"{detail['resourcesAffected']['s3Bucket']['name']}/"
          f"{detail['resourcesAffected']['s3Object']['key']}")

Common runbook: a High Macie finding on a public-access bucket triggers an EventBridge rule that (1) flips the bucket's BlockPublicAccess settings to all-true, (2) creates a Jira ticket assigned to the bucket's owning team via tag lookup, (3) Slack-pages the data-protection on-call.


↑ Back to Top