Cozy228/textract.md

## textract.md

      
    Raw
  

              textract.md
            
          
    # Amazon Textract — Phase 1 Enterprise Access Hardening (PrivateLink + Org Guardrails + Endpoint Policy)

**Scope (Phase 1 only):**
This design focuses on the *three foundational controls* required to expose Amazon Textract safely across application teams:

1) **Network boundary:** Interface VPC Endpoint (AWS PrivateLink) + Private DNS  
2) **Org-level enforcement:** SCP / Permission Boundary to **deny non-VPCE Textract calls**  
3) **Ingress gate:** VPC Endpoint Policy to control **who may use the endpoint**

> Async job orchestration (SNS/SQS, result storage patterns, etc.) is explicitly **out of scope** for Phase 1 and will be addressed in Phase 2.

---

## 1. Why this is needed (and why IAM-only is not enough)

### IAM-only is necessary but not sufficient
Identity-based IAM policies answer: **“who is allowed to call Textract?”**  
They do **not** reliably enforce: **“from where can Textract be called?”**

If credentials are abused (e.g., SSRF, role credential theft, CI misuse), IAM-only typically still permits calling Textract from *outside your controlled network* unless you add a network-bound guardrail.

### What Phase 1 adds (defense-in-depth)
This design ensures Textract can only be used:

- **From inside approved VPCs via a specific VPC Endpoint (VPCE)**  
- **By approved principals only (endpoint policy allowlist)**  
- **With org-level enforcement (SCP/boundary) that teams cannot bypass**

AWS explicitly supports accessing Textract via interface endpoints and Private DNS using the default regional DNS name.  
See “Amazon Textract and interface VPC endpoints” (Textract Developer Guide). :contentReference[oaicite:0]{index=0}

---

## 2. Target state (control stack overview)

### Control 1 — Network Boundary (default private access)
- Create **Interface VPC Endpoint** for Textract:
  - `com.amazonaws.<region>.textract` (optional FIPS: `textract-fips`)
- Enable **Private DNS**
- Apps keep using the standard endpoint DNS name (no code changes), but DNS resolves to VPCE private IPs

AWS Textract docs confirm:
- Textract supports interface endpoints (PrivateLink)
- Private DNS allows using the default DNS name (e.g., `textract.us-east-1.amazonaws.com`). :contentReference[oaicite:1]{index=1}

### Control 2 — Org-Level Enforcement (hard deny for non-VPCE calls)
- Apply **SCP** (preferred) or **Permission Boundary**
- Deny `textract:*` unless request context contains the expected `aws:SourceVpce`

SCPs define “permission guardrails” across accounts and don’t grant permissions themselves. :contentReference[oaicite:2]{index=2}

### Control 3 — Endpoint Policy Gate (who can use the endpoint)
- Attach a **VPC Endpoint Policy** to the Textract interface endpoint
- Allowlist approved IAM roles/principals
- Endpoint policy doesn’t replace IAM; both must allow the request

Endpoint policy definition (resource-based policy attached to a VPC endpoint): :contentReference[oaicite:3]{index=3}

---

## 3. Detailed design

### 3.1 Network boundary — Interface VPCE + Private DNS

#### Design decisions
- **Interface VPCE (PrivateLink)** is mandatory for production VPCs that use Textract.
- **Private DNS enabled** is mandatory to avoid app-side endpoint overrides and keep SDK usage standard.

#### What this achieves
- Workloads in private subnets can reach Textract without requiring IGW/NAT/public IPs (removes broad egress dependency).
- Textract calls traverse AWS private networking through the VPCE entry point.

Textract service name and Private DNS behavior are documented here: :contentReference[oaicite:4]{index=4}

#### Required infra artifacts (Terraform-managed)
- `aws_vpc_endpoint` (type = `Interface`)
- Subnets: dedicated endpoint subnets (or shared with app subnets)
- Security group: restrict inbound to app workloads and limit egress as needed
- `private_dns_enabled = true`
- Endpoint policy (see §3.3)

> Note: VPCE is the *access path*, not the *security boundary alone*. Enforcement comes from SCP/boundary + endpoint policy.

---

### 3.2 Org-level enforcement — SCP / Permission Boundary (deny non-VPCE)

#### Recommendation: SCP over boundary
- **SCP** is the strongest control: centrally enforced at OU/account level, app teams can’t loosen it.
- Permission boundaries can be used in environments without Organizations or for special cases, but SCP is preferred for enterprise guardrails.

AWS Organizations SCP overview: :contentReference[oaicite:5]{index=5}  
SCP examples and syntax guidance: :contentReference[oaicite:6]{index=6}

#### Enforcement pattern
**Deny** Textract calls unless they come from the approved VPCE ID.

`aws:SourceVpce` is a standard global condition context key used to limit access to a specified VPC endpoint (AWS references this approach broadly, including service best practices). :contentReference[oaicite:7]{index=7}

##### SCP example (Phase 1 baseline)
```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyTextractUnlessFromApprovedVPCE",
      "Effect": "Deny",
      "Action": "textract:*",
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "aws:SourceVpce": "vpce-xxxxxxxxxxxxxxxxx"
        }
      }
    }
  ]
}
Operational impact (must communicate to app teams)


Any Textract call outside the approved VPC/VPCE path will fail with explicit deny.


This includes:

local developer machines
CI runners outside the controlled VPC
workloads not configured to use the VPC endpoint path


This is intentional: it prevents credential abuse from uncontrolled networks.

Optional enhancement (future): Add break-glass exceptions or tighter source bindings (e.g., VPC-bound credential usage patterns).
AWS Security Blog discusses advanced patterns to restrict where credentials can be used from. (Amazon Web Services, Inc.)


3.3 Ingress gate — VPCE Endpoint Policy (who can use the endpoint)

Why endpoint policy matters

Even with a VPCE in place, without endpoint policy hardening you can unintentionally allow broad usage inside the VPC. Endpoint policies provide a second gate:

SCP/boundary: “Requests must arrive via this VPCE”
Endpoint policy: “Only these principals may use this VPCE”
IAM: “These principals may call Textract actions”

Endpoint policy definition and behavior: (AWS 文档)
Endpoint policy baseline (allowlist principals)

Use aws:PrincipalArn allowlisting (or account/OU patterns) to limit who can use this endpoint:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowApprovedRolesToUseTextractEndpoint",
      "Effect": "Allow",
      "Principal": "*",
      "Action": [
        "textract:DetectDocumentText",
        "textract:AnalyzeDocument",
        "textract:StartDocumentTextDetection",
        "textract:StartDocumentAnalysis",
        "textract:GetDocumentTextDetection",
        "textract:GetDocumentAnalysis"
      ],
      "Resource": "*",
      "Condition": {
        "ArnLike": {
          "aws:PrincipalArn": [
            "arn:aws:iam::*:role/app-*",
            "arn:aws:iam::*:role/platform-textract-*"
          ]
        }
      }
    }
  ]
}
Notes:

Keep endpoint policy aligned with your approved usage model (sync vs async can still be allowed in Phase 1; orchestration comes later).
Endpoint policy does not override IAM; both must allow the request. (AWS 文档)
You can attach endpoint policy only when the service supports it (Textract does; AWS ML blog demonstrates this specifically). (Amazon Web Services, Inc.)


4. Terraform delivery model (Phase 1)

Module A: textract_vpce_foundation (per VPC + region)

Creates

Interface VPCE for Textract (com.amazonaws.<region>.textract)
Private DNS enabled
VPCE Security Group
VPCE Endpoint Policy

Outputs

textract_vpce_id
textract_vpce_sg_id
textract_vpce_dns_entries

Textract interface endpoint requirements: (AWS 文档)
General interface endpoint creation & endpoint policy support: (AWS 文档)

Module B: textract_org_guardrails (per OU/account)

Delivers

SCP JSON policy templates (deny non-VPCE Textract)
Rollout plan and testing guidance

SCPs are guardrails; deployment should be staged and tested. (AWS 文档)

5. Responsibilities & operating model

Infra / Platform team owns


VPCE creation, Private DNS enablement
Endpoint policy governance
SCP/boundary enforcement policy ownership
Reference IAM policy sets (least privilege patterns)

Application teams own


Use standard AWS SDK calls (no endpoint overrides)
Run workloads in approved VPCs/subnets
Request approved IAM role patterns (matching allowlist rules)


6. Validation plan (how we prove it works)

Test A — “Default private path”


Run Textract call from workload inside approved VPC
Confirm success without requiring public egress/NAT dependency
Optional: validate DNS resolves to VPCE private IPs

Private DNS behavior is described in Textract VPCE documentation. (AWS 文档)
Test B — “Hard deny outside VPCE”


Attempt Textract call from a non-approved network path
Expect explicit deny due to SCP/boundary

SCP evaluation model and deny-by-default behavior: (AWS 文档)
Test C — “Endpoint policy gate works”


Use a role NOT in allowlist inside the VPC
Expect denial even though the call is via VPCE (endpoint policy blocks)

Endpoint policy purpose and constraints: (AWS 文档)

7. Known limitations (Phase 1)


PrivateLink ensures a private access path, but the service remains a managed AWS service (data is processed by AWS). The security objective is controlling network path + enforcement, not “keeping data inside the VPC boundary”.
If some workloads must call Textract outside VPC (e.g., developer laptop), they will be blocked by design. Handle via separate non-prod policies or dedicated dev accounts/OUs.


8. Phase 2 (out of scope here)


Standard async orchestration: SNS/SQS patterns, DLQ, retries
Output storage patterns, KMS, lifecycle policies
Optional “Textract Gateway Service” for rate limiting, auditing enrichment, payload sanitization


References (official docs & authoritative sources)


Accessed: 2026-01-19

Amazon Textract and interface VPC endpoints (Textract Developer Guide)
https://docs.aws.amazon.com/textract/latest/dg/vpc-interface-endpoints.html

Control access to VPC endpoints using endpoint policies (Amazon VPC User Guide / PrivateLink)
https://docs.aws.amazon.com/vpc/latest/privatelink/vpc-endpoints-access.html

Access an AWS service using an interface VPC endpoint (PrivateLink)
https://docs.aws.amazon.com/vpc/latest/privatelink/create-interface-endpoint.html

Service control policies (SCPs) - AWS Organizations
https://docs.aws.amazon.com/organizations/latest/userguide/orgs_manage_policies_scps.html

SCP examples - AWS Organizations
https://docs.aws.amazon.com/organizations/latest/userguide/orgs_manage_policies_scps_examples.html

SCP syntax - AWS Organizations
https://docs.aws.amazon.com/organizations/latest/userguide/orgs_manage_policies_scps_syntax.html

SCP evaluation - AWS Organizations
https://docs.aws.amazon.com/organizations/latest/userguide/orgs_manage_policies_scps_evaluation.html

AWS global condition context keys (IAM)
https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_condition-keys.html

AWS Security Blog: Restrict where EC2 instance credentials can be used from (network-bound enforcement patterns)
https://aws.amazon.com/blogs/security/how-to-use-policies-to-restrict-where-ec2-instance-credentials-can-be-used-from/

AWS ML Blog: Using Amazon Textract with AWS PrivateLink (Textract + PrivateLink + endpoint policy)
https://aws.amazon.com/blogs/machine-learning/using-amazon-textract-with-aws-privatelink/

::contentReference[oaicite:18]{index=18}
No results found