nicksieger/terraform-ci-workflow-proposal.md

## terraform-ci-workflow-proposal.md

      
    Raw
  

              terraform-ci-workflow-proposal.md
            
          
    Ideal CI Workflow for Terraform Infrastructure as Code


Proposal for optimizing CI/CD workflows for a multi-stack, multi-account Terraform repository on GitHub Actions

Executive Summary

This document proposes best practices for managing Terraform infrastructure modules with focus on:

Smart change detection: Plan only changed modules + dependents (60-80% faster CI)
Efficient drift detection: Multi-tiered nightly checks with automated alerting
Safety mechanisms: Preserved while improving speed and developer experience


Current Repository Structure

Overview


14 root modules (stacks) in infra/ managing different AWS services
9 reusable modules in modules/ with semantic versioning
25+ stack/workspace combinations across 7+ AWS accounts
Terraform 1.13.3 with S3 backend and workspace-based isolation
GitHub Actions with OIDC authentication to AWS

Existing Workflows


terraform-trigger.yml: Dispatches all 25+ combinations on every PR/push
terraform-plan.yml: Reusable workflow for plan/apply operations
terraform-lint.yml: Enforces terraform fmt standards
tag-modules.yml: Auto-versions modules on changes
renovate.yml: Automated dependency updates

Key Stacks


business-workloads: Business workload infrastructure (ECS)
untrusted-compute: UC data planes (ECS zones)
untrusted-compute-control: UC control plane (EKS)
events: MSK Kafka cluster
bootstrap: Account initialization, OIDC roles
tailscale: VPN subnet routers
single-tenant-workloads: Customer-specific deployments
shared-workloads: Customer VM infrastructure


Problem Statement

Current Pain Points


❌ Every PR triggers 25+ plans: Takes 30+ minutes even for single-file changes
❌ Noisy PR comments: 20+ plan outputs make reviews difficult
❌ High cost: Wastes GitHub Actions minutes
❌ No drift detection: Manual changes go unnoticed until next deployment
❌ Difficult to focus: Hard to identify which changes are relevant


Recommendation 1: Smart Change Detection

Strategy: Plan Changed Modules + Dependents

Instead of running all 25+ combinations on every PR, detect:

Which stacks have direct file changes
Which modules have changed
Which stacks depend on those modules
Run plans ONLY for affected stacks

Implementation

Create .github/workflows/terraform-detect-changes.yml:
name: Detect Terraform Changes

on:
  pull_request:
    paths:
      - 'infra/**/*.tf'
      - 'infra/**/*.tfvars'
      - 'modules/**/*.tf'
      - '.terraform-version'

jobs:
  detect-changes:
    runs-on: ubuntu-latest
    outputs:
      matrix: ${{ steps.generate-matrix.outputs.matrix }}
      has_changes: ${{ steps.generate-matrix.outputs.has_changes }}
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Need full history for change detection

      - name: Detect changed stacks and modules
        id: generate-matrix
        run: |
          # Get changed files
          CHANGED_FILES=$(git diff --name-only origin/${{ github.base_ref }}...HEAD)

          # Parse changed stacks
          CHANGED_STACKS=$(echo "$CHANGED_FILES" | grep -E '^infra/[^/]+/' | cut -d'/' -f2 | sort -u)

          # Parse changed modules
          CHANGED_MODULES=$(echo "$CHANGED_FILES" | grep -E '^modules/[^/]+/' | cut -d'/' -f2 | sort -u)

          # Find dependent stacks using grep
          DEPENDENT_STACKS=""
          for module in $CHANGED_MODULES; do
            # Find all stacks referencing this module
            DEPS=$(grep -rl "source.*modules/$module" infra/ | cut -d'/' -f2 | sort -u)
            DEPENDENT_STACKS="$DEPENDENT_STACKS $DEPS"
          done

          # Combine and deduplicate
          ALL_AFFECTED_STACKS=$(echo "$CHANGED_STACKS $DEPENDENT_STACKS" | tr ' ' '\n' | sort -u | grep -v '^$')

          # Generate matrix (filter terraform-trigger.yml matrix by affected stacks)
          MATRIX_JSON=$(echo "$ALL_AFFECTED_STACKS" | jq -R -s -c 'split("\n") | map(select(length > 0))')

          echo "matrix={\"stack\":$MATRIX_JSON}" >> $GITHUB_OUTPUT
          echo "has_changes=$([[ -n \"$ALL_AFFECTED_STACKS\" ]] && echo true || echo false)" >> $GITHUB_OUTPUT

  plan-changed:
    needs: detect-changes
    if: needs.detect-changes.outputs.has_changes == 'true'
    strategy:
      matrix: ${{ fromJson(needs.detect-changes.outputs.matrix) }}
    uses: ./.github/workflows/terraform-plan.yml
    with:
      stack: ${{ matrix.stack }}
      workspace: ${{ matrix.workspace }}
      account: ${{ matrix.account }}
    secrets: inherit
Update terraform-trigger.yml

name: Terraform CI/CD

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]
  workflow_dispatch:
    inputs:
      scope:
        description: 'Scope to plan/apply'
        required: true
        type: choice
        options:
          - changed-only
          - all-stacks
        default: 'changed-only'

jobs:
  # Use smart detection for PRs
  detect-changes:
    if: |
      github.event_name == 'pull_request' ||
      (github.event_name == 'workflow_dispatch' && inputs.scope == 'changed-only')
    uses: ./.github/workflows/terraform-detect-changes.yml

  # Plan changed stacks
  plan-changed:
    needs: detect-changes
    if: needs.detect-changes.outputs.has_changes == 'true'
    strategy:
      fail-fast: false
      matrix: ${{ fromJson(needs.detect-changes.outputs.matrix) }}
    uses: ./.github/workflows/terraform-plan.yml
    with:
      workspace: ${{ matrix.workspace }}
      stack: ${{ matrix.stack }}
      account: ${{ matrix.account }}
      concurrency: ${{ matrix.stack }}-${{ matrix.workspace }}
    secrets: inherit

  # Plan all stacks (manual trigger or [plan-all] in commit message)
  plan-all:
    if: |
      (github.event_name == 'workflow_dispatch' && inputs.scope == 'all-stacks') ||
      contains(github.event.head_commit.message, '[plan-all]')
    strategy:
      fail-fast: false
      matrix:
        # Your existing full matrix (25+ combinations)
        include:
          - { workspace: prod, stack: business-workloads, account: business-workloads }
          - { workspace: bws, stack: business-workloads, account: business-workloads-staging }
          # ... all 25+ combinations
    uses: ./.github/workflows/terraform-plan.yml
    with:
      workspace: ${{ matrix.workspace }}
      stack: ${{ matrix.stack }}
      account: ${{ matrix.account }}
      concurrency: ${{ matrix.stack }}-${{ matrix.workspace }}
    secrets: inherit
Benefits


⚡ 60-80% faster CI for typical single-stack changes
🎯 Focused PR reviews: Only see plans for affected stacks
💰 Cost reduction: Fewer GitHub Actions minutes consumed
🔍 Still safe: All dependents are automatically included
🛡️ Safety net: Manual "plan all" option always available

When to Plan All

Keep full planning for:

✅ Manual workflow dispatch (user selects "all-stacks")
✅ Commit message contains [plan-all]
✅ Changes to .terraform-version
✅ Changes to provider version constraints
✅ Changes to backend configuration
✅ Weekly scheduled runs (for drift detection)
✅ Release branches


Recommendation 2: Multi-Tiered Drift Detection

Strategy: Nightly Production, Weekly Staging

Drift detection with different frequencies based on criticality:

🔴 Critical stacks (production): Every night
🟡 Normal stacks (staging): Weekly (Mondays)
🟢 Development: On-demand only

Implementation

Create .github/workflows/terraform-drift-detection.yml:
name: Drift Detection

on:
  schedule:
    # Run nightly at 3am UTC (after Renovate completes)
    - cron: '0 3 * * *'
  workflow_dispatch:
    inputs:
      scope:
        description: 'Scope of drift check'
        required: true
        type: choice
        options:
          - all
          - production-only
          - critical-stacks
        default: 'all'

jobs:
  drift-check:
    strategy:
      fail-fast: false  # Continue checking all stacks even if one drifts
      max-parallel: 5   # Avoid AWS API rate limits
      matrix:
        # Tiered approach
        include:
          # Production workloads (check nightly)
          - { stack: business-workloads, workspace: prod, account: business-workloads, priority: critical }
          - { stack: untrusted-compute, workspace: uc, account: untrusted-compute, priority: critical }
          - { stack: untrusted-compute-control, workspace: uc-control-use2-a, account: untrusted-compute, priority: critical }
          - { stack: events, workspace: bw, account: business-workloads, priority: critical }

          # Infrastructure foundations (check nightly)
          - { stack: bootstrap, workspace: bw, account: business-workloads, priority: critical }
          - { stack: bootstrap, workspace: uc, account: untrusted-compute, priority: critical }

          # Staging environments (check weekly - Monday only)
          - { stack: business-workloads, workspace: bws, account: business-workloads-staging, priority: normal, day: 1 }
          - { stack: untrusted-compute, workspace: ucs, account: untrusted-compute-staging, priority: normal, day: 1 }
          - { stack: bootstrap, workspace: bws, account: business-workloads-staging, priority: normal, day: 1 }

    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      # Skip non-critical stacks on wrong day
      - name: Check if should run
        id: should-run
        run: |
          DAY_OF_WEEK=$(date +%u)  # 1=Monday, 7=Sunday
          MATRIX_DAY="${{ matrix.day || 0 }}"

          if [[ "${{ inputs.scope }}" == "production-only" && "${{ matrix.priority }}" != "critical" ]]; then
            echo "skip=true" >> $GITHUB_OUTPUT
          elif [[ "$MATRIX_DAY" -ne 0 && "$DAY_OF_WEEK" -ne "$MATRIX_DAY" ]]; then
            echo "skip=true" >> $GITHUB_OUTPUT
          else
            echo "skip=false" >> $GITHUB_OUTPUT
          fi

      - name: Configure AWS Credentials
        if: steps.should-run.outputs.skip == 'false'
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::${{ matrix.account }}:role/GitHubActionsTerraformPlan
          aws-region: us-east-2
          role-session-name: drift-check-${{ github.run_id }}

      - name: Setup Terraform
        if: steps.should-run.outputs.skip == 'false'
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.13.3

      - name: Terraform Init
        if: steps.should-run.outputs.skip == 'false'
        working-directory: infra/${{ matrix.stack }}
        run: terraform init

      - name: Select Workspace
        if: steps.should-run.outputs.skip == 'false'
        working-directory: infra/${{ matrix.stack }}
        run: terraform workspace select ${{ matrix.workspace }}

      - name: Terraform Plan (Drift Detection)
        if: steps.should-run.outputs.skip == 'false'
        id: plan
        working-directory: infra/${{ matrix.stack }}
        run: |
          terraform plan \
            -var-file=vars/${{ matrix.workspace }}.tfvars \
            -detailed-exitcode \
            -out=drift-plan.tfplan \
            -no-color | tee plan-output.txt

          EXIT_CODE=$?
          echo "exit_code=$EXIT_CODE" >> $GITHUB_OUTPUT

          # Exit code 2 means changes detected (drift)
          if [[ $EXIT_CODE -eq 2 ]]; then
            echo "drift_detected=true" >> $GITHUB_OUTPUT
          else
            echo "drift_detected=false" >> $GITHUB_OUTPUT
          fi
        continue-on-error: true

      - name: Parse Drift Summary
        if: steps.should-run.outputs.skip == 'false' && steps.plan.outputs.drift_detected == 'true'
        id: summary
        working-directory: infra/${{ matrix.stack }}
        run: |
          # Extract resource changes
          SUMMARY=$(grep -A 1 "Plan:" plan-output.txt | tail -1 || echo "Unable to parse")
          echo "summary=$SUMMARY" >> $GITHUB_OUTPUT

          # Extract changed resources (first 20)
          CHANGED_RESOURCES=$(grep -E "^\s+[~+-]" plan-output.txt | head -20 || echo "No resources listed")
          echo "resources<<EOF" >> $GITHUB_OUTPUT
          echo "$CHANGED_RESOURCES" >> $GITHUB_OUTPUT
          echo "EOF" >> $GITHUB_OUTPUT

      - name: Upload Drift Plan
        if: steps.should-run.outputs.skip == 'false' && steps.plan.outputs.drift_detected == 'true'
        uses: actions/upload-artifact@v4
        with:
          name: drift-plan-${{ matrix.stack }}-${{ matrix.workspace }}
          path: infra/${{ matrix.stack }}/drift-plan.tfplan
          retention-days: 30

      - name: Create GitHub Issue on Drift
        if: steps.should-run.outputs.skip == 'false' && steps.plan.outputs.drift_detected == 'true' && matrix.priority == 'critical'
        uses: actions/github-script@v7
        with:
          script: |
            const stack = '${{ matrix.stack }}';
            const workspace = '${{ matrix.workspace }}';
            const summary = `${{ steps.summary.outputs.summary }}`;
            const resources = `${{ steps.summary.outputs.resources }}`;

            // Check if issue already exists
            const issues = await github.rest.issues.listForRepo({
              owner: context.repo.owner,
              repo: context.repo.repo,
              state: 'open',
              labels: 'drift-detection'
            });

            const existingIssue = issues.data.find(issue =>
              issue.title.includes(`[${stack}/${workspace}]`)
            );

            const body = `## 🚨 Drift Detected in Infrastructure

            **Stack**: \`${stack}\`
            **Workspace**: \`${workspace}\`
            **Account**: \`${{ matrix.account }}\`
            **Detection Time**: ${new Date().toISOString()}
            **Priority**: ${{ matrix.priority }}

            ### Summary
            \`\`\`
            ${summary}
            \`\`\`

            ### Changed Resources (first 20)
            \`\`\`diff
            ${resources}
            \`\`\`

            ### Action Required
            - [ ] Review drift and determine if expected
            - [ ] If expected: update Terraform to match infrastructure, then apply
            - [ ] If unexpected: investigate who made manual changes (check CloudTrail)
            - [ ] Document decision in this issue
            - [ ] Close issue once remediated

            ### Resources
            - [View Workflow Run](https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }})
            - Drift plan artifact: \`drift-plan-${stack}-${workspace}\`

            ### Investigation Commands
            \`\`\`bash
            # Download and inspect the drift plan
            gh run download ${{ github.run_id }} -n drift-plan-${stack}-${workspace}

            # Show the full plan
            cd infra/${stack}
            terraform workspace select ${workspace}
            terraform show drift-plan.tfplan

            # Check CloudTrail for manual changes
            aws cloudtrail lookup-events \\
              --lookup-attributes AttributeKey=ResourceType,AttributeValue=<resource-type> \\
              --max-results 50
            \`\`\`
            `;

            if (existingIssue) {
              // Update existing issue with new comment
              await github.rest.issues.createComment({
                owner: context.repo.owner,
                repo: context.repo.repo,
                issue_number: existingIssue.number,
                body: `### 🔄 Drift Still Present (${new Date().toISOString()})\n\n${body}`
              });

              // Re-open if closed
              if (existingIssue.state === 'closed') {
                await github.rest.issues.update({
                  owner: context.repo.owner,
                  repo: context.repo.repo,
                  issue_number: existingIssue.number,
                  state: 'open'
                });
              }
            } else {
              // Create new issue
              await github.rest.issues.create({
                owner: context.repo.owner,
                repo: context.repo.repo,
                title: `[Drift Detection] ${stack}/${workspace}`,
                body: body,
                labels: ['drift-detection', 'infrastructure', 'needs-triage', matrix.priority]
              });
            }

      - name: Slack Notification (Critical Drift)
        if: steps.should-run.outputs.skip == 'false' && steps.plan.outputs.drift_detected == 'true' && matrix.priority == 'critical'
        run: |
          # Example Slack webhook notification
          SLACK_WEBHOOK="${{ secrets.SLACK_WEBHOOK_INFRA }}"

          if [[ -n "$SLACK_WEBHOOK" ]]; then
            curl -X POST "$SLACK_WEBHOOK" \
              -H 'Content-Type: application/json' \
              -d '{
                "text": "🚨 Critical Infrastructure Drift Detected",
                "blocks": [
                  {
                    "type": "section",
                    "text": {
                      "type": "mrkdwn",
                      "text": "*Drift Detected in Production Infrastructure*\n\n*Stack:* `${{ matrix.stack }}`\n*Workspace:* `${{ matrix.workspace }}`\n*Summary:* ${{ steps.summary.outputs.summary }}"
                    }
                  },
                  {
                    "type": "actions",
                    "elements": [
                      {
                        "type": "button",
                        "text": {
                          "type": "plain_text",
                          "text": "View Workflow"
                        },
                        "url": "https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}"
                      }
                    ]
                  }
                ]
              }'
          fi

  drift-summary:
    needs: drift-check
    runs-on: ubuntu-latest
    if: always()
    steps:
      - name: Generate Drift Summary Report
        uses: actions/github-script@v7
        with:
          script: |
            // Generate aggregate summary of all drift checks
            const results = ${{ toJson(needs.drift-check) }};
            console.log('Drift detection run completed');
            console.log('Results:', results);

            // Optional: Post summary to Slack or create a digest issue
Drift Prevention Mechanisms

In addition to detection, implement prevention:


AWS Config Rules
# Add to bootstrap stack
resource "aws_config_config_rule" "terraform_managed_only" {
  name = "terraform-managed-resources-only"

  source {
    owner             = "AWS"
    source_identifier = "REQUIRED_TAGS"
  }

  scope {
    compliance_resource_types = [
      "AWS::EC2::Instance",
      "AWS::RDS::DBInstance",
      "AWS::ECS::Service",
      # Add all critical resource types
    ]
  }

  input_parameters = jsonencode({
    tag1Key   = "ManagedBy"
    tag1Value = "Terraform"
  })
}


IAM Policies (restrict console access to Terraform-managed resources)
# Deny modification of resources with ManagedBy=Terraform tag
data "aws_iam_policy_document" "prevent_terraform_resource_modification" {
  statement {
    effect = "Deny"
    actions = [
      "ec2:TerminateInstances",
      "rds:DeleteDBInstance",
      "ecs:UpdateService",
      # Add relevant modify/delete actions
    ]

    resources = ["*"]

    condition {
      test     = "StringEquals"
      variable = "aws:ResourceTag/ManagedBy"
      values   = ["Terraform"]
    }
  }
}


CloudTrail Monitoring
resource "aws_cloudwatch_event_rule" "terraform_resource_modification" {
  name        = "terraform-resource-manual-modification"
  description = "Alert on manual changes to Terraform-managed resources"

  event_pattern = jsonencode({
    source      = ["aws.ec2", "aws.rds", "aws.ecs"]
    detail-type = ["AWS API Call via CloudTrail"]
    detail = {
      eventName = [
        "TerminateInstances",
        "ModifyDBInstance",
        "UpdateService"
      ]
      # Exclude GitHub Actions role
      userIdentity = {
        arn = [{
          "anything-but" = {
            prefix = "arn:aws:sts::*:assumed-role/GitHubActionsTerraform"
          }
        }]
      }
    }
  })
}


Additional Best Practices

1. Pre-Commit Validation

Add comprehensive validation before CI runs:
# .github/workflows/terraform-validate.yml
name: Terraform Validation

on:
  pull_request:
    paths:
      - '**/*.tf'
      - '**/*.tfvars'

jobs:
  validate:
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        stack:
          - business-workloads
          - events
          - bootstrap
          - untrusted-compute
          - untrusted-compute-control
          - tailscale
          - cluster-permissions
          - kafka-connect
          - shared-workloads
          - single-tenant-workloads
          - synthetics-tests
          - workspaces
          - monitoring
          - state

    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.13.3

      - name: Terraform Init (backend=false)
        working-directory: infra/${{ matrix.stack }}
        run: terraform init -backend=false

      - name: Terraform Validate
        working-directory: infra/${{ matrix.stack }}
        run: terraform validate

      - name: TFLint
        uses: terraform-linters/setup-tflint@v4
        with:
          tflint_version: latest

      - name: Run TFLint
        working-directory: infra/${{ matrix.stack }}
        run: |
          tflint --init
          tflint --format=compact

      - name: Checkov Security Scan
        uses: bridgecrewio/checkov-action@v12
        with:
          directory: infra/${{ matrix.stack }}
          framework: terraform
          soft_fail: true  # Don't block PRs, just warn
          output_format: github_failed_only
2. Cost Estimation with Infracost

Add cost visibility to PRs:
# Add to terraform-plan.yml
- name: Setup Infracost
  uses: infracost/actions/setup@v3
  with:
    api-key: ${{ secrets.INFRACOST_API_KEY }}

- name: Generate cost estimate
  run: |
    infracost breakdown \
      --path=infra/${{ inputs.stack }} \
      --terraform-workspace=${{ inputs.workspace }} \
      --format=json \
      --out-file=/tmp/infracost-base.json

- name: Post cost comment to PR
  if: github.event_name == 'pull_request'
  run: |
    infracost comment github \
      --path=/tmp/infracost-base.json \
      --repo=${{ github.repository }} \
      --pull-request=${{ github.event.pull_request.number }} \
      --github-token=${{ secrets.GITHUB_TOKEN }} \
      --behavior=update
3. Module Dependency Visualization

Enhance documentation with visual dependency graphs:
#!/bin/bash
# scripts/generate-dep-graph.sh

echo "digraph TerraformDeps {" > terraform-deps.dot
echo "  rankdir=LR;" >> terraform-deps.dot
echo "  node [shape=box, style=rounded];" >> terraform-deps.dot
echo "" >> terraform-deps.dot

# Add module nodes
echo "  // Modules" >> terraform-deps.dot
echo "  subgraph cluster_modules {" >> terraform-deps.dot
echo "    label=\"Reusable Modules\";" >> terraform-deps.dot
echo "    style=filled;" >> terraform-deps.dot
echo "    color=lightgrey;" >> terraform-deps.dot

for module in modules/*/; do
  module_name=$(basename "$module")
  echo "    \"module:$module_name\" [color=blue];" >> terraform-deps.dot
done

echo "  }" >> terraform-deps.dot
echo "" >> terraform-deps.dot

# Add stack nodes
echo "  // Stacks" >> terraform-deps.dot
echo "  subgraph cluster_stacks {" >> terraform-deps.dot
echo "    label=\"Root Modules (Stacks)\";" >> terraform-deps.dot
echo "    style=filled;" >> terraform-deps.dot
echo "    color=lightblue;" >> terraform-deps.dot

for stack in infra/*/; do
  stack_name=$(basename "$stack")
  [[ "$stack_name" == "state" ]] && continue
  echo "    \"stack:$stack_name\" [color=green];" >> terraform-deps.dot
done

echo "  }" >> terraform-deps.dot
echo "" >> terraform-deps.dot

# Module -> Module dependencies
echo "  // Module dependencies" >> terraform-deps.dot
jq -r '.modules | to_entries[] | "  \"module:\(.key)\" -> \"module:\(.value[])\";"' \
  < .github/module-deps.json >> terraform-deps.dot 2>/dev/null || true

echo "" >> terraform-deps.dot

# Stack -> Module dependencies
echo "  // Stack dependencies on modules" >> terraform-deps.dot
for stack in infra/*/; do
  stack_name=$(basename "$stack")
  [[ "$stack_name" == "state" ]] && continue

  grep -h "source.*modules/" "$stack"/*.tf 2>/dev/null | \
    sed -n 's/.*modules\/\([^"?\/]*\).*/  "stack:'"$stack_name"'" -> "module:\1";/p' | \
    sort -u >> terraform-deps.dot
done

echo "}" >> terraform-deps.dot

# Generate PNG
dot -Tpng terraform-deps.dot -o terraform-deps.png
echo "Dependency graph generated: terraform-deps.png"
Add to CI:
- name: Generate dependency graph
  run: bash scripts/generate-dep-graph.sh

- name: Upload dependency graph
  uses: actions/upload-artifact@v4
  with:
    name: terraform-dependency-graph
    path: terraform-deps.png
4. Performance Optimizations

# Optimize terraform-plan.yml

# 1. Increase plugin cache effectiveness
- uses: actions/cache@v4
  with:
    path: |
      ~/.terraform.d/plugin-cache
      infra/${{ inputs.stack }}/.terraform/providers
    key: terraform-${{ runner.os }}-${{ inputs.stack }}-${{ hashFiles('infra/${{ inputs.stack }}/.terraform.lock.hcl') }}
    restore-keys: |
      terraform-${{ runner.os }}-${{ inputs.stack }}-
      terraform-${{ runner.os }}-

# 2. Limit parallel executions to avoid rate limits
strategy:
  max-parallel: 10  # Adjust based on AWS API limits

# 3. Use larger runners for faster execution
runs-on: ubuntu-latest-4-cores  # If available

Migration Plan

Phase 1: Smart Change Detection (Week 1-2)

Goals: Reduce CI time by 60-80%
Tasks:

✅ Create .github/workflows/terraform-detect-changes.yml
✅ Update .github/workflows/terraform-trigger.yml to use detection
✅ Test on feature branch with various change scenarios:

Single stack change
Module change affecting multiple stacks
Multi-stack change
No Terraform changes


✅ Monitor CI execution times and gather metrics
✅ Adjust matrix generation logic if needed
✅ Merge to main after 1 week of successful testing

Success Criteria:

✅ CI time reduced by 60%+ for single-stack changes
✅ All dependent stacks correctly identified
✅ No false negatives (missing required plans)

Phase 2: Drift Detection (Week 3-4)

Goals: Proactive drift identification and alerting
Tasks:

✅ Create .github/workflows/terraform-drift-detection.yml
✅ Set up GitHub issue label: drift-detection
✅ Configure Slack webhook for critical alerts (optional)
✅ Run manually for 1 week to establish baseline:

Identify expected vs. unexpected drift
Tune alert thresholds
Document known drift sources


✅ Enable nightly schedule for production stacks
✅ Add weekly schedule for staging stacks
✅ Create runbook for drift response

Success Criteria:

✅ All production stacks checked nightly
✅ Drift issues created within 5 minutes of detection
✅ Zero false positives after tuning period

Phase 3: Enhancements (Week 5-6)

Goals: Add validation, cost estimation, and documentation
Tasks:

✅ Add terraform-validate.yml workflow
✅ Integrate TFLint and Checkov
✅ (Optional) Add Infracost for cost visibility
✅ Generate dependency graph visualization
✅ Document all workflows in repository README
✅ Create runbooks for common scenarios:

Responding to drift alerts
Manually triggering full plans
Emergency changes
Rolling back changes


Success Criteria:

✅ All PRs pass validation before planning
✅ Cost estimates visible on infrastructure PRs
✅ Team trained on new workflows


Summary: Key Decisions


Decision Area
Recommendation
Rationale


Plan Strategy
Smart detection: changed stacks + module dependents
60-80% faster CI, focused reviews, lower cost


Plan All Fallback
Manual dispatch + [plan-all] commit message + weekly schedule
Safety net for comprehensive validation


Drift Detection Frequency
Nightly (prod) + Weekly (staging)
Early detection without excessive overhead


Drift Alerting
GitHub issues + Slack for critical
Trackable, auditable, actionable


Drift Prevention
AWS Config + IAM policies + CloudTrail monitoring
Multi-layered defense against manual changes


Validation
terraform fmt + validate + TFLint + Checkov
Catch errors before expensive plans


Module Versioning
Keep existing tag-modules.yml + Renovate
Working well, no changes needed


Cost Visibility
Optional Infracost integration
Helpful for cost-sensitive changes


Performance
Plugin caching + parallel limits + larger runners
Optimize execution time and reliability


Expected Outcomes

Metrics to Track

Before:

🐌 Average PR CI time: ~30 minutes (25+ plans)
💸 GitHub Actions minutes per PR: ~250 minutes
🕒 Time to detect drift: Variable (on next deployment)
📝 PR review complexity: High (20+ plan outputs)

After (projected):

⚡ Average PR CI time: ~6 minutes (3-5 plans)
💰 GitHub Actions minutes per PR: ~50 minutes (80% reduction)
🎯 Time to detect drift: <24 hours (nightly checks)
✅ PR review complexity: Low (only relevant plans)

ROI Calculation

Cost Savings (monthly estimate):

GitHub Actions minutes saved: ~40,000 minutes/month
Developer time saved (faster PR feedback): ~20 hours/month
Incident prevention (drift detection): 1-2 incidents avoided

Time Investment:

Initial setup: ~40 hours
Ongoing maintenance: ~4 hours/month

Break-even: Within 2 months

Risks and Mitigations


Risk
Impact
Mitigation


False negatives (missed dependencies)
High
Keep "plan all" fallback, test thoroughly, use .github/module-deps.json


False positives (unnecessary plans)
Low
Better to over-plan than under-plan


Drift alert fatigue
Medium
Tune alert thresholds, fix underlying issues, separate critical/normal


GitHub API rate limits
Low
Use max-parallel limits, spread checks throughout day


Initial setup complexity
Medium
Phased rollout, thorough testing, comprehensive documentation


Team adoption
Medium
Training sessions, runbooks, gradual rollout


Conclusion

Your Terraform repository already has a solid foundation with good separation of concerns, workspace isolation, and automated module versioning. These recommendations optimize for:

Speed: Smart change detection reduces CI time by 60-80%
Safety: Drift detection catches manual changes within 24 hours
Cost: Reduced GitHub Actions minutes and faster developer feedback
Maintainability: Clear workflows, automated alerts, comprehensive documentation

The key insight is to plan intelligently, not exhaustively while maintaining safety nets (manual "plan all", weekly full checks, drift detection). This balances speed with safety, developer experience with operational reliability.
Next Steps


Review this proposal with your team
Prioritize which recommendations to implement first
Start with Phase 1 (smart change detection) for immediate wins
Gradually add drift detection and enhancements
Measure and iterate based on real-world results


Document Version: 1.0
Last Updated: 2026-01-09
Feedback: Share experiences and improvements with the team
Decision Area	Recommendation	Rationale
Plan Strategy	Smart detection: changed stacks + module dependents	60-80% faster CI, focused reviews, lower cost
Plan All Fallback	Manual dispatch + `[plan-all]` commit message + weekly schedule	Safety net for comprehensive validation
Drift Detection Frequency	Nightly (prod) + Weekly (staging)	Early detection without excessive overhead
Drift Alerting	GitHub issues + Slack for critical	Trackable, auditable, actionable
Drift Prevention	AWS Config + IAM policies + CloudTrail monitoring	Multi-layered defense against manual changes
Validation	terraform fmt + validate + TFLint + Checkov	Catch errors before expensive plans
Module Versioning	Keep existing tag-modules.yml + Renovate	Working well, no changes needed
Cost Visibility	Optional Infracost integration	Helpful for cost-sensitive changes
Performance	Plugin caching + parallel limits + larger runners	Optimize execution time and reliability
Risk	Impact	Mitigation
False negatives (missed dependencies)	High	Keep "plan all" fallback, test thoroughly, use `.github/module-deps.json`
False positives (unnecessary plans)	Low	Better to over-plan than under-plan
Drift alert fatigue	Medium	Tune alert thresholds, fix underlying issues, separate critical/normal
GitHub API rate limits	Low	Use `max-parallel` limits, spread checks throughout day
Initial setup complexity	Medium	Phased rollout, thorough testing, comprehensive documentation
Team adoption	Medium	Training sessions, runbooks, gradual rollout