Skip to content

Instantly share code, notes, and snippets.

@nicksieger
Created January 9, 2026 18:13
Show Gist options
  • Select an option

  • Save nicksieger/9fe34f023804c22b1c71a4666a21f920 to your computer and use it in GitHub Desktop.

Select an option

Save nicksieger/9fe34f023804c22b1c71a4666a21f920 to your computer and use it in GitHub Desktop.
Ideal CI Workflow for Terraform Infrastructure as Code - Multi-stack, multi-account repository optimization guide

Ideal CI Workflow for Terraform Infrastructure as Code

Proposal for optimizing CI/CD workflows for a multi-stack, multi-account Terraform repository on GitHub Actions

Executive Summary

This document proposes best practices for managing Terraform infrastructure modules with focus on:

  • Smart change detection: Plan only changed modules + dependents (60-80% faster CI)
  • Efficient drift detection: Multi-tiered nightly checks with automated alerting
  • Safety mechanisms: Preserved while improving speed and developer experience

Current Repository Structure

Overview

  • 14 root modules (stacks) in infra/ managing different AWS services
  • 9 reusable modules in modules/ with semantic versioning
  • 25+ stack/workspace combinations across 7+ AWS accounts
  • Terraform 1.13.3 with S3 backend and workspace-based isolation
  • GitHub Actions with OIDC authentication to AWS

Existing Workflows

  1. terraform-trigger.yml: Dispatches all 25+ combinations on every PR/push
  2. terraform-plan.yml: Reusable workflow for plan/apply operations
  3. terraform-lint.yml: Enforces terraform fmt standards
  4. tag-modules.yml: Auto-versions modules on changes
  5. renovate.yml: Automated dependency updates

Key Stacks

  • business-workloads: Business workload infrastructure (ECS)
  • untrusted-compute: UC data planes (ECS zones)
  • untrusted-compute-control: UC control plane (EKS)
  • events: MSK Kafka cluster
  • bootstrap: Account initialization, OIDC roles
  • tailscale: VPN subnet routers
  • single-tenant-workloads: Customer-specific deployments
  • shared-workloads: Customer VM infrastructure

Problem Statement

Current Pain Points

  1. ❌ Every PR triggers 25+ plans: Takes 30+ minutes even for single-file changes
  2. ❌ Noisy PR comments: 20+ plan outputs make reviews difficult
  3. ❌ High cost: Wastes GitHub Actions minutes
  4. ❌ No drift detection: Manual changes go unnoticed until next deployment
  5. ❌ Difficult to focus: Hard to identify which changes are relevant

Recommendation 1: Smart Change Detection

Strategy: Plan Changed Modules + Dependents

Instead of running all 25+ combinations on every PR, detect:

  1. Which stacks have direct file changes
  2. Which modules have changed
  3. Which stacks depend on those modules
  4. Run plans ONLY for affected stacks

Implementation

Create .github/workflows/terraform-detect-changes.yml:

name: Detect Terraform Changes

on:
  pull_request:
    paths:
      - 'infra/**/*.tf'
      - 'infra/**/*.tfvars'
      - 'modules/**/*.tf'
      - '.terraform-version'

jobs:
  detect-changes:
    runs-on: ubuntu-latest
    outputs:
      matrix: ${{ steps.generate-matrix.outputs.matrix }}
      has_changes: ${{ steps.generate-matrix.outputs.has_changes }}
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Need full history for change detection

      - name: Detect changed stacks and modules
        id: generate-matrix
        run: |
          # Get changed files
          CHANGED_FILES=$(git diff --name-only origin/${{ github.base_ref }}...HEAD)

          # Parse changed stacks
          CHANGED_STACKS=$(echo "$CHANGED_FILES" | grep -E '^infra/[^/]+/' | cut -d'/' -f2 | sort -u)

          # Parse changed modules
          CHANGED_MODULES=$(echo "$CHANGED_FILES" | grep -E '^modules/[^/]+/' | cut -d'/' -f2 | sort -u)

          # Find dependent stacks using grep
          DEPENDENT_STACKS=""
          for module in $CHANGED_MODULES; do
            # Find all stacks referencing this module
            DEPS=$(grep -rl "source.*modules/$module" infra/ | cut -d'/' -f2 | sort -u)
            DEPENDENT_STACKS="$DEPENDENT_STACKS $DEPS"
          done

          # Combine and deduplicate
          ALL_AFFECTED_STACKS=$(echo "$CHANGED_STACKS $DEPENDENT_STACKS" | tr ' ' '\n' | sort -u | grep -v '^$')

          # Generate matrix (filter terraform-trigger.yml matrix by affected stacks)
          MATRIX_JSON=$(echo "$ALL_AFFECTED_STACKS" | jq -R -s -c 'split("\n") | map(select(length > 0))')

          echo "matrix={\"stack\":$MATRIX_JSON}" >> $GITHUB_OUTPUT
          echo "has_changes=$([[ -n \"$ALL_AFFECTED_STACKS\" ]] && echo true || echo false)" >> $GITHUB_OUTPUT

  plan-changed:
    needs: detect-changes
    if: needs.detect-changes.outputs.has_changes == 'true'
    strategy:
      matrix: ${{ fromJson(needs.detect-changes.outputs.matrix) }}
    uses: ./.github/workflows/terraform-plan.yml
    with:
      stack: ${{ matrix.stack }}
      workspace: ${{ matrix.workspace }}
      account: ${{ matrix.account }}
    secrets: inherit

Update terraform-trigger.yml

name: Terraform CI/CD

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]
  workflow_dispatch:
    inputs:
      scope:
        description: 'Scope to plan/apply'
        required: true
        type: choice
        options:
          - changed-only
          - all-stacks
        default: 'changed-only'

jobs:
  # Use smart detection for PRs
  detect-changes:
    if: |
      github.event_name == 'pull_request' ||
      (github.event_name == 'workflow_dispatch' && inputs.scope == 'changed-only')
    uses: ./.github/workflows/terraform-detect-changes.yml

  # Plan changed stacks
  plan-changed:
    needs: detect-changes
    if: needs.detect-changes.outputs.has_changes == 'true'
    strategy:
      fail-fast: false
      matrix: ${{ fromJson(needs.detect-changes.outputs.matrix) }}
    uses: ./.github/workflows/terraform-plan.yml
    with:
      workspace: ${{ matrix.workspace }}
      stack: ${{ matrix.stack }}
      account: ${{ matrix.account }}
      concurrency: ${{ matrix.stack }}-${{ matrix.workspace }}
    secrets: inherit

  # Plan all stacks (manual trigger or [plan-all] in commit message)
  plan-all:
    if: |
      (github.event_name == 'workflow_dispatch' && inputs.scope == 'all-stacks') ||
      contains(github.event.head_commit.message, '[plan-all]')
    strategy:
      fail-fast: false
      matrix:
        # Your existing full matrix (25+ combinations)
        include:
          - { workspace: prod, stack: business-workloads, account: business-workloads }
          - { workspace: bws, stack: business-workloads, account: business-workloads-staging }
          # ... all 25+ combinations
    uses: ./.github/workflows/terraform-plan.yml
    with:
      workspace: ${{ matrix.workspace }}
      stack: ${{ matrix.stack }}
      account: ${{ matrix.account }}
      concurrency: ${{ matrix.stack }}-${{ matrix.workspace }}
    secrets: inherit

Benefits

  • ⚑ 60-80% faster CI for typical single-stack changes
  • 🎯 Focused PR reviews: Only see plans for affected stacks
  • πŸ’° Cost reduction: Fewer GitHub Actions minutes consumed
  • πŸ” Still safe: All dependents are automatically included
  • πŸ›‘οΈ Safety net: Manual "plan all" option always available

When to Plan All

Keep full planning for:

  • βœ… Manual workflow dispatch (user selects "all-stacks")
  • βœ… Commit message contains [plan-all]
  • βœ… Changes to .terraform-version
  • βœ… Changes to provider version constraints
  • βœ… Changes to backend configuration
  • βœ… Weekly scheduled runs (for drift detection)
  • βœ… Release branches

Recommendation 2: Multi-Tiered Drift Detection

Strategy: Nightly Production, Weekly Staging

Drift detection with different frequencies based on criticality:

  • πŸ”΄ Critical stacks (production): Every night
  • 🟑 Normal stacks (staging): Weekly (Mondays)
  • 🟒 Development: On-demand only

Implementation

Create .github/workflows/terraform-drift-detection.yml:

name: Drift Detection

on:
  schedule:
    # Run nightly at 3am UTC (after Renovate completes)
    - cron: '0 3 * * *'
  workflow_dispatch:
    inputs:
      scope:
        description: 'Scope of drift check'
        required: true
        type: choice
        options:
          - all
          - production-only
          - critical-stacks
        default: 'all'

jobs:
  drift-check:
    strategy:
      fail-fast: false  # Continue checking all stacks even if one drifts
      max-parallel: 5   # Avoid AWS API rate limits
      matrix:
        # Tiered approach
        include:
          # Production workloads (check nightly)
          - { stack: business-workloads, workspace: prod, account: business-workloads, priority: critical }
          - { stack: untrusted-compute, workspace: uc, account: untrusted-compute, priority: critical }
          - { stack: untrusted-compute-control, workspace: uc-control-use2-a, account: untrusted-compute, priority: critical }
          - { stack: events, workspace: bw, account: business-workloads, priority: critical }

          # Infrastructure foundations (check nightly)
          - { stack: bootstrap, workspace: bw, account: business-workloads, priority: critical }
          - { stack: bootstrap, workspace: uc, account: untrusted-compute, priority: critical }

          # Staging environments (check weekly - Monday only)
          - { stack: business-workloads, workspace: bws, account: business-workloads-staging, priority: normal, day: 1 }
          - { stack: untrusted-compute, workspace: ucs, account: untrusted-compute-staging, priority: normal, day: 1 }
          - { stack: bootstrap, workspace: bws, account: business-workloads-staging, priority: normal, day: 1 }

    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      # Skip non-critical stacks on wrong day
      - name: Check if should run
        id: should-run
        run: |
          DAY_OF_WEEK=$(date +%u)  # 1=Monday, 7=Sunday
          MATRIX_DAY="${{ matrix.day || 0 }}"

          if [[ "${{ inputs.scope }}" == "production-only" && "${{ matrix.priority }}" != "critical" ]]; then
            echo "skip=true" >> $GITHUB_OUTPUT
          elif [[ "$MATRIX_DAY" -ne 0 && "$DAY_OF_WEEK" -ne "$MATRIX_DAY" ]]; then
            echo "skip=true" >> $GITHUB_OUTPUT
          else
            echo "skip=false" >> $GITHUB_OUTPUT
          fi

      - name: Configure AWS Credentials
        if: steps.should-run.outputs.skip == 'false'
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::${{ matrix.account }}:role/GitHubActionsTerraformPlan
          aws-region: us-east-2
          role-session-name: drift-check-${{ github.run_id }}

      - name: Setup Terraform
        if: steps.should-run.outputs.skip == 'false'
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.13.3

      - name: Terraform Init
        if: steps.should-run.outputs.skip == 'false'
        working-directory: infra/${{ matrix.stack }}
        run: terraform init

      - name: Select Workspace
        if: steps.should-run.outputs.skip == 'false'
        working-directory: infra/${{ matrix.stack }}
        run: terraform workspace select ${{ matrix.workspace }}

      - name: Terraform Plan (Drift Detection)
        if: steps.should-run.outputs.skip == 'false'
        id: plan
        working-directory: infra/${{ matrix.stack }}
        run: |
          terraform plan \
            -var-file=vars/${{ matrix.workspace }}.tfvars \
            -detailed-exitcode \
            -out=drift-plan.tfplan \
            -no-color | tee plan-output.txt

          EXIT_CODE=$?
          echo "exit_code=$EXIT_CODE" >> $GITHUB_OUTPUT

          # Exit code 2 means changes detected (drift)
          if [[ $EXIT_CODE -eq 2 ]]; then
            echo "drift_detected=true" >> $GITHUB_OUTPUT
          else
            echo "drift_detected=false" >> $GITHUB_OUTPUT
          fi
        continue-on-error: true

      - name: Parse Drift Summary
        if: steps.should-run.outputs.skip == 'false' && steps.plan.outputs.drift_detected == 'true'
        id: summary
        working-directory: infra/${{ matrix.stack }}
        run: |
          # Extract resource changes
          SUMMARY=$(grep -A 1 "Plan:" plan-output.txt | tail -1 || echo "Unable to parse")
          echo "summary=$SUMMARY" >> $GITHUB_OUTPUT

          # Extract changed resources (first 20)
          CHANGED_RESOURCES=$(grep -E "^\s+[~+-]" plan-output.txt | head -20 || echo "No resources listed")
          echo "resources<<EOF" >> $GITHUB_OUTPUT
          echo "$CHANGED_RESOURCES" >> $GITHUB_OUTPUT
          echo "EOF" >> $GITHUB_OUTPUT

      - name: Upload Drift Plan
        if: steps.should-run.outputs.skip == 'false' && steps.plan.outputs.drift_detected == 'true'
        uses: actions/upload-artifact@v4
        with:
          name: drift-plan-${{ matrix.stack }}-${{ matrix.workspace }}
          path: infra/${{ matrix.stack }}/drift-plan.tfplan
          retention-days: 30

      - name: Create GitHub Issue on Drift
        if: steps.should-run.outputs.skip == 'false' && steps.plan.outputs.drift_detected == 'true' && matrix.priority == 'critical'
        uses: actions/github-script@v7
        with:
          script: |
            const stack = '${{ matrix.stack }}';
            const workspace = '${{ matrix.workspace }}';
            const summary = `${{ steps.summary.outputs.summary }}`;
            const resources = `${{ steps.summary.outputs.resources }}`;

            // Check if issue already exists
            const issues = await github.rest.issues.listForRepo({
              owner: context.repo.owner,
              repo: context.repo.repo,
              state: 'open',
              labels: 'drift-detection'
            });

            const existingIssue = issues.data.find(issue =>
              issue.title.includes(`[${stack}/${workspace}]`)
            );

            const body = `## 🚨 Drift Detected in Infrastructure

            **Stack**: \`${stack}\`
            **Workspace**: \`${workspace}\`
            **Account**: \`${{ matrix.account }}\`
            **Detection Time**: ${new Date().toISOString()}
            **Priority**: ${{ matrix.priority }}

            ### Summary
            \`\`\`
            ${summary}
            \`\`\`

            ### Changed Resources (first 20)
            \`\`\`diff
            ${resources}
            \`\`\`

            ### Action Required
            - [ ] Review drift and determine if expected
            - [ ] If expected: update Terraform to match infrastructure, then apply
            - [ ] If unexpected: investigate who made manual changes (check CloudTrail)
            - [ ] Document decision in this issue
            - [ ] Close issue once remediated

            ### Resources
            - [View Workflow Run](https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }})
            - Drift plan artifact: \`drift-plan-${stack}-${workspace}\`

            ### Investigation Commands
            \`\`\`bash
            # Download and inspect the drift plan
            gh run download ${{ github.run_id }} -n drift-plan-${stack}-${workspace}

            # Show the full plan
            cd infra/${stack}
            terraform workspace select ${workspace}
            terraform show drift-plan.tfplan

            # Check CloudTrail for manual changes
            aws cloudtrail lookup-events \\
              --lookup-attributes AttributeKey=ResourceType,AttributeValue=<resource-type> \\
              --max-results 50
            \`\`\`
            `;

            if (existingIssue) {
              // Update existing issue with new comment
              await github.rest.issues.createComment({
                owner: context.repo.owner,
                repo: context.repo.repo,
                issue_number: existingIssue.number,
                body: `### πŸ”„ Drift Still Present (${new Date().toISOString()})\n\n${body}`
              });

              // Re-open if closed
              if (existingIssue.state === 'closed') {
                await github.rest.issues.update({
                  owner: context.repo.owner,
                  repo: context.repo.repo,
                  issue_number: existingIssue.number,
                  state: 'open'
                });
              }
            } else {
              // Create new issue
              await github.rest.issues.create({
                owner: context.repo.owner,
                repo: context.repo.repo,
                title: `[Drift Detection] ${stack}/${workspace}`,
                body: body,
                labels: ['drift-detection', 'infrastructure', 'needs-triage', matrix.priority]
              });
            }

      - name: Slack Notification (Critical Drift)
        if: steps.should-run.outputs.skip == 'false' && steps.plan.outputs.drift_detected == 'true' && matrix.priority == 'critical'
        run: |
          # Example Slack webhook notification
          SLACK_WEBHOOK="${{ secrets.SLACK_WEBHOOK_INFRA }}"

          if [[ -n "$SLACK_WEBHOOK" ]]; then
            curl -X POST "$SLACK_WEBHOOK" \
              -H 'Content-Type: application/json' \
              -d '{
                "text": "🚨 Critical Infrastructure Drift Detected",
                "blocks": [
                  {
                    "type": "section",
                    "text": {
                      "type": "mrkdwn",
                      "text": "*Drift Detected in Production Infrastructure*\n\n*Stack:* `${{ matrix.stack }}`\n*Workspace:* `${{ matrix.workspace }}`\n*Summary:* ${{ steps.summary.outputs.summary }}"
                    }
                  },
                  {
                    "type": "actions",
                    "elements": [
                      {
                        "type": "button",
                        "text": {
                          "type": "plain_text",
                          "text": "View Workflow"
                        },
                        "url": "https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}"
                      }
                    ]
                  }
                ]
              }'
          fi

  drift-summary:
    needs: drift-check
    runs-on: ubuntu-latest
    if: always()
    steps:
      - name: Generate Drift Summary Report
        uses: actions/github-script@v7
        with:
          script: |
            // Generate aggregate summary of all drift checks
            const results = ${{ toJson(needs.drift-check) }};
            console.log('Drift detection run completed');
            console.log('Results:', results);

            // Optional: Post summary to Slack or create a digest issue

Drift Prevention Mechanisms

In addition to detection, implement prevention:

  1. AWS Config Rules

    # Add to bootstrap stack
    resource "aws_config_config_rule" "terraform_managed_only" {
      name = "terraform-managed-resources-only"
    
      source {
        owner             = "AWS"
        source_identifier = "REQUIRED_TAGS"
      }
    
      scope {
        compliance_resource_types = [
          "AWS::EC2::Instance",
          "AWS::RDS::DBInstance",
          "AWS::ECS::Service",
          # Add all critical resource types
        ]
      }
    
      input_parameters = jsonencode({
        tag1Key   = "ManagedBy"
        tag1Value = "Terraform"
      })
    }
  2. IAM Policies (restrict console access to Terraform-managed resources)

    # Deny modification of resources with ManagedBy=Terraform tag
    data "aws_iam_policy_document" "prevent_terraform_resource_modification" {
      statement {
        effect = "Deny"
        actions = [
          "ec2:TerminateInstances",
          "rds:DeleteDBInstance",
          "ecs:UpdateService",
          # Add relevant modify/delete actions
        ]
    
        resources = ["*"]
    
        condition {
          test     = "StringEquals"
          variable = "aws:ResourceTag/ManagedBy"
          values   = ["Terraform"]
        }
      }
    }
  3. CloudTrail Monitoring

    resource "aws_cloudwatch_event_rule" "terraform_resource_modification" {
      name        = "terraform-resource-manual-modification"
      description = "Alert on manual changes to Terraform-managed resources"
    
      event_pattern = jsonencode({
        source      = ["aws.ec2", "aws.rds", "aws.ecs"]
        detail-type = ["AWS API Call via CloudTrail"]
        detail = {
          eventName = [
            "TerminateInstances",
            "ModifyDBInstance",
            "UpdateService"
          ]
          # Exclude GitHub Actions role
          userIdentity = {
            arn = [{
              "anything-but" = {
                prefix = "arn:aws:sts::*:assumed-role/GitHubActionsTerraform"
              }
            }]
          }
        }
      })
    }

Additional Best Practices

1. Pre-Commit Validation

Add comprehensive validation before CI runs:

# .github/workflows/terraform-validate.yml
name: Terraform Validation

on:
  pull_request:
    paths:
      - '**/*.tf'
      - '**/*.tfvars'

jobs:
  validate:
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        stack:
          - business-workloads
          - events
          - bootstrap
          - untrusted-compute
          - untrusted-compute-control
          - tailscale
          - cluster-permissions
          - kafka-connect
          - shared-workloads
          - single-tenant-workloads
          - synthetics-tests
          - workspaces
          - monitoring
          - state

    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.13.3

      - name: Terraform Init (backend=false)
        working-directory: infra/${{ matrix.stack }}
        run: terraform init -backend=false

      - name: Terraform Validate
        working-directory: infra/${{ matrix.stack }}
        run: terraform validate

      - name: TFLint
        uses: terraform-linters/setup-tflint@v4
        with:
          tflint_version: latest

      - name: Run TFLint
        working-directory: infra/${{ matrix.stack }}
        run: |
          tflint --init
          tflint --format=compact

      - name: Checkov Security Scan
        uses: bridgecrewio/checkov-action@v12
        with:
          directory: infra/${{ matrix.stack }}
          framework: terraform
          soft_fail: true  # Don't block PRs, just warn
          output_format: github_failed_only

2. Cost Estimation with Infracost

Add cost visibility to PRs:

# Add to terraform-plan.yml
- name: Setup Infracost
  uses: infracost/actions/setup@v3
  with:
    api-key: ${{ secrets.INFRACOST_API_KEY }}

- name: Generate cost estimate
  run: |
    infracost breakdown \
      --path=infra/${{ inputs.stack }} \
      --terraform-workspace=${{ inputs.workspace }} \
      --format=json \
      --out-file=/tmp/infracost-base.json

- name: Post cost comment to PR
  if: github.event_name == 'pull_request'
  run: |
    infracost comment github \
      --path=/tmp/infracost-base.json \
      --repo=${{ github.repository }} \
      --pull-request=${{ github.event.pull_request.number }} \
      --github-token=${{ secrets.GITHUB_TOKEN }} \
      --behavior=update

3. Module Dependency Visualization

Enhance documentation with visual dependency graphs:

#!/bin/bash
# scripts/generate-dep-graph.sh

echo "digraph TerraformDeps {" > terraform-deps.dot
echo "  rankdir=LR;" >> terraform-deps.dot
echo "  node [shape=box, style=rounded];" >> terraform-deps.dot
echo "" >> terraform-deps.dot

# Add module nodes
echo "  // Modules" >> terraform-deps.dot
echo "  subgraph cluster_modules {" >> terraform-deps.dot
echo "    label=\"Reusable Modules\";" >> terraform-deps.dot
echo "    style=filled;" >> terraform-deps.dot
echo "    color=lightgrey;" >> terraform-deps.dot

for module in modules/*/; do
  module_name=$(basename "$module")
  echo "    \"module:$module_name\" [color=blue];" >> terraform-deps.dot
done

echo "  }" >> terraform-deps.dot
echo "" >> terraform-deps.dot

# Add stack nodes
echo "  // Stacks" >> terraform-deps.dot
echo "  subgraph cluster_stacks {" >> terraform-deps.dot
echo "    label=\"Root Modules (Stacks)\";" >> terraform-deps.dot
echo "    style=filled;" >> terraform-deps.dot
echo "    color=lightblue;" >> terraform-deps.dot

for stack in infra/*/; do
  stack_name=$(basename "$stack")
  [[ "$stack_name" == "state" ]] && continue
  echo "    \"stack:$stack_name\" [color=green];" >> terraform-deps.dot
done

echo "  }" >> terraform-deps.dot
echo "" >> terraform-deps.dot

# Module -> Module dependencies
echo "  // Module dependencies" >> terraform-deps.dot
jq -r '.modules | to_entries[] | "  \"module:\(.key)\" -> \"module:\(.value[])\";"' \
  < .github/module-deps.json >> terraform-deps.dot 2>/dev/null || true

echo "" >> terraform-deps.dot

# Stack -> Module dependencies
echo "  // Stack dependencies on modules" >> terraform-deps.dot
for stack in infra/*/; do
  stack_name=$(basename "$stack")
  [[ "$stack_name" == "state" ]] && continue

  grep -h "source.*modules/" "$stack"/*.tf 2>/dev/null | \
    sed -n 's/.*modules\/\([^"?\/]*\).*/  "stack:'"$stack_name"'" -> "module:\1";/p' | \
    sort -u >> terraform-deps.dot
done

echo "}" >> terraform-deps.dot

# Generate PNG
dot -Tpng terraform-deps.dot -o terraform-deps.png
echo "Dependency graph generated: terraform-deps.png"

Add to CI:

- name: Generate dependency graph
  run: bash scripts/generate-dep-graph.sh

- name: Upload dependency graph
  uses: actions/upload-artifact@v4
  with:
    name: terraform-dependency-graph
    path: terraform-deps.png

4. Performance Optimizations

# Optimize terraform-plan.yml

# 1. Increase plugin cache effectiveness
- uses: actions/cache@v4
  with:
    path: |
      ~/.terraform.d/plugin-cache
      infra/${{ inputs.stack }}/.terraform/providers
    key: terraform-${{ runner.os }}-${{ inputs.stack }}-${{ hashFiles('infra/${{ inputs.stack }}/.terraform.lock.hcl') }}
    restore-keys: |
      terraform-${{ runner.os }}-${{ inputs.stack }}-
      terraform-${{ runner.os }}-

# 2. Limit parallel executions to avoid rate limits
strategy:
  max-parallel: 10  # Adjust based on AWS API limits

# 3. Use larger runners for faster execution
runs-on: ubuntu-latest-4-cores  # If available

Migration Plan

Phase 1: Smart Change Detection (Week 1-2)

Goals: Reduce CI time by 60-80%

Tasks:

  1. βœ… Create .github/workflows/terraform-detect-changes.yml
  2. βœ… Update .github/workflows/terraform-trigger.yml to use detection
  3. βœ… Test on feature branch with various change scenarios:
    • Single stack change
    • Module change affecting multiple stacks
    • Multi-stack change
    • No Terraform changes
  4. βœ… Monitor CI execution times and gather metrics
  5. βœ… Adjust matrix generation logic if needed
  6. βœ… Merge to main after 1 week of successful testing

Success Criteria:

  • βœ… CI time reduced by 60%+ for single-stack changes
  • βœ… All dependent stacks correctly identified
  • βœ… No false negatives (missing required plans)

Phase 2: Drift Detection (Week 3-4)

Goals: Proactive drift identification and alerting

Tasks:

  1. βœ… Create .github/workflows/terraform-drift-detection.yml
  2. βœ… Set up GitHub issue label: drift-detection
  3. βœ… Configure Slack webhook for critical alerts (optional)
  4. βœ… Run manually for 1 week to establish baseline:
    • Identify expected vs. unexpected drift
    • Tune alert thresholds
    • Document known drift sources
  5. βœ… Enable nightly schedule for production stacks
  6. βœ… Add weekly schedule for staging stacks
  7. βœ… Create runbook for drift response

Success Criteria:

  • βœ… All production stacks checked nightly
  • βœ… Drift issues created within 5 minutes of detection
  • βœ… Zero false positives after tuning period

Phase 3: Enhancements (Week 5-6)

Goals: Add validation, cost estimation, and documentation

Tasks:

  1. βœ… Add terraform-validate.yml workflow
  2. βœ… Integrate TFLint and Checkov
  3. βœ… (Optional) Add Infracost for cost visibility
  4. βœ… Generate dependency graph visualization
  5. βœ… Document all workflows in repository README
  6. βœ… Create runbooks for common scenarios:
    • Responding to drift alerts
    • Manually triggering full plans
    • Emergency changes
    • Rolling back changes

Success Criteria:

  • βœ… All PRs pass validation before planning
  • βœ… Cost estimates visible on infrastructure PRs
  • βœ… Team trained on new workflows

Summary: Key Decisions

Decision Area Recommendation Rationale
Plan Strategy Smart detection: changed stacks + module dependents 60-80% faster CI, focused reviews, lower cost
Plan All Fallback Manual dispatch + [plan-all] commit message + weekly schedule Safety net for comprehensive validation
Drift Detection Frequency Nightly (prod) + Weekly (staging) Early detection without excessive overhead
Drift Alerting GitHub issues + Slack for critical Trackable, auditable, actionable
Drift Prevention AWS Config + IAM policies + CloudTrail monitoring Multi-layered defense against manual changes
Validation terraform fmt + validate + TFLint + Checkov Catch errors before expensive plans
Module Versioning Keep existing tag-modules.yml + Renovate Working well, no changes needed
Cost Visibility Optional Infracost integration Helpful for cost-sensitive changes
Performance Plugin caching + parallel limits + larger runners Optimize execution time and reliability

Expected Outcomes

Metrics to Track

Before:

  • 🐌 Average PR CI time: ~30 minutes (25+ plans)
  • πŸ’Έ GitHub Actions minutes per PR: ~250 minutes
  • πŸ•’ Time to detect drift: Variable (on next deployment)
  • πŸ“ PR review complexity: High (20+ plan outputs)

After (projected):

  • ⚑ Average PR CI time: ~6 minutes (3-5 plans)
  • πŸ’° GitHub Actions minutes per PR: ~50 minutes (80% reduction)
  • 🎯 Time to detect drift: <24 hours (nightly checks)
  • βœ… PR review complexity: Low (only relevant plans)

ROI Calculation

Cost Savings (monthly estimate):

  • GitHub Actions minutes saved: ~40,000 minutes/month
  • Developer time saved (faster PR feedback): ~20 hours/month
  • Incident prevention (drift detection): 1-2 incidents avoided

Time Investment:

  • Initial setup: ~40 hours
  • Ongoing maintenance: ~4 hours/month

Break-even: Within 2 months


Risks and Mitigations

Risk Impact Mitigation
False negatives (missed dependencies) High Keep "plan all" fallback, test thoroughly, use .github/module-deps.json
False positives (unnecessary plans) Low Better to over-plan than under-plan
Drift alert fatigue Medium Tune alert thresholds, fix underlying issues, separate critical/normal
GitHub API rate limits Low Use max-parallel limits, spread checks throughout day
Initial setup complexity Medium Phased rollout, thorough testing, comprehensive documentation
Team adoption Medium Training sessions, runbooks, gradual rollout

Conclusion

Your Terraform repository already has a solid foundation with good separation of concerns, workspace isolation, and automated module versioning. These recommendations optimize for:

  1. Speed: Smart change detection reduces CI time by 60-80%
  2. Safety: Drift detection catches manual changes within 24 hours
  3. Cost: Reduced GitHub Actions minutes and faster developer feedback
  4. Maintainability: Clear workflows, automated alerts, comprehensive documentation

The key insight is to plan intelligently, not exhaustively while maintaining safety nets (manual "plan all", weekly full checks, drift detection). This balances speed with safety, developer experience with operational reliability.

Next Steps

  1. Review this proposal with your team
  2. Prioritize which recommendations to implement first
  3. Start with Phase 1 (smart change detection) for immediate wins
  4. Gradually add drift detection and enhancements
  5. Measure and iterate based on real-world results

Document Version: 1.0 Last Updated: 2026-01-09 Feedback: Share experiences and improvements with the team

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment