Proposal for optimizing CI/CD workflows for a multi-stack, multi-account Terraform repository on GitHub Actions
This document proposes best practices for managing Terraform infrastructure modules with focus on:
- Smart change detection: Plan only changed modules + dependents (60-80% faster CI)
- Efficient drift detection: Multi-tiered nightly checks with automated alerting
- Safety mechanisms: Preserved while improving speed and developer experience
- 14 root modules (stacks) in
infra/managing different AWS services - 9 reusable modules in
modules/with semantic versioning - 25+ stack/workspace combinations across 7+ AWS accounts
- Terraform 1.13.3 with S3 backend and workspace-based isolation
- GitHub Actions with OIDC authentication to AWS
- terraform-trigger.yml: Dispatches all 25+ combinations on every PR/push
- terraform-plan.yml: Reusable workflow for plan/apply operations
- terraform-lint.yml: Enforces
terraform fmtstandards - tag-modules.yml: Auto-versions modules on changes
- renovate.yml: Automated dependency updates
business-workloads: Business workload infrastructure (ECS)untrusted-compute: UC data planes (ECS zones)untrusted-compute-control: UC control plane (EKS)events: MSK Kafka clusterbootstrap: Account initialization, OIDC rolestailscale: VPN subnet routerssingle-tenant-workloads: Customer-specific deploymentsshared-workloads: Customer VM infrastructure
- β Every PR triggers 25+ plans: Takes 30+ minutes even for single-file changes
- β Noisy PR comments: 20+ plan outputs make reviews difficult
- β High cost: Wastes GitHub Actions minutes
- β No drift detection: Manual changes go unnoticed until next deployment
- β Difficult to focus: Hard to identify which changes are relevant
Instead of running all 25+ combinations on every PR, detect:
- Which stacks have direct file changes
- Which modules have changed
- Which stacks depend on those modules
- Run plans ONLY for affected stacks
Create .github/workflows/terraform-detect-changes.yml:
name: Detect Terraform Changes
on:
pull_request:
paths:
- 'infra/**/*.tf'
- 'infra/**/*.tfvars'
- 'modules/**/*.tf'
- '.terraform-version'
jobs:
detect-changes:
runs-on: ubuntu-latest
outputs:
matrix: ${{ steps.generate-matrix.outputs.matrix }}
has_changes: ${{ steps.generate-matrix.outputs.has_changes }}
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Need full history for change detection
- name: Detect changed stacks and modules
id: generate-matrix
run: |
# Get changed files
CHANGED_FILES=$(git diff --name-only origin/${{ github.base_ref }}...HEAD)
# Parse changed stacks
CHANGED_STACKS=$(echo "$CHANGED_FILES" | grep -E '^infra/[^/]+/' | cut -d'/' -f2 | sort -u)
# Parse changed modules
CHANGED_MODULES=$(echo "$CHANGED_FILES" | grep -E '^modules/[^/]+/' | cut -d'/' -f2 | sort -u)
# Find dependent stacks using grep
DEPENDENT_STACKS=""
for module in $CHANGED_MODULES; do
# Find all stacks referencing this module
DEPS=$(grep -rl "source.*modules/$module" infra/ | cut -d'/' -f2 | sort -u)
DEPENDENT_STACKS="$DEPENDENT_STACKS $DEPS"
done
# Combine and deduplicate
ALL_AFFECTED_STACKS=$(echo "$CHANGED_STACKS $DEPENDENT_STACKS" | tr ' ' '\n' | sort -u | grep -v '^$')
# Generate matrix (filter terraform-trigger.yml matrix by affected stacks)
MATRIX_JSON=$(echo "$ALL_AFFECTED_STACKS" | jq -R -s -c 'split("\n") | map(select(length > 0))')
echo "matrix={\"stack\":$MATRIX_JSON}" >> $GITHUB_OUTPUT
echo "has_changes=$([[ -n \"$ALL_AFFECTED_STACKS\" ]] && echo true || echo false)" >> $GITHUB_OUTPUT
plan-changed:
needs: detect-changes
if: needs.detect-changes.outputs.has_changes == 'true'
strategy:
matrix: ${{ fromJson(needs.detect-changes.outputs.matrix) }}
uses: ./.github/workflows/terraform-plan.yml
with:
stack: ${{ matrix.stack }}
workspace: ${{ matrix.workspace }}
account: ${{ matrix.account }}
secrets: inheritname: Terraform CI/CD
on:
pull_request:
branches: [main]
push:
branches: [main]
workflow_dispatch:
inputs:
scope:
description: 'Scope to plan/apply'
required: true
type: choice
options:
- changed-only
- all-stacks
default: 'changed-only'
jobs:
# Use smart detection for PRs
detect-changes:
if: |
github.event_name == 'pull_request' ||
(github.event_name == 'workflow_dispatch' && inputs.scope == 'changed-only')
uses: ./.github/workflows/terraform-detect-changes.yml
# Plan changed stacks
plan-changed:
needs: detect-changes
if: needs.detect-changes.outputs.has_changes == 'true'
strategy:
fail-fast: false
matrix: ${{ fromJson(needs.detect-changes.outputs.matrix) }}
uses: ./.github/workflows/terraform-plan.yml
with:
workspace: ${{ matrix.workspace }}
stack: ${{ matrix.stack }}
account: ${{ matrix.account }}
concurrency: ${{ matrix.stack }}-${{ matrix.workspace }}
secrets: inherit
# Plan all stacks (manual trigger or [plan-all] in commit message)
plan-all:
if: |
(github.event_name == 'workflow_dispatch' && inputs.scope == 'all-stacks') ||
contains(github.event.head_commit.message, '[plan-all]')
strategy:
fail-fast: false
matrix:
# Your existing full matrix (25+ combinations)
include:
- { workspace: prod, stack: business-workloads, account: business-workloads }
- { workspace: bws, stack: business-workloads, account: business-workloads-staging }
# ... all 25+ combinations
uses: ./.github/workflows/terraform-plan.yml
with:
workspace: ${{ matrix.workspace }}
stack: ${{ matrix.stack }}
account: ${{ matrix.account }}
concurrency: ${{ matrix.stack }}-${{ matrix.workspace }}
secrets: inherit- β‘ 60-80% faster CI for typical single-stack changes
- π― Focused PR reviews: Only see plans for affected stacks
- π° Cost reduction: Fewer GitHub Actions minutes consumed
- π Still safe: All dependents are automatically included
- π‘οΈ Safety net: Manual "plan all" option always available
Keep full planning for:
- β Manual workflow dispatch (user selects "all-stacks")
- β
Commit message contains
[plan-all] - β
Changes to
.terraform-version - β Changes to provider version constraints
- β Changes to backend configuration
- β Weekly scheduled runs (for drift detection)
- β Release branches
Drift detection with different frequencies based on criticality:
- π΄ Critical stacks (production): Every night
- π‘ Normal stacks (staging): Weekly (Mondays)
- π’ Development: On-demand only
Create .github/workflows/terraform-drift-detection.yml:
name: Drift Detection
on:
schedule:
# Run nightly at 3am UTC (after Renovate completes)
- cron: '0 3 * * *'
workflow_dispatch:
inputs:
scope:
description: 'Scope of drift check'
required: true
type: choice
options:
- all
- production-only
- critical-stacks
default: 'all'
jobs:
drift-check:
strategy:
fail-fast: false # Continue checking all stacks even if one drifts
max-parallel: 5 # Avoid AWS API rate limits
matrix:
# Tiered approach
include:
# Production workloads (check nightly)
- { stack: business-workloads, workspace: prod, account: business-workloads, priority: critical }
- { stack: untrusted-compute, workspace: uc, account: untrusted-compute, priority: critical }
- { stack: untrusted-compute-control, workspace: uc-control-use2-a, account: untrusted-compute, priority: critical }
- { stack: events, workspace: bw, account: business-workloads, priority: critical }
# Infrastructure foundations (check nightly)
- { stack: bootstrap, workspace: bw, account: business-workloads, priority: critical }
- { stack: bootstrap, workspace: uc, account: untrusted-compute, priority: critical }
# Staging environments (check weekly - Monday only)
- { stack: business-workloads, workspace: bws, account: business-workloads-staging, priority: normal, day: 1 }
- { stack: untrusted-compute, workspace: ucs, account: untrusted-compute-staging, priority: normal, day: 1 }
- { stack: bootstrap, workspace: bws, account: business-workloads-staging, priority: normal, day: 1 }
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
# Skip non-critical stacks on wrong day
- name: Check if should run
id: should-run
run: |
DAY_OF_WEEK=$(date +%u) # 1=Monday, 7=Sunday
MATRIX_DAY="${{ matrix.day || 0 }}"
if [[ "${{ inputs.scope }}" == "production-only" && "${{ matrix.priority }}" != "critical" ]]; then
echo "skip=true" >> $GITHUB_OUTPUT
elif [[ "$MATRIX_DAY" -ne 0 && "$DAY_OF_WEEK" -ne "$MATRIX_DAY" ]]; then
echo "skip=true" >> $GITHUB_OUTPUT
else
echo "skip=false" >> $GITHUB_OUTPUT
fi
- name: Configure AWS Credentials
if: steps.should-run.outputs.skip == 'false'
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::${{ matrix.account }}:role/GitHubActionsTerraformPlan
aws-region: us-east-2
role-session-name: drift-check-${{ github.run_id }}
- name: Setup Terraform
if: steps.should-run.outputs.skip == 'false'
uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.13.3
- name: Terraform Init
if: steps.should-run.outputs.skip == 'false'
working-directory: infra/${{ matrix.stack }}
run: terraform init
- name: Select Workspace
if: steps.should-run.outputs.skip == 'false'
working-directory: infra/${{ matrix.stack }}
run: terraform workspace select ${{ matrix.workspace }}
- name: Terraform Plan (Drift Detection)
if: steps.should-run.outputs.skip == 'false'
id: plan
working-directory: infra/${{ matrix.stack }}
run: |
terraform plan \
-var-file=vars/${{ matrix.workspace }}.tfvars \
-detailed-exitcode \
-out=drift-plan.tfplan \
-no-color | tee plan-output.txt
EXIT_CODE=$?
echo "exit_code=$EXIT_CODE" >> $GITHUB_OUTPUT
# Exit code 2 means changes detected (drift)
if [[ $EXIT_CODE -eq 2 ]]; then
echo "drift_detected=true" >> $GITHUB_OUTPUT
else
echo "drift_detected=false" >> $GITHUB_OUTPUT
fi
continue-on-error: true
- name: Parse Drift Summary
if: steps.should-run.outputs.skip == 'false' && steps.plan.outputs.drift_detected == 'true'
id: summary
working-directory: infra/${{ matrix.stack }}
run: |
# Extract resource changes
SUMMARY=$(grep -A 1 "Plan:" plan-output.txt | tail -1 || echo "Unable to parse")
echo "summary=$SUMMARY" >> $GITHUB_OUTPUT
# Extract changed resources (first 20)
CHANGED_RESOURCES=$(grep -E "^\s+[~+-]" plan-output.txt | head -20 || echo "No resources listed")
echo "resources<<EOF" >> $GITHUB_OUTPUT
echo "$CHANGED_RESOURCES" >> $GITHUB_OUTPUT
echo "EOF" >> $GITHUB_OUTPUT
- name: Upload Drift Plan
if: steps.should-run.outputs.skip == 'false' && steps.plan.outputs.drift_detected == 'true'
uses: actions/upload-artifact@v4
with:
name: drift-plan-${{ matrix.stack }}-${{ matrix.workspace }}
path: infra/${{ matrix.stack }}/drift-plan.tfplan
retention-days: 30
- name: Create GitHub Issue on Drift
if: steps.should-run.outputs.skip == 'false' && steps.plan.outputs.drift_detected == 'true' && matrix.priority == 'critical'
uses: actions/github-script@v7
with:
script: |
const stack = '${{ matrix.stack }}';
const workspace = '${{ matrix.workspace }}';
const summary = `${{ steps.summary.outputs.summary }}`;
const resources = `${{ steps.summary.outputs.resources }}`;
// Check if issue already exists
const issues = await github.rest.issues.listForRepo({
owner: context.repo.owner,
repo: context.repo.repo,
state: 'open',
labels: 'drift-detection'
});
const existingIssue = issues.data.find(issue =>
issue.title.includes(`[${stack}/${workspace}]`)
);
const body = `## π¨ Drift Detected in Infrastructure
**Stack**: \`${stack}\`
**Workspace**: \`${workspace}\`
**Account**: \`${{ matrix.account }}\`
**Detection Time**: ${new Date().toISOString()}
**Priority**: ${{ matrix.priority }}
### Summary
\`\`\`
${summary}
\`\`\`
### Changed Resources (first 20)
\`\`\`diff
${resources}
\`\`\`
### Action Required
- [ ] Review drift and determine if expected
- [ ] If expected: update Terraform to match infrastructure, then apply
- [ ] If unexpected: investigate who made manual changes (check CloudTrail)
- [ ] Document decision in this issue
- [ ] Close issue once remediated
### Resources
- [View Workflow Run](https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }})
- Drift plan artifact: \`drift-plan-${stack}-${workspace}\`
### Investigation Commands
\`\`\`bash
# Download and inspect the drift plan
gh run download ${{ github.run_id }} -n drift-plan-${stack}-${workspace}
# Show the full plan
cd infra/${stack}
terraform workspace select ${workspace}
terraform show drift-plan.tfplan
# Check CloudTrail for manual changes
aws cloudtrail lookup-events \\
--lookup-attributes AttributeKey=ResourceType,AttributeValue=<resource-type> \\
--max-results 50
\`\`\`
`;
if (existingIssue) {
// Update existing issue with new comment
await github.rest.issues.createComment({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: existingIssue.number,
body: `### π Drift Still Present (${new Date().toISOString()})\n\n${body}`
});
// Re-open if closed
if (existingIssue.state === 'closed') {
await github.rest.issues.update({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: existingIssue.number,
state: 'open'
});
}
} else {
// Create new issue
await github.rest.issues.create({
owner: context.repo.owner,
repo: context.repo.repo,
title: `[Drift Detection] ${stack}/${workspace}`,
body: body,
labels: ['drift-detection', 'infrastructure', 'needs-triage', matrix.priority]
});
}
- name: Slack Notification (Critical Drift)
if: steps.should-run.outputs.skip == 'false' && steps.plan.outputs.drift_detected == 'true' && matrix.priority == 'critical'
run: |
# Example Slack webhook notification
SLACK_WEBHOOK="${{ secrets.SLACK_WEBHOOK_INFRA }}"
if [[ -n "$SLACK_WEBHOOK" ]]; then
curl -X POST "$SLACK_WEBHOOK" \
-H 'Content-Type: application/json' \
-d '{
"text": "π¨ Critical Infrastructure Drift Detected",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Drift Detected in Production Infrastructure*\n\n*Stack:* `${{ matrix.stack }}`\n*Workspace:* `${{ matrix.workspace }}`\n*Summary:* ${{ steps.summary.outputs.summary }}"
}
},
{
"type": "actions",
"elements": [
{
"type": "button",
"text": {
"type": "plain_text",
"text": "View Workflow"
},
"url": "https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}"
}
]
}
]
}'
fi
drift-summary:
needs: drift-check
runs-on: ubuntu-latest
if: always()
steps:
- name: Generate Drift Summary Report
uses: actions/github-script@v7
with:
script: |
// Generate aggregate summary of all drift checks
const results = ${{ toJson(needs.drift-check) }};
console.log('Drift detection run completed');
console.log('Results:', results);
// Optional: Post summary to Slack or create a digest issueIn addition to detection, implement prevention:
-
AWS Config Rules
# Add to bootstrap stack resource "aws_config_config_rule" "terraform_managed_only" { name = "terraform-managed-resources-only" source { owner = "AWS" source_identifier = "REQUIRED_TAGS" } scope { compliance_resource_types = [ "AWS::EC2::Instance", "AWS::RDS::DBInstance", "AWS::ECS::Service", # Add all critical resource types ] } input_parameters = jsonencode({ tag1Key = "ManagedBy" tag1Value = "Terraform" }) }
-
IAM Policies (restrict console access to Terraform-managed resources)
# Deny modification of resources with ManagedBy=Terraform tag data "aws_iam_policy_document" "prevent_terraform_resource_modification" { statement { effect = "Deny" actions = [ "ec2:TerminateInstances", "rds:DeleteDBInstance", "ecs:UpdateService", # Add relevant modify/delete actions ] resources = ["*"] condition { test = "StringEquals" variable = "aws:ResourceTag/ManagedBy" values = ["Terraform"] } } }
-
CloudTrail Monitoring
resource "aws_cloudwatch_event_rule" "terraform_resource_modification" { name = "terraform-resource-manual-modification" description = "Alert on manual changes to Terraform-managed resources" event_pattern = jsonencode({ source = ["aws.ec2", "aws.rds", "aws.ecs"] detail-type = ["AWS API Call via CloudTrail"] detail = { eventName = [ "TerminateInstances", "ModifyDBInstance", "UpdateService" ] # Exclude GitHub Actions role userIdentity = { arn = [{ "anything-but" = { prefix = "arn:aws:sts::*:assumed-role/GitHubActionsTerraform" } }] } } }) }
Add comprehensive validation before CI runs:
# .github/workflows/terraform-validate.yml
name: Terraform Validation
on:
pull_request:
paths:
- '**/*.tf'
- '**/*.tfvars'
jobs:
validate:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
stack:
- business-workloads
- events
- bootstrap
- untrusted-compute
- untrusted-compute-control
- tailscale
- cluster-permissions
- kafka-connect
- shared-workloads
- single-tenant-workloads
- synthetics-tests
- workspaces
- monitoring
- state
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.13.3
- name: Terraform Init (backend=false)
working-directory: infra/${{ matrix.stack }}
run: terraform init -backend=false
- name: Terraform Validate
working-directory: infra/${{ matrix.stack }}
run: terraform validate
- name: TFLint
uses: terraform-linters/setup-tflint@v4
with:
tflint_version: latest
- name: Run TFLint
working-directory: infra/${{ matrix.stack }}
run: |
tflint --init
tflint --format=compact
- name: Checkov Security Scan
uses: bridgecrewio/checkov-action@v12
with:
directory: infra/${{ matrix.stack }}
framework: terraform
soft_fail: true # Don't block PRs, just warn
output_format: github_failed_onlyAdd cost visibility to PRs:
# Add to terraform-plan.yml
- name: Setup Infracost
uses: infracost/actions/setup@v3
with:
api-key: ${{ secrets.INFRACOST_API_KEY }}
- name: Generate cost estimate
run: |
infracost breakdown \
--path=infra/${{ inputs.stack }} \
--terraform-workspace=${{ inputs.workspace }} \
--format=json \
--out-file=/tmp/infracost-base.json
- name: Post cost comment to PR
if: github.event_name == 'pull_request'
run: |
infracost comment github \
--path=/tmp/infracost-base.json \
--repo=${{ github.repository }} \
--pull-request=${{ github.event.pull_request.number }} \
--github-token=${{ secrets.GITHUB_TOKEN }} \
--behavior=updateEnhance documentation with visual dependency graphs:
#!/bin/bash
# scripts/generate-dep-graph.sh
echo "digraph TerraformDeps {" > terraform-deps.dot
echo " rankdir=LR;" >> terraform-deps.dot
echo " node [shape=box, style=rounded];" >> terraform-deps.dot
echo "" >> terraform-deps.dot
# Add module nodes
echo " // Modules" >> terraform-deps.dot
echo " subgraph cluster_modules {" >> terraform-deps.dot
echo " label=\"Reusable Modules\";" >> terraform-deps.dot
echo " style=filled;" >> terraform-deps.dot
echo " color=lightgrey;" >> terraform-deps.dot
for module in modules/*/; do
module_name=$(basename "$module")
echo " \"module:$module_name\" [color=blue];" >> terraform-deps.dot
done
echo " }" >> terraform-deps.dot
echo "" >> terraform-deps.dot
# Add stack nodes
echo " // Stacks" >> terraform-deps.dot
echo " subgraph cluster_stacks {" >> terraform-deps.dot
echo " label=\"Root Modules (Stacks)\";" >> terraform-deps.dot
echo " style=filled;" >> terraform-deps.dot
echo " color=lightblue;" >> terraform-deps.dot
for stack in infra/*/; do
stack_name=$(basename "$stack")
[[ "$stack_name" == "state" ]] && continue
echo " \"stack:$stack_name\" [color=green];" >> terraform-deps.dot
done
echo " }" >> terraform-deps.dot
echo "" >> terraform-deps.dot
# Module -> Module dependencies
echo " // Module dependencies" >> terraform-deps.dot
jq -r '.modules | to_entries[] | " \"module:\(.key)\" -> \"module:\(.value[])\";"' \
< .github/module-deps.json >> terraform-deps.dot 2>/dev/null || true
echo "" >> terraform-deps.dot
# Stack -> Module dependencies
echo " // Stack dependencies on modules" >> terraform-deps.dot
for stack in infra/*/; do
stack_name=$(basename "$stack")
[[ "$stack_name" == "state" ]] && continue
grep -h "source.*modules/" "$stack"/*.tf 2>/dev/null | \
sed -n 's/.*modules\/\([^"?\/]*\).*/ "stack:'"$stack_name"'" -> "module:\1";/p' | \
sort -u >> terraform-deps.dot
done
echo "}" >> terraform-deps.dot
# Generate PNG
dot -Tpng terraform-deps.dot -o terraform-deps.png
echo "Dependency graph generated: terraform-deps.png"Add to CI:
- name: Generate dependency graph
run: bash scripts/generate-dep-graph.sh
- name: Upload dependency graph
uses: actions/upload-artifact@v4
with:
name: terraform-dependency-graph
path: terraform-deps.png# Optimize terraform-plan.yml
# 1. Increase plugin cache effectiveness
- uses: actions/cache@v4
with:
path: |
~/.terraform.d/plugin-cache
infra/${{ inputs.stack }}/.terraform/providers
key: terraform-${{ runner.os }}-${{ inputs.stack }}-${{ hashFiles('infra/${{ inputs.stack }}/.terraform.lock.hcl') }}
restore-keys: |
terraform-${{ runner.os }}-${{ inputs.stack }}-
terraform-${{ runner.os }}-
# 2. Limit parallel executions to avoid rate limits
strategy:
max-parallel: 10 # Adjust based on AWS API limits
# 3. Use larger runners for faster execution
runs-on: ubuntu-latest-4-cores # If availableGoals: Reduce CI time by 60-80%
Tasks:
- β
Create
.github/workflows/terraform-detect-changes.yml - β
Update
.github/workflows/terraform-trigger.ymlto use detection - β
Test on feature branch with various change scenarios:
- Single stack change
- Module change affecting multiple stacks
- Multi-stack change
- No Terraform changes
- β Monitor CI execution times and gather metrics
- β Adjust matrix generation logic if needed
- β Merge to main after 1 week of successful testing
Success Criteria:
- β CI time reduced by 60%+ for single-stack changes
- β All dependent stacks correctly identified
- β No false negatives (missing required plans)
Goals: Proactive drift identification and alerting
Tasks:
- β
Create
.github/workflows/terraform-drift-detection.yml - β
Set up GitHub issue label:
drift-detection - β Configure Slack webhook for critical alerts (optional)
- β
Run manually for 1 week to establish baseline:
- Identify expected vs. unexpected drift
- Tune alert thresholds
- Document known drift sources
- β Enable nightly schedule for production stacks
- β Add weekly schedule for staging stacks
- β Create runbook for drift response
Success Criteria:
- β All production stacks checked nightly
- β Drift issues created within 5 minutes of detection
- β Zero false positives after tuning period
Goals: Add validation, cost estimation, and documentation
Tasks:
- β Add terraform-validate.yml workflow
- β Integrate TFLint and Checkov
- β (Optional) Add Infracost for cost visibility
- β Generate dependency graph visualization
- β Document all workflows in repository README
- β
Create runbooks for common scenarios:
- Responding to drift alerts
- Manually triggering full plans
- Emergency changes
- Rolling back changes
Success Criteria:
- β All PRs pass validation before planning
- β Cost estimates visible on infrastructure PRs
- β Team trained on new workflows
| Decision Area | Recommendation | Rationale |
|---|---|---|
| Plan Strategy | Smart detection: changed stacks + module dependents | 60-80% faster CI, focused reviews, lower cost |
| Plan All Fallback | Manual dispatch + [plan-all] commit message + weekly schedule |
Safety net for comprehensive validation |
| Drift Detection Frequency | Nightly (prod) + Weekly (staging) | Early detection without excessive overhead |
| Drift Alerting | GitHub issues + Slack for critical | Trackable, auditable, actionable |
| Drift Prevention | AWS Config + IAM policies + CloudTrail monitoring | Multi-layered defense against manual changes |
| Validation | terraform fmt + validate + TFLint + Checkov | Catch errors before expensive plans |
| Module Versioning | Keep existing tag-modules.yml + Renovate | Working well, no changes needed |
| Cost Visibility | Optional Infracost integration | Helpful for cost-sensitive changes |
| Performance | Plugin caching + parallel limits + larger runners | Optimize execution time and reliability |
Before:
- π Average PR CI time: ~30 minutes (25+ plans)
- πΈ GitHub Actions minutes per PR: ~250 minutes
- π Time to detect drift: Variable (on next deployment)
- π PR review complexity: High (20+ plan outputs)
After (projected):
- β‘ Average PR CI time: ~6 minutes (3-5 plans)
- π° GitHub Actions minutes per PR: ~50 minutes (80% reduction)
- π― Time to detect drift: <24 hours (nightly checks)
- β PR review complexity: Low (only relevant plans)
Cost Savings (monthly estimate):
- GitHub Actions minutes saved: ~40,000 minutes/month
- Developer time saved (faster PR feedback): ~20 hours/month
- Incident prevention (drift detection): 1-2 incidents avoided
Time Investment:
- Initial setup: ~40 hours
- Ongoing maintenance: ~4 hours/month
Break-even: Within 2 months
| Risk | Impact | Mitigation |
|---|---|---|
| False negatives (missed dependencies) | High | Keep "plan all" fallback, test thoroughly, use .github/module-deps.json |
| False positives (unnecessary plans) | Low | Better to over-plan than under-plan |
| Drift alert fatigue | Medium | Tune alert thresholds, fix underlying issues, separate critical/normal |
| GitHub API rate limits | Low | Use max-parallel limits, spread checks throughout day |
| Initial setup complexity | Medium | Phased rollout, thorough testing, comprehensive documentation |
| Team adoption | Medium | Training sessions, runbooks, gradual rollout |
Your Terraform repository already has a solid foundation with good separation of concerns, workspace isolation, and automated module versioning. These recommendations optimize for:
- Speed: Smart change detection reduces CI time by 60-80%
- Safety: Drift detection catches manual changes within 24 hours
- Cost: Reduced GitHub Actions minutes and faster developer feedback
- Maintainability: Clear workflows, automated alerts, comprehensive documentation
The key insight is to plan intelligently, not exhaustively while maintaining safety nets (manual "plan all", weekly full checks, drift detection). This balances speed with safety, developer experience with operational reliability.
- Review this proposal with your team
- Prioritize which recommendations to implement first
- Start with Phase 1 (smart change detection) for immediate wins
- Gradually add drift detection and enhancements
- Measure and iterate based on real-world results
Document Version: 1.0 Last Updated: 2026-01-09 Feedback: Share experiences and improvements with the team