jchadwick/copilot-auto-deploy-architecture.md

## copilot-auto-deploy-architecture.md

      
    Raw
  

              copilot-auto-deploy-architecture.md
            
          
    Auto-Deploy Copilot Agent PRs to Testing Clusters

Overview

When a Copilot coding agent opens a PR in any org repo and the Docker image build succeeds, automatically deploy that image to the repo's designated testing cluster by generating a service orders folder in the corresponding deployment repo. When the PR is closed or merged, automatically clean up the deployment.
Key constraints:

Only testing clusters — never production
Org-level setup, minimal per-repo configuration
No personal access tokens — GitHub Apps only
Source repos never hold deployment write credentials


System Diagram

┌─────────────────────────────────────────────────────────────────────┐
│                        SOURCE REPO                                  │
│                   (e.g. glg/apollo-admin)                           │
│                                                                     │
│  .deploy.yml          Existing CI Workflow                          │
│  ┌────────────┐       ┌─────────────────────────────────────────┐   │
│  │ cluster:   │       │ 1. PR opened by Copilot agent           │   │
│  │   i22      │       │ 2. Build Docker image → pr-42-abc1234   │   │
│  │ service:   │       │ 3. Push to registry                     │   │
│  │   apollo-  │       │ 4. Generate glg-deploy-dispatcher token │   │
│  │   admin    │       │ 5. repository_dispatch → deploy-auto    │   │
│  └────────────┘       │    payload: {repo, pr#, tag, sha}       │   │
│                       └────────────────────┬────────────────────┘   │
│                                            │                        │
│  Secrets available:                        │                        │
│   DISPATCHER_APP_ID (org secret)           │                        │
│   DISPATCHER_PRIVATE_KEY (org secret)      │                        │
└────────────────────────────────────────────┼────────────────────────┘
                                             │ repository_dispatch
                                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    glg/deploy-automation                             │
│               (central orchestration repo)                          │
│                                                                     │
│  Workflow: on repository_dispatch                                   │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │ VALIDATION PHASE                                             │   │
│  │  a. Fetch PR from source repo API → verify user.type == Bot  │   │
│  │  b. Verify pr_author in strict actor allowlist               │   │
│  │  c. Fetch .deploy.yml from DEFAULT BRANCH of source repo     │   │
│  │  d. Fetch clusters.yml from glg/deploy-config default branch │   │
│  │  e. Verify cluster is in allowed_clusters list               │   │
│  │  f. Resolve cluster → deployment_repo from cluster_repos map │   │
│  │  g. Validate image_tag matches ^pr-\d+-[a-f0-9]{7,40}$      │   │
│  │  h. Check active PR deployment count < threshold (e.g. 3)    │   │
│  └──────────────────────────────────────────────────────────────┘   │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │ DEPLOYMENT PHASE                                             │   │
│  │  i. Generate glg-deploy-bot token (contents:write on         │   │
│  │     deployment repos)                                        │   │
│  │  j. Clone deployment repo                                    │   │
│  │  k. Generate orders folder for {service}-pr-{number}         │   │
│  │     (copy + modify existing service, or from template)       │   │
│  │  l. Commit to main with message:                             │   │
│  │     "deploy: {service} pr-{number} from {source_repo}#{pr}" │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  Secrets (repo-level only):                                         │
│   DEPLOY_BOT_APP_ID                                                 │
│   DEPLOY_BOT_PRIVATE_KEY                                            │
│                                                                     │
│  Also has:                                                          │
│   Scheduled GC workflow (cron)                                      │
│   Cleanup handler (on cleanup-pr dispatch)                          │
└─────────────────────────────┬───────────────────────────────────────┘
                              │ git push (via deploy-bot token)
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│              DEPLOYMENT REPO                                        │
│     (e.g. glg/gds.clusterconfig.i22)                                │
│                                                                     │
│  services/                                                          │
│    apollo-admin/              ← existing production-like deploy     │
│      orders                                                         │
│      ...                                                            │
│    apollo-admin-pr-42/        ← created by automation               │
│      orders                   ← dockerdeploy .../apollo-admin/      │
│      ...                         pr-42-abc1234                      │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                    glg/deploy-config                                 │
│              (locked-down config repo)                               │
│                                                                     │
│  clusters.yml                                                       │
│  ┌──────────────────────────────────────────┐                       │
│  │ allowed_clusters:                        │                       │
│  │   - i22                                  │                       │
│  │   - i25                                  │                       │
│  │                                          │                       │
│  │ # Repo is derived from cluster ID:       │                       │
│  │ # glg/gds.clusterconfig.{cluster_id}     │                       │
│  └──────────────────────────────────────────┘                       │
│                                                                     │
│  actor_allowlist.yml                                                │
│  ┌──────────────────────────────────────────┐                       │
│  │ allowed_actors:                          │                       │
│  │   - copilot-swe-agent[bot]              │                       │
│  │   - github-actions[bot]                  │                       │
│  └──────────────────────────────────────────┘                       │
│                                                                     │
│  Branch protection: require 2 reviewers                             │
│  CODEOWNERS: @glg/platform-team                                     │
└─────────────────────────────────────────────────────────────────────┘


Components

1. GitHub Apps

Two apps provide clean separation of privileges:


glg-deploy-dispatcher
glg-deploy-bot


Purpose
Source repos dispatch events to deploy-automation
deploy-automation writes to deployment repos


Permissions
contents: read, metadata: read
contents: write, metadata: read


Installed on
All source repos + deploy-automation + deploy-config
gds.clusterconfig.* deployment repos only


Secrets stored in
Org secrets, scoped to source repos only
Repo secrets on deploy-automation only


Blast radius if compromised
Can read source code and dispatch events. Cannot write to any repo.
Can write to gds.clusterconfig.* deployment repos. But key is only in deploy-automation, not exposed to source repos.


2. Config Repo — glg/deploy-config

A dedicated repo with strict access controls, owned by the platform/security team.
clusters.yml — allowlist of testing clusters:
allowed_clusters:
  - i22
  - i25
  - i30

# Deployment repo is derived from cluster ID: glg/gds.clusterconfig.{cluster_id}
# No explicit mapping needed — the naming convention is enforced by the workflow.
actor_allowlist.yml — strict list of bot actors allowed to trigger deployments:
allowed_actors:
  - copilot-swe-agent[bot]
  - github-actions[bot]
Access controls:

Branch protection on main, require 2 reviewers
CODEOWNERS: @glg/platform-team
No direct pushes

3. Per-Repo Config — .deploy.yml

Lives in each source repo's root on the default branch. The workflow always reads this from the default branch, never the PR branch.
cluster: i22
service_path: services/apollo-admin
Note: There is no deployment_repo field. The cluster ID is used to derive the deployment repo name via the convention glg/gds.clusterconfig.{cluster_id}. The cluster ID is validated against the allowlist in clusters.yml. This prevents a malicious .deploy.yml from targeting production clusters or arbitrary repos.
4. Central Orchestration Repo — glg/deploy-automation

Contains all deployment logic:

deploy-pr dispatch handler workflow
cleanup-pr dispatch handler workflow
Scheduled garbage collection workflow


Workflow Details

Source Repo Workflow Addition

Each source repo adds two small jobs to their existing Docker build workflow. This is the only per-repo setup required beyond .deploy.yml:
# Added to the existing docker-build.yml workflow

on:
  pull_request:
    types: [opened, synchronize, reopened, closed]

jobs:
  build:
    if: github.event.action != 'closed'
    runs-on: ubuntu-latest
    outputs:
      image_tag: ${{ steps.tag.outputs.image_tag }}
    steps:
      # ... existing Docker build steps ...
      - name: Set image tag
        id: tag
        run: |
          SHORT_SHA=$(echo "${{ github.sha }}" | cut -c1-7)
          echo "image_tag=pr-${{ github.event.pull_request.number }}-${SHORT_SHA}" >> "$GITHUB_OUTPUT"
      # ... push to registry ...

  trigger-deploy:
    needs: build
    if: |
      github.event.action != 'closed'
      && github.event.pull_request.user.type == 'Bot'
    runs-on: ubuntu-latest
    steps:
      - name: Generate dispatcher token
        id: app-token
        uses: actions/create-github-app-token@v1  # pin to SHA in practice
        with:
          app-id: ${{ secrets.DISPATCHER_APP_ID }}
          private-key: ${{ secrets.DISPATCHER_PRIVATE_KEY }}
          owner: glg
          repositories: deploy-automation

      - name: Trigger deployment
        uses: peter-evans/repository-dispatch@v3  # pin to SHA in practice
        with:
          token: ${{ steps.app-token.outputs.token }}
          repository: glg/deploy-automation
          event-type: deploy-pr
          client-payload: >-
            {
              "source_repo": "${{ github.repository }}",
              "pr_number": ${{ github.event.pull_request.number }},
              "pr_author": "${{ github.event.pull_request.user.login }}",
              "image_tag": "${{ needs.build.outputs.image_tag }}",
              "sha": "${{ github.sha }}",
              "default_branch": "${{ github.event.repository.default_branch }}"
            }

  trigger-cleanup:
    if: |
      github.event.action == 'closed'
      && github.event.pull_request.user.type == 'Bot'
    runs-on: ubuntu-latest
    steps:
      - name: Generate dispatcher token
        id: app-token
        uses: actions/create-github-app-token@v1
        with:
          app-id: ${{ secrets.DISPATCHER_APP_ID }}
          private-key: ${{ secrets.DISPATCHER_PRIVATE_KEY }}
          owner: glg
          repositories: deploy-automation

      - name: Trigger cleanup
        uses: peter-evans/repository-dispatch@v3
        with:
          token: ${{ steps.app-token.outputs.token }}
          repository: glg/deploy-automation
          event-type: cleanup-pr
          client-payload: >-
            {
              "source_repo": "${{ github.repository }}",
              "pr_number": ${{ github.event.pull_request.number }}
            }
Deploy-Automation: Deploy Handler

# glg/deploy-automation/.github/workflows/deploy-pr.yml
name: Deploy PR to Testing Cluster

on:
  repository_dispatch:
    types: [deploy-pr]

env:
  MAX_PR_DEPLOYMENTS: 3

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Extract payload
        id: payload
        run: |
          echo "source_repo=${{ github.event.client_payload.source_repo }}" >> "$GITHUB_OUTPUT"
          echo "pr_number=${{ github.event.client_payload.pr_number }}" >> "$GITHUB_OUTPUT"
          echo "pr_author=${{ github.event.client_payload.pr_author }}" >> "$GITHUB_OUTPUT"
          echo "image_tag=${{ github.event.client_payload.image_tag }}" >> "$GITHUB_OUTPUT"
          echo "sha=${{ github.event.client_payload.sha }}" >> "$GITHUB_OUTPUT"
          echo "default_branch=${{ github.event.client_payload.default_branch }}" >> "$GITHUB_OUTPUT"

      # --- VALIDATION PHASE ---

      - name: Validate image tag format
        run: |
          TAG="${{ steps.payload.outputs.image_tag }}"
          if [[ ! "$TAG" =~ ^pr-[0-9]+-[a-f0-9]{7,40}$ ]]; then
            echo "::error::Invalid image tag format: $TAG"
            exit 1
          fi

      - name: Validate source repo is in org
        run: |
          REPO="${{ steps.payload.outputs.source_repo }}"
          if [[ ! "$REPO" =~ ^glg/ ]]; then
            echo "::error::Source repo is not in glg org: $REPO"
            exit 1
          fi

      - name: Generate dispatcher token (for reading configs)
        id: dispatcher-token
        uses: actions/create-github-app-token@v1
        with:
          app-id: ${{ secrets.DISPATCHER_APP_ID }}
          private-key: ${{ secrets.DISPATCHER_PRIVATE_KEY }}
          owner: glg

      - name: Validate PR author is a bot
        env:
          GH_TOKEN: ${{ steps.dispatcher-token.outputs.token }}
        run: |
          REPO="${{ steps.payload.outputs.source_repo }}"
          PR_NUM="${{ steps.payload.outputs.pr_number }}"

          PR_DATA=$(gh api "repos/${REPO}/pulls/${PR_NUM}" --jq '{type: .user.type, login: .user.login, state: .state}')
          USER_TYPE=$(echo "$PR_DATA" | jq -r '.type')
          USER_LOGIN=$(echo "$PR_DATA" | jq -r '.login')
          PR_STATE=$(echo "$PR_DATA" | jq -r '.state')

          if [[ "$PR_STATE" != "open" ]]; then
            echo "::error::PR #${PR_NUM} is not open (state: ${PR_STATE})"
            exit 1
          fi

          if [[ "$USER_TYPE" != "Bot" ]]; then
            echo "::error::PR author is not a bot (type: ${USER_TYPE})"
            exit 1
          fi

          echo "pr_author_login=${USER_LOGIN}" >> "$GITHUB_OUTPUT"

      - name: Fetch actor allowlist
        id: allowlist
        env:
          GH_TOKEN: ${{ steps.dispatcher-token.outputs.token }}
        run: |
          ALLOWLIST=$(gh api "repos/glg/deploy-config/contents/actor_allowlist.yml" --jq '.content' | base64 -d)
          AUTHOR="${{ steps.payload.outputs.pr_author }}"

          if ! echo "$ALLOWLIST" | grep -qxF "  - ${AUTHOR}"; then
            echo "::error::Actor '${AUTHOR}' is not in the allowlist"
            exit 1
          fi

          echo "Actor '${AUTHOR}' is in the allowlist"

      - name: Fetch .deploy.yml from default branch
        id: deploy-config
        env:
          GH_TOKEN: ${{ steps.dispatcher-token.outputs.token }}
        run: |
          REPO="${{ steps.payload.outputs.source_repo }}"
          BRANCH="${{ steps.payload.outputs.default_branch }}"

          CONFIG=$(gh api "repos/${REPO}/contents/.deploy.yml?ref=${BRANCH}" --jq '.content' | base64 -d)

          CLUSTER=$(echo "$CONFIG" | yq '.cluster')
          SERVICE_PATH=$(echo "$CONFIG" | yq '.service_path')

          if [[ -z "$CLUSTER" || "$CLUSTER" == "null" ]]; then
            echo "::error::.deploy.yml is missing 'cluster' field"
            exit 1
          fi

          if [[ -z "$SERVICE_PATH" || "$SERVICE_PATH" == "null" ]]; then
            echo "::error::.deploy.yml is missing 'service_path' field"
            exit 1
          fi

          echo "cluster=${CLUSTER}" >> "$GITHUB_OUTPUT"
          echo "service_path=${SERVICE_PATH}" >> "$GITHUB_OUTPUT"

      - name: Validate cluster and resolve deployment repo
        id: cluster
        env:
          GH_TOKEN: ${{ steps.dispatcher-token.outputs.token }}
        run: |
          CLUSTERS_CONFIG=$(gh api "repos/glg/deploy-config/contents/clusters.yml" --jq '.content' | base64 -d)
          CLUSTER="${{ steps.deploy-config.outputs.cluster }}"

          # Check cluster is in allowlist
          if ! echo "$CLUSTERS_CONFIG" | yq ".allowed_clusters[]" | grep -qxF "$CLUSTER"; then
            echo "::error::Cluster '${CLUSTER}' is not in the allowed clusters list"
            exit 1
          fi

          # Derive deployment repo from cluster ID (enforced naming convention)
          DEPLOY_REPO="glg/gds.clusterconfig.${CLUSTER}"

          echo "deploy_repo=${DEPLOY_REPO}" >> "$GITHUB_OUTPUT"

      - name: Check PR deployment count
        env:
          GH_TOKEN: ${{ steps.dispatcher-token.outputs.token }}
        run: |
          DEPLOY_REPO="${{ steps.cluster.outputs.deploy_repo }}"
          SERVICE_PATH="${{ steps.deploy-config.outputs.service_path }}"
          SERVICE_NAME=$(basename "$SERVICE_PATH")

          # Count existing PR deployment folders for this service
          EXISTING=$(gh api "repos/${DEPLOY_REPO}/contents/$(dirname "$SERVICE_PATH")" --jq '.[].name' 2>/dev/null | grep -c "^${SERVICE_NAME}-pr-" || true)

          if [[ "$EXISTING" -ge "$MAX_PR_DEPLOYMENTS" ]]; then
            echo "::error::Service '${SERVICE_NAME}' already has ${EXISTING} PR deployments (max: ${MAX_PR_DEPLOYMENTS})"
            exit 1
          fi

          echo "Current PR deployments for ${SERVICE_NAME}: ${EXISTING}"

      # --- DEPLOYMENT PHASE ---

      - name: Generate deploy-bot token
        id: deploy-token
        uses: actions/create-github-app-token@v1
        with:
          app-id: ${{ secrets.DEPLOY_BOT_APP_ID }}
          private-key: ${{ secrets.DEPLOY_BOT_PRIVATE_KEY }}
          owner: glg
          repositories: ${{ steps.cluster.outputs.deploy_repo }}

      - name: Generate orders folder and deploy
        env:
          GH_TOKEN: ${{ steps.deploy-token.outputs.token }}
        run: |
          DEPLOY_REPO="${{ steps.cluster.outputs.deploy_repo }}"
          SERVICE_PATH="${{ steps.deploy-config.outputs.service_path }}"
          SERVICE_NAME=$(basename "$SERVICE_PATH")
          SERVICE_DIR=$(dirname "$SERVICE_PATH")
          PR_NUMBER="${{ steps.payload.outputs.pr_number }}"
          IMAGE_TAG="${{ steps.payload.outputs.image_tag }}"
          SOURCE_REPO="${{ steps.payload.outputs.source_repo }}"
          PR_FOLDER="${SERVICE_NAME}-pr-${PR_NUMBER}"

          # Clone deployment repo
          git clone "https://x-access-token:${GH_TOKEN}@github.com/${DEPLOY_REPO}.git" deploy-repo
          cd deploy-repo

          git config user.name "glg-deploy-bot[bot]"
          git config user.email "glg-deploy-bot[bot]@users.noreply.github.com"

          # Copy existing service folder as base (or fail if it doesn't exist)
          if [[ ! -d "${SERVICE_PATH}" ]]; then
            echo "::error::Service path '${SERVICE_PATH}' does not exist in ${DEPLOY_REPO}"
            exit 1
          fi

          # Remove existing PR folder if it exists (update scenario)
          rm -rf "${SERVICE_DIR}/${PR_FOLDER}"

          # Copy and modify
          cp -r "${SERVICE_PATH}" "${SERVICE_DIR}/${PR_FOLDER}"

          # Update the dockerdeploy line in the orders file
          ORDERS_FILE="${SERVICE_DIR}/${PR_FOLDER}/orders"
          if [[ ! -f "$ORDERS_FILE" ]]; then
            echo "::error::No orders file found at ${ORDERS_FILE}"
            exit 1
          fi

          # Replace the dockerdeploy line's tag portion
          # Original: dockerdeploy github/glg/apollo-admin/main:latest
          # Updated:  dockerdeploy github/glg/apollo-admin/main:pr-42-abc1234
          sed -i.bak -E "s|(dockerdeploy [^:]+):.*|\1:${IMAGE_TAG}|" "$ORDERS_FILE"
          rm -f "${ORDERS_FILE}.bak"

          # Commit and push with retry for concurrent pushes
          git add -A
          git commit -m "deploy: ${SERVICE_NAME} pr-${PR_NUMBER} from ${SOURCE_REPO}#${PR_NUMBER}

          Source: ${SOURCE_REPO}#${PR_NUMBER}
          Image tag: ${IMAGE_TAG}
          Automated by glg/deploy-automation"

          MAX_RETRIES=3
          for i in $(seq 1 $MAX_RETRIES); do
            if git push origin main; then
              echo "Successfully deployed ${PR_FOLDER}"
              break
            fi
            if [[ $i -eq $MAX_RETRIES ]]; then
              echo "::error::Failed to push after ${MAX_RETRIES} retries"
              exit 1
            fi
            echo "Push failed, retrying (attempt $((i+1))/${MAX_RETRIES})..."
            git pull --rebase origin main
          done
Deploy-Automation: Cleanup Handler

# glg/deploy-automation/.github/workflows/cleanup-pr.yml
name: Cleanup PR Deployment

on:
  repository_dispatch:
    types: [cleanup-pr]

jobs:
  cleanup:
    runs-on: ubuntu-latest
    steps:
      - name: Extract payload
        id: payload
        run: |
          echo "source_repo=${{ github.event.client_payload.source_repo }}" >> "$GITHUB_OUTPUT"
          echo "pr_number=${{ github.event.client_payload.pr_number }}" >> "$GITHUB_OUTPUT"

      - name: Validate source repo is in org
        run: |
          REPO="${{ steps.payload.outputs.source_repo }}"
          if [[ ! "$REPO" =~ ^glg/ ]]; then
            echo "::error::Source repo is not in glg org: $REPO"
            exit 1
          fi

      - name: Generate dispatcher token
        id: dispatcher-token
        uses: actions/create-github-app-token@v1
        with:
          app-id: ${{ secrets.DISPATCHER_APP_ID }}
          private-key: ${{ secrets.DISPATCHER_PRIVATE_KEY }}
          owner: glg

      - name: Fetch .deploy.yml from default branch
        id: deploy-config
        env:
          GH_TOKEN: ${{ steps.dispatcher-token.outputs.token }}
        run: |
          REPO="${{ steps.payload.outputs.source_repo }}"

          # Get default branch
          DEFAULT_BRANCH=$(gh api "repos/${REPO}" --jq '.default_branch')

          CONFIG=$(gh api "repos/${REPO}/contents/.deploy.yml?ref=${DEFAULT_BRANCH}" --jq '.content' | base64 -d)
          CLUSTER=$(echo "$CONFIG" | yq '.cluster')
          SERVICE_PATH=$(echo "$CONFIG" | yq '.service_path')

          echo "cluster=${CLUSTER}" >> "$GITHUB_OUTPUT"
          echo "service_path=${SERVICE_PATH}" >> "$GITHUB_OUTPUT"

      - name: Resolve deployment repo
        id: cluster
        env:
          GH_TOKEN: ${{ steps.dispatcher-token.outputs.token }}
        run: |
          CLUSTERS_CONFIG=$(gh api "repos/glg/deploy-config/contents/clusters.yml" --jq '.content' | base64 -d)
          CLUSTER="${{ steps.deploy-config.outputs.cluster }}"

          # Validate cluster is in allowlist
          if ! echo "$CLUSTERS_CONFIG" | yq ".allowed_clusters[]" | grep -qxF "$CLUSTER"; then
            echo "::error::Cluster '${CLUSTER}' is not in the allowed clusters list"
            exit 1
          fi

          # Derive deployment repo from cluster ID
          DEPLOY_REPO="glg/gds.clusterconfig.${CLUSTER}"

          echo "deploy_repo=${DEPLOY_REPO}" >> "$GITHUB_OUTPUT"

      - name: Generate deploy-bot token
        id: deploy-token
        uses: actions/create-github-app-token@v1
        with:
          app-id: ${{ secrets.DEPLOY_BOT_APP_ID }}
          private-key: ${{ secrets.DEPLOY_BOT_PRIVATE_KEY }}
          owner: glg
          repositories: ${{ steps.cluster.outputs.deploy_repo }}

      - name: Remove PR deployment folder
        env:
          GH_TOKEN: ${{ steps.deploy-token.outputs.token }}
        run: |
          DEPLOY_REPO="${{ steps.cluster.outputs.deploy_repo }}"
          SERVICE_PATH="${{ steps.deploy-config.outputs.service_path }}"
          SERVICE_NAME=$(basename "$SERVICE_PATH")
          SERVICE_DIR=$(dirname "$SERVICE_PATH")
          PR_NUMBER="${{ steps.payload.outputs.pr_number }}"
          SOURCE_REPO="${{ steps.payload.outputs.source_repo }}"
          PR_FOLDER="${SERVICE_NAME}-pr-${PR_NUMBER}"

          git clone "https://x-access-token:${GH_TOKEN}@github.com/${DEPLOY_REPO}.git" deploy-repo
          cd deploy-repo

          git config user.name "glg-deploy-bot[bot]"
          git config user.email "glg-deploy-bot[bot]@users.noreply.github.com"

          TARGET="${SERVICE_DIR}/${PR_FOLDER}"
          if [[ ! -d "$TARGET" ]]; then
            echo "PR deployment folder '${TARGET}' does not exist, nothing to clean up"
            exit 0
          fi

          rm -rf "$TARGET"
          git add -A
          git commit -m "cleanup: remove ${PR_FOLDER} (${SOURCE_REPO}#${PR_NUMBER} closed)

          Source: ${SOURCE_REPO}#${PR_NUMBER}
          Automated by glg/deploy-automation"

          MAX_RETRIES=3
          for i in $(seq 1 $MAX_RETRIES); do
            if git push origin main; then
              echo "Successfully cleaned up ${PR_FOLDER}"
              break
            fi
            if [[ $i -eq $MAX_RETRIES ]]; then
              echo "::error::Failed to push after ${MAX_RETRIES} retries"
              exit 1
            fi
            echo "Push failed, retrying..."
            git pull --rebase origin main
          done
Deploy-Automation: Scheduled Garbage Collection

# glg/deploy-automation/.github/workflows/gc.yml
name: Garbage Collect Stale PR Deployments

on:
  schedule:
    - cron: '0 6 * * *'  # Daily at 6am UTC
  workflow_dispatch: {}    # Allow manual trigger

jobs:
  gc:
    runs-on: ubuntu-latest
    steps:
      - name: Generate dispatcher token
        id: dispatcher-token
        uses: actions/create-github-app-token@v1
        with:
          app-id: ${{ secrets.DISPATCHER_APP_ID }}
          private-key: ${{ secrets.DISPATCHER_PRIVATE_KEY }}
          owner: glg

      - name: Generate deploy-bot token
        id: deploy-token
        uses: actions/create-github-app-token@v1
        with:
          app-id: ${{ secrets.DEPLOY_BOT_APP_ID }}
          private-key: ${{ secrets.DEPLOY_BOT_PRIVATE_KEY }}
          owner: glg

      - name: Fetch cluster config
        id: config
        env:
          GH_TOKEN: ${{ steps.dispatcher-token.outputs.token }}
        run: |
          gh api "repos/glg/deploy-config/contents/clusters.yml" --jq '.content' | base64 -d > clusters.yml

      - name: Scan and clean stale deployments
        env:
          GH_TOKEN_READ: ${{ steps.dispatcher-token.outputs.token }}
          GH_TOKEN_WRITE: ${{ steps.deploy-token.outputs.token }}
        run: |
          ORPHANS_FOUND=0

          # Iterate over each allowed cluster and derive deployment repo
          for CLUSTER_ID in $(yq '.allowed_clusters[]' clusters.yml); do
            DEPLOY_REPO="glg/gds.clusterconfig.${CLUSTER_ID}"
            echo "Scanning ${DEPLOY_REPO}..."

            # List all directories that match the *-pr-* pattern
            # This is a simplified scan — adjust based on your actual directory structure
            DIRS=$(GH_TOKEN="$GH_TOKEN_READ" gh api "repos/${DEPLOY_REPO}/git/trees/main?recursive=1" \
              --jq '.tree[] | select(.type == "tree") | .path' \
              | grep -E '-pr-[0-9]+$' || true)

            for DIR in $DIRS; do
              # Extract service name and PR number from folder name
              FOLDER_NAME=$(basename "$DIR")
              PR_NUM=$(echo "$FOLDER_NAME" | grep -oE 'pr-[0-9]+$' | sed 's/pr-//')

              if [[ -z "$PR_NUM" ]]; then
                continue
              fi

              # We need to find which source repo this came from.
              # Check the last commit message on this folder for the source repo reference.
              COMMIT_MSG=$(GH_TOKEN="$GH_TOKEN_READ" gh api "repos/${DEPLOY_REPO}/commits?path=${DIR}&per_page=1" \
                --jq '.[0].commit.message' 2>/dev/null || true)

              SOURCE_REPO=$(echo "$COMMIT_MSG" | grep -oE 'glg/[^ #]+' | head -1 || true)

              if [[ -z "$SOURCE_REPO" ]]; then
                echo "  WARNING: Could not determine source repo for ${DIR}, skipping"
                continue
              fi

              # Check if the PR is still open
              PR_STATE=$(GH_TOKEN="$GH_TOKEN_READ" gh api "repos/${SOURCE_REPO}/pulls/${PR_NUM}" \
                --jq '.state' 2>/dev/null || echo "not_found")

              if [[ "$PR_STATE" == "open" ]]; then
                echo "  ${DIR}: PR #${PR_NUM} still open, keeping"
                continue
              fi

              echo "  ${DIR}: PR #${PR_NUM} is ${PR_STATE}, removing"
              ORPHANS_FOUND=$((ORPHANS_FOUND + 1))

              # Clone, remove, commit, push
              TEMP_DIR=$(mktemp -d)
              GH_TOKEN="$GH_TOKEN_WRITE" git clone "https://x-access-token:${GH_TOKEN_WRITE}@github.com/${DEPLOY_REPO}.git" "$TEMP_DIR"
              cd "$TEMP_DIR"
              git config user.name "glg-deploy-bot[bot]"
              git config user.email "glg-deploy-bot[bot]@users.noreply.github.com"

              rm -rf "$DIR"
              git add -A
              git commit -m "gc: remove stale deployment ${FOLDER_NAME} (${SOURCE_REPO}#${PR_NUM} ${PR_STATE})

          Automated garbage collection by glg/deploy-automation"

              for i in 1 2 3; do
                if git push origin main; then
                  break
                fi
                git pull --rebase origin main
              done

              cd -
              rm -rf "$TEMP_DIR"
            done
          done

          echo "Garbage collection complete. Orphans removed: ${ORPHANS_FOUND}"

          if [[ "$ORPHANS_FOUND" -gt 0 ]]; then
            echo "::warning::Removed ${ORPHANS_FOUND} stale PR deployment(s)"
          fi

Security Threat Analysis

Threats and Mitigations


#
Threat
Severity
Mitigation


1
App key compromise
CRITICAL
Two-app architecture. Source repos only hold the dispatcher key (read-only). Deploy-bot key lives only in deploy-automation repo. Even if dispatcher key leaks, attacker cannot write to deployment repos.


2
Bot actor spoofing
HIGH
Double validation: user.type == "Bot" (GitHub-controlled field) AND exact-match against actor_allowlist.yml in locked-down config repo.


3
Malicious .deploy.yml
HIGH
Always read from the default branch, never the PR branch. Cluster validated against allowlist. Deployment repo resolved from the config repo, not from .deploy.yml.


4
Deployment flooding / DoS
MEDIUM
Max 3 active PR deployments per service. Enforced in validation phase before any write occurs.


5
Command injection via PR content
MEDIUM
Image tag validated against strict regex ^pr-\d+-[a-f0-9]{7,40}$. All PR-derived values passed through environment variables, not string interpolation.


6
Race conditions in deployment repo
LOW-MED
Retry loop with git pull --rebase on push failure (up to 3 attempts).


7
GitHub App over-permissioning
MEDIUM
Deploy-bot installed only on gds.clusterconfig.* deployment repos. Dispatcher installed on source repos + config repos. Neither has more access than needed.


8
Stale deployments from cleanup failures
LOW-MED
Daily cron GC scans all deployment repos, cross-references with PR state, removes orphaned folders. Warns via GitHub Actions annotations.


9
Compromised shared workflow
HIGH
Mitigated by the dispatch pattern: there is no reusable workflow called by source repos. All logic lives in deploy-automation which is protected by branch protection and CODEOWNERS. Source repos only send a dispatch event.


Security Properties


Source repos never hold deployment write credentials — they only have the dispatcher app key which can read and dispatch, never write
.deploy.yml is read from the default branch — PR authors cannot tamper with cluster targeting
Cluster allowlist is in a separate locked-down repo — only the platform team can modify what clusters are targetable
Actor allowlist is centrally managed — adding a new bot type requires platform team review
All validation happens in deploy-automation — source repos have no say in what gets deployed where beyond their merged .deploy.yml
Rate limited — max 3 concurrent PR deployments per service
Self-healing — scheduled GC catches any cleanup failures
No reusable workflow to compromise — the dispatch pattern means source repos never reference or run deploy-automation code directly


Validation Checklist

Every deployment must pass ALL of these checks:


#
Check
Prevents


1
PR exists and is open
Stale/invalid dispatch payloads


2
pr.user.type == "Bot"
Human PRs triggering deploys


3
pr.user.login in actor_allowlist.yml
Unknown bots triggering deploys


4
.deploy.yml read from default branch
PR branch tampering with config


5
cluster in allowed_clusters
Deploying to production


6
deployment_repo derived from glg/gds.clusterconfig.{cluster_id} convention
Arbitrary repo targeting


7
image_tag matches ^pr-\d+-[a-f0-9]{7,40}$
Command injection via tag


8
Active PR deployments for service < 3
Resource exhaustion / flooding


9
Source repo belongs to the org (^glg/)
Cross-org abuse


Setup Checklist

One-Time Org Setup


 Create glg-deploy-dispatcher GitHub App

Permissions: contents: read, metadata: read
Install on: all source repos + deploy-automation + deploy-config


 Create glg-deploy-bot GitHub App

Permissions: contents: write, metadata: read
Install on: gds.clusterconfig.* deployment repos only


 Create glg/deploy-automation repo

Add repo secrets: DEPLOY_BOT_APP_ID, DEPLOY_BOT_PRIVATE_KEY, DISPATCHER_APP_ID, DISPATCHER_PRIVATE_KEY
Add the three workflows: deploy-pr.yml, cleanup-pr.yml, gc.yml
Enable branch protection on main


 Create glg/deploy-config repo

Add clusters.yml and actor_allowlist.yml
Enable branch protection: require 2 reviewers
Add CODEOWNERS: @glg/platform-team


 Add org secrets scoped to source repos:

DISPATCHER_APP_ID
DISPATCHER_PRIVATE_KEY


Per-Repo Setup (Done by Each Team)


 Add .deploy.yml to the repo's default branch
 Add trigger-deploy and trigger-cleanup jobs to existing Docker build workflow
	`glg-deploy-dispatcher`	`glg-deploy-bot`
Purpose	Source repos dispatch events to `deploy-automation`	`deploy-automation` writes to deployment repos
Permissions	`contents: read`, `metadata: read`	`contents: write`, `metadata: read`
Installed on	All source repos + `deploy-automation` + `deploy-config`	`gds.clusterconfig.*` deployment repos only
Secrets stored in	Org secrets, scoped to source repos only	Repo secrets on `deploy-automation` only
Blast radius if compromised	Can read source code and dispatch events. Cannot write to any repo.	Can write to `gds.clusterconfig.*` deployment repos. But key is only in `deploy-automation`, not exposed to source repos.
#	Threat	Severity	Mitigation
1	App key compromise	CRITICAL	Two-app architecture. Source repos only hold the dispatcher key (read-only). Deploy-bot key lives only in `deploy-automation` repo. Even if dispatcher key leaks, attacker cannot write to deployment repos.
2	Bot actor spoofing	HIGH	Double validation: `user.type == "Bot"` (GitHub-controlled field) AND exact-match against `actor_allowlist.yml` in locked-down config repo.
3	Malicious `.deploy.yml`	HIGH	Always read from the default branch, never the PR branch. Cluster validated against allowlist. Deployment repo resolved from the config repo, not from `.deploy.yml`.
4	Deployment flooding / DoS	MEDIUM	Max 3 active PR deployments per service. Enforced in validation phase before any write occurs.
5	Command injection via PR content	MEDIUM	Image tag validated against strict regex `^pr-\d+-[a-f0-9]{7,40}$`. All PR-derived values passed through environment variables, not string interpolation.
6	Race conditions in deployment repo	LOW-MED	Retry loop with `git pull --rebase` on push failure (up to 3 attempts).
7	GitHub App over-permissioning	MEDIUM	Deploy-bot installed only on `gds.clusterconfig.*` deployment repos. Dispatcher installed on source repos + config repos. Neither has more access than needed.
8	Stale deployments from cleanup failures	LOW-MED	Daily cron GC scans all deployment repos, cross-references with PR state, removes orphaned folders. Warns via GitHub Actions annotations.
9	Compromised shared workflow	HIGH	Mitigated by the dispatch pattern: there is no reusable workflow called by source repos. All logic lives in `deploy-automation` which is protected by branch protection and CODEOWNERS. Source repos only send a dispatch event.
#	Check	Prevents
1	PR exists and is open	Stale/invalid dispatch payloads
2	`pr.user.type == "Bot"`	Human PRs triggering deploys
3	`pr.user.login` in `actor_allowlist.yml`	Unknown bots triggering deploys
4	`.deploy.yml` read from default branch	PR branch tampering with config
5	`cluster` in `allowed_clusters`	Deploying to production
6	`deployment_repo` derived from `glg/gds.clusterconfig.{cluster_id}` convention	Arbitrary repo targeting
7	`image_tag` matches `^pr-\d+-[a-f0-9]{7,40}$`	Command injection via tag
8	Active PR deployments for service < 3	Resource exhaustion / flooding
9	Source repo belongs to the org (`^glg/`)	Cross-org abuse