Skip to content

Instantly share code, notes, and snippets.

@jchadwick
Last active February 25, 2026 21:00
Show Gist options
  • Select an option

  • Save jchadwick/45fe26e634c957d52075819298cdc7b5 to your computer and use it in GitHub Desktop.

Select an option

Save jchadwick/45fe26e634c957d52075819298cdc7b5 to your computer and use it in GitHub Desktop.
Architecture: Auto-Deploy Copilot Agent PRs to Testing Clusters

Auto-Deploy Copilot Agent PRs to Testing Clusters

Overview

When a Copilot coding agent opens a PR in any org repo and the Docker image build succeeds, automatically deploy that image to the repo's designated testing cluster by generating a service orders folder in the corresponding deployment repo. When the PR is closed or merged, automatically clean up the deployment.

Key constraints:

  • Only testing clusters — never production
  • Org-level setup, minimal per-repo configuration
  • No personal access tokens — GitHub Apps only
  • Source repos never hold deployment write credentials

System Diagram

┌─────────────────────────────────────────────────────────────────────┐
│                        SOURCE REPO                                  │
│                   (e.g. glg/apollo-admin)                           │
│                                                                     │
│  .deploy.yml          Existing CI Workflow                          │
│  ┌────────────┐       ┌─────────────────────────────────────────┐   │
│  │ cluster:   │       │ 1. PR opened by Copilot agent           │   │
│  │   i22      │       │ 2. Build Docker image → pr-42-abc1234   │   │
│  │ service:   │       │ 3. Push to registry                     │   │
│  │   apollo-  │       │ 4. Generate glg-deploy-dispatcher token │   │
│  │   admin    │       │ 5. repository_dispatch → deploy-auto    │   │
│  └────────────┘       │    payload: {repo, pr#, tag, sha}       │   │
│                       └────────────────────┬────────────────────┘   │
│                                            │                        │
│  Secrets available:                        │                        │
│   DISPATCHER_APP_ID (org secret)           │                        │
│   DISPATCHER_PRIVATE_KEY (org secret)      │                        │
└────────────────────────────────────────────┼────────────────────────┘
                                             │ repository_dispatch
                                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    glg/deploy-automation                             │
│               (central orchestration repo)                          │
│                                                                     │
│  Workflow: on repository_dispatch                                   │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │ VALIDATION PHASE                                             │   │
│  │  a. Fetch PR from source repo API → verify user.type == Bot  │   │
│  │  b. Verify pr_author in strict actor allowlist               │   │
│  │  c. Fetch .deploy.yml from DEFAULT BRANCH of source repo     │   │
│  │  d. Fetch clusters.yml from glg/deploy-config default branch │   │
│  │  e. Verify cluster is in allowed_clusters list               │   │
│  │  f. Resolve cluster → deployment_repo from cluster_repos map │   │
│  │  g. Validate image_tag matches ^pr-\d+-[a-f0-9]{7,40}$      │   │
│  │  h. Check active PR deployment count < threshold (e.g. 3)    │   │
│  └──────────────────────────────────────────────────────────────┘   │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │ DEPLOYMENT PHASE                                             │   │
│  │  i. Generate glg-deploy-bot token (contents:write on         │   │
│  │     deployment repos)                                        │   │
│  │  j. Clone deployment repo                                    │   │
│  │  k. Generate orders folder for {service}-pr-{number}         │   │
│  │     (copy + modify existing service, or from template)       │   │
│  │  l. Commit to main with message:                             │   │
│  │     "deploy: {service} pr-{number} from {source_repo}#{pr}" │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  Secrets (repo-level only):                                         │
│   DEPLOY_BOT_APP_ID                                                 │
│   DEPLOY_BOT_PRIVATE_KEY                                            │
│                                                                     │
│  Also has:                                                          │
│   Scheduled GC workflow (cron)                                      │
│   Cleanup handler (on cleanup-pr dispatch)                          │
└─────────────────────────────┬───────────────────────────────────────┘
                              │ git push (via deploy-bot token)
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│              DEPLOYMENT REPO                                        │
│     (e.g. glg/gds.clusterconfig.i22)                                │
│                                                                     │
│  services/                                                          │
│    apollo-admin/              ← existing production-like deploy     │
│      orders                                                         │
│      ...                                                            │
│    apollo-admin-pr-42/        ← created by automation               │
│      orders                   ← dockerdeploy .../apollo-admin/      │
│      ...                         pr-42-abc1234                      │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                    glg/deploy-config                                 │
│              (locked-down config repo)                               │
│                                                                     │
│  clusters.yml                                                       │
│  ┌──────────────────────────────────────────┐                       │
│  │ allowed_clusters:                        │                       │
│  │   - i22                                  │                       │
│  │   - i25                                  │                       │
│  │                                          │                       │
│  │ # Repo is derived from cluster ID:       │                       │
│  │ # glg/gds.clusterconfig.{cluster_id}     │                       │
│  └──────────────────────────────────────────┘                       │
│                                                                     │
│  actor_allowlist.yml                                                │
│  ┌──────────────────────────────────────────┐                       │
│  │ allowed_actors:                          │                       │
│  │   - copilot-swe-agent[bot]              │                       │
│  │   - github-actions[bot]                  │                       │
│  └──────────────────────────────────────────┘                       │
│                                                                     │
│  Branch protection: require 2 reviewers                             │
│  CODEOWNERS: @glg/platform-team                                     │
└─────────────────────────────────────────────────────────────────────┘

Components

1. GitHub Apps

Two apps provide clean separation of privileges:

glg-deploy-dispatcher glg-deploy-bot
Purpose Source repos dispatch events to deploy-automation deploy-automation writes to deployment repos
Permissions contents: read, metadata: read contents: write, metadata: read
Installed on All source repos + deploy-automation + deploy-config gds.clusterconfig.* deployment repos only
Secrets stored in Org secrets, scoped to source repos only Repo secrets on deploy-automation only
Blast radius if compromised Can read source code and dispatch events. Cannot write to any repo. Can write to gds.clusterconfig.* deployment repos. But key is only in deploy-automation, not exposed to source repos.

2. Config Repo — glg/deploy-config

A dedicated repo with strict access controls, owned by the platform/security team.

clusters.yml — allowlist of testing clusters:

allowed_clusters:
  - i22
  - i25
  - i30

# Deployment repo is derived from cluster ID: glg/gds.clusterconfig.{cluster_id}
# No explicit mapping needed — the naming convention is enforced by the workflow.

actor_allowlist.yml — strict list of bot actors allowed to trigger deployments:

allowed_actors:
  - copilot-swe-agent[bot]
  - github-actions[bot]

Access controls:

  • Branch protection on main, require 2 reviewers
  • CODEOWNERS: @glg/platform-team
  • No direct pushes

3. Per-Repo Config — .deploy.yml

Lives in each source repo's root on the default branch. The workflow always reads this from the default branch, never the PR branch.

cluster: i22
service_path: services/apollo-admin

Note: There is no deployment_repo field. The cluster ID is used to derive the deployment repo name via the convention glg/gds.clusterconfig.{cluster_id}. The cluster ID is validated against the allowlist in clusters.yml. This prevents a malicious .deploy.yml from targeting production clusters or arbitrary repos.

4. Central Orchestration Repo — glg/deploy-automation

Contains all deployment logic:

  • deploy-pr dispatch handler workflow
  • cleanup-pr dispatch handler workflow
  • Scheduled garbage collection workflow

Workflow Details

Source Repo Workflow Addition

Each source repo adds two small jobs to their existing Docker build workflow. This is the only per-repo setup required beyond .deploy.yml:

# Added to the existing docker-build.yml workflow

on:
  pull_request:
    types: [opened, synchronize, reopened, closed]

jobs:
  build:
    if: github.event.action != 'closed'
    runs-on: ubuntu-latest
    outputs:
      image_tag: ${{ steps.tag.outputs.image_tag }}
    steps:
      # ... existing Docker build steps ...
      - name: Set image tag
        id: tag
        run: |
          SHORT_SHA=$(echo "${{ github.sha }}" | cut -c1-7)
          echo "image_tag=pr-${{ github.event.pull_request.number }}-${SHORT_SHA}" >> "$GITHUB_OUTPUT"
      # ... push to registry ...

  trigger-deploy:
    needs: build
    if: |
      github.event.action != 'closed'
      && github.event.pull_request.user.type == 'Bot'
    runs-on: ubuntu-latest
    steps:
      - name: Generate dispatcher token
        id: app-token
        uses: actions/create-github-app-token@v1  # pin to SHA in practice
        with:
          app-id: ${{ secrets.DISPATCHER_APP_ID }}
          private-key: ${{ secrets.DISPATCHER_PRIVATE_KEY }}
          owner: glg
          repositories: deploy-automation

      - name: Trigger deployment
        uses: peter-evans/repository-dispatch@v3  # pin to SHA in practice
        with:
          token: ${{ steps.app-token.outputs.token }}
          repository: glg/deploy-automation
          event-type: deploy-pr
          client-payload: >-
            {
              "source_repo": "${{ github.repository }}",
              "pr_number": ${{ github.event.pull_request.number }},
              "pr_author": "${{ github.event.pull_request.user.login }}",
              "image_tag": "${{ needs.build.outputs.image_tag }}",
              "sha": "${{ github.sha }}",
              "default_branch": "${{ github.event.repository.default_branch }}"
            }

  trigger-cleanup:
    if: |
      github.event.action == 'closed'
      && github.event.pull_request.user.type == 'Bot'
    runs-on: ubuntu-latest
    steps:
      - name: Generate dispatcher token
        id: app-token
        uses: actions/create-github-app-token@v1
        with:
          app-id: ${{ secrets.DISPATCHER_APP_ID }}
          private-key: ${{ secrets.DISPATCHER_PRIVATE_KEY }}
          owner: glg
          repositories: deploy-automation

      - name: Trigger cleanup
        uses: peter-evans/repository-dispatch@v3
        with:
          token: ${{ steps.app-token.outputs.token }}
          repository: glg/deploy-automation
          event-type: cleanup-pr
          client-payload: >-
            {
              "source_repo": "${{ github.repository }}",
              "pr_number": ${{ github.event.pull_request.number }}
            }

Deploy-Automation: Deploy Handler

# glg/deploy-automation/.github/workflows/deploy-pr.yml
name: Deploy PR to Testing Cluster

on:
  repository_dispatch:
    types: [deploy-pr]

env:
  MAX_PR_DEPLOYMENTS: 3

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Extract payload
        id: payload
        run: |
          echo "source_repo=${{ github.event.client_payload.source_repo }}" >> "$GITHUB_OUTPUT"
          echo "pr_number=${{ github.event.client_payload.pr_number }}" >> "$GITHUB_OUTPUT"
          echo "pr_author=${{ github.event.client_payload.pr_author }}" >> "$GITHUB_OUTPUT"
          echo "image_tag=${{ github.event.client_payload.image_tag }}" >> "$GITHUB_OUTPUT"
          echo "sha=${{ github.event.client_payload.sha }}" >> "$GITHUB_OUTPUT"
          echo "default_branch=${{ github.event.client_payload.default_branch }}" >> "$GITHUB_OUTPUT"

      # --- VALIDATION PHASE ---

      - name: Validate image tag format
        run: |
          TAG="${{ steps.payload.outputs.image_tag }}"
          if [[ ! "$TAG" =~ ^pr-[0-9]+-[a-f0-9]{7,40}$ ]]; then
            echo "::error::Invalid image tag format: $TAG"
            exit 1
          fi

      - name: Validate source repo is in org
        run: |
          REPO="${{ steps.payload.outputs.source_repo }}"
          if [[ ! "$REPO" =~ ^glg/ ]]; then
            echo "::error::Source repo is not in glg org: $REPO"
            exit 1
          fi

      - name: Generate dispatcher token (for reading configs)
        id: dispatcher-token
        uses: actions/create-github-app-token@v1
        with:
          app-id: ${{ secrets.DISPATCHER_APP_ID }}
          private-key: ${{ secrets.DISPATCHER_PRIVATE_KEY }}
          owner: glg

      - name: Validate PR author is a bot
        env:
          GH_TOKEN: ${{ steps.dispatcher-token.outputs.token }}
        run: |
          REPO="${{ steps.payload.outputs.source_repo }}"
          PR_NUM="${{ steps.payload.outputs.pr_number }}"

          PR_DATA=$(gh api "repos/${REPO}/pulls/${PR_NUM}" --jq '{type: .user.type, login: .user.login, state: .state}')
          USER_TYPE=$(echo "$PR_DATA" | jq -r '.type')
          USER_LOGIN=$(echo "$PR_DATA" | jq -r '.login')
          PR_STATE=$(echo "$PR_DATA" | jq -r '.state')

          if [[ "$PR_STATE" != "open" ]]; then
            echo "::error::PR #${PR_NUM} is not open (state: ${PR_STATE})"
            exit 1
          fi

          if [[ "$USER_TYPE" != "Bot" ]]; then
            echo "::error::PR author is not a bot (type: ${USER_TYPE})"
            exit 1
          fi

          echo "pr_author_login=${USER_LOGIN}" >> "$GITHUB_OUTPUT"

      - name: Fetch actor allowlist
        id: allowlist
        env:
          GH_TOKEN: ${{ steps.dispatcher-token.outputs.token }}
        run: |
          ALLOWLIST=$(gh api "repos/glg/deploy-config/contents/actor_allowlist.yml" --jq '.content' | base64 -d)
          AUTHOR="${{ steps.payload.outputs.pr_author }}"

          if ! echo "$ALLOWLIST" | grep -qxF "  - ${AUTHOR}"; then
            echo "::error::Actor '${AUTHOR}' is not in the allowlist"
            exit 1
          fi

          echo "Actor '${AUTHOR}' is in the allowlist"

      - name: Fetch .deploy.yml from default branch
        id: deploy-config
        env:
          GH_TOKEN: ${{ steps.dispatcher-token.outputs.token }}
        run: |
          REPO="${{ steps.payload.outputs.source_repo }}"
          BRANCH="${{ steps.payload.outputs.default_branch }}"

          CONFIG=$(gh api "repos/${REPO}/contents/.deploy.yml?ref=${BRANCH}" --jq '.content' | base64 -d)

          CLUSTER=$(echo "$CONFIG" | yq '.cluster')
          SERVICE_PATH=$(echo "$CONFIG" | yq '.service_path')

          if [[ -z "$CLUSTER" || "$CLUSTER" == "null" ]]; then
            echo "::error::.deploy.yml is missing 'cluster' field"
            exit 1
          fi

          if [[ -z "$SERVICE_PATH" || "$SERVICE_PATH" == "null" ]]; then
            echo "::error::.deploy.yml is missing 'service_path' field"
            exit 1
          fi

          echo "cluster=${CLUSTER}" >> "$GITHUB_OUTPUT"
          echo "service_path=${SERVICE_PATH}" >> "$GITHUB_OUTPUT"

      - name: Validate cluster and resolve deployment repo
        id: cluster
        env:
          GH_TOKEN: ${{ steps.dispatcher-token.outputs.token }}
        run: |
          CLUSTERS_CONFIG=$(gh api "repos/glg/deploy-config/contents/clusters.yml" --jq '.content' | base64 -d)
          CLUSTER="${{ steps.deploy-config.outputs.cluster }}"

          # Check cluster is in allowlist
          if ! echo "$CLUSTERS_CONFIG" | yq ".allowed_clusters[]" | grep -qxF "$CLUSTER"; then
            echo "::error::Cluster '${CLUSTER}' is not in the allowed clusters list"
            exit 1
          fi

          # Derive deployment repo from cluster ID (enforced naming convention)
          DEPLOY_REPO="glg/gds.clusterconfig.${CLUSTER}"

          echo "deploy_repo=${DEPLOY_REPO}" >> "$GITHUB_OUTPUT"

      - name: Check PR deployment count
        env:
          GH_TOKEN: ${{ steps.dispatcher-token.outputs.token }}
        run: |
          DEPLOY_REPO="${{ steps.cluster.outputs.deploy_repo }}"
          SERVICE_PATH="${{ steps.deploy-config.outputs.service_path }}"
          SERVICE_NAME=$(basename "$SERVICE_PATH")

          # Count existing PR deployment folders for this service
          EXISTING=$(gh api "repos/${DEPLOY_REPO}/contents/$(dirname "$SERVICE_PATH")" --jq '.[].name' 2>/dev/null | grep -c "^${SERVICE_NAME}-pr-" || true)

          if [[ "$EXISTING" -ge "$MAX_PR_DEPLOYMENTS" ]]; then
            echo "::error::Service '${SERVICE_NAME}' already has ${EXISTING} PR deployments (max: ${MAX_PR_DEPLOYMENTS})"
            exit 1
          fi

          echo "Current PR deployments for ${SERVICE_NAME}: ${EXISTING}"

      # --- DEPLOYMENT PHASE ---

      - name: Generate deploy-bot token
        id: deploy-token
        uses: actions/create-github-app-token@v1
        with:
          app-id: ${{ secrets.DEPLOY_BOT_APP_ID }}
          private-key: ${{ secrets.DEPLOY_BOT_PRIVATE_KEY }}
          owner: glg
          repositories: ${{ steps.cluster.outputs.deploy_repo }}

      - name: Generate orders folder and deploy
        env:
          GH_TOKEN: ${{ steps.deploy-token.outputs.token }}
        run: |
          DEPLOY_REPO="${{ steps.cluster.outputs.deploy_repo }}"
          SERVICE_PATH="${{ steps.deploy-config.outputs.service_path }}"
          SERVICE_NAME=$(basename "$SERVICE_PATH")
          SERVICE_DIR=$(dirname "$SERVICE_PATH")
          PR_NUMBER="${{ steps.payload.outputs.pr_number }}"
          IMAGE_TAG="${{ steps.payload.outputs.image_tag }}"
          SOURCE_REPO="${{ steps.payload.outputs.source_repo }}"
          PR_FOLDER="${SERVICE_NAME}-pr-${PR_NUMBER}"

          # Clone deployment repo
          git clone "https://x-access-token:${GH_TOKEN}@github.com/${DEPLOY_REPO}.git" deploy-repo
          cd deploy-repo

          git config user.name "glg-deploy-bot[bot]"
          git config user.email "glg-deploy-bot[bot]@users.noreply.github.com"

          # Copy existing service folder as base (or fail if it doesn't exist)
          if [[ ! -d "${SERVICE_PATH}" ]]; then
            echo "::error::Service path '${SERVICE_PATH}' does not exist in ${DEPLOY_REPO}"
            exit 1
          fi

          # Remove existing PR folder if it exists (update scenario)
          rm -rf "${SERVICE_DIR}/${PR_FOLDER}"

          # Copy and modify
          cp -r "${SERVICE_PATH}" "${SERVICE_DIR}/${PR_FOLDER}"

          # Update the dockerdeploy line in the orders file
          ORDERS_FILE="${SERVICE_DIR}/${PR_FOLDER}/orders"
          if [[ ! -f "$ORDERS_FILE" ]]; then
            echo "::error::No orders file found at ${ORDERS_FILE}"
            exit 1
          fi

          # Replace the dockerdeploy line's tag portion
          # Original: dockerdeploy github/glg/apollo-admin/main:latest
          # Updated:  dockerdeploy github/glg/apollo-admin/main:pr-42-abc1234
          sed -i.bak -E "s|(dockerdeploy [^:]+):.*|\1:${IMAGE_TAG}|" "$ORDERS_FILE"
          rm -f "${ORDERS_FILE}.bak"

          # Commit and push with retry for concurrent pushes
          git add -A
          git commit -m "deploy: ${SERVICE_NAME} pr-${PR_NUMBER} from ${SOURCE_REPO}#${PR_NUMBER}

          Source: ${SOURCE_REPO}#${PR_NUMBER}
          Image tag: ${IMAGE_TAG}
          Automated by glg/deploy-automation"

          MAX_RETRIES=3
          for i in $(seq 1 $MAX_RETRIES); do
            if git push origin main; then
              echo "Successfully deployed ${PR_FOLDER}"
              break
            fi
            if [[ $i -eq $MAX_RETRIES ]]; then
              echo "::error::Failed to push after ${MAX_RETRIES} retries"
              exit 1
            fi
            echo "Push failed, retrying (attempt $((i+1))/${MAX_RETRIES})..."
            git pull --rebase origin main
          done

Deploy-Automation: Cleanup Handler

# glg/deploy-automation/.github/workflows/cleanup-pr.yml
name: Cleanup PR Deployment

on:
  repository_dispatch:
    types: [cleanup-pr]

jobs:
  cleanup:
    runs-on: ubuntu-latest
    steps:
      - name: Extract payload
        id: payload
        run: |
          echo "source_repo=${{ github.event.client_payload.source_repo }}" >> "$GITHUB_OUTPUT"
          echo "pr_number=${{ github.event.client_payload.pr_number }}" >> "$GITHUB_OUTPUT"

      - name: Validate source repo is in org
        run: |
          REPO="${{ steps.payload.outputs.source_repo }}"
          if [[ ! "$REPO" =~ ^glg/ ]]; then
            echo "::error::Source repo is not in glg org: $REPO"
            exit 1
          fi

      - name: Generate dispatcher token
        id: dispatcher-token
        uses: actions/create-github-app-token@v1
        with:
          app-id: ${{ secrets.DISPATCHER_APP_ID }}
          private-key: ${{ secrets.DISPATCHER_PRIVATE_KEY }}
          owner: glg

      - name: Fetch .deploy.yml from default branch
        id: deploy-config
        env:
          GH_TOKEN: ${{ steps.dispatcher-token.outputs.token }}
        run: |
          REPO="${{ steps.payload.outputs.source_repo }}"

          # Get default branch
          DEFAULT_BRANCH=$(gh api "repos/${REPO}" --jq '.default_branch')

          CONFIG=$(gh api "repos/${REPO}/contents/.deploy.yml?ref=${DEFAULT_BRANCH}" --jq '.content' | base64 -d)
          CLUSTER=$(echo "$CONFIG" | yq '.cluster')
          SERVICE_PATH=$(echo "$CONFIG" | yq '.service_path')

          echo "cluster=${CLUSTER}" >> "$GITHUB_OUTPUT"
          echo "service_path=${SERVICE_PATH}" >> "$GITHUB_OUTPUT"

      - name: Resolve deployment repo
        id: cluster
        env:
          GH_TOKEN: ${{ steps.dispatcher-token.outputs.token }}
        run: |
          CLUSTERS_CONFIG=$(gh api "repos/glg/deploy-config/contents/clusters.yml" --jq '.content' | base64 -d)
          CLUSTER="${{ steps.deploy-config.outputs.cluster }}"

          # Validate cluster is in allowlist
          if ! echo "$CLUSTERS_CONFIG" | yq ".allowed_clusters[]" | grep -qxF "$CLUSTER"; then
            echo "::error::Cluster '${CLUSTER}' is not in the allowed clusters list"
            exit 1
          fi

          # Derive deployment repo from cluster ID
          DEPLOY_REPO="glg/gds.clusterconfig.${CLUSTER}"

          echo "deploy_repo=${DEPLOY_REPO}" >> "$GITHUB_OUTPUT"

      - name: Generate deploy-bot token
        id: deploy-token
        uses: actions/create-github-app-token@v1
        with:
          app-id: ${{ secrets.DEPLOY_BOT_APP_ID }}
          private-key: ${{ secrets.DEPLOY_BOT_PRIVATE_KEY }}
          owner: glg
          repositories: ${{ steps.cluster.outputs.deploy_repo }}

      - name: Remove PR deployment folder
        env:
          GH_TOKEN: ${{ steps.deploy-token.outputs.token }}
        run: |
          DEPLOY_REPO="${{ steps.cluster.outputs.deploy_repo }}"
          SERVICE_PATH="${{ steps.deploy-config.outputs.service_path }}"
          SERVICE_NAME=$(basename "$SERVICE_PATH")
          SERVICE_DIR=$(dirname "$SERVICE_PATH")
          PR_NUMBER="${{ steps.payload.outputs.pr_number }}"
          SOURCE_REPO="${{ steps.payload.outputs.source_repo }}"
          PR_FOLDER="${SERVICE_NAME}-pr-${PR_NUMBER}"

          git clone "https://x-access-token:${GH_TOKEN}@github.com/${DEPLOY_REPO}.git" deploy-repo
          cd deploy-repo

          git config user.name "glg-deploy-bot[bot]"
          git config user.email "glg-deploy-bot[bot]@users.noreply.github.com"

          TARGET="${SERVICE_DIR}/${PR_FOLDER}"
          if [[ ! -d "$TARGET" ]]; then
            echo "PR deployment folder '${TARGET}' does not exist, nothing to clean up"
            exit 0
          fi

          rm -rf "$TARGET"
          git add -A
          git commit -m "cleanup: remove ${PR_FOLDER} (${SOURCE_REPO}#${PR_NUMBER} closed)

          Source: ${SOURCE_REPO}#${PR_NUMBER}
          Automated by glg/deploy-automation"

          MAX_RETRIES=3
          for i in $(seq 1 $MAX_RETRIES); do
            if git push origin main; then
              echo "Successfully cleaned up ${PR_FOLDER}"
              break
            fi
            if [[ $i -eq $MAX_RETRIES ]]; then
              echo "::error::Failed to push after ${MAX_RETRIES} retries"
              exit 1
            fi
            echo "Push failed, retrying..."
            git pull --rebase origin main
          done

Deploy-Automation: Scheduled Garbage Collection

# glg/deploy-automation/.github/workflows/gc.yml
name: Garbage Collect Stale PR Deployments

on:
  schedule:
    - cron: '0 6 * * *'  # Daily at 6am UTC
  workflow_dispatch: {}    # Allow manual trigger

jobs:
  gc:
    runs-on: ubuntu-latest
    steps:
      - name: Generate dispatcher token
        id: dispatcher-token
        uses: actions/create-github-app-token@v1
        with:
          app-id: ${{ secrets.DISPATCHER_APP_ID }}
          private-key: ${{ secrets.DISPATCHER_PRIVATE_KEY }}
          owner: glg

      - name: Generate deploy-bot token
        id: deploy-token
        uses: actions/create-github-app-token@v1
        with:
          app-id: ${{ secrets.DEPLOY_BOT_APP_ID }}
          private-key: ${{ secrets.DEPLOY_BOT_PRIVATE_KEY }}
          owner: glg

      - name: Fetch cluster config
        id: config
        env:
          GH_TOKEN: ${{ steps.dispatcher-token.outputs.token }}
        run: |
          gh api "repos/glg/deploy-config/contents/clusters.yml" --jq '.content' | base64 -d > clusters.yml

      - name: Scan and clean stale deployments
        env:
          GH_TOKEN_READ: ${{ steps.dispatcher-token.outputs.token }}
          GH_TOKEN_WRITE: ${{ steps.deploy-token.outputs.token }}
        run: |
          ORPHANS_FOUND=0

          # Iterate over each allowed cluster and derive deployment repo
          for CLUSTER_ID in $(yq '.allowed_clusters[]' clusters.yml); do
            DEPLOY_REPO="glg/gds.clusterconfig.${CLUSTER_ID}"
            echo "Scanning ${DEPLOY_REPO}..."

            # List all directories that match the *-pr-* pattern
            # This is a simplified scan — adjust based on your actual directory structure
            DIRS=$(GH_TOKEN="$GH_TOKEN_READ" gh api "repos/${DEPLOY_REPO}/git/trees/main?recursive=1" \
              --jq '.tree[] | select(.type == "tree") | .path' \
              | grep -E '-pr-[0-9]+$' || true)

            for DIR in $DIRS; do
              # Extract service name and PR number from folder name
              FOLDER_NAME=$(basename "$DIR")
              PR_NUM=$(echo "$FOLDER_NAME" | grep -oE 'pr-[0-9]+$' | sed 's/pr-//')

              if [[ -z "$PR_NUM" ]]; then
                continue
              fi

              # We need to find which source repo this came from.
              # Check the last commit message on this folder for the source repo reference.
              COMMIT_MSG=$(GH_TOKEN="$GH_TOKEN_READ" gh api "repos/${DEPLOY_REPO}/commits?path=${DIR}&per_page=1" \
                --jq '.[0].commit.message' 2>/dev/null || true)

              SOURCE_REPO=$(echo "$COMMIT_MSG" | grep -oE 'glg/[^ #]+' | head -1 || true)

              if [[ -z "$SOURCE_REPO" ]]; then
                echo "  WARNING: Could not determine source repo for ${DIR}, skipping"
                continue
              fi

              # Check if the PR is still open
              PR_STATE=$(GH_TOKEN="$GH_TOKEN_READ" gh api "repos/${SOURCE_REPO}/pulls/${PR_NUM}" \
                --jq '.state' 2>/dev/null || echo "not_found")

              if [[ "$PR_STATE" == "open" ]]; then
                echo "  ${DIR}: PR #${PR_NUM} still open, keeping"
                continue
              fi

              echo "  ${DIR}: PR #${PR_NUM} is ${PR_STATE}, removing"
              ORPHANS_FOUND=$((ORPHANS_FOUND + 1))

              # Clone, remove, commit, push
              TEMP_DIR=$(mktemp -d)
              GH_TOKEN="$GH_TOKEN_WRITE" git clone "https://x-access-token:${GH_TOKEN_WRITE}@github.com/${DEPLOY_REPO}.git" "$TEMP_DIR"
              cd "$TEMP_DIR"
              git config user.name "glg-deploy-bot[bot]"
              git config user.email "glg-deploy-bot[bot]@users.noreply.github.com"

              rm -rf "$DIR"
              git add -A
              git commit -m "gc: remove stale deployment ${FOLDER_NAME} (${SOURCE_REPO}#${PR_NUM} ${PR_STATE})

          Automated garbage collection by glg/deploy-automation"

              for i in 1 2 3; do
                if git push origin main; then
                  break
                fi
                git pull --rebase origin main
              done

              cd -
              rm -rf "$TEMP_DIR"
            done
          done

          echo "Garbage collection complete. Orphans removed: ${ORPHANS_FOUND}"

          if [[ "$ORPHANS_FOUND" -gt 0 ]]; then
            echo "::warning::Removed ${ORPHANS_FOUND} stale PR deployment(s)"
          fi

Security Threat Analysis

Threats and Mitigations

# Threat Severity Mitigation
1 App key compromise CRITICAL Two-app architecture. Source repos only hold the dispatcher key (read-only). Deploy-bot key lives only in deploy-automation repo. Even if dispatcher key leaks, attacker cannot write to deployment repos.
2 Bot actor spoofing HIGH Double validation: user.type == "Bot" (GitHub-controlled field) AND exact-match against actor_allowlist.yml in locked-down config repo.
3 Malicious .deploy.yml HIGH Always read from the default branch, never the PR branch. Cluster validated against allowlist. Deployment repo resolved from the config repo, not from .deploy.yml.
4 Deployment flooding / DoS MEDIUM Max 3 active PR deployments per service. Enforced in validation phase before any write occurs.
5 Command injection via PR content MEDIUM Image tag validated against strict regex ^pr-\d+-[a-f0-9]{7,40}$. All PR-derived values passed through environment variables, not string interpolation.
6 Race conditions in deployment repo LOW-MED Retry loop with git pull --rebase on push failure (up to 3 attempts).
7 GitHub App over-permissioning MEDIUM Deploy-bot installed only on gds.clusterconfig.* deployment repos. Dispatcher installed on source repos + config repos. Neither has more access than needed.
8 Stale deployments from cleanup failures LOW-MED Daily cron GC scans all deployment repos, cross-references with PR state, removes orphaned folders. Warns via GitHub Actions annotations.
9 Compromised shared workflow HIGH Mitigated by the dispatch pattern: there is no reusable workflow called by source repos. All logic lives in deploy-automation which is protected by branch protection and CODEOWNERS. Source repos only send a dispatch event.

Security Properties

  • Source repos never hold deployment write credentials — they only have the dispatcher app key which can read and dispatch, never write
  • .deploy.yml is read from the default branch — PR authors cannot tamper with cluster targeting
  • Cluster allowlist is in a separate locked-down repo — only the platform team can modify what clusters are targetable
  • Actor allowlist is centrally managed — adding a new bot type requires platform team review
  • All validation happens in deploy-automation — source repos have no say in what gets deployed where beyond their merged .deploy.yml
  • Rate limited — max 3 concurrent PR deployments per service
  • Self-healing — scheduled GC catches any cleanup failures
  • No reusable workflow to compromise — the dispatch pattern means source repos never reference or run deploy-automation code directly

Validation Checklist

Every deployment must pass ALL of these checks:

# Check Prevents
1 PR exists and is open Stale/invalid dispatch payloads
2 pr.user.type == "Bot" Human PRs triggering deploys
3 pr.user.login in actor_allowlist.yml Unknown bots triggering deploys
4 .deploy.yml read from default branch PR branch tampering with config
5 cluster in allowed_clusters Deploying to production
6 deployment_repo derived from glg/gds.clusterconfig.{cluster_id} convention Arbitrary repo targeting
7 image_tag matches ^pr-\d+-[a-f0-9]{7,40}$ Command injection via tag
8 Active PR deployments for service < 3 Resource exhaustion / flooding
9 Source repo belongs to the org (^glg/) Cross-org abuse

Setup Checklist

One-Time Org Setup

  • Create glg-deploy-dispatcher GitHub App
    • Permissions: contents: read, metadata: read
    • Install on: all source repos + deploy-automation + deploy-config
  • Create glg-deploy-bot GitHub App
    • Permissions: contents: write, metadata: read
    • Install on: gds.clusterconfig.* deployment repos only
  • Create glg/deploy-automation repo
    • Add repo secrets: DEPLOY_BOT_APP_ID, DEPLOY_BOT_PRIVATE_KEY, DISPATCHER_APP_ID, DISPATCHER_PRIVATE_KEY
    • Add the three workflows: deploy-pr.yml, cleanup-pr.yml, gc.yml
    • Enable branch protection on main
  • Create glg/deploy-config repo
    • Add clusters.yml and actor_allowlist.yml
    • Enable branch protection: require 2 reviewers
    • Add CODEOWNERS: @glg/platform-team
  • Add org secrets scoped to source repos:
    • DISPATCHER_APP_ID
    • DISPATCHER_PRIVATE_KEY

Per-Repo Setup (Done by Each Team)

  • Add .deploy.yml to the repo's default branch
  • Add trigger-deploy and trigger-cleanup jobs to existing Docker build workflow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment