Skip to content

Instantly share code, notes, and snippets.

@oleander
Created January 10, 2026 15:31
Show Gist options
  • Select an option

  • Save oleander/f32d55da667352305607cd9370b74547 to your computer and use it in GitHub Desktop.

Select an option

Save oleander/f32d55da667352305607cd9370b74547 to your computer and use it in GitHub Desktop.
Kubernetes Rails Deployment Specifications - OVH Migration

Design Document: Kubernetes Rails Deployment

Overview

This design describes the architecture for deploying a Rails application on OVH Kubernetes with separate containers for the web server, delayed job workers, and Kafka consumers. The solution uses Docker multi-stage builds to create separate container images for each process type from a single Dockerfile. Each build target contains the appropriate CMD to start its specific process in the foreground with proper signal handling, health checks, and JSON logging for Logz.io integration.

Migration from ECS Build Server Model

Previous Deployment Architecture (Deprecated)

The previous deployment model had a disconnected build and deploy process:

Build Process:

  1. Developer merges code to master
  2. .github/workflows/cd.yml builds playwright target → pushed to GHCR (never used)
  3. Operator manually SSHs to build server in storecove-app-docker directory
  4. Runs ./build-deploy -s true -a true -b master script which:
    • Clones fresh copy of datajust repo
    • Builds using storecove-app-docker/production/Dockerfile
    • Copies assets to S3 CDN
    • Pushes to AWS ECR

Deploy Process:

  1. Script runs aws ecs update-service --force-new-deployment for 3 services
  2. ECS pulls latest image from ECR
  3. Starts new tasks with monolithic container

Problems with this approach:

  • Manual intervention required
  • CI builds wasted (never deployed)
  • Different Dockerfile for production vs. CI
  • Build-from-scratch on every deploy (slow)
  • Assets managed separately on S3

New OVH Kubernetes Architecture

The new model unifies build and deploy into one automated workflow:

Build & Deploy Process:

  1. Developer merges code to master
  2. .github/workflows/deploy.yml automatically triggers
  3. Builds 5 Docker targets from datajust/Dockerfile
  4. Pushes all images to OVH Container Registry
  5. Runs migrations using rails target
  6. Applies Kubernetes manifests
  7. Kubernetes performs rolling update

Benefits:

  • ✅ Fully automated
  • ✅ Same Dockerfile for all environments
  • ✅ Images built once, used everywhere
  • ✅ Assets served from container
  • ✅ Faster builds (layer caching)
  • ✅ No manual SSH required

Deprecated Components

Component Status Replacement
storecove-app-docker/production/Dockerfile Deprecated datajust/Dockerfile with multiple targets
storecove-app-docker/production/build-deploy Deprecated .github/workflows/deploy.yml
.github/workflows/cd.yml GHCR images Deprecated for prod .github/workflows/deploy.yml OVH images
Manual ECS service restart Deprecated Automatic Kubernetes rolling update
S3 CDN for assets Deprecated Assets served from container
Container-level cron (whenever gem) Deprecated Kubernetes CronJobs

Architecture

graph TB
    subgraph "OVH Kubernetes Cluster"
        subgraph "Web Tier"
            WEB1[Rails Server Pod 1]
            WEB2[Rails Server Pod 2]
            WEB3[Rails Server Pod N]
        end
        
        subgraph "Worker Tier"
            DJ1[Worker Primary Pod 1]
            DJ2[Worker Primary Pod N]
            DJ3[Worker Secondary Pod 1]
            DJ4[Worker Secondary Pod N]
        end
        
        subgraph "Kafka Consumer Tier"
            KS[Sending Status Consumer]
            KN[New Document Consumer]
            KR[Received Status Consumer]
        end
        
        subgraph "Scheduled Tasks"
            CRON[Kubernetes CronJobs]
        end
        
        subgraph "Logging"
            FB[Fluent Bit DaemonSet]
        end
        
        ING[Ingress Controller]
        SVC[Kubernetes Service]
    end
    
    DB[(Database)]
    KAFKA[Kafka Brokers]
    LOGZ[Logz.io]
    ROLLBAR[Rollbar]
    
    ING --> SVC
    SVC --> WEB1
    SVC --> WEB2
    SVC --> WEB3
    
    WEB1 --> DB
    DJ1 --> DB
    DJ2 --> DB
    DJ3 --> DB
    DJ4 --> DB
    CRON --> DB
    
    KS --> KAFKA
    KN --> KAFKA
    KR --> KAFKA
    
    FB --> LOGZ
    
    WEB1 -.-> FB
    DJ1 -.-> FB
    KS -.-> FB
Loading

Build and Deploy Flow

flowchart TD
    subgraph "Docker Build"
        BASE[Base Stage] --> APP[App Base Stage]
        APP --> RAILS[rails target]
        APP --> WORKER[worker target]
        APP --> KS[kafka-sending-status target]
        APP --> KN[kafka-new-document target]
        APP --> KR[kafka-received-status target]
    end
    
    subgraph "Container Registry"
        RAILS --> IMG_RAILS[storecove-app:rails-latest]
        WORKER --> IMG_WORKER[storecove-app:worker-latest]
        KS --> IMG_KS[storecove-app:kafka-sending-status-latest]
        KN --> IMG_KN[storecove-app:kafka-new-document-latest]
        KR --> IMG_KR[storecove-app:kafka-received-status-latest]
    end
    
    subgraph "Kubernetes Deployments"
        IMG_RAILS --> DEP_RAILS[rails-server Deployment]
        IMG_WORKER --> DEP_WORKER1[worker-primary Deployment]
        IMG_WORKER --> DEP_WORKER2[worker-secondary Deployment]
        IMG_KS --> DEP_KS[kafka-sending-status Deployment]
        IMG_KN --> DEP_KN[kafka-new-document Deployment]
        IMG_KR --> DEP_KR[kafka-received-status Deployment]
    end
    
    subgraph "Kubernetes CronJobs"
        IMG_RAILS --> CRON1[scheduled-task-1 CronJob]
        IMG_RAILS --> CRON2[scheduled-task-2 CronJob]
        IMG_RAILS --> CRONN[scheduled-task-N CronJob]
    end
Loading

Components and Interfaces

1. Dockerfile Multi-Stage Build Targets

The Dockerfile uses multi-stage builds to create optimized images for each component type from a shared base.

# syntax=docker/dockerfile:1-labs
# Base stage with all dependencies
FROM ubuntu:focal AS base

ARG BUNDLER_VERSION=2.6.8
ENV BUNDLER_VERSION=${BUNDLER_VERSION}
ENV DEBIAN_FRONTEND=noninteractive
ENV BUNDLE_PATH=/cache/bundle
ENV YARN_CACHE_FOLDER=/cache/yarn
ENV BUNDLE_SILENCE_ROOT_WARNING=1

SHELL ["/bin/bash", "-l", "-c"]

# ... (existing base setup: apt packages, RVM, Ruby, Node.js, etc.) ...

WORKDIR /app

# Ruby dependencies stage
FROM base AS ruby-deps
USER app
COPY --chown=app:sudo Gemfile Gemfile.lock ./
RUN bash -lc "bundle install"

# Node dependencies stage
FROM base AS node-deps
USER app
COPY --chown=app:sudo package.json yarn.lock ./
RUN bash -lc "yarn install --frozen-lockfile"

# Application base with all code and assets
FROM base AS app-base
USER app

COPY --chown=app:sudo Gemfile Gemfile.lock ./
COPY --from=ruby-deps /cache/bundle /cache/bundle

COPY --chown=app:sudo package.json yarn.lock ./
COPY --from=node-deps /cache/yarn /cache/yarn

COPY --chown=app:sudo . .

RUN yarn install --frozen-lockfile
ENV SECRET_KEY_BASE_DUMMY=1
RUN bash -lc "bundle exec rails assets:precompile"
ENV SECRET_KEY_BASE_DUMMY=0

# Create log directory
RUN mkdir -p /app/log

# ===== Rails Server Target =====
FROM app-base AS rails
EXPOSE 3000
ENV RAILS_SERVE_STATIC_FILES=true
ENV RAILS_LOG_TO_STDOUT=true
ENV PROCESS_TARGET=server
CMD ["bash", "-lc", "bundle exec rails server -b 0.0.0.0 -p 3000"]

# ===== Delayed Job Worker Target =====
# Pool configuration passed via DELAYED_JOB_POOLS environment variable
# Example: DELAYED_JOB_POOLS="--pool=mail:1 --pool=slack:2"
FROM app-base AS worker
EXPOSE 3001
ENV PROCESS_TARGET=worker
ENV DELAYED_JOB_POOLS=""
ENV DELAYED_JOB_TIMEOUT=280
COPY --chown=app:sudo scripts/health_server.rb /scripts/health_server.rb
CMD ["bash", "-lc", "HEALTH_PORT=3001 ruby /scripts/health_server.rb & exec bundle exec bin/delayed_job run --timeout=${DELAYED_JOB_TIMEOUT} $DELAYED_JOB_POOLS"]

# ===== Kafka Sending Status Consumer =====
FROM app-base AS kafka-sending-status
EXPOSE 3002
ENV PROCESS_TARGET=kafka-sending-status
COPY --chown=app:sudo scripts/health_server.rb /scripts/health_server.rb
CMD ["bash", "-lc", "HEALTH_PORT=3002 ruby /scripts/health_server.rb & exec bundle exec racecar --group-id \"$KAFKA_SENDINGSTATUSUPDATE_CONSUMER_GROUP_ID\" --sasl-username \"$KAFKA_SENDINGSTATUSUPDATE_CONSUMER_USERNAME\" --sasl-password \"$KAFKA_SENDINGSTATUSUPDATE_CONSUMER_PASSWORD\" Kafka::Consumers::SendingActionStatusUpdateConsumer"]

# ===== Kafka New Document Consumer =====
FROM app-base AS kafka-new-document
EXPOSE 3003
ENV PROCESS_TARGET=kafka-new-document
COPY --chown=app:sudo scripts/health_server.rb /scripts/health_server.rb
CMD ["bash", "-lc", "HEALTH_PORT=3003 ruby /scripts/health_server.rb & exec bundle exec racecar --group-id \"$KAFKA_NEWDOCUMENTNOTIFICATION_CONSUMER_GROUP\" --sasl-username \"$KAFKA_NEWDOCUMENTNOTIFICATION_CONSUMER_USERNAME\" --sasl-password \"$KAFKA_NEWDOCUMENTNOTIFICATION_CONSUMER_PASSWORD\" Kafka::Consumers::NewDocumentNotificationConsumer"]

# ===== Kafka Received Status Consumer =====
FROM app-base AS kafka-received-status
EXPOSE 3004
ENV PROCESS_TARGET=kafka-received-status
COPY --chown=app:sudo scripts/health_server.rb /scripts/health_server.rb
CMD ["bash", "-lc", "HEALTH_PORT=3004 ruby /scripts/health_server.rb & exec bundle exec racecar --group-id \"$KAFKA_RECEIVEDDOCUMENTSTATUS_CONSUMER_GROUP\" --sasl-username \"$KAFKA_RECEIVEDDOCUMENTSTATUS_CONSUMER_USERNAME\" --sasl-password \"$KAFKA_RECEIVEDDOCUMENTSTATUS_CONSUMER_PASSWORD\" Kafka::Consumers::ReceivedDocumentStatusConsumer"]

Worker Pool Configuration

The worker Docker target uses a configurable DELAYED_JOB_POOLS environment variable, allowing different Kubernetes deployments to run different queue pools from the same image.

Pool Groups

Deployment DELAYED_JOB_POOLS Value
worker-primary --pool=mail:1 --pool=inboundpeppol,inboundpeppolemail,inboundsftp,inboundublemail,inboundpartneremail:4 --pool=ses_notifications,ses_mail,sar_mail,edi_smtp,edi_as2,ses_mail_in_out:2 --pool=vatcalc_out_out_live,vatcalc_out_out_pilot:1 --pool=analyze_action,invoice_analyzer,slack,apply_action:1 --pool=document_submissions:2
worker-secondary --pool=smp_phoss:8 --pool=aruba_out_out_prod,aruba_out_out_pilot,aruba_out_out_webhooks_pilot,aruba_out_out_webhooks_prod:1 --pool=chargebee_webhook_events,exactsales_webhook_events,storecove_webhook_events:1 --pool=outgoing_webhooks,outgoing_webhooks_sandbox:4 --pool=outgoing_webhooks_asia,outgoing_webhooks_sandbox_asia:4 --pool=exact_worker,snelstart_worker,sftp_worker,as2_worker:1 --pool=received_documents,aruba_in_in_webhooks:1 --pool=storecove_api_self:3 --pool=active_storage_analysis,active_storage_mirror,active_storage_preview,active_storage_purge:1 --pool=kafka_sending_actions_status_update,kafka_received_document_status,kafka_new_document_notification:12 --pool=meta_events,exceptions,aruba_admin:1 --pool=customer_reporting:1 --pool=my_lhdnm_poller:6

This allows:

  • Independent scaling of pool groups
  • Single Docker image for all workers
  • Easy adjustment of pool assignments via K8s manifests

Puma Configuration

The Rails application uses Puma in single-process, multi-threaded mode (workers commented out in config/puma.rb). This is intentional for the Kubernetes deployment:

  • Horizontal Scaling: Multiple pods provide process-level isolation and fault tolerance
  • Simpler Failure Mode: If a pod crashes, only one replica is affected
  • Resource Predictability: Each pod uses consistent resources (no worker forking)
  • Thread Pool: Each pod uses 5 threads (configurable via RAILS_MAX_THREADS)

Production Configuration:

# config/puma.rb
threads 5, 5  # Default: 5 threads per pod
# workers disabled - scaling via Kubernetes replicas instead

Environment Variables:

  • RAILS_MAX_THREADS - Max threads per pod (default: 5)
  • RAILS_MIN_THREADS - Min threads per pod (default: 5)
  • WEB_CONCURRENCY - Not used (workers disabled)
  • DB_POOL - ActiveRecord connection pool size (should match RAILS_MAX_THREADS)

ActiveRecord Connection Pooling: The ActiveRecord connection pool size should match the Puma thread count to avoid connection exhaustion. In config/database.yml:

production:
  primary:
    adapter: mysql2
    pool: <%= ENV.fetch("DB_POOL") { ENV.fetch("RAILS_MAX_THREADS") { 5 } } %>
    # ... other settings ...

For the Rails server with 5 threads per pod and 2-10 replicas, total connections = 5 threads × 10 pods = 50 connections maximum.

Racecar/Kafka Configuration

Racecar consumers must log to STDOUT for Fluent Bit collection:

# config/initializers/racecar.rb (update for Kubernetes)
Racecar.configure do |config|
  # ... existing config ...
  
  # Change from file logging to STDOUT
  config.logfile = STDOUT
  
  # Use Rails logger for consistent JSON formatting
  config.logger = Rails.logger if Rails.logger
  
  # Offset commit configuration for graceful shutdown
  config.offset_commit_interval = 10  # Commit every 10 seconds (default)
  config.offset_commit_threshold = 0  # Or commit after every message for max safety
  
  # ... rest of config ...
end

SIGTERM Handling: Racecar handles SIGTERM gracefully by default, committing offsets before shutdown.

Signal Handling and Process Management

CMD with bash -lc and exec: The design uses CMD ["bash", "-lc", "exec bundle exec <command>"] which:

  1. Starts bash as PID 1
  2. The exec keyword replaces bash with the actual process
  3. The actual process (Puma, delayed_job, racecar) receives SIGTERM directly
  4. All three processes handle SIGTERM gracefully by default:
    • Puma: Stops accepting new connections, completes in-flight requests
    • delayed_job: Completes current job within timeout, or leaves in queue
    • Racecar: Commits offsets and disconnects cleanly

Health Server Background Process: The health server runs as a background process (&) and won't receive SIGTERM propagation. This is acceptable because:

  • When the main process exits (delayed_job or racecar), the container exits
  • Kubernetes detects the container exit and restarts it
  • The health server is supplementary; container exit is the primary failure detection

Container Exit on Process Failure: If the main process crashes:

  1. Container exits with non-zero code
  2. Kubernetes detects exit via container state
  3. Liveness probe subsequently fails
  4. Kubernetes restarts the container per the restart policy

2. Health Check Server (health_server.rb)

A lightweight WEBrick server for worker and Kafka consumer health checks:

#!/usr/bin/env ruby
require 'webrick'
require 'json'

PROCESS_TARGET = ENV.fetch('PROCESS_TARGET', 'unknown')
HEALTH_PORT = ENV.fetch('HEALTH_PORT', 3001).to_i

# Only load Rails for workers that need DB checks
if PROCESS_TARGET.start_with?('worker')
  require_relative '/app/config/environment'
end

server = WEBrick::HTTPServer.new(Port: HEALTH_PORT, Logger: WEBrick::Log.new("/dev/null"), AccessLog: [])

server.mount_proc '/health' do |req, res|
  begin
    if PROCESS_TARGET.start_with?('worker')
      # Workers check database connectivity
      ActiveRecord::Base.connection.execute("SELECT 1")
    end
    # Kafka consumers just check process is alive (per requirements)
    
    res.status = 200
    res.content_type = 'application/json'
    res.body = { 
      status: 'healthy', 
      process_target: PROCESS_TARGET, 
      pod_name: ENV.fetch('POD_NAME', 'unknown'),
      namespace: ENV.fetch('POD_NAMESPACE', 'default'),
      timestamp: Time.now.iso8601 
    }.to_json
  rescue => e
    res.status = 503
    res.content_type = 'application/json'
    res.body = { status: 'unhealthy', process_target: PROCESS_TARGET, error: e.message, timestamp: Time.now.iso8601 }.to_json
  end
end

server.mount_proc '/ready' do |req, res|
  res.status = 200
  res.content_type = 'application/json'
  res.body = { status: 'ready', process_target: PROCESS_TARGET }.to_json
end

trap('INT') { server.shutdown }
trap('TERM') { server.shutdown }

server.start

3. Rails Health Check Controller

For the web server, health checks are handled by a Rails controller. Note: Liveness checks process health only (no DB), while readiness checks DB connectivity.

# app/controllers/health_controller.rb
class HealthController < ApplicationController
  skip_before_action :authenticate_user!, raise: false
  
  # Liveness: Is the process alive? (Don't check DB - restarting won't help if DB is down)
  def liveness
    render json: { status: 'alive', process_target: ENV.fetch('PROCESS_TARGET', 'server'), timestamp: Time.current.iso8601 }, status: :ok
  end
  
  # Readiness: Can it serve traffic? (Check DB connectivity)
  def readiness
    ActiveRecord::Base.connection.execute("SELECT 1")
    render json: { 
      status: 'ready', 
      process_target: ENV.fetch('PROCESS_TARGET', 'server'), 
      pod_name: ENV.fetch('POD_NAME', 'unknown'),
      namespace: ENV.fetch('POD_NAMESPACE', 'default'),
      timestamp: Time.current.iso8601 
    }, status: :ok
  rescue => e
    render json: { status: 'not_ready', error: e.message }, status: :service_unavailable
  end
end

4. Rails Routes for Health Checks

# config/routes.rb (add these routes)
get '/health/liveness' => 'health#liveness'
get '/health/readiness' => 'health#readiness'

5. JSON Logging Configuration

# Gemfile (add)
gem 'lograge'
# config/environments/production.rb (add)
config.lograge.enabled = true
config.lograge.formatter = Lograge::Formatters::Json.new
config.lograge.custom_options = lambda do |event|
  {
    process_target: ENV.fetch('PROCESS_TARGET', 'server'),
    pod_name: ENV.fetch('POD_NAME', 'unknown'),
    namespace: ENV.fetch('POD_NAMESPACE', 'default')
  }
end

Data Models

Environment Variables

Variable Required Default Description
PROCESS_TARGET No Set by target Process type identifier (server, worker-primary, worker-secondary, kafka-*)
HEALTH_PORT No 3001-3004 Port for health check server (workers/kafka)
DELAYED_JOB_POOLS Conditional "" Pool arguments for delayed_job (required for worker target)
DELAYED_JOB_TIMEOUT No 280 Seconds to wait for job completion on SIGTERM
RAILS_ENV Yes - Rails environment
RAILS_LOG_TO_STDOUT No true Enable logging to stdout
RAILS_SERVE_STATIC_FILES No true Enable static file serving from Puma
RAILS_MAX_THREADS No 5 Maximum Puma threads per pod
RAILS_MIN_THREADS No 5 Minimum Puma threads per pod
DB_POOL No 5 ActiveRecord connection pool size (should match RAILS_MAX_THREADS)
DATABASE_URL Yes - Database connection string
KAFKA_* Conditional - Kafka credentials (required for kafka-* targets)
LOGZIO_TOKEN Yes - Logz.io shipping token (via Secret)
POD_NAME No unknown Kubernetes pod name (from downward API)
POD_NAMESPACE No default Kubernetes namespace (from downward API)

Kubernetes Secrets

Secret Name Keys Used By
storecove-app-db-credentials DATABASE_HOST, DATABASE_PORT, DATABASE_USERNAME, DATABASE_PASSWORD, DATABASE_NAME All
storecove-app-master-key RAILS_MASTER_KEY All
storecove-app-aws-credentials AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION, AWS_*_BUCKET All
storecove-app-valkey-credentials VALKEY_HOST, VALKEY_PORT, VALKEY_USERNAME, VALKEY_PASSWORD Rails, Workers
storecove-app-queue-credentials SQS_*_QUEUE URLs (bounces, complaints, deliveries, partner, peppol, receive, sftp) Workers
storecove-app-kafka-credentials KAFKA_CONSUMER, KAFKA_PRODUCER Kafka consumers
storecove-app-logzio LOGZIO_TOKEN Fluent Bit
storecove-app-rollbar ROLLBAR_ACCESS_TOKEN GitHub Actions
storecove-app-email-credentials EMAIL_PROVIDER_USERNAME, EMAIL_PROVIDER_PASSWORD Rails, Workers
storecove-app-billing-credentials CHARGEBEE_API_KEY, CHARGEBEE_SITE, STRIPE_SECRET_KEY Rails, Workers
storecove-app-peppol-credentials PEPPOL_SHOP_ID, DEFAULT_ACCESSPOINT_* Rails, Workers
storecove-app-webhooks-credentials WEBHOOKS_ENCRYPT_KEY, WEBHOOKS_ENCRYPT_IV Rails, Workers
storecove-app-intercom-credentials INTERCOM_APP_ID, INTERCOM_API_SECRET, INTERCOM_API_ACCESS_TOKEN Rails
mysql-ca-cert ca-cert.pem All (mounted as volume)

Health Check Ports

Component Health Port Endpoint Notes
Rails Server 3000 /health/liveness, /health/readiness Via Rails controller
Worker Primary 3001 /health Via WEBrick health_server.rb
Worker Secondary 3001 /health Via WEBrick health_server.rb
Kafka Sending Status 3002 /health Via WEBrick health_server.rb
Kafka New Document 3003 /health Via WEBrick health_server.rb
Kafka Received Status 3004 /health Via WEBrick health_server.rb

Kubernetes Resource Specifications

Component CPU Request CPU Limit Memory Request Memory Limit Replicas terminationGracePeriodSeconds
Rails Server 500m 2000m 1Gi 4Gi 2-10 (HPA) 30
Worker Primary 250m 1000m 512Mi 2Gi 2-5 300
Worker Secondary 250m 2000m 512Mi 4Gi 2-5 300
Kafka Consumer (each) 100m 500m 256Mi 1Gi 1-3 60
CronJob (each) 100m 500m 256Mi 1Gi N/A N/A

Build and Push Strategy (GitHub Actions)

Each Docker build target is built and pushed separately with appropriate tags:

# Build Rails server target
- name: Build and push Rails server
  uses: docker/build-push-action@v6
  with:
    context: .
    push: true
    target: rails
    tags: |
      ${{ vars.OVH_REGISTRY_URL }}/storecove-app:rails-${{ github.sha }}
      ${{ vars.OVH_REGISTRY_URL }}/storecove-app:rails-latest
    cache-from: type=gha
    cache-to: type=gha,mode=max

# Build Worker target
- name: Build and push Worker
  uses: docker/build-push-action@v6
  with:
    context: .
    push: true
    target: worker
    tags: |
      ${{ vars.OVH_REGISTRY_URL }}/storecove-app:worker-${{ github.sha }}
      ${{ vars.OVH_REGISTRY_URL }}/storecove-app:worker-latest
    cache-from: type=gha
    cache-to: type=gha,mode=max

# Build Kafka consumer targets
- name: Build and push Kafka Sending Status
  uses: docker/build-push-action@v6
  with:
    context: .
    push: true
    target: kafka-sending-status
    tags: |
      ${{ vars.OVH_REGISTRY_URL }}/storecove-app:kafka-sending-status-${{ github.sha }}
      ${{ vars.OVH_REGISTRY_URL }}/storecove-app:kafka-sending-status-latest
    cache-from: type=gha
    cache-to: type=gha,mode=max

# Repeat for kafka-new-document and kafka-received-status

Docker BuildKit Optimization

The build strategy uses Docker BuildKit with GitHub Actions cache for faster builds:

Cache Strategy:

  • cache-from: type=gha - Pull cache layers from previous builds
  • cache-to: type=gha,mode=max - Store all layers for future builds
  • Shared layers between targets (base, ruby-deps, node-deps, app-base) are cached once

Build Performance:

  • First build: ~15-20 minutes (all layers)
  • Subsequent builds (code changes only): ~2-5 minutes (app-base rebuilt)
  • Subsequent builds (dependency changes): ~10-12 minutes (ruby-deps/node-deps rebuilt)

Parallel Builds: Consider building targets in parallel using GitHub Actions matrix strategy:

strategy:
  matrix:
    target: [rails, worker, kafka-sending-status, kafka-new-document, kafka-received-status]
steps:
  - uses: docker/build-push-action@v6
    with:
      target: ${{ matrix.target }}
      tags: ${{ vars.OVH_REGISTRY_URL }}/storecove-app:${{ matrix.target }}-${{ github.sha }}

This reduces total build time from ~15 minutes sequential to ~5 minutes parallel (limited by slowest target).

# Run database migrations
- name: Run database migrations
  run: |
    kubectl run migration-${{ github.sha }} \
      --image=${{ vars.OVH_REGISTRY_URL }}/storecove-app:rails-${{ github.sha }} \
      --restart=Never \
      --rm \
      --wait \
      --command -- bash -lc "bundle exec rails db:migrate"

# Apply Kubernetes manifests
- name: Apply Kubernetes manifests
  run: |
    export IMAGE_TAG=${{ github.sha }}
    envsubst < k8s/rails-server.yaml | kubectl apply -f -
    envsubst < k8s/worker-primary.yaml | kubectl apply -f -
    envsubst < k8s/worker-secondary.yaml | kubectl apply -f -
    envsubst < k8s/kafka-sending-status.yaml | kubectl apply -f -
    envsubst < k8s/kafka-new-document.yaml | kubectl apply -f -
    envsubst < k8s/kafka-received-status.yaml | kubectl apply -f -
    kubectl apply -f k8s/cronjobs/
    kubectl apply -f k8s/ingress.yaml

# Notify Rollbar of deployment
- name: Notify Rollbar
  if: success()
  run: |
    curl -X POST https://api.rollbar.com/api/1/deploy/ \
      -H "Content-Type: application/json" \
      -d '{
        "access_token": "${{ secrets.ROLLBAR_ACCESS_TOKEN }}",
        "environment": "production",
        "revision": "${{ github.sha }}",
        "local_username": "${{ github.actor }}",
        "comment": "Deployed via GitHub Actions"
      }'

Kubernetes Manifests

Rails Server Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: rails-server
  labels:
    app: storecove
    component: server
spec:
  replicas: 2
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  selector:
    matchLabels:
      app: storecove
      component: server
  template:
    metadata:
      labels:
        app: storecove
        component: server
    spec:
      terminationGracePeriodSeconds: 30
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000
      containers:
      - name: rails
        image: ${OVH_REGISTRY_URL}/storecove-app:rails-latest
        imagePullPolicy: Always
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: false
          capabilities:
            drop:
              - ALL
        env:
        - name: RAILS_ENV
          value: "production"
        - name: RAILS_SERVE_STATIC_FILES
          value: "true"
        - name: PROCESS_TARGET
          value: "server"
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: MYSQL_SSL_CA
          value: "/etc/ssl/mysql/ca-cert.pem"
        envFrom:
        - secretRef:
            name: storecove-app-db-credentials
        - secretRef:
            name: storecove-app-master-key
        - secretRef:
            name: storecove-app-aws-credentials
        - secretRef:
            name: storecove-app-valkey-credentials
        - secretRef:
            name: storecove-app-email-credentials
        - secretRef:
            name: storecove-app-billing-credentials
        - secretRef:
            name: storecove-app-peppol-credentials
        - secretRef:
            name: storecove-app-webhooks-credentials
        - secretRef:
            name: storecove-app-intercom-credentials
        - secretRef:
            name: storecove-app-rollbar-credentials
        volumeMounts:
        - name: mysql-ca
          mountPath: /etc/ssl/mysql
          readOnly: true
        ports:
        - containerPort: 3000
          name: http
        livenessProbe:
          httpGet:
            path: /health/liveness
            port: 3000
          initialDelaySeconds: 30
          periodSeconds: 10
          failureThreshold: 3
          timeoutSeconds: 5
        readinessProbe:
          httpGet:
            path: /health/readiness
            port: 3000
          initialDelaySeconds: 30
          periodSeconds: 10
          failureThreshold: 3
          timeoutSeconds: 5
        resources:
          requests:
            cpu: 500m
            memory: 1Gi
          limits:
            cpu: 2000m
            memory: 4Gi
      volumes:
      - name: mysql-ca
        secret:
          secretName: mysql-ca-cert
---
apiVersion: v1
kind: Service
metadata:
  name: rails-server
spec:
  selector:
    app: storecove
    component: server
  ports:
  - port: 80
    targetPort: 3000

Ingress Configuration

Important: The Ingress NGINX Controller is scheduled for retirement in March 2026.

  • Verify OVH's actual ingress controller type before production deployment
  • If OVH uses nginx-ingress, plan migration to Gateway API by Q2 2026
  • The annotations below assume nginx-ingress; update if OVH uses a different controller

OVH Production Subdomains:

  • app.fr.storecove.com - Main application (2M body size limit)
  • api.fr.storecove.com - API endpoint (100M body size limit)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: storecove-app-ingress
  annotations:
    # Body size limits
    nginx.ingress.kubernetes.io/proxy-body-size: "2m"
    # Timeouts for long-running requests
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    # Security headers
    nginx.ingress.kubernetes.io/server-snippet: |
      more_clear_headers "X-Powered-By";
      more_clear_headers "Server";
    # TLS
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - app.fr.storecove.com
    secretName: storecove-app-tls
  rules:
  - host: app.fr.storecove.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: rails-server
            port:
              number: 80
---
# Separate ingress for API subdomain with larger body size
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: storecove-api-ingress
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "100m"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - api.fr.storecove.com
    secretName: storecove-api-tls
  rules:
  - host: api.fr.storecove.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: rails-server
            port:
              number: 80

Worker Primary Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: worker-primary
  labels:
    app: storecove
    component: worker-primary
spec:
  replicas: 2
  selector:
    matchLabels:
      app: storecove
      component: worker-primary
  template:
    metadata:
      labels:
        app: storecove
        component: worker-primary
    spec:
      terminationGracePeriodSeconds: 300
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000
      containers:
      - name: worker
        image: ${OVH_REGISTRY_URL}/storecove-app:worker-latest
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: false
          capabilities:
            drop:
              - ALL
        env:
        - name: RAILS_ENV
          value: "production"
        - name: PROCESS_TARGET
          value: "worker-primary"
        - name: DELAYED_JOB_POOLS
          value: "--pool=mail:1 --pool=inboundpeppol,inboundpeppolemail,inboundsftp,inboundublemail,inboundpartneremail:4 --pool=ses_notifications,ses_mail,sar_mail,edi_smtp,edi_as2,ses_mail_in_out:2 --pool=vatcalc_out_out_live,vatcalc_out_out_pilot:1 --pool=analyze_action,invoice_analyzer,slack,apply_action:1 --pool=document_submissions:2"
        - name: DELAYED_JOB_TIMEOUT
          value: "280"
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: MYSQL_SSL_CA
          value: "/etc/ssl/mysql/ca-cert.pem"
        envFrom:
        - secretRef:
            name: storecove-app-db-credentials
        - secretRef:
            name: storecove-app-master-key
        - secretRef:
            name: storecove-app-aws-credentials
        - secretRef:
            name: storecove-app-valkey-credentials
        - secretRef:
            name: storecove-app-queue-credentials
        - secretRef:
            name: storecove-app-email-credentials
        - secretRef:
            name: storecove-app-billing-credentials
        - secretRef:
            name: storecove-app-peppol-credentials
        - secretRef:
            name: storecove-app-webhooks-credentials
        volumeMounts:
        - name: mysql-ca
          mountPath: /etc/ssl/mysql
          readOnly: true
        ports:
        - containerPort: 3001
          name: health
        livenessProbe:
          httpGet:
            path: /health
            port: 3001
          initialDelaySeconds: 30
          periodSeconds: 30
          failureThreshold: 3
          timeoutSeconds: 10
        resources:
          requests:
            cpu: 250m
            memory: 512Mi
          limits:
            cpu: 1000m
            memory: 2Gi
      volumes:
      - name: mysql-ca
        secret:
          secretName: mysql-ca-cert

Worker Secondary Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: worker-secondary
  labels:
    app: storecove
    component: worker-secondary
spec:
  replicas: 2
  selector:
    matchLabels:
      app: storecove
      component: worker-secondary
  template:
    metadata:
      labels:
        app: storecove
        component: worker-secondary
    spec:
      terminationGracePeriodSeconds: 300
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000
      containers:
      - name: worker
        image: ${OVH_REGISTRY_URL}/storecove-app:worker-latest
        imagePullPolicy: Always
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: false
          capabilities:
            drop:
              - ALL
        env:
        - name: RAILS_ENV
          value: "production"
        - name: PROCESS_TARGET
          value: "worker-secondary"
        - name: DELAYED_JOB_POOLS
          value: "--pool=smp_phoss:8 --pool=aruba_out_out_prod,aruba_out_out_pilot,aruba_out_out_webhooks_pilot,aruba_out_out_webhooks_prod:1 --pool=chargebee_webhook_events,exactsales_webhook_events,storecove_webhook_events:1 --pool=outgoing_webhooks,outgoing_webhooks_sandbox:4 --pool=outgoing_webhooks_asia,outgoing_webhooks_sandbox_asia:4 --pool=exact_worker,snelstart_worker,sftp_worker,as2_worker:1 --pool=received_documents,aruba_in_in_webhooks:1 --pool=storecove_api_self:3 --pool=active_storage_analysis,active_storage_mirror,active_storage_preview,active_storage_purge:1 --pool=kafka_sending_actions_status_update,kafka_received_document_status,kafka_new_document_notification:12 --pool=meta_events,exceptions,aruba_admin:1 --pool=customer_reporting:1 --pool=my_lhdnm_poller:6"
        - name: DELAYED_JOB_TIMEOUT
          value: "280"
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: MYSQL_SSL_CA
          value: "/etc/ssl/mysql/ca-cert.pem"
        envFrom:
        - secretRef:
            name: storecove-app-db-credentials
        - secretRef:
            name: storecove-app-master-key
        - secretRef:
            name: storecove-app-aws-credentials
        - secretRef:
            name: storecove-app-valkey-credentials
        - secretRef:
            name: storecove-app-queue-credentials
        - secretRef:
            name: storecove-app-email-credentials
        - secretRef:
            name: storecove-app-billing-credentials
        - secretRef:
            name: storecove-app-peppol-credentials
        - secretRef:
            name: storecove-app-webhooks-credentials
        volumeMounts:
        - name: mysql-ca
          mountPath: /etc/ssl/mysql
          readOnly: true
        ports:
        - containerPort: 3001
          name: health
        livenessProbe:
          httpGet:
            path: /health
            port: 3001
          initialDelaySeconds: 30
          periodSeconds: 30
          failureThreshold: 3
          timeoutSeconds: 10
        resources:
          requests:
            cpu: 250m
            memory: 512Mi
          limits:
            cpu: 2000m
            memory: 4Gi
      volumes:
      - name: mysql-ca
        secret:
          secretName: mysql-ca-cert

Kafka Consumer Deployment (Example: Sending Status)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kafka-sending-status
  labels:
    app: storecove
    component: kafka-sending-status
spec:
  replicas: 1
  selector:
    matchLabels:
      app: storecove
      component: kafka-sending-status
  template:
    metadata:
      labels:
        app: storecove
        component: kafka-sending-status
    spec:
      terminationGracePeriodSeconds: 60
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000
      containers:
      - name: consumer
        image: ${OVH_REGISTRY_URL}/storecove-app:kafka-sending-status-latest
        imagePullPolicy: Always
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: false
          capabilities:
            drop:
              - ALL
        env:
        - name: RAILS_ENV
          value: "production"
        - name: PROCESS_TARGET
          value: "kafka-sending-status"
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: MYSQL_SSL_CA
          value: "/etc/ssl/mysql/ca-cert.pem"
        envFrom:
        - secretRef:
            name: storecove-app-db-credentials
        - secretRef:
            name: storecove-app-master-key
        - secretRef:
            name: storecove-app-kafka-credentials
        volumeMounts:
        - name: mysql-ca
          mountPath: /etc/ssl/mysql
          readOnly: true
        ports:
        - containerPort: 3002
          name: health
        livenessProbe:
          httpGet:
            path: /health
            port: 3002
          initialDelaySeconds: 30
          periodSeconds: 30
          failureThreshold: 3
        resources:
          requests:
            cpu: 100m
            memory: 256Mi
          limits:
            cpu: 500m
            memory: 1Gi
      volumes:
      - name: mysql-ca
        secret:
          secretName: mysql-ca-cert

Note: The kafka-new-document and kafka-received-status deployments follow the same pattern, with their respective ports (3003, 3004) and PROCESS_TARGET values.

Kubernetes CronJobs

Scheduled tasks are implemented as Kubernetes CronJobs using the rails Docker target. Each CronJob runs independently with the full Rails environment.

# Example: Daily reporting task
apiVersion: batch/v1
kind: CronJob
metadata:
  name: daily-report
  labels:
    app: storecove
    component: cronjob
spec:
  schedule: "0 6 * * *"  # 6 AM UTC daily
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      activeDeadlineSeconds: 3600  # Job must complete within 1 hour
      backoffLimit: 2
      template:
        spec:
          restartPolicy: OnFailure
          securityContext:
            runAsNonRoot: true
            runAsUser: 1000
            runAsGroup: 1000
            fsGroup: 1000
          containers:
          - name: rails
            image: ${OVH_REGISTRY_URL}/storecove-app:rails-latest
            imagePullPolicy: Always
            securityContext:
              allowPrivilegeEscalation: false
              readOnlyRootFilesystem: false
              capabilities:
                drop:
                  - ALL
            command: ["bash", "-lc", "bundle exec rake reports:daily"]
            env:
            - name: RAILS_ENV
              value: "production"
            - name: PROCESS_TARGET
              value: "cronjob-daily-report"
            - name: MYSQL_SSL_CA
              value: "/etc/ssl/mysql/ca-cert.pem"
            envFrom:
            - secretRef:
                name: storecove-app-db-credentials
            - secretRef:
                name: storecove-app-master-key
            - secretRef:
                name: storecove-app-aws-credentials
            - secretRef:
                name: storecove-app-valkey-credentials
            volumeMounts:
            - name: mysql-ca
              mountPath: /etc/ssl/mysql
              readOnly: true
            resources:
              requests:
                cpu: 100m
                memory: 256Mi
              limits:
                cpu: 500m
                memory: 1Gi
          volumes:
          - name: mysql-ca
            secret:
              secretName: mysql-ca-cert

CronJob Migration from Whenever

Tasks currently defined in config/schedule.rb (whenever gem) must be migrated to individual CronJob manifests:

Task Description CronJob Name Schedule Command
Customer reports customer-reports 0 6 * * * rake customer_reporting:schedule_reports
SaaS org reporting (monthly) saas-organizations 30 8 1 * * rake saas:organizations_global && rake saas:organizations_asia && rake saas:organizations_pacific
Peppol end users reporting peppol-end-users 0 23 2 * * rake peppol_reporting:peppol_reporting_end_users
Peppol transactions reporting peppol-transactions 0 1 3 * * rake peppol_reporting:peppol_reporting_transactions
Peppol SG/IRAS reporting peppol-sg-monthly 30 5 1 * * rake peppol_reporting:identifiers_in_out_sg && rake peppol_reporting:reporting_sg_iras_sla_sandbox && rake peppol_reporting:reporting_sg_iras_sla_live
AWS SES bounce rates aws-ses-bounce-rates 30 4 * * 1 rake aws_ses_reporting:bounce_rates_sending && rake aws_ses_reporting:bounce_rates_administrations
Kafka sending/clearing updates kafka-sending-clearing */10 * * * * rake kafka:produce_invoice_submission_action_update_requests_sending && rake kafka:produce_invoice_submission_action_update_requests_clearing
Kafka new docs hourly kafka-new-docs-hourly 0 * * * * rake kafka:produce_new_documents_request_hourly
Kafka new docs daily kafka-new-docs-daily 0 0 * * * rake kafka:produce_new_documents_request_daily
Clean delayed jobs queue clean-delayed-jobs */5 * * * * rake railsdb:clean_delayed_jobs_inboundpeppol
CorpPass/MyKYC detection corppass-mykyc-detect */5 * * * * rake corppass:detect[sandbox] && rake corppass:detect[live] && rake mykyc:detect[sandbox] && rake mykyc:detect[live]
Reconcile Chargebee reconcile-chargebee 15 7 * * 6 rake saas:reconcile_chargebee
Check invalid identifiers identifiers-invalid 30 7 * * 6 rake identifiers:invalid
SMP reconciliation smp-reconcile 0 8 * * 6 rake smp:reconcile && rake smp:reconcile_sg
Email worker email-worker 0 * * * * rails runner "C5::EmailWorker.new.perform"
Invoice analyzer invoice-analyzer 0 * * * * rails runner "InvoiceAnalyzerJob.perform_later"

Total: 16 CronJobs replacing container-level cron managed by whenever gem.

Fluent Bit RBAC and DaemonSet for Logz.io

Fluent Bit requires RBAC permissions to access Kubernetes metadata for log enrichment.

# ServiceAccount for Fluent Bit
apiVersion: v1
kind: ServiceAccount
metadata:
  name: fluent-bit
  namespace: logging
---
# ClusterRole with permissions to read pod metadata
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: fluent-bit
rules:
- apiGroups: [""]
  resources:
    - namespaces
    - pods
  verbs: ["get", "list", "watch"]
---
# ClusterRoleBinding to bind the role to the service account
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: fluent-bit
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: fluent-bit
subjects:
- kind: ServiceAccount
  name: fluent-bit
  namespace: logging
---
# DaemonSet for Fluent Bit
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: logging
spec:
  selector:
    matchLabels:
      app: fluent-bit
  template:
    metadata:
      labels:
        app: fluent-bit
    spec:
      serviceAccountName: fluent-bit
      containers:
      - name: fluent-bit
        image: fluent/fluent-bit:latest
        securityContext:
          runAsNonRoot: false
          privileged: false
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop:
              - ALL
        env:
        - name: LOGZIO_TOKEN
          valueFrom:
            secretKeyRef:
              name: storecove-app-logzio
              key: token
        volumeMounts:
        - name: varlog
          mountPath: /var/log
          readOnly: true
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: config
          mountPath: /fluent-bit/etc/
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: config
        configMap:
          name: fluent-bit-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: logging
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush         1
        Log_Level     info
        Parsers_File  parsers.conf

    [INPUT]
        Name              tail
        Path              /var/log/containers/storecove*.log
        Parser            docker
        Tag               kube.*
        Refresh_Interval  5
        Mem_Buf_Limit     5MB

    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Merge_Log           On
        K8S-Logging.Parser  On

    [OUTPUT]
        Name            http
        Match           *
        Host            listener.logz.io
        Port            8071
        URI             /?token=${LOGZIO_TOKEN}&type=kubernetes
        Format          json_lines
        tls             On
        tls.verify      On
  
  parsers.conf: |
    [PARSER]
        Name        docker
        Format      json
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L

Correctness Properties

A property is a characteristic or behavior that should hold true across all valid executions of a system—essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.

Property 1: Build Target Produces Correct Process

For any Docker build target (rails, worker, kafka-*), building and running the image SHALL start only the process specified by that target.

Validates: Requirements 1.2, 1.3, 1.4, 1.5, 1.6

Property 2: Foreground Process Execution

For any Docker build target, the started process SHALL be the main process (PID 1 or direct child of PID 1) and SHALL not daemonize.

Validates: Requirements 1.7

Property 3: Health Server Starts Before Main Process

For any worker or Kafka consumer target, the health check server SHALL be listening on its port before the main process starts consuming work.

Validates: Requirements 1.9

Property 4: JSON Log Format Validity

For any log entry output by any process type (server, worker, kafka-*), the log entry SHALL be valid JSON that can be parsed without error.

Validates: Requirements 2.1, 2.2, 2.3

Property 5: Required Log Fields Presence

For any JSON log entry, the entry SHALL contain the fields: timestamp, level, process_target, pod_name, namespace, and message.

Validates: Requirements 2.4, 2.5

Property 6: Sensitive Data Exclusion from Logs

For any log entry, the entry SHALL NOT contain values of environment variables whose names contain PASSWORD, SECRET, KEY, or TOKEN.

Validates: Requirements 2.8, 7.5

Property 7: SIGTERM Graceful Shutdown

For any SIGTERM signal sent to a container, the main process SHALL begin graceful shutdown within 1 second.

Validates: Requirements 5.4

Property 8: Health Check Response Time

For any health check request, the response SHALL be returned within 5 seconds.

Validates: Requirements 3.11

Property 9: Liveness vs Readiness Separation

For any Rails server, the liveness endpoint SHALL return 200 even when the database is unreachable, while the readiness endpoint SHALL return 503.

Validates: Requirements 3.1, 3.2, 3.3

Error Handling

Build Target Errors

Error Condition Behavior Exit Code
Missing required env var Log error with variable name, exit 1
Process fails to start Log error with details, exit 1
Health server fails to bind Log error, continue (main process may still work) -

Health Check Error Responses

Component Error Condition HTTP Status Response Body
Rails Server (readiness) DB connection failed 503 {"status":"not_ready","error":"..."}
Worker DB connection failed 503 {"status":"unhealthy","error":"..."}
Worker Process not running 503 {"status":"unhealthy","error":"process not found"}
Kafka Consumer Process crashed 503 {"status":"unhealthy","error":"..."}

Graceful Shutdown Timeouts

Component terminationGracePeriodSeconds Rationale
Rails Server 30 Typical HTTP request timeout
Delayed Job Worker 300 Jobs may take several minutes
Kafka Consumer 60 Offset commit and disconnect

Testing Strategy

Unit Tests

Unit tests verify specific examples and edge cases:

  1. Health Check Tests

    • Test liveness returns 200 when process running
    • Test readiness returns 200 when DB connected
    • Test readiness returns 503 when DB disconnected
  2. Logging Tests

    • Test log output is valid JSON
    • Test required fields are present
    • Test sensitive data is not logged

Property-Based Tests

Property-based tests verify universal properties across many inputs using a PBT library (e.g., RSpec property testing for Ruby).

Each property test should run minimum 100 iterations.

Property Test 1: JSON Log Validity

  • Generate various log scenarios
  • Verify all output is parseable JSON
  • Feature: kubernetes-rails-deployment, Property 4: JSON Log Format Validity

Property Test 2: Required Fields Presence

  • Generate log entries
  • Verify all contain required fields
  • Feature: kubernetes-rails-deployment, Property 5: Required Log Fields Presence

Property Test 3: Sensitive Data Exclusion

  • Generate log entries with various env vars set
  • Verify no sensitive values appear in logs
  • Feature: kubernetes-rails-deployment, Property 6: Sensitive Data Exclusion from Logs

Integration Tests

Integration tests verify components work together:

  1. Container Build Tests

    • Build each target
    • Verify correct process starts
    • Verify health endpoint responds
  2. Kubernetes Manifest Validation

    • Use kubectl --dry-run to validate manifests
    • Verify all required fields present
    • Verify probe configurations correct
  3. Log Shipping Tests

    • Start container with Fluent Bit
    • Generate logs
    • Verify logs appear in Logz.io (or mock endpoint)

Requirements Document

Introduction

This specification covers the modernization of a Rails application deployment from ECS with a monolithic container approach to a Kubernetes-native architecture. Currently, the application runs delayed_job workers and multiple Racecar Kafka consumers as daemonized background processes alongside the web server. The new architecture will run each process type in its own container with proper health monitoring, centralized logging to Logz.io, and automatic recovery via Kubernetes probes.

Historical Context: The previous deployment model used AWS ECS with a disconnected build process. When code was merged to master, a CD workflow (.github/workflows/cd.yml) built and pushed images to GitHub Container Registry, but these images were never used in production. Instead, production deployments required manually SSH-ing to a build server, which would clone the datajust repository and build a fresh image using a separate Dockerfile located in the storecove-app-docker repository (production/Dockerfile). This image was then pushed to AWS ECR and deployed to ECS.

New Approach: The OVH Kubernetes deployment eliminates this disconnection. Images are built automatically in GitHub Actions when code is merged to master, using the Dockerfile in the datajust repository itself. These same images are immediately deployed to OVH Kubernetes, eliminating manual steps and ensuring consistency between CI and production. The storecove-app-docker repository is deprecated for this deployment model.

Glossary

  • Docker_Build_Target: A named stage in a multi-stage Dockerfile that produces a specific container image
  • Rails_Server: Puma serving the Rails web application
  • Delayed_Jobs_Worker: Background job processor using delayed_job with multiple queue pools
  • Kafka_Consumer: Racecar-based consumer (SendingActionStatusUpdate, NewDocumentNotification, ReceivedDocumentStatus)
  • Health_Check_Endpoint: HTTP endpoint returning process health status for Kubernetes probes
  • Logz_io: External centralized logging service for log aggregation
  • Kubernetes_Deployment: K8s resource defining pod specifications and replica counts
  • Liveness_Probe: Kubernetes health check that restarts unhealthy containers
  • Readiness_Probe: Kubernetes health check that controls traffic routing to the Rails_Server
  • CronJob: Kubernetes resource for running scheduled tasks

Requirements

Requirement 1: Docker Multi-Stage Build Targets

User Story: As a DevOps engineer, I want separate Docker build targets for each process type, so that I can deploy and scale each component independently using the same codebase.

Acceptance Criteria

  1. THE Dockerfile SHALL define a base stage containing all shared dependencies and application code
  2. THE Dockerfile SHALL define a build target named "rails" that starts Puma serving the Rails application in the foreground
  3. THE Dockerfile SHALL define a build target named "worker" that starts delayed_job in the foreground using pools specified by the DELAYED_JOB_POOLS environment variable
  4. THE worker target CMD SHALL expand the DELAYED_JOB_POOLS variable to pass pool arguments to delayed_job (e.g., "--pool=mail:1 --pool=slack:2")
  5. THE Dockerfile SHALL define a build target named "kafka-sending-status" that starts the SendingActionStatusUpdateConsumer in the foreground
  6. THE Dockerfile SHALL define a build target named "kafka-new-document" that starts the NewDocumentNotificationConsumer in the foreground
  7. THE Dockerfile SHALL define a build target named "kafka-received-status" that starts the ReceivedDocumentStatusConsumer in the foreground
  8. EACH build target SHALL run its process in the foreground without daemonizing
  9. EACH build target SHALL trap SIGTERM for graceful shutdown
  10. THE worker and Kafka consumer targets SHALL start a health check server on their respective health ports before starting the main process
  11. THE rails build target SHALL expose port 3000 for the web server and health endpoints
  12. THE worker target SHALL expose port 3001 for health checks
  13. THE kafka-sending-status target SHALL expose port 3002 for health checks
  14. THE kafka-new-document target SHALL expose port 3003 for health checks
  15. THE kafka-received-status target SHALL expose port 3004 for health checks

Requirement 2: Centralized Logging to Logz.io

User Story: As an operations engineer, I want all application logs sent to Logz.io, so that I can monitor and debug issues across all components in one place.

Acceptance Criteria

  1. THE Rails_Server SHALL output logs to stdout in JSON format
  2. THE Delayed_Jobs_Worker SHALL output logs to stdout in JSON format
  3. THE Kafka_Consumer SHALL output logs to stdout in JSON format
  4. WHEN a log entry is generated, THE logging configuration SHALL include timestamp, log level, process_target, pod_name, namespace, and message fields
  5. THE Kubernetes_Deployment SHALL set a PROCESS_TARGET environment variable identifying the component type, which Rails SHALL include in all log entries
  6. THE Kubernetes_Deployment SHALL use a Fluent Bit DaemonSet to forward container stdout to Logz_io
  7. THE Kubernetes_Deployment SHALL provide the Logz.io token via a Kubernetes Secret named storecove-app-logzio
  8. NO Docker build target SHALL log any environment variables containing PASSWORD, SECRET, KEY, or TOKEN

Requirement 3: Health Check Endpoints

User Story: As a platform engineer, I want health check endpoints for each process type, so that Kubernetes can monitor and restart unhealthy containers.

Acceptance Criteria

  1. WHEN the Rails_Server process is running, THE Health_Check_Endpoint at /health/liveness SHALL return HTTP 200 status
  2. WHEN the Rails_Server is healthy AND can connect to the database, THE Health_Check_Endpoint at /health/readiness SHALL return HTTP 200 status
  3. IF the Rails_Server cannot connect to the database, THEN THE Health_Check_Endpoint at /health/readiness SHALL return HTTP 503 status
  4. WHEN the Delayed_Jobs_Worker process is running and can connect to the database, THE Health_Check_Endpoint SHALL return HTTP 200 status
  5. IF the Delayed_Jobs_Worker process is not running or cannot connect to the database, THEN THE Health_Check_Endpoint SHALL return HTTP 503 status
  6. WHEN the Kafka_Consumer process is running and has not crashed, THE Health_Check_Endpoint SHALL return HTTP 200 status
  7. IF the Kafka_Consumer process has crashed or is not running, THEN THE Health_Check_Endpoint SHALL return HTTP 503 status
  8. THE Health_Check_Endpoint for Delayed_Jobs_Worker SHALL listen on port 3001
  9. THE Health_Check_Endpoint for Kafka consumers SHALL listen on ports 3002 (sending-status), 3003 (new-document), and 3004 (received-status)
  10. THE Health_Check_Endpoint for workers and Kafka consumers SHALL be provided by a lightweight WEBrick HTTP server
  11. THE Health_Check_Endpoint SHALL respond within 5 seconds

Requirement 4: Kubernetes Deployment Configuration

User Story: As a DevOps engineer, I want Kubernetes deployment manifests for each component, so that I can deploy and scale them independently on OVH Kubernetes.

Acceptance Criteria

  1. THE Kubernetes_Deployment for Rails_Server SHALL define liveness probe at /health/liveness with initialDelaySeconds 30, periodSeconds 10, and failureThreshold 3
  2. THE Kubernetes_Deployment for Rails_Server SHALL define readiness probe at /health/readiness with initialDelaySeconds 30, periodSeconds 10, and failureThreshold 3
  3. THE Kubernetes_Deployment for Delayed_Jobs_Worker SHALL define liveness probe on port 3001 with initialDelaySeconds 30, periodSeconds 30, and failureThreshold 3
  4. THE Kubernetes_Deployment for each Kafka_Consumer SHALL define liveness probe on its respective health port with initialDelaySeconds 30, periodSeconds 30, and failureThreshold 3
  5. WHEN a Liveness_Probe fails three consecutive times, THE Kubernetes_Deployment SHALL restart the container
  6. THE Kubernetes_Deployment SHALL allow independent replica scaling for each component type
  7. THE Kubernetes_Deployment SHALL use images built from different Docker build targets (rails, worker, kafka-*) from the same Dockerfile
  8. THE Kubernetes_Deployment for Rails_Server SHALL define resource requests of 256Mi memory / 250m CPU, and limits of 2Gi memory / 1000m CPU
  9. THE Kubernetes_Deployment for Delayed_Jobs_Worker SHALL define resource requests of 512Mi memory / 250m CPU, and limits of 4Gi memory / 2000m CPU
  10. THE Kubernetes_Deployment for each Kafka_Consumer SHALL define resource requests of 256Mi memory / 100m CPU, and limits of 1Gi memory / 500m CPU
  11. THE Kubernetes_Deployment SHALL NOT define a Kubernetes Service for worker or Kafka deployments
  12. THE Kubernetes_Deployment for Rails_Server SHALL define a strategy.rollingUpdate with maxUnavailable 0 and maxSurge 1
  13. THE Kubernetes_Deployment SHALL provide Kafka credentials via a Kubernetes Secret named storecove-app-kafka-credentials
  14. THE Kubernetes cluster SHALL define separate Deployment resources for different worker pool groups, each using the same "worker" Docker target with different DELAYED_JOB_POOLS values
  15. THE worker-primary Deployment SHALL configure DELAYED_JOB_POOLS with: mail, inbound processing (peppol, sftp, ubl, partner email), SES/email queues, vatcalc, analyze/invoice/slack/apply actions, and document_submissions pools
  16. THE worker-secondary Deployment SHALL configure DELAYED_JOB_POOLS with: smp_phoss, aruba, webhooks (including asia), integrations (exact, snelstart, sftp, as2), received_documents, storecove_api_self, active_storage, kafka processing, meta_events, customer_reporting, and my_lhdnm_poller pools
  17. EACH worker Deployment SHALL use the same health check port (3001) since only one delayed_job process runs per container

Requirement 5: Graceful Shutdown Handling

User Story: As a platform engineer, I want processes to shut down gracefully when Kubernetes terminates them, so that in-flight work is not lost.

Acceptance Criteria

  1. WHEN the Rails_Server receives SIGTERM, THE Rails_Server SHALL stop accepting new connections and complete in-flight requests before exiting
  2. WHEN the Delayed_Jobs_Worker receives SIGTERM, THE Delayed_Jobs_Worker SHALL complete the current job if it finishes within terminationGracePeriodSeconds, otherwise the job SHALL be left in the queue for retry
  3. WHEN the Kafka_Consumer receives SIGTERM, THE Kafka_Consumer SHALL commit offsets and disconnect cleanly before exiting
  4. EACH Docker build target SHALL trap SIGTERM and handle graceful shutdown
  5. THE Kubernetes_Deployment SHALL configure terminationGracePeriodSeconds of 30 for Rails_Server
  6. THE Kubernetes_Deployment SHALL configure terminationGracePeriodSeconds of 300 for Delayed_Jobs_Worker
  7. THE Kubernetes_Deployment SHALL configure terminationGracePeriodSeconds of 60 for each Kafka_Consumer

Requirement 6: Database Migrations

User Story: As a DevOps engineer, I want database migrations to run safely during deployments, so that schema changes don't cause downtime or data corruption.

Acceptance Criteria

  1. THE deployment pipeline SHALL run rails db:migrate as a GitHub Actions step BEFORE applying Kubernetes deployment manifests
  2. IF the migration step fails, THE deployment pipeline SHALL abort and NOT apply new Kubernetes manifests
  3. THE migration step SHALL use the rails Docker build target image

Requirement 7: Secrets Management

User Story: As a security engineer, I want all sensitive configuration stored in Kubernetes Secrets, so that credentials are not exposed in manifests or logs.

Acceptance Criteria

  1. THE Kubernetes_Deployment SHALL reference database credentials from a Secret named storecove-app-db-credentials
  2. THE Kubernetes_Deployment SHALL reference Kafka credentials from a Secret named storecove-app-kafka-credentials
  3. THE Kubernetes_Deployment SHALL reference Logz.io token from a Secret named storecove-app-logzio
  4. THE Kubernetes_Deployment SHALL reference Rails master key from a Secret named storecove-app-master-key
  5. NO Docker build target SHALL log any environment variables containing PASSWORD, SECRET, KEY, or TOKEN
  6. THE Kubernetes_Deployment SHALL reference AWS credentials from a Secret named storecove-app-aws-credentials
  7. THE Kubernetes_Deployment SHALL reference Valkey credentials from a Secret named storecove-app-valkey-credentials
  8. THE Kubernetes_Deployment SHALL reference SQS/queue credentials from a Secret named storecove-app-queue-credentials
  9. THE Kubernetes_Deployment SHALL reference email provider credentials from a Secret named storecove-app-email-credentials
  10. THE Kubernetes_Deployment SHALL reference billing credentials (Chargebee, Stripe) from a Secret named storecove-app-billing-credentials
  11. THE Kubernetes_Deployment SHALL reference Peppol/access point configuration from a Secret named storecove-app-peppol-credentials
  12. THE Kubernetes_Deployment SHALL reference webhook encryption keys from a Secret named storecove-app-webhooks-credentials
  13. THE Kubernetes_Deployment SHALL reference Rollbar API key from a Secret named storecove-app-rollbar-credentials
  14. THE Kubernetes_Deployment SHALL reference Intercom credentials from a Secret named storecove-app-intercom-credentials

Requirement 8: Continuous Deployment on Master Merge

User Story: As a developer, I want the application to automatically deploy to OVH Kubernetes when changes are merged to master, so that new features reach production without manual build server intervention.

Acceptance Criteria

  1. THE GitHub Actions workflow SHALL trigger automatically on every push to the master branch
  2. THE GitHub Actions workflow SHALL build all Docker targets (rails, worker, kafka-sending-status, kafka-new-document, kafka-received-status) from the Dockerfile in the datajust repository
  3. EACH Docker target SHALL be tagged with both {target}-{git-sha} and {target}-latest tags
  4. THE GitHub Actions workflow SHALL push all built images to the OVH Container Registry
  5. THE images pushed to OVH Container Registry SHALL be the SAME images deployed to Kubernetes (no rebuilding in production)
  6. THE GitHub Actions workflow SHALL run database migrations using the rails target image BEFORE deploying new pods
  7. THE GitHub Actions workflow SHALL apply Kubernetes manifests for all components
  8. IF any build step fails, THE workflow SHALL abort and NOT deploy
  9. IF the migration step fails, THE workflow SHALL abort and NOT apply new manifests
  10. THE workflow SHALL use Docker BuildKit cache to optimize build times
  11. THE workflow SHALL NOT use the build process from storecove-app-docker repository (deprecated)
  12. THE GitHub Actions workflow SHALL notify Rollbar of successful deployments with git SHA, environment, and deployer information

Requirement 9: Repository and Build Process Deprecation

User Story: As a team member, I want clarity on which repositories and build processes are active vs. deprecated, so that I don't accidentally use outdated deployment methods.

Acceptance Criteria

  1. THE deployment workflow SHALL build images from the datajust repository only
  2. THE storecove-app-docker repository SHALL NOT be used for building production images
  3. THE storecove-app-docker/production/build-deploy script SHALL NOT be used for deployments
  4. THE .github/workflows/cd.yml workflow MAY continue to build images for CI/testing purposes, but these SHALL NOT be used for production deployments to OVH
  5. ALL production deployments SHALL use images built by .github/workflows/deploy.yml

Requirement 10: Scheduled Tasks via Kubernetes CronJobs

User Story: As a platform engineer, I want scheduled tasks to run reliably via Kubernetes CronJobs, so that periodic maintenance and reporting jobs execute on time.

Acceptance Criteria

  1. THE scheduled tasks SHALL be implemented using Kubernetes CronJob resources
  2. EACH CronJob SHALL use the "rails" Docker build target as its container image
  3. THE CronJob resources SHALL reference the same secrets as other deployments
  4. THE CronJob resources SHALL define appropriate schedule expressions matching the current whenever configuration
  5. THE CronJob resources SHALL set restartPolicy to "OnFailure"
  6. THE CronJob resources SHALL set concurrencyPolicy to "Forbid" to prevent overlapping runs
  7. THE CronJob container command SHALL execute rake tasks or rails runner commands as needed
  8. THE deployment workflow SHALL apply CronJob manifests alongside Deployment manifests

Requirement 11: Ingress Configuration

User Story: As a platform engineer, I want proper ingress configuration, so that external traffic reaches the application with appropriate limits and routing.

Acceptance Criteria

  1. THE Kubernetes Ingress SHALL route external traffic to the Rails_Server service
  2. THE Kubernetes Ingress SHALL configure routes for app.fr.storecove.com (application host)
  3. THE Kubernetes Ingress SHALL configure separate routes for api.fr.storecove.com (API host)
  4. THE Kubernetes Ingress SHALL configure TLS certificates for both subdomains using cert-manager
  5. THE Kubernetes Ingress for api.fr.storecove.com SHALL configure client-max-body-size of 100M
  6. THE Kubernetes Ingress for app.fr.storecove.com SHALL configure client-max-body-size of 2M
  7. THE Kubernetes Ingress SHALL configure appropriate proxy timeouts for long-running requests (300s)
  8. THE Kubernetes Ingress SHALL be configured via annotations appropriate to the OVH ingress controller

Requirement 12: Static Asset Serving

User Story: As a platform engineer, I want static assets served directly from the Rails container.

Acceptance Criteria

  1. THE Rails_Server container SHALL serve static assets directly via Puma (RAILS_SERVE_STATIC_FILES=true)
  2. THE assets SHALL be precompiled during the Docker image build
  3. THE Rails configuration SHALL NOT configure an external asset_host CDN
  4. THE Kubernetes Ingress MAY configure caching headers for /assets paths

Requirement 13: Application Environment Configuration

User Story: As a DevOps engineer, I want all required environment variables configured, so that the application functions correctly.

Acceptance Criteria

  1. THE Kubernetes_Deployment SHALL set RAILS_ENV to "production" (or appropriate environment)
  2. THE Kubernetes_Deployment SHALL set RAILS_LOG_TO_STDOUT to "true"
  3. THE Kubernetes_Deployment SHALL set RAILS_SERVE_STATIC_FILES to "true" for Rails_Server
  4. THE Kubernetes_Deployment SHALL set PROCESS_TARGET to identify each component type (server, worker-primary, worker-secondary, kafka-sending-status, kafka-new-document, kafka-received-status)
  5. THE Kubernetes_Deployment SHALL use Kubernetes Downward API to inject POD_NAME and POD_NAMESPACE environment variables
  6. THE Kubernetes_Deployment for worker targets SHALL set DELAYED_JOB_POOLS with appropriate pool configuration

Implementation Plan: Kubernetes Rails Deployment

Overview

This plan implements the migration from ECS with daemonized background processes to Kubernetes-native architecture using Docker multi-stage builds. Each component (web server, delayed job workers, Kafka consumers) has its own Docker build target that produces a separate container image. Each image runs its process in the foreground with proper health checks, signal handling, and JSON logging.

Tasks

  • 1. Create Docker multi-stage build targets

    • 1.1 Create base and app-base stages in Dockerfile
      • Define base stage with all shared dependencies (Ruby, Node.js, system packages)
      • Create ruby-deps and node-deps stages for dependency caching
      • Create app-base stage with application code and precompiled assets
      • Requirements: 1.1
    • 1.2 Create rails build target
      • Expose port 3000
      • Set PROCESS_TARGET=server environment variable
      • Set RAILS_SERVE_STATIC_FILES=true environment variable
      • Set RAILS_LOG_TO_STDOUT=true environment variable
      • CMD to start Puma in foreground: bundle exec rails server -b 0.0.0.0 -p 3000
      • Requirements: 1.2, 1.7, 1.11, 12.1, 12.2, 13.2
    • 1.3 Create worker build target
      • Expose port 3001
      • Set PROCESS_TARGET=worker environment variable
      • Set DELAYED_JOB_POOLS="" environment variable (will be overridden by K8s deployment)
      • Set DELAYED_JOB_TIMEOUT=280 environment variable
      • Copy health_server.rb to /scripts/
      • CMD to start health server then delayed_job with --timeout and $DELAYED_JOB_POOLS variable expansion
      • Requirements: 1.3, 1.4, 1.7, 1.9, 1.12
    • 1.4 Create Kafka consumer build targets
      • Create kafka-sending-status target (port 3002)
      • Create kafka-new-document target (port 3003)
      • Create kafka-received-status target (port 3004)
      • Each starts health server then racecar consumer in foreground
      • Verify Kafka broker URLs are configured via environment variables or Racecar config file
      • Requirements: 1.5, 1.6, 1.7, 1.8, 1.9, 1.12, 1.13, 1.14
    • 1.5 Remove or simplify entrypoint.sh
      • The entrypoint.sh is no longer needed for process selection (targets have their own CMD)
      • Either remove entirely or simplify to just RVM initialization if needed
      • Requirements: 1.7
    • [ ]* 1.6 Write property test for build target produces correct process
      • Property 1: Build Target Produces Correct Process
      • Validates: Requirements 1.2, 1.3, 1.5, 1.6, 1.7
    • [ ]* 1.7 Write property test for foreground process execution
      • Property 2: Foreground Process Execution
      • Validates: Requirements 1.7
      • Note: Puma, delayed_job, and racecar handle SIGTERM gracefully by default
  • 2. Implement health check infrastructure

    • 2.1 Create WEBrick health check server for workers and Kafka consumers
      • Create scripts/health_server.rb with /health and /ready endpoints
      • Workers: load Rails environment and check database connectivity
      • Kafka consumers: simple process-alive check (no DB)
      • Return JSON responses with status, process_target, pod_name, namespace, and timestamp
      • Configure port via HEALTH_PORT environment variable
      • Requirements: 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 3.10, 3.11
    • 2.2 Create Rails health controller for web server
      • Add HealthController with liveness and readiness actions
      • Liveness: check process is alive (no DB check)
      • Readiness: check DB connectivity
      • Skip authentication for health endpoints
      • Add routes for /health/liveness and /health/readiness to config/routes.rb
      • Include pod_name and namespace in responses
      • Requirements: 3.1, 3.2, 3.3, 3.11
    • [ ]* 2.3 Write unit tests for health check endpoints
      • Test liveness returns 200 when process running
      • Test readiness returns 200 when DB connected
      • Test readiness returns 503 when DB disconnected
      • Requirements: 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7
    • [ ]* 2.4 Write property test for health server starts before main process
      • Property 3: Health Server Starts Before Main Process
      • Validates: Requirements 1.9
  • 3. Configure JSON logging for all process types

    • 3.1 Configure Rails logger for JSON output
      • Add lograge gem to Gemfile
      • Configure lograge in config/environments/production.rb
      • Include timestamp, level, process_target, pod_name, namespace, and message fields
      • Configure for stdout output
      • Ensure sensitive data is not logged
      • Requirements: 2.1, 2.4, 2.5, 2.8
    • 3.2 Configure delayed_job for JSON logging
      • Set up JSON formatter for delayed_job output
      • Ensure logs go to stdout
      • Requirements: 2.2, 2.4
    • 3.3 Configure Racecar/Kafka consumers for JSON logging
      • Update config/initializers/racecar.rb to set config.logfile = STDOUT
      • Configure Racecar to use Rails.logger for consistent JSON formatting
      • Verify offset_commit_interval is set appropriately (default: 10 seconds)
      • Include required fields in log entries
      • Requirements: 2.3, 2.4
    • [ ]* 3.4 Write property test for JSON log format validity
      • Property 4: JSON Log Format Validity
      • Validates: Requirements 2.1, 2.2, 2.3
    • [ ]* 3.5 Write property test for required log fields presence
      • Property 5: Required Log Fields Presence
      • Validates: Requirements 2.4, 2.5
    • [ ]* 3.6 Write property test for sensitive data exclusion
      • Property 6: Sensitive Data Exclusion from Logs
      • Validates: Requirements 2.8, 7.5
  • 4. Checkpoint - Verify Dockerfile and health checks

    • Ensure all Docker targets build successfully
    • Ensure health endpoints respond correctly
    • Ask the user if questions arise.
  • 5. Create Kubernetes deployment manifests

    • 5.1 Create Rails server deployment manifest
      • Use image storecove-app:rails-latest
      • Configure liveness probe on /health/liveness (no DB check) with initialDelaySeconds=30, periodSeconds=10, failureThreshold=3
      • Configure readiness probe on /health/readiness (with DB check) with initialDelaySeconds=30, periodSeconds=10, failureThreshold=3
      • Set resource requests/limits (500m-2000m CPU, 1Gi-4Gi memory)
      • Set terminationGracePeriodSeconds to 30
      • Configure rollingUpdate with maxUnavailable=0, maxSurge=1
      • Add POD_NAME and POD_NAMESPACE env vars from downward API
      • Set RAILS_ENV, RAILS_LOG_TO_STDOUT, RAILS_SERVE_STATIC_FILES, PROCESS_TARGET
      • Reference secrets:
        • storecove-app-db-credentials
        • storecove-app-master-key
        • storecove-app-aws-credentials
        • storecove-app-valkey-credentials
        • storecove-app-email-credentials
        • storecove-app-billing-credentials
        • storecove-app-peppol-credentials
        • storecove-app-webhooks-credentials
        • storecove-app-intercom-credentials
        • storecove-app-rollbar-credentials
      • Requirements: 4.1, 4.2, 4.6, 4.7, 4.8, 4.12, 5.5, 7.1, 7.4, 7.6, 7.7, 7.9, 7.10, 7.11, 7.12, 7.13, 7.14, 13.1, 13.2, 13.3, 13.4, 13.5
    • 5.2 Create Kubernetes Service for Rails server
      • Create Service targeting port 3000
      • No Service for workers or Kafka consumers
      • Requirements: 4.11
    • 5.3 Create delayed job worker deployment manifests
      • Create worker-primary deployment with DELAYED_JOB_POOLS for: mail, inbound processing, SES/email, vatcalc, analyze/invoice/slack/apply, document_submissions
      • Create worker-secondary deployment with DELAYED_JOB_POOLS for: smp_phoss, aruba, webhooks, integrations, received_documents, storecove_api_self, active_storage, kafka, meta_events, customer_reporting, my_lhdnm_poller
      • Both use image storecove-app:worker-latest
      • Configure liveness probe on port 3001 with initialDelaySeconds=30, periodSeconds=30, failureThreshold=3
      • Set resource requests/limits: worker-primary (250m-1000m CPU, 512Mi-2Gi memory), worker-secondary (250m-2000m CPU, 512Mi-4Gi memory)
      • Set terminationGracePeriodSeconds to 300
      • Set PROCESS_TARGET to worker-primary or worker-secondary respectively
      • Reference secrets:
        • storecove-app-db-credentials
        • storecove-app-master-key
        • storecove-app-aws-credentials
        • storecove-app-valkey-credentials
        • storecove-app-queue-credentials
        • storecove-app-email-credentials
        • storecove-app-billing-credentials
        • storecove-app-peppol-credentials
        • storecove-app-webhooks-credentials
      • Requirements: 4.3, 4.6, 4.7, 4.9, 4.11, 4.14, 4.15, 4.16, 4.17, 5.6, 7.1, 7.4, 7.6, 7.7, 7.8, 7.9, 7.10, 7.11, 7.12, 13.4, 13.6
    • 5.4 Create Kafka consumer deployment manifests
      • Create deployment for kafka-sending-status consumer (image: kafka-sending-status-latest, port 3002)
      • Create deployment for kafka-new-document consumer (image: kafka-new-document-latest, port 3003)
      • Create deployment for kafka-received-status consumer (image: kafka-received-status-latest, port 3004)
      • Configure liveness probes with initialDelaySeconds=30, periodSeconds=30, failureThreshold=3
      • Set resource requests/limits (100m-500m CPU, 256Mi-1Gi memory)
      • Set terminationGracePeriodSeconds to 60
      • Reference secrets: storecove-app-db-credentials, storecove-app-master-key, storecove-app-kafka-credentials
      • Requirements: 4.4, 4.5, 4.6, 4.7, 4.10, 4.11, 5.7, 7.1, 7.2, 7.4
    • 5.5 Create Ingress manifests
      • Create main app Ingress for app.fr.storecove.com with 2M body size limit
      • Create API Ingress for api.fr.storecove.com with 100M body size limit
      • Configure TLS with cert-manager for both subdomains (2 certificates: storecove-app-tls, storecove-api-tls)
      • Configure proxy timeouts for long-running requests (300s)
      • Configure security headers (strip X-Powered-By, Server)
      • Add note about nginx-ingress retirement (March 2026) and Gateway API migration path
      • Verify ingress class matches OVH cluster configuration
      • Requirements: 11.1, 11.2, 11.3, 11.4, 11.5, 11.6, 11.7, 11.8
    • 5.6 Create Kubernetes CronJob manifests
      • Create 16 CronJob manifests from config/schedule.rb (see design doc for complete mapping)
      • Each uses rails-latest image with imagePullPolicy: Always
      • Set concurrencyPolicy to "Forbid"
      • Set restartPolicy to "OnFailure"
      • Set activeDeadlineSeconds to 3600 (1 hour timeout per job)
      • Add security contexts matching Rails server deployment
      • Reference same secrets as Rails server deployment
      • Add MySQL CA volume mount to each CronJob
      • Define schedule expressions matching whenever configuration
      • Set PROCESS_TARGET to cronjob-{task-name} for each
      • Requirements: 10.1, 10.2, 10.3, 10.4, 10.5, 10.6, 10.7, 10.8
    • [ ]* 5.7 Validate Kubernetes manifests with kubectl dry-run
      • Run kubectl apply --dry-run=client on all manifests (deployments, services, ingress, cronjobs)
      • Verify all required fields present
      • Requirements: 4.1, 4.2, 4.3, 4.4, 10.1, 11.1
  • 6. Configure Fluent Bit for Logz.io integration

    • 6.1 Create Fluent Bit DaemonSet manifest
      • Define DaemonSet in logging namespace
      • Mount container logs from host
      • Configure Logz.io output with token from secret storecove-app-logzio
      • Requirements: 2.6, 2.7, 7.3
    • 6.2 Create Fluent Bit ConfigMap
      • Configure tail input for storecove container logs
      • Add Kubernetes filter for metadata enrichment
      • Configure HTTP output to Logz.io
      • Add JSON parser configuration
      • Requirements: 2.6
  • 7. Checkpoint - Verify Kubernetes manifests

    • Ensure all manifests are valid
    • Ask the user if questions arise.
  • 8. Update GitHub Actions workflow

    • 8.1 Update deploy.yml to build multiple Docker targets
      • Build and push rails target with tag rails-$SHA and rails-latest
      • Build and push worker target with tag worker-$SHA and worker-latest
      • Build and push kafka-sending-status target
      • Build and push kafka-new-document target
      • Build and push kafka-received-status target
      • Use Docker BuildKit cache for faster builds
      • Consider parallel builds using matrix strategy for CI speed
      • Trigger automatically on push to master branch
      • Requirements: 4.7, 8.1, 8.2, 8.3, 8.4, 8.10
    • 8.2 Update deploy.yml for multi-deployment strategy
      • Use existing pause/pin/unpause strategy for migrations (already implemented)
      • Run db:migrate using rails image BEFORE applying K8s manifests
      • Abort deployment if migration or build fails
      • Apply all deployment manifests:
        • rails-server
        • worker-primary
        • worker-secondary
        • kafka-sending-status
        • kafka-new-document
        • kafka-received-status
        • CronJobs (k8s/cronjobs/)
        • Ingress
      • Create/update all required secrets (14 total):
        • storecove-app-db-credentials
        • storecove-app-kafka-credentials
        • storecove-app-logzio
        • storecove-app-master-key
        • storecove-app-aws-credentials
        • storecove-app-valkey-credentials
        • storecove-app-queue-credentials
        • storecove-app-email-credentials
        • storecove-app-billing-credentials
        • storecove-app-peppol-credentials
        • storecove-app-webhooks-credentials
        • storecove-app-rollbar-credentials
        • storecove-app-intercom-credentials
      • Notify Rollbar of successful deployment with git SHA, environment, and deployer
      • Ensure same images are deployed (no rebuilding)
      • Requirements: 6.1, 6.2, 6.3, 7.1, 7.2, 7.3, 7.4, 7.6, 7.7, 7.8, 7.9, 7.10, 7.11, 7.12, 7.13, 7.14, 8.5, 8.6, 8.7, 8.8, 8.9, 8.12
    • 8.3 Document deprecation of old build process
      • Add comments noting storecove-app-docker is deprecated
      • Ensure workflow uses datajust/Dockerfile only
      • Requirements: 8.11, 9.1, 9.2, 9.3, 9.4, 9.5
  • 9. Final checkpoint - Integration verification

    • Ensure all Docker targets build successfully
    • Ensure all tests pass
    • Verify all components can be deployed independently
    • Confirm health checks respond correctly
    • Ask the user if questions arise.
  • 10. Documentation and cleanup

    • 10.1 Update deployment documentation
      • Add README noting storecove-app-docker is deprecated for production
      • Document new deployment workflow
      • Document CronJob migration from whenever gem
      • Requirements: 9.1, 9.2, 9.3
    • 10.2 Archive old build scripts
      • Mark storecove-app-docker/production/build-deploy as deprecated
      • Add deprecation notice to old Dockerfile
      • Requirements: 9.2, 9.3, 9.4, 9.5

Notes

  • Tasks marked with * are optional and can be skipped for faster MVP
  • Each task references specific requirements for traceability
  • Checkpoints ensure incremental validation
  • Property tests validate universal correctness properties
  • Docker multi-stage builds allow building separate images from one Dockerfile
  • Each target has its own CMD and EXPOSE, no entrypoint script needed
  • Health check ports: 3000 (server), 3001 (worker), 3002-3004 (Kafka consumers)
  • Liveness probes check process health only; readiness probes check DB connectivity (for Rails server)
  • PROCESS_TARGET is set as ENV in each Dockerfile target, overridden by K8s deployment for workers
  • DELAYED_JOB_POOLS is set empty in Dockerfile, configured per-deployment in K8s manifests
  • DELAYED_JOB_TIMEOUT set to 280 seconds (slightly less than terminationGracePeriodSeconds)
  • The storecove-app-docker repository is deprecated for production builds
  • All production images are built from datajust/Dockerfile via deploy.yml workflow
  • 13 application secrets + 1 MySQL CA certificate required for full deployment
  • 16 CronJobs replace the whenever gem for scheduled tasks
  • Two separate Ingress resources for OVH production subdomains: app.fr.storecove.com (2M), api.fr.storecove.com (100M)
  • Puma runs in single-process mode (workers disabled) - scaling via K8s replicas
  • Racecar must log to STDOUT (update config/initializers/racecar.rb)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment