Skip to content

Instantly share code, notes, and snippets.

@carloswm85
Last active February 28, 2026 03:01
Show Gist options
  • Select an option

  • Save carloswm85/75c2ab343bbe6fb0b20ca5fac271ee3c to your computer and use it in GitHub Desktop.

Select an option

Save carloswm85/75c2ab343bbe6fb0b20ca5fac271ee3c to your computer and use it in GitHub Desktop.

Week 8 - Teach One Another: CI/CD Quiz

Question 1

  • A large enterprise uses a CI/CD pipeline with multiple stages, including automated testing and deployment to production.
  • During a recent deployment, a critical bug slipped through despite all tests passing.
  • Explain how a misconfiguration in the pipeline’s environment variables across stages could have caused this, and propose a strategy to prevent it in the future.

Adam

Overall, from what I've researched about this, a misconfiguration in the pipeline's environment variables could mean that all of the testing that's being done is using different settings than what the production is using, which can cause many errors to happen over and over. To be able to prevent this, you really just need to make sure all stages use the same environment variables and double-check them before running the pipeline every single time. It's a little tedious but very simple.

Explanation: Environment variables are like small settings or secret codes that tell a program how to run. If the testing stage has different settings than the production stage, the program behaves way differently. Tests may say everything is okay, but when everything goes live, the program can easily break because it’s using the real settings.

Carlos

  • Misconfiguration explanation:
    • Different variables accross stages could trigger different responses from the application logic
    • Staging services and disabled features may be used by staging, but not during production
    • This could be caused more precisely by: Different value flags in configuration files, different connection string or DB schemas, missing secrets in production.
  • Prevention strategy:
    • Include configuration files like JSON or YAML files, this is called "Configuration as code"
    • Build once, deploy multiple times. For example, build in developer machine, then deploy in staging and production
    • Mirror production configuration in pre-production
    • Centralize secrets management

Question 2

Which of the following is the most significant risk when implementing a blue-green deployment strategy in a CI/CD pipeline with zero downtime requirements?

a) Increased memory usage due to running two environments simultaneously b) Data inconsistency between the old and new database instances c) Network latency spikes during traffic switching d) Rollback complexity due to insufficient test coverage

Provide a brief justification for your choice.

Adam

The most significant risk is definitely b) Data inconsistency between the old and new database instances. Even if code is able to switch perfectly, the new environment database can not be in sync with the old one and people will be missing data or even wrong data overall. To be able to prevent this, the team should carefully sync everything over and over again before switching traffic to other environments or people working on the project.

Explanation: When you do a blue-green deployment, you have two versions of the app running at the same time. If the new version’s database isn’t the same as the old one, users will see wrong data or missing data throughout the entirety of the project. That’s why keeping databases in sync is super important.

Carlos

I researched that the most significant risk when implementing blue-green deplyments is:

b) Data inconsistency between the old and new database instances

Blue-green deployments duplicate the compute layer, but databases are typically shared or migrated — not duplicated. If the new (green) version introduces schema changes, the old (blue) version may write data in a format the new version cannot read, or vice versa. During the traffic switch window, both versions may be writing concurrently. Honestly, this sound like a nightmare to me!


Question 3

True or False: Feature toggles in a CI/CD pipeline eliminate the need for branch-based development workflows. Justify your answer with a real-world example of when feature toggles might fail to replace branching entirely.

Adam

I believe the answer to be false? Feature toggles let you turn features on or off without making separate branches in the system, but they don't handle massive changes that can break the code completely. For example, you can rewrite a whole login system. If you keeping it behind a toggle system though it can still cause bugs in other parts of the app completely, so a separate branch is safer until the whole system is ready to be converted into one.

Explanation: Feature toggles are like light switches. You can turn new features on and off. If you change something really big though, flipping a switch can still break things, so you still need a separate branch to work safely.

Carlos

  • After reading about Feature Toggle, I think that the right answer is false.
  • Real world example: FT may hide features from user, but at code base level, and depending on the extension of changes performed, those changes may affect other working features. This development technique is complementary to branching, not mutually exclusive.

Question 4

Arrange the following CI/CD pipeline stages in the optimal order for a microservices-based application with strict security requirements. Then, explain why one of these stages could become a bottleneck in a large-scale system:

  • Static code analysis
  • Artifact deployment
  • Unit testing
  • Security vulnerability scanning
  • Integration testing

Adam

The order that we would place these to make things as optimal as possible would be this:

  1. Static code analysis
  2. Unit testing
  3. Security vulnerability scanning
  4. Integration testing
  5. Artifact deployment

Why a stage could be a bottleneck: Security vulnerability scanning can become SUPER slow in a massive system because it has to check every single service for security issues that could be happening. This can also be super tedious, especially if the project/app is on a time basis for a customer. If there are many small services that are in the project, this step can also take a long time and slow down the whole pipeline which can be very frustrating. Explanation: First of all, the pipeline checks the code for mistakes (static analysis) and small tests (unit tests). Then it will look for security problems before testing how all parts work properly together. Finally, it will deploy the code in the best way possible for the app. The security check can take a LONG time if there are lots of services, which can hold everything up and be very frustrating for everybody involved within the project.

Carlos

Optimal order in my opinion:

Order Stage Rationale
1 Static code analysis Fastest feedback, no build needed, catches issues early during development
2 Unit testing Fast, isolated, validates logic before integration. Can be performed during development (local) and during CI (before deployment)
3 ⚠️ Security vulnerability scanning Scans dependencies and SAST ( Static Application Security Testing.) results post-build
4 Integration testing Requires running services; slower, runs after code is vetted
5 Artifact deployment Only deploy what has passed all quality gates

Bottleneck: Security Vulnerability Scanning

At scale, scanning every dependency tree and running SAST/DAST tools across dozens of microservices can take 10–20+ minutes per pipeline run. With hundreds of daily commits in a large organization, this becomes a throughput blocker. Mitigation includes incremental scanning (only changed packages), caching scan results, and running scans in parallel per service.


Question 5

Design a CI/CD pipeline strategy for a globally distributed team working on a machine learning application where model training takes several hours and must be validated before deployment. Discuss how you would balance speed, reliability, and resource cost.

Adam

For a globally distributed team, we would design the CI/CD pipeline so that quick checks can run on every code commit to keep feedback fast. Model training would also run in a separate stage that would be made only when model-related code or data changes are on a scheduled bases. This would be something like doing it every night. After training, the model would go through a validation test to check how accurate it is, whether or not it has a lot of bias, and check its performance overall before it's approved for deployment. To be able to balance speed, reliability, and cost, I would use cloud resources that will automatically spin up for training and shut down when they are finished. These would also assist in caching data and reusing previous artifacts when possible and this would drastically decrease the training time overall.

Explanation: Since training the model takes hours to do, you really don’t want to run it every single time someone changes a small amount of code. So the pipeline should run quick tests immediately, and only train the model when it actually needs to do so. After the model is trained completely, it should be tested to make sure it works well.

Carlos

Key Constraints:

  • Model training takes hours
  • Validation required
  • Distributed team
  • High compute cost

Proposed Architecture:

Layer Strategy
Code CI Fast linting + unit tests
Model CI Trigger async training pipeline
Validation Automated model metrics evaluation
Registry Store versioned models
Deployment Canary release with performance monitoring

Use separate pipelines:

  • App pipeline (fast)
  • Model pipeline (long-running, async)

Modern approach:

  • Store models in a registry (e.g., MLflow)
  • Use experiment tracking
  • Validate metrics against thresholds before promotion

Balancing Tradeoffs:

Goal Approach
Speed Cache datasets + incremental retraining
Reliability Automated statistical validation gates
Cost Spot instances + scheduled training windows

Question 6

In a CI/CD system using a trunk-based development approach, what is the primary challenge when integrating third-party dependencies with frequent updates?

a) Version conflicts causing build failures b) Lack of visibility into dependency security vulnerabilities c) Increased build times due to dependency resolution d) Difficulty in maintaining a consistent test suite

Explain why this challenge is unique to trunk-based development.

Adam

To us, the best and most correct answer here is a) Version conflicts causing build failures. In a trunk-based development, every single person involved is going to commit directly to the main branch very frequently, so when third-party dependencies are updating often, different changes can clash and break the build every single time. This is really hard because there are no long lived branches to be able to isolate how dependencies are updating, so conflicts affect the main codebase IMMEDIATELY. Explanation: Since every single person is pushing code to the same main branch all the time, if a dependency updates and something changes, it can quickly break the build for every person involved which is obviously frustrating. In trunk-based development, you don’t have separate branches to test big dependency changes safely, so problems happen often and show up very quickly.

Carlos

Because everyone is commiting to the same branch (main), it is highly possible that changes on dependency verions cause lots of:

a) Version conflicts causing build failures

A dependency update merged into trunk can immediately break everyone’s build, when all team members pull the dependency changes into their local branches.


Question 7

A company recently adopted a CI/CD pipeline with automated rollbacks triggered by failed health checks. During a major release, the rollback mechanism activated unexpectedly, causing a 30-minute outage. Investigate potential causes of this failure and suggest improvements that leverage modern observability tools.

Adam

Looking at this, the rollback might've been triggered by health checks that were a little too sensitive. This also could be caused by misconfigurations, or even things like reporting false negatives in the programming. It could also be that the health checks didn't match user experience caused by real human beings, so normal behavior looked like a failure to the system. To improve this quite a bit, the company should try and use modern tools like logs, and tracing to understand how system behavior works a little better. They could also add alerts so engineers working on the app can investigate things before rollbacking happens automatically.

Explanation: If the system thought something was wrong when it actually wasn’t, the rollback kicked in and made users lose access for 30 minutes which is a ton of time for coding. The health checks might have been checking the wrong things or were too strict. Use more modern tools so that everybody can see what the system is doing specifically, fix the health checks, and make sure automatic rollbacks only happen when there’s actually a problem.

Carlos

Well, we could assume that the automated rollback worked correctly in the first place. I mean, it did what it was supposed to do. On the other hand possible causes of unexpected rollback may be:

Cause Explanation
Misconfigured health checks Too aggressive timeout
Cold start latency New version slower to initialize
Resource throttling CPU/memory pressure
Incorrect readiness probe Service marked unhealthy prematurely
Observability blind spots Metrics misinterpreted

Likely scenario why 30-minute outage happened:

  • Health checks triggered rollback
  • Rollback also failed health checks
  • System entered retry loop
  • No manual override guard

Improvements using modern observability, leverage tools like:

  • Datadog: commercial, full SaaS monitoring platform
  • Prometheus: open-source metrics collection & querying
  • Grafana Labs: open-source visualization (often paired with Prometheus)

Recommended improvements:

Improvement Benefit
Distributed tracing Identify latency spikes
SLO-based rollback triggers Avoid premature rollback
Progressive delivery (canary) Limit blast radius
Automated anomaly detection Reduce false positives
Circuit breakers Prevent cascading failure
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment