carloswm85/Week 8 - Prepare What is CICD.md

## Week 8 - Prepare What is CICD.md

      
    Raw
  

              Week 8 - Prepare  What is CICD.md
            
          
Week 8 - Teach One Another: CI/CD Quiz

Question 1

Adam
Carlos


Question 2

Adam
Carlos


Question 3

Adam
Carlos


Question 4

Adam
Carlos


Question 5

Adam
Carlos


Question 6

Adam
Carlos


Question 7

Adam
Carlos


Week 8 - Teach One Another: CI/CD Quiz

Question 1


A large enterprise uses a CI/CD pipeline with multiple stages, including automated testing and deployment to production.
During a recent deployment, a critical bug slipped through despite all tests passing.
Explain how a misconfiguration in the pipeline’s environment variables across stages could have caused this, and propose a strategy to prevent it in the future.


Adam

Overall, from what I've researched about this, a misconfiguration in the pipeline's environment variables could mean that all of the testing that's being done is using different settings than what the production is using, which can cause many errors to happen over and over. To be able to prevent this, you really just need to make sure all stages use the same environment variables and double-check them before running the pipeline every single time. It's a little tedious but very simple.
Explanation: Environment variables are like small settings or secret codes that tell a program how to run. If the testing stage has different settings than the production stage, the program behaves way differently. Tests may say everything is okay, but when everything goes live, the program can easily break because it’s using the real settings.
Carlos


Misconfiguration explanation:

Different variables accross stages could trigger different responses from the application logic
Staging services and disabled features may be used by staging, but not during production
This could be caused more precisely by: Different value flags in configuration files, different connection string or DB schemas, missing secrets in production.


Prevention strategy:

Include configuration files like JSON or YAML files, this is called "Configuration as code"
Build once, deploy multiple times. For example, build in developer machine, then deploy in staging and production
Mirror production configuration in pre-production
Centralize secrets management


Question 2


Which of the following is the most significant risk when implementing a blue-green deployment strategy in a CI/CD pipeline with zero downtime requirements?
a) Increased memory usage due to running two environments simultaneously
b) Data inconsistency between the old and new database instances
c) Network latency spikes during traffic switching
d) Rollback complexity due to insufficient test coverage
Provide a brief justification for your choice.

Adam

The most significant risk is definitely b) Data inconsistency between the old and new database instances. Even if code is able to switch perfectly, the new environment database can not be in sync with the old one and people will be missing data or even wrong data overall. To be able to prevent this, the team should carefully sync everything over and over again before switching traffic to other environments or people working on the project.
Explanation: When you do a blue-green deployment, you have two versions of the app running at the same time. If the new version’s database isn’t the same as the old one, users will see wrong data or missing data throughout the entirety of the project. That’s why keeping databases in sync is super important.
Carlos

I researched that the most significant risk when implementing blue-green deplyments is:

b) Data inconsistency between the old and new database instances

Blue-green deployments duplicate the compute layer, but databases are typically shared or migrated — not duplicated. If the new (green) version introduces schema changes, the old (blue) version may write data in a format the new version cannot read, or vice versa. During the traffic switch window, both versions may be writing concurrently. Honestly, this sound like a nightmare to me!

Question 3


True or False: Feature toggles in a CI/CD pipeline eliminate the need for branch-based development workflows. Justify your answer with a real-world example of when feature toggles might fail to replace branching entirely.

Adam

I believe the answer to be false? Feature toggles let you turn features on or off without making separate branches in the system, but they don't handle massive changes that can break the code completely. For example, you can rewrite a whole login system. If you keeping it behind a toggle system though it can still cause bugs in other parts of the app completely, so a separate branch is safer until the whole system is ready to be converted into one.
Explanation: Feature toggles are like light switches. You can turn new features on and off. If you change something really big though, flipping a switch can still break things, so you still need a separate branch to work safely.
Carlos


After reading about Feature Toggle, I think that the right answer is false.
Real world example: FT may hide features from user, but at code base level, and depending on the extension of changes performed, those changes may affect other working features. This development technique is complementary to branching, not mutually exclusive.


Question 4


Arrange the following CI/CD pipeline stages in the optimal order for a microservices-based application with strict security requirements. Then, explain why one of these stages could become a bottleneck in a large-scale system:

Static code analysis
Artifact deployment
Unit testing
Security vulnerability scanning
Integration testing


Adam

The order that we would place these to make things as optimal as possible would be this:

Static code analysis
Unit testing
Security vulnerability scanning
Integration testing
Artifact deployment

Why a stage could be a bottleneck: Security vulnerability scanning can become SUPER slow in a massive system because it has to check every single service for security issues that could be happening. This can also be super tedious, especially if the project/app is on a time basis for a customer. If there are many small services that are in the project, this step can also take a long time and slow down the whole pipeline which can be very frustrating.
Explanation: First of all, the pipeline checks the code for mistakes (static analysis) and small tests (unit tests). Then it will look for security problems before testing how all parts work properly together. Finally, it will deploy the code in the best way possible for the app. The security check can take a LONG time if there are lots of services, which can hold everything up and be very frustrating for everybody involved within the project.
Carlos

Optimal order in my opinion:


Order
Stage
Rationale


1
Static code analysis
Fastest feedback, no build needed, catches issues early during development


2
Unit testing
Fast, isolated, validates logic before integration. Can be performed during development (local) and during CI (before deployment)


3 ⚠️
Security vulnerability scanning
Scans dependencies and SAST ( Static Application Security Testing.) results post-build


4
Integration testing
Requires running services; slower, runs after code is vetted


5
Artifact deployment
Only deploy what has passed all quality gates


Bottleneck: Security Vulnerability Scanning
At scale, scanning every dependency tree and running SAST/DAST tools across dozens of microservices can take 10–20+ minutes per pipeline run. With hundreds of daily commits in a large organization, this becomes a throughput blocker. Mitigation includes incremental scanning (only changed packages), caching scan results, and running scans in parallel per service.

Question 5


Design a CI/CD pipeline strategy for a globally distributed team working on a machine learning application where model training takes several hours and must be validated before deployment. Discuss how you would balance speed, reliability, and resource cost.

Adam

For a globally distributed team, we would design the CI/CD pipeline so that quick checks can run on every code commit to keep feedback fast. Model training would also run in a separate stage that would be made only when model-related code or data changes are on a scheduled bases. This would be something like doing it every night. After training, the model would go through a validation test to check how accurate it is, whether or not it has a lot of bias, and check its performance overall before it's approved for deployment. To be able to balance speed, reliability, and cost, I would use cloud resources that will automatically spin up for training and shut down when they are finished. These would also assist in caching data and reusing previous artifacts when possible and this would drastically decrease the training time overall.
Explanation: Since training the model takes hours to do, you really don’t want to run it every single time someone changes a small amount of code. So the pipeline should run quick tests immediately, and only train the model when it actually needs to do so. After the model is trained completely, it should be tested to make sure it works well.
Carlos

Key Constraints:

Model training takes hours
Validation required
Distributed team
High compute cost

Proposed Architecture:


Layer
Strategy


Code CI
Fast linting + unit tests


Model CI
Trigger async training pipeline


Validation
Automated model metrics evaluation


Registry
Store versioned models


Deployment
Canary release with performance monitoring


Use separate pipelines:

App pipeline (fast)
Model pipeline (long-running, async)

Modern approach:

Store models in a registry (e.g., MLflow)
Use experiment tracking
Validate metrics against thresholds before promotion

Balancing Tradeoffs:


Goal
Approach


Speed
Cache datasets + incremental retraining


Reliability
Automated statistical validation gates


Cost
Spot instances + scheduled training windows


Question 6


In a CI/CD system using a trunk-based development approach, what is the primary challenge when integrating third-party dependencies with frequent updates?
a) Version conflicts causing build failures
b) Lack of visibility into dependency security vulnerabilities
c) Increased build times due to dependency resolution
d) Difficulty in maintaining a consistent test suite
Explain why this challenge is unique to trunk-based development.

Adam

To us, the best and most correct answer here is a) Version conflicts causing build failures.
In a trunk-based development, every single person involved is going to commit directly to the main branch very frequently, so when third-party dependencies are updating often, different changes can clash and break the build every single time. This is really hard because there are no long lived branches to be able to isolate how dependencies are updating, so conflicts affect the main codebase IMMEDIATELY.
Explanation: Since every single person is pushing code to the same main branch all the time, if a dependency updates and something changes, it can quickly break the build for every person involved which is obviously frustrating. In trunk-based development, you don’t have separate branches to test big dependency changes safely, so problems happen often and show up very quickly.
Carlos

Because everyone is commiting to the same branch (main), it is highly possible that changes on dependency verions cause lots of:

a) Version conflicts causing build failures

A dependency update merged into trunk can immediately break everyone’s build, when all team members pull the dependency changes into their local branches.

Question 7


A company recently adopted a CI/CD pipeline with automated rollbacks triggered by failed health checks. During a major release, the rollback mechanism activated unexpectedly, causing a 30-minute outage. Investigate potential causes of this failure and suggest improvements that leverage modern observability tools.

Adam

Looking at this, the rollback might've been triggered by health checks that were a little too sensitive. This also could be caused by misconfigurations, or even things like reporting false negatives in the programming. It could also be that the health checks didn't match user experience caused by real human beings, so normal behavior looked like a failure to the system. To improve this quite a bit, the company should try and use modern tools like logs, and tracing to understand how system behavior works a little better. They could also add alerts so engineers working on the app can investigate things before rollbacking happens automatically.
Explanation: If the system thought something was wrong when it actually wasn’t, the rollback kicked in and made users lose access for 30 minutes which is a ton of time for coding. The health checks might have been checking the wrong things or were too strict. Use more modern tools so that everybody can see what the system is doing specifically, fix the health checks, and make sure automatic rollbacks only happen when there’s actually a problem.
Carlos

Well, we could assume that the automated rollback worked correctly in the first place. I mean, it did what it was supposed to do. On the other hand possible causes of unexpected rollback may be:


Cause
Explanation


Misconfigured health checks
Too aggressive timeout


Cold start latency
New version slower to initialize


Resource throttling
CPU/memory pressure


Incorrect readiness probe
Service marked unhealthy prematurely


Observability blind spots
Metrics misinterpreted


Likely scenario why 30-minute outage happened:

Health checks triggered rollback
Rollback also failed health checks
System entered retry loop
No manual override guard

Improvements using modern observability, leverage tools like:

Datadog: commercial, full SaaS monitoring platform
Prometheus: open-source metrics collection & querying
Grafana Labs: open-source visualization (often paired with Prometheus)

Recommended improvements:


Improvement
Benefit


Distributed tracing
Identify latency spikes


SLO-based rollback triggers
Avoid premature rollback


Progressive delivery (canary)
Limit blast radius


Automated anomaly detection
Reduce false positives


Circuit breakers
Prevent cascading failure
Order	Stage	Rationale
1	Static code analysis	Fastest feedback, no build needed, catches issues early during development
2	Unit testing	Fast, isolated, validates logic before integration. Can be performed during development (local) and during CI (before deployment)
3 ⚠️	Security vulnerability scanning	Scans dependencies and SAST ( Static Application Security Testing.) results post-build
4	Integration testing	Requires running services; slower, runs after code is vetted
5	Artifact deployment	Only deploy what has passed all quality gates
Layer	Strategy
Code CI	Fast linting + unit tests
Model CI	Trigger async training pipeline
Validation	Automated model metrics evaluation
Registry	Store versioned models
Deployment	Canary release with performance monitoring
Goal	Approach
Speed	Cache datasets + incremental retraining
Reliability	Automated statistical validation gates
Cost	Spot instances + scheduled training windows
Cause	Explanation
Misconfigured health checks	Too aggressive timeout
Cold start latency	New version slower to initialize
Resource throttling	CPU/memory pressure
Incorrect readiness probe	Service marked unhealthy prematurely
Observability blind spots	Metrics misinterpreted
Improvement	Benefit
Distributed tracing	Identify latency spikes
SLO-based rollback triggers	Avoid premature rollback
Progressive delivery (canary)	Limit blast radius
Automated anomaly detection	Reduce false positives
Circuit breakers	Prevent cascading failure