apolloclark/K8S Vulnerability Management.md

## K8S Vulnerability Management.md

      
    Raw
  

              K8S Vulnerability Management.md
            
          
    K8S Vulnerability Management


I've done Vulnerability Management as a Security Engineer for the past 8 years, across AWS EC2, Azure VM, GCP VM, and Docker / Kubernetes Container Images. Here is my best advice to any organization that wants to do it.
1. Don't Buy a Scanner, Initially Build it with Open-source


Build it, then buy it.

It's tempting to just buy a security vendor product, and that is a good idea, if there is budget and willingness within the organization to adopt a new tool. However, even with a vendor tool, the following steps will still need to be implemented to be successful. I've seen multiple projects that adopted a vendor tool, but failed because they did not follow all of these steps, and faced a lot of pushback from developer teams and executives.
Container Scanners

Trivy
Clair
Grype / Anchore
Dagda
OpenSCAP
Dockle
Tern
Prisma Cloud
Wiz

Analysis

2025-01-05 Top 7 OSS Container Image Scanning Tools for 2025
2024-11-26 Open-Source Container Security: A Deep Dive into Trivy, Clair, and Grype

2. Create Dedicated Git Repos for Each Image


You cannot Git Version / Tag a folder in a Git Repo.

A common approach is to create a Git Monorepo of all or many Container Images, but that prevents you from doing fine-grained Git Versioning / Tagging for each individual Container Image. If there are Image A and Image B in the same Git Repo, but you only made changes to Image A, then you create a new Git Version / Tag, which applies to all folders and all files in the Git Repo, it implies that Image B was also updated. It is better to put Image A into a dedicated Git Repo and Image B into a dedicated Git Repo, so that they can be updated, tested, and deployed, independently. Ideally, all Container Images should be rebuilt every 7 days, to fulfil compliance with security standards such as FedRAMP, and have nightly builds for emergency security updates for Critical severity vulnerabilities.
3. Establish Image Ownership


Every Container Image should have one Owners Team.

I've heard countless times: "Everyone uses this image, so everyone owns it." That's not good enough. A specific and singular Developer Team needs to be responsible for maintaining, upgrading, fixing, and deploying each Contianer Image. The best way to do this is to require every Git Repo within an organization have a CODEOWNERS file. You can automate verifying this with a Python script that scans every Git Repo, every morning, for the existence and contents of this file. Even better is to include specific Developer Team Manager names, emails, and Slack channels, which are also verified to still be active in case someone leaves the organization.
4. Establish Consistent Tagging / Versioning


Inconsistent Container Image Tagging makes accurate analysis impossible.

I've worked with multiple companies where each Developer Team has their own standard for tagging, ex:

1.0.1
1.0.1-dev
1.0.1-abcd1234
1.0.1-abcd1234-dev
2026.01.01-abcd1234
2026.01.01-abcd1234-dev
20260101-abcd1234
20260101-abcd1234-dev

The organization needs to have every Developer Team use the same Container Image Tagging / Versioning. I suggest the last example: "<date> - <git_hash> - <environment>".
5. Stop Using Publicly Available Images

Even the most popular Container Images for things as common as Python are only rebuilt every 3-4 weeks, which is not fast enough to integrate security fixes for Critical and High severity vulnerabilities. The latest FedRAMP standard is 3 days for Critical, 7 days for High, 21 days for Moderate, and 180 days for Low.

2025-07-22 - FedRAMP forges ahead with faster vulnerability remediation - Synack

6. Use Shared Secure Base Images

While migrating away from using Public Images, it's important to establish the organization-wide usage of Shared Secure Base Images, often Ubuntu / Debian, Amazon Linux 2 / RHEL, and optionally Window Server, or some variation. Ideally it would only be RHEL or only Debian, but in practice the licensing costs for RHEL become "too expensive" at scale, which is a discussion needed between the Sales, Security, Platform, Legal, and Finance Teams. If you don't want to build your own Shared Secure Base Images, investigate purchasing them from ChainGaurd.

Chainguard
RapidFort
Aikido

7. Create a Standardized CI/CD Pipeline

Technically the Shared Secure Base Images can be manually built, but it should be fully automated with a CI/CD Pipeline. I suggest using GitHub Actions, HashiCorp Packer, and Ansible. GitHub Actions has been replacing Jenkins and other CI/CD Pipeline Tools for the past few years now. HashiCorp Packer is a CLI tool for easily building both Cloud VM and Container Images. Ansible can be used to automated hardening of Container Images with there being multiple open-source CIS Benchmark Ansible Playbooks available.
8. Test Everything


Manual testing in 2026 is not good enough.

There are not any silver bullets in IT, except having a robust test suite with 90%+ code coverage. Any organization that is running more than 10 Container Images needs automated testing suites. See my gist below for CI/CD Pipeline Tool suggestions.

CI/CD Build Tools and Testing - Infrastructure Testing

9. Scan Everything


What is running? What is stored? What are the result of that latest CI/CD Builds?

The targets to scan are:

AWS EKS / ArgoCD
AWS ECR
JFrog Artifactory
CI/CD Build Logs

Ideally both AWS ECR and JFrog Artifactory should be purged of any Container Image that is older than 6 months. After that initial change has been made, the Retention Policy can be shortened to 3 months, 1 month, and even just 7 days. The results of all these scans and logs should be stored in a database, be it SQL such as MySQL or Postgres, or NoSQL such as ElasticSearch, MongoDB, or similar.
10. Remove old AWS ECR Images

If you are using AWS then you should enable the AWS ECR - Lifecycle Policies plus use AWS Config to detect and alert on violations.

AWS Blog - Clean up Your Container Images with Amazon ECR Lifecycle Policies
https://aws.amazon.com/blogs/compute/clean-up-your-container-images-with-amazon-ecr-lifecycle-policies/
AWS Config, ecr-private-lifecycle-policy-configured
https://docs.aws.amazon.com/config/latest/developerguide/ecr-private-lifecycle-policy-configured.html
AWS ECR - Automate the cleanup of images by using lifecycle policies in Amazon ECR
https://docs.aws.amazon.com/AmazonECR/latest/userguide/LifecyclePolicies.html
AWS ECR - Lifecycle policy properties in Amazon ECR
https://docs.aws.amazon.com/AmazonECR/latest/userguide/lifecycle_policy_parameters.html
Terraform, AWS Resource - aws_ecr_lifecycle_policy
https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/ecr_lifecycle_policy.html#repository-1
AWS Support - How do I check my Amazon ECR lifecycle policy events?
https://repost.aws/knowledge-center/ecr-lifecycle-policy-events
AWS Prescriptive Guidance - Replicate filtered Amazon ECR container images across accounts or Regions
https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/replicate-filtered-amazon-ecr-container-images-across-accounts-or-regions.html

11. Build Reporting

Now that everything has been scanned and is stored in a database that can be queried, it's possible to create reporting. This can be as simple as a Python script that creates a CSV / Excel file, and is run once a week on the latest findings, creating both point in time global summaries, and changes over time. Generally, executives only care about the changes over time, and expect it to be improving. If you skipped any of the earlier steps in this document then you will have a lot of difficulty with explaining to executives: "What needs to be fixed?" and "Who is responsible for fixing it?"
Suggested KPIS:

Total Count, of All Vulnerabilities
Total Count, of All Vulnerabilities, Group By Severity
Total Count, of All Vulnerabilities, Group By Severity and Team (Most Vulnerable Team)
Total Count, of Top 5 Vulnerabilities, Occurances (Most Common Vulnerabilities)
Container Images, Order By Vulnerability Count (Most Vulnerable Images)
Container Images, Order By Age Desc (Oldest Images)

12. Build Dashboards

Executives like to visualize data, and a lot of Developer Team Managers don't know how to write SQL queries. I prefer to give everyone read-only SQL access so that they can self-serve analysis with the data. I've used countless visualization tools.

Tableau
AWS Quick Sight
Google Looker
Power BI
Sigma

13. Block Insecure Builds

After a few weeks or months of basically changing how the organization operates, then you can start to block insecure builds within the CI/CD Pipelines. At this point of technical maturity the Developer Teams will have all of the information and tools they need to be successful.
14. Buy an Enterprise Tool

After doing the earlier steps for 3-6 months, then consider buying an enterprise vendor tool. If you choose a tool like Trivy, then you can swap it in for the enterprise version called AquaSec. The source for Reporting and Dashboards will change, but the business process workflow will stay the same.
No results found