For Kubernetes/Cloud Engineers transitioning into AI/ML Engineering Updated with: Data Engineering foundations, Vector DBs, LLMOps, Distributed Training, Security & Governance, ML Observability Resources listed in the order you should follow them.
You can't be an ML Platform Engineer without understanding the data layer. Start both tracks in parallel.
| # | Resource | URL | Type |
|---|---|---|---|
| 1 | Andrew Ng β ML Specialization (theory foundations first) | https://www.coursera.org/specializations/machine-learning-introduction | Course (Free audit) |
| 2 | fast.ai β Practical Deep Learning for Coders (top-down, code-first) | https://course.fast.ai/ | Course (Free) |
| 3 | fast.ai Book (companion, freely available online) | https://fastai.github.io/fastbook2e/ | Book (Free) |
Modern ML platforms = feature pipelines + batch + streaming + lakehouse. Without this, you look like a model-serving engineer, not a platform engineer.
| # | Resource | URL | Type |
|---|---|---|---|
| 4 | Apache Parquet & Arrow Fundamentals β columnar storage, the foundation | https://arrow.apache.org/docs/python/parquet.html | Docs |
| 5 | Delta Lake Docs β ACID ML data lakes (Databricks-native) | https://docs.delta.io/latest/index.html | Docs |
| 6 | Apache Iceberg Docs β open table format, broader engine support | https://iceberg.apache.org/docs/latest/ | Docs |
| 7 | Apache Kafka Intro β event-driven data for ML feature pipelines | https://kafka.apache.org/intro | Docs |
| 8 | Confluent Kafka Docs β deeper Kafka reference | https://docs.confluent.io/kafka/introduction.html | Docs |
- Write posts positioning your Kubernetes operators as ML infrastructure
- Your developer platforms = Internal ML Developer Portals
- Your API skills = ML serving gateways with rate limiting and queuing
| # | Resource | URL | Type |
|---|---|---|---|
| 9 | KServe Docs β Kubernetes-native model serving (operator patterns you'll recognize) | https://kserve.github.io/website/docs/intro | Docs |
| 10 | KServe GitHub β source, examples, issues | https://github.com/kserve/kserve | GitHub |
| 11 | KServe Quickstart | https://kserve.github.io/website/docs/getting-started/quickstart-guide | Quickstart |
| 12 | Ray Serve Docs β alternative serving framework | https://docs.ray.io/en/latest/serve/index.html | Docs |
| # | Resource | URL | Type |
|---|---|---|---|
| 13 | MLflow Docs β experiment tracking and model registry | https://mlflow.org/docs/latest/index.html | Docs |
| 14 | Kubeflow β ML workflows on Kubernetes | https://www.kubeflow.org/ | Docs |
| 15 | Kubeflow Pipelines β portable, scalable ML workflows | https://www.kubeflow.org/docs/components/pipelines/overview/ | Docs |
| # | Resource | URL | Type |
|---|---|---|---|
| 16 | Feast Docs β open-source feature store | https://docs.feast.dev | Docs |
| 17 | Feast GitHub β source and examples | https://github.com/feast-dev/feast | GitHub |
Train a simple model β serve it via KServe on Kind β expose via a Go API you write β log experiments with MLflow
This is the highest-demand zone in 2026. RAG systems are production reality. Interviewers will probe deeply here.
| # | Resource | URL | Type |
|---|---|---|---|
| 18 | vLLM Docs β high-throughput LLM serving with paged attention | https://docs.vllm.ai/en/stable/ | Docs |
| 19 | vLLM Quickstart | https://docs.vllm.ai/en/latest/getting_started/quickstart/ | Quickstart |
| 20 | Ollama β run open-source LLMs locally and on-cluster | https://ollama.com/ | Tool |
You need to understand: embedding pipelines, chunking strategy, vector indexing (HNSW, IVF), recall vs latency tradeoffs.
| # | Resource | URL | Type |
|---|---|---|---|
| 21 | Milvus Docs β enterprise-scale open-source vector DB, Kubernetes-native | https://milvus.io/docs | Docs |
| 22 | Weaviate Docs β hybrid search + knowledge graph capabilities | https://weaviate.io/developers/weaviate | Docs |
| 23 | Pinecone Docs β managed vector DB, easiest to start with | https://docs.pinecone.io/home | Docs |
| 24 | Qdrant Docs β performance-focused OSS vector DB, good for self-hosting | https://qdrant.tech/documentation/ | Docs |
| # | Resource | URL | Type |
|---|---|---|---|
| 25 | NVIDIA GPU Operator β GPU node management on Kubernetes | https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html | Docs |
| 26 | KEDA Docs β event-driven autoscaling on GPU utilization metrics | https://keda.sh/docs/latest/ | Docs |
| 27 | KEDA GitHub | https://github.com/kedacore/keda | GitHub |
| # | Resource | URL | Type |
|---|---|---|---|
| 28 | HuggingFace Quantization Guide β GGUF, AWQ, GPTQ explained | https://huggingface.co/docs/transformers/main/en/quantization/overview | Docs |
Deploy a self-hosted LLM (Llama/Mistral) + a vector store (Milvus) + autoscaling via KEDA + Go API with request queuing This is the foundation of your capstone RAG platform
Prompt versioning, evaluation frameworks, guardrails β these are now standard interview topics.
| # | Resource | URL | Type |
|---|---|---|---|
| 29 | LangChain Docs β build LLM chains, agents, and RAG pipelines | https://python.langchain.com/docs/introduction/ | Docs |
| 30 | LangSmith Docs β prompt versioning, tracing, and evaluation | https://docs.smith.langchain.com/ | Docs |
| 31 | LlamaIndex Docs β data framework for LLM retrieval (RAG-focused) | https://docs.llamaindex.ai/en/stable/ | Docs |
| 32 | Weights & Biases β LLM Evals | https://wandb.ai/site/solutions/llm | Docs |
| 33 | Guardrails AI β output validation and guardrails for LLMs | https://www.guardrailsai.com/docs | Docs |
You're inference-heavy so far. Add this to cover senior platform roles that own the full training β serving lifecycle.
| # | Resource | URL | Type |
|---|---|---|---|
| 34 | PyTorch Distributed Training Overview β data, tensor, pipeline parallelism explained | https://pytorch.org/tutorials/beginner/dist_overview.html | Docs |
| 35 | Ray Train Docs β distributed ML training (pairs with Ray Serve you already know) | https://docs.ray.io/en/latest/train/train.html | Docs |
| 36 | Ray Train + PyTorch Quickstart | https://docs.ray.io/en/latest/train/getting-started-pytorch.html | Quickstart |
| # | Resource | URL | Type |
|---|---|---|---|
| 37 | Uber β Michelangelo ML Platform | https://www.uber.com/blog/michelangelo-machine-learning-platform/ | Blog |
| 38 | Airbnb β ML Platform Architecture | https://medium.com/airbnb-engineering/using-machine-learning-to-predict-value-of-homes-on-airbnb-9272d3d4739d | Blog |
| 39 | LinkedIn β Scaling ML Productivity | https://engineering.linkedin.com/blog/2019/01/scaling-machine-learning-productivity-at-linkedin | Blog |
| 40 | Netflix β Metaflow Open Source | https://netflixtechblog.com/open-sourcing-metaflow-a-human-centric-framework-for-data-science-fa72e04a5d9 | Blog |
| # | Resource | URL | Type |
|---|---|---|---|
| 41 | Kubebuilder Book β build operators (review ML-specific patterns) | https://book.kubebuilder.io/ | Book (Free) |
| 42 | Operator SDK Docs | https://sdk.operatorframework.io/docs/ | Docs |
| # | Resource | URL | Type |
|---|---|---|---|
| 43 | AWS EC2 Spot Instances for ML | https://aws.amazon.com/blogs/machine-learning/run-machine-learning-workloads-with-amazon-ec2-spot-instances-and-amazon-ec2-auto-scaling/ | Blog |
| 44 | GCP Spot VMs for AI Workloads | https://cloud.google.com/compute/docs/instances/spot | Docs |
| 45 | NVIDIA MIG (Multi-Instance GPU) | https://docs.nvidia.com/datacenter/tesla/mig-user-guide/ | Docs |
Build a Golang Kubernetes Operator for ModelServer: manages model lifecycle (versioning, canary rollouts, shadow mode) Add LangSmith-based prompt evaluation to your RAG pipeline from Month 3
Prometheus + Grafana alone is insufficient for AI roles. Interviewers will ask about: token throughput, latency percentiles, hallucination rate tracking, embedding drift.
| # | Resource | URL | Type |
|---|---|---|---|
| 46 | Arize AI β LLM observability with embedding drift, hallucination monitoring | https://docs.arize.com/arize | Docs |
| 47 | WhyLabs Docs β ML monitoring for data drift and model performance | https://docs.whylabs.ai/docs/ | Docs |
| 48 | Evidently AI β open-source ML monitoring and drift detection | https://docs.evidentlyai.com/ | Docs |
| 49 | Prometheus Docs β infrastructure metrics (token throughput, GPU utilization) | https://prometheus.io/docs/introduction/overview/ | Docs |
| 50 | Grafana Docs β dashboarding for ML-specific metrics | https://grafana.com/docs/grafana/latest/ | Docs |
Model supply chain security, PII handling, data lineage, RBAC for model access β platform engineers who know this stand out.
| # | Resource | URL | Type |
|---|---|---|---|
| 51 | ML Security OWASP Top 10 for LLMs β the canonical security reference for LLM apps | https://owasp.org/www-project-top-10-for-large-language-model-applications/ | Docs |
| 52 | Sigstore β Model Signing β supply chain security for ML artifacts | https://www.sigstore.dev/ | Docs |
| 53 | OPA (Open Policy Agent) β RBAC and policy enforcement for model access | https://www.openpolicyagent.org/docs/latest/ | Docs |
| 54 | OpenLineage β data lineage standard for ML pipelines | https://openlineage.io/ | Docs |
| 55 | Marquez β metadata and lineage service built on OpenLineage | https://marquezproject.ai/ | Docs |
| # | Resource | URL | Type |
|---|---|---|---|
| 56 | Vertex AI Documentation | https://cloud.google.com/vertex-ai/docs | Docs |
| 57 | GCP ML Engineer Learning Path | https://www.skills.google/paths/17 | Course (Free) |
| 58 | Preparing for GCP ML Eng Cert β Coursera | https://www.coursera.org/professional-certificates/preparing-for-google-cloud-machine-learning-engineer-professional-certificate | Course |
| # | Resource | URL | Type |
|---|---|---|---|
| 59 | Amazon SageMaker Developer Guide | https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html | Docs |
| 60 | Amazon Bedrock Docs | https://docs.aws.amazon.com/bedrock/latest/userguide/what-is-bedrock.html | Docs |
| # | Resource | URL | Type |
|---|---|---|---|
| 61 | Azure Machine Learning Docs | https://learn.microsoft.com/en-us/azure/machine-learning/overview-what-is-azure-machine-learning | Docs |
Senior roles care more about OSS contributions and technical blog depth than certs. Treat the cert as a bonus, not the goal.
| # | Resource | URL | Notes |
|---|---|---|---|
| 62 | Google Professional ML Engineer (primary β most respected) | https://cloud.google.com/learn/certification/machine-learning-engineer | $200 |
| 63 | Exam Guide PDF | https://cloud.google.com/learn/certification/guides/machine-learning-engineer | Free |
| 64 | AWS ML Specialty (optional secondary) | https://aws.amazon.com/certification/certified-machine-learning-specialty/ | $300 |
One cohesive system beats four disconnected demos. This single project will outperform most ML engineer portfolios.
System components to build:
| Component | Technology |
|---|---|
| Data ingestion pipeline | Kafka + Iceberg/Delta Lake |
| Embedding generation service | HuggingFace models + vLLM |
| Vector database | Milvus (Kubernetes-native) |
| LLM inference | vLLM with quantized Llama/Mistral |
| Autoscaling | KEDA on GPU utilization + queue depth |
| Model lifecycle management | Your custom Golang Kubernetes Operator |
| Prompt evaluation & tracing | LangSmith or W&B |
| Observability dashboards | Prometheus + Grafana + Arize AI (embedding drift) |
| Canary model rollout | Istio traffic splitting via your Operator |
| Security & access control | OPA policies for model endpoint RBAC |
| Data lineage | OpenLineage integration |
| Cost monitoring | GPU utilization reports + spot instance optimization |
| # | Resource | URL | Type |
|---|---|---|---|
| 65 | ML Engineer jobs β LinkedIn | https://www.linkedin.com/jobs/search/?keywords=ML+Platform+Engineer | Job Board |
| 66 | ai-jobs.net β AI-specific job board | https://ai-jobs.net/ | Job Board |
| 67 | Levels.fyi β ML Engineer salaries | https://www.levels.fyi/t/machine-learning-engineer | Salary Data |
| Resource | URL | Why |
|---|---|---|
| HuggingFace β model hub, datasets, spaces | https://huggingface.co/ | Central hub for open-source models |
| Papers With Code β ML research + reproducible code | https://paperswithcode.com/ | Stay current on research |
| The Batch (Andrew Ng newsletter) | https://www.deeplearning.ai/the-batch/ | Weekly AI news digest |
| MLOps Community | https://mlops.community/ | Networking with practitioners |
| CNCF AI/ML Working Group | https://github.com/cncf/tag-runtime/blob/main/wg/artificial-intelligence.md | Kubernetes + ML community |
| Golang Kubernetes client-go | https://github.com/kubernetes/client-go | Operator development base |
| Apache Arrow Docs | https://arrow.apache.org/docs/ | Columnar data format fundamentals |
- Capstone RAG Platform β full production stack (see Month 6 table above)
- Golang Kubernetes Operator for ModelServer (versioning, canary, shadow mode)
- End-to-end ML pipeline β Kafka β Iceberg β training β model registry β serving
- Distributed training demo β PyTorch + Ray Train on multi-GPU Kubernetes node pool
- 3+ technical blog posts framing infra skills through the ML lens
- 1 OSS contribution to KServe, Kubeflow, or Milvus (beats any cert)
- LinkedIn posts on GPU autoscaling, LLM serving, or MLOps β this space has low-quality content, you'll stand out
- Can explain RAG architecture end-to-end (chunking β embedding β retrieval β generation)
- Can articulate data parallelism vs tensor parallelism vs pipeline parallelism
- Can discuss HNSW vs IVF vector indexing tradeoffs
- Can explain ML supply chain security threats (model poisoning, prompt injection)
- Can design RBAC for a multi-tenant model serving platform
Month 1: ML Fundamentals + Data Systems (Kafka, Iceberg, Delta Lake)
β
Month 2: MLOps (KServe, Kubeflow, Feast, MLflow)
β
Month 3: LLM Infra (vLLM, Vector DBs, GPU ops)
β
Month 4: LLMOps + Distributed Training + Platform Architecture
β
Month 5: Observability + Security + Cloud AI Services
β
Month 6: Capstone Project + Portfolio + Job Search
Your unfair advantage: Most ML engineers are learning Kubernetes. You already own it. Add the ML domain layer on top and you become the rarest hire in the market β a platform engineer who also speaks ML.