Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save piyushjajoo/51d72850754aabc06ef1f6d994a4d35f to your computer and use it in GitHub Desktop.

Select an option

Save piyushjajoo/51d72850754aabc06ef1f6d994a4d35f to your computer and use it in GitHub Desktop.
πŸš€ AI/ML Platform Engineer β€” 6-Month Learning Roadmap

πŸš€ AI/ML Platform Engineer β€” 6-Month Senior-Ready Roadmap

For Kubernetes/Cloud Engineers transitioning into AI/ML Engineering Updated with: Data Engineering foundations, Vector DBs, LLMOps, Distributed Training, Security & Governance, ML Observability Resources listed in the order you should follow them.


πŸ“… MONTH 1 β€” ML Foundations + Data Systems for ML

You can't be an ML Platform Engineer without understanding the data layer. Start both tracks in parallel.

Track A β€” ML Fundamentals

# Resource URL Type
1 Andrew Ng β€” ML Specialization (theory foundations first) https://www.coursera.org/specializations/machine-learning-introduction Course (Free audit)
2 fast.ai β€” Practical Deep Learning for Coders (top-down, code-first) https://course.fast.ai/ Course (Free)
3 fast.ai Book (companion, freely available online) https://fastai.github.io/fastbook2e/ Book (Free)

Track B β€” Data Engineering for ML (Critical Layer β€” Don't Skip)

Modern ML platforms = feature pipelines + batch + streaming + lakehouse. Without this, you look like a model-serving engineer, not a platform engineer.

# Resource URL Type
4 Apache Parquet & Arrow Fundamentals β€” columnar storage, the foundation https://arrow.apache.org/docs/python/parquet.html Docs
5 Delta Lake Docs β€” ACID ML data lakes (Databricks-native) https://docs.delta.io/latest/index.html Docs
6 Apache Iceberg Docs β€” open table format, broader engine support https://iceberg.apache.org/docs/latest/ Docs
7 Apache Kafka Intro β€” event-driven data for ML feature pipelines https://kafka.apache.org/intro Docs
8 Confluent Kafka Docs β€” deeper Kafka reference https://docs.confluent.io/kafka/introduction.html Docs

πŸ” Reframe Your Existing Work

  • Write posts positioning your Kubernetes operators as ML infrastructure
  • Your developer platforms = Internal ML Developer Portals
  • Your API skills = ML serving gateways with rate limiting and queuing

πŸ“… MONTH 2 β€” MLOps Core + Tooling

Model Serving

# Resource URL Type
9 KServe Docs β€” Kubernetes-native model serving (operator patterns you'll recognize) https://kserve.github.io/website/docs/intro Docs
10 KServe GitHub β€” source, examples, issues https://github.com/kserve/kserve GitHub
11 KServe Quickstart https://kserve.github.io/website/docs/getting-started/quickstart-guide Quickstart
12 Ray Serve Docs β€” alternative serving framework https://docs.ray.io/en/latest/serve/index.html Docs

Experiment Tracking & Pipelines

# Resource URL Type
13 MLflow Docs β€” experiment tracking and model registry https://mlflow.org/docs/latest/index.html Docs
14 Kubeflow β€” ML workflows on Kubernetes https://www.kubeflow.org/ Docs
15 Kubeflow Pipelines β€” portable, scalable ML workflows https://www.kubeflow.org/docs/components/pipelines/overview/ Docs

Feature Stores

# Resource URL Type
16 Feast Docs β€” open-source feature store https://docs.feast.dev Docs
17 Feast GitHub β€” source and examples https://github.com/feast-dev/feast GitHub

πŸ”¨ Month 2 Project

Train a simple model β†’ serve it via KServe on Kind β†’ expose via a Go API you write β†’ log experiments with MLflow


πŸ“… MONTH 3 β€” LLM Infrastructure + Vector Databases

This is the highest-demand zone in 2026. RAG systems are production reality. Interviewers will probe deeply here.

LLM Serving Engines

# Resource URL Type
18 vLLM Docs β€” high-throughput LLM serving with paged attention https://docs.vllm.ai/en/stable/ Docs
19 vLLM Quickstart https://docs.vllm.ai/en/latest/getting_started/quickstart/ Quickstart
20 Ollama β€” run open-source LLMs locally and on-cluster https://ollama.com/ Tool

Vector Databases (Don't Skip β€” Interviewers Always Ask)

You need to understand: embedding pipelines, chunking strategy, vector indexing (HNSW, IVF), recall vs latency tradeoffs.

# Resource URL Type
21 Milvus Docs β€” enterprise-scale open-source vector DB, Kubernetes-native https://milvus.io/docs Docs
22 Weaviate Docs β€” hybrid search + knowledge graph capabilities https://weaviate.io/developers/weaviate Docs
23 Pinecone Docs β€” managed vector DB, easiest to start with https://docs.pinecone.io/home Docs
24 Qdrant Docs β€” performance-focused OSS vector DB, good for self-hosting https://qdrant.tech/documentation/ Docs

GPU Kubernetes Operations

# Resource URL Type
25 NVIDIA GPU Operator β€” GPU node management on Kubernetes https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html Docs
26 KEDA Docs β€” event-driven autoscaling on GPU utilization metrics https://keda.sh/docs/latest/ Docs
27 KEDA GitHub https://github.com/kedacore/keda GitHub

Model Quantization (Conceptual Depth)

# Resource URL Type
28 HuggingFace Quantization Guide β€” GGUF, AWQ, GPTQ explained https://huggingface.co/docs/transformers/main/en/quantization/overview Docs

πŸ”¨ Month 3 Project

Deploy a self-hosted LLM (Llama/Mistral) + a vector store (Milvus) + autoscaling via KEDA + Go API with request queuing This is the foundation of your capstone RAG platform


πŸ“… MONTH 4 β€” LLMOps, Distributed Training + Platform Architecture

LLMOps & Prompt Infrastructure (2026 Hiring Signal)

Prompt versioning, evaluation frameworks, guardrails β€” these are now standard interview topics.

# Resource URL Type
29 LangChain Docs β€” build LLM chains, agents, and RAG pipelines https://python.langchain.com/docs/introduction/ Docs
30 LangSmith Docs β€” prompt versioning, tracing, and evaluation https://docs.smith.langchain.com/ Docs
31 LlamaIndex Docs β€” data framework for LLM retrieval (RAG-focused) https://docs.llamaindex.ai/en/stable/ Docs
32 Weights & Biases β€” LLM Evals https://wandb.ai/site/solutions/llm Docs
33 Guardrails AI β€” output validation and guardrails for LLMs https://www.guardrailsai.com/docs Docs

Distributed Training (Senior-Level Requirement)

You're inference-heavy so far. Add this to cover senior platform roles that own the full training β†’ serving lifecycle.

# Resource URL Type
34 PyTorch Distributed Training Overview β€” data, tensor, pipeline parallelism explained https://pytorch.org/tutorials/beginner/dist_overview.html Docs
35 Ray Train Docs β€” distributed ML training (pairs with Ray Serve you already know) https://docs.ray.io/en/latest/train/train.html Docs
36 Ray Train + PyTorch Quickstart https://docs.ray.io/en/latest/train/getting-started-pytorch.html Quickstart

AI Platform Architecture (Engineering Blog Reading)

# Resource URL Type
37 Uber β€” Michelangelo ML Platform https://www.uber.com/blog/michelangelo-machine-learning-platform/ Blog
38 Airbnb β€” ML Platform Architecture https://medium.com/airbnb-engineering/using-machine-learning-to-predict-value-of-homes-on-airbnb-9272d3d4739d Blog
39 LinkedIn β€” Scaling ML Productivity https://engineering.linkedin.com/blog/2019/01/scaling-machine-learning-productivity-at-linkedin Blog
40 Netflix β€” Metaflow Open Source https://netflixtechblog.com/open-sourcing-metaflow-a-human-centric-framework-for-data-science-fa72e04a5d9 Blog

Kubernetes Operator Development for ML

# Resource URL Type
41 Kubebuilder Book β€” build operators (review ML-specific patterns) https://book.kubebuilder.io/ Book (Free)
42 Operator SDK Docs https://sdk.operatorframework.io/docs/ Docs

GPU Cost Optimization

# Resource URL Type
43 AWS EC2 Spot Instances for ML https://aws.amazon.com/blogs/machine-learning/run-machine-learning-workloads-with-amazon-ec2-spot-instances-and-amazon-ec2-auto-scaling/ Blog
44 GCP Spot VMs for AI Workloads https://cloud.google.com/compute/docs/instances/spot Docs
45 NVIDIA MIG (Multi-Instance GPU) https://docs.nvidia.com/datacenter/tesla/mig-user-guide/ Docs

πŸ”¨ Month 4 Project

Build a Golang Kubernetes Operator for ModelServer: manages model lifecycle (versioning, canary rollouts, shadow mode) Add LangSmith-based prompt evaluation to your RAG pipeline from Month 3


πŸ“… MONTH 5 β€” Cloud AI Services + ML Observability + Security & Governance

LLM-Specific Observability (Senior-Level Signals)

Prometheus + Grafana alone is insufficient for AI roles. Interviewers will ask about: token throughput, latency percentiles, hallucination rate tracking, embedding drift.

# Resource URL Type
46 Arize AI β€” LLM observability with embedding drift, hallucination monitoring https://docs.arize.com/arize Docs
47 WhyLabs Docs β€” ML monitoring for data drift and model performance https://docs.whylabs.ai/docs/ Docs
48 Evidently AI β€” open-source ML monitoring and drift detection https://docs.evidentlyai.com/ Docs
49 Prometheus Docs β€” infrastructure metrics (token throughput, GPU utilization) https://prometheus.io/docs/introduction/overview/ Docs
50 Grafana Docs β€” dashboarding for ML-specific metrics https://grafana.com/docs/grafana/latest/ Docs

ML Security & Governance (Big Gap Interviewers Now Probe)

Model supply chain security, PII handling, data lineage, RBAC for model access β€” platform engineers who know this stand out.

# Resource URL Type
51 ML Security OWASP Top 10 for LLMs β€” the canonical security reference for LLM apps https://owasp.org/www-project-top-10-for-large-language-model-applications/ Docs
52 Sigstore β€” Model Signing β€” supply chain security for ML artifacts https://www.sigstore.dev/ Docs
53 OPA (Open Policy Agent) β€” RBAC and policy enforcement for model access https://www.openpolicyagent.org/docs/latest/ Docs
54 OpenLineage β€” data lineage standard for ML pipelines https://openlineage.io/ Docs
55 Marquez β€” metadata and lineage service built on OpenLineage https://marquezproject.ai/ Docs

Cloud AI Services

Google Cloud (Primary)

# Resource URL Type
56 Vertex AI Documentation https://cloud.google.com/vertex-ai/docs Docs
57 GCP ML Engineer Learning Path https://www.skills.google/paths/17 Course (Free)
58 Preparing for GCP ML Eng Cert β€” Coursera https://www.coursera.org/professional-certificates/preparing-for-google-cloud-machine-learning-engineer-professional-certificate Course

AWS (Secondary)

# Resource URL Type
59 Amazon SageMaker Developer Guide https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html Docs
60 Amazon Bedrock Docs https://docs.aws.amazon.com/bedrock/latest/userguide/what-is-bedrock.html Docs

Azure (Secondary)

# Resource URL Type
61 Azure Machine Learning Docs https://learn.microsoft.com/en-us/azure/machine-learning/overview-what-is-azure-machine-learning Docs

🎯 Certification Strategy

Senior roles care more about OSS contributions and technical blog depth than certs. Treat the cert as a bonus, not the goal.

# Resource URL Notes
62 Google Professional ML Engineer (primary β€” most respected) https://cloud.google.com/learn/certification/machine-learning-engineer $200
63 Exam Guide PDF https://cloud.google.com/learn/certification/guides/machine-learning-engineer Free
64 AWS ML Specialty (optional secondary) https://aws.amazon.com/certification/certified-machine-learning-specialty/ $300

πŸ“… MONTH 6 β€” Capstone Project + Portfolio + Job Targeting

πŸ† Capstone: Production-Grade RAG Platform on Kubernetes

One cohesive system beats four disconnected demos. This single project will outperform most ML engineer portfolios.

System components to build:

Component Technology
Data ingestion pipeline Kafka + Iceberg/Delta Lake
Embedding generation service HuggingFace models + vLLM
Vector database Milvus (Kubernetes-native)
LLM inference vLLM with quantized Llama/Mistral
Autoscaling KEDA on GPU utilization + queue depth
Model lifecycle management Your custom Golang Kubernetes Operator
Prompt evaluation & tracing LangSmith or W&B
Observability dashboards Prometheus + Grafana + Arize AI (embedding drift)
Canary model rollout Istio traffic splitting via your Operator
Security & access control OPA policies for model endpoint RBAC
Data lineage OpenLineage integration
Cost monitoring GPU utilization reports + spot instance optimization

Job Targeting Resources

# Resource URL Type
65 ML Engineer jobs β€” LinkedIn https://www.linkedin.com/jobs/search/?keywords=ML+Platform+Engineer Job Board
66 ai-jobs.net β€” AI-specific job board https://ai-jobs.net/ Job Board
67 Levels.fyi β€” ML Engineer salaries https://www.levels.fyi/t/machine-learning-engineer Salary Data

πŸ—‚οΈ BONUS: Reference Resources (Use Throughout All 6 Months)

Resource URL Why
HuggingFace β€” model hub, datasets, spaces https://huggingface.co/ Central hub for open-source models
Papers With Code β€” ML research + reproducible code https://paperswithcode.com/ Stay current on research
The Batch (Andrew Ng newsletter) https://www.deeplearning.ai/the-batch/ Weekly AI news digest
MLOps Community https://mlops.community/ Networking with practitioners
CNCF AI/ML Working Group https://github.com/cncf/tag-runtime/blob/main/wg/artificial-intelligence.md Kubernetes + ML community
Golang Kubernetes client-go https://github.com/kubernetes/client-go Operator development base
Apache Arrow Docs https://arrow.apache.org/docs/ Columnar data format fundamentals

βœ… Senior-Ready Portfolio Checklist (Month 6)

Projects

  • Capstone RAG Platform β€” full production stack (see Month 6 table above)
  • Golang Kubernetes Operator for ModelServer (versioning, canary, shadow mode)
  • End-to-end ML pipeline β€” Kafka β†’ Iceberg β†’ training β†’ model registry β†’ serving
  • Distributed training demo β€” PyTorch + Ray Train on multi-GPU Kubernetes node pool

Content

  • 3+ technical blog posts framing infra skills through the ML lens
  • 1 OSS contribution to KServe, Kubeflow, or Milvus (beats any cert)
  • LinkedIn posts on GPU autoscaling, LLM serving, or MLOps β€” this space has low-quality content, you'll stand out

Interview Readiness

  • Can explain RAG architecture end-to-end (chunking β†’ embedding β†’ retrieval β†’ generation)
  • Can articulate data parallelism vs tensor parallelism vs pipeline parallelism
  • Can discuss HNSW vs IVF vector indexing tradeoffs
  • Can explain ML supply chain security threats (model poisoning, prompt injection)
  • Can design RBAC for a multi-tenant model serving platform

πŸ—ΊοΈ Revised Learning Flow (Mirrors Real ML System Evolution)

Month 1: ML Fundamentals + Data Systems (Kafka, Iceberg, Delta Lake)
         ↓
Month 2: MLOps (KServe, Kubeflow, Feast, MLflow)
         ↓
Month 3: LLM Infra (vLLM, Vector DBs, GPU ops)
         ↓
Month 4: LLMOps + Distributed Training + Platform Architecture
         ↓
Month 5: Observability + Security + Cloud AI Services
         ↓
Month 6: Capstone Project + Portfolio + Job Search

Your unfair advantage: Most ML engineers are learning Kubernetes. You already own it. Add the ML domain layer on top and you become the rarest hire in the market β€” a platform engineer who also speaks ML.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment