Skip to content

Instantly share code, notes, and snippets.

@yeahdongcn
Last active March 12, 2026 06:23
Show Gist options
  • Select an option

  • Save yeahdongcn/9879bf38fdb2ca36d9be3b73a629819c to your computer and use it in GitHub Desktop.

Select an option

Save yeahdongcn/9879bf38fdb2ca36d9be3b73a629819c to your computer and use it in GitHub Desktop.
🍎 Running SGLang Natively on macOS: LLMs and Diffusion Models on Apple Silicon

Why macOS Matters for Local AI

Apple Silicon machines have quietly become some of the most interesting systems for local AI workloads:

  • Powerful GPUs
  • Large unified memory (up to 192GB)
  • High memory bandwidth
  • Metal compute acceleration

Yet most modern inference frameworks still prioritize Linux + CUDA GPUs.

That leaves macOS developers in an awkward position: powerful hardware, but limited tooling.

Recently, I worked on enabling native macOS execution for SGLang, and the result is something quite exciting:

You can now run both LLMs and diffusion models natively on macOS using SGLang.

To my knowledge, this is the first time SGLang supports both model types on Apple Silicon in the same runtime environment.


What is SGLang?

SGLang is a high-performance inference framework designed for serving large language models and multimodal models efficiently. ([SGLang Documentation][1])

Key capabilities include:

  • Continuous batching
  • Prefix caching (RadixAttention)
  • Quantization support
  • OpenAI-compatible APIs
  • Multi-GPU and distributed serving
  • Support for language models, embeddings, reward models, and diffusion models

The framework is widely used for production inference pipelines and research experimentation.

Historically, SGLang primarily targeted:

  • NVIDIA GPUs
  • AMD GPUs
  • Moore Threads GPUs
  • Ascend NPUs
  • CPUs

But macOS support has been missing β€” until now.


Enabling macOS Support

I recently contributed a pull request that allows SGLang to run on macOS using Apple Silicon hardware.

πŸ‘‰ PR: sgl-project/sglang#19549

The goal was simple:

Allow developers to experiment with SGLang directly on their Mac, without requiring a Linux GPU server.

This required addressing a number of issues including:

  • device backend detection
  • runtime compatibility
  • fallback implementations for unsupported GPU features
  • ensuring diffusion pipelines also function correctly

The result is a working runtime capable of executing:

  • LLM inference
  • diffusion generation

directly on macOS.


Running SGLang on macOS

Below is a simplified setup flow.


1. Clone the repository

git clone https://github.com/sgl-project/sglang.git
cd sglang

2. Install dependencies

SGLang can be installed using pip or uv.

brew install ffmpeg uv

# Create and activate a virtual environment
uv venv -p 3.11 sglang-diffusion
source sglang-diffusion/bin/activate

# Install the Python packages
uv pip install --upgrade pip
rm -f python/pyproject.toml && mv python/pyproject_other.toml python/pyproject.toml
uv pip install -e "python[all_mps]"

Using uv significantly speeds up dependency installation.


3. Launch an LLM server

Example:

uv run python -m sglang.launch_server \
  --model-path Qwen/Qwen3-0.6B --trust-remote-code \
  --disable-radix-cache --disable-cuda-graph --tp-size 1 \
  --host 0.0.0.0 --port 43436

Once the server is running, you can send requests through the OpenAI-compatible API.


4. Run a Diffusion Model

SGLang also supports image generation pipelines.

Example:

sglang generate \
  --model-path black-forest-labs/FLUX.1-dev \
  --prompt "A logo With Bold Large text: SGL Diffusion"

gen


Monitoring GPU Usage on macOS

One problem when working with AI workloads on macOS is lack of simple GPU monitoring tools.

On Linux we have:

nvidia-smi

But on Apple Silicon there is no direct equivalent.

To make debugging easier, I built a small tool:

apple-smi

apple-smi https://github.com/yeahdongcn/apple-smi

apple-smi is a lightweight CLI tool that provides GPU monitoring for Apple Silicon, similar to nvidia-smi.

It allows developers to inspect:

  • Metal GPU utilization
  • GPU memory usage
  • real-time inference load

Example usage:

> pip install apple-smi
> apple-smi

Example output:

> apple-smi
Thu Mar 05 10:07:03 2026
+-----------------------------------------------------------------------------------------+
| APPLE-SMI 0.1.2              macOS Version: 26.3 (25D125)              Metal Version: 4 |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                               |                 Disp.A |                      |
|      Temp                 Pwr:Usage/Cap |           Memory-Usage |             GPU-Util |
|=========================================+========================+======================|
|   0  Apple M1 (8-Core GPU)              |                     On |                      |
|       30C                    9.1W / 20W |    14181MiB / 16384MiB |                  45% |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
| GPU         PID  Type  Process name                                          GPU Memory |
|                                                                                   Usage |
|=========================================================================================|
|   0       58547   C    sglang::scheduler                                         248MiB |
|   0       58548   C    sglang::detokenizer                                       195MiB |
|   0       58408   C    python3                                                    39MiB |
+-----------------------------------------------------------------------------------------+

Why macOS Is Interesting for LLM Inference

Apple Silicon machines are unique because they use unified memory architecture.

Instead of having separate CPU and GPU memory pools, both processors share the same memory space.

This enables scenarios like:

  • running large models locally
  • avoiding memory copies
  • experimenting with multi-stage pipelines

On machines such as:

  • Mac Studio
  • Mac Pro
  • high-end MacBook Pro

you can have 64GB–192GB unified memory, which is extremely useful for model experimentation.


Call for Contributors (MLX / Metal Developers)

This macOS backend is still very early stage.

There are many opportunities for improvements.

Areas where contributors could help:

  • MLX kernel optimizations
  • Metal compute kernels
  • attention implementations
  • memory scheduling improvements
  • batching optimization

If you are working on:

  • Apple MLX
  • Metal GPU programming
  • PyTorch MPS
  • macOS inference frameworks

your contributions could significantly accelerate macOS inference performance.


Help Test Large Models

If you have access to large-memory Apple Silicon machines, testing would be extremely valuable.

For example:

  • Mac Studio
  • Mac Pro
  • M-series Max / Ultra chips

Please try running different models and report:

  • which models run successfully
  • performance observations
  • memory usage patterns
  • bugs or runtime issues

Feedback from real hardware helps improve macOS support significantly.


The Bigger Picture

The local AI ecosystem is evolving rapidly.

A few years ago, serious AI workloads required:

Linux + NVIDIA GPUs

Today, things are changing.

Apple Silicon machines are powerful enough to run meaningful workloads locally.

With frameworks like SGLang, it becomes possible to build full AI pipelines locally on macOS:

Prompt β†’ LLM β†’ Image Prompt β†’ Diffusion β†’ Result

All running on your own machine.


Final Thoughts

This macOS support is only the beginning.

There is huge potential for improving performance and usability on Apple Silicon.

If you're interested in:

  • local AI development
  • MLX or Metal optimization
  • inference frameworks
  • running models locally

please consider contributing.

Together we can make macOS a serious platform for open-source AI inference.


Links

SGLang https://github.com/sgl-project/sglang

macOS Support PR sgl-project/sglang#19549

apple-smi https://github.com/yeahdongcn/apple-smi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment