Apple Silicon machines have quietly become some of the most interesting systems for local AI workloads:
- Powerful GPUs
- Large unified memory (up to 192GB)
- High memory bandwidth
- Metal compute acceleration
Yet most modern inference frameworks still prioritize Linux + CUDA GPUs.
That leaves macOS developers in an awkward position: powerful hardware, but limited tooling.
Recently, I worked on enabling native macOS execution for SGLang, and the result is something quite exciting:
You can now run both LLMs and diffusion models natively on macOS using SGLang.
To my knowledge, this is the first time SGLang supports both model types on Apple Silicon in the same runtime environment.
SGLang is a high-performance inference framework designed for serving large language models and multimodal models efficiently. ([SGLang Documentation][1])
Key capabilities include:
- Continuous batching
- Prefix caching (RadixAttention)
- Quantization support
- OpenAI-compatible APIs
- Multi-GPU and distributed serving
- Support for language models, embeddings, reward models, and diffusion models
The framework is widely used for production inference pipelines and research experimentation.
Historically, SGLang primarily targeted:
- NVIDIA GPUs
- AMD GPUs
- Moore Threads GPUs
- Ascend NPUs
- CPUs
But macOS support has been missing β until now.
I recently contributed a pull request that allows SGLang to run on macOS using Apple Silicon hardware.
π PR: sgl-project/sglang#19549
The goal was simple:
Allow developers to experiment with SGLang directly on their Mac, without requiring a Linux GPU server.
This required addressing a number of issues including:
- device backend detection
- runtime compatibility
- fallback implementations for unsupported GPU features
- ensuring diffusion pipelines also function correctly
The result is a working runtime capable of executing:
- LLM inference
- diffusion generation
directly on macOS.
Below is a simplified setup flow.
git clone https://github.com/sgl-project/sglang.git
cd sglangSGLang can be installed using pip or uv.
brew install ffmpeg uv
# Create and activate a virtual environment
uv venv -p 3.11 sglang-diffusion
source sglang-diffusion/bin/activate
# Install the Python packages
uv pip install --upgrade pip
rm -f python/pyproject.toml && mv python/pyproject_other.toml python/pyproject.toml
uv pip install -e "python[all_mps]"Using uv significantly speeds up dependency installation.
Example:
uv run python -m sglang.launch_server \
--model-path Qwen/Qwen3-0.6B --trust-remote-code \
--disable-radix-cache --disable-cuda-graph --tp-size 1 \
--host 0.0.0.0 --port 43436Once the server is running, you can send requests through the OpenAI-compatible API.
SGLang also supports image generation pipelines.
Example:
sglang generate \
--model-path black-forest-labs/FLUX.1-dev \
--prompt "A logo With Bold Large text: SGL Diffusion"One problem when working with AI workloads on macOS is lack of simple GPU monitoring tools.
On Linux we have:
nvidia-smi
But on Apple Silicon there is no direct equivalent.
To make debugging easier, I built a small tool:
apple-smi https://github.com/yeahdongcn/apple-smi
apple-smi is a lightweight CLI tool that provides GPU monitoring for Apple Silicon, similar to nvidia-smi.
It allows developers to inspect:
- Metal GPU utilization
- GPU memory usage
- real-time inference load
Example usage:
> pip install apple-smi
> apple-smiExample output:
> apple-smi
Thu Mar 05 10:07:03 2026
+-----------------------------------------------------------------------------------------+
| APPLE-SMI 0.1.2 macOS Version: 26.3 (25D125) Metal Version: 4 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name | Disp.A | |
| Temp Pwr:Usage/Cap | Memory-Usage | GPU-Util |
|=========================================+========================+======================|
| 0 Apple M1 (8-Core GPU) | On | |
| 30C 9.1W / 20W | 14181MiB / 16384MiB | 45% |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU PID Type Process name GPU Memory |
| Usage |
|=========================================================================================|
| 0 58547 C sglang::scheduler 248MiB |
| 0 58548 C sglang::detokenizer 195MiB |
| 0 58408 C python3 39MiB |
+-----------------------------------------------------------------------------------------+Apple Silicon machines are unique because they use unified memory architecture.
Instead of having separate CPU and GPU memory pools, both processors share the same memory space.
This enables scenarios like:
- running large models locally
- avoiding memory copies
- experimenting with multi-stage pipelines
On machines such as:
- Mac Studio
- Mac Pro
- high-end MacBook Pro
you can have 64GBβ192GB unified memory, which is extremely useful for model experimentation.
This macOS backend is still very early stage.
There are many opportunities for improvements.
Areas where contributors could help:
- MLX kernel optimizations
- Metal compute kernels
- attention implementations
- memory scheduling improvements
- batching optimization
If you are working on:
- Apple MLX
- Metal GPU programming
- PyTorch MPS
- macOS inference frameworks
your contributions could significantly accelerate macOS inference performance.
If you have access to large-memory Apple Silicon machines, testing would be extremely valuable.
For example:
- Mac Studio
- Mac Pro
- M-series Max / Ultra chips
Please try running different models and report:
- which models run successfully
- performance observations
- memory usage patterns
- bugs or runtime issues
Feedback from real hardware helps improve macOS support significantly.
The local AI ecosystem is evolving rapidly.
A few years ago, serious AI workloads required:
Linux + NVIDIA GPUs
Today, things are changing.
Apple Silicon machines are powerful enough to run meaningful workloads locally.
With frameworks like SGLang, it becomes possible to build full AI pipelines locally on macOS:
Prompt β LLM β Image Prompt β Diffusion β Result
All running on your own machine.
This macOS support is only the beginning.
There is huge potential for improving performance and usability on Apple Silicon.
If you're interested in:
- local AI development
- MLX or Metal optimization
- inference frameworks
- running models locally
please consider contributing.
Together we can make macOS a serious platform for open-source AI inference.
SGLang https://github.com/sgl-project/sglang
macOS Support PR sgl-project/sglang#19549
apple-smi https://github.com/yeahdongcn/apple-smi