yeahdongcn/blog1.md

## blog1.md

      
    Raw
  

              blog1.md
            
          
    Why macOS Matters for Local AI

Apple Silicon machines have quietly become some of the most interesting systems for local AI workloads:

Powerful GPUs
Large unified memory (up to 192GB)
High memory bandwidth
Metal compute acceleration

Yet most modern inference frameworks still prioritize Linux + CUDA GPUs.
That leaves macOS developers in an awkward position: powerful hardware, but limited tooling.
Recently, I worked on enabling native macOS execution for SGLang, and the result is something quite exciting:

You can now run both LLMs and diffusion models natively on macOS using SGLang.

To my knowledge, this is the first time SGLang supports both model types on Apple Silicon in the same runtime environment.

What is SGLang?

SGLang is a high-performance inference framework designed for serving large language models and multimodal models efficiently. ([SGLang Documentation][1])
Key capabilities include:

Continuous batching
Prefix caching (RadixAttention)
Quantization support
OpenAI-compatible APIs
Multi-GPU and distributed serving
Support for language models, embeddings, reward models, and diffusion models

The framework is widely used for production inference pipelines and research experimentation.
Historically, SGLang primarily targeted:

NVIDIA GPUs
AMD GPUs
Moore Threads GPUs
Ascend NPUs
CPUs

But macOS support has been missing — until now.

Enabling macOS Support

I recently contributed a pull request that allows SGLang to run on macOS using Apple Silicon hardware.
👉
PR: sgl-project/sglang#19549
The goal was simple:

Allow developers to experiment with SGLang directly on their Mac, without requiring a Linux GPU server.

This required addressing a number of issues including:

device backend detection
runtime compatibility
fallback implementations for unsupported GPU features
ensuring diffusion pipelines also function correctly

The result is a working runtime capable of executing:

LLM inference
diffusion generation

directly on macOS.

Running SGLang on macOS

Below is a simplified setup flow.

1. Clone the repository

git clone https://github.com/sgl-project/sglang.git
cd sglang

2. Install dependencies

SGLang can be installed using pip or uv.
brew install ffmpeg uv

# Create and activate a virtual environment
uv venv -p 3.11 sglang-diffusion
source sglang-diffusion/bin/activate

# Install the Python packages
uv pip install --upgrade pip
rm -f python/pyproject.toml && mv python/pyproject_other.toml python/pyproject.toml
uv pip install -e "python[all_mps]"
Using uv significantly speeds up dependency installation.

3. Launch an LLM server

Example:
uv run python -m sglang.launch_server \
  --model-path Qwen/Qwen3-0.6B --trust-remote-code \
  --disable-radix-cache --disable-cuda-graph --tp-size 1 \
  --host 0.0.0.0 --port 43436
Once the server is running, you can send requests through the OpenAI-compatible API.

4. Run a Diffusion Model

SGLang also supports image generation pipelines.
Example:
sglang generate \
  --model-path black-forest-labs/FLUX.1-dev \
  --prompt "A logo With Bold Large text: SGL Diffusion"


Monitoring GPU Usage on macOS

One problem when working with AI workloads on macOS is lack of simple GPU monitoring tools.
On Linux we have:
nvidia-smi

But on Apple Silicon there is no direct equivalent.
To make debugging easier, I built a small tool:
apple-smi

apple-smi
https://github.com/yeahdongcn/apple-smi
apple-smi is a lightweight CLI tool that provides GPU monitoring for Apple Silicon, similar to nvidia-smi.
It allows developers to inspect:

Metal GPU utilization
GPU memory usage
real-time inference load

Example usage:
> pip install apple-smi
> apple-smi
Example output:
> apple-smi
Thu Mar 05 10:07:03 2026
+-----------------------------------------------------------------------------------------+
| APPLE-SMI 0.1.2              macOS Version: 26.3 (25D125)              Metal Version: 4 |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                               |                 Disp.A |                      |
|      Temp                 Pwr:Usage/Cap |           Memory-Usage |             GPU-Util |
|=========================================+========================+======================|
|   0  Apple M1 (8-Core GPU)              |                     On |                      |
|       30C                    9.1W / 20W |    14181MiB / 16384MiB |                  45% |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
| GPU         PID  Type  Process name                                          GPU Memory |
|                                                                                   Usage |
|=========================================================================================|
|   0       58547   C    sglang::scheduler                                         248MiB |
|   0       58548   C    sglang::detokenizer                                       195MiB |
|   0       58408   C    python3                                                    39MiB |
+-----------------------------------------------------------------------------------------+

Why macOS Is Interesting for LLM Inference

Apple Silicon machines are unique because they use unified memory architecture.
Instead of having separate CPU and GPU memory pools, both processors share the same memory space.
This enables scenarios like:

running large models locally
avoiding memory copies
experimenting with multi-stage pipelines

On machines such as:

Mac Studio
Mac Pro
high-end MacBook Pro

you can have 64GB–192GB unified memory, which is extremely useful for model experimentation.

Call for Contributors (MLX / Metal Developers)

This macOS backend is still very early stage.
There are many opportunities for improvements.
Areas where contributors could help:

MLX kernel optimizations
Metal compute kernels
attention implementations
memory scheduling improvements
batching optimization

If you are working on:

Apple MLX
Metal GPU programming
PyTorch MPS
macOS inference frameworks

your contributions could significantly accelerate macOS inference performance.

Help Test Large Models

If you have access to large-memory Apple Silicon machines, testing would be extremely valuable.
For example:

Mac Studio
Mac Pro
M-series Max / Ultra chips

Please try running different models and report:

which models run successfully
performance observations
memory usage patterns
bugs or runtime issues

Feedback from real hardware helps improve macOS support significantly.

The Bigger Picture

The local AI ecosystem is evolving rapidly.
A few years ago, serious AI workloads required:
Linux + NVIDIA GPUs

Today, things are changing.
Apple Silicon machines are powerful enough to run meaningful workloads locally.
With frameworks like SGLang, it becomes possible to build full AI pipelines locally on macOS:
Prompt → LLM → Image Prompt → Diffusion → Result

All running on your own machine.

Final Thoughts

This macOS support is only the beginning.
There is huge potential for improving performance and usability on Apple Silicon.
If you're interested in:

local AI development
MLX or Metal optimization
inference frameworks
running models locally

please consider contributing.
Together we can make macOS a serious platform for open-source AI inference.

Links

SGLang
https://github.com/sgl-project/sglang
macOS Support PR
sgl-project/sglang#19549
apple-smi
https://github.com/yeahdongcn/apple-smi
No results found