R0CKSTAR yeahdongcn

## blog2.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                yeahdongcn
                / blog2.md
            
            
              Last active
              March 12, 2026 06:23
            
              
                Accelerating SGLang Inference on macOS: 5× Faster with Native MLX
              
          
    SGLang already runs on macOS via PyTorch's MPS (Metal Performance Shaders) backend — you can launch a server, send requests, and get responses. But performance on Apple Silicon has been underwhelming. In this post, we describe how we integrated a native MLX execution path into SGLang that delivers up to 5.3× higher throughput while using significantly less memory.
The Problem: PyTorch MPS Overhead

When SGLang runs on macOS with PyTorch MPS, every operation — matrix multiplications, attention, normalization — goes through PyTorch's MPS backend, which translates PyTorch ops into Metal Performance Shaders. This translation layer adds substantial overhead:

Op dispatch overhead: Each PyTorch operation is individually dispatched to MPS, missing optimization opportunities that come from fusing operations together.
Memory duplication: PyTorch loads model weights into MPS memory and allocates a large KV cache, leaving less room for actual inference workloads.
No fused kernels: Op


## blog1.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                yeahdongcn
                / blog1.md
            
            
              Last active
              March 12, 2026 06:23
            
              
                🍎 Running SGLang Natively on macOS: LLMs and Diffusion Models on Apple Silicon
              
          
    Why macOS Matters for Local AI

Apple Silicon machines have quietly become some of the most interesting systems for local AI workloads:

Powerful GPUs
Large unified memory (up to 192GB)
High memory bandwidth
Metal compute acceleration

Yet most modern inference frameworks still prioritize Linux + CUDA GPUs.

  
## Dockerfile.vulkan
ARG UBUNTU_VERSION=22.04
ARG DOCKERHUB_REGISTRY=docker.io/library
# ---------- Stage 1: Build Vulkan SDK ----------
FROM --platform=linux/arm64 ${DOCKERHUB_REGISTRY}/ubuntu:${UBUNTU_VERSION} AS builder

ENV DEBIAN_FRONTEND=noninteractive

# Install build tools and dependencies
RUN apt update && apt install -y --no-install-recommends \
    build-essential cmake ninja-build git python3-pip curl pkg-config \

## 📊 Weekly development breakdown
Docker     6 hrs 16 mins  █████████▏░░░░░░░░░░░  43.8%
Makefile   3 hrs 39 mins  █████▎░░░░░░░░░░░░░░░  25.5%
Go         55 mins        █▎░░░░░░░░░░░░░░░░░░░   6.5%
HTML       45 mins        █░░░░░░░░░░░░░░░░░░░░   5.3%
Python     37 mins        ▉░░░░░░░░░░░░░░░░░░░░   4.4%

## harbor.sh
#!/bin/bash

#Harbor on Ubuntu 18.04

#Prompt for the user to ask if the install should use the IP Address or Fully Qualified Domain Name of the Harbor Server
PS3='Would you like to install Harbor based on IP or FQDN? '
select option in IP FQDN
do
    case $option in
        IP)

## gist:21c10ead7284ca59eac45d445d78e34b
- (BOOL)isDockVisible
{
    pid_t pid = 0;
    for (NSRunningApplication *runningApp in
         [[NSWorkspace sharedWorkspace] runningApplications]) {
        if ([[runningApp bundleIdentifier] isEqualToString:@"com.apple.dock"]) {
            pid = [runningApp processIdentifier];
            break;
        }
    }

## gist:aab7c4259fc74c138a4bd1c267fccfb9
Printing description of entry:
{
    kCGWindowAlpha = 1;
    kCGWindowBounds =     {
        Height = 1050;
        Width = 1680;
        X = 0;
        Y = 0;
    };
    kCGWindowIsOnscreen = 1;

## electron-download cache location
Cache location
The location of the cache depends on the operating system, the defaults are:

Linux: $XDG_CACHE_HOME or ~/.cache/electron/
MacOS: ~/Library/Caches/electron/
Windows: $LOCALAPPDATA/electron/Cache or ~/AppData/Local/electron/Cache/
You can set the ELECTRON_CACHE environment variable to set cache location explicitly.


## gist:835cfd14d619871117e95af8b6250724
$git archive --format zip --output "./output.zip" master -0

## gist:2fd11947523e70140b3184b8cf60db00
# Create path if it doesn't exsit
mkdir -p ${DIR}
	ARG UBUNTU_VERSION=22.04
	ARG DOCKERHUB_REGISTRY=docker.io/library
	# ---------- Stage 1: Build Vulkan SDK ----------
	FROM --platform=linux/arm64 ${DOCKERHUB_REGISTRY}/ubuntu:${UBUNTU_VERSION} AS builder

	ENV DEBIAN_FRONTEND=noninteractive

	# Install build tools and dependencies
	RUN apt update && apt install -y --no-install-recommends \
	build-essential cmake ninja-build git python3-pip curl pkg-config \
	Docker 6 hrs 16 mins █████████▏░░░░░░░░░░░ 43.8%
	Makefile 3 hrs 39 mins █████▎░░░░░░░░░░░░░░░ 25.5%
	Go 55 mins █▎░░░░░░░░░░░░░░░░░░░ 6.5%
	HTML 45 mins █░░░░░░░░░░░░░░░░░░░░ 5.3%
	Python 37 mins ▉░░░░░░░░░░░░░░░░░░░░ 4.4%
	#!/bin/bash

	#Harbor on Ubuntu 18.04

	#Prompt for the user to ask if the install should use the IP Address or Fully Qualified Domain Name of the Harbor Server
	PS3='Would you like to install Harbor based on IP or FQDN? '
	select option in IP FQDN
	do
	case $option in
	IP)
	- (BOOL)isDockVisible
	{
	pid_t pid = 0;
	for (NSRunningApplication *runningApp in
	[[NSWorkspace sharedWorkspace] runningApplications]) {
	if ([[runningApp bundleIdentifier] isEqualToString:@"com.apple.dock"]) {
	pid = [runningApp processIdentifier];
	break;
	}
	}
	Printing description of entry:
	{
	kCGWindowAlpha = 1;
	kCGWindowBounds = {
	Height = 1050;
	Width = 1680;
	X = 0;
	Y = 0;
	};
	kCGWindowIsOnscreen = 1;
	Cache location
	The location of the cache depends on the operating system, the defaults are:

	Linux: $XDG_CACHE_HOME or ~/.cache/electron/
	MacOS: ~/Library/Caches/electron/
	Windows: $LOCALAPPDATA/electron/Cache or ~/AppData/Local/electron/Cache/
	You can set the ELECTRON_CACHE environment variable to set cache location explicitly.