soobrosa/local_ai_on_apple_silicon.md

## local_ai_on_apple_silicon.md

      
    Raw
  

              local_ai_on_apple_silicon.md
            
          
    Local AI on Apple Silicon

Dec 15th, 2025
Why?

You might ask the question. We all know that LLMs are non-deterministic stochastic parrots and the “blurry JPEGs of the Web” or whatnot the thief overlords found laying around. Still they are useful with the proper leash.
Possible reasons to prefer a local AI:

nothing leaves our computer, no telemetry, no data for the third party,
we can ~retrain with local and personal data, plus previous bullet,
still the boxing time of the LLM does not limit us — it can search on the Web
we can freeze a version we like — and want to test and control, like make it production-ready.

Status Quo

You can run reasonable local LLM models on your Apple Silicon, be it a Macbook Air, a Pro, a Studio or an iPhone. The integrated memory is the culprit, the more you have, the better you are off. Sure, number of cores matter but first you have to be able to load the models, cores can only make the inference faster after that.
Expected specifications for a prosumer setup are:

MacBook Air 24 Gb
MacBook Pro 32 Gb
Mac Studio 32 Gb
iPhone 8 Gb

Lot of work is happening on making models able to run on MLX, and quantizing them to smaller sizes to fit them in local memory. On the other end of the spectrum people run dozens of parallel jobs on a single, really big Mac, or even small clusters of Macs.
Benchmarks mentioned below have run on a MacBook Pro M4 32Gb to give you a rule of thumb.
Novelty is this: llama will start in your Chrome with WebGPU but it’s not smart.
iPhone

Models that can run are limited as biggest actual memory of the phones could be only 8 Gb — I had success with only 4 Gb. The app also runs on iPad, there the Pros can have 16 Gb memory. The app will recommend models for the actual device.
Locally AI - Run AI models locally your iPhone and iPad.
Mac

Quickstart

gpt-oss-20b is a reasonable generic model that eats 12 Gb RAM, try it out easily. Gave me 31 tokens/sec. Feels like o3-mini, has three reasoning efforts: low, medium and high.
$ brew install llama.cpp

$ llama-server -hf ggml-org/gpt-oss-20b-GGUF \
  --ctx-size 0 --jinja -ub 2048 -b 2048 -ngl 99 -c 2048

$ open http://127.0.0.1:8080
Tool overview

This is still the wild west, so don’t expect the tools to use models interchangeably, you might end up having the same model downloaded twice.
I use all three now, .gguf seems to be the de facto format for models for now.


Tool
Default Model Location
Model File Formats
Notes


Ollama
~/.ollama/models
.blob .manifest
Stores models in its own container format, blobs not directly usable by llama.cpp or LM Studio without conversion.


llama.cpp
~/Library/Caches/llama.cpp or user-specified path
.bin .gguf
Uses raw model weights (older: .bin, newer: .gguf). You can point it to custom paths with CLI flags.


LM Studio
~/.lmstudio/models/
.gguf
LM Studio downloads GGUF-format models directly compatible with llama.cpp.


Draw Things
~/Library/Containers/com.liuliu.draw-things/Data/Documents/Models
.ckpt
Can download models.


Models

Look at them as tools in a toolbox. You will not be able to solve all your problems with only one, for different modalities of functionalities you might pull an other one — especially on a local setup.
Models live on Hugging Face.
Running models

Generally try to fly with LM Studio, it manages models, has a UI, you can set parameters. If you want search the Web easily, you need to do Ollama.
LM Studio shows you how much of the context window is being used. So you may find it useful to ask for it to summarize the conversation so far, when the context window gets close to being filled. This way you can help it remember important information that it would otherwise forget.
Modalities

This list does not intend to be comprehensive but a good enough one:

generic text-to-text — like a plain vanilla ChatGPT, you ask a question, set a task and the answer will be prose gpt-oss-20b,
image-to-text — you want a (structured) visual understanding, descriptions, identification, transcription,
audio-to-text — you want a (structured) auditory understanding, summarization, transcription,
text-to-image — you want to turn your description to an image, a photo qwen_image_edit-q4_k_s needs 32 Gb,
Install https://github.com/ivanfioravanti/qwen-image-mps and run

uv run python qwen-image-mps.py generate -p 'A vintage coffee shop full of racoons, in a neon cyberpunk city' -s 10 --quantization Q4_K_S

Did take 18 minutes for me, adding -f makes it 8 minutes.
Stores models in ~/.cache/huggingface/hub/

text-edit-to-image — you want to change an image based on a textual description, would run with the previous setup but needs more memory,
coding — the only major domain specific modality as of the enormous training data and the peculiar logic (run and test loops) qwen3-coder-30b 40 tokens/sec.

Functionalities

This list does not intend to be comprehensive but a good enough one:

run locally limited inference — on a lossful compression of the training corpus,
run searches on the Web — this is now the easiest with Ollama Web Search (you need to get an API key, they promise not to persist search data, more details on Hacker News, no SLA I’ve seen yet, example uses qwen3-4b),
have memory — even stupid one with expanding the context,
can be ~trainable locally — not fast but with reasonable speed.

MCP

MCP, the biggest security hole after npm, is the way to connect models. Use it with moderation, trust no one.
Tinkering


MLX LM is a Python package for generating text and fine-tuning large language models on Apple silicon with MLX.
LM Studio stores its models in the ~/.lm-studio/models directory. This means you can use the mlx-lm Python library to run prompts through the same model like this:
uv run --isolated --with mlx-lm mlx_lm.generate \
--model ~/.lmstudio/models/lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-4bit \
--prompt "Write an HTML and JavaScript page implementing space invaders" \
-m 8192 --top-k 20 --top-p 0.8 --temp 0.7
Be aware that this will load a duplicate copy of the model into memory so you may want to quit LM Studio before running this command! Ran with 17 tokens per sec.
Tool	Default Model Location	Model File Formats	Notes
Ollama	~/.ollama/models	.blob .manifest	Stores models in its own container format, blobs not directly usable by llama.cpp or LM Studio without conversion.
llama.cpp	~/Library/Caches/llama.cpp or user-specified path	.bin .gguf	Uses raw model weights (older: .bin, newer: .gguf). You can point it to custom paths with CLI flags.
LM Studio	~/.lmstudio/models/	.gguf	LM Studio downloads GGUF-format models directly compatible with llama.cpp.
Draw Things	~/Library/Containers/com.liuliu.draw-things/Data/Documents/Models	.ckpt	Can download models.
No results found