Dec 15th, 2025
You might ask the question. We all know that LLMs are non-deterministic stochastic parrots and the “blurry JPEGs of the Web” or whatnot the thief overlords found laying around. Still they are useful with the proper leash.
Possible reasons to prefer a local AI:
- nothing leaves our computer, no telemetry, no data for the third party,
- we can ~retrain with local and personal data, plus previous bullet,
- still the boxing time of the LLM does not limit us — it can search on the Web
- we can freeze a version we like — and want to test and control, like make it production-ready.
You can run reasonable local LLM models on your Apple Silicon, be it a Macbook Air, a Pro, a Studio or an iPhone. The integrated memory is the culprit, the more you have, the better you are off. Sure, number of cores matter but first you have to be able to load the models, cores can only make the inference faster after that.
Expected specifications for a prosumer setup are:
- MacBook Air 24 Gb
- MacBook Pro 32 Gb
- Mac Studio 32 Gb
- iPhone 8 Gb
Lot of work is happening on making models able to run on MLX, and quantizing them to smaller sizes to fit them in local memory. On the other end of the spectrum people run dozens of parallel jobs on a single, really big Mac, or even small clusters of Macs.
Benchmarks mentioned below have run on a MacBook Pro M4 32Gb to give you a rule of thumb.
Novelty is this: llama will start in your Chrome with WebGPU but it’s not smart.
Models that can run are limited as biggest actual memory of the phones could be only 8 Gb — I had success with only 4 Gb. The app also runs on iPad, there the Pros can have 16 Gb memory. The app will recommend models for the actual device.
Locally AI - Run AI models locally your iPhone and iPad.
gpt-oss-20b is a reasonable generic model that eats 12 Gb RAM, try it out easily. Gave me 31 tokens/sec. Feels like o3-mini, has three reasoning efforts: low, medium and high.
$ brew install llama.cpp
$ llama-server -hf ggml-org/gpt-oss-20b-GGUF \
--ctx-size 0 --jinja -ub 2048 -b 2048 -ngl 99 -c 2048
$ open http://127.0.0.1:8080This is still the wild west, so don’t expect the tools to use models interchangeably, you might end up having the same model downloaded twice.
I use all three now, .gguf seems to be the de facto format for models for now.
| Tool | Default Model Location | Model File Formats | Notes |
|---|---|---|---|
| Ollama | ~/.ollama/models | .blob .manifest | Stores models in its own container format, blobs not directly usable by llama.cpp or LM Studio without conversion. |
| llama.cpp | ~/Library/Caches/llama.cpp or user-specified path | .bin .gguf | Uses raw model weights (older: .bin, newer: .gguf). You can point it to custom paths with CLI flags. |
| LM Studio | ~/.lmstudio/models/ | .gguf | LM Studio downloads GGUF-format models directly compatible with llama.cpp. |
| Draw Things | ~/Library/Containers/com.liuliu.draw-things/Data/Documents/Models | .ckpt | Can download models. |
Look at them as tools in a toolbox. You will not be able to solve all your problems with only one, for different modalities of functionalities you might pull an other one — especially on a local setup.
Models live on Hugging Face.
Generally try to fly with LM Studio, it manages models, has a UI, you can set parameters. If you want search the Web easily, you need to do Ollama.
LM Studio shows you how much of the context window is being used. So you may find it useful to ask for it to summarize the conversation so far, when the context window gets close to being filled. This way you can help it remember important information that it would otherwise forget.
This list does not intend to be comprehensive but a good enough one:
- generic text-to-text — like a plain vanilla ChatGPT, you ask a question, set a task and the answer will be prose
gpt-oss-20b, - image-to-text — you want a (structured) visual understanding, descriptions, identification, transcription,
- audio-to-text — you want a (structured) auditory understanding, summarization, transcription,
- text-to-image — you want to turn your description to an image, a photo
qwen_image_edit-q4_k_sneeds 32 Gb, Install https://github.com/ivanfioravanti/qwen-image-mps and run
uv run python qwen-image-mps.py generate -p 'A vintage coffee shop full of racoons, in a neon cyberpunk city' -s 10 --quantization Q4_K_S
Did take 18 minutes for me, adding -f makes it 8 minutes.
Stores models in ~/.cache/huggingface/hub/
- text-edit-to-image — you want to change an image based on a textual description, would run with the previous setup but needs more memory,
- coding — the only major domain specific modality as of the enormous training data and the peculiar logic (run and test loops)
qwen3-coder-30b40 tokens/sec.
This list does not intend to be comprehensive but a good enough one:
- run locally limited inference — on a lossful compression of the training corpus,
- run searches on the Web — this is now the easiest with Ollama Web Search (you need to get an API key, they promise not to persist search data, more details on Hacker News, no SLA I’ve seen yet, example uses
qwen3-4b), - have memory — even stupid one with expanding the context,
- can be ~trainable locally — not fast but with reasonable speed.
MCP, the biggest security hole after npm, is the way to connect models. Use it with moderation, trust no one.
-
MLX LM is a Python package for generating text and fine-tuning large language models on Apple silicon with MLX. LM Studio stores its models in the
~/.lm-studio/modelsdirectory. This means you can use themlx-lmPython library to run prompts through the same model like this:uv run --isolated --with mlx-lm mlx_lm.generate \ --model ~/.lmstudio/models/lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-4bit \ --prompt "Write an HTML and JavaScript page implementing space invaders" \ -m 8192 --top-k 20 --top-p 0.8 --temp 0.7
Be aware that this will load a duplicate copy of the model into memory so you may want to quit LM Studio before running this command! Ran with 17 tokens per sec.