I wrote a post yesterday about why GPUs barely help small text embeddings at batch=1. Different workload, same machines. This time I ran a local LLM inference benchmark across the same three boxes. The result complicated my hardware mental model in a way I think is worth sharing.
The setup
Three machines.
A Mac M2 Pro with 16 GB of unified memory, running Metal through llama-cpp-python.
A Linux desktop with an Intel 13700K, 62 GB of RAM, and an RTX 2080 Ti with 11 GB of VRAM. CUDA 13.
A Windows desktop with an AMD 5800X, 64 GB of RAM, and an RX 6600 XT with 8 GB of VRAM. Vulkan through llama.cpp.
Four models, all Q4_K_M quantization except the last. Phi-3 mini 3.8B. Qwen 2.5 7B. Llama 3.1 8B. Llama 3.1 70B at the more aggressive Q3_K_S as a stretch test.
Ten-prompt suite, mixing short Q&A, code generation, summarization, and long context. Three runs per prompt. Median across the runs.
The numbers
Generation tokens per second, overall median across the suite.
| Model | Mac M2 Pro (Metal) | Linux 2080 Ti (CUDA) | Windows 6600 XT (Vulkan) |
|---|---|---|---|
| Phi-3 mini 3.8B | 19.1 | 59.9 | 16.4 |
| Qwen 2.5 7B | 12.4 | 43.0 | 20.4 |
| Llama 3.1 8B | 11.6 | 40.1 | 20.9 |
| Llama 3.1 70B Q3 | won't fit | 1.3 | won't fit |
The 2080 Ti winning everything makes intuitive sense. The Mac-versus-AMD comparison is the part that surprised me.
The anomaly
The RX 6600 XT is a roughly $200 used consumer GPU. It beats my Mac M2 Pro on Llama 3.1 8B by 80 percent. 20.9 tokens per second versus 11.6.
The same RX 6600 XT loses to my Mac on Phi-3 mini. 16.4 versus 19.1. A 14 percent loss.
Same hardware. Same benchmark harness. Same prompts. Opposite winner.
The reflex answer is "noise." It is not noise. The numbers held up across three runs per cell and ten prompts per cell. They held up in the per-category breakdowns, the prompt-eval rates, and the time-to-first-token measurements. The Mac wins for small models and loses for medium models. That is the finding.
Why this happens
Phi-3 mini at Q4_K_M is about 2.2 GB of weights. That fits in the M2 Pro's cache hierarchy comfortably.
Apple Silicon's unified memory architecture means there is no host-to-device transfer. The CPU and GPU share the same physical memory pool with the same bandwidth. There is no PCIe bus to cross. Dispatch overhead is the only fixed cost.
The RX 6600 XT has more raw VRAM bandwidth than the M2 Pro's unified pool. About 256 GB/s versus 200. But for a 2.2 GB model running one token at a time, you cannot saturate that bandwidth. The compute work per dispatch is too small. The PCIe round-trip and the Vulkan driver overhead eat the win.
For Qwen 7B and Llama 8B at Q4, the model is around 5 GB. That exceeds the M2 Pro's cache. The Mac is now memory-bandwidth-bound at the SoC level, sharing 200 GB/s between CPU and GPU. The discrete card is bandwidth-bound at the VRAM level, with 256 GB/s dedicated to the GPU alone. The discrete card wins.
The threshold where this flips is roughly where the model exceeds the M2 Pro's effective cache. For Q4 quantization, that threshold lives somewhere between 3.8B and 7B parameters.
What this means if you are buying hardware
The right question is not "which platform is faster for local AI." It is "which platform is faster for the model size I actually use."
If your loop is small specialized models. Routing classifiers. Lightweight rerankers. Sentence embedders. Mac wins. Buy more unified memory.
If your loop is 7B and 8B chat models. The midrange AMD card wins on price-per-token. Buy used.
If your loop is 13B and larger. NVIDIA's mature CUDA dispatch and the higher-end VRAM widen the gap, but the gap is still roughly proportional to the cost.
If your loop is 70B and above. None of this hardware is enough.
The honest answer to "what should I buy for local AI" is "what is the model going to be."
The hardware tier ceiling
The 70B result is the most useful data point in this benchmark, because it stops being about which platform wins.
Llama 3.1 70B at Q3 will not load on a 16 GB Mac. Will not load on an 8 GB AMD card. Runs at 1.3 tokens per second on the 62 GB system RAM Linux box with the 2080 Ti partially offloaded. Time to first token is 3.8 seconds. Technically possible. Unusable for chat.
Above that tier, you need a Mac Studio M3 Ultra with 512 GB of unified memory, or a 192 GB DDR5 workstation with a 24 GB GPU, or a multi-GPU rig. Those exist. They are not most developers' desks.
DeepSeek V3 and R1 sit higher still. At Unsloth's most aggressive Q1.58 quant they need around 131 GB of unified memory or 192 GB of system RAM. People do run them on consumer hardware. Just not on the kind of consumer hardware most developers own.
The "you need pooled compute" argument used to feel abstract to me. It does not anymore. There is a specific tier of model your current desk cannot run. Whichever model that is, that is where pooled compute starts to matter.
The summary
Hardware-versus-model-size matters more than vendor for local model inference. The Mac M2 Pro wins for small models that fit in cache. The discrete GPUs win once the model exceeds cache. The cheap AMD card is competitive with the more expensive NVIDIA card on price-per-token. None of this hardware runs 70B usably, and the larger 671B-class models need a hardware tier above any of it.
There is no universal winner. The right hardware depends on which model is in your loop.
If you are about to spend money on hardware for local AI, run the benchmark on the model you actually use before you commit. The vendor wars are not the answer.
Rob is building Strake, a GitHub Action deploy gate for engineering teams that run production without a full SRE bench. Strake posts a GO / HOLD / CRITICAL verdict in the pull request using context from incidents, deploy history, dependency changes, service health, and runbooks.