Brute-Force Retrieval Holds Through 5,000 Memories | Strake

The last post I wrote measured how long it takes to query my Linux box's memory store from my Mac over a WireGuard mesh. The answer was about 20 milliseconds, plus or minus a few for Tailscale jitter, stable across store sizes from 10 to 500 entries. That ended with a sentence I should not have left there without testing: "the linear scan over a few hundred vectors is negligible at this scale."

That is true at a few hundred. The honest follow-up question is the one I had skipped. At what scale does it stop being true.

I knew the rough shape of the answer. The retriever is a brute-force cosine scan over every embedding in the local store. No approximate-nearest-neighbor index. No HNSW, no IVF, no FAISS. Just a loop. At 500 entries the loop is essentially free; the embedder cost dominates the call. Somewhere above 500 the loop starts to cost real milliseconds. The question is where.

I ran the bench again with a --sizes flag I added that morning, and pointed it at 1,000, 5,000, and 10,000 synthetic memories on the Linux peer. Same script as before. Same Tailscale mesh. Warm embedder, 30 query samples per store size.

The measurement

I ran each cell once. Cleanup between cells so the next size starts from zero. The Mac-local p50 baseline (14.8 ms, an essentially empty embedding-pass-through) is shown for reference.

Store size on peer	p50	p95	min
Mac local (baseline)	14.8 ms	18.3 ms	7.6 ms
10 memories on peer	35.9 ms	40.7 ms	18.1 ms
100 memories on peer	36.1 ms	41.2 ms	19.1 ms
500 memories on peer	36.3 ms	46.0 ms	20.9 ms
1,000 memories on peer	42.0 ms	51.1 ms	21.3 ms
5,000 memories on peer	40.6 ms	116.5 ms	32.0 ms
10,000 memories on peer	57.6 ms	131.9 ms	47.2 ms

Three observations the table makes obvious.

First, p50 is remarkably flat through 5,000. Forty-two milliseconds at 1K, forty milliseconds at 5K. The store size grew fifty times and the median round-trip moved by zero. The cosine scan is contributing essentially nothing to the median at these sizes; the embedder cost on the peer is still the dominant term in a typical query.

Second, p95 starts to widen at 1K and breaks at 5K. Fifty-one milliseconds at 1K, then 116 milliseconds at 5K, then 132 at 10K. The 5K p95 more than doubled vs the 1K p95 even though the median barely moved. That is the linear scan showing up in the tail. The median samples are landing in a regime where the embedder dominates, but the slow samples are catching the scan doing real work.

Third, p50 starts moving at 10K. Fifty-eight milliseconds, fifteen above the 1K and 5K p50s. The scan is now contributing visibly to the median, not just the tail.

The threshold is somewhere between 5K and 10K. p95 has already broken by 5K; p50 breaks by 10K. The "fine" regime ends in that window.

Full bench data and the script that produced it: trypotluck.ai/benchmarks.

What that means in practice

Most people who would use a local AI memory store do not have 10,000 entries in it. I have been dogfooding the system for a few months and my own store has 280 memories in it as of writing this. A heavy daily user storing every decision, preference, and project fact would maybe reach 2,000 in a year. The 5K threshold is something most users will never hit. The 10K threshold is firmly in power-user territory.

That is the result I wanted to be true, and it is. Brute-force cosine scan is the right default for personal AI memory at the scale most people will operate at. Adding an ANN index now would be a premature optimization that buys nothing for 95% of users and adds operational complexity for everyone.

It is also useful to know exactly when ANN starts paying off. If a user reaches around 5,000 memories, p95 starts wobbling. If they reach around 10,000, p50 starts wobbling. The right time to add HNSW or IVF-PQ to this codebase is when I see real users hitting those sizes, not before. Until then the scan is the right answer and the engineering effort is better spent elsewhere.

A note on what is in the embedder budget

p50 is forty milliseconds through 5K. About fifteen of those are the WireGuard round-trip plus FastAPI middleware. The rest is the embedder running on the peer. bge-small-en-v1.5 on a 2080 Ti via CUDA does a single query embedding in about twenty milliseconds when the model is warm. The cosine scan over 500 vectors of dimension 384 is about 0.2 milliseconds in numpy, which is below the precision of my measurement. The scan over 5,000 vectors is about 2 milliseconds, which is also below the noise floor of an HTTP-over-WireGuard probe. The scan over 10,000 vectors is about 4 milliseconds and that one is just barely visible in the median delta between 5K and 10K.

What is visible in the 5K p95 is not the average scan cost. It is the worst-case scan cost colliding with a worst-case scheduling stall, a worst-case GC pause, a worst-case context switch on the peer. The tail samples are catching the system at its slowest. As the store grows, more of the scan's wall time happens during a bad moment, so the tail widens disproportionately to the median.

The interesting engineering implication is that the first optimization that matters is not ANN. It is reducing the number of vectors the scan touches in the first place. Project-scoped filtering (only scan vectors tagged for the current project), confidence-threshold pruning (skip vectors below a minimum stored confidence), recency cutoff (skip vectors older than N days unless re-accessed) all reduce the scan size cheaply. Most realistic queries on a 10,000-vector store are not actually asking the retriever to consider all 10,000. They are asking it to consider the few hundred relevant to the current project, which puts the effective scan size right back in the "fine" regime.

ANN indexing is the right move when even the project-scoped scan crosses 5K. That is a real product moment, but it is not now.

Honest limits of the measurement

This was Linux peer only, with CUDA-accelerated embedding and a 2080 Ti. Mac local would look slightly different at scale because Metal embedder throughput is slightly higher than CUDA on this generation of GPU. Windows peer would look worse because the embedder runs on CPU there. I expect the same shape, the same threshold around 5K to 10K, slightly different absolute numbers. I have not yet measured the Mac local or Windows peer cases at 1K plus and that is the next bench.

The synthetic memories the bench script populates are deliberately diverse but they are not real user memories. Real memories are more semantically clustered (you ask about kubernetes more than you ask about cassandra) which means the cosine scores will have a different distribution, which means the top-k selection will land in slightly different cache regimes. I would expect this to make the tail samples noisier in production than in the bench. The 5K and 10K p95s are probably underestimates of what a real user with a power-user store would see.

I ran each size once. A statistically defensible characterization would run each size five to ten times and report distributions, not point estimates. The p50 numbers are stable enough run-to-run that I am confident in the 5K threshold within a few hundred memories either way. The p95 numbers have wider run-to-run variance, especially at 5K and 10K, so the exact p95 values are less trustworthy than the trend.

These results are for retrieval latency only. They do not address what happens to retrieval quality as the store grows. A 10K-memory store has different recall characteristics than a 500-memory store because there is more competition for the top-k slots. That is a separate measurement and one I have not run yet. The right place for it is on the next pass.

What I would build next

In order of what actually pays off for most users:

Project-scoped pre-filtering in the retriever, so the cosine scan only touches vectors with a matching project tag. Cheap to implement, immediately reduces effective scan size for anyone with more than one project.
A tiny in-process LRU cache for embeddings of recently-asked queries. Same query within a session skips the embedder entirely. Embedder is the dominant cost at small sizes, so a 50% cache hit rate cuts p50 by ten milliseconds.
Recency-based cold-storage tiering for memories older than some threshold and unaccessed for some other threshold. Keeps the hot scan small even as total store grows past 10K.
ANN indexing (probably HNSW via hnswlib for the python bindings, or usearch for a smaller dependency) only when the hot scan size crosses 5K. That is the right moment, not before.

I will probably ship 1 this week, 2 next week, defer 3 until a user actually has the scale to need it, and defer 4 until item 3 is not enough. The order is from "fixes a thing I can measure today" to "fixes a thing I will not need to fix for months."

What this changed

The previous post about cross-machine memory query ended with a hand-wave. "Negligible at this scale" without an upper bound is not a measurement, it is a vibe. The fix was a bench script that already existed plus three CLI flags. The cost was thirty seconds of typing and fifteen minutes of waiting for the populate step to finish at 10K entries.

The result is the same shape I expected with a sharper number than I expected. The cosine scan is fine through about 5,000 entries. By 10,000 it is starting to be a real cost. Most users will live their whole Potluck life inside the "fine" regime. For the ones that don't, the right next optimization is not ANN. It is the cheaper pre-filter step that puts most queries back in the fine regime even on a large store.

The architecture was right. The threshold I was working off was wrong. The measurement tightened it from "negligible at this scale" to "negligible through five thousand, real cost past ten thousand," which is a more honest sentence to put in a benchmark caption.

Rob writes the Local AI Engineering Notes series on strake.dev. He's also building Potluck AI, the local-first AI memory system measured in this post, and Strake, a GitHub Action deploy gate.

Brute-Force Retrieval Holds Through 5,000 Memories. Then It Doesn't.

The measurement

What that means in practice

A note on what is in the embedder budget

Honest limits of the measurement

What I would build next

What this changed

Related Posts

Cross-Machine Memory Query: About 20 Milliseconds, Most Days

An AMD GPU Beat My Mac on Llama 8B. The Same GPU Lost on Phi-3.

Your GPU Probably Isn't Helping Your Retrieval System