Experimental Results from a Self-Improving Retrieval System for Conversational Memory

The biology-inspired mutation layer didn’t work. A learned MLP adapter and segmentation mutation both produced ~zero NDCG lift on LongMemEval. The control loop was sound; the perturbations weren’t load-bearing.

A recall diagnostic reframed the project: 78% of relevant entries never reached the cross-encoder. Bi-encoder recall was the ceiling, not the mutation layer.

Standard IR wins compounded: 0.95-cosine dedup plus BM25 alongside vector plus cross-encoder rerank took NDCG@10 from 0.22 to 0.34. BM25 alone beat pretrained embeddings by 76% on this corpus.

Clustered retrieval-induced forgetting (Anderson 1994, ported as far as I can tell for the first time) added +1.9pp NDCG with p=0.0001 on LongMemEval. Regresses on NFCorpus: the mechanism is scoped to single-user long-term conversation memory, not general IR.

Write-time LLM enrichment (gist plus anticipated queries via Haiku) was the biggest single lever: +8.3pp NDCG on covered queries.

A regex-tokenizer fix that BM25 had been missing was worth +1.4pp NDCG on the headline benchmark.

Six independent ablations (reranker swap, BGE bi-encoder, multi-field BM25, field-boosted BM25, late chunking on a GPU, k_deep sweep) all bounced off the same ceiling: BM25 supplies the candidates the reranker is already ranking well. Model-layer swaps are theatre when one component dominates.

Ported the whole stack to Rust: single binary, ratatui TUI, PyO3 plus napi-rs bindings, Claude Code plus Codex CLI plugins. Cross-project search dropped from 6–7s to 1.7s.

Lesson: check the bottleneck before extending the mechanism.

Leave a Comment Cancel reply