How I built a zero-token memory layer for LLMs (and why it outperforms vector store approaches)

If you've built an AI chatbot or agent, you've hit the same problem: the LLM forgets everything between sessions. The standard solution is to stuff your conversation history into a vector store and retrieve relevant chunks before each call. It works — but it has a hidden cost. The token problem nobody talks about Every popular memory solution — mem0, Zep, Langchain ConversationSummaryMemory — runs an LLM under the hood when you recall. That's anywhere from 500 to 7,000 tokens per recall call, on top of your actual LLM call. For a chatbot with 1,000 daily active users doing 10 messages each, that's 10,000 recall calls × ~2,000 tokens = 20 million extra tokens per day. Before your LLM has said a single word. The retrieval-only approach I built BECOMER around a different idea: semantic retrieval using embeddings, no LLM inside the memory layer. Store → embed → index → retrieve. Your LLM receives the retrieved context and reasons over it — exactly what it's already doing. from becomer import Client mem = Client("bcm_your-api-key") # Before your LLM call context = mem.recall("what does this user prefer?", top_k=5) # Inject into your system prompt system_prompt = f"User context:\n{chr(10).join(context)}" # After your LLM call mem.store("User asked about Python decorators, found list comprehension more intuitive") Benchmark results Tested against LongMemEval (n=500) — the academic standard for conversational memory: System Score Tokens/recall BECOMER 94.4% 0 mem0 93.4% ~6,787 Hindsight 91.4% ~6,787 The honest caveat: on LOCOMO's multi-hop reasoning questions, mem0 scores 91.6% vs our 69.5%. Their system adds an LLM reasoning pass over retrieved results. We return the context; your LLM reasons. For most agent use cases where you control the final LLM call, this gap disappears. Multi-tenant in two lines For developers building apps with multiple end-users, pass a user_id: # Each user gets a fully isolated namespace mem_alice = Client("bcm_key", user_id="alice-123") mem_alice.store("Alice prefers TypeScript and dark mode") mem_bob = Client("bcm_key", user_id="bob-456") mem_bob.recall("preferences") # → [] — completely isolated Isolation is enforced at the database layer, not just application code. One master key covers your entire user base. Agent use cases The pattern that makes BECOMER useful beyond chatbots is shared namespaces for multi-agent systems: # Research agent (GPT-4o) stores findings mem = Client("bcm_key", user_id="task-abc") mem.store("API endpoint: POST /v2/payments, OAuth2") mem.store("Rate limit: 100 req/min") # Executor agent (Claude) — different process, same namespace ctx = Client("bcm_key", user_id="task-abc").recall("payment API details") # → gets exactly what the research agent found # No message passing. No state files. No coordination code. Self-improving systems work the same way: store every attempt with its outcome, recall what worked before the next run. What's available today REST API Python SDK: pip install becomer JS/Node SDK: npm install @becomerpackage/sdk (zero deps, TypeScript types) MCP: works with Claude Desktop and Cursor, set BECOMER_API_KEY and go Framework adapters: LangChain, LlamaIndex, LangGraph, CrewAI, AutoGen Free tier: 1,000 calls/month. Pro: $12/month. https://becomer.net — full docs, benchmarks, and free API key. I'm curious how others are handling the token cost problem for memory. What approaches have you found that work at scale?

— Summary

View Original Source

4 weeks ago

How I built a zero-token memory layer for LLMs (and why it outperforms vector store approaches)

Identifying Necessary Transparency Moments In Agentic AI (Part 1)

5 ways your CLAUDE.md rules quietly fail

Rethinking The Experience Of System Tools

Testing Font Scaling For Accessibility With Figma Variables

A Designer’s Guide To Eco-Friendly Interfaces

The Making of the New Lesse Studio Website: Clarity, Performance, and Intentionality