April 5, 2026 memory mcp sqlite vector-search

Your AI Coding Assistant Forgets Everything. Here's How I Fixed It.

A problem/solution analysis of total-recall — multi-tiered persistent memory for TUI coding assistants.

The Problem

Every TUI coding assistant has the same gap. You spend an hour teaching Claude Code about your project's quirks — that the auth middleware was rewritten for compliance, that integration tests must hit a real database, that your team prefers bundled PRs for refactors — and then the session ends.

Next session: blank slate.

$ claude
╭───────────────────────────────────────────╮
│ Claude Code v1.0.23                       │
│ Session started.                          │
│                                           │
│ ! No memory of previous sessions          │
│ ! No project context loaded               │
│ ! Previous corrections forgotten          │
╰───────────────────────────────────────────╯

> refactor the auth middleware

  Analyzing codebase...
  Mocking the database layer for testability.
  Opening separate PRs: auth-core, auth-utils, auth-tests.
  Note: removed legacy compliance check — looked unused.

The built-in memory systems are flat files. No tiering — every memory is treated equally, leading to context bloat or information loss. No semantic search — retrieval is by filename, not meaning. Switch from Claude Code to Copilot CLI? Start from scratch. No observability to know if memory is even helping.

The Solution

total-recall introduces a three-tier memory architecture backed by local SQLite and vector embeddings. Memories aren't flat files — they're ranked, decayed, promoted, and searched semantically.

Hot

≤ 50

Auto-injected into every prompt. Your most critical context, always present. Token-budgeted to 4K tokens.

Warm

≤ 10K

Retrieved semantically per query. Ask about auth — auth memories surface automatically.

Cold

∞

Hierarchical knowledge base. Ingest docs, READMEs, architecture notes, entire source trees.

Here's what the same session looks like with total-recall loaded:

$ claude
╭──────────────────────────────────────────────────────╮
│ Claude Code v1.0.23                                  │
│ total-recall loaded — 3 hot, 12 warm, 847 KB chunks  │
│ Last session: 2 hours ago                            │
│                                                      │
│ Hot context:                                         │
│ [correction]  auth rewrite was compliance-driven     │
│ [preference]  integration tests must use real DB     │
│ [preference]  bundled PRs for refactor changes       │
╰──────────────────────────────────────────────────────╯

> refactor the auth middleware

  Recalling: auth rewrite was compliance-driven — preserving all
  compliance checks. Using real DB for integration tests as required.
  Bundling auth-core, auth-utils, and auth-tests into a single PR.

Architecture Deep Dive

Local-First Embeddings

Every memory is vectorized on write using all-MiniLM-L6-v2 — a sentence transformer that produces 384-dimensional embeddings. It runs locally via ONNX Runtime. No API keys, no network calls, no cloud dependency. The model ships bundled with the package.

Storage is SQLite with the sqlite-vec extension for vector similarity search. A single file at ~/.total-recall/total-recall.db holds everything — memories, embeddings, knowledge chunks, compaction logs, and eval metrics.

Hybrid Search

Retrieval combines two signals: vector similarity (cosine distance in 384-dimensional space) and full-text search (BM25 via SQLite FTS5). The scores are fused with a configurable weight — by default, 70% semantic similarity + 30% keyword match. This catches both conceptual and exact-match queries.

Query: "how does authentication work?"

Vector search  ---  embed(query) -> cosine similarity -> top K
                |
FTS5 search    ---  tokenize(query) -> BM25 ranking -> top K
                |
Score fusion   ---  0.7 * vector + 0.3 * fts -> ranked results

Decay and Compaction

Memories aren't permanent — they decay. The decay formula combines three factors:

Time factor: exponential decay with a 7-day half-life. A memory untouched for a week has half its original score.
Frequency factor: 1 + log2(1 + access_count) — frequently accessed memories resist decay.
Type weight: corrections (1.5x) and preferences (1.3x) decay slower than general memories (1.0x). Your mistakes are the most valuable thing to remember.

At session end, compaction runs: hot entries scoring below 0.7 demote to warm. Warm entries scoring below 0.3 demote to cold. Cold entries that become relevant again promote back up. The system self-tunes.

Cross-Platform: One Memory, Every Tool

total-recall speaks MCP (Model Context Protocol) — the emerging standard for tool-to-model communication. Any MCP-compatible coding assistant can use it. But the real win is the import system.

On first session_start, total-recall scans for existing memories across six platforms and migrates them automatically:

Claude Code

Copilot CLI

Cursor

Cline

OpenCode

Hermes

Each importer knows where its host tool stores memories — ~/.claude/projects/*/memory/*.md for Claude Code, .cursorrules for Cursor, SQLite databases for Cline. Content hashes prevent duplicate imports. Switch tools freely; your memory follows.

session_start

1. Initialize embedder         ............. ok
2. Import: Claude Code          ............. 3 new
3. Import: Copilot CLI          ............. 1 new
4. Import: Cursor               ............. skipped
5. Warm sweep                   ............. 2 demoted
6. Project docs ingest          ............. 12 chunks
7. Smoke test                   ............. 22/22 pass
8. Hot tier assembly            ............. 3 entries, 1.2K tokens

Observability: Is Memory Actually Helping?

Most memory systems are fire-and-forget. You store things, hope they come back, and have no way to measure if retrieval is working. total-recall ships a full eval framework.

A 139-query benchmark suite runs on version changes to validate retrieval quality. Each query has expected results — both what should surface and what shouldn't. The system tracks precision, hit rate, mean reciprocal rank, and per-tier routing accuracy.

/total-recall eval

Retrieval Quality (7-day rolling)
----------------------------------------------
  Precision          0.94     (target: 0.85)
  Hit rate           0.91     (target: 0.80)
  MRR                0.88     (target: 0.75)
  Avg latency        12ms
----------------------------------------------

Per-Tier Breakdown
----------------------------------------------
  Hot    3  entries   precision 1.00
  Warm   12 entries   precision 0.92
  Cold   847 chunks   precision 0.89
----------------------------------------------

Regression Detection
----------------------------------------------
  vs. previous config  no regressions

Config snapshots enable A/B comparison — change a threshold, run the benchmark, compare metrics side-by-side. Retrieval misses are captured as benchmark candidates, so the test suite grows organically from real-world failures. The eval system improves itself.

Getting Started

For Claude Code users — one command:

/plugin install total-recall@strvmarv-total-recall-marketplace

For any MCP-compatible tool (Copilot CLI, Cursor, Cline, OpenCode, Hermes):

npm install -g @strvmarv/total-recall

Then add it to your tool's MCP config:

{
  "mcpServers": {
    "total-recall": {
      "command": "total-recall"
    }
  }
}

Or paste this into any AI coding assistant and it will install itself:

Install the total-recall memory plugin: fetch and follow the instructions at INSTALL.md

On first session, total-recall initializes its database, imports your existing memories, and starts working. No configuration required — the defaults are tuned from the eval benchmark.

Source: github.com/strvmarv/total-recall · npm: @strvmarv/total-recall