Back to blog
Memory
April 11, 20266 min readby Noomachy Team

How Noomachy Uses Vector Search to Find Relevant Memories

Saving memories is the easy part. Finding the right ones at the right moment is where the engineering happens.

The Problem

Your agent has 500 stored memories about you. The user asks: "What was that book my colleague recommended?"

A keyword search for "book" might match nothing — the original memory was "Sarah suggested The Lean Startup". There's no literal "book" in there. Semantic search needs to understand that suggest = recommend and The Lean Startup = book.

Vector embeddings solve this.

What Embeddings Actually Are

An embedding is a fixed-length list of numbers (typically 768 or 1536 dimensions) that captures the meaning of a piece of text. Two pieces of text with similar meaning produce vectors that are close together in this high-dimensional space.

So "Sarah suggested The Lean Startup" and "a book my colleague recommended" end up as nearby vectors, even though they share no actual words.

The Pipeline

Here's how Noomachy uses embeddings under the hood:

1. On Memory Creation

When a fact gets saved to semantic memory:

  • The fact's text is sent to Vertex AI's textembedding-gecko@003 model
  • The model returns a 768-dimensional vector
  • The vector is stored alongside the fact

2. On Memory Retrieval

When a new conversation starts (or the user asks something):

  • The user's message is embedded
  • Cosine similarity is computed between the query vector and every memory vector for this agent
  • The top-K most similar memories are returned (default K = 10)
  • They're injected into the system prompt

This happens in milliseconds for thousands of memories.

3. The Hybrid Approach

Pure vector search is great for semantic similarity but can miss exact matches. Noomachy uses a hybrid:

  • Vector similarity for semantic relevance
  • Tag filtering for explicit categorization
  • Recency boost for recently accessed memories
  • Confidence weighting for high-confidence facts

The combined score determines what gets injected.

Why Top-K and Not All Memories

You could load all of a user's memories into the prompt every time. It's simple and complete. It's also expensive and confusing — irrelevant memories crowd the context window and dilute the model's attention.

Top-K (typically 5-20) gives you the best of both worlds: enough relevant context, no noise.

The Validation Gate Connection

The validation gate (described in Sovereign Memory) uses the same vector search to detect duplicates. When a new fact arrives:

  1. Embed the new fact
  2. Search existing memories for the closest match
  3. If cosine similarity > 0.92 → duplicate, reject
  4. If similarity is between 0.7 and 0.92 → similar but not identical, queue for review
  5. Below 0.7 → genuinely new, auto-approve if confidence is high enough

This is how you prevent your memory store from filling up with slight rephrasings of the same fact.

Performance Notes

  • Embedding generation: ~50ms per call
  • Vector search over 10K memories: < 10ms with proper indexing
  • Total memory hydration before each request: ~100-200ms

In production, the embedding step is the bottleneck. Caching embeddings aggressively (which Noomachy does) keeps it fast.

What This Means For You

You don't need to think about any of this. As a user, you just chat. Your agent quietly embeds, validates, stores, and retrieves — and the result is that it remembers things in a way that feels intelligent because it actually understands meaning, not just keywords.

Try it →

#Vector Search#Embeddings#Architecture

Ready to try Noomachy?

Build AI agents with sovereign memory in minutes. Free tier, no credit card.

Get Started Free