RAGMarch 31, 202610 min readBy Olatunde Adedeji

Vector Store and Semantic Retrieval

A deep dive on how EviVault uses embeddings, ChromaDB, cosine similarity, and user-scoped retrieval to turn indexed chunks into grounded evidence.

Once documents are extracted and chunked, the next challenge is making those chunks findable by meaning rather than by exact wording.

That is where the vector store and retrieval layer come in.

This part of the system matters because document intelligence does not improve just because files have been uploaded and processed. The product becomes useful only when the right passages can be surfaced for the right question at the right time. If retrieval is weak, the rest of the stack inherits that weakness. The answer generation layer receives poor context. The evidence panel shows weaker support. Trust drops even when the interface still looks polished.

That is why retrieval sits near the center of EviVault Assistant.

The platform was built as a grounded internal Q&A system for documents such as policies, onboarding guides, operational procedures, and internal references. Once those files are chunked and embedded, the system needs a reliable way to match a user’s natural-language question to the most relevant indexed passages.

That matching problem is the job of semantic retrieval.

What semantic retrieval changes

Traditional document search often relies on keyword overlap. That approach works when the user’s wording closely matches the source text. It becomes less useful when people ask questions in everyday language while the underlying documents use more formal or domain-specific phrasing.

That mismatch shows up constantly in internal systems.

A user may ask:

How many vacation days do I get?

The policy document may instead say:

Annual paid time off entitlement is determined by employment class and tenure.

Keyword overlap is weak. Semantic similarity is strong.

That is the core advantage of dense retrieval. Instead of matching words directly, the system encodes both the query and the document chunks as vectors in a shared embedding space. Chunks that mean similar things end up near each other, even when their wording differs.

At a high level, the retrieval flow looks like this:

text
Question → Embed the query → Search nearest vectors → Return top chunks

That pattern is simple, but it changes the quality of the product in important ways.

Why this layer matters so much

A grounded answer depends on grounded evidence. Grounded evidence depends on retrieving the right passages. That means retrieval is not just a backend convenience. It is part of the product’s trust model.

In EviVault, this layer needed to accomplish several things well:

surface semantically relevant chunks for natural-language questions
remain lightweight enough for practical internal deployments
preserve metadata needed for evidence display
work with per-user scoping rather than exposing every indexed chunk
return scores that can support both product logic and UI signals

Those needs shaped both the embedding choice and the vector store choice.

Choosing a local-first embedding model

EviVault uses all-MiniLM-L6-v2 as its embedding model.

That choice was pragmatic. The system was designed for realistic environments, including teams that may care about local processing, modest infrastructure, cost control, and predictable latency. A lightweight sentence-transformers model fits that context well.

The setting is straightforward:

python
EMBEDDING_MODEL: str = "all-MiniLM-L6-v2"

The model is small enough to run comfortably on CPU, fast enough for practical ingestion and query-time embedding, and strong enough for many internal semantic search workloads.

That balance mattered more than chasing the largest or newest embedding option.

A lot of architectural quality comes from choosing a tool that matches the deployment environment rather than choosing the most fashionable component.

Why ChromaDB fit the project

For the vector store, EviVault uses ChromaDB.

That was a good fit for several reasons. The system needed persistent storage, easy local deployment, a clear developer experience, and a retrieval layer that would not overwhelm the rest of the application architecture.

A simple client setup looks like this:

python
import chromadb

def get_chroma_client():
    return chromadb.PersistentClient(path=settings.CHROMA_PERSIST_DIR)

This choice keeps the vector layer operationally manageable. The index persists to disk, works cleanly in a compact deployment story, and does not require an external managed service just to support the first version of the product.

That simplicity was important because EviVault was built to be understandable as well as functional.

Initializing the collection

The retrieval collection is configured with cosine distance:

python
def get_collection():
    client = get_chroma_client()
    return client.get_or_create_collection(
        name="evivault_chunks",
        metadata={"hnsw:space": "cosine"},
    )

That small metadata setting matters. Retrieval quality depends not only on the model but also on how vector similarity is measured.

Cosine distance is the right default here because it measures directional similarity between vectors. In semantic embedding workflows, that usually aligns well with the kind of “meaning closeness” the product actually needs.

It is also a natural fit when embeddings are normalized before storage and retrieval.

The embedding function

The embedding model is wrapped in a small helper so the rest of the system can treat it as a reusable service:

python
from sentence_transformers import SentenceTransformer

def get_embedding_model():
    return SentenceTransformer(settings.EMBEDDING_MODEL)

def get_embedding_function():
    model = get_embedding_model()

    def embed(texts: list[str]) -> list[list[float]]:
        return model.encode(texts, normalize_embeddings=True).tolist()

    return embed

There are two useful details here.

First, the retrieval and ingestion layers both use the same embedding function. That keeps the embedding space consistent. If documents are embedded one way and user questions another way, the similarity search loses coherence.

Second, normalize_embeddings=True helps keep similarity behavior stable. Normalized vectors make cosine-based comparisons more reliable and easier to reason about.

This is another recurring theme in the project: small implementation details often support larger trust goals.

Turning chunks into searchable vectors

Once a document has been chunked, the system can embed and store those chunks. A simplified storage call looks like this:

python
collection.add(
    ids=ids,
    documents=chunk_texts,
    embeddings=embeddings,
    metadatas=[
        {
            "document_id": document.id,
            "chunk_index": i,
            "filename": document.filename,
        }
        for i, _ in enumerate(chunks)
    ],
)

This call does more than just write vectors.

It also stores metadata that matters later in the product:

document ID for scoping and traceability
chunk index for evidence display
filename for the evidence panel and citations

That metadata is what helps bridge the retrieval layer and the user experience. When a chunk comes back during search, the platform can do more than show text. It can explain where that text came from.

Query-time retrieval

When a user asks a question, the system first embeds the query, then asks the vector store for the nearest chunk vectors.

A representative retrieval function looks like this:

python
def retrieve_chunks(
    query: str,
    top_k: int = 5,
    user_id: str = None,
    db: Session = None,
) -> list[dict]:
    ef = get_embedding_function()
    query_embedding = ef([query])[0]
    collection = get_collection()

    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
        include=["documents", "metadatas", "distances"],
    )

    chunks = []
    for i, chroma_id in enumerate(results["ids"][0]):
        meta = results["metadatas"][0][i]
        distance = results["distances"][0][i]
        similarity = 1.0 - distance

        chunks.append({
            "chroma_id": chroma_id,
            "content": results["documents"][0][i],
            "document_id": meta.get("document_id", ""),
            "filename": meta.get("filename", "unknown"),
            "chunk_index": meta.get("chunk_index", 0),
            "similarity": round(similarity, 4),
        })

    chunks.sort(key=lambda c: c["similarity"], reverse=True)
    return chunks

This flow is intentionally readable.

The query becomes an embedding. ChromaDB returns the nearest stored chunks along with their metadata and distances. The code then converts those distances into a more intuitive similarity score and sorts the results so the most relevant evidence sits first.

That line is especially important:

python
similarity = 1.0 - distance

ChromaDB returns cosine distance rather than a direct similarity score. Subtracting the distance from 1.0 gives the application a score that is easier to interpret, easier to display in the interface, and easier to use in confidence logic later.

Small translation steps like this often make the difference between a system that works and a system that is easy to reason about.

Why similarity scoring matters beyond retrieval

It would be easy to think of similarity as an internal retrieval metric only. In EviVault, it plays a broader role.

The similarity score feeds at least three product behaviors:

result ranking
confidence labeling
abstention thresholds

That means the retrieval layer is not isolated from the trust layer. It actively supports it.

If the top chunks are strong matches, the system can move forward with more confidence. If the best score is too weak, the generation layer can abstain rather than fabricate a polished answer from poor support.

That is one of the reasons I treated retrieval as a core architectural choice rather than a generic search utility.

User scoping is part of retrieval quality

The system is not designed to search every indexed document for every user. It is designed to search the documents the user is actually allowed to access.

That matters for both security and product correctness.

A semantic retrieval system that surfaces relevant passages from the wrong user’s files is not just insecure. It is functionally broken.

That is why EviVault enforces user ownership in the application layer after vector search returns candidates:

python
doc = db.query(Document).filter(
    Document.id == doc_id,
    Document.owner_id == user_id,
    Document.status == "ready",
).first()

if doc is None:
    continue

This separation is deliberate. ChromaDB handles vector search. The relational layer handles ownership and document lifecycle semantics. Each layer does the work it is best suited to do.

That separation also keeps the system easier to maintain. Authorization logic stays explicit. Retrieval logic stays focused.

Why dense retrieval fit this product

Dense retrieval was the right fit for EviVault because the product’s questions are usually semantic, not lexical.

Internal users do not always speak in the same language as the document authors. Operational policies are often written formally. Employee questions are usually more natural and direct. Semantic embeddings bridge that gap in a way keyword-only search often cannot.

This is where the vector layer becomes more than a technical implementation choice. It becomes a product enabler.

Without semantic retrieval, the assistant would often behave more like a keyword search box with better formatting. With semantic retrieval, it becomes much better at surfacing meaning rather than just matching terminology.

What this part of the project taught me

Building this layer reinforced a few practical lessons.

First, retrieval quality deserves first-class architectural attention. It is easy to over-focus on generation and under-invest in the step that supplies the evidence.

Second, local-first choices can still produce strong product outcomes. A lightweight embedding model and a compact vector store can be the right fit when privacy, simplicity, and deployment realism matter.

Third, metadata matters just as much as vectors. The retrieval layer is more valuable when it can return text along with the identifiers and labels the rest of the product needs.

Fourth, retrieval logic should remain understandable. Systems become easier to improve when developers can follow the full path from query embedding to ranked chunk list without unnecessary complexity.

Final Thoughts

The vector store and semantic retrieval layer are what turn processed documents into usable evidence.

In EviVault, that means embedding indexed chunks with a local-first model, storing them in ChromaDB, embedding user questions at query time, retrieving the nearest passages, converting distance into usable similarity scores, and filtering those results through user ownership rules before answer generation begins.

This layer matters because grounded answers depend on grounded retrieval.

If the ingestion pipeline is where reliability begins, the retrieval layer is where that reliability gets tested.