RAGApril 1, 202610 min readBy Olatunde Adedeji

Assembling the RAG Pipeline: The /ask Endpoint

A deep dive on how EviVault assembles retrieval, generation, validation, and logging into a clean FastAPI endpoint for grounded document Q&A.

By the time a user asks a question in EviVault Assistant, several important pieces are already in place.

Documents have been uploaded. Text has been extracted. Content has been chunked and embedded. Vectors have been indexed. Retrieval and confidence logic are ready to do their work. The user interface knows how to display answers, citations, and trust signals.

At that point, the system needs one clean path that pulls those pieces together.

That path is the /ask endpoint.

This route is where the product becomes interactive. It is where a natural-language question enters the platform and where the platform decides what evidence to retrieve, whether the evidence is strong enough, how to shape the response, and what to record for later visibility.

That makes the endpoint important. But it also creates a design risk.

If too much logic is pushed directly into the route, the system becomes harder to test, harder to debug, and harder to evolve. A grounded product should not only answer carefully. It should also be assembled carefully.

That is why the /ask route in EviVault is intentionally thin.

What this endpoint is responsible for

At a high level, the /ask endpoint coordinates the system’s main question-answering flow.

Its responsibilities are straightforward:

accept a validated user question
identify the authenticated user
retrieve relevant chunks from that user’s documents
generate a grounded answer or abstain
log the interaction for analytics and traceability
return a structured response the frontend can render

That sounds like a lot, but the route itself should not own the internal logic of each step. It should orchestrate. The service layer should do the work.

That separation is one of the strongest structural decisions in the project.

The route itself

A representative version of the route looks like this:

python
@router.post("/ask", response_model=AskResponse)
def ask_question(
    payload: AskRequest,
    db: Session = Depends(get_db),
    user: User = Depends(get_current_user),
):
    chunks = retrieve_chunks(
        query=payload.question,
        top_k=5,
        user_id=user.id,
        db=db,
    )

    result = generate_answer(payload.question, chunks)

    log = QueryLog(
        user_id=user.id,
        question=payload.question,
        answer=result["answer"][:2000],
        grounded=result["grounded"],
        confidence=result["confidence"],
        retrieved_chunks=len(chunks),
    )
    db.add(log)
    db.commit()

    return AskResponse(**result)

This is a small block of code, but it says a lot about the architecture.

The endpoint does not know how semantic retrieval works internally. It does not know how confidence thresholds are applied. It does not know how the answer is generated or when the system abstains. It delegates those concerns to the relevant service functions and focuses on coordination.

That is exactly what a well-behaved API route should do.

Why a thin route matters

There is a temptation in AI application work to let the main endpoint become a dumping ground for product logic.

The route is where the “action” happens, so teams often let it absorb retrieval details, prompt construction, fallback logic, formatting rules, and persistence choices. The result is an endpoint that becomes harder to reason about every time a new feature is added.

EviVault avoids that pattern.

The /ask route stays thin because each major concern already has a better home:

retrieval belongs in the retrieval service
confidence and abstention belong in the generation logic
request validation belongs in schemas
authentication belongs in dependency injection
persistence belongs in the model and database layer

This keeps the architecture more legible. It also makes the system easier to extend without turning one route into a fragile tangle of responsibilities.

Request validation with Pydantic

Before the route does anything useful, the incoming request needs to be validated.

That begins with a simple request schema:

python
class AskRequest(BaseModel):
    question: str

This may look minimal, but it is an important part of the endpoint design.

The schema makes the contract explicit. The route expects a question and nothing else. That keeps the API clean, reduces ambiguity, and gives FastAPI a clear validation layer before the business logic begins.

The response also uses a structured schema:

python
class Citation(BaseModel):
    filename: str
    chunk_index: int
    similarity: float
    excerpt: str | None = None

class AskResponse(BaseModel):
    answer: str
    citations: list[Citation]
    grounded: bool
    confidence: str

This model matters because the endpoint is not returning raw model text. It is returning a structured product response. The frontend needs more than an answer string. It needs citations, trust signals, and enough metadata to build the evidence experience cleanly.

That structure is part of what makes the assistant feel like a product rather than a loose model wrapper.

Authentication enters through dependencies

The /ask route also depends on user identity. The system should only retrieve from documents the authenticated user is allowed to access.

That is why the route receives the user through dependency injection:

python
def ask_question(
    payload: AskRequest,
    db: Session = Depends(get_db),
    user: User = Depends(get_current_user),
):
    ...

This is another strong design choice.

The route does not manually inspect headers or decode tokens inline. Authentication is handled upstream through get_current_user. By the time the route executes, it already has a validated user object available.

That keeps security logic out of the route body and makes the path easier to read. It also helps maintain predictable access boundaries across the product.

Retrieval as a separate concern

The next step is retrieving the most relevant chunks for the user’s question.

That happens through a dedicated service call:

python
chunks = retrieve_chunks(
    query=payload.question,
    top_k=5,
    user_id=user.id,
    db=db,
)

This separation matters because retrieval is not trivial.

It involves query embedding, vector similarity search, metadata reconstruction, similarity scoring, and user scoping. That is already a full subsystem. It deserves to stay isolated from the endpoint orchestration layer.

This also makes the route easier to follow conceptually. A reader can see that retrieval happens here without having to parse every implementation detail immediately.

That is a quiet but valuable kind of clarity.

Generation is not just generation

Once the route has the retrieved chunks, it passes them to the answer generation service:

python
result = generate_answer(payload.question, chunks)

This function does more than generate text.

It is also where the system decides whether it should answer at all. If no chunks are retrieved, or if the top similarity score is too weak, the system abstains. If the evidence is strong enough, the service builds context, generates a grounded answer, attaches citations, and labels the result with a confidence level.

A simplified version of the logic looks like this:

python
if not chunks:
    return {
        "answer": "I don't have enough evidence to answer this question.",
        "citations": [],
        "grounded": False,
        "confidence": "none",
    }

top_similarity = chunks[0]["similarity"]

if top_similarity < 0.35:
    return {
        "answer": "The available documents do not contain sufficient evidence to answer this reliably.",
        "citations": chunks[:3],
        "grounded": False,
        "confidence": "insufficient",
    }

This is one of the reasons the /ask endpoint can remain thin. The most important trust decision in the product already lives where it belongs: inside the answer-generation layer that owns confidence and abstention behavior.

That is better than scattering those decisions across the route itself.

Logging matters too

After the result is generated, the system records the interaction:

python
log = QueryLog(
    user_id=user.id,
    question=payload.question,
    answer=result["answer"][:2000],
    grounded=result["grounded"],
    confidence=result["confidence"],
    retrieved_chunks=len(chunks),
)
db.add(log)
db.commit()

This is easy to overlook, but it is an important part of the endpoint’s role.

The system is not just answering questions. It is building an operational trail that supports analytics, debugging, and product learning. Logging allows the platform to measure how often queries are grounded, how confidence is distributed, how many chunks are typically retrieved, and where the product may need improvement.

That kind of visibility becomes increasingly important as the assistant moves from prototype to production-like use.

A system shaped around trust benefits from being inspectable not only by users, but also by its builders.

Why the response model matters

The route returns:

python
return AskResponse(**result)

That line does more than wrap a dictionary.

It ensures that the final payload matches the product contract. The response model becomes a clean boundary between backend complexity and frontend usability.

The frontend does not have to guess what fields are available or what shape the answer will take. It can rely on a stable structure:

answer
citations
grounded
confidence

That stability matters. It lets the frontend render message content, confidence badges, grounded-versus-ungrounded states, and the evidence panel without extra negotiation or brittle parsing logic.

This is another way the endpoint supports trust. Predictable data contracts make the whole system easier to understand and evolve.

The full request lifecycle

When viewed as a sequence, the endpoint flow is clear:

text
POST /api/ask
  → validate request
  → authenticate user
  → retrieve relevant chunks
  → evaluate evidence quality
  → generate grounded answer or abstain
  → log the interaction
  → return structured response

That clarity is worth preserving.

A lot of AI systems become harder to maintain because their core flows are conceptually simple but structurally messy. EviVault aims for the opposite. The flow remains straightforward, and the code structure reflects that simplicity as closely as possible.

Why this route reflects the larger product philosophy

The /ask endpoint is a good example of how the project thinks about AI product design more broadly.

It does not treat the model as the center of everything. It treats the full pipeline as the product:

validated input
authenticated access
user-scoped retrieval
evidence-aware answer logic
structured output
operational traceability

That orientation matters.

A grounded assistant is not just a prompt attached to a vector store. It is a system with explicit layers and responsibilities. The endpoint is where those layers meet, so its design reveals a lot about the seriousness of the architecture.

In EviVault, the route shows restraint. It coordinates rather than overreaches. That is a good sign in systems meant to be maintainable.

What this part of the project taught me

Building the /ask flow reinforced a few practical lessons.

First, orchestration is its own design problem. A strong product needs a clean path that connects its layers without collapsing them into one another.

Second, thin endpoints are usually better endpoints. When routes remain focused on coordination, the system becomes easier to test, understand, and extend.

Third, schemas and response models matter more than teams sometimes realize. They help turn a backend workflow into a stable product interface.

Fourth, logging should be built into the main interaction flow, not bolted on later. Trustworthy systems benefit from having a record of how they behave in practice.

FInal Thoughts

The /ask endpoint is where EviVault’s document intelligence pipeline becomes a user-facing product.

It accepts a question, routes it through retrieval and answer logic, respects user access boundaries, records what happened, and returns a structured response that the interface can render with evidence and trust signals intact.

That may sound like a standard API responsibility, but the way the route is assembled matters.

In a grounded RAG system, reliability does not come only from good retrieval or careful prompting. It also comes from clean orchestration.

The endpoint should not try to do everything.

It should connect the right parts, clearly.