RAGMarch 31, 20269 min readBy Olatunde Adedeji

The Ingestion Pipeline: From Raw File to Searchable Chunks

A deep dive on how EviVault turns uploaded files into indexed knowledge through text extraction, chunking, embedding, and storage.

A retrieval system is only as useful as the text it indexes.

That sounds obvious, but it is one of the most important truths in document intelligence work. Teams often spend a lot of time discussing models, prompts, and answer quality, then treat ingestion like a preprocessing detail. In practice, the ingestion pipeline quietly shapes nearly every answer the system will ever produce.

If the text extraction is weak, important content disappears before retrieval even begins. If the chunks are poorly split, relevant passages become harder to retrieve. If metadata is thin or inconsistent, it becomes harder to show evidence clearly in the interface. Good retrieval starts long before the user asks a question.

That is why the ingestion layer matters so much in EviVault Assistant.

The platform was designed to help users search and question internal documents such as policy files, onboarding guides, procedures, operational notes, and other knowledge assets. Before any of those files can support a grounded answer, they need to move through a clean pipeline:

text
Upload → Extract → Chunk → Embed → Store

This article focuses on that pipeline and on the product decisions behind it.

What the ingestion layer has to accomplish

At a high level, ingestion turns an uploaded file into searchable knowledge.

That sounds simple until you look at the actual responsibilities involved. The pipeline has to:

accept common internal document formats
extract usable text without crashing on messy inputs
split long text into meaningful segments
preserve enough context for retrieval to work
capture metadata that helps with evidence display later
generate embeddings efficiently
store vectors and chunk records in a way that stays maintainable

Each of those steps affects the trustworthiness and usefulness of the final system.

A grounded assistant does not begin at answer generation. It begins when the file first enters the platform.

Starting with practical file support

EviVault begins with a focused set of file types:

PDF
DOCX
TXT
Markdown

That choice was practical. These formats cover a large share of internal operational knowledge without making the ingestion layer too broad too early. The goal was not to support every file format on day one. The goal was to support the formats most likely to contain real working knowledge inside teams.

The extraction function reflects that scope directly:

python
def extract_text(file_path: str, content_type: str) -> str:
    if content_type == "application/pdf" or file_path.endswith(".pdf"):
        from pypdf import PdfReader
        reader = PdfReader(file_path)
        return "\n\n".join(page.extract_text() or "" for page in reader.pages)

    elif file_path.endswith(".docx"):
        from docx import Document as DocxDocument
        doc = DocxDocument(file_path)
        return "\n\n".join(p.text for p in doc.paragraphs if p.text.strip())

    else:
        with open(file_path, "r", encoding="utf-8", errors="ignore") as f:
            return f.read()

This is a small block of code, but it carries a few meaningful design decisions.

PDF pages are joined with double newlines so the extracted text keeps some paragraph-level separation across pages. DOCX paragraphs filter out blank lines so the system does not fill the text stream with empty paragraph objects. The fallback path reads plain text with UTF-8 and errors="ignore" so ingestion is less likely to fail on imperfect encodings.

The function does not try to be clever. It tries to be reliable.

That is often the right instinct in ingestion work.

Why extraction quality matters more than it looks

When people think about a document intelligence product, they often imagine the moment the user asks a question. But by that point, part of the answer quality has already been decided.

If extraction drops a heading, merges unrelated sections, strips paragraph boundaries, or produces broken text from a source file, the damage carries downstream. Retrieval will search through those flawed chunks. The LLM, if one is used, will receive less coherent context. The evidence panel will reflect weaker excerpts. Trust degrades long before generation enters the picture.

That is why the ingestion layer has to be treated as a product concern, not just a plumbing concern.

Chunking is where retrieval becomes practical

Once text is extracted, the next task is splitting it into retrievable segments.

This step matters because neither vector retrieval nor downstream generation works best on entire documents. Long files need to be broken into smaller pieces that are semantically coherent, manageable to embed, and useful to surface as evidence.

EviVault uses chunking with overlap:

python
def chunk_text(text: str, chunk_size: int = 512, overlap: int = 64) -> list[dict]:
    chunks = []
    start = 0

    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunk_content = text[start:end].strip()

        if chunk_content:
            chunks.append({
                "content": chunk_content,
                "char_start": start,
                "char_end": end,
            })

        start = end - overlap if end < len(text) else end

    return chunks

Even this shorter version shows the core idea clearly. Each chunk covers a window of text, and neighboring chunks share a small amount of overlap. That overlap matters because useful answers often sit near boundaries. Without overlap, a relevant sentence can be split awkwardly across two chunks and become harder to retrieve.

In the fuller project implementation, the logic also tries to align chunk boundaries with paragraph or sentence breaks when possible. That improves readability and semantic coherence. A chunk that ends at a natural boundary is usually more useful than a chunk that slices through the middle of a thought.

Why overlap matters

Overlap is one of those implementation details that seems small until you remove it.

Imagine a policy sentence that starts near the end of one chunk and finishes at the beginning of the next. If the chunks do not overlap, the system may store two incomplete fragments, neither of which fully captures the policy statement. Retrieval then has a harder time surfacing the right evidence, and the answer quality falls.

Overlap helps preserve continuity.

It is a practical trade-off. You spend a little more storage and embedding work to reduce the chance that an important passage gets broken into weak pieces. For a system built around grounded retrieval, that trade-off is worth it.

Metadata is not an afterthought

Each chunk in EviVault carries more than text. It also carries metadata that becomes useful later in the pipeline:

chunk index
character start offset
character end offset
document identifier
filename
vector store identifier

A representative chunk object looks like this:

python
{
    "content": "The extracted text segment...",
    "char_start": 1024,
    "char_end": 1536,
}

Those fields matter because retrieval is not the final step. Once a chunk is retrieved, the system still needs to present evidence in a usable way. Character offsets help map a chunk back to its position in the source text. Chunk indices help identify where the passage came from. Filenames support evidence display in the interface.

This is a recurring theme in EviVault: trust features often depend on metadata created much earlier in the pipeline.

Embedding and storing the chunks

After extraction and chunking, the system needs to turn chunk text into embeddings and store them.

That stage connects the ingestion pipeline to the retrieval layer. In EviVault, a batch of chunk texts is sent through the embedding model, then written to both the vector database and the relational database.

A simplified version looks like this:

python
def ingest_document(db: Session, document: Document, file_path: str) -> None:
    text = extract_text(file_path, document.content_type)
    chunks = chunk_text(text)

    collection = get_collection()
    ef = get_embedding_function()

    chunk_texts = [c["content"] for c in chunks]
    embeddings = ef(chunk_texts)

    ids = []
    for i, chunk_data in enumerate(chunks):
        chroma_id = f"{document.id}_chunk_{i}"
        db.add(DocumentChunk(
            document_id=document.id,
            chunk_index=i,
            content=chunk_data["content"],
            char_start=chunk_data["char_start"],
            char_end=chunk_data["char_end"],
            chroma_id=chroma_id,
        ))
        ids.append(chroma_id)

    collection.add(
        ids=ids,
        documents=chunk_texts,
        embeddings=embeddings,
        metadatas=[
            {
                "document_id": document.id,
                "chunk_index": i,
                "filename": document.filename,
            }
            for i, _ in enumerate(chunks)
        ],
    )

    document.chunk_count = len(chunks)
    document.status = "ready"
    db.commit()

There are a few important ideas packed into this flow.

First, embeddings are generated in batch rather than one chunk at a time. That is more efficient and keeps ingestion practical on modest infrastructure.

Second, each chunk gets a deterministic ID like document_id_chunk_i. That makes later deletion and bookkeeping much simpler.

Third, the system stores information in two places for two different reasons. The vector database supports semantic search. The relational database supports application logic, metadata management, and auditability. The two layers stay connected through shared identifiers.

This split is one of the reasons the system remains understandable.

Why dual storage is worth it

It can be tempting to ask whether the vector store should hold everything. In practice, that usually leads to awkward trade-offs.

The vector store is excellent for nearest-neighbor search. It is not the right place to centralize all application logic. The relational database is much better suited for ownership, document lifecycle, chunk bookkeeping, processing status, and user-scoped queries.

Keeping both layers does add some implementation responsibility, but it also keeps each layer focused on what it does best.

That separation pays off later when the system needs to enforce security rules, display chunk metadata cleanly, or delete all chunk records for a removed document.

The upload step and document lifecycle

The ingestion pipeline begins as soon as a file is uploaded.

At the API level, the upload route validates the extension, saves the file, creates a document record, and then starts ingestion:

python
@router.post("/upload", response_model=DocumentOut, status_code=201)
def upload_document(file: UploadFile = File(...), db=Depends(get_db), user=Depends(get_current_user)):
    ext = os.path.splitext(file.filename)[1].lower()
    if ext not in [".pdf", ".txt", ".md", ".docx"]:
        raise HTTPException(status_code=400, detail=f"Unsupported file type: {ext}")

    file_id = str(uuid.uuid4())
    save_path = os.path.join(settings.UPLOAD_DIR, f"{file_id}{ext}")

    with open(save_path, "wb") as f:
        f.write(file.file.read())

    doc = Document(filename=file.filename, owner_id=user.id, status="processing")
    db.add(doc)
    db.commit()

    ingest_document(db, doc, save_path)

    db.refresh(doc)
    return DocumentOut.model_validate(doc)

This creates a simple lifecycle:

text
processing → ready
processing → failed

That status model matters because the rest of the platform can then behave predictably. Retrieval only works against ready documents. Failed ingests do not quietly pollute the search experience. The platform has a cleaner operational boundary.

In a larger production deployment, ingestion would likely move to a background worker so the upload request can return immediately while processing happens asynchronously. But even in this simpler flow, the lifecycle is explicit and easy to reason about.

Configuration shapes retrieval quality

Ingestion also carries several tunable decisions.

A representative settings block looks like this:

python
class Settings(BaseSettings):
    EMBEDDING_MODEL: str = "all-MiniLM-L6-v2"
    CHUNK_SIZE: int = 512
    CHUNK_OVERLAP: int = 64
    CHROMA_PERSIST_DIR: str = "./chroma_data"
    UPLOAD_DIR: str = "./uploads"

These values are not random implementation details. They influence how well the system retrieves answers later.

A larger chunk size may preserve more context, but it can also reduce precision. A smaller chunk size may improve focus, but it can fragment ideas too aggressively. Overlap can preserve continuity, though too much overlap increases duplication and storage load.

There is no single perfect value for every deployment. Good settings depend on the kinds of documents the system handles, how dense the writing is, how long the sections tend to be, and how much precision the product needs.

That is another reason ingestion deserves more attention than it usually gets. It quietly encodes some of the most important retrieval trade-offs in the product.

Why this stage matters to trust

It is tempting to think of trust as something the UI creates through citations and evidence panels. But trust begins much earlier.

The evidence panel is only as good as the retrieved chunks it can show. The retrieved chunks are only as good as the index. The index is only as good as the chunking. The chunking is only as good as the extracted text.

That chain starts at ingestion.

A grounded answer depends on a grounded pipeline.

For EviVault, that meant treating extraction, chunking, metadata, and storage as core parts of the product, not background utilities.

What this part of the project taught me

Working on the ingestion layer reinforced a few lessons.

First, retrieval quality is shaped upstream. By the time a user sees an answer, a large part of its quality has already been decided by how the source documents were processed.

Second, chunking is product design as much as engineering. The way text is split influences retrieval precision, evidence readability, and the overall trustworthiness of the system.

Third, metadata deserves deliberate attention. Evidence-heavy products depend on more than content. They depend on being able to trace, label, and present that content clearly.

Fourth, simple pipelines are often better starting points than overly ambitious ones. A focused file set, a clear chunking strategy, and clean persistence rules are more valuable than an ingestion layer that tries to do everything too early.

Final Thoughts

The ingestion pipeline is where raw files become searchable knowledge.

In EviVault, that means moving from uploaded documents to extracted text, from extracted text to overlapping chunks, from chunks to embeddings, and from embeddings to a maintained index that can support grounded answers later.

It is easy to overlook this stage because it runs before the visible magic of retrieval and answer generation. But in a grounded document intelligence system, ingestion is not just preparation.

It is where reliability begins.