A framework and leaderboard for Retrieval Pipelines evaluation on ViDoRe v3

Community Article Published February 27, 2026

Retrieval Augmented Generation (RAG) became popular as a method for improving the answers generated by a Large Language Model (LLM). Instead of retraining the LLM on a specific domain or on updated content, RAG employs a retrieval component to inject relevant context regarding the user query into the prompt. The quality of retrieval directly impacts the quality of LLM generated answers.

To measure this, we have recently launched ViDoRe v3 — a real-world benchmark for evaluating embedding models on visual retrieval tasks. However, while tracking the progress of these individual models is crucial, modern industrial retrieval systems rarely rely on a single model; instead, they are built on multiple, complex, and intricate pipelines. Such retrieval pipelines might be designed to optimize performance aspects (like indexing or serving throughput), or to deal with "messy" (e.g., handwritten notes) or visually rich data (complex financial tables, charts, infographics). Building a state-of-the-art retrieval system today means choosing the right components for business and system requirements.

In this blog post, we discuss the key components and design choices of retrieval pipelines and introduce the ViDoRe v3 Pipeline evaluation framework and leaderboard. The framework provides a simple interface that allows evaluation and comparison of retrieval pipelines on ViDoRe v3, from a simple dense or sparse retriever baseline to a complex multi-step agentic system with tool use.

The new ViDoRe V3 Pipeline Leaderboard is publicly available on Hugging Face and embedded in this blogpost!



1. What to feed your retriever? OCR vs. Vision-Language Models

The first decision in any pipeline is how to "see" your data. Traditionally, we relied on Optical Character Recognition (OCR) to extract text from PDFs and images, which was then chunked and indexed. Advanced document processing pipelines (such as NV-Ingest) offer granular control, deploying dedicated components to meticulously extract structured information from tables, charts, and infographics.

Another recent approach is the use of Vision-Language Models (VLMs), exemplified by embedding architectures such as ColPali. Instead of converting a page to text first, these models embed the entire page image directly. By treating the layout, charts, and text as a single visual entity, you preserve visual information that eludes OCR.

ViDoRe was introduced as a benchmark for Visual Document Retrieval (VDR), with queries from different domains annotated with PDF page images that are relevant for answering them. The most recent version, ViDoRe v3, is focused on enterprise domains and was built with a high-quality human and synthetic annotation process.

2. The Retrieval Engine: Sparse, Dense, or Late Interaction?

Another important decision is the algorithm to retrieve data that is relevant to the user query.

Sparse Search (e.g. BM25): A traditional search algorithm based on exact word or prefix matching, where a given keyword is relevant in a document if it is frequent in it, and infrequent in other documents from the corpus. It’s unbeatable for finding specific product IDs or technical terms like "XYZ-700-Alpha."

Dense Embedding Models: These models allow for semantic search, capturing the meaning behind the words or images. They know that "how to fix a flat" and "tire repair guide" are related, even if they share no words.

Late Interaction (e.g., ColBERT/ColPali): These models offer a middle ground. Instead of squashing a whole document into one vector, they keep embeddings for every token (or image patch). This allows for "token-level" alignment between query and context, drastically improving accuracy for complex queries at the cost of higher storage.

It is also possible to combine/ensemble the outputs from different retrievers, by using algorithms like Reciprocal Rank Fusion (RRF).

3. The Refinement Layer: To Re-rank or Not?

Initial retrieval is often a candidate-selection approach; it identifies the top-k likely relevant candidates. However, sending 50 context pieces (e.g., chunks or page images) to an LLM is expensive and noisy. A Re-ranker acts as a second, additional filter, although it adds some extra latency to the pipeline. It typically uses a cross-encoder to take the query and the top candidates and performs a more thorough comparison, reordering them to yield the top-3 or 5 most relevant pieces of information. At this time, a retrieval pipeline without a re-ranker is generally considered incomplete for production.

4. The Rise of the Retrieval Agent

The most significant shift recently is the move from static pipelines to Retrieval Agents. Instead of following a fixed path, an agent can "reason" about the search. If the first search returns no or few relevant results, the agent might decide to rewrite the query, look in a different database, or even use a tool to look up a missing definition before trying again. It introduces a "self-correction" loop that static pipelines simply lack. Because this added reasoning takes time, retrieval agents are best suited for use cases where accuracy is more critical than raw speed.

5. The ingredients

Retrieval pipelines might be very diverse in terms of their complexity and components employed. In the end, they should be designed according to the retrieval system requirements related to indexing (e.g., storage, throughput) and searching (accuracy, latency). Here, we provide a summary of components that can be used in retrieval pipelines.

Component Function Best For...
OCR Extraction Converts images/PDFs to raw text. Clean, text-heavy documents.
Embedding model Embeds a piece of content into a vector, so that it is “searchable” by dense retrieval methods. The model might support embedding text, images (VLM-based) or other modalities (Omni-models) Semantic search for text when lexical similarity of queries and content is lower. If embedding page images (OCR-free), it understands visual aspects like complex layouts, charts, and tables.
Chunking strategy Breaks text information into smaller pieces that can be embedded. Chunks can be split by fixed length, recursive character splitting, or by semantic chunking. Reduces the amount of text per unit, avoiding "Lost in the Middle" syndrome and context dilution.
Query Expansion At inference time, re-writes a query in different ways to expand the search, or generates "fake" documents for the query and search for similar documents. Useful for searching on out-of-domain data, or when there is low lexical similarity between queries and the relevant context.
Hybrid Search (RRF) Combines candidates from multiple retrievers, like BM25 and Dense embeddings. Reciprocal Rank Fusion (RRF) is a popular algorithm for merging candidates. Balances keyword accuracy with semantic meaning.
Late Interaction Token-to-token similarity (MaxSim). High-precision requirements and small corpus, as it requires much more storage for token embeddings.
Re-ranker High-accuracy secondary sorting. Filtering out noise before the LLM.
Retrieval Agent Autonomous, dynamic, multi-step search reasoning. Might include query re-writing and calling different retrieval “tools”. Handles complex queries that require multiple lookups.

6. Choosing the ingredients

In order to make it easy to compare the different pipelines using a standardized evaluation, we introduce the Vidore V3 Retrieval Pipeline framework and leaderboard. The leaderboard showcases different pipeline implementations sorted by average accuracy over the Vidore V3 benchmark. It also includes the “Seconds per doc/query” information, that gives an idea of indexing and search latency.

Each of the 8 public datasets from Vidore V3 provide both the page image and its text (as extracted by NV-Ingest). That enables comparing text-based and image-based pipelines under a single evaluation.

We also introduce a minimalist framework with a simple interface for evaluating retrieval pipelines. The pipeline under evaluation only needs to extend the BasePipeline class, implement __init__() to receive extra arguments, implement index() to embed the corpus and the search() method that returns the top-k relevant documents per query, as shown in the following code snippet. Full pipeline examples might be found here:

from vidore_benchmark import BasePipeline
from typing import List, Any, Dict

class MyPipeline(BasePipeline):
    def __init__(self, batch_size: int = 32, top_k: int = 100):
        # ... (initialization logic)
        pass
    def index(self, corpus_ids: List[str], corpus_images: List[Any], corpus_texts: List[Any]) -> None:
        """
        Indexes the provided corpus data. This method is typically used to
        pre-process or store the corpus data for later retrieval.
        """
        # ... (indexing logic)
        pass
    def search(self, query_ids: List[str], queries: List[str]) -> Dict[str, Dict[str, float]]:
        """
        Performs a search over the indexed corpus based on the queries.
        It should return the retrieved results without the full corpus data.
        """
        # ... (search logic)
        # Example of the expected return format:
        results = {"query_1": {"corpus_id_5": 0.89, "corpus_id_220": 0.87, "corpus_id_1056": 0.86, ...},
                   "query_2": {"corpus_id_19": 0.74, "corpus_id_125": 0.73, "corpus_id_1056": 0.72, ...},
                   ...
                   }
        return results

Example of a retrieval pipeline implementing the interface for Vidore V3 pipeline evaluation After that, evaluation on ViDoRe v3 becomes as simple as :

vidore-benchmark pipeline evaluate-all \
    --module-path my_pipeline.py \
    --class-name MyPipeline \
    --output-dir results/ \
    --language english

7. Comparing the pipelines from the leaderboard

ViDoRe pipeline leaderboard allows comparing different retrieval pipeline results with the same evaluation framework and datasets. You can use this evaluation framework to compare for example dense vs sparse retrieval vs hybrid retrieval approaches, using re-ranker or not, using single-vector vs late-interaction embeddings, using visual document retrieval vs text-based retrieval pipelines (as we provide both page image and its text, extracted using NeMo Retriever extraction service). You can also call your own OCR solution inside the index() method of your pipeline.

Some highlights from the results:

  • In general, pipelines with late-interaction embedding models provide highest single-model retrieval accuracy (e.g. ColEmbed 8B v2), but pipelines with smaller embedding models + cross-encoder might be competitive (e.g. jina v4 + zerank2 and Llama Nemotron Embed 1B + Rerank 1B). On the other hand, Search Latency increases quite a lot with the addition of a re-ranker in the pipeline.
  • Using multiple modalities generally helps improve retrieval accuracy - we notice that for Llama-Nemotron-Embed-VL-1B and for a pipeline with that embedding model followed by the Llama-Nemotron-Rerank-VL-1B cross-encoder, using page image + OCR text as input to the models lead to ~6.5% improvement in accuracy, compared to using only the image.
  • The most accurate pipeline at leaderboard release time is a pipeline using only text modality (Jina embedding v4 + ZeRank2). While single retrievers fall behind their vision counterparts (see ViDoRe v3 paper), there is still quite a gap for visual rerankers to match the best text ones.
  • Models with a very low number of parameters (mxbai-edge-clbert-v0.32.m or Llama-Nemotron-Embed-VL-1B) can have surprisingly good performance, while guaranteeing ultra-low latencies, especially at search time.

Jump in

We hope the Vidore V3 Pipeline Leaderboard is a useful resource for researchers and practitioners to decide which retrieval pipelines to use for academic or real-world applications. We incentivize and welcome submissions of new retrieval pipelines exploring the diverse components available.

Clicky stuff

Community

Sign up or log in to comment