AGC-Qwen3-VL-4B

This model applies Attention-Guided Clustering (AGC) to compress multi-vector video representations for efficient ColBERT-style late interaction retrieval. Model weights are initialized from Qwen3-VL-4B-Instruct and finetuned on MSR-VTT for text-to-video retrieval with bidirectional attention.

AGC compresses ~1300 video token vectors into a fixed budget of 32 vectors (97.6% compression), while matching or exceeding uncompressed retrieval performance.

arXiv GitHub License

Method Overview

AGC consists of three components:

  1. Attention-based Centroid Selection — Learned universal query tokens are appended to the document token sequence. The attention weights from these learned tokens to document tokens at the last transformer layer produce saliency scores, identifying the most semantically important regions. The top-m tokens by saliency are selected as cluster centroids.

  2. Hard Clustering — Every document token is assigned to its nearest centroid via cosine similarity, grouping related content into coherent clusters while preserving distinct semantic details.

  3. Weighted Aggregation — Tokens within each cluster are aggregated into a single vector using saliency-weighted averaging, prioritizing informative tokens and maintaining gradient flow stable for training.

The resulting m compressed vectors are used with ColBERT-style MaxSim scoring for retrieval.

Method

Results on MSR-VTT

Method Tokens R@1 R@10 nDCG@10
OmniEmbed-7B 1 51.5 83.2 67.1
Video-ColBERT 26 51.5 85.5 67.7
Baseline (Ours, uncompressed) 1318 55.7 88.3 71.9
SeqResize 32 53.3 86.9 69.9
MemTok 32 54.2 86.4 69.9
H-Pool 32 54.1 87.3 70.4
AGC-Qwen3-VL-4B (This model) 32 58.5 88.4 73.0

AGC at 32 tokens outperforms the uncompressed baseline at R@1 while using only 2.4% of the original index size, demonstrating that compression reduces redundancy in multimodal representations. This Qwen3-VL-4B variant achieves the best MSR-VTT performance among the reported model sizes (see paper Table 7).

Model Details

Initial weights Qwen3-VL-4B-Instruct
Architecture Qwen3-VL with bidirectional attention
Hidden dimension 2048
Compression method AGC (Attention-Guided Clustering)
Universal query tokens 32 learned universal query tokens (<|mem0|> – <|mem31|>)
Default budget 32 vectors per document
Scoring ColBERT-style MaxSim (late interaction)
Normalization L2-normalized embeddings
Query prefix "Query: "
Passage prefix "Passage: "
Precision bfloat16
Training video frames 24

Usage

import torch
from transformers import AutoProcessor
from qwen_vl_utils import process_vision_info

from src.arguments import ModelArguments
from src.encoder.select_encoder import AttentionSelectEncoder
from src.models.qwen3_vl_embed.qwen3_vl_embed import Qwen3ForEmbedding
from src.utils import get_appending_token_strings

MODEL_ID = "hltcoe/AGC_qwen3-vl_msrvtt"
VIDEO_PATH = "PLACEHOLDER"
NUM_PROXY_TOKENS = 32
APPENDING_SUFFIX = "".join(get_appending_token_strings(NUM_PROXY_TOKENS))

# --- Setup ---
model_args = ModelArguments(
    model_name_or_path=MODEL_ID,
    pooling="select",
    normalize=True,
    num_appending_token=NUM_PROXY_TOKENS,
    use_cluster_pooling=True,
    use_attn_weight_cluster_pooling=True,
    attn_implementation="flash_attention_2",
)

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AttentionSelectEncoder.load(
    Qwen3ForEmbedding,
    model_args,
    attn_implementation=model_args.attn_implementation,
    dtype=torch.bfloat16,
)
model = model.to("cuda").eval()

# --- Encode a video document ---
passage_messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Passage: "},
            {"type": "video", "video": VIDEO_PATH, "nframes": 24, "max_pixels": 110592, "min_pixels": 98304},
        ],
    }
]
text = processor.apply_chat_template(passage_messages, tokenize=False, add_generation_prompt=False)
text += APPENDING_SUFFIX
image_inputs, video_inputs, _ = process_vision_info(
    passage_messages,
    image_patch_size=16,
    return_video_kwargs=True,
    return_video_metadata=True,
)
video_inputs, video_metadatas = zip(*video_inputs)
video_inputs, video_metadatas = list(video_inputs), list(video_metadatas)
passage_inputs = processor(
    text=[text],
    videos=video_inputs,
    video_metadata=video_metadatas,
    do_resize=False,
    do_sample_frames=False,
    padding=True,
    return_tensors="pt",
).to("cuda")

with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
    with torch.inference_mode():
        doc_embeddings, doc_mask = model.encode(passage_inputs, is_query=False)
        print(doc_embeddings.shape)
        # doc_embeddings: (1, 32, 2048) — 32 compressed AGC vectors

# --- Encode a text query ---
query_messages = [{"role": "user", "content": [{"type": "text", "text": "Query: a person is cooking"}]}]
query_text = processor.apply_chat_template(query_messages, tokenize=False, add_generation_prompt=False)
query_inputs = processor(text=[query_text], padding=True, return_tensors="pt").to("cuda")

with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
    with torch.inference_mode():
        query_embeddings, query_mask = model.encode(query_inputs, is_query=True)
        print(query_embeddings.shape)

# --- ColBERT MaxSim scoring ---
score = model.compute_similarity(query_embeddings, doc_embeddings, query_mask, doc_mask)
print(f"Similarity score: {score.item():.4f}")

Command line usage

For running inference and evaluation from the command line, see the Quick Start section.

Citation

@misc{qin2026multivectorindexcompressionmodality,
      title={Multi-Vector Index Compression in Any Modality}, 
      author={Hanxiang Qin and Alexander Martin and Rohan Jha and Chunsheng Zuo and Reno Kriz and Benjamin Van Durme},
      year={2026},
      eprint={2602.21202},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2602.21202}, 
}
Downloads last month
13
Safetensors
Model size
570k params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hltcoe/AGC_qwen3-vl_msrvtt

Finetuned
(204)
this model

Dataset used to train hltcoe/AGC_qwen3-vl_msrvtt

Collection including hltcoe/AGC_qwen3-vl_msrvtt

Paper for hltcoe/AGC_qwen3-vl_msrvtt