Add pipeline tag, library name, and improve model card

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +18 -10
README.md CHANGED
@@ -1,30 +1,38 @@
1
  ---
 
 
2
  language:
3
  - en
 
4
  metrics:
5
  - accuracy
6
- base_model:
7
- - OpenGVLab/InternVL3_5-1B-Instruct
8
  tags:
9
  - visual-reasoning
10
  - fine-grained-vqa
11
  - fine-grained-recognition
12
- license: mit
 
13
  ---
14
- # Model Card for TWIN-Qwen2.5-VL-3B
15
 
16
- <!-- Provide a quick summary of what the model is/does. -->
 
 
 
 
17
 
18
- This is the InternVL3_5-1B model post-trained on the TWIN dataset from the paper: [Same or Not? Enhancing Visual Perception in Vision-Language Models](https://glab-caltech.github.io/twin/)
19
 
20
- For further information please refer to the [project webpage](https://glab-caltech.github.io/twin/), [paper](https://arxiv.org/abs/2512.23592), and [repository](https://github.com/damianomarsili/TWIN).
 
 
 
 
21
 
22
  ## Citation
23
 
24
- If you use TWIN in your research, please consider citing our work:
25
 
26
- **BibTeX:**
27
- ```
28
  @misc{marsili2025notenhancingvisualperception,
29
  title={Same or Not? Enhancing Visual Perception in Vision-Language Models},
30
  author={Damiano Marsili and Aditya Mehta and Ryan Y. Lin and Georgia Gkioxari},
 
1
  ---
2
+ base_model:
3
+ - OpenGVLab/InternVL3_5-1B-Instruct
4
  language:
5
  - en
6
+ license: mit
7
  metrics:
8
  - accuracy
 
 
9
  tags:
10
  - visual-reasoning
11
  - fine-grained-vqa
12
  - fine-grained-recognition
13
+ pipeline_tag: image-text-to-text
14
+ library_name: transformers
15
  ---
 
16
 
17
+ # Model Card for TWIN-InternVL3_5-1B
18
+
19
+ This repository contains the InternVL3.5-1B model post-trained on the TWIN dataset, as introduced in the paper [Same or Not? Enhancing Visual Perception in Vision-Language Models](https://arxiv.org/abs/2512.23592).
20
+
21
+ TWIN is a large-scale dataset of 561,000 image-pair queries designed to enhance the perceptual abilities of Vision-Language Models (VLMs). It tasks models to determine whether two visually similar images depict the same object, encouraging attention to nuanced visual cues. Fine-tuning on TWIN yields significant gains in fine-grained recognition across various domains like art, animals, plants, and landmarks.
22
 
23
+ ## Resources
24
 
25
+ - **Project Page:** [https://glab-caltech.github.io/twin/](https://glab-caltech.github.io/twin/)
26
+ - **Paper:** [Same or Not? Enhancing Visual Perception in Vision-Language Models](https://arxiv.org/abs/2512.23592)
27
+ - **Code Repository:** [https://github.com/damianomarsili/TWIN](https://github.com/damianomarsili/TWIN)
28
+ - **Dataset:** [glab-caltech/TWIN](https://huggingface.co/datasets/glab-caltech/TWIN)
29
+ - **Benchmark Suite:** [glab-caltech/FGVQA](https://huggingface.co/datasets/glab-caltech/FGVQA)
30
 
31
  ## Citation
32
 
33
+ If you use TWIN in your research, please consider citing the work:
34
 
35
+ ```bibtex
 
36
  @misc{marsili2025notenhancingvisualperception,
37
  title={Same or Not? Enhancing Visual Perception in Vision-Language Models},
38
  author={Damiano Marsili and Aditya Mehta and Ryan Y. Lin and Georgia Gkioxari},