kawhiiiileo commited on
Commit
eef52a8
·
verified ·
1 Parent(s): a222f68

Upload 2 files

Browse files
.gitattributes CHANGED
@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  tokenizer.json filter=lfs diff=lfs merge=lfs -text
 
 
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  tokenizer.json filter=lfs diff=lfs merge=lfs -text
37
+ assets/innovator_vl_architecture.png filter=lfs diff=lfs merge=lfs -text
assets/README_Innovator_VL_8B_Thinking.md ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - zh
5
+ license: apache-2.0
6
+ pipeline_tag: multimodal-text-generation
7
+ tags:
8
+ - pretrained
9
+ - multimodal
10
+ - vision-language
11
+ - scientific
12
+ - reasoning
13
+ - thinking
14
+ ---
15
+
16
+ # Innovator-VL-8B-Thinking
17
+
18
+ ## Introduction
19
+
20
+ **Innovator-VL-8B-Thinking** is a multimodal reasoning-oriented large
21
+ language model designed for complex scientific problem solving. Built
22
+ upon Innovator-VL-8B-Instruct, this model is further optimized for
23
+ explicit multi-step reasoning, long-horizon chain-of-thought generation,
24
+ and token-efficient scientific analysis.
25
+
26
+ The model is particularly suitable for scientific tasks that require
27
+ structured reasoning over visual and textual evidence, such as
28
+ mathematics, chemistry, materials science, and multimodal scientific
29
+ benchmarks.
30
+
31
+ ------------------------------------------------------------------------
32
+
33
+ ## Model Overview
34
+
35
+ - **Model Type**: Vision-Language Reasoning Model
36
+ - **Parameter Size**: 8B
37
+ - **Base Language Model**: Qwen3-8B-Base
38
+ - **Vision Encoder**: RICE-ViT
39
+ - **Projector**: PatchMerger
40
+
41
+ The model supports native-resolution multi-image inputs and is optimized
42
+ for reasoning-intensive multimodal scenarios.
43
+
44
+ ------------------------------------------------------------------------
45
+
46
+ ## Key Characteristics
47
+
48
+ ### Explicit Multimodal Reasoning
49
+
50
+ Innovator-VL-8B-Thinking is trained to explicitly generate structured
51
+ reasoning traces, enabling the model to: - Perform multi-step logical
52
+ deduction grounded in visual evidence - Solve complex mathematical and
53
+ scientific problems - Maintain reasoning consistency across long
54
+ contexts
55
+
56
+ ### Reinforcement Learning for Long-Horizon Reasoning
57
+
58
+ The model is further optimized using reinforcement learning to
59
+ improve: - Reasoning correctness - Output consistency - Token efficiency
60
+ in long chain-of-thought generation
61
+
62
+ Sequence-level optimization enables strong accuracy while significantly
63
+ reducing unnecessary reasoning tokens.
64
+
65
+ ### Scientific Reasoning Performance
66
+
67
+ Compared to instruction-only models, Innovator-VL-8B-Thinking
68
+ demonstrates substantial gains on: - Multimodal mathematical reasoning
69
+ benchmarks - Scientific reasoning and domain-specific QA - Tasks
70
+ requiring precise step-by-step analysis
71
+
72
+ ------------------------------------------------------------------------
73
+
74
+ ## Model Architecture
75
+
76
+ `<img src="assets/innovator_vl_architecture.png" width="600"/>`{=html}
77
+
78
+ - **Vision Encoder**: RICE-ViT (region-aware visual representation)
79
+ - **Projector**: PatchMerger for visual token compression
80
+ - **Language Model**: Qwen3-8B-Base
81
+ - **Model Size**: 8B parameters
82
+
83
+ The architecture is shared with the Instruct variant, while the
84
+ optimization objective and training strategy differ at the post-training
85
+ stage.
86
+
87
+ ------------------------------------------------------------------------
88
+
89
+ ## Training Pipeline
90
+
91
+ ### Multimodal Pre-training
92
+
93
+ - Vision-language alignment with LLaVA-1.5 (558K)
94
+ - Full-parameter mid-training using LLaVA-OneVision-1.5 (85M)
95
+
96
+ ### Instruction Initialization
97
+
98
+ - Initialized from Innovator-VL-8B-Instruct
99
+ - Supervised fine-tuning with multimodal instruction and reasoning
100
+ data
101
+
102
+ ### Reinforcement Learning
103
+
104
+ - Trained with Innovator-VL-RL-172K
105
+ - Optimized using Group Sequence Policy Optimization (GSPO)
106
+ - Reward design jointly considers reasoning structure and answer
107
+ correctness
108
+
109
+ ------------------------------------------------------------------------
110
+
111
+ ## Output Format
112
+
113
+ During reasoning tasks, the model may produce structured outputs:
114
+
115
+ `<think>`{=html} Step-by-step reasoning process `</think>`{=html}
116
+ `<answer>`{=html} Final answer `</answer>`{=html}
117
+
118
+ This format is enforced during training to improve reasoning stability
119
+ and evaluation consistency.
120
+
121
+ ------------------------------------------------------------------------
122
+
123
+ ## Usage Recommendations
124
+
125
+ This model is recommended for: - Multimodal mathematical reasoning -
126
+ Scientific problem solving requiring explicit reasoning - Evaluation
127
+ settings emphasizing chain-of-thought quality
128
+
129
+ For general instruction-following or latency-sensitive applications, the
130
+ Instruct version is recommended.
131
+
132
+ ------------------------------------------------------------------------
133
+
134
+ ## Citation
135
+
136
+ @article{innovator-vl, title={Innovator-VL: A Multimodal Large Language
137
+ Model for Scientific Discovery}, year={2025} }
assets/innovator_vl_architecture.png ADDED

Git LFS Details

  • SHA256: a10c31adecda1ead8df899d9f3e9e307811172cff74b8a77fec42f8c3363e838
  • Pointer size: 131 Bytes
  • Size of remote file: 602 kB