Upload 2 files

Browse files

Files changed (3) hide show

.gitattributes +1 -0
assets/README_Innovator_VL_8B_Thinking.md +137 -0
assets/innovator_vl_architecture.png +3 -0

.gitattributes CHANGED Viewed

@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text

 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text
+assets/innovator_vl_architecture.png filter=lfs diff=lfs merge=lfs -text

assets/README_Innovator_VL_8B_Thinking.md ADDED Viewed

	@@ -0,0 +1,137 @@

+---
+language:
+- en
+- zh
+license: apache-2.0
+pipeline_tag: multimodal-text-generation
+tags:
+- pretrained
+- multimodal
+- vision-language
+- scientific
+- reasoning
+- thinking
+---
+# Innovator-VL-8B-Thinking
+## Introduction
+**Innovator-VL-8B-Thinking** is a multimodal reasoning-oriented large
+language model designed for complex scientific problem solving. Built
+upon Innovator-VL-8B-Instruct, this model is further optimized for
+explicit multi-step reasoning, long-horizon chain-of-thought generation,
+and token-efficient scientific analysis.
+The model is particularly suitable for scientific tasks that require
+structured reasoning over visual and textual evidence, such as
+mathematics, chemistry, materials science, and multimodal scientific
+benchmarks.
+------------------------------------------------------------------------
+## Model Overview
+-   **Model Type**: Vision-Language Reasoning Model
+-   **Parameter Size**: 8B
+-   **Base Language Model**: Qwen3-8B-Base
+-   **Vision Encoder**: RICE-ViT
+-   **Projector**: PatchMerger
+The model supports native-resolution multi-image inputs and is optimized
+for reasoning-intensive multimodal scenarios.
+------------------------------------------------------------------------
+## Key Characteristics
+### Explicit Multimodal Reasoning
+Innovator-VL-8B-Thinking is trained to explicitly generate structured
+reasoning traces, enabling the model to: - Perform multi-step logical
+deduction grounded in visual evidence - Solve complex mathematical and
+scientific problems - Maintain reasoning consistency across long
+contexts
+### Reinforcement Learning for Long-Horizon Reasoning
+The model is further optimized using reinforcement learning to
+improve: - Reasoning correctness - Output consistency - Token efficiency
+in long chain-of-thought generation
+Sequence-level optimization enables strong accuracy while significantly
+reducing unnecessary reasoning tokens.
+### Scientific Reasoning Performance
+Compared to instruction-only models, Innovator-VL-8B-Thinking
+demonstrates substantial gains on: - Multimodal mathematical reasoning
+benchmarks - Scientific reasoning and domain-specific QA - Tasks
+requiring precise step-by-step analysis
+------------------------------------------------------------------------
+## Model Architecture
+`<img src="assets/innovator_vl_architecture.png" width="600"/>`{=html}
+-   **Vision Encoder**: RICE-ViT (region-aware visual representation)
+-   **Projector**: PatchMerger for visual token compression
+-   **Language Model**: Qwen3-8B-Base
+-   **Model Size**: 8B parameters
+The architecture is shared with the Instruct variant, while the
+optimization objective and training strategy differ at the post-training
+stage.
+------------------------------------------------------------------------
+## Training Pipeline
+### Multimodal Pre-training
+-   Vision-language alignment with LLaVA-1.5 (558K)
+-   Full-parameter mid-training using LLaVA-OneVision-1.5 (85M)
+### Instruction Initialization
+-   Initialized from Innovator-VL-8B-Instruct
+-   Supervised fine-tuning with multimodal instruction and reasoning
+    data
+### Reinforcement Learning
+-   Trained with Innovator-VL-RL-172K
+-   Optimized using Group Sequence Policy Optimization (GSPO)
+-   Reward design jointly considers reasoning structure and answer
+    correctness
+------------------------------------------------------------------------
+## Output Format
+During reasoning tasks, the model may produce structured outputs:
+`<think>`{=html} Step-by-step reasoning process `</think>`{=html}
+`<answer>`{=html} Final answer `</answer>`{=html}
+This format is enforced during training to improve reasoning stability
+and evaluation consistency.
+------------------------------------------------------------------------
+## Usage Recommendations
+This model is recommended for: - Multimodal mathematical reasoning -
+Scientific problem solving requiring explicit reasoning - Evaluation
+settings emphasizing chain-of-thought quality
+For general instruction-following or latency-sensitive applications, the
+Instruct version is recommended.
+------------------------------------------------------------------------
+## Citation
+@article{innovator-vl, title={Innovator-VL: A Multimodal Large Language
+Model for Scientific Discovery}, year={2025} }

assets/innovator_vl_architecture.png ADDED Viewed

Git LFS Details

SHA256: a10c31adecda1ead8df899d9f3e9e307811172cff74b8a77fec42f8c3363e838
Pointer size: 131 Bytes
Size of remote file: 602 kB