| --- |
| language: |
| - en |
| license: mit |
| library_name: same |
| tags: |
| - vision-language |
| - navigation |
| - embodied-ai |
| - visual-navigation |
| - mixture-of-experts |
| - multimodal |
| - pytorch |
| datasets: |
| - R2R |
| - REVERIE |
| - RXR |
| - CVDN |
| - SOON |
| - ObjectNav-MP3D |
| metrics: |
| - success_rate |
| - spl |
| pipeline_tag: visual-question-answering |
| model-index: |
| - name: SAME |
| results: |
| - task: |
| type: visual-navigation |
| name: Vision-and-Language Navigation |
| dataset: |
| type: R2R |
| name: Room-to-Room (R2R) |
| metrics: |
| - type: success_rate |
| value: 76 |
| name: SR (val_unseen) |
| - type: spl |
| value: 66 |
| name: SPL (val_unseen) |
| - type: success_rate |
| value: 74 |
| name: SR (test_unseen) |
| - type: spl |
| value: 64 |
| name: SPL (test_unseen) |
| - task: |
| type: visual-navigation |
| name: Vision-and-Language Navigation |
| dataset: |
| type: REVERIE |
| name: REVERIE |
| metrics: |
| - type: success_rate |
| value: 46.4 |
| name: SR (val_unseen) |
| - type: spl |
| value: 36.1 |
| name: SPL (val_unseen) |
| - type: success_rate |
| value: 48.6 |
| name: SR (test_unseen) |
| - type: spl |
| value: 37.1 |
| name: SPL (test_unseen) |
| - task: |
| type: visual-navigation |
| name: Multilingual VLN |
| dataset: |
| type: RXR |
| name: RxR-EN |
| metrics: |
| - type: success_rate |
| value: 50.5 |
| name: SR (val_unseen) |
| - type: ndtw |
| value: 51.2 |
| name: nDTW (val_unseen) |
| - task: |
| type: visual-navigation |
| name: Dialog Navigation |
| dataset: |
| type: CVDN |
| name: CVDN |
| metrics: |
| - type: goal_progress |
| value: 6.94 |
| name: GP (val) |
| - type: goal_progress |
| value: 7.07 |
| name: GP (test) |
| - task: |
| type: visual-navigation |
| name: Object-Oriented Navigation |
| dataset: |
| type: SOON |
| name: SOON |
| metrics: |
| - type: success_rate |
| value: 36.1 |
| name: SR (val_unseen) |
| - type: spl |
| value: 25.4 |
| name: SPL (val_unseen) |
| - type: success_rate |
| value: 38.2 |
| name: SR (test_unseen) |
| - type: spl |
| value: 27.1 |
| name: SPL (test_unseen) |
| - task: |
| type: object-navigation |
| name: Object Navigation |
| dataset: |
| type: ObjectNav-MP3D |
| name: ObjectNav-MP3D |
| metrics: |
| - type: success_rate |
| value: 76.3 |
| name: SR (val) |
| - type: spl |
| value: 42.7 |
| name: SPL (val) |
| --- |
| |
| <div align="center"> |
|
|
| <h1><span style="background: linear-gradient(to right, #007BA7, #99B5D2); -webkit-background-clip: text; color: transparent;font-style: italic;"> SAME</span>: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts</h1> |
|
|
| <div> |
| <a href='https://gengzezhou.github.io' target='_blank'>Gengze Zhou<sup>🍕</sup></a>; |
| <a href='http://www.yiconghong.me' target='_blank'>Yicong Hong<sup>🌭</sup></a>; |
| <a href='https://zunwang1.github.io' target='_blank'>Zun Wang<sup>🍔</sup></a>; |
| <a href='https://github.com/zhaoc5' target='_blank'>Chongyang Zhao<sup>🌮</sup></a>; |
| <a href='https://www.cs.unc.edu/~mbansal/' target='_blank'>Mohit Bansal<sup>🍔</sup></a>; |
| <a href='http://www.qi-wu.me' target='_blank'>Qi Wu<sup>🍕</sup></a> |
| </div> |
| <sup>🍕</sup>AIML, University of Adelaide |
| <sup>🌭</sup>Adobe Research |
| <sup>🍔</sup>UNC, Chapel Hill |
| <sup>🌮</sup>UNSW Sydney |
| |
| <br> |
|
|
| <div> |
| <a href='https://github.com/GengzeZhou/SAME' target='_blank'><img alt="Static Badge" src="https://img.shields.io/badge/VLNBench-v0.1-blue"></a> |
| <a href='https://arxiv.org/abs/2412.05552' target='_blank'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> |
| <a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a> |
| </div> |
| |
| </div> |
|
|
| ## Model Description |
|
|
| **SAME** (State-Adaptive Mixture of Experts) is a unified framework for language-guided visual navigation that consolidates diverse navigation tasks into a single versatile agent. Unlike previous task-specific approaches, SAME can handle both **high-level category-specific search** (e.g., "find a chair") and **low-level language-guided navigation** (e.g., detailed turn-by-turn instructions) through a novel state-adaptive Mixture of Experts (MoE) architecture. |
|
|
| ### Key Features |
|
|
| - **Multi-Task Capability**: Single model handles 9 different navigation datasets simultaneously |
| - **State-Adaptive MoE**: Dynamic expert routing based on multimodal features (text + visual observations) |
| - **Simulator-Free**: Works entirely with pre-computed CLIP ViT-B/16 features - no simulator installation required |
| - **Flexible Architecture**: MoE can be placed at attention query, key-value, or feed-forward network positions |
|
|
| ## Model Architecture |
|
|
| SAME is built on a transformer-based architecture with the following key components: |
|
|
| | Component | Description | |
| |-----------|-------------| |
| | **Language Encoder** | 9-layer BERT-based transformer encoder | |
| | **Image Embeddings** | Processes 512-dim CLIP ViT-B/16 panoramic features | |
| | **Local VP Encoder** | Viewport-level information with crossmodal fusion | |
| | **Global Map Encoder** | Global spatial graph with dynamic routing | |
| | **State-Adaptive MoE** | 8 experts with top-2 selection, multimodal routing | |
|
|
| ### MoE Routing |
|
|
| The State-Adaptive MoE uses multimodal features (fused text + visual embeddings) to dynamically route tokens to specialized experts. This allows the model to adapt its behavior based on: |
| - The granularity of language instructions |
| - Current visual observations |
| - Navigation task requirements |
|
|
| ## Intended Uses |
|
|
| ### Primary Use Cases |
|
|
| - **Vision-and-Language Navigation (VLN)**: Following natural language instructions in indoor environments |
| - **Object Navigation**: Finding target objects given category names |
| - **Dialog-based Navigation**: Multi-turn conversational navigation |
| - **Remote Object Grounding**: Navigating to and identifying remote objects |
|
|
| ### Supported Tasks |
|
|
| | Task | Dataset | Description | |
| |------|---------|-------------| |
| | Low-Level Navigation | R2R, R2R-PREVALENT, R2R-ScaleVLN | Fine-grained instruction following | |
| | Object Grounding | REVERIE, REVERIE-ScaleVLN | Navigate and ground remote objects | |
| | Long Horizontal VLN | RXR-EN | Long horizon navigation (English) | |
| | Dialog Navigation | CVDN | Cooperative vision-and-dialog navigation | |
| | Object Search | SOON | Semantic object-oriented navigation | |
| | Object Navigation | ObjectNav-MP3D | Category-based object finding | |
|
|
| ## How to Use |
|
|
| ### Installation |
|
|
| ```bash |
| git clone https://github.com/GengzeZhou/SAME.git |
| cd SAME |
| conda create --name SAME python=3.10 |
| conda activate SAME |
| pip install -r requirements.txt |
| ``` |
|
|
| ### Download Data and Models |
|
|
| ```bash |
| # Download all datasets and features |
| python download.py --data |
| |
| # Download pretrained models |
| python download.py --pretrain |
| |
| # Download trained checkpoints (optional) |
| python download.py --checkpoints |
| ``` |
|
|
| ### Training |
|
|
| ```bash |
| cd src |
| |
| # Single GPU training |
| python run.py --config_dir configs/main_multi_q.yaml |
| |
| # Multi-GPU distributed training |
| torchrun --nproc_per_node=4 --master_port=29500 \ |
| run.py --config_dir configs/main_multi_q.yaml |
| ``` |
|
|
| ### Evaluation |
|
|
| ```bash |
| cd src |
| python run.py --config_dir configs/test.yaml \ |
| --options experiment.resume_file=/path/to/checkpoint.pt |
| ``` |
|
|
| ### Configuration Options |
|
|
| ```yaml |
| model: |
| use_moe_layer: true |
| moe_type: "Task" # Task-based MoE |
| moe_position: "Attn_q" # Attn_q, Attn_kv, or FFN |
| task_routing_feature: "multi" # Multimodal routing (recommended) |
| num_experts: 8 |
| num_experts_per_tok: 2 # Top-2 expert selection |
| ``` |
| ## Training Details |
| ### Training Data |
| SAME is trained on 9 navigation datasets with weighted sampling: |
| | Dataset | Environment | Sampling Weight | |
| |---------|-------------|-----------------| |
| | R2R-ScaleVLN | HM3D | 10-20 | |
| | R2R-PREVALENT | MP3D | 1 | |
| | R2R | MP3D | 1 | |
| | REVERIE-ScaleVLN | HM3D | 1-10 | |
| | REVERIE | MP3D | 1 | |
| | RXR-EN | MP3D | 1 | |
| | CVDN | MP3D | 1 | |
| | SOON | MP3D | 1 | |
| | ObjectNav-MP3D | MP3D (Habitat) | 2 | |
| ### Training Hyperparameters |
| - **Optimizer**: AdamW |
| - **Learning Rate**: 1e-5 |
| - **Total Iterations**: 500,000 |
| - **Batch Size**: 16 |
| - **Gradient Clipping**: 0.5 |
| - **Training Algorithm**: DAgger (Dataset Aggregation) |
| - **MoE Auxiliary Loss Coefficient**: 0.8 |
| ### Visual Features |
| - **Feature Extractor**: CLIP ViT-B/16 |
| - **Feature Dimension**: 512 |
| - **Format**: HDF5 / LMDB |
| - **Environments**: MatterSim, Habitat-MP3D, Habitat-HM3D |
| ## Evaluation Results |
| SAME achieves state-of-the-art or highly competitive performance across all navigation benchmarks as a **unified model**, outperforming task-specific approaches in many cases. |
| ### Main Results (Unified Model) |
| #### Room-to-Room (R2R) |
| | Split | SR ↑ | SPL ↑ | |
| |-------|------|-------| |
| | Val Unseen | **76** | 66 | |
| | Test Unseen | **74** | **64** | |
| #### REVERIE |
| | Split | SR ↑ | SPL ↑ | |
| |-------|------|-------| |
| | Val Unseen | **46.4** | **36.1** | |
| | Test Unseen | **48.6** | **37.1** | |
| #### RxR-EN (Multilingual VLN) |
| | Split | SR ↑ | nDTW ↑ | |
| |-------|------|--------| |
| | Val Unseen | **50.5** | **51.2** | |
| #### CVDN (Dialog Navigation) |
| | Split | GP ↑ | |
| |-------|------| |
| | Val | **6.94** | |
| | Test | 7.07 | |
| #### SOON (Object-Oriented Navigation) |
| | Split | SR ↑ | SPL ↑ | |
| |-------|------|-------| |
| | Val Unseen | 36.1 | 25.4 | |
| | Test Unseen | **38.2** | **27.1** | |
| #### ObjectNav-MP3D |
| | Split | SR ↑ | SPL ↑ | |
| |-------|------|-------| |
| | Val | **76.3** | 42.7 | |
| ### Evaluation Metrics |
| - **SR (Success Rate)**: Percentage of successful navigations (within 3m of goal) |
| - **SPL (Success weighted by Path Length)**: Efficiency-weighted success rate |
| - **nDTW (normalized Dynamic Time Warping)**: Path similarity to ground truth |
| - **GP (Goal Progress)**: Progress towards the goal in dialog navigation |
| - **NE (Navigation Error)**: Distance to goal at episode end |
| - **OSR (Oracle Success Rate)**: Success rate with oracle stop action |
| ## Model Variants |
| | Variant | MoE Position | Routing | Checkpoint | |
| |---------|--------------|---------|------------| |
| | SAME-Q | Attention Query | Multimodal | `Attnq_pretrained_ckpt.pt` | |
| | SAME-KV | Attention K/V | Multimodal | `Attnkv_pretrained_ckpt.pt` | |
| | SAME-FFN | Feed-Forward | Multimodal | `FFN_pretrained_ckpt.pt` | |
|
|
| ## Limitations |
|
|
| - **Indoor Environments Only**: Trained and evaluated on indoor navigation datasets |
| - **Pre-computed Features**: Requires pre-extracted CLIP features; cannot process raw images directly |
| - **English Language**: Primary support for English instructions (though RXR provides multilingual data) |
| - **Static Environments**: Assumes static environments without dynamic obstacles or agents |
|
|
| ## Environmental Impact |
|
|
| - **Hardware**: Training conducted on NVIDIA A100 GPUs |
| - **Training Time**: Approximately 2-3 days on 4x A100 GPUs |
|
|
| ## Citation |
|
|
| If you find this work helpful, please cite: |
|
|
| ```bibtex |
| @article{zhou2024same, |
| title={SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts}, |
| author={Gengze Zhou and Yicong Hong and Zun Wang and Chongyang Zhao and Mohit Bansal and Qi Wu}, |
| journal={arXiv preprint arXiv:2412.05552}, |
| year={2024}, |
| } |
| ``` |
|
|
| ## Authors |
|
|
| - **Gengze Zhou** - AIML, University of Adelaide ([Website](https://gengzezhou.github.io)) |
| - **Yicong Hong** - Adobe Research ([Website](http://www.yiconghong.me)) |
| - **Zun Wang** - UNC Chapel Hill ([Website](https://zunwang1.github.io)) |
| - **Chongyang Zhao** - UNSW Sydney ([GitHub](https://github.com/zhaoc5)) |
| - **Mohit Bansal** - UNC Chapel Hill ([Website](https://www.cs.unc.edu/~mbansal/)) |
| - **Qi Wu** - University of Adelaide ([Website](http://www.qi-wu.me)) |
|
|
| ## Acknowledgements |
|
|
| We extend our gratitude to: |
| - [MatterPort3D](https://niessner.github.io/Matterport/) for the open-source platform |
| - [DUET](https://github.com/cshizhe/VLN-DUET) for the foundational architecture |
| - [ScaleVLN](https://github.com/wz0919/ScaleVLN) for augmented training data |
| - [NaviLLM](https://github.com/zd11024/NaviLLM) for additional insights |
|
|
| ## License |
|
|
| This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. |
|
|
| ## Contact |
|
|
| For questions or issues, please open an issue on the [GitHub repository](https://github.com/GengzeZhou/SAME) or contact the authors. |