Irodori-TTS-500M-v2-Character-Voice-SigLIP

Project Page arXiv GitHub Demo

Model Description

Irodori-TTS-500M-v2-Character-Voice-SigLIP is a Japanese TTS model based on Aratako/Irodori-TTS-500M-v2. This model synthesizes speech in a specific character's voice by using the character's image as a condition. By using encoded features from a character image as the conditioning signal instead of reference audio or voice captions, it enables zero-shot speech synthesis with a voice that matches the character's atmosphere.

This SigLIP variant uses SigLIP-v2-B/16-512 as the image encoder. Another model in the same Character Voice family, Irodori-TTS-500M-v2-Character-Voice-Tagger, uses a wd-tagger-based image encoder instead.

Samples

遠くで鳴る夕暮れの鐘が、一日の終わりを告げている。家々の窓には、ぽつりぽつりと暖かな灯りがともり始めた。

Character Image Generated Audio

Usage

For inference code, installation instructions, the Gradio demo, and CLI examples, please refer to the GitHub repository.

For CLI inference, use --hf-checkpoint p1atdev/Irodori-TTS-500M-v2-Character-Voice-SigLIP together with --character-image. See the GitHub README for complete command examples.

License

This model is released under the MIT License.

Acknowledgments

This model builds on the following projects and resources:

We also thank the authors and contributors of the original Irodori-TTS project and related open-source projects.

Citation

If you use this model in research or a project, please cite:

@misc{character-voice-control,
  author = {Tingrui Zhou and Keiji Yanai},
  title = {A Character's Look Speaks Volumes: Character Image-Conditioned Speaker Style Control for Japanese Text-to-Speech},
  year = {2026},
  eprint = {TODO},
  archivePrefix = {arXiv},
  primaryClass = {cs.SD},
  url = {TODO}
}

Please also cite the original Irodori-TTS model:

@misc{irodori-tts-v2,
  author = {Chihiro Arata},
  title = {Irodori-TTS: A Flow Matching-based Text-to-Speech Model with Emoji-driven Style Control},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/Aratako/Irodori-TTS-500M-v2}}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
0.5B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for p1atdev/Irodori-TTS-500M-v2-Character-Voice-SigLIP

Finetuned
(9)
this model

Spaces using p1atdev/Irodori-TTS-500M-v2-Character-Voice-SigLIP 2

Collection including p1atdev/Irodori-TTS-500M-v2-Character-Voice-SigLIP