VoiceCLAP-Large

Voice-text contrastive embedding model — the larger of the two anchors released with VoiceNet.

VoiceCLAP-Large is a single-tower model: a rank-16 LoRA finetune of LCO-Embedding-Omni-7B (Qwen2.5-Omni-Thinker-7B backbone with a sentence-transformer last-token-pooling head) trained with the symmetric InfoNCE loss. The audio and text embeddings are produced by the same backbone — the modality is determined by what is fed in via the multimodal chat template.


Architecture	single-tower Omni-Embedding (Qwen2.5-Omni-Thinker-7B + ST last-token-pool)
Adaptation	rank-16 LoRA (alpha 32, dropout 0.05), merged into the released weights
Joint embedding	3 584-d, L2-normalised
Loss	symmetric InfoNCE (all-gather negatives)
Total parameters	~7 B (full merged model)
Epochs	1

Training data

Trained for 1 epoch on the open mixture (9 datasets) used in the VoiceNet paper:

emolia-balanced-5M-subset (annotated subset of Emilia)
laions_got_talent_clean_with_captions
majestrino-data
synthetic_vocal_bursts
improved_synthetic_vocal_bursts
ears
expresso
voxceleb1
voxceleb2

All clips are captioned with MOSS-Audio-8B-Thinking-derived dense vocal-style captions covering emotions, talking-style attributes, and demographics.

Standalone load example

The model uses the SentenceTransformer multimodal API — both sentence-transformers and transformers are on PyPI; no other deps are required.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("VoiceNet/voiceclap-large", trust_remote_code=True)

# Text embedding (3 584-d, L2-normalised)
text_emb = model.encode(["a calm and steady voice"])

# Audio embedding — pass a dict with raw samples + sampling rate.
import soundfile as sf
arr, sr = sf.read("clip.wav")
audio_emb = model.encode([{"array": arr, "sampling_rate": sr}])

# Cosine similarity (embeddings already L2-normalised)
print((audio_emb @ text_emb.T).item())

Citation

If you use this model, please cite the VoiceNet paper.

Downloads last month: -

Safetensors

Model size

9B params

Tensor type

F32

BF16

Model tree for laion/voiceclap-large

Base model

LCO-Embedding/LCO-Embedding-Omni-7B

Finetuned

(2)

this model