wav2vec2-xls-r-1b-cantonese

Fine-tuned facebook/wav2vec2-xls-r-1b for Cantonese (yue) speech recognition on Common Voice.

Evaluation Results

Metric	Value
CER (no punctuation)	20.57%
CER (raw)	20.85%
Eval Loss	0.0328
Best Step	76000
Best Epoch	13.07

Training History

Step	Epoch	Eval Loss	CER (nopunct)	CER (raw)
1000	0.01	6.2552	100.00%	100.00%
2000	0.02	5.7134	100.00%	100.00%
3000	0.04	3.6000	77.21%	77.30%
4000	0.05	2.1981	60.83%	61.40%
5000	0.06	1.5810	51.66%	51.91%
6000	1.01	1.2162	46.42%	46.65%
7000	1.02	0.9619	42.77%	42.95%
8000	1.03	0.8133	40.52%	40.69%
9000	1.04	0.7011	38.55%	38.66%
10000	1.06	0.6233	39.21%	39.38%
11000	2.00	0.5601	36.76%	37.02%
12000	2.01	0.5020	34.19%	36.47%
13000	2.03	0.4461	33.06%	34.10%
14000	2.04	0.4118	32.24%	32.40%
15000	2.05	0.3762	32.04%	32.08%
16000	2.06	0.3530	31.14%	31.15%
17000	3.01	0.3313	29.82%	29.86%
18000	3.02	0.2990	28.93%	28.94%
19000	3.03	0.2784	28.18%	28.23%
20000	3.05	0.2498	27.20%	28.12%
21000	3.06	0.2302	26.85%	27.22%
22000	4.00	0.2149	26.30%	26.57%
23000	4.02	0.1964	25.74%	26.10%
24000	4.03	0.1865	25.42%	26.37%
25000	4.04	0.1725	24.88%	25.10%
26000	4.05	0.1585	24.54%	24.57%
27000	4.06	0.1444	24.05%	24.16%
28000	5.01	0.1598	24.70%	25.07%
29000	5.02	0.1485	24.73%	25.41%
30000	5.03	0.1385	24.49%	25.39%
31000	5.05	0.1337	23.35%	23.96%
32000	5.06	0.1239	23.45%	23.60%
33000	6.00	0.1136	23.13%	23.22%
34000	6.02	0.1122	23.82%	25.76%
35000	6.03	0.1258	23.44%	23.93%
36000	6.04	0.1071	22.83%	23.13%
37000	6.05	0.1087	22.78%	23.22%
38000	6.07	0.0987	22.70%	22.83%
39000	7.01	0.0961	22.52%	24.59%
40000	7.02	0.0850	22.20%	22.33%
41000	7.04	0.0839	22.04%	22.22%
42000	7.05	0.0873	22.25%	22.74%
43000	7.06	0.0769	22.02%	23.37%
44000	8.01	0.0777	22.12%	27.00%
45000	8.02	0.0663	21.65%	24.92%
46000	8.03	0.0683	21.76%	21.81%
47000	8.04	0.0654	21.50%	21.55%
48000	8.06	0.0619	21.48%	21.52%
49000	9.00	0.0640	21.36%	22.33%
50000	9.01	0.0593	22.24%	24.59%
51000	9.03	0.0588	21.34%	21.36%
52000	9.04	0.0579	21.25%	22.04%
53000	9.05	0.0614	22.27%	24.85%
54000	9.06	0.0544	21.07%	21.08%
55000	10.01	0.0525	21.02%	22.75%
56000	10.02	0.0524	21.06%	21.13%
57000	10.03	0.0497	20.92%	20.97%
58000	10.04	0.0468	20.84%	20.84%
59000	10.06	0.0449	20.78%	20.80%
60000	11.00	0.0488	20.94%	20.93%
61000	11.01	0.0501	20.87%	21.45%
62000	11.03	0.0504	21.02%	21.54%
63000	11.04	0.0452	20.87%	21.00%
64000	11.05	0.0440	20.83%	20.96%
65000	11.06	0.0407	20.70%	20.79%
66000	12.01	0.0443	20.88%	21.01%
67000	12.02	0.0417	20.85%	21.02%
68000	12.03	0.0434	21.03%	21.10%
69000	12.05	0.0420	20.88%	21.01%
70000	12.06	0.0425	21.88%	21.99%
71000	13.00	0.0390	21.99%	22.28%
72000	13.02	0.0379	20.65%	20.83%
73000	13.03	0.0353	21.02%	21.24%
74000	13.04	0.0397	21.25%	21.55%
75000	13.05	0.0332	20.61%	20.85%
76000	13.07	0.0328	20.57%	20.85%
77000	14.01	0.0316	20.69%	20.92%
78000	14.02	0.0331	20.68%	20.95%
79000	14.04	0.0329	20.66%	20.97%
80000	14.05	0.0321	20.57%	20.80%
81000	14.06	0.0322	20.58%	20.82%

Training Details

Base model: facebook/wav2vec2-xls-r-1b
Dataset: mozilla-foundation/common_voice_17_0 (yue)
Language: Cantonese (yue)
Task: Automatic Speech Recognition (ASR)
Architecture: CTC (Connectionist Temporal Classification)
Metric: Character Error Rate (CER)
Total training steps: 81540

Training Metrics

TensorBoard logs are included in the runs/ directory of this repository.

# Clone and view locally
git clone https://huggingface.co/awong-dev/wav2vec2-xls-r-1b-cantonese
tensorboard --logdir wav2vec2-xls-r-1b-cantonese/runs

Usage

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torchaudio
import torch

processor = Wav2Vec2Processor.from_pretrained("awong-dev/wav2vec2-xls-r-1b-cantonese")
model = Wav2Vec2ForCTC.from_pretrained("awong-dev/wav2vec2-xls-r-1b-cantonese")

# Load audio
audio, sr = torchaudio.load("audio.mp3")
if sr != 16000:
    audio = torchaudio.transforms.Resample(sr, 16000)(audio)

inputs = processor(audio.squeeze().numpy(), sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
print(transcription)

Downloads last month: 2

Safetensors

Model size

1.0B params

Tensor type

F32

Model tree for awong-dev/wav2vec2-xls-r-1b-cantonese

Base model

facebook/wav2vec2-xls-r-1b

Finetuned

(121)

this model

Dataset used to train awong-dev/wav2vec2-xls-r-1b-cantonese

Evaluation results

CER (no punctuation) on Common Voice (Cantonese)
test set self-reported

0.206
CER (raw) on Common Voice (Cantonese)
test set self-reported

0.208