wav2vec2-xls-r-1b-cantonese
Fine-tuned facebook/wav2vec2-xls-r-1b for Cantonese (yue) speech recognition on Common Voice.
Evaluation Results
| Metric |
Value |
| CER (no punctuation) |
20.57% |
| CER (raw) |
20.85% |
| Eval Loss |
0.0328 |
| Best Step |
76000 |
| Best Epoch |
13.07 |
Training History
| Step |
Epoch |
Eval Loss |
CER (nopunct) |
CER (raw) |
| 1000 |
0.01 |
6.2552 |
100.00% |
100.00% |
| 2000 |
0.02 |
5.7134 |
100.00% |
100.00% |
| 3000 |
0.04 |
3.6000 |
77.21% |
77.30% |
| 4000 |
0.05 |
2.1981 |
60.83% |
61.40% |
| 5000 |
0.06 |
1.5810 |
51.66% |
51.91% |
| 6000 |
1.01 |
1.2162 |
46.42% |
46.65% |
| 7000 |
1.02 |
0.9619 |
42.77% |
42.95% |
| 8000 |
1.03 |
0.8133 |
40.52% |
40.69% |
| 9000 |
1.04 |
0.7011 |
38.55% |
38.66% |
| 10000 |
1.06 |
0.6233 |
39.21% |
39.38% |
| 11000 |
2.00 |
0.5601 |
36.76% |
37.02% |
| 12000 |
2.01 |
0.5020 |
34.19% |
36.47% |
| 13000 |
2.03 |
0.4461 |
33.06% |
34.10% |
| 14000 |
2.04 |
0.4118 |
32.24% |
32.40% |
| 15000 |
2.05 |
0.3762 |
32.04% |
32.08% |
| 16000 |
2.06 |
0.3530 |
31.14% |
31.15% |
| 17000 |
3.01 |
0.3313 |
29.82% |
29.86% |
| 18000 |
3.02 |
0.2990 |
28.93% |
28.94% |
| 19000 |
3.03 |
0.2784 |
28.18% |
28.23% |
| 20000 |
3.05 |
0.2498 |
27.20% |
28.12% |
| 21000 |
3.06 |
0.2302 |
26.85% |
27.22% |
| 22000 |
4.00 |
0.2149 |
26.30% |
26.57% |
| 23000 |
4.02 |
0.1964 |
25.74% |
26.10% |
| 24000 |
4.03 |
0.1865 |
25.42% |
26.37% |
| 25000 |
4.04 |
0.1725 |
24.88% |
25.10% |
| 26000 |
4.05 |
0.1585 |
24.54% |
24.57% |
| 27000 |
4.06 |
0.1444 |
24.05% |
24.16% |
| 28000 |
5.01 |
0.1598 |
24.70% |
25.07% |
| 29000 |
5.02 |
0.1485 |
24.73% |
25.41% |
| 30000 |
5.03 |
0.1385 |
24.49% |
25.39% |
| 31000 |
5.05 |
0.1337 |
23.35% |
23.96% |
| 32000 |
5.06 |
0.1239 |
23.45% |
23.60% |
| 33000 |
6.00 |
0.1136 |
23.13% |
23.22% |
| 34000 |
6.02 |
0.1122 |
23.82% |
25.76% |
| 35000 |
6.03 |
0.1258 |
23.44% |
23.93% |
| 36000 |
6.04 |
0.1071 |
22.83% |
23.13% |
| 37000 |
6.05 |
0.1087 |
22.78% |
23.22% |
| 38000 |
6.07 |
0.0987 |
22.70% |
22.83% |
| 39000 |
7.01 |
0.0961 |
22.52% |
24.59% |
| 40000 |
7.02 |
0.0850 |
22.20% |
22.33% |
| 41000 |
7.04 |
0.0839 |
22.04% |
22.22% |
| 42000 |
7.05 |
0.0873 |
22.25% |
22.74% |
| 43000 |
7.06 |
0.0769 |
22.02% |
23.37% |
| 44000 |
8.01 |
0.0777 |
22.12% |
27.00% |
| 45000 |
8.02 |
0.0663 |
21.65% |
24.92% |
| 46000 |
8.03 |
0.0683 |
21.76% |
21.81% |
| 47000 |
8.04 |
0.0654 |
21.50% |
21.55% |
| 48000 |
8.06 |
0.0619 |
21.48% |
21.52% |
| 49000 |
9.00 |
0.0640 |
21.36% |
22.33% |
| 50000 |
9.01 |
0.0593 |
22.24% |
24.59% |
| 51000 |
9.03 |
0.0588 |
21.34% |
21.36% |
| 52000 |
9.04 |
0.0579 |
21.25% |
22.04% |
| 53000 |
9.05 |
0.0614 |
22.27% |
24.85% |
| 54000 |
9.06 |
0.0544 |
21.07% |
21.08% |
| 55000 |
10.01 |
0.0525 |
21.02% |
22.75% |
| 56000 |
10.02 |
0.0524 |
21.06% |
21.13% |
| 57000 |
10.03 |
0.0497 |
20.92% |
20.97% |
| 58000 |
10.04 |
0.0468 |
20.84% |
20.84% |
| 59000 |
10.06 |
0.0449 |
20.78% |
20.80% |
| 60000 |
11.00 |
0.0488 |
20.94% |
20.93% |
| 61000 |
11.01 |
0.0501 |
20.87% |
21.45% |
| 62000 |
11.03 |
0.0504 |
21.02% |
21.54% |
| 63000 |
11.04 |
0.0452 |
20.87% |
21.00% |
| 64000 |
11.05 |
0.0440 |
20.83% |
20.96% |
| 65000 |
11.06 |
0.0407 |
20.70% |
20.79% |
| 66000 |
12.01 |
0.0443 |
20.88% |
21.01% |
| 67000 |
12.02 |
0.0417 |
20.85% |
21.02% |
| 68000 |
12.03 |
0.0434 |
21.03% |
21.10% |
| 69000 |
12.05 |
0.0420 |
20.88% |
21.01% |
| 70000 |
12.06 |
0.0425 |
21.88% |
21.99% |
| 71000 |
13.00 |
0.0390 |
21.99% |
22.28% |
| 72000 |
13.02 |
0.0379 |
20.65% |
20.83% |
| 73000 |
13.03 |
0.0353 |
21.02% |
21.24% |
| 74000 |
13.04 |
0.0397 |
21.25% |
21.55% |
| 75000 |
13.05 |
0.0332 |
20.61% |
20.85% |
| 76000 |
13.07 |
0.0328 |
20.57% |
20.85% |
| 77000 |
14.01 |
0.0316 |
20.69% |
20.92% |
| 78000 |
14.02 |
0.0331 |
20.68% |
20.95% |
| 79000 |
14.04 |
0.0329 |
20.66% |
20.97% |
| 80000 |
14.05 |
0.0321 |
20.57% |
20.80% |
| 81000 |
14.06 |
0.0322 |
20.58% |
20.82% |
Training Details
Training Metrics
TensorBoard logs are included in the runs/ directory of this repository.
git clone https://huggingface.co/awong-dev/wav2vec2-xls-r-1b-cantonese
tensorboard --logdir wav2vec2-xls-r-1b-cantonese/runs
Usage
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torchaudio
import torch
processor = Wav2Vec2Processor.from_pretrained("awong-dev/wav2vec2-xls-r-1b-cantonese")
model = Wav2Vec2ForCTC.from_pretrained("awong-dev/wav2vec2-xls-r-1b-cantonese")
audio, sr = torchaudio.load("audio.mp3")
if sr != 16000:
audio = torchaudio.transforms.Resample(sr, 16000)(audio)
inputs = processor(audio.squeeze().numpy(), sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
print(transcription)