ASL Recognition Model - Improved β¨
Improved ASL Recognition using 1D CNN + Bidirectional LSTM with Attention on MediaPipe landmarks.
Performance
- Validation Accuracy: 64.61%
- Improvement: +438% from baseline (12% β 64.61%)
- Parameters: 3,258,574
- Best Epoch: 27
Model Architecture
This improved model uses a hybrid architecture:
- 1D Convolutional Layers: Extract spatial features from hand landmarks
- Conv1D (126 β 128 β 256) with BatchNorm and Dropout
- Bidirectional LSTM: Model temporal dependencies across frames
- 2 layers, 256 hidden units, bidirectional
- Attention Mechanism: Focus on important frames in the sequence
- Classification Head: 512 β 256 β 77 classes with BatchNorm + Dropout
Key Improvements
Phase 1 - Data
β
Stratified train/val split (80/20) preserving class distribution
β
Data augmentation: Gaussian noise on landmarks
β
Class-weighted loss to handle any imbalance
Phase 2 - Model
β
1D CNN for better spatial feature extraction
β
Bidirectional LSTM for temporal modeling
β
Attention mechanism for frame importance
β
BatchNorm for training stability
Phase 3 - Training
β
AdamW optimizer with weight decay (1e-4)
β
OneCycleLR scheduler for better convergence
β
Gradient clipping (max_norm=1.0)
β
Early stopping (patience=15)
β
Class-weighted CrossEntropyLoss
Phase 4 - Validation
β
Proper stratified validation split
β
No augmentation during validation
β
Consistent preprocessing pipeline
Vocabulary
- 77 classes: 26 letters (A-Z) + 51 common words
- Words: hello, goodbye, please, thank_you, yes, no, help, sorry, good, bad, friend, family, love, eat, drink, water, food, home, work, school, teacher, student, book, read, write, learn, understand, know, think, feel, want, need, have, go, come, see, hear, speak, sign, time, today, tomorrow, yesterday, morning, afternoon, evening, night, happy, sad, angry, tired
Training Details
Batch Size: 32
Epochs: 100 (early stopped at 42)
Learning Rate: 1e-3 (OneCycleLR)
Weight Decay: 1e-4
Dropout: 0.3
Train Samples: 6,160
Val Samples: 1,540
Usage
import torch
import numpy as np
from train_asl_improved import ImprovedASLModel
# Load model
model = ImprovedASLModel(num_classes=77)
checkpoint = torch.load('best_model.pth', map_location='cpu')
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
# Load label encoder
classes = np.load('label_encoder_classes.npy', allow_pickle=True)
# Inference
landmarks = torch.randn(1, 30, 2, 21, 3) # (batch, seq_len, hands, landmarks, coords)
mask = torch.ones(1, 30) # Valid frames mask
with torch.no_grad():
logits = model(landmarks, mask)
pred = torch.argmax(logits, dim=-1)
print(f"Predicted: {classes[pred.item()]}")
Input Format
- Shape: (batch, 30, 2, 21, 3)
- 30 frames (sequence length)
- 2 hands (left + right)
- 21 landmarks per hand (MediaPipe format)
- 3 coordinates (x, y, z)
- Normalization: Landmarks normalized to [-1, 1] range
- Padding: Zero-padded sequences with mask support
Training Results
| Metric | Baseline | Improved | Change |
|---|---|---|---|
| Val Accuracy | 12.01% | 64.61% | +52.6% |
| Train Accuracy | ~20% | 72.58% | +52.6% |
| Parameters | 323K | 3.26M | 10x |
Validation Accuracy Timeline
- Epoch 1: 7.01%
- Epoch 5: 25.13%
- Epoch 10: 48.77%
- Epoch 15: 57.79%
- Epoch 20: 61.17%
- Epoch 27: 64.61% β Best
- Epoch 42: 62.40% (early stop)
Next Steps to Reach 80%+
- More Data: Current 100 samples/class is minimal; collect 300-500 per class
- Better Augmentation: Time warping, speed perturbation, mixup
- Ensemble: Combine multiple models or use test-time augmentation
- Architecture: Try Transformer encoders instead of LSTM
- Signer Independence: Ensure train/val split by signer ID (not done in current dataset)
Files
best_model.pth: Model checkpoint (37MB)label_encoder_classes.npy: Class labels mappingmodel_config.json: Configuration and metadatatraining_history.png: Loss/accuracy plotstraining_history.npy: Training metrics
Citation
If you use this model, please cite:
@misc{asl-recognition-improved,
title={Improved ASL Recognition with CNN+LSTM},
author={namratha2412},
year={2025},
howpublished={\url{https://huggingface.co/namratha2412/asl-recognition}}
}
License: MIT
Dataset: 7,700 sequences (100 per class)
Framework: PyTorch 2.0+