ASL Recognition Model - Improved ✨

Improved ASL Recognition using 1D CNN + Bidirectional LSTM with Attention on MediaPipe landmarks.

Performance

  • Validation Accuracy: 64.61%
  • Improvement: +438% from baseline (12% β†’ 64.61%)
  • Parameters: 3,258,574
  • Best Epoch: 27

Model Architecture

This improved model uses a hybrid architecture:

  1. 1D Convolutional Layers: Extract spatial features from hand landmarks
    • Conv1D (126 β†’ 128 β†’ 256) with BatchNorm and Dropout
  2. Bidirectional LSTM: Model temporal dependencies across frames
    • 2 layers, 256 hidden units, bidirectional
  3. Attention Mechanism: Focus on important frames in the sequence
  4. Classification Head: 512 β†’ 256 β†’ 77 classes with BatchNorm + Dropout

Key Improvements

Phase 1 - Data

βœ… Stratified train/val split (80/20) preserving class distribution
βœ… Data augmentation: Gaussian noise on landmarks
βœ… Class-weighted loss to handle any imbalance

Phase 2 - Model

βœ… 1D CNN for better spatial feature extraction
βœ… Bidirectional LSTM for temporal modeling
βœ… Attention mechanism for frame importance
βœ… BatchNorm for training stability

Phase 3 - Training

βœ… AdamW optimizer with weight decay (1e-4)
βœ… OneCycleLR scheduler for better convergence
βœ… Gradient clipping (max_norm=1.0)
βœ… Early stopping (patience=15)
βœ… Class-weighted CrossEntropyLoss

Phase 4 - Validation

βœ… Proper stratified validation split
βœ… No augmentation during validation
βœ… Consistent preprocessing pipeline

Vocabulary

  • 77 classes: 26 letters (A-Z) + 51 common words
  • Words: hello, goodbye, please, thank_you, yes, no, help, sorry, good, bad, friend, family, love, eat, drink, water, food, home, work, school, teacher, student, book, read, write, learn, understand, know, think, feel, want, need, have, go, come, see, hear, speak, sign, time, today, tomorrow, yesterday, morning, afternoon, evening, night, happy, sad, angry, tired

Training Details

Batch Size: 32
Epochs: 100 (early stopped at 42)
Learning Rate: 1e-3 (OneCycleLR)
Weight Decay: 1e-4
Dropout: 0.3
Train Samples: 6,160
Val Samples: 1,540

Usage

import torch
import numpy as np
from train_asl_improved import ImprovedASLModel

# Load model
model = ImprovedASLModel(num_classes=77)
checkpoint = torch.load('best_model.pth', map_location='cpu')
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Load label encoder
classes = np.load('label_encoder_classes.npy', allow_pickle=True)

# Inference
landmarks = torch.randn(1, 30, 2, 21, 3)  # (batch, seq_len, hands, landmarks, coords)
mask = torch.ones(1, 30)  # Valid frames mask

with torch.no_grad():
    logits = model(landmarks, mask)
    pred = torch.argmax(logits, dim=-1)
    print(f"Predicted: {classes[pred.item()]}")

Input Format

  • Shape: (batch, 30, 2, 21, 3)
    • 30 frames (sequence length)
    • 2 hands (left + right)
    • 21 landmarks per hand (MediaPipe format)
    • 3 coordinates (x, y, z)
  • Normalization: Landmarks normalized to [-1, 1] range
  • Padding: Zero-padded sequences with mask support

Training Results

Metric Baseline Improved Change
Val Accuracy 12.01% 64.61% +52.6%
Train Accuracy ~20% 72.58% +52.6%
Parameters 323K 3.26M 10x

Validation Accuracy Timeline

  • Epoch 1: 7.01%
  • Epoch 5: 25.13%
  • Epoch 10: 48.77%
  • Epoch 15: 57.79%
  • Epoch 20: 61.17%
  • Epoch 27: 64.61% ← Best
  • Epoch 42: 62.40% (early stop)

Next Steps to Reach 80%+

  1. More Data: Current 100 samples/class is minimal; collect 300-500 per class
  2. Better Augmentation: Time warping, speed perturbation, mixup
  3. Ensemble: Combine multiple models or use test-time augmentation
  4. Architecture: Try Transformer encoders instead of LSTM
  5. Signer Independence: Ensure train/val split by signer ID (not done in current dataset)

Files

  • best_model.pth: Model checkpoint (37MB)
  • label_encoder_classes.npy: Class labels mapping
  • model_config.json: Configuration and metadata
  • training_history.png: Loss/accuracy plots
  • training_history.npy: Training metrics

Citation

If you use this model, please cite:

@misc{asl-recognition-improved,
  title={Improved ASL Recognition with CNN+LSTM},
  author={namratha2412},
  year={2025},
  howpublished={\url{https://huggingface.co/namratha2412/asl-recognition}}
}

License: MIT
Dataset: 7,700 sequences (100 per class)
Framework: PyTorch 2.0+

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support