HuggingFaceFW/finepdfs_50BT-dclm_30BT-fineweb_edu_20BT-shuffled Viewer • Updated 23 days ago • 62.1M • 1.12k • 3
HuggingFaceFW/finepdfs_50BT-dclm_30BT-fineweb_edu_20BT Viewer • Updated 23 days ago • 62.1M • 32.7k • 1
🤏 Smol-Data Collection Tried and tested mixes for strong pretraining. Inspired by https://huggingface.co/blog/codelion/optimal-dataset-mixing • 14 items • Updated 23 days ago • 12
pplx-embed Collection Diffusion-Pretrained Dense and Contextual Embeddings • 7 items • Updated 28 days ago • 94
mistralai/Voxtral-Mini-4B-Realtime-2602 Automatic Speech Recognition • 4B • Updated 14 days ago • 737k • 725