Instructions to use HuggingFaceFW/fineweb-edu-classifier with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HuggingFaceFW/fineweb-edu-classifier with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="HuggingFaceFW/fineweb-edu-classifier")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("HuggingFaceFW/fineweb-edu-classifier") model = AutoModelForSequenceClassification.from_pretrained("HuggingFaceFW/fineweb-edu-classifier") - Inference
- Notebooks
- Google Colab
- Kaggle
Choice on pretrained model and fine-tuning.
#3
by Avditvs - opened
Hi !
The technical paper does not really elaborate on the choice of the backbone (snowflake model) for training the classifier as well as why it was frozen. Could you give more details about the implementation choices ?
Hi @Avditvs ! We've experimented with RoBERTa, mixedbread-ai/mxbai-embed-large-v1, and the snowflake models. Due to significant amounts of noise (+/- 1 point) in the Llama annotations, a frozen encoder helped prevent overfitting, and (counter-intuitively) a retrieval-focused snowflake model worked best. Also snowflake-arctic-embed-m performed just as well as snowflake-arctic-embed-l, so we went with it to save on compute.