tanaos-spam-detection-v1 (ONNX)
This is an ONNX version of tanaos/tanaos-spam-detection-v1. It was automatically converted and uploaded using this Hugging Face Space.
Usage with Transformers.js
See the pipeline documentation for text-classification: https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.TextClassificationPipeline
tanaos-spam-detection-v1: A small but performant base spam detection model
This model was created by Tanaos with the Artifex Python library.
This is a multilingual spam detection model (it supports 15+ languages) based on distilbert-base-multilingual-cased and fine-tuned on a synthetic dataset to classify text as spam or not_spam. It is intended to be used as a first-layer spam filter for email systems, messaging applications or any other text-based communication platform.
The following categories are considered spam:
- Unsolicited commercial advertisement or non-commercial proselytizing.
- Fraudulent schemes. including get-rich-quick and pyramid schemes.
- Phishing attempts. unrealistic offers or announcements.
- Content with deceptive or misleading information.
- Malware or harmful links.
- Adult content or explicit material.
- Excessive use of capitalization or punctuation to grab attention.
How to Use
Via the Artifex library (pip install artifex)
from artifex import Artifex
spam_detection = Artifex().spam_detection
print(spam_detection("You won an IPhone 16! Click here to claim your prize."))
# >>> [{'label': 'spam', 'score': 0.9945}]
Via the Transformers library
from transformers import pipeline
clf = pipeline("text-classification", model="tanaos/tanaos-spam-detection-v1")
print(clf("You won an IPhone 16! Click here to claim your prize."))
# >>> [{'label': 'spam', 'score': 0.9945}]
How to fine-tune (without training data)
Use the Artifex library to fine-tune the model to any language other than English or to custom definitions of spam by generating synthetic training data on-the-fly. Install Artifex with
pip install artifex
Fine-tune to any language
from artifex import Artifex
spam_detection = Artifex().spam_detection
model_output_path = "./output_model/"
spam_detection.train(
spam_content=[
"Publicidad comercial no solicitada o proselitismo no comercial",
"Esquemas fraudulentos, incluidos los de enriquecimiento rápido y los esquemas piramidales",
"Intentos de phishing, ofertas irreales o anuncios engañosos",
"Contenido con información engañosa o fraudulenta",
"Malware o enlaces dañinos",
"Contenido para adultos o material explícito",
"Uso excesivo de mayúsculas o signos de puntuación para llamar la atención",
],
language="spanish",
output_model_path=model_output_path,
)
spam_detection.load(model_output_path)
print(spam_detection("¡Has ganado un IPhone 16! Haz clic aquí para reclamar tu premio."))
# >>> [{'label': 'spam', 'score': 0.9970}]
Fine-tune to custom definitions of spam
from artifex import Artifex
spam_detection = Artifex().spam_detection
model_output_path = "./output_model/"
spam_detection.train(
spam_content=[
"Messages that contain excessive emojis or special characters",
"Messages that promote gambling or betting activities",
"Messages that include unsolicited attachments or files",
"Messages that impersonate legitimate organizations or individuals",
"Messages that use sensationalist language to provoke reactions",
"Messages that contain misleading links or URLs",
"Messages that encourage illegal activities or behavior",
],
output_model_path=model_output_path,
)
spam_detection.load(model_output_path)
print(spam_detection("🎉🏖️✈️ Congratulations! You've won a vacation. Click the link below to claim your prize!!!"))
# >>> [{'label': 'spam', 'score': 0.9979}]
Model Description
- Base model:
distilbert/distilbert-base-multilingual-cased - Task: Text classification (spam detection)
- Languages: Multilingual (15+ languages)
- Fine-tuning data: A synthetic, custom dataset of spam and not spam examples.
Training Details
This model was trained using the Artifex Python library
pip install artifex
by providing the following instructions and generating 10,000 synthetic training samples:
from artifex import Artifex
spam_detection = Artifex().spam_detection
spam_detection.train(
spam_content=[
"Unsolicited commercial advertisement or non-commercial proselytizing",
"Fraudulent schemes, including get-rich-quick and pyramid schemes",
"Phishing attempts, unrealistic offers or announcements",
"Content with deceptive or misleading information",
"Malware or harmful links",
"Adult content or explicit material",
"Excessive use of capitalization or punctuation to grab attention",
],
num_samples=10000
)
Intended Uses
This model is intended to:
- Serve as a first-layer spam filter for email systems, messaging applications, or any other text-based communication platform.
- Help reduce unwanted or harmful messages by classifying text as spam or not spam.
Not intended for:
- Use in high-stakes scenarios where misclassification could lead to significant consequences without further human review.
- Downloads last month
- 11