Josephgflowers/Finance-Instruct-500k
Viewer • Updated • 518k • 1.22k • 228
This repository contains a WordPiece tokenizer fine-tuned on the Finance-Instruct-500k dataset, starting from the base model yakul259/english-wordpiece-tokenizer-60k.
It is tailored for financial domain text processing, capturing domain-specific terminology and patterns while maintaining efficient subword segmentation.
Key Features:
<cls> and <sep> special tokens.train<cls> — Classification token<sep> — Separator token<unk> — Unknown token<pad> — Padding token<mask> — Masking token (MLM tasks)$A:0 <sep>:0 <cls>:2$A:0 <sep>:0 $B:1 <sep>:1 <cls>:2WordPieceTrainer from Hugging Face tokenizers library<cls>, <sep>, <unk>, <pad>, <mask>This tokenizer is released under the MIT License.
If you use this tokenizer, please cite:
title = Finance WordPiece Tokenizer Fine-tuned on Finance-Instruct-500k
author = yakul259
year = 2025
publisher = Hugging Face
Base model
yakul259/english-wordpiece-tokenizer-60k