Converting kanji to hiragana is often simpler and faster when done as a post-processing step using a standard Python library rather than relying on Transformers.
Why “hiragana-only translation” is usually a post-processing problem
Most English→Japanese models are trained to output standard Japanese orthography (kanji + kana). Forcing the translator to emit only hiragana during decoding is possible but fragile and often harms translation quality. A more robust pattern is:
- Translate EN→JA normally (best quality)
- Convert the Japanese output to its reading (よみ)
- Convert the reading to hiragana
- Validate + handle edge cases (symbols, OOV, numbers, names)
This separation keeps translation quality high and makes the “hiragana-only” requirement deterministic.
Recommended approach: translate → Sudachi reading → hiragana
Background: what Sudachi gives you
Sudachi is a Japanese morphological analyzer that tokenizes text and can return each token’s reading form via reading_form(). (Works Applications)
Important detail: Sudachi’s reading form is “furigana” style in katakana, and for unknown words it can return an empty string. (Javadoc)
So you typically do:
reading_form()→ katakana reading- katakana → hiragana using
jaconv.kata2hira()(Toki Inoue Inoue)
The pitfalls (and the standard fixes)
1) “きごう” appears (punctuation became a word)
Sudachi can produce a reading for “symbol” tokens; converting that reading yields “きごう”. A common rule is: do not convert readings for tokens whose POS is “補助記号”; keep the original punctuation surface. (Qiita)
2) Residual kanji remains (unknown words / missing reading)
Because OOV tokens can have no reading, you need either:
- a user dictionary / better Sudachi dictionary coverage, or
- a fallback converter that can still transliterate leftover kanji.
A practical fallback is pykakasi; current docs recommend using convert() (new API). (Pykakasi)
“Production-ish” checklist for the Sudachi pipeline
-
Use Sudachi dictionary with broad coverage (often
full) -
For each token:
- if POS is
補助記号→ keep surface (avoid “きごう”) (Qiita) - else if has reading →
kata2hira(reading)(Toki Inoue Inoue) - else keep surface or fallback transliteration (pykakasi)
- if POS is
-
Post-validate: reject/flag if any kanji remain (and rerun fallback)
Better approaches (when/why you’d choose them)
1) Use a better translator, then keep the same hiragana conversion step
If translation quality is the bottleneck, fix that first. Your LFM2 results look strong because the model is explicitly MT-tuned and requires the system prompt "Translate to Japanese." plus recommended sampling parameters. (Hugging Face)
This is usually the best “quality per complexity” path: translator improvements upstream, same deterministic downstream conversion.
2) Use pyopenjtalk (OpenJTalk frontend) for pronunciation-driven kana
If your end goal is TTS / pronunciation rather than “reading as written,” pyopenjtalk.g2p(text, kana=True) returns pronunciation in katakana, which you can convert to hiragana. (LESS IS MORE)
Pros:
- Strong pronunciation-oriented processing (good for speech pipelines)
Cons: - “Pronunciation kana” can differ from orthographic reading (particles, rendaku, etc.), depending on what you want.
3) Use an off-the-shelf “kanji→kana” wrapper library
If you want a simpler interface than writing Sudachi token loops:
- kanjiconv: converts kanji→hiragana/katakana/romaji and is built on SudachiDict/SudachiPy; it aims to handle proper nouns better as dictionaries update. (GitHub)
Pros: - Less glue code
- Still Sudachi-based
Cons: - You inherit its design choices; you may still need custom handling for punctuation/OOV.
4) JS/Frontend: use Kuroshiro
If you need hiragana conversion in a browser/Node environment, Kuroshiro is a standard option for converting Japanese text to hiragana/katakana/romaji (with furigana/okurigana modes). (kuroshiro)
Approaches that look tempting but usually aren’t worth it
Constrained decoding to “ban kanji” during generation
In Transformers you can block tokens using bad_words_ids / “NoBadWords” processors, but for “ban all kanji” you’d need an enormous list (tokenization-dependent), and it can be slow and leaky:
- Large
bad_words_idslists can severely slow generation. (Hugging Face Forums) - Blocking can be bypassed when words split across tokens (classic issue). (GitHub)
Regex/grammar constrained generation frameworks (e.g., Outlines) can enforce character-level constraints. (DotTxt AI)
But for Japanese with subword tokenization, “hiragana-only” constraints often become:
- slower,
- harder to integrate,
- and still not guaranteed to preserve translation quality as well as “translate then convert.”
The unavoidable limitation: hiragana-only loses disambiguation
Kanji disambiguates meaning. Hiragana-only output can be ambiguous (homophones), and “best reading” depends on context. Even high-quality reading tools can be imperfect in edge cases (numbers, special terms, rare names), so you should expect:
- occasional reading choices you want to override (e.g., 私 → わたし vs わたくし)
- special handling for numbers/units if needed (common in TTS/learning tooling)
Demo
Uses LiquidAI/LFM2-350M-ENJP-MT (requires the system prompt "Translate to Japanese." and the recommended sampling params). (Hugging Face)
Uses tokenizer.apply_chat_template(..., add_generation_prompt=True) for chat formatting. (Hugging Face)
Uses Sudachi reading_form() + part_of_speech() and jaconv.kata2hira() for kana conversion. (Works Applications)
"""
Simple demo (single file, no argparse):
EN -> JA (LiquidAI/LFM2-350M-ENJP-MT) -> hiragana-only (Sudachi reading -> hiragana)
pip deps:
pip install -U torch transformers sentencepiece sudachipy sudachidict_full jaconv
Optional (hard guarantee: remove any leftover kanji if Sudachi can't read an OOV token):
pip install -U pykakasi
Key references:
- LFM2-350M-ENJP-MT model card (required system prompt + recommended sampling):
https://huggingface.co/LiquidAI/LFM2-350M-ENJP-MT
- Transformers chat templating (apply_chat_template / add_generation_prompt):
https://huggingface.co/docs/transformers/en/chat_templating
- SudachiPy Morpheme API (reading_form / part_of_speech):
https://worksapplications.github.io/sudachi.rs/python/api/sudachipy.morpheme.html
- jaconv kata2hira:
https://ikegami-yukino.github.io/jaconv/jaconv.html
"""
from __future__ import annotations
import re
import unicodedata
from typing import List
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from sudachipy import Dictionary, SplitMode
import jaconv
# Optional fallback (uncomment if you install pykakasi)
# from pykakasi import kakasi as pykakasi_kakasi
# _kks = pykakasi_kakasi()
# _kks.setMode("J", "H") # Kanji -> Hiragana
# _kks.setMode("K", "H") # Katakana -> Hiragana
# _kks.setMode("H", "H") # Hiragana -> Hiragana
# _kks.setMode("a", "a") # ASCII passthrough
# _kakasi = _kks.getConverter()
KANJI_RE = re.compile(r"[\u3400-\u4DBF\u4E00-\u9FFF]") # CJK Ext-A + Unified
def contains_kanji(s: str) -> bool:
return KANJI_RE.search(s) is not None
def is_punct_or_symbol_only(s: str) -> bool:
# Unicode categories: P* punctuation, S* symbols
for ch in s:
cat = unicodedata.category(ch)
if not (cat.startswith("P") or cat.startswith("S") or ch.isspace()):
return False
return True
class LFM2EnJaTranslator:
"""
LFM2-350M-ENJP-MT requires a system prompt:
"Translate to Japanese." (EN -> JA)
and recommends:
temperature=0.5, top_p=1.0, min_p=0.1, repetition_penalty=1.05
"""
def __init__(self, model_id: str = "LiquidAI/LFM2-350M-ENJP-MT"):
self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
self.dtype = torch.float16 if self.device.type == "cuda" else torch.float32 # CPU float32
self.tokenizer = AutoTokenizer.from_pretrained(model_id)
# Ensure we have a pad token for generation
if self.tokenizer.pad_token_id is None and self.tokenizer.eos_token_id is not None:
self.tokenizer.pad_token = self.tokenizer.eos_token
# transformers recently introduced dtype=... in places; keep a compatibility fallback.
try:
self.model = AutoModelForCausalLM.from_pretrained(model_id, dtype=self.dtype)
except TypeError:
self.model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=self.dtype)
self.model.to(self.device).eval()
@torch.inference_mode()
def translate(self, text_en: str, max_new_tokens: int = 192) -> str:
messages = [
{"role": "system", "content": "Translate to Japanese."},
{"role": "user", "content": text_en},
]
# Chat template -> token IDs (adds the assistant generation header)
input_ids = self.tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
).to(self.device)
gen_kwargs = dict(
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.5,
top_p=1.0,
min_p=0.1, # may not exist in older transformers; handled below
repetition_penalty=1.05,
pad_token_id=self.tokenizer.pad_token_id,
eos_token_id=self.tokenizer.eos_token_id,
)
# Back-compat: if min_p isn't supported, retry without it.
try:
out = self.model.generate(input_ids, **gen_kwargs)
except TypeError:
gen_kwargs.pop("min_p", None)
out = self.model.generate(input_ids, **gen_kwargs)
# Decode only newly generated tokens (after the prompt)
new_tokens = out[0, input_ids.shape[-1]:]
ja = self.tokenizer.decode(new_tokens, skip_special_tokens=True).strip()
return ja
class HiraganaOnlyConverter:
"""
Japanese -> hiragana-only (reading-based):
- Use Sudachi reading_form() per token (usually katakana reading).
- Convert katakana -> hiragana via jaconv.kata2hira().
- Keep punctuation as punctuation to avoid "キゴウ" -> "きごう".
- Optional: customize a couple of common reading preferences.
"""
def __init__(self):
# Use sudachidict_full for better coverage; SplitMode.C is a stable default.
self.tok = Dictionary(dict="full").create(mode=SplitMode.C)
def to_hiragana(self, ja: str) -> str:
parts: List[str] = []
for m in self.tok.tokenize(ja):
surface = m.surface()
reading = (m.reading_form() or "")
pos = m.part_of_speech() or ()
# 1) Keep symbols/punctuation as-is (prevents "きごう" artifacts)
if (pos and pos[0] == "補助記号") or is_punct_or_symbol_only(surface) or reading == "キゴウ":
parts.append(surface)
continue
# 2) Optional style overrides (comment out if you don't want them)
# Sudachi may output "わたくし" for 私; many prefer casual "わたし".
if surface == "私":
parts.append("わたし")
continue
# 3) Normal path: reading (katakana) -> hiragana
if reading and reading != "*":
parts.append(jaconv.kata2hira(reading))
else:
# No reading: keep surface (or use pykakasi fallback if enabled)
# if contains_kanji(surface):
# parts.append(_kakasi.do(surface))
# else:
parts.append(surface)
out = "".join(parts)
# Optional final hard guarantee if you enabled pykakasi:
# if contains_kanji(out):
# out = _kakasi.do(out)
return out
def main():
translator = LFM2EnJaTranslator()
hira_conv = HiraganaOnlyConverter()
examples = [
"I love natural language processing.",
"Tokyo is the capital of Japan.",
"Please translate this into Japanese.",
"The quick brown fox jumps over the lazy dog.",
]
for en in examples:
ja = translator.translate(en)
hira = hira_conv.to_hiragana(ja)
# If you didn't enable pykakasi fallback, this warning tells you where Sudachi had no reading.
if contains_kanji(hira):
print("[warn] Kanji remained (likely OOV token with no reading). Enable pykakasi or add a dictionary entry.")
print("EN :", en)
print("JA :", ja)
print("HIRA:", hira)
print("-" * 70)
if __name__ == "__main__":
main()