Best model for translating English to Japanese

Converting kanji to hiragana is often simpler and faster when done as a post-processing step using a standard Python library rather than relying on Transformers.


Why “hiragana-only translation” is usually a post-processing problem

Most English→Japanese models are trained to output standard Japanese orthography (kanji + kana). Forcing the translator to emit only hiragana during decoding is possible but fragile and often harms translation quality. A more robust pattern is:

  1. Translate EN→JA normally (best quality)
  2. Convert the Japanese output to its reading (よみ)
  3. Convert the reading to hiragana
  4. Validate + handle edge cases (symbols, OOV, numbers, names)

This separation keeps translation quality high and makes the “hiragana-only” requirement deterministic.


Recommended approach: translate → Sudachi reading → hiragana

Background: what Sudachi gives you

Sudachi is a Japanese morphological analyzer that tokenizes text and can return each token’s reading form via reading_form(). (Works Applications)
Important detail: Sudachi’s reading form is “furigana” style in katakana, and for unknown words it can return an empty string. (Javadoc)
So you typically do:

  • reading_form() → katakana reading
  • katakana → hiragana using jaconv.kata2hira() (Toki Inoue Inoue)

The pitfalls (and the standard fixes)

1) “きごう” appears (punctuation became a word)

Sudachi can produce a reading for “symbol” tokens; converting that reading yields “きごう”. A common rule is: do not convert readings for tokens whose POS is “補助記号”; keep the original punctuation surface. (Qiita)

2) Residual kanji remains (unknown words / missing reading)

Because OOV tokens can have no reading, you need either:

  • a user dictionary / better Sudachi dictionary coverage, or
  • a fallback converter that can still transliterate leftover kanji.

A practical fallback is pykakasi; current docs recommend using convert() (new API). (Pykakasi)

“Production-ish” checklist for the Sudachi pipeline

  • Use Sudachi dictionary with broad coverage (often full)

  • For each token:

    • if POS is 補助記号 → keep surface (avoid “きごう”) (Qiita)
    • else if has reading → kata2hira(reading) (Toki Inoue Inoue)
    • else keep surface or fallback transliteration (pykakasi)
  • Post-validate: reject/flag if any kanji remain (and rerun fallback)


Better approaches (when/why you’d choose them)

1) Use a better translator, then keep the same hiragana conversion step

If translation quality is the bottleneck, fix that first. Your LFM2 results look strong because the model is explicitly MT-tuned and requires the system prompt "Translate to Japanese." plus recommended sampling parameters. (Hugging Face)
This is usually the best “quality per complexity” path: translator improvements upstream, same deterministic downstream conversion.

2) Use pyopenjtalk (OpenJTalk frontend) for pronunciation-driven kana

If your end goal is TTS / pronunciation rather than “reading as written,” pyopenjtalk.g2p(text, kana=True) returns pronunciation in katakana, which you can convert to hiragana. (LESS IS MORE)
Pros:

  • Strong pronunciation-oriented processing (good for speech pipelines)
    Cons:
  • “Pronunciation kana” can differ from orthographic reading (particles, rendaku, etc.), depending on what you want.

3) Use an off-the-shelf “kanji→kana” wrapper library

If you want a simpler interface than writing Sudachi token loops:

  • kanjiconv: converts kanji→hiragana/katakana/romaji and is built on SudachiDict/SudachiPy; it aims to handle proper nouns better as dictionaries update. (GitHub)
    Pros:
  • Less glue code
  • Still Sudachi-based
    Cons:
  • You inherit its design choices; you may still need custom handling for punctuation/OOV.

4) JS/Frontend: use Kuroshiro

If you need hiragana conversion in a browser/Node environment, Kuroshiro is a standard option for converting Japanese text to hiragana/katakana/romaji (with furigana/okurigana modes). (kuroshiro)


Approaches that look tempting but usually aren’t worth it

Constrained decoding to “ban kanji” during generation

In Transformers you can block tokens using bad_words_ids / “NoBadWords” processors, but for “ban all kanji” you’d need an enormous list (tokenization-dependent), and it can be slow and leaky:

  • Large bad_words_ids lists can severely slow generation. (Hugging Face Forums)
  • Blocking can be bypassed when words split across tokens (classic issue). (GitHub)

Regex/grammar constrained generation frameworks (e.g., Outlines) can enforce character-level constraints. (DotTxt AI)
But for Japanese with subword tokenization, “hiragana-only” constraints often become:

  • slower,
  • harder to integrate,
  • and still not guaranteed to preserve translation quality as well as “translate then convert.”

The unavoidable limitation: hiragana-only loses disambiguation

Kanji disambiguates meaning. Hiragana-only output can be ambiguous (homophones), and “best reading” depends on context. Even high-quality reading tools can be imperfect in edge cases (numbers, special terms, rare names), so you should expect:

  • occasional reading choices you want to override (e.g., 私 → わたし vs わたくし)
  • special handling for numbers/units if needed (common in TTS/learning tooling)

Demo

Uses LiquidAI/LFM2-350M-ENJP-MT (requires the system prompt "Translate to Japanese." and the recommended sampling params). (Hugging Face)
Uses tokenizer.apply_chat_template(..., add_generation_prompt=True) for chat formatting. (Hugging Face)
Uses Sudachi reading_form() + part_of_speech() and jaconv.kata2hira() for kana conversion. (Works Applications)

"""
Simple demo (single file, no argparse):
EN -> JA (LiquidAI/LFM2-350M-ENJP-MT) -> hiragana-only (Sudachi reading -> hiragana)

pip deps:
  pip install -U torch transformers sentencepiece sudachipy sudachidict_full jaconv

Optional (hard guarantee: remove any leftover kanji if Sudachi can't read an OOV token):
  pip install -U pykakasi

Key references:
  - LFM2-350M-ENJP-MT model card (required system prompt + recommended sampling):
    https://huggingface.co/LiquidAI/LFM2-350M-ENJP-MT
  - Transformers chat templating (apply_chat_template / add_generation_prompt):
    https://huggingface.co/docs/transformers/en/chat_templating
  - SudachiPy Morpheme API (reading_form / part_of_speech):
    https://worksapplications.github.io/sudachi.rs/python/api/sudachipy.morpheme.html
  - jaconv kata2hira:
    https://ikegami-yukino.github.io/jaconv/jaconv.html
"""

from __future__ import annotations

import re
import unicodedata
from typing import List

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

from sudachipy import Dictionary, SplitMode
import jaconv

# Optional fallback (uncomment if you install pykakasi)
# from pykakasi import kakasi as pykakasi_kakasi
# _kks = pykakasi_kakasi()
# _kks.setMode("J", "H")  # Kanji -> Hiragana
# _kks.setMode("K", "H")  # Katakana -> Hiragana
# _kks.setMode("H", "H")  # Hiragana -> Hiragana
# _kks.setMode("a", "a")  # ASCII passthrough
# _kakasi = _kks.getConverter()

KANJI_RE = re.compile(r"[\u3400-\u4DBF\u4E00-\u9FFF]")  # CJK Ext-A + Unified


def contains_kanji(s: str) -> bool:
    return KANJI_RE.search(s) is not None


def is_punct_or_symbol_only(s: str) -> bool:
    # Unicode categories: P* punctuation, S* symbols
    for ch in s:
        cat = unicodedata.category(ch)
        if not (cat.startswith("P") or cat.startswith("S") or ch.isspace()):
            return False
    return True


class LFM2EnJaTranslator:
    """
    LFM2-350M-ENJP-MT requires a system prompt:
      "Translate to Japanese."  (EN -> JA)
    and recommends:
      temperature=0.5, top_p=1.0, min_p=0.1, repetition_penalty=1.05
    """

    def __init__(self, model_id: str = "LiquidAI/LFM2-350M-ENJP-MT"):
        self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        self.dtype = torch.float16 if self.device.type == "cuda" else torch.float32  # CPU float32

        self.tokenizer = AutoTokenizer.from_pretrained(model_id)

        # Ensure we have a pad token for generation
        if self.tokenizer.pad_token_id is None and self.tokenizer.eos_token_id is not None:
            self.tokenizer.pad_token = self.tokenizer.eos_token

        # transformers recently introduced dtype=... in places; keep a compatibility fallback.
        try:
            self.model = AutoModelForCausalLM.from_pretrained(model_id, dtype=self.dtype)
        except TypeError:
            self.model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=self.dtype)

        self.model.to(self.device).eval()

    @torch.inference_mode()
    def translate(self, text_en: str, max_new_tokens: int = 192) -> str:
        messages = [
            {"role": "system", "content": "Translate to Japanese."},
            {"role": "user", "content": text_en},
        ]

        # Chat template -> token IDs (adds the assistant generation header)
        input_ids = self.tokenizer.apply_chat_template(
            messages,
            tokenize=True,
            add_generation_prompt=True,
            return_tensors="pt",
        ).to(self.device)

        gen_kwargs = dict(
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.5,
            top_p=1.0,
            min_p=0.1,               # may not exist in older transformers; handled below
            repetition_penalty=1.05,
            pad_token_id=self.tokenizer.pad_token_id,
            eos_token_id=self.tokenizer.eos_token_id,
        )

        # Back-compat: if min_p isn't supported, retry without it.
        try:
            out = self.model.generate(input_ids, **gen_kwargs)
        except TypeError:
            gen_kwargs.pop("min_p", None)
            out = self.model.generate(input_ids, **gen_kwargs)

        # Decode only newly generated tokens (after the prompt)
        new_tokens = out[0, input_ids.shape[-1]:]
        ja = self.tokenizer.decode(new_tokens, skip_special_tokens=True).strip()
        return ja


class HiraganaOnlyConverter:
    """
    Japanese -> hiragana-only (reading-based):
      - Use Sudachi reading_form() per token (usually katakana reading).
      - Convert katakana -> hiragana via jaconv.kata2hira().
      - Keep punctuation as punctuation to avoid "キゴウ" -> "きごう".
      - Optional: customize a couple of common reading preferences.
    """

    def __init__(self):
        # Use sudachidict_full for better coverage; SplitMode.C is a stable default.
        self.tok = Dictionary(dict="full").create(mode=SplitMode.C)

    def to_hiragana(self, ja: str) -> str:
        parts: List[str] = []
        for m in self.tok.tokenize(ja):
            surface = m.surface()
            reading = (m.reading_form() or "")
            pos = m.part_of_speech() or ()

            # 1) Keep symbols/punctuation as-is (prevents "きごう" artifacts)
            if (pos and pos[0] == "補助記号") or is_punct_or_symbol_only(surface) or reading == "キゴウ":
                parts.append(surface)
                continue

            # 2) Optional style overrides (comment out if you don't want them)
            # Sudachi may output "わたくし" for 私; many prefer casual "わたし".
            if surface == "私":
                parts.append("わたし")
                continue

            # 3) Normal path: reading (katakana) -> hiragana
            if reading and reading != "*":
                parts.append(jaconv.kata2hira(reading))
            else:
                # No reading: keep surface (or use pykakasi fallback if enabled)
                # if contains_kanji(surface):
                #     parts.append(_kakasi.do(surface))
                # else:
                parts.append(surface)

        out = "".join(parts)

        # Optional final hard guarantee if you enabled pykakasi:
        # if contains_kanji(out):
        #     out = _kakasi.do(out)

        return out


def main():
    translator = LFM2EnJaTranslator()
    hira_conv = HiraganaOnlyConverter()

    examples = [
        "I love natural language processing.",
        "Tokyo is the capital of Japan.",
        "Please translate this into Japanese.",
        "The quick brown fox jumps over the lazy dog.",
    ]

    for en in examples:
        ja = translator.translate(en)
        hira = hira_conv.to_hiragana(ja)

        # If you didn't enable pykakasi fallback, this warning tells you where Sudachi had no reading.
        if contains_kanji(hira):
            print("[warn] Kanji remained (likely OOV token with no reading). Enable pykakasi or add a dictionary entry.")

        print("EN  :", en)
        print("JA  :", ja)
        print("HIRA:", hira)
        print("-" * 70)


if __name__ == "__main__":
    main()