Best model for translating English to Japanese

tejas2501 · July 25, 2024, 6:10pm

I am working on a project to translate text from English to Japanese.
I’ve read about llama being trained on Japanese text but I am not sure if there are any better models out there. Any suggestions please?

yadukini · December 12, 2024, 10:18am

I tried the Helsinki-NLP model: Helsinki-NLP/opus-mt-en-jap · Hugging Face.
I tried translating a few words and numbers in English to Japanese and didn’t get a good result. Google translate and Deepl translate gave me correct results. I don’t know why, but the words that I gave as input for translation were pretty simple and doesn’t look like the model needs to have domain knowledge for this simple task. If you come across or have come across a better model for English to Japanese translation, please reply to this thread.

John6666 · December 12, 2024, 10:45am

The LLM with a parameter count of around 7B to 14B is the easiest to use in my experience. Gemma2, Qwen 2.5 Instruct, and Mistral NEMO are good for Japanese and English.
Of these, Gemma2 has a small (2B) model that has been trained specifically for Japanese, so you might want to try that.
I think that if there were a model specifically for translation, it would be even smaller, but I don’t know much about specialized models.

Edit:
I am a Japanese speaker with an LLM, so please feel free to ask me about anything related to Japanese. There are many (presumably) Japanese people on Hub, but they are often working independently, so it is difficult to find them.
I am also working independently.

Gogie607 · December 22, 2024, 7:06pm

Hi John, I am just learning Japanese for fun and I thought about building a pipeline that I could use to practice my speaking, Currently I have a microphone module (from my telephone feeding audio data into a whisper ASR, I was particulary surprised when it spit out kanji and hiragana so I wanted to take it a bit further. I figured WhisperSpeech might be a suitable TTS to train in Japanese, but I know absolute ZERO about doing that. I like this LLM that understands Japanese as well and is not limited to only Latin based tokens. Might you have any tips for me?

John6666 · December 22, 2024, 8:09pm

I don’t often get the chance to work with speech models, but from a quick search, it seems that WhisperSpeech has a good reputation in Japan.
However, I’ve found something even better. It’s as follows. Even among native Japanese speakers, there are not many people who can pronounce it this beautifully.
It’s in the realm of voice actors or announcers… it may even be overkill for learning.

Vidzshan · April 29, 2025, 11:14am

I tried it, but seemly it is not very effective.

John6666 · April 29, 2025, 11:26am

Hmm… Not only is it a mistranslation, but it’s also a strange mistranslation…

vnrom · April 29, 2025, 1:36pm

That’s a great project! While LLaMA models have some multilingual capabilities (especially in the more recent versions like LLaMA 3), they’re not specifically optimized for high-quality English–Japanese translation.

kuroithorns · January 14, 2026, 5:41am

I am planning to build a software for real time translations from japanese to english and english to japanese. Can you suggest some good models for it? Mainly interview level or business level

John6666 · January 16, 2026, 4:06am

Regarding real-time performance, we can combine responsive models with multiple complex translation models of varying sizes, or use compact, high-performance LLMs (such as Liquid LFM2)…

The key will likely lie in application-layer optimizations rather than the selection of translation models itself. For general resources, refer to this.

nourvi · January 24, 2026, 12:57am

what is to be done to eliminate the kanji and have only hiragana as the translation output?

John6666 · January 24, 2026, 8:13am

Converting kanji to hiragana is often simpler and faster when done as a post-processing step using a standard Python library rather than relying on Transformers.

Why “hiragana-only translation” is usually a post-processing problem

Most English→Japanese models are trained to output standard Japanese orthography (kanji + kana). Forcing the translator to emit only hiragana during decoding is possible but fragile and often harms translation quality. A more robust pattern is:

Translate EN→JA normally (best quality)
Convert the Japanese output to its reading (よみ)
Convert the reading to hiragana
Validate + handle edge cases (symbols, OOV, numbers, names)

This separation keeps translation quality high and makes the “hiragana-only” requirement deterministic.

Recommended approach: translate → Sudachi reading → hiragana

Background: what Sudachi gives you

Sudachi is a Japanese morphological analyzer that tokenizes text and can return each token’s reading form via reading_form(). (Works Applications)
Important detail: Sudachi’s reading form is “furigana” style in katakana, and for unknown words it can return an empty string. (Javadoc)
So you typically do:

reading_form() → katakana reading
katakana → hiragana using jaconv.kata2hira() (Toki Inoue Inoue)

The pitfalls (and the standard fixes)

1) “きごう” appears (punctuation became a word)

Sudachi can produce a reading for “symbol” tokens; converting that reading yields “きごう”. A common rule is: do not convert readings for tokens whose POS is “補助記号”; keep the original punctuation surface. (Qiita)

2) Residual kanji remains (unknown words / missing reading)

Because OOV tokens can have no reading, you need either:

a user dictionary / better Sudachi dictionary coverage, or
a fallback converter that can still transliterate leftover kanji.

A practical fallback is pykakasi; current docs recommend using convert() (new API). (Pykakasi)

“Production-ish” checklist for the Sudachi pipeline

Use Sudachi dictionary with broad coverage (often full)
For each token:
- if POS is 補助記号 → keep surface (avoid “きごう”) (Qiita)
- else if has reading → kata2hira(reading) (Toki Inoue Inoue)
- else keep surface or fallback transliteration (pykakasi)
Post-validate: reject/flag if any kanji remain (and rerun fallback)

Better approaches (when/why you’d choose them)

1) Use a better translator, then keep the same hiragana conversion step

If translation quality is the bottleneck, fix that first. Your LFM2 results look strong because the model is explicitly MT-tuned and requires the system prompt "Translate to Japanese." plus recommended sampling parameters. (Hugging Face)
This is usually the best “quality per complexity” path: translator improvements upstream, same deterministic downstream conversion.

2) Use pyopenjtalk (OpenJTalk frontend) for pronunciation-driven kana

If your end goal is TTS / pronunciation rather than “reading as written,” pyopenjtalk.g2p(text, kana=True) returns pronunciation in katakana, which you can convert to hiragana. (LESS IS MORE)
Pros:

Strong pronunciation-oriented processing (good for speech pipelines)
Cons:
“Pronunciation kana” can differ from orthographic reading (particles, rendaku, etc.), depending on what you want.

3) Use an off-the-shelf “kanji→kana” wrapper library

If you want a simpler interface than writing Sudachi token loops:

kanjiconv: converts kanji→hiragana/katakana/romaji and is built on SudachiDict/SudachiPy; it aims to handle proper nouns better as dictionaries update. (GitHub)
Pros:
Less glue code
Still Sudachi-based
Cons:
You inherit its design choices; you may still need custom handling for punctuation/OOV.

4) JS/Frontend: use Kuroshiro

If you need hiragana conversion in a browser/Node environment, Kuroshiro is a standard option for converting Japanese text to hiragana/katakana/romaji (with furigana/okurigana modes). (kuroshiro)

Approaches that look tempting but usually aren’t worth it

Constrained decoding to “ban kanji” during generation

In Transformers you can block tokens using bad_words_ids / “NoBadWords” processors, but for “ban all kanji” you’d need an enormous list (tokenization-dependent), and it can be slow and leaky:

Large bad_words_ids lists can severely slow generation. (Hugging Face Forums)
Blocking can be bypassed when words split across tokens (classic issue). (GitHub)

Regex/grammar constrained generation frameworks (e.g., Outlines) can enforce character-level constraints. (DotTxt AI)
But for Japanese with subword tokenization, “hiragana-only” constraints often become:

slower,
harder to integrate,
and still not guaranteed to preserve translation quality as well as “translate then convert.”

The unavoidable limitation: hiragana-only loses disambiguation

Kanji disambiguates meaning. Hiragana-only output can be ambiguous (homophones), and “best reading” depends on context. Even high-quality reading tools can be imperfect in edge cases (numbers, special terms, rare names), so you should expect:

occasional reading choices you want to override (e.g., 私 → わたし vs わたくし)
special handling for numbers/units if needed (common in TTS/learning tooling)

Demo

Uses LiquidAI/LFM2-350M-ENJP-MT (requires the system prompt "Translate to Japanese." and the recommended sampling params). (Hugging Face)
Uses tokenizer.apply_chat_template(..., add_generation_prompt=True) for chat formatting. (Hugging Face)
Uses Sudachi reading_form() + part_of_speech() and jaconv.kata2hira() for kana conversion. (Works Applications)

"""
Simple demo (single file, no argparse):
EN -> JA (LiquidAI/LFM2-350M-ENJP-MT) -> hiragana-only (Sudachi reading -> hiragana)

pip deps:
  pip install -U torch transformers sentencepiece sudachipy sudachidict_full jaconv

Optional (hard guarantee: remove any leftover kanji if Sudachi can't read an OOV token):
  pip install -U pykakasi

Key references:
  - LFM2-350M-ENJP-MT model card (required system prompt + recommended sampling):
    https://huggingface.co/LiquidAI/LFM2-350M-ENJP-MT
  - Transformers chat templating (apply_chat_template / add_generation_prompt):
    https://huggingface.co/docs/transformers/en/chat_templating
  - SudachiPy Morpheme API (reading_form / part_of_speech):
    https://worksapplications.github.io/sudachi.rs/python/api/sudachipy.morpheme.html
  - jaconv kata2hira:
    https://ikegami-yukino.github.io/jaconv/jaconv.html
"""

from __future__ import annotations

import re
import unicodedata
from typing import List

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

from sudachipy import Dictionary, SplitMode
import jaconv

# Optional fallback (uncomment if you install pykakasi)
# from pykakasi import kakasi as pykakasi_kakasi
# _kks = pykakasi_kakasi()
# _kks.setMode("J", "H")  # Kanji -> Hiragana
# _kks.setMode("K", "H")  # Katakana -> Hiragana
# _kks.setMode("H", "H")  # Hiragana -> Hiragana
# _kks.setMode("a", "a")  # ASCII passthrough
# _kakasi = _kks.getConverter()

KANJI_RE = re.compile(r"[\u3400-\u4DBF\u4E00-\u9FFF]")  # CJK Ext-A + Unified


def contains_kanji(s: str) -> bool:
    return KANJI_RE.search(s) is not None


def is_punct_or_symbol_only(s: str) -> bool:
    # Unicode categories: P* punctuation, S* symbols
    for ch in s:
        cat = unicodedata.category(ch)
        if not (cat.startswith("P") or cat.startswith("S") or ch.isspace()):
            return False
    return True


class LFM2EnJaTranslator:
    """
    LFM2-350M-ENJP-MT requires a system prompt:
      "Translate to Japanese."  (EN -> JA)
    and recommends:
      temperature=0.5, top_p=1.0, min_p=0.1, repetition_penalty=1.05
    """

    def __init__(self, model_id: str = "LiquidAI/LFM2-350M-ENJP-MT"):
        self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        self.dtype = torch.float16 if self.device.type == "cuda" else torch.float32  # CPU float32

        self.tokenizer = AutoTokenizer.from_pretrained(model_id)

        # Ensure we have a pad token for generation
        if self.tokenizer.pad_token_id is None and self.tokenizer.eos_token_id is not None:
            self.tokenizer.pad_token = self.tokenizer.eos_token

        # transformers recently introduced dtype=... in places; keep a compatibility fallback.
        try:
            self.model = AutoModelForCausalLM.from_pretrained(model_id, dtype=self.dtype)
        except TypeError:
            self.model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=self.dtype)

        self.model.to(self.device).eval()

    @torch.inference_mode()
    def translate(self, text_en: str, max_new_tokens: int = 192) -> str:
        messages = [
            {"role": "system", "content": "Translate to Japanese."},
            {"role": "user", "content": text_en},
        ]

        # Chat template -> token IDs (adds the assistant generation header)
        input_ids = self.tokenizer.apply_chat_template(
            messages,
            tokenize=True,
            add_generation_prompt=True,
            return_tensors="pt",
        ).to(self.device)

        gen_kwargs = dict(
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.5,
            top_p=1.0,
            min_p=0.1,               # may not exist in older transformers; handled below
            repetition_penalty=1.05,
            pad_token_id=self.tokenizer.pad_token_id,
            eos_token_id=self.tokenizer.eos_token_id,
        )

        # Back-compat: if min_p isn't supported, retry without it.
        try:
            out = self.model.generate(input_ids, **gen_kwargs)
        except TypeError:
            gen_kwargs.pop("min_p", None)
            out = self.model.generate(input_ids, **gen_kwargs)

        # Decode only newly generated tokens (after the prompt)
        new_tokens = out[0, input_ids.shape[-1]:]
        ja = self.tokenizer.decode(new_tokens, skip_special_tokens=True).strip()
        return ja


class HiraganaOnlyConverter:
    """
    Japanese -> hiragana-only (reading-based):
      - Use Sudachi reading_form() per token (usually katakana reading).
      - Convert katakana -> hiragana via jaconv.kata2hira().
      - Keep punctuation as punctuation to avoid "キゴウ" -> "きごう".
      - Optional: customize a couple of common reading preferences.
    """

    def __init__(self):
        # Use sudachidict_full for better coverage; SplitMode.C is a stable default.
        self.tok = Dictionary(dict="full").create(mode=SplitMode.C)

    def to_hiragana(self, ja: str) -> str:
        parts: List[str] = []
        for m in self.tok.tokenize(ja):
            surface = m.surface()
            reading = (m.reading_form() or "")
            pos = m.part_of_speech() or ()

            # 1) Keep symbols/punctuation as-is (prevents "きごう" artifacts)
            if (pos and pos[0] == "補助記号") or is_punct_or_symbol_only(surface) or reading == "キゴウ":
                parts.append(surface)
                continue

            # 2) Optional style overrides (comment out if you don't want them)
            # Sudachi may output "わたくし" for 私; many prefer casual "わたし".
            if surface == "私":
                parts.append("わたし")
                continue

            # 3) Normal path: reading (katakana) -> hiragana
            if reading and reading != "*":
                parts.append(jaconv.kata2hira(reading))
            else:
                # No reading: keep surface (or use pykakasi fallback if enabled)
                # if contains_kanji(surface):
                #     parts.append(_kakasi.do(surface))
                # else:
                parts.append(surface)

        out = "".join(parts)

        # Optional final hard guarantee if you enabled pykakasi:
        # if contains_kanji(out):
        #     out = _kakasi.do(out)

        return out


def main():
    translator = LFM2EnJaTranslator()
    hira_conv = HiraganaOnlyConverter()

    examples = [
        "I love natural language processing.",
        "Tokyo is the capital of Japan.",
        "Please translate this into Japanese.",
        "The quick brown fox jumps over the lazy dog.",
    ]

    for en in examples:
        ja = translator.translate(en)
        hira = hira_conv.to_hiragana(ja)

        # If you didn't enable pykakasi fallback, this warning tells you where Sudachi had no reading.
        if contains_kanji(hira):
            print("[warn] Kanji remained (likely OOV token with no reading). Enable pykakasi or add a dictionary entry.")

        print("EN  :", en)
        print("JA  :", ja)
        print("HIRA:", hira)
        print("-" * 70)


if __name__ == "__main__":
    main()

Topic		Replies	Views
Two way translation Speech to Speech model EN-DE Models	1	475	September 26, 2023
Translation for Indian languages With CoT Research	4	36	July 8, 2025
Provide alternative translations on click 🤗Transformers	0	179	November 13, 2023
AraT5 for arabic to english translation Models	0	44	August 8, 2024
Small llm for polish english translation Beginners	9	507	February 12, 2025