I am working on a project to translate text from English to Japanese.
I’ve read about llama being trained on Japanese text but I am not sure if there are any better models out there. Any suggestions please?
I tried the Helsinki-NLP model: Helsinki-NLP/opus-mt-en-jap · Hugging Face.
I tried translating a few words and numbers in English to Japanese and didn’t get a good result. Google translate and Deepl translate gave me correct results. I don’t know why, but the words that I gave as input for translation were pretty simple and doesn’t look like the model needs to have domain knowledge for this simple task. If you come across or have come across a better model for English to Japanese translation, please reply to this thread.
The LLM with a parameter count of around 7B to 14B is the easiest to use in my experience. Gemma2, Qwen 2.5 Instruct, and Mistral NEMO are good for Japanese and English.
Of these, Gemma2 has a small (2B) model that has been trained specifically for Japanese, so you might want to try that.
I think that if there were a model specifically for translation, it would be even smaller, but I don’t know much about specialized models.
Edit:
I am a Japanese speaker with an LLM, so please feel free to ask me about anything related to Japanese. There are many (presumably) Japanese people on Hub, but they are often working independently, so it is difficult to find them.
I am also working independently.
Hi John, I am just learning Japanese for fun and I thought about building a pipeline that I could use to practice my speaking, Currently I have a microphone module (from my telephone feeding audio data into a whisper ASR, I was particulary surprised when it spit out kanji and hiragana so I wanted to take it a bit further. I figured WhisperSpeech might be a suitable TTS to train in Japanese, but I know absolute ZERO about doing that. I like this LLM that understands Japanese as well and is not limited to only Latin based tokens. Might you have any tips for me?
I don’t often get the chance to work with speech models, but from a quick search, it seems that WhisperSpeech has a good reputation in Japan.
However, I’ve found something even better. It’s as follows. Even among native Japanese speakers, there are not many people who can pronounce it this beautifully.
It’s in the realm of voice actors or announcers… it may even be overkill for learning.![]()
Hmm… Not only is it a mistranslation, but it’s also a strange mistranslation…
That’s a great project! While LLaMA models have some multilingual capabilities (especially in the more recent versions like LLaMA 3), they’re not specifically optimized for high-quality English–Japanese translation.
I am planning to build a software for real time translations from japanese to english and english to japanese. Can you suggest some good models for it? Mainly interview level or business level
Regarding real-time performance, we can combine responsive models with multiple complex translation models of varying sizes, or use compact, high-performance LLMs (such as Liquid LFM2)…
The key will likely lie in application-layer optimizations rather than the selection of translation models itself. For general resources, refer to this.
what is to be done to eliminate the kanji and have only hiragana as the translation output?
Converting kanji to hiragana is often simpler and faster when done as a post-processing step using a standard Python library rather than relying on Transformers.
Why “hiragana-only translation” is usually a post-processing problem
Most English→Japanese models are trained to output standard Japanese orthography (kanji + kana). Forcing the translator to emit only hiragana during decoding is possible but fragile and often harms translation quality. A more robust pattern is:
- Translate EN→JA normally (best quality)
- Convert the Japanese output to its reading (よみ)
- Convert the reading to hiragana
- Validate + handle edge cases (symbols, OOV, numbers, names)
This separation keeps translation quality high and makes the “hiragana-only” requirement deterministic.
Recommended approach: translate → Sudachi reading → hiragana
Background: what Sudachi gives you
Sudachi is a Japanese morphological analyzer that tokenizes text and can return each token’s reading form via reading_form(). (Works Applications)
Important detail: Sudachi’s reading form is “furigana” style in katakana, and for unknown words it can return an empty string. (Javadoc)
So you typically do:
reading_form()→ katakana reading- katakana → hiragana using
jaconv.kata2hira()(Toki Inoue Inoue)
The pitfalls (and the standard fixes)
1) “きごう” appears (punctuation became a word)
Sudachi can produce a reading for “symbol” tokens; converting that reading yields “きごう”. A common rule is: do not convert readings for tokens whose POS is “補助記号”; keep the original punctuation surface. (Qiita)
2) Residual kanji remains (unknown words / missing reading)
Because OOV tokens can have no reading, you need either:
- a user dictionary / better Sudachi dictionary coverage, or
- a fallback converter that can still transliterate leftover kanji.
A practical fallback is pykakasi; current docs recommend using convert() (new API). (Pykakasi)
“Production-ish” checklist for the Sudachi pipeline
-
Use Sudachi dictionary with broad coverage (often
full) -
For each token:
- if POS is
補助記号→ keep surface (avoid “きごう”) (Qiita) - else if has reading →
kata2hira(reading)(Toki Inoue Inoue) - else keep surface or fallback transliteration (pykakasi)
- if POS is
-
Post-validate: reject/flag if any kanji remain (and rerun fallback)
Better approaches (when/why you’d choose them)
1) Use a better translator, then keep the same hiragana conversion step
If translation quality is the bottleneck, fix that first. Your LFM2 results look strong because the model is explicitly MT-tuned and requires the system prompt "Translate to Japanese." plus recommended sampling parameters. (Hugging Face)
This is usually the best “quality per complexity” path: translator improvements upstream, same deterministic downstream conversion.
2) Use pyopenjtalk (OpenJTalk frontend) for pronunciation-driven kana
If your end goal is TTS / pronunciation rather than “reading as written,” pyopenjtalk.g2p(text, kana=True) returns pronunciation in katakana, which you can convert to hiragana. (LESS IS MORE)
Pros:
- Strong pronunciation-oriented processing (good for speech pipelines)
Cons: - “Pronunciation kana” can differ from orthographic reading (particles, rendaku, etc.), depending on what you want.
3) Use an off-the-shelf “kanji→kana” wrapper library
If you want a simpler interface than writing Sudachi token loops:
- kanjiconv: converts kanji→hiragana/katakana/romaji and is built on SudachiDict/SudachiPy; it aims to handle proper nouns better as dictionaries update. (GitHub)
Pros: - Less glue code
- Still Sudachi-based
Cons: - You inherit its design choices; you may still need custom handling for punctuation/OOV.
4) JS/Frontend: use Kuroshiro
If you need hiragana conversion in a browser/Node environment, Kuroshiro is a standard option for converting Japanese text to hiragana/katakana/romaji (with furigana/okurigana modes). (kuroshiro)
Approaches that look tempting but usually aren’t worth it
Constrained decoding to “ban kanji” during generation
In Transformers you can block tokens using bad_words_ids / “NoBadWords” processors, but for “ban all kanji” you’d need an enormous list (tokenization-dependent), and it can be slow and leaky:
- Large
bad_words_idslists can severely slow generation. (Hugging Face Forums) - Blocking can be bypassed when words split across tokens (classic issue). (GitHub)
Regex/grammar constrained generation frameworks (e.g., Outlines) can enforce character-level constraints. (DotTxt AI)
But for Japanese with subword tokenization, “hiragana-only” constraints often become:
- slower,
- harder to integrate,
- and still not guaranteed to preserve translation quality as well as “translate then convert.”
The unavoidable limitation: hiragana-only loses disambiguation
Kanji disambiguates meaning. Hiragana-only output can be ambiguous (homophones), and “best reading” depends on context. Even high-quality reading tools can be imperfect in edge cases (numbers, special terms, rare names), so you should expect:
- occasional reading choices you want to override (e.g., 私 → わたし vs わたくし)
- special handling for numbers/units if needed (common in TTS/learning tooling)
Demo
Uses LiquidAI/LFM2-350M-ENJP-MT (requires the system prompt "Translate to Japanese." and the recommended sampling params). (Hugging Face)
Uses tokenizer.apply_chat_template(..., add_generation_prompt=True) for chat formatting. (Hugging Face)
Uses Sudachi reading_form() + part_of_speech() and jaconv.kata2hira() for kana conversion. (Works Applications)
"""
Simple demo (single file, no argparse):
EN -> JA (LiquidAI/LFM2-350M-ENJP-MT) -> hiragana-only (Sudachi reading -> hiragana)
pip deps:
pip install -U torch transformers sentencepiece sudachipy sudachidict_full jaconv
Optional (hard guarantee: remove any leftover kanji if Sudachi can't read an OOV token):
pip install -U pykakasi
Key references:
- LFM2-350M-ENJP-MT model card (required system prompt + recommended sampling):
https://huggingface.co/LiquidAI/LFM2-350M-ENJP-MT
- Transformers chat templating (apply_chat_template / add_generation_prompt):
https://huggingface.co/docs/transformers/en/chat_templating
- SudachiPy Morpheme API (reading_form / part_of_speech):
https://worksapplications.github.io/sudachi.rs/python/api/sudachipy.morpheme.html
- jaconv kata2hira:
https://ikegami-yukino.github.io/jaconv/jaconv.html
"""
from __future__ import annotations
import re
import unicodedata
from typing import List
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from sudachipy import Dictionary, SplitMode
import jaconv
# Optional fallback (uncomment if you install pykakasi)
# from pykakasi import kakasi as pykakasi_kakasi
# _kks = pykakasi_kakasi()
# _kks.setMode("J", "H") # Kanji -> Hiragana
# _kks.setMode("K", "H") # Katakana -> Hiragana
# _kks.setMode("H", "H") # Hiragana -> Hiragana
# _kks.setMode("a", "a") # ASCII passthrough
# _kakasi = _kks.getConverter()
KANJI_RE = re.compile(r"[\u3400-\u4DBF\u4E00-\u9FFF]") # CJK Ext-A + Unified
def contains_kanji(s: str) -> bool:
return KANJI_RE.search(s) is not None
def is_punct_or_symbol_only(s: str) -> bool:
# Unicode categories: P* punctuation, S* symbols
for ch in s:
cat = unicodedata.category(ch)
if not (cat.startswith("P") or cat.startswith("S") or ch.isspace()):
return False
return True
class LFM2EnJaTranslator:
"""
LFM2-350M-ENJP-MT requires a system prompt:
"Translate to Japanese." (EN -> JA)
and recommends:
temperature=0.5, top_p=1.0, min_p=0.1, repetition_penalty=1.05
"""
def __init__(self, model_id: str = "LiquidAI/LFM2-350M-ENJP-MT"):
self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
self.dtype = torch.float16 if self.device.type == "cuda" else torch.float32 # CPU float32
self.tokenizer = AutoTokenizer.from_pretrained(model_id)
# Ensure we have a pad token for generation
if self.tokenizer.pad_token_id is None and self.tokenizer.eos_token_id is not None:
self.tokenizer.pad_token = self.tokenizer.eos_token
# transformers recently introduced dtype=... in places; keep a compatibility fallback.
try:
self.model = AutoModelForCausalLM.from_pretrained(model_id, dtype=self.dtype)
except TypeError:
self.model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=self.dtype)
self.model.to(self.device).eval()
@torch.inference_mode()
def translate(self, text_en: str, max_new_tokens: int = 192) -> str:
messages = [
{"role": "system", "content": "Translate to Japanese."},
{"role": "user", "content": text_en},
]
# Chat template -> token IDs (adds the assistant generation header)
input_ids = self.tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
).to(self.device)
gen_kwargs = dict(
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.5,
top_p=1.0,
min_p=0.1, # may not exist in older transformers; handled below
repetition_penalty=1.05,
pad_token_id=self.tokenizer.pad_token_id,
eos_token_id=self.tokenizer.eos_token_id,
)
# Back-compat: if min_p isn't supported, retry without it.
try:
out = self.model.generate(input_ids, **gen_kwargs)
except TypeError:
gen_kwargs.pop("min_p", None)
out = self.model.generate(input_ids, **gen_kwargs)
# Decode only newly generated tokens (after the prompt)
new_tokens = out[0, input_ids.shape[-1]:]
ja = self.tokenizer.decode(new_tokens, skip_special_tokens=True).strip()
return ja
class HiraganaOnlyConverter:
"""
Japanese -> hiragana-only (reading-based):
- Use Sudachi reading_form() per token (usually katakana reading).
- Convert katakana -> hiragana via jaconv.kata2hira().
- Keep punctuation as punctuation to avoid "キゴウ" -> "きごう".
- Optional: customize a couple of common reading preferences.
"""
def __init__(self):
# Use sudachidict_full for better coverage; SplitMode.C is a stable default.
self.tok = Dictionary(dict="full").create(mode=SplitMode.C)
def to_hiragana(self, ja: str) -> str:
parts: List[str] = []
for m in self.tok.tokenize(ja):
surface = m.surface()
reading = (m.reading_form() or "")
pos = m.part_of_speech() or ()
# 1) Keep symbols/punctuation as-is (prevents "きごう" artifacts)
if (pos and pos[0] == "補助記号") or is_punct_or_symbol_only(surface) or reading == "キゴウ":
parts.append(surface)
continue
# 2) Optional style overrides (comment out if you don't want them)
# Sudachi may output "わたくし" for 私; many prefer casual "わたし".
if surface == "私":
parts.append("わたし")
continue
# 3) Normal path: reading (katakana) -> hiragana
if reading and reading != "*":
parts.append(jaconv.kata2hira(reading))
else:
# No reading: keep surface (or use pykakasi fallback if enabled)
# if contains_kanji(surface):
# parts.append(_kakasi.do(surface))
# else:
parts.append(surface)
out = "".join(parts)
# Optional final hard guarantee if you enabled pykakasi:
# if contains_kanji(out):
# out = _kakasi.do(out)
return out
def main():
translator = LFM2EnJaTranslator()
hira_conv = HiraganaOnlyConverter()
examples = [
"I love natural language processing.",
"Tokyo is the capital of Japan.",
"Please translate this into Japanese.",
"The quick brown fox jumps over the lazy dog.",
]
for en in examples:
ja = translator.translate(en)
hira = hira_conv.to_hiragana(ja)
# If you didn't enable pykakasi fallback, this warning tells you where Sudachi had no reading.
if contains_kanji(hira):
print("[warn] Kanji remained (likely OOV token with no reading). Enable pykakasi or add a dictionary entry.")
print("EN :", en)
print("JA :", ja)
print("HIRA:", hira)
print("-" * 70)
if __name__ == "__main__":
main()
