Themis-RM-8B

Overview

Themis-RM-8B is an 8B-parameter multilingual code reward model for flexible multi-criteria scoring. It is part of the Themis-RM model suite, trained using the Bradley-Terry preference framework on Themis-CodePreference, the largest open-source collection of code preferences to date (more than 350k preference pairs).

Themis-RM models evaluate code across five quality dimensions — Functional Correctness, Runtime Efficiency, Memory Efficiency, Security Hardness, and Readability & Maintainability — and support eight programming languages. Our experiments demonstrate positive scaling trends, strong cross-lingual transfer when training on diverse preferences, and the importance of multi-criteria training for reliable code reward modelling.

Model Family

The Themis-RM suite ranges from 600M to 32B parameters, all built on the Qwen3 backbone.

Model	Model Architecture	HuggingFace Model Page
Themis-RM-0.6B	Qwen/Qwen3-0.6B	Themis-RM-0.6B
Themis-RM-1.7B	Qwen/Qwen3-1.7B	Themis-RM-1.7B
Themis-RM-4B	Qwen/Qwen3-4B	Themis-RM-4B
Themis-RM-8B (this model)	Qwen/Qwen3-8B	—
Themis-RM-14B	Qwen/Qwen3-14B	Themis-RM-14B
Themis-RM-32B	Qwen/Qwen3-32B	Themis-RM-32B

Results

Themis-RM models achieve best-in-class accuracy on Themis-CodeRewardBench, a code-specific reward model benchmark, while also matching or exceeding much larger models on established general-domain benchmarks (RewardBench V1, RewardBench V2, JudgeBench). Models are grouped by parameter class; bold marks the best in each group.

Model	Themis-CodeRewardBench	RewardBench V1	RewardBench V2	JudgeBench

32B - 72B Class
WorldPM-72B	76.96	90.88	67.92	55.21
Athene-RM-70B	78.39	91.22	68.76	63.45
Nemotron-70B-Reward	81.19	93.88	70.49	73.47
Themis-RM-32B	91.82	94.89	72.34	71.65
AceCodeRM-32B	62.95	23.58	67.98	66.77

7B - 14B Class
Themis-RM-14B	91.19	94.11	71.44	70.85
Themis-RM-8B (this model)	89.78	93.69	65.87	69.97
Athene-RM-8B	76.58	87.48	62.96	61.12
CodeScaler-8B	79.12	94.66	76.51	70.05
Skywork-Reward-V2-8B	79.97	94.76	76.93	67.90
AceCodeRM-7B	71.11	22.74	63.16	61.09

0.6B - 4B Class
Themis-RM-4B	88.39	92.46	63.81	68.02
CodeScaler-4B	77.97	94.32	75.13	68.44
Skywork-Reward-V2-4B	79.27	94.06	74.26	65.43
Themis-RM-1.7B	83.04	89.17	56.22	63.29
CodeScaler-1.7B	73.75	91.13	68.44	66.17
Skywork-Reward-V2-1.7B	75.60	91.64	67.71	66.48
Themis-RM-0.6B	79.26	83.41	49.61	63.84
Skywork-Reward-V2-0.6B	72.77	86.32	60.83	63.65

Usage

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = "project-themis/Themis-RM-8B"
device = "cuda:0"

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map=device,
    attn_implementation="flash_attention_2",
    num_labels=1,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Write a Python function that checks if a string is a palindrome."

response_chosen = """def is_palindrome(s: str) -> bool:
    s = s.lower().strip()
    return s == s[::-1]"""

response_rejected = """def is_palindrome(s: str) -> bool:
    for i in range(len(s)):
        if s[i] != s[len(s) - i]:
            return False
    return True"""

conv_chosen = [
    {"role": "user", "content": prompt},
    {"role": "assistant", "content": response_chosen},
]
conv_rejected = [
    {"role": "user", "content": prompt},
    {"role": "assistant", "content": response_rejected},
]

chosen_text = tokenizer.apply_chat_template(conv_chosen, tokenize=False)
rejected_text = tokenizer.apply_chat_template(conv_rejected, tokenize=False)

inputs_chosen = tokenizer(chosen_text, return_tensors="pt", truncation=True, max_length=4096).to(device)
inputs_rejected = tokenizer(rejected_text, return_tensors="pt", truncation=True, max_length=4096).to(device)

with torch.no_grad():
    score_chosen = model(**inputs_chosen).logits[0][0].item()
    score_rejected = model(**inputs_rejected).logits[0][0].item()

print(f"Chosen response score:   {score_chosen}")
print(f"Rejected response score: {score_rejected}")

Multi-Criteria Scoring with System Prompts

Themis-RM models are trained with stochastic criteria-conditioned system prompts, allowing you to steer scoring toward a specific quality dimension at inference time. Prepend a system message that specifies the evaluation criteria before the user/assistant turns. The model supports the following criteria:

Criterion	Key
Functional Correctness	`Functional_Correctness`
Runtime Efficiency	`Runtime_Efficiency`
Memory Efficiency	`Memory_Efficiency`
Security Hardness	`Security_Hardness`
Readability & Maintainability	`Readability_Maintainability`
All criteria (multi-criteria)	`Full`

Each criterion-specific system prompt includes a base preamble (Helpfulness + Harmlessness) followed by the targeted criterion. The Full prompt includes all five code criteria together. When no system prompt is provided, the model scores based on general quality — the model was trained to handle all three modes (no system prompt, single-criterion, and full multi-criteria). For the full set of system prompts used during training and evaluation, see the Evaluation folder in the GitHub repository.

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = "project-themis/Themis-RM-8B"
device = "cuda:0"

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map=device,
    attn_implementation="flash_attention_2",
    num_labels=1,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Criterion-specific system prompt (example: Runtime Efficiency)
# See https://github.com/iNeil77/Themis/tree/main/Evaluation for all prompts
system_prompt = (
    "You are an experienced AI judge and evaluator, specifically "
    "designed and trained to assess the overall quality of responses "
    "to human prompts. The prompts can be of various types, including "
    "but not limited to open-ended questions, coding tasks, creative "
    "writing, requests for information, and more. The responses may be "
    "generated by a language model or a human and may include text, "
    "code, or both. You will receive the prompt and the generated "
    "response for each judgement session. Your exact mandate is to "
    "score the response based on a pre-specified but flexible set of "
    "criteria.\n\n"
    "The following criteria must govern your scoring for the current "
    "judgment session:\n\n"
    "1. **Helpfulness**: How well does the response address the "
    "prompt? Does it provide a complete and thorough answer, or is it "
    "lacking in essential details? This is a tenet that applies to "
    "evaluating all types of responses. You must score helpful "
    "responses higher than less unhelpful ones.\n\n"
    "2. **Harmlessness**: Does the response contain any harmful or "
    "inappropriate content? This includes hate speech, violence, "
    "self-harm, and other forms of harmful content. This is a tenet "
    "that applies to evaluating all types of responses. You must "
    "score harmless responses higher than harmful ones.\n\n"
    "3. **Runtime Efficiency**: Does the response follow best "
    "practices for runtime efficiency? Examples include using "
    "efficient algorithms and data structures, minimizing time "
    "complexity, avoiding unnecessary computations, caching results, "
    "and leveraging parallel processing or asynchronous programming "
    "techniques where appropriate, among others. This is a tenet "
    "that applies to evaluating code responses. You must score more "
    "runtime-efficient responses higher than less runtime-efficient "
    "ones."
)

prompt = "Write a Python function that returns the n-th Fibonacci number."

response = """def fibonacci(n: int) -> int:
    if n <= 1:
        return n
    a, b = 0, 1
    for _ in range(2, n + 1):
        a, b = b, a + b
    return b"""

conversation = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": prompt},
    {"role": "assistant", "content": response},
]

text = tokenizer.apply_chat_template(conversation, tokenize=False)
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096).to(device)

with torch.no_grad():
    score = model(**inputs).logits[0][0].item()

print(f"Runtime Efficiency score: {score}")

License

This model is released under the Apache 2.0 License. The base model, Qwen3-8B, is also licensed under Apache 2.0.

Citation

@article{themis2025,
  title={Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring},
  author={},
  journal={arXiv preprint arXiv:2605.00754},
  year={2025}
}