---
library_name: transformers
language:
- en
base_model:
- Qwen/Qwen2.5-32B-Instruct
tags:
- evaluation
- sql
- reward-model
---
# Contextual-SQL Reward Model
[](https://contextual.ai/blog/open-sourcing-the-best-local-text-to-sql-system)
[](https://github.com/ContextualAI/bird-sql)
[](https://huggingface.co/collections/ContextualAI/contextual-sql)
**Contextual-SQL Reward Model** is the scoring component of the Contextual-SQL system, which achieved #1 on the BIRD benchmark leaderboard in February 2025. This model is finetuned from Qwen-2.5-32B-Instruct to rank SQL query candidates by scoring their execution correctness given a database schema and natural language query.
This reward model is one component of the full Contextual-SQL pipeline. For the complete text-to-SQL system (including SQL generation with Qwen2.5-Coder-32B-Instruct), please see the [Github repository](https://github.com/ContextualAI/bird-sql).
For more details about the complete system and methodology, check out our [blog post](https://contextual.ai/blog/open-sourcing-the-best-local-text-to-sql-system).
## Model Details
The Contextual-SQL system achieves state-of-the-art performance through a multi-stage pipeline where this reward model plays a critical role:
- **Inference-Time Scaling:** The system generates multiple diverse SQL candidates through parallel sampling with varied prompts and temperatures from Qwen2.5-Coder-32B-Instruct, then selects the best candidate using execution validation, consistency scoring, and this learned reward model.
- **Reward Model Training:** This Qwen-2.5-32B-Instruct base model was finetuned on the BIRD training set to rank candidate SQL queries. Training uses a classification objective with hard negative mining, where each question is paired with one correct SQL candidate and 15 high-likelihood incorrect candidates as hard negatives.
- **Multi-Signal Candidate Selection:** The final ranking combines this reward model's probability scores P(Y|X,Q) with the generator's log-probabilities P(X|Q) (weighted at α=0.4), creating a joint likelihood that captures both the quality of the SQL candidate and the correctness of its execution output.
### Model Description
- **Developed by:** Contextual AI
- **Language(s) (NLP):** English
- **Finetuned from model:** Qwen-2.5-32B-Instruct
- **Model type:** Reward model for SQL query scoring
- **License:** Same as base model (Apache 2.0)
### Model Sources
- **Repository:** https://github.com/ContextualAI/bird-sql
- **Blog Post:** https://contextual.ai/blog/open-sourcing-the-best-local-text-to-sql-system
## Usage
This model is designed to be used as part of the full Contextual-SQL pipeline for scoring SQL candidates. It is **not** a standalone text-to-SQL generator.
### Quick Start (Scoring a Single SQL Candidate)
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("ContextualAI/ctx-bird-reward-250121")
model = AutoModelForCausalLM.from_pretrained("ContextualAI/ctx-bird-reward-250121")
# Example inputs
db_schema = """CREATE TABLE customers (
id INT PRIMARY KEY,
name VARCHAR(100),
region VARCHAR(50),
revenue FLOAT
);"""
question = "Show me top 5 highest revenue customers by region"
evidence = "Revenue is stored in the revenue column"
sql_candidate = "SELECT region, name, revenue FROM customers ORDER BY revenue DESC LIMIT 5"
execution_result = "5 rows returned"
num_rows = 5
# Format the prompt
messages = [
{
"role": "system",
"content": "You are a judge that can check whether a given SQL correctly answers a given natural language user query. You'll be given Database Schema, Question, External Knowledge, SQL, logprob Score and its Execution Result.",
},
{
"role": "user",
"content": (
f"-- Database Schema: \n{db_schema}\n"
f"-- Question: {question}\n"
f"-- External Knowledge: {evidence}\n"
f"-- SQL: {sql_candidate}\n"
f"-- Execution Result #rows: {num_rows}\n"
f"-- Execution Result START\n{execution_result}\n"
f"-- END Execution Result\n"
f"-- Does SQL correctly answer Question?\n"
),
}
]
# Generate reward score
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=10)
score = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(f"Reward score: {score}")
```
### Complete Pipeline
For production use, we recommend using the full Contextual-SQL pipeline from the [Github repository](https://github.com/ContextualAI/bird-sql). The pipeline consists of:
1. **Candidate Generation**: Generate multiple SQL query candidates using Qwen2.5-Coder-32B-Instruct
2. **SQL Execution**: Execute candidates against the database
3. **Reward Scoring**: Score candidates with this reward model
4. **Final Selection**: Select the best candidate based on combined scoring
#### Installation
```bash
# Clone the repository
git clone https://github.com/ContextualAI/bird-sql.git
cd bird-sql
# Install dependencies
pip install -r requirements.txt
# Download the reward model
mkdir -p models/reward
huggingface-cli download ContextualAI/ctx-bird-reward-250121 \
--local-dir models/reward
```
#### Running the Pipeline
```bash
# Step 1: Generate SQL candidates
python src/generate.py \
--input_file data/test_all.jsonl \
--output_dir output/generations/ \
--num_gpus 2
# Step 2: Execute SQL candidates
python src/process_sqls.py \
--input_file data/test_all.jsonl \
--generations_dir output/generations/ \
--output_dir output/with_results/ \
--compare_against_gt \
--sql_timeout 30.0
# Step 3: Score with reward model
VLLM_USE_V1=0 python src/reward.py \
--input_file output/with_results/data_with_results.jsonl \
--output_dir output/with_rewards \
--num_gpus 2
# Step 4: Select best candidates
python src/analysis.py \
--rewards_dir output/with_rewards \
--gt_sql_file data/test_gold_sqls.txt \
--output_dir output/analysis \
--num_cpus 100
```
## System Requirements
- **GPU:** 2+ GPUs with 80GB RAM each (for full pipeline)
- **Python:** 3.10+
- See the [repository](https://github.com/ContextualAI/bird-sql) for complete requirements
## Evaluation
The complete Contextual-SQL system (using this reward model) achieves state-of-the-art results on the BIRD benchmark:
| Model | BIRD Dev Set | BIRD Test Set |
|:------|-------------:|--------------:|
| **Contextual-SQL** | 73.50 | 75.63 |
To reproduce these evaluation results, follow the instructions in the [Github repository](https://github.com/ContextualAI/bird-sql).
## Citation
If you find our work helpful, please cite:
```bibtex
@misc{agrawal2025text2sql,
author = {Sheshansh Agrawal and Thien Nguyen},
title = {Open-Sourcing the Best Local Text-to-SQL System},
year = {2025},
url = {https://contextual.ai/blog/open-sourcing-the-best-local-text-to-sql-system/}
}
```