---
library_name: transformers
language:
- en
base_model:
- Qwen/Qwen2.5-32B-Instruct
tags:
- evaluation
- sql
- reward-model
---

<div align="center">

# Contextual-SQL Reward Model

<img src="Contextual_AI_Brand_Mark_Dark.png" width="10%" alt="Contextual_AI"/>

</div>

<hr>

<div align="center">

[![Blog Post](https://img.shields.io/badge/📝%20Blog-ContextualSQL-green)](https://contextual.ai/blog/open-sourcing-the-best-local-text-to-sql-system)
[![GitHub](https://img.shields.io/badge/GitHub-ContextualSQL-black?logo=github)](https://github.com/ContextualAI/bird-sql)
[![Hugging Face Collection](https://img.shields.io/badge/🤗%20Hugging%20Face-Model%20Collection-yellow)](https://huggingface.co/collections/ContextualAI/contextual-sql)

</div>

**Contextual-SQL Reward Model** is the scoring component of the Contextual-SQL system, which achieved #1 on the BIRD benchmark leaderboard in February 2025. This model is finetuned from Qwen-2.5-32B-Instruct to rank SQL query candidates by scoring their execution correctness given a database schema and natural language query.

This reward model is one component of the full Contextual-SQL pipeline. For the complete text-to-SQL system (including SQL generation with Qwen2.5-Coder-32B-Instruct), please see the [Github repository](https://github.com/ContextualAI/bird-sql).

For more details about the complete system and methodology, check out our [blog post](https://contextual.ai/blog/open-sourcing-the-best-local-text-to-sql-system).

## Model Details

The Contextual-SQL system achieves state-of-the-art performance through a multi-stage pipeline where this reward model plays a critical role:

- **Inference-Time Scaling:** The system generates multiple diverse SQL candidates through parallel sampling with varied prompts and temperatures from Qwen2.5-Coder-32B-Instruct, then selects the best candidate using execution validation, consistency scoring, and this learned reward model.

- **Reward Model Training:** This Qwen-2.5-32B-Instruct base model was finetuned on the BIRD training set to rank candidate SQL queries. Training uses a classification objective with hard negative mining, where each question is paired with one correct SQL candidate and 15 high-likelihood incorrect candidates as hard negatives.

- **Multi-Signal Candidate Selection:** The final ranking combines this reward model's probability scores P(Y|X,Q) with the generator's log-probabilities P(X|Q) (weighted at α=0.4), creating a joint likelihood that captures both the quality of the SQL candidate and the correctness of its execution output.

### Model Description

- **Developed by:** Contextual AI
- **Language(s) (NLP):** English
- **Finetuned from model:** Qwen-2.5-32B-Instruct
- **Model type:** Reward model for SQL query scoring
- **License:** Same as base model (Apache 2.0)

### Model Sources

- **Repository:** https://github.com/ContextualAI/bird-sql
- **Blog Post:** https://contextual.ai/blog/open-sourcing-the-best-local-text-to-sql-system

## Usage

This model is designed to be used as part of the full Contextual-SQL pipeline for scoring SQL candidates. It is **not** a standalone text-to-SQL generator.

### Quick Start (Scoring a Single SQL Candidate)
```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ContextualAI/ctx-bird-reward-250121")
model = AutoModelForCausalLM.from_pretrained("ContextualAI/ctx-bird-reward-250121")

# Example inputs
db_schema = """CREATE TABLE customers (
    id INT PRIMARY KEY,
    name VARCHAR(100),
    region VARCHAR(50),
    revenue FLOAT
);"""

question = "Show me top 5 highest revenue customers by region"
evidence = "Revenue is stored in the revenue column"
sql_candidate = "SELECT region, name, revenue FROM customers ORDER BY revenue DESC LIMIT 5"
execution_result = "5 rows returned"
num_rows = 5

# Format the prompt
messages = [
    {
        "role": "system",
        "content": "You are a judge that can check whether a given SQL correctly answers a given natural language user query. You'll be given Database Schema, Question, External Knowledge, SQL, logprob Score and its Execution Result.",
    },
    {
        "role": "user",
        "content": (
            f"-- Database Schema: \n{db_schema}\n"
            f"-- Question: {question}\n"
            f"-- External Knowledge: {evidence}\n"
            f"-- SQL: {sql_candidate}\n"
            f"-- Execution Result #rows: {num_rows}\n"
            f"-- Execution Result START\n{execution_result}\n"
            f"-- END Execution Result\n"
            f"-- Does SQL correctly answer Question?\n"
        ),
    }
]

# Generate reward score
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=10)
score = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(f"Reward score: {score}")
```

### Complete Pipeline

For production use, we recommend using the full Contextual-SQL pipeline from the [Github repository](https://github.com/ContextualAI/bird-sql). The pipeline consists of:

1. **Candidate Generation**: Generate multiple SQL query candidates using Qwen2.5-Coder-32B-Instruct
2. **SQL Execution**: Execute candidates against the database  
3. **Reward Scoring**: Score candidates with this reward model
4. **Final Selection**: Select the best candidate based on combined scoring

#### Installation
```bash
# Clone the repository
git clone https://github.com/ContextualAI/bird-sql.git
cd bird-sql

# Install dependencies
pip install -r requirements.txt

# Download the reward model
mkdir -p models/reward
huggingface-cli download ContextualAI/ctx-bird-reward-250121 \
  --local-dir models/reward
```

#### Running the Pipeline
```bash
# Step 1: Generate SQL candidates
python src/generate.py \
  --input_file data/test_all.jsonl \
  --output_dir output/generations/ \
  --num_gpus 2

# Step 2: Execute SQL candidates
python src/process_sqls.py \
  --input_file data/test_all.jsonl \
  --generations_dir output/generations/ \
  --output_dir output/with_results/ \
  --compare_against_gt \
  --sql_timeout 30.0

# Step 3: Score with reward model
VLLM_USE_V1=0 python src/reward.py \
  --input_file output/with_results/data_with_results.jsonl \
  --output_dir output/with_rewards \
  --num_gpus 2

# Step 4: Select best candidates
python src/analysis.py \
  --rewards_dir output/with_rewards \
  --gt_sql_file data/test_gold_sqls.txt \
  --output_dir output/analysis \
  --num_cpus 100
```

## System Requirements

- **GPU:** 2+ GPUs with 80GB RAM each (for full pipeline)
- **Python:** 3.10+
- See the [repository](https://github.com/ContextualAI/bird-sql) for complete requirements

## Evaluation

The complete Contextual-SQL system (using this reward model) achieves state-of-the-art results on the BIRD benchmark:

| Model | BIRD Dev Set | BIRD Test Set |
|:------|-------------:|--------------:|
| **Contextual-SQL** | 73.50 | 75.63 |

To reproduce these evaluation results, follow the instructions in the [Github repository](https://github.com/ContextualAI/bird-sql).

## Citation

If you find our work helpful, please cite:
```bibtex
@misc{agrawal2025text2sql,
  author       = {Sheshansh Agrawal and Thien Nguyen},
  title        = {Open-Sourcing the Best Local Text-to-SQL System},
  year         = {2025},
  url          = {https://contextual.ai/blog/open-sourcing-the-best-local-text-to-sql-system/}
}
```