--- library_name: transformers language: - en base_model: - Qwen/Qwen2.5-32B-Instruct tags: - evaluation - sql - reward-model ---
# Contextual-SQL Reward Model Contextual_AI

[![Blog Post](https://img.shields.io/badge/📝%20Blog-ContextualSQL-green)](https://contextual.ai/blog/open-sourcing-the-best-local-text-to-sql-system) [![GitHub](https://img.shields.io/badge/GitHub-ContextualSQL-black?logo=github)](https://github.com/ContextualAI/bird-sql) [![Hugging Face Collection](https://img.shields.io/badge/🤗%20Hugging%20Face-Model%20Collection-yellow)](https://huggingface.co/collections/ContextualAI/contextual-sql)
**Contextual-SQL Reward Model** is the scoring component of the Contextual-SQL system, which achieved #1 on the BIRD benchmark leaderboard in February 2025. This model is finetuned from Qwen-2.5-32B-Instruct to rank SQL query candidates by scoring their execution correctness given a database schema and natural language query. This reward model is one component of the full Contextual-SQL pipeline. For the complete text-to-SQL system (including SQL generation with Qwen2.5-Coder-32B-Instruct), please see the [Github repository](https://github.com/ContextualAI/bird-sql). For more details about the complete system and methodology, check out our [blog post](https://contextual.ai/blog/open-sourcing-the-best-local-text-to-sql-system). ## Model Details The Contextual-SQL system achieves state-of-the-art performance through a multi-stage pipeline where this reward model plays a critical role: - **Inference-Time Scaling:** The system generates multiple diverse SQL candidates through parallel sampling with varied prompts and temperatures from Qwen2.5-Coder-32B-Instruct, then selects the best candidate using execution validation, consistency scoring, and this learned reward model. - **Reward Model Training:** This Qwen-2.5-32B-Instruct base model was finetuned on the BIRD training set to rank candidate SQL queries. Training uses a classification objective with hard negative mining, where each question is paired with one correct SQL candidate and 15 high-likelihood incorrect candidates as hard negatives. - **Multi-Signal Candidate Selection:** The final ranking combines this reward model's probability scores P(Y|X,Q) with the generator's log-probabilities P(X|Q) (weighted at α=0.4), creating a joint likelihood that captures both the quality of the SQL candidate and the correctness of its execution output. ### Model Description - **Developed by:** Contextual AI - **Language(s) (NLP):** English - **Finetuned from model:** Qwen-2.5-32B-Instruct - **Model type:** Reward model for SQL query scoring - **License:** Same as base model (Apache 2.0) ### Model Sources - **Repository:** https://github.com/ContextualAI/bird-sql - **Blog Post:** https://contextual.ai/blog/open-sourcing-the-best-local-text-to-sql-system ## Usage This model is designed to be used as part of the full Contextual-SQL pipeline for scoring SQL candidates. It is **not** a standalone text-to-SQL generator. ### Quick Start (Scoring a Single SQL Candidate) ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("ContextualAI/ctx-bird-reward-250121") model = AutoModelForCausalLM.from_pretrained("ContextualAI/ctx-bird-reward-250121") # Example inputs db_schema = """CREATE TABLE customers ( id INT PRIMARY KEY, name VARCHAR(100), region VARCHAR(50), revenue FLOAT );""" question = "Show me top 5 highest revenue customers by region" evidence = "Revenue is stored in the revenue column" sql_candidate = "SELECT region, name, revenue FROM customers ORDER BY revenue DESC LIMIT 5" execution_result = "5 rows returned" num_rows = 5 # Format the prompt messages = [ { "role": "system", "content": "You are a judge that can check whether a given SQL correctly answers a given natural language user query. You'll be given Database Schema, Question, External Knowledge, SQL, logprob Score and its Execution Result.", }, { "role": "user", "content": ( f"-- Database Schema: \n{db_schema}\n" f"-- Question: {question}\n" f"-- External Knowledge: {evidence}\n" f"-- SQL: {sql_candidate}\n" f"-- Execution Result #rows: {num_rows}\n" f"-- Execution Result START\n{execution_result}\n" f"-- END Execution Result\n" f"-- Does SQL correctly answer Question?\n" ), } ] # Generate reward score inputs = tokenizer.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=10) score = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True) print(f"Reward score: {score}") ``` ### Complete Pipeline For production use, we recommend using the full Contextual-SQL pipeline from the [Github repository](https://github.com/ContextualAI/bird-sql). The pipeline consists of: 1. **Candidate Generation**: Generate multiple SQL query candidates using Qwen2.5-Coder-32B-Instruct 2. **SQL Execution**: Execute candidates against the database 3. **Reward Scoring**: Score candidates with this reward model 4. **Final Selection**: Select the best candidate based on combined scoring #### Installation ```bash # Clone the repository git clone https://github.com/ContextualAI/bird-sql.git cd bird-sql # Install dependencies pip install -r requirements.txt # Download the reward model mkdir -p models/reward huggingface-cli download ContextualAI/ctx-bird-reward-250121 \ --local-dir models/reward ``` #### Running the Pipeline ```bash # Step 1: Generate SQL candidates python src/generate.py \ --input_file data/test_all.jsonl \ --output_dir output/generations/ \ --num_gpus 2 # Step 2: Execute SQL candidates python src/process_sqls.py \ --input_file data/test_all.jsonl \ --generations_dir output/generations/ \ --output_dir output/with_results/ \ --compare_against_gt \ --sql_timeout 30.0 # Step 3: Score with reward model VLLM_USE_V1=0 python src/reward.py \ --input_file output/with_results/data_with_results.jsonl \ --output_dir output/with_rewards \ --num_gpus 2 # Step 4: Select best candidates python src/analysis.py \ --rewards_dir output/with_rewards \ --gt_sql_file data/test_gold_sqls.txt \ --output_dir output/analysis \ --num_cpus 100 ``` ## System Requirements - **GPU:** 2+ GPUs with 80GB RAM each (for full pipeline) - **Python:** 3.10+ - See the [repository](https://github.com/ContextualAI/bird-sql) for complete requirements ## Evaluation The complete Contextual-SQL system (using this reward model) achieves state-of-the-art results on the BIRD benchmark: | Model | BIRD Dev Set | BIRD Test Set | |:------|-------------:|--------------:| | **Contextual-SQL** | 73.50 | 75.63 | To reproduce these evaluation results, follow the instructions in the [Github repository](https://github.com/ContextualAI/bird-sql). ## Citation If you find our work helpful, please cite: ```bibtex @misc{agrawal2025text2sql, author = {Sheshansh Agrawal and Thien Nguyen}, title = {Open-Sourcing the Best Local Text-to-SQL System}, year = {2025}, url = {https://contextual.ai/blog/open-sourcing-the-best-local-text-to-sql-system/} } ```