Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images Paper • 2604.07338 • Published Apr 8 • 5
Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation Paper • 2602.16990 • Published Feb 19 • 11
view article Article Introducing the Open FinLLM Leaderboard +11 QianqianXie1994, jiminHuang, Effoula, yanglet, alejandroll10, Benyou, ldruth, xiangr, Me1oy, ShirleyY, mirageco, blitzionic, clefourrier • Oct 4, 2024 • 80
MentraSuite: Post-Training Large Language Models for Mental Health Reasoning and Assessment Paper • 2512.09636 • Published Dec 10, 2025 • 26
MentraSuite: Post-Training Large Language Models for Mental Health Reasoning and Assessment Paper • 2512.09636 • Published Dec 10, 2025 • 26
DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation Paper • 2510.09116 • Published Oct 10, 2025 • 97
FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs Paper • 2510.08886 • Published Oct 10, 2025 • 20
From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models Paper • 2508.13491 • Published Aug 19, 2025 • 59
From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models Paper • 2508.13491 • Published Aug 19, 2025 • 59
MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation Paper • 2506.14028 • Published Jun 16, 2025 • 94
Running 7 Open Financial LLM Leaderboard 🏆 7 Evaluating LLMs on Multilingual Multimodal Financial Tasks