General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks
Paper • 2604.11778 • Published • 6
None defined yet.
DiningBench: A Hierarchical Multi-view Benchmark for Perception and Reasoning in the Dietary Domain
ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?