new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Jan 29

Referring Expression Comprehension: A Survey of Methods and Datasets

Referring expression comprehension (REC) aims to localize a target object in an image described by a referring expression phrased in natural language. Different from the object detection task that queried object labels have been pre-defined, the REC problem only can observe the queries during the test. It thus more challenging than a conventional computer vision problem. This task has attracted a lot of attention from both computer vision and natural language processing community, and several lines of work have been proposed, from CNN-RNN model, modular network to complex graph-based model. In this survey, we first examine the state of the art by comparing modern approaches to the problem. We classify methods by their mechanism to encode the visual and textual modalities. In particular, we examine the common approach of joint embedding images and expressions to a common feature space. We also discuss modular architectures and graph-based models that interface with structured graph representation. In the second part of this survey, we review the datasets available for training and evaluating REC systems. We then group results according to the datasets, backbone models, settings so that they can be fairly compared. Finally, we discuss promising future directions for the field, in particular the compositional referring expression comprehension that requires longer reasoning chain to address.

  • 3 authors
·
Jul 18, 2020

GeoRef: Referring Expressions in Geometry via Task Formulation, Synthetic Supervision, and Reinforced MLLM-based Solutions

AI-driven geometric problem solving is a complex vision-language task that requires accurate diagram interpretation, mathematical reasoning, and robust cross-modal grounding. A foundational yet underexplored capability for this task is the ability to identify and interpret geometric elements based on natural language queries. To address this, we introduce the task of Referring Expression Comprehension (REC) for geometric problems, which evaluates whether models can localize points, shapes, and spatial relations in diagrams in response to textual prompts. We present GeoRef, a benchmark dataset constructed from existing geometric problem corpora, featuring diverse, high-quality annotations and queries. Due to the lack of annotated data for this task, we generate a large-scale synthetic training dataset using a structured geometric formal language, enabling broad coverage of geometric concepts and facilitating model adaptation. We explore two fine-tuning approaches: Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO). Our results show that GRPO significantly outperforms SFT by better aligning model behavior with task-specific rewards. Furthermore, we propose a verify-and-regenerate mechanism that detects incorrect predictions and re-infers answers using contextual reasoning history, further boosting accuracy. Notably, even state-of-the-art Multimodal Large Language Models (MLLMs) struggle with this task, underscoring the necessity of explicitly evaluating and strengthening geometric grounding as a prerequisite for robust geometric problem solving. Moreover, models trained on GeoRef demonstrate measurable improvements on downstream geometric reasoning tasks, highlighting the broader value of REC as a foundation for multimodal mathematical understanding.

  • 9 authors
·
Sep 25, 2025

Gene-DML: Dual-Pathway Multi-Level Discrimination for Gene Expression Prediction from Histopathology Images

Accurately predicting gene expression from histopathology images offers a scalable and non-invasive approach to molecular profiling, with significant implications for precision medicine and computational pathology. However, existing methods often underutilize the cross-modal representation alignment between histopathology images and gene expression profiles across multiple representational levels, thereby limiting their prediction performance. To address this, we propose Gene-DML, a unified framework that structures latent space through Dual-pathway Multi-Level discrimination to enhance correspondence between morphological and transcriptional modalities. The multi-scale instance-level discrimination pathway aligns hierarchical histopathology representations extracted at local, neighbor, and global levels with gene expression profiles, capturing scale-aware morphological-transcriptional relationships. In parallel, the cross-level instance-group discrimination pathway enforces structural consistency between individual (image/gene) instances and modality-crossed (gene/image, respectively) groups, strengthening the alignment across modalities. By jointly modelling fine-grained and structural-level discrimination, Gene-DML is able to learn robust cross-modal representations, enhancing both predictive accuracy and generalization across diverse biological contexts. Extensive experiments on public spatial transcriptomics datasets demonstrate that Gene-DML achieves state-of-the-art performance in gene expression prediction. The code and checkpoints will be released soon.

  • 4 authors
·
Jul 19, 2025

DreamID: High-Fidelity and Fast diffusion-based Face Swapping via Triplet ID Group Learning

In this paper, we introduce DreamID, a diffusion-based face swapping model that achieves high levels of ID similarity, attribute preservation, image fidelity, and fast inference speed. Unlike the typical face swapping training process, which often relies on implicit supervision and struggles to achieve satisfactory results. DreamID establishes explicit supervision for face swapping by constructing Triplet ID Group data, significantly enhancing identity similarity and attribute preservation. The iterative nature of diffusion models poses challenges for utilizing efficient image-space loss functions, as performing time-consuming multi-step sampling to obtain the generated image during training is impractical. To address this issue, we leverage the accelerated diffusion model SD Turbo, reducing the inference steps to a single iteration, enabling efficient pixel-level end-to-end training with explicit Triplet ID Group supervision. Additionally, we propose an improved diffusion-based model architecture comprising SwapNet, FaceNet, and ID Adapter. This robust architecture fully unlocks the power of the Triplet ID Group explicit supervision. Finally, to further extend our method, we explicitly modify the Triplet ID Group data during training to fine-tune and preserve specific attributes, such as glasses and face shape. Extensive experiments demonstrate that DreamID outperforms state-of-the-art methods in terms of identity similarity, pose and expression preservation, and image fidelity. Overall, DreamID achieves high-quality face swapping results at 512*512 resolution in just 0.6 seconds and performs exceptionally well in challenging scenarios such as complex lighting, large angles, and occlusions.

  • 8 authors
·
Apr 20, 2025 8

Empathy Detection from Text, Audiovisual, Audio or Physiological Signals: A Systematic Review of Task Formulations and Machine Learning Methods

Empathy indicates an individual's ability to understand others. Over the past few years, empathy has drawn attention from various disciplines, including but not limited to Affective Computing, Cognitive Science, and Psychology. Detecting empathy has potential applications in society, healthcare and education. Despite being a broad and overlapping topic, the avenue of empathy detection leveraging Machine Learning remains underexplored from a systematic literature review perspective. We collected 849 papers from 10 well-known academic databases, systematically screened them and analysed the final 82 papers. Our analyses reveal several prominent task formulations - including empathy on localised utterances or overall expressions, unidirectional or parallel empathy, and emotional contagion - in monadic, dyadic and group interactions. Empathy detection methods are summarised based on four input modalities - text, audiovisual, audio and physiological signals - thereby presenting modality-specific network architecture design protocols. We discuss challenges, research gaps and potential applications in the Affective Computing-based empathy domain, which can facilitate new avenues of exploration. We further enlist the public availability of datasets and codes. This paper, therefore, provides a structured overview of recent advancements and remaining challenges towards developing a robust empathy detection system that could meaningfully contribute to enhancing human well-being.

  • 5 authors
·
Oct 30, 2023

All you need is a second look: Towards Tighter Arbitrary shape text detection

Deep learning-based scene text detection methods have progressed substantially over the past years. However, there remain several problems to be solved. Generally, long curve text instances tend to be fragmented because of the limited receptive field size of CNN. Besides, simple representations using rectangle or quadrangle bounding boxes fall short when dealing with more challenging arbitrary-shaped texts. In addition, the scale of text instances varies greatly which leads to the difficulty of accurate prediction through a single segmentation network. To address these problems, we innovatively propose a two-stage segmentation based arbitrary text detector named NASK (Need A Second looK). Specifically, NASK consists of a Text Instance Segmentation network namely TIS (\(1^{st}\) stage), a Text RoI Pooling module and a Fiducial pOint eXpression module termed as FOX (\(2^{nd}\) stage). Firstly, TIS conducts instance segmentation to obtain rectangle text proposals with a proposed Group Spatial and Channel Attention module (GSCA) to augment the feature expression. Then, Text RoI Pooling transforms these rectangles to the fixed size. Finally, FOX is introduced to reconstruct text instances with a more tighter representation using the predicted geometrical attributes including text center line, text line orientation, character scale and character orientation. Experimental results on two public benchmarks including Total-Text and SCUT-CTW1500 have demonstrated that the proposed NASK achieves state-of-the-art results.

  • 2 authors
·
Apr 26, 2020

All You Need is a Second Look: Towards Arbitrary-Shaped Text Detection

Arbitrary-shaped text detection is a challenging task since curved texts in the wild are of the complex geometric layouts. Existing mainstream methods follow the instance segmentation pipeline to obtain the text regions. However, arbitraryshaped texts are difficult to be depicted through one single segmentation network because of the varying scales. In this paper, we propose a two-stage segmentation-based detector, termed as NASK (Need A Second looK), for arbitrary-shaped text detection. Compared to the traditional single-stage segmentation network, our NASK conducts the detection in a coarse-to-fine manner with the first stage segmentation spotting the rectangle text proposals and the second one retrieving compact representations. Specifically, NASK is composed of a Text Instance Segmentation (TIS) network (1st stage), a Geometry-aware Text RoI Alignment (GeoAlign) module, and a Fiducial pOint eXpression (FOX) module (2nd stage). Firstly, TIS extracts the augmented features with a novel Group Spatial and Channel Attention (GSCA) module and conducts instance segmentation to obtain rectangle proposals. Then, GeoAlign converts these rectangles into the fixed size and encodes RoI-wise feature representation. Finally, FOX disintegrates the text instance into serval pivotal geometrical attributes to refine the detection results. Extensive experimental results on three public benchmarks including Total-Text, SCUTCTW1500, and ICDAR 2015 verify that our NASK outperforms recent state-of-the-art methods.

  • 4 authors
·
Jun 23, 2021

GRES: Generalized Referring Expression Segmentation

Referring Expression Segmentation (RES) aims to generate a segmentation mask for the object described by a given language expression. Existing classic RES datasets and methods commonly support single-target expressions only, i.e., one expression refers to one target object. Multi-target and no-target expressions are not considered. This limits the usage of RES in practice. In this paper, we introduce a new benchmark called Generalized Referring Expression Segmentation (GRES), which extends the classic RES to allow expressions to refer to an arbitrary number of target objects. Towards this, we construct the first large-scale GRES dataset called gRefCOCO that contains multi-target, no-target, and single-target expressions. GRES and gRefCOCO are designed to be well-compatible with RES, facilitating extensive experiments to study the performance gap of the existing RES methods on the GRES task. In the experimental study, we find that one of the big challenges of GRES is complex relationship modeling. Based on this, we propose a region-based GRES baseline ReLA that adaptively divides the image into regions with sub-instance clues, and explicitly models the region-region and region-language dependencies. The proposed approach ReLA achieves new state-of-the-art performance on the both newly proposed GRES and classic RES tasks. The proposed gRefCOCO dataset and method are available at https://henghuiding.github.io/GRES.

  • 3 authors
·
Jun 1, 2023

Oracle Efficient Algorithms for Groupwise Regret

We study the problem of online prediction, in which at each time step t, an individual x_t arrives, whose label we must predict. Each individual is associated with various groups, defined based on their features such as age, sex, race etc., which may intersect. Our goal is to make predictions that have regret guarantees not just overall but also simultaneously on each sub-sequence comprised of the members of any single group. Previous work such as [Blum & Lykouris] and [Lee et al] provide attractive regret guarantees for these problems; however, these are computationally intractable on large model classes. We show that a simple modification of the sleeping experts technique of [Blum & Lykouris] yields an efficient reduction to the well-understood problem of obtaining diminishing external regret absent group considerations. Our approach gives similar regret guarantees compared to [Blum & Lykouris]; however, we run in time linear in the number of groups, and are oracle-efficient in the hypothesis class. This in particular implies that our algorithm is efficient whenever the number of groups is polynomially bounded and the external-regret problem can be solved efficiently, an improvement on [Blum & Lykouris]'s stronger condition that the model class must be small. Our approach can handle online linear regression and online combinatorial optimization problems like online shortest paths. Beyond providing theoretical regret bounds, we evaluate this algorithm with an extensive set of experiments on synthetic data and on two real data sets -- Medical costs and the Adult income dataset, both instantiated with intersecting groups defined in terms of race, sex, and other demographic characteristics. We find that uniformly across groups, our algorithm gives substantial error improvements compared to running a standard online linear regression algorithm with no groupwise regret guarantees.

  • 5 authors
·
Oct 6, 2023

Knowledge Graph Embedding by Normalizing Flows

A key to knowledge graph embedding (KGE) is to choose a proper representation space, e.g., point-wise Euclidean space and complex vector space. In this paper, we propose a unified perspective of embedding and introduce uncertainty into KGE from the view of group theory. Our model can incorporate existing models (i.e., generality), ensure the computation is tractable (i.e., efficiency) and enjoy the expressive power of complex random variables (i.e., expressiveness). The core idea is that we embed entities/relations as elements of a symmetric group, i.e., permutations of a set. Permutations of different sets can reflect different properties of embedding. And the group operation of symmetric groups is easy to compute. In specific, we show that the embedding of many existing models, point vectors, can be seen as elements of a symmetric group. To reflect uncertainty, we first embed entities/relations as permutations of a set of random variables. A permutation can transform a simple random variable into a complex random variable for greater expressiveness, called a normalizing flow. We then define scoring functions by measuring the similarity of two normalizing flows, namely NFE. We construct several instantiating models and prove that they are able to learn logical rules. Experimental results demonstrate the effectiveness of introducing uncertainty and our model. The code is available at https://github.com/changyi7231/NFE.

  • 3 authors
·
Sep 30, 2024

Complex Mathematical Expression Recognition: Benchmark, Large-Scale Dataset and Strong Baseline

Mathematical Expression Recognition (MER) has made significant progress in recognizing simple expressions, but the robust recognition of complex mathematical expressions with many tokens and multiple lines remains a formidable challenge. In this paper, we first introduce CMER-Bench, a carefully constructed benchmark that categorizes expressions into three difficulty levels: easy, moderate, and complex. Leveraging CMER-Bench, we conduct a comprehensive evaluation of existing MER models and general-purpose multimodal large language models (MLLMs). The results reveal that while current methods perform well on easy and moderate expressions, their performance degrades significantly when handling complex mathematical expressions, mainly because existing public training datasets are primarily composed of simple samples. In response, we propose MER-17M and CMER-3M that are large-scale datasets emphasizing the recognition of complex mathematical expressions. The datasets provide rich and diverse samples to support the development of accurate and robust complex MER models. Furthermore, to address the challenges posed by the complicated spatial layout of complex expressions, we introduce a novel expression tokenizer, and a new representation called Structured Mathematical Language, which explicitly models the hierarchical and spatial structure of expressions beyond LaTeX format. Based on these, we propose a specialized model named CMERNet, built upon an encoder-decoder architecture and trained on CMER-3M. Experimental results show that CMERNet, with only 125 million parameters, significantly outperforms existing MER models and MLLMs on CMER-Bench.

  • 5 authors
·
Dec 14, 2025