VIPA

Figure 1. (a) Example of a transformer decoder used to extract language outputs. (b) Illustration of different RIS frameworks. Different from previous works, our approach leverages \textit{visual expression}, generated from the retrieved informative parts of visual contexts, as a key-value set in the Transformer-based segmentation decoder. (c) Visual comparison of two different key-value sets. Yellow dotted boxes are incorrect predictions. The visual expression (VE) robustly guides the network's attention to the regions of interest, whereas the advanced language expression (LE) shows incomplete prediction.

Referring Image Segmentation (RIS) aims to segment a target object described by a natural language expression. Existing methods have evolved by leveraging the vision information into the language tokens. To more effectively exploit visual contexts for fine-grained segmentation, we propose a novel Visual Informative Part Attention (VIPA) framework for referring image segmentation. VIPA leverages the informative parts of visual contexts, called a visual expression, which can effectively provide the structural and semantic visual target information to the network. We also design a visual expression generator (VEG) module, which retrieves informative visual tokens via local-global linguistic context cues and refines the retrieved tokens for reducing noise information and sharing informative visual attributes. This module allows the visual expression to consider comprehensive contexts and capture semantic visual contexts of informative regions. In this way, our framework enables the network's attention to robustly align with the fine-grained regions of interest. Extensive experiments and visual analysis demonstrate the effectiveness of our approach. Our VIPA outperforms the existing state-of-the-art methods on four public RIS benchmarks.

1. We propose a novel Visual Informative Part Attention (VIPA) framework for referring image segmentation, which leverages the informative parts of visual contexts as a key-value set for the vision query features in the Transformer-based segmentation decoder. Our approach is the first to explore the potential of visual expression in the attention mechanism of the referring segmentation.

2. We present a visual expression generator module, which retrieves informative visual tokens via local-global linguistic cues and refines them for mitigating the distraction by noise information and sharing the visual attributes. This module enables visual expression to consider comprehensive contexts and capture semantic visual contexts for fine-grained segmentation.

3. Our method consistently shows strong performance on four public RIS benchmarks. The visual analysis of attention results clearly demonstrates the effectiveness of our framework for referring image segmentation.

Figure 2. Overview of Visual Informative Part Attention (VIPA) framework. VIPA robustly guides the network's attention to the region of interest by exploiting the visual expression generated from the retrieved informative parts of visual contexts.

Table 1. Performance comparison with the state-of-the-art methods on four public referring image segmentation datasets. LLM-based models are marked in gray. † indicates models trained on multiple RefCOCO series datasets with removed validation and testing images to prevent data leakage. For ReferIt dataset, only ReferIt training set is used.

Table 2. Main ablation for the effectiveness of our method. LE: Linguistic Expression tokens. VE: Visual Expression tokens (Ours).

Table 3. Ablation studies for the design of the visual expression generator module.

Table 4. Ablation for different encoder fusion types to verify the versatility of our method.

Figure 3. Ablation study on the number of the retrieved informative visual tokens.

Visual Informative Part Attention Framework for Transformer-based Referring Image Segmentation

Abstract

🔥 Highlights

Overview Framework

Experimental Results

Visualizations

Figure 6. Qualitative comparison with previous state-of-the-art methods for the different types of the images and language expressions.

Figure 7. Qualitative comparison with the LLM-based RIS model.

Figure 8. Qualitative results of our model on long and difficult language inputs.