Referring Image Segmentation (RIS) aims to segment a target object described by a natural language expression. Existing methods typically adopt a Transformer-based decoder, where language-based tokens serve as a key-value set (i.e., an information provider) to assign attention to target regions for the vision query features. However, these frameworks guided by language-based tokens have an inherent limitation in that language information is insufficient to guide the network's attention towards fine-grained target regions, even with advanced language models. To address this issue, we propose a novel Visual Informative Part Attention (VIPA) framework that leverages visual expression, generated from the informative parts of visual contexts, as a key-value set in the decoder for Transformer-based referring image segmentation. The visual expression effectively provides the structural and semantic visual information to the network. We also design a visual expression generator (VEG) module, which retrieves informative visual tokens via local-global linguistic context cues and refines the retrieved tokens for reducing noise information and sharing informative visual attributes. This module allows visual expression to consider comprehensive contexts and capture semantic visual contexts of informative regions. In this way, our framework enables the network's attention to robustly align with the fine-grained regions of interest. Extensive experiments and visual analysis demonstrate the effectiveness of our approach. Our VIPA outperforms the existing state-of-the-art methods on four public RIS benchmarks.
1. We propose a novel Visual Informative Part Attention (VIPA) framework for Transformer-based referring image segmentation, which leverages the informative parts of visual contexts (i.e., visual expression) as a key-value set in the transformer decoder. The visual expression robustly leads the network’s attention to the region of interest. Our approach is the first to explore the potential of visual expression in the attention mechanism of referring image segmentation.
2. We present a visual expression generator (VEG) module, which retrieves informative visual tokens via local-global linguistic cues and refines them for mitigating the distraction by noise information and sharing the visual attributes. This module enables visual expression to consider comprehensive contexts and capture semantic visual contexts for fine-grained segmentation.
3. Our method consistently shows strong performance and surpasses the state-of-the-art methods on four public RIS benchmarks. The visual analysis of attention results clearly demonstrates the effectiveness of our framework for Transformer-based referring image segmentation.
Table 1. Performance comparison with the state-of-the-art methods using oIoU (%) metric on three public referring image segmentation datasets. LLM-based models are marked in gray. † indicates models trained on multiple RefCOCO series datasets with removed validation and testing images to prevent data leakage. For ReferIt dataset, only ReferIt training set is used.
Table 2. Performance comparison of recent RIS methods using mIoU (%) metric. † indicates models trained on multiple RefCOCO series datasets.
Table 3. Main ablation for the effectiveness of our method. LE: Linguistic Expression tokens. VE: Visual Expression tokens (Ours).
Table 4. Ablation studies for the design of the visual expression generator module.
Figure 3. Ablation study on the number of the retrieved informative visual tokens.