Abstract:
Cross-modality image-text retrieval is a task to retrieve corresponding images or texts under the given another mode text or image. Traditional retrieval paradigms rely on deep learning to extract feature representations from images and texts, and map them to a common semantic space for semantic matching. However, this method relies more on the correlation of the data than on the real causal relationship behind the data, and faces challenges in the representation and interpretability of high-level semantic information. Therefore, we introduce causal inference and consensus knowledge on the basis of deep learning, and propose a causal image-text retrieval method embedded with consensus knowledge. Specifically, we incorporate causal interventions into the visual feature extraction module, replacing correlational relationships with causal relationships to learn causal visual features that capture underlying knowledge. These causal visual features are then concatenated with the original visual features to obtain the final visual feature representation. To solve the problem of insufficient text feature representation in this method, a more powerful text feature extraction model BERT is adopted, and consensus knowledge shared between two modal data is embedded for consensus level representation learning of image-text features. Experimental results on the MS-COCO dataset demonstrate that our approach achieves consistent KPI improvements of Recall@k and mR for bidirectional image-text retrieval tasks.