嵌入共識知識的因果圖文檢索方法研究

梁彥鵬; 劉雪兒; 馬忠貴; 李卓

doi:10.13374/j.issn2095-9389.2023.05.28.001

嵌入共識知識的因果圖文檢索方法研究

doi: 10.13374/j.issn2095-9389.2023.05.28.001

北京科技大學

詳細信息

中圖分類號: TG 142.71
計量
- 文章訪問數: 54
- HTML全文瀏覽量: 3
- PDF下載量: 3
- 被引次數: 0
出版歷程
- 網絡出版日期: 2023-08-16

Research on Causal Image-Text Retrieval Embedded with Consensus Knowledge

摘要

摘要: 跨模態圖像-文本檢索是一項在給定另一模態的查詢條件下檢索相關圖像或文本的任務。傳統的檢索范式依靠深度學習提取圖文特征表示，并將其映射到一個公共表示空間中進行語義匹配。然而，這種方法更多地依賴數據表面的相關性，而無法挖掘數據背后真實的因果關系，在高層語義信息的表示和可解釋性方面面臨著挑戰。為此，本文在深度學習的基礎上引入因果推斷和共識知識，提出嵌入共識知識的因果圖文檢索方法。具體而言，將因果干預引入視覺特征提取模塊，通過因果關系替換相關關系學習常識因果視覺特征，并與原始視覺特征進行連接得到最終的視覺特征表示。為解決本方法文本特征表示不足的問題，采用更強大的文本特征提取模型BERT，并且嵌入兩種模態數據之間共享的共識知識對圖文特征進行共識級的表示學習。最終在MS-COCO數據集上的實驗證明了本文的方法可以在雙向圖文檢索任務上實現Recall@k和mR的一致性改進。
- 因果推斷 /
- 圖像-文本檢索 /
- 跨模態 /
- 計算機視覺 /
- 自然語言處理
Abstract: Cross-modality image-text retrieval is a task to retrieve corresponding images or texts under the given another mode text or image. Traditional retrieval paradigms rely on deep learning to extract feature representations from images and texts, and map them to a common semantic space for semantic matching. However, this method relies more on the correlation of the data than on the real causal relationship behind the data, and faces challenges in the representation and interpretability of high-level semantic information. Therefore, we introduce causal inference and consensus knowledge on the basis of deep learning, and propose a causal image-text retrieval method embedded with consensus knowledge. Specifically, we incorporate causal interventions into the visual feature extraction module, replacing correlational relationships with causal relationships to learn causal visual features that capture underlying knowledge. These causal visual features are then concatenated with the original visual features to obtain the final visual feature representation. To solve the problem of insufficient text feature representation in this method, a more powerful text feature extraction model BERT is adopted, and consensus knowledge shared between two modal data is embedded for consensus level representation learning of image-text features. Experimental results on the MS-COCO dataset demonstrate that our approach achieves consistent KPI improvements of Recall@k and mR for bidirectional image-text retrieval tasks.
- causal inference /
- image-text retrieval /
- cross-modality /
- computer vision /
- natural language processing