历史建筑多模态检索方法研究

袁嘉梦; 陈浪; 陈维亚; 骆汉宾

doi:10.16670/j.cnki.cn11-5823/tu.2024.04.02

历史建筑多模态检索方法研究

Research on Multimodal Retrieval Methods for Historical Buildings

摘要

摘要: 在HBIM (Historic Building Information Modeling) 数据库中进行信息查询面临三个问题：一是没有普适性的规则判断建筑之间的相似性；二是未考虑建筑本身所包含的历史文化信息；三是查询文本多基于关键词，难以检索到关键词未包含的信息。针对以上问题，提出了一种面向历史建筑的多模态检索方法，用户能通过输入图像或自然语言文本数据，检索到与输入特征相符的建筑，并以列表形式进行排序。在以图像检索建筑时，利用“dino_vit16”模型对图像进行特征提取，所提出的图像-建筑检索方法检索精度达90.08%；在文本检索建筑时则基于CLIP（Contrastive Language-Image Pre-training）模型建立图像和文本的关联，研究了图文相似度和文本相似度权重的取值，选择m =0.6，n =0.4作为权重的最佳配置。实验证明所提出的文本-建筑检索算法对于包含某种外观特征查询语句的检索效果最好，对于描述某种功能和建筑风格的查询语句检索效果最差，而当查询语句中包含4个以上的混合特征，能够描述出建筑的基本面貌时，可以准确地检索到符合条件的建筑。

Abstract: The retrieval of historical buildings in HBIM database faces three main issues: 1) the absence of universal rules for determining the similarity between buildings; 2) the neglect of historical and cultural information inherent to the buildings themselves; 3) most queries rely on keywords, which imposes limitations of available information. Addressing these challenges, this paper introduces a multimodal retrieval approach for historical buildings. Users can retrieve a list of buildings matching their input features, whether through images or natural language text data. For image-based building retrieval, the "dino_vit16" model is employed for feature extraction, achieving a retrieval accuracy of 90.08% with the proposed image-building retrieval method. For text-based building retrieval, a connection between images and text is established through the CLIP model. The study explores the values of image-text similarity and text similarity weights, selecting m=0.6 and n=0.4 as the optimal configuration for these weights. Experimental results have shown that the proposed text-based architectural retrieval algorithm performs best when the query statement contains a specific visual feature, and it performs worst when the query statement describes a particular function and architectural style. However, when the query statement includes four or more mixed features that accurately describe the fundamental appearance of a building, it can accurately retrieve buildings that meet the criteria.

HTML全文

参考文献(22)

施引文献

资源附件(0)