历史建筑多模态检索方法研究

袁嘉梦; 陈浪; 陈维亚; 骆汉宾

doi:10.16670/j.cnki.cn11-5823/tu.2024.04.02

2024, 16(4): 7-13. doi: 10.16670/j.cnki.cn11-5823/tu.2024.04.02

历史建筑多模态检索方法研究

1.	华中科技大学土木与水利工程学院，武汉 430074
2.	国家数字建造技术创新中心，武汉 430074

通讯作者: 陈维亚,

网络出版日期: 2024-08-20

作者简介: 袁嘉梦（1999-），女，硕士，主要研究方向：计算机视觉及历史建筑保护

基金项目: 国家自然科学基金项目 72001086

Research on Multimodal Retrieval Methods for Historical Buildings

Jiameng Yuan ^1,, Lang Chen ^1,, Weiya Chen ^1,2,,, Hanbin Luo ^1,2,

1.	School of Civil and Hydraulic Engineering, Huazhong University of Science and Technology, Wuhan 430074, China
2.	National Digital Construction Technology Innovation Center, Wuhan 430074, China

Corresponding author: Weiya Chen,

Available Online: 2024-08-20

引用本文: 袁嘉梦, 陈浪, 陈维亚, 骆汉宾. 历史建筑多模态检索方法研究[J]. 土木建筑工程信息技术, 2024, 16(4): 7-13. doi: 10.16670/j.cnki.cn11-5823/tu.2024.04.02

Citation: Jiameng Yuan, Lang Chen, Weiya Chen, Hanbin Luo. Research on Multimodal Retrieval Methods for Historical Buildings[J]. Journal of Information Technologyin Civil Engineering and Architecture, 2024, 16(4): 7-13. doi: 10.16670/j.cnki.cn11-5823/tu.2024.04.02

摘要:在HBIM (Historic Building Information Modeling) 数据库中进行信息查询面临三个问题：一是没有普适性的规则判断建筑之间的相似性；二是未考虑建筑本身所包含的历史文化信息；三是查询文本多基于关键词，难以检索到关键词未包含的信息。针对以上问题，提出了一种面向历史建筑的多模态检索方法，用户能通过输入图像或自然语言文本数据，检索到与输入特征相符的建筑，并以列表形式进行排序。在以图像检索建筑时，利用“dino_vit16”模型对图像进行特征提取，所提出的图像-建筑检索方法检索精度达90.08%；在文本检索建筑时则基于CLIP（Contrastive Language-Image Pre-training）模型建立图像和文本的关联，研究了图文相似度和文本相似度权重的取值，选择m =0.6，n =0.4作为权重的最佳配置。实验证明所提出的文本-建筑检索算法对于包含某种外观特征查询语句的检索效果最好，对于描述某种功能和建筑风格的查询语句检索效果最差，而当查询语句中包含4个以上的混合特征，能够描述出建筑的基本面貌时，可以准确地检索到符合条件的建筑。

关键词: 历史建筑, HBIM, ViT, 相似性度量, 多模态检索

Abstract: The retrieval of historical buildings in HBIM database faces three main issues: 1) the absence of universal rules for determining the similarity between buildings; 2) the neglect of historical and cultural information inherent to the buildings themselves; 3) most queries rely on keywords, which imposes limitations of available information. Addressing these challenges, this paper introduces a multimodal retrieval approach for historical buildings. Users can retrieve a list of buildings matching their input features, whether through images or natural language text data. For image-based building retrieval, the "dino_vit16" model is employed for feature extraction, achieving a retrieval accuracy of 90.08% with the proposed image-building retrieval method. For text-based building retrieval, a connection between images and text is established through the CLIP model. The study explores the values of image-text similarity and text similarity weights, selecting m=0.6 and n=0.4 as the optimal configuration for these weights. Experimental results have shown that the proposed text-based architectural retrieval algorithm performs best when the query statement contains a specific visual feature, and it performs worst when the query statement describes a particular function and architectural style. However, when the query statement includes four or more mixed features that accurately describe the fundamental appearance of a building, it can accurately retrieve buildings that meet the criteria.

Key words: Historical Building, Historical Building Information Modeling (HBIM), Vision Transformer (ViT), Similarity Measurement, Multimodal Retrieval

[1]	Murphy M., Mcgovern E., Pavia S., et al. Historic building information modelling (HBIM), 2009: 311-327.
[2]	Dore C., Murphy M. . Integration of historic building information modeling (HBIM) and 3D GIS for recording and managing cultural heritage sites[C]// International Conference on Virtual Systems and Multimedia, 2012.
[3]	Murphy M., Corns A., Cahill J., et al. Developing Historic Building Information Modelling Guidelines and Procedures for Architectural Heritage in Ireland[J]. Semantic Scholar, 2017, 8: 539-546.
[4]	López F J, Lerones P M, Llamas J, et al. A review of heritage building information modeling (HBIM)[J]. Multimodal Technologies and Interaction, 2018, 2(2): 21.doi: 10.3390/mti2020021
[5]	Devesh R, Jha J, Jayaswal R, et al. Retrieval of monuments images through ACO optimization approach[J]. Int. Res. J. Eng. Technol, 2017, 4(7): 279-285.
[6]	Devesh R, Jha J. An Efficient Approach for Monuments Image Retrieval Using Multi-visual Descriptors[C]//Proceeding of the Second International Conference on Microelectronics, Computing & Communication Systems (MCCS 2017). Springer Singapore, 2019: 281-293.
[7]	Jha J, Bhaduaria S S. A novel approach for retrieval of historical monuments images using visual contents and unsupervised machine learning[J]. Int J, 2020, 9(3).
[8]	文政颖, 卫欣. 多分辨批量古典建筑图像深度学习检索算法[J]. 河南工程学院学报(自然科学版), 2019, 31(02): 66-71.
[9]	杨蕾. 基于深度学习的地标建筑图像检索研究与实现[D]. 西安: 西安建筑科技大学, 2022.
[10]	Agarwal A, Saxena V. Content based multimodal retrieval for databases of Indian monuments[C]//Contemporary Computing: Third International Conference, IC3 2010, Noida, India. Proceedings, Part Ⅰ 3. Springer Berlin Heidelberg, 2010: 446-455.
[11]	Wu H, Mao J, Zhang Y, et al. Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 6609-6618.
[12]	Matsubara T. Target-oriented deformation of visual-semantic embedding space[J]. Leice Transactions on Information and Systems, 2021, 104(1): 24-33.
[13]	Wang Z, Liu X, Li H, et al. Camp: Cross-modal adaptive message passing for text-image retrieval[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 5764-5773.
[14]	Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90.doi: 10.1145/3065386
[15]	Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv, 2014: 1409.1556.
[16]	Chen W, Liu Y, Wang W, et al. Deep learning for instance retrieval: A survey[J]. arXiv preprint arXiv, 2021: 2101.11282.
[17]	Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in Neural Information Processing Systems, 2017, 30.
[18]	Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint arXiv, 2010: 11929, 2020.
[19]	Caron M, Touvron H, Misra I, et al. Emerging properties in self-supervised vision transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 9650-9660.
[20]	Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]//International Conference on Machine Learning. PMLR, 2021: 8748-8763.
[21]	青岛安娜别墅_百度百科(baidu. com) [EB/OL] (2023-08-19) [2023-10-28]
[22]	Chen W, Yuan J, Luo H. Design and development of heritage building information model (HBIM) database to support maintenance[J]. EG-ICE International Workshop on Intelligent Computing in Engineering, 2022: 359-367.

计量

PDF下载量(14)
文章访问量(526)
HTML全文浏览量(246)

历史建筑多模态检索方法研究

Research on Multimodal Retrieval Methods for Historical Buildings

计量

作者相关

关键词相关

Figures And Tables