Abstract:
The retrieval of historical buildings in HBIM database faces three main issues: 1) the absence of universal rules for determining the similarity between buildings; 2) the neglect of historical and cultural information inherent to the buildings themselves; 3) most queries rely on keywords, which imposes limitations of available information. Addressing these challenges, this paper introduces a multimodal retrieval approach for historical buildings. Users can retrieve a list of buildings matching their input features, whether through images or natural language text data. For image-based building retrieval, the "dino_vit16" model is employed for feature extraction, achieving a retrieval accuracy of 90.08% with the proposed image-building retrieval method. For text-based building retrieval, a connection between images and text is established through the CLIP model. The study explores the values of image-text similarity and text similarity weights, selecting m=0.6 and n=0.4 as the optimal configuration for these weights. Experimental results have shown that the proposed text-based architectural retrieval algorithm performs best when the query statement contains a specific visual feature, and it performs worst when the query statement describes a particular function and architectural style. However, when the query statement includes four or more mixed features that accurately describe the fundamental appearance of a building, it can accurately retrieve buildings that meet the criteria.