Abstract
Zero-shot learning, applied with vision-language pretrained (VLP) models, is expected to be an alternative to existing deep learning models for defect detection, under insufficient dataset. However, VLP models, including contrastive language-image pretraining (CLIP), showed fluctuated performance on prompts (inputs), resulting in research on prompt engineering—optimization of prompts for improving performance. Therefore, this study aims to identify the features of a prompt that can yield the best performance in classifying and detecting building defects using the zero-shot and few-shot capabilities of CLIP. The results reveal the following: (1) domain-specific definitions are better than general definitions and images; (2) a complete sentence is better than a set of core terms; and (3) multimodal information is better than single-modal information. The resulting detection performance using the proposed prompting method outperformed that of existing supervised models.
Original language | English |
---|---|
Journal | Computer-Aided Civil and Infrastructure Engineering |
DOIs | |
Publication status | Accepted/In press - 2022 |
Bibliographical note
Funding Information:National Research Foundation of Korea (NRF), Grant/Award Number: 2021RIA2C300820969
Publisher Copyright:
© 2022 Computer-Aided Civil and Infrastructure Engineering.
All Science Journal Classification (ASJC) codes
- Civil and Structural Engineering
- Building and Construction
- Computer Science Applications
- Computer Graphics and Computer-Aided Design
- Computational Theory and Mathematics