Named entity recognition of building construction defect information from text with linguistic noise

Kahyun Jeon, Ghang Lee, Seongmin Yang, H. David Jeong

Research output: Contribution to journalArticlepeer-review


Neither traditional rule-based named entity recognition (NER) nor the latest language models perform well in information extraction from noisy text—the text that contains linguistic errors, slang, loanwords, and jargon. Building defect complaints filed by residents via online systems is a representative example of such noisy text. This paper proposes an NER method for automatically extracting defect information from noisy text using a defect thesaurus and transfer learning. The thesaurus built herein included 1097 defect named entities in 23 categories. The NER performance was tested using 69,750 defect complaints through transfer learning of three representative pre-trained language models: Multilingual Bidirectional Encoder Representations from Transformers (BERT), Korean BERT (KoBERT), and Korean Efficiently Learning an Encoder that Classifies Token Replacements Accurately (KoELECTRA). The proposed method achieved an average F1 score of 91.0% using KoBERT. This NER performance was higher than that of the open benchmark NER performance for clean text (86.1%).

Original languageEnglish
Article number104543
JournalAutomation in Construction
Publication statusPublished - 2022 Nov

Bibliographical note

Funding Information:
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science and ICT, MSIT) (No. 2021R1A2C3008209 ) and an Institute for Information and Communications Technology Planning and Evaluation (IITP) grant (No. 2019-0-01559-001 ), both funded by the Ministry of Science and ICT (MSIT) of Korea .

Publisher Copyright:
© 2022 Elsevier B.V.

All Science Journal Classification (ASJC) codes

  • Control and Systems Engineering
  • Civil and Structural Engineering
  • Building and Construction


Dive into the research topics of 'Named entity recognition of building construction defect information from text with linguistic noise'. Together they form a unique fingerprint.

Cite this