Abstract
Neither traditional rule-based named entity recognition (NER) nor the latest language models perform well in information extraction from noisy text—the text that contains linguistic errors, slang, loanwords, and jargon. Building defect complaints filed by residents via online systems is a representative example of such noisy text. This paper proposes an NER method for automatically extracting defect information from noisy text using a defect thesaurus and transfer learning. The thesaurus built herein included 1097 defect named entities in 23 categories. The NER performance was tested using 69,750 defect complaints through transfer learning of three representative pre-trained language models: Multilingual Bidirectional Encoder Representations from Transformers (BERT), Korean BERT (KoBERT), and Korean Efficiently Learning an Encoder that Classifies Token Replacements Accurately (KoELECTRA). The proposed method achieved an average F1 score of 91.0% using KoBERT. This NER performance was higher than that of the open benchmark NER performance for clean text (86.1%).
Original language | English |
---|---|
Article number | 104543 |
Journal | Automation in Construction |
Volume | 143 |
DOIs | |
Publication status | Published - 2022 Nov |
Bibliographical note
Funding Information:This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science and ICT, MSIT) (No. 2021R1A2C3008209 ) and an Institute for Information and Communications Technology Planning and Evaluation (IITP) grant (No. 2019-0-01559-001 ), both funded by the Ministry of Science and ICT (MSIT) of Korea .
Publisher Copyright:
© 2022 Elsevier B.V.
All Science Journal Classification (ASJC) codes
- Control and Systems Engineering
- Civil and Structural Engineering
- Building and Construction