Self-optimization of training dataset improves forecasting of cyanobacterial bloom by machine learning

Jayun Kim, Woosik Jung, Jusuk An, Hyun Je Oh, Joonhong Park

Research output: Contribution to journalArticlepeer-review

Abstract

Data-driven model (DDM) prediction of aquatic ecological responses, such as cyanobacterial harmful algal blooms (CyanoHABs), is critically influenced by the choice of training dataset. However, a systematic method to choose the optimal training dataset considering data history has not yet been developed. Providing a comprehensive procedure with self-based optimal training dataset-selecting algorithm would self-improve the DDM performance. In this study, a novel algorithm was developed to self-generate possible training dataset candidates from the available input and output variable data and self-choose the optimal training dataset that maximizes CyanoHAB forecasting performance. Nine years of meteorological and water quality data (input) and CyanoHAB data (output) from a site on the Nakdong River, South Korea, were acquired and pretreated via an automated process. An artificial neural network (ANN) was chosen from among the DDM candidates by first-cut training and validation using the entire collected dataset. Optimal training datasets for the ANN were self-selected from among the possible self-generated training datasets by systematically simulating the performance in response to 46 periods and 40 sizes (number of data elements) of the generated training datasets. The best-performing models were screened to identify the candidate models. The best performance corresponded to 6–7 years of training data (∼18 % lower error) for forecasting 1–28 d ahead (1–28 d of forecasting lead time (FLT)). After the hyperparameters of the screened model candidates were fine-tuned, the best-performing model (7 years of data with 14 d FLT) was self-determined by comparing the forecasts with unseen CyanoHAB events. The self-determined model could reasonably predict CyanoHABs occurring in Korean waters (cyanobacteria cells/mL ≥ 1000). Thus, our proposed method of self-optimizing the training dataset effectively improved the predictive accuracy and operational efficiency of the DDM prediction of CyanoHAB.

Original languageEnglish
Article number161398
JournalScience of the Total Environment
Volume866
DOIs
Publication statusPublished - 2023 Mar 25

Bibliographical note

Funding Information:
This work was supported by the Basic Science Research Program through the National Research Foundation of Korea funded by the Ministry of Education (No. 2018R1A6A1A08025348 ) and the Korea Environment Industry & Technology Institute through the project for developing innovative drinking water and wastewater technologies program funded by the Korean Ministry of the Environment ( 2020002700003 ). We are grateful to National Institute of Environmental Research for sharing the data.

Funding Information:
This work was supported by the Basic Science Research Program through the National Research Foundation of Korea funded by the Ministry of Education (No. 2018R1A6A1A08025348) and the Korea Environment Industry & Technology Institute through the project for developing innovative drinking water and wastewater technologies program funded by the Korean Ministry of the Environment (2020002700003). We are grateful to National Institute of Environmental Research for sharing the data.

Publisher Copyright:
© 2023

All Science Journal Classification (ASJC) codes

  • Environmental Engineering
  • Environmental Chemistry
  • Waste Management and Disposal
  • Pollution

Fingerprint

Dive into the research topics of 'Self-optimization of training dataset improves forecasting of cyanobacterial bloom by machine learning'. Together they form a unique fingerprint.

Cite this