Deep similarity analysis and forecasting of actual outbreak of major infectious diseases using Internet-Sourced data

Beakcheol Jang, Yeongha Kim, Gun Il Kim, Jong Wook Kim

Research output: Contribution to journalArticlepeer-review


Perhaps no other generation in the span of recorded human history has endured the risks of infectious diseases as has the current generation. The prevalence of infectious diseases is caused mainly by unlimited contact between people in a highly globalized world. Disease control and prevention (CDC) promptly collect and produce disease outbreak statistics, but CDCs rely on a curated, centralized collection system, and requires up to two weeks of lead time. Consequently, the quick release of disease outbreak information has become a great challenge. Infectious disease outbreak information is recorded and spread somewhere on the Internet much faster than CDC announcements, and Internet-sourced data have shown non-substitutable potential to watch and predict infectious disease outbreaks in advance. In this study, we performed a thorough analysis to show the similarity between the Korean Center of Disease Control (KCDC) infectious disease datasets and three Internet-sourced data for nine major infectious diseases in terms of time-series volume. The results show that many of infectious disease outbreak have strongly related to Internet-sourced data. We analyzed several factors that affect the similarity. Our analysis shows that the increase in the number of Internet-sourced data correlates with the increase in the number of infected people and thus, show the positive similarity. We also found that the greater the number of infectious disease outbreaks corresponds to having a wider spread of outbreak regions, in which it also proves to have higher similarity. We presented the prediction result of infectious disease outbreak using various Internet-sourced data and an effective deep learning algorithm. It showed that there are positive correlations between the number of infected people or the number of related web data and the prediction accuracy. We developed and currently operate a web-based system to show the similarity between KCDC and related Internet-sourced data for infectious diseases. This paper helps people to identify what kind of Internet-sourced data they need to use to predict and track a specific infectious disease outbreak. We considered as much as nine major diseases and three kinds of Internet-sourced data together, and we can say that our finding did not depend on specific infectious disease nor specific Internet-sourced data.

Original languageEnglish
Article number104148
JournalJournal of Biomedical Informatics
Publication statusPublished - 2022 Sept

Bibliographical note

Funding Information:
This work was supported by the National Research Foundation of Korea Fund of NRF-2022R1F1A1063961 .

Publisher Copyright:
© 2022 Elsevier Inc.

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Health Informatics


Dive into the research topics of 'Deep similarity analysis and forecasting of actual outbreak of major infectious diseases using Internet-Sourced data'. Together they form a unique fingerprint.

Cite this