Popular news articles attract thousands of online comments, making it tedious and time-consuming for a manual review. Automatically clustering similar comments can help reduce the burden of manual analyses, but appropriate feature words must be selected for successful clustering. In this paper, we present a data-driven feature word selection method which realizes structurally superior clustering of online comments. The top 1,000 most frequent nouns appearing across the entire 7.44 million Korean online comments are selected to construct an overall noun set. Frequent nouns in the online comments of each news article are selected to construct the local noun set. The intersection between the local and overall noun set is taken to construct the global noun set. The global noun set is removed from the corresponding local noun set to construct the distinct noun set. The top 250 most frequent nouns are selected for each of the local, global, and distinct noun sets for K-means clustering. The clustered results are evaluated using three internal cluster validation indices, Dunn, PBM, and Silhouette. As a result, online comments clustered using distinct nouns produced structurally superior clusters when compared to the other types of nouns, local and global.
|Title of host publication||2016 International Conference on Big Data and Smart Computing, BigComp 2016|
|Publisher||Institute of Electrical and Electronics Engineers Inc.|
|Number of pages||4|
|Publication status||Published - 2016 Mar 3|
|Event||International Conference on Big Data and Smart Computing, BigComp 2016 - Hong Kong, China|
Duration: 2016 Jan 18 → 2016 Jan 20
|Name||2016 International Conference on Big Data and Smart Computing, BigComp 2016|
|Other||International Conference on Big Data and Smart Computing, BigComp 2016|
|Period||16/1/18 → 16/1/20|
Bibliographical noteFunding Information:
This work was in part supported by the Convergence Research Center (CRC) Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (MSIP), Korea (NRF-2015R1A5A7037615), and in part by the Ministry of Science, ICT and Future Planning (MSIP), Korea, under the "IT Consilience Creative Program" (IITP-2015-R0346-15-1008) supervised by the Institute for Information & Communications Technology Promotion (IITP).
All Science Journal Classification (ASJC) codes
- Computer Networks and Communications
- Information Systems
- Information Systems and Management