Data-driven feature word selection for clustering online news comments

Heeryon Cho, Jong-Seok Lee

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

Popular news articles attract thousands of online comments, making it tedious and time-consuming for a manual review. Automatically clustering similar comments can help reduce the burden of manual analyses, but appropriate feature words must be selected for successful clustering. In this paper, we present a data-driven feature word selection method which realizes structurally superior clustering of online comments. The top 1,000 most frequent nouns appearing across the entire 7.44 million Korean online comments are selected to construct an overall noun set. Frequent nouns in the online comments of each news article are selected to construct the local noun set. The intersection between the local and overall noun set is taken to construct the global noun set. The global noun set is removed from the corresponding local noun set to construct the distinct noun set. The top 250 most frequent nouns are selected for each of the local, global, and distinct noun sets for K-means clustering. The clustered results are evaluated using three internal cluster validation indices, Dunn, PBM, and Silhouette. As a result, online comments clustered using distinct nouns produced structurally superior clusters when compared to the other types of nouns, local and global.

Original languageEnglish
Title of host publication2016 International Conference on Big Data and Smart Computing, BigComp 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages494-497
Number of pages4
ISBN (Electronic)9781467387965
DOIs
Publication statusPublished - 2016 Mar 3
EventInternational Conference on Big Data and Smart Computing, BigComp 2016 - Hong Kong, China
Duration: 2016 Jan 182016 Jan 20

Publication series

Name2016 International Conference on Big Data and Smart Computing, BigComp 2016

Other

OtherInternational Conference on Big Data and Smart Computing, BigComp 2016
CountryChina
CityHong Kong
Period16/1/1816/1/20

Fingerprint

Clustering
News
K-means clustering
Burden

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Information Systems
  • Information Systems and Management

Cite this

Cho, H., & Lee, J-S. (2016). Data-driven feature word selection for clustering online news comments. In 2016 International Conference on Big Data and Smart Computing, BigComp 2016 (pp. 494-497). [7425977] (2016 International Conference on Big Data and Smart Computing, BigComp 2016). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/BIGCOMP.2016.7425977
Cho, Heeryon ; Lee, Jong-Seok. / Data-driven feature word selection for clustering online news comments. 2016 International Conference on Big Data and Smart Computing, BigComp 2016. Institute of Electrical and Electronics Engineers Inc., 2016. pp. 494-497 (2016 International Conference on Big Data and Smart Computing, BigComp 2016).
@inproceedings{c276181189294ef2b96224b3aa4f0452,
title = "Data-driven feature word selection for clustering online news comments",
abstract = "Popular news articles attract thousands of online comments, making it tedious and time-consuming for a manual review. Automatically clustering similar comments can help reduce the burden of manual analyses, but appropriate feature words must be selected for successful clustering. In this paper, we present a data-driven feature word selection method which realizes structurally superior clustering of online comments. The top 1,000 most frequent nouns appearing across the entire 7.44 million Korean online comments are selected to construct an overall noun set. Frequent nouns in the online comments of each news article are selected to construct the local noun set. The intersection between the local and overall noun set is taken to construct the global noun set. The global noun set is removed from the corresponding local noun set to construct the distinct noun set. The top 250 most frequent nouns are selected for each of the local, global, and distinct noun sets for K-means clustering. The clustered results are evaluated using three internal cluster validation indices, Dunn, PBM, and Silhouette. As a result, online comments clustered using distinct nouns produced structurally superior clusters when compared to the other types of nouns, local and global.",
author = "Heeryon Cho and Jong-Seok Lee",
year = "2016",
month = "3",
day = "3",
doi = "10.1109/BIGCOMP.2016.7425977",
language = "English",
series = "2016 International Conference on Big Data and Smart Computing, BigComp 2016",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "494--497",
booktitle = "2016 International Conference on Big Data and Smart Computing, BigComp 2016",
address = "United States",

}

Cho, H & Lee, J-S 2016, Data-driven feature word selection for clustering online news comments. in 2016 International Conference on Big Data and Smart Computing, BigComp 2016., 7425977, 2016 International Conference on Big Data and Smart Computing, BigComp 2016, Institute of Electrical and Electronics Engineers Inc., pp. 494-497, International Conference on Big Data and Smart Computing, BigComp 2016, Hong Kong, China, 16/1/18. https://doi.org/10.1109/BIGCOMP.2016.7425977

Data-driven feature word selection for clustering online news comments. / Cho, Heeryon; Lee, Jong-Seok.

2016 International Conference on Big Data and Smart Computing, BigComp 2016. Institute of Electrical and Electronics Engineers Inc., 2016. p. 494-497 7425977 (2016 International Conference on Big Data and Smart Computing, BigComp 2016).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Data-driven feature word selection for clustering online news comments

AU - Cho, Heeryon

AU - Lee, Jong-Seok

PY - 2016/3/3

Y1 - 2016/3/3

N2 - Popular news articles attract thousands of online comments, making it tedious and time-consuming for a manual review. Automatically clustering similar comments can help reduce the burden of manual analyses, but appropriate feature words must be selected for successful clustering. In this paper, we present a data-driven feature word selection method which realizes structurally superior clustering of online comments. The top 1,000 most frequent nouns appearing across the entire 7.44 million Korean online comments are selected to construct an overall noun set. Frequent nouns in the online comments of each news article are selected to construct the local noun set. The intersection between the local and overall noun set is taken to construct the global noun set. The global noun set is removed from the corresponding local noun set to construct the distinct noun set. The top 250 most frequent nouns are selected for each of the local, global, and distinct noun sets for K-means clustering. The clustered results are evaluated using three internal cluster validation indices, Dunn, PBM, and Silhouette. As a result, online comments clustered using distinct nouns produced structurally superior clusters when compared to the other types of nouns, local and global.

AB - Popular news articles attract thousands of online comments, making it tedious and time-consuming for a manual review. Automatically clustering similar comments can help reduce the burden of manual analyses, but appropriate feature words must be selected for successful clustering. In this paper, we present a data-driven feature word selection method which realizes structurally superior clustering of online comments. The top 1,000 most frequent nouns appearing across the entire 7.44 million Korean online comments are selected to construct an overall noun set. Frequent nouns in the online comments of each news article are selected to construct the local noun set. The intersection between the local and overall noun set is taken to construct the global noun set. The global noun set is removed from the corresponding local noun set to construct the distinct noun set. The top 250 most frequent nouns are selected for each of the local, global, and distinct noun sets for K-means clustering. The clustered results are evaluated using three internal cluster validation indices, Dunn, PBM, and Silhouette. As a result, online comments clustered using distinct nouns produced structurally superior clusters when compared to the other types of nouns, local and global.

UR - http://www.scopus.com/inward/record.url?scp=84964678923&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84964678923&partnerID=8YFLogxK

U2 - 10.1109/BIGCOMP.2016.7425977

DO - 10.1109/BIGCOMP.2016.7425977

M3 - Conference contribution

T3 - 2016 International Conference on Big Data and Smart Computing, BigComp 2016

SP - 494

EP - 497

BT - 2016 International Conference on Big Data and Smart Computing, BigComp 2016

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Cho H, Lee J-S. Data-driven feature word selection for clustering online news comments. In 2016 International Conference on Big Data and Smart Computing, BigComp 2016. Institute of Electrical and Electronics Engineers Inc. 2016. p. 494-497. 7425977. (2016 International Conference on Big Data and Smart Computing, BigComp 2016). https://doi.org/10.1109/BIGCOMP.2016.7425977