DBSTexC: Density-Based spatio-Textual clustering on twitter

Minh D. Nguyen, Won-Yong Shin

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Citations (Scopus)

Abstract

Density-based spatial clustering of applications with noise (DBSCAN) is the most commonly used density-based clustering algorithm, where it can discover multiple clusters with arbitrary shapes. DBSCAN works properly when the input data type is homogeneous, but the DBSCAN’s approach may not be sufficient when the input dataset has textual heterogeneity (e.g., when we intend to find clusters from geo-tagged posts on social media relevant to a certain point-of-interest (POI)), thus leading to poor performance. In this paper, we present DBSTexC, a new density-based clustering algorithm using spatio-textual information on Twitter. We first define POI-relevant and POI-irrelevant tweets as the records that contain and do not contain a POI name or its coherent variations, respectively. By taking into account the fractions of POI-relevant and POI-irrelevant tweets, our DBSTexC algorithm shows a much higher clustering quality than the DBSCAN case in terms of the F1 score and its variants. DBSTexC can be thought of as a generalized version of DBSCAN due to the findings that it performs identically as DBSCAN when the inputs are homogeneous and far outperforms DBSCAN when the heterogeneous input data type is given.

Original languageEnglish
Title of host publicationProceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2017
EditorsJana Diesner, Elena Ferrari, Guandong Xu
PublisherAssociation for Computing Machinery, Inc
Pages23-26
Number of pages4
ISBN (Electronic)9781450349932
DOIs
Publication statusPublished - 2017 Jul 31
Event9th IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2017 - Sydney, Australia
Duration: 2017 Jul 312017 Aug 3

Publication series

NameProceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2017

Other

Other9th IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2017
CountryAustralia
CitySydney
Period17/7/3117/8/3

Fingerprint

Clustering algorithms

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Information Systems

Cite this

Nguyen, M. D., & Shin, W-Y. (2017). DBSTexC: Density-Based spatio-Textual clustering on twitter. In J. Diesner, E. Ferrari, & G. Xu (Eds.), Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2017 (pp. 23-26). (Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2017). Association for Computing Machinery, Inc. https://doi.org/10.1145/3110025.3110096
Nguyen, Minh D. ; Shin, Won-Yong. / DBSTexC : Density-Based spatio-Textual clustering on twitter. Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2017. editor / Jana Diesner ; Elena Ferrari ; Guandong Xu. Association for Computing Machinery, Inc, 2017. pp. 23-26 (Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2017).
@inproceedings{518124456ce743db94afbbe3ca773d95,
title = "DBSTexC: Density-Based spatio-Textual clustering on twitter",
abstract = "Density-based spatial clustering of applications with noise (DBSCAN) is the most commonly used density-based clustering algorithm, where it can discover multiple clusters with arbitrary shapes. DBSCAN works properly when the input data type is homogeneous, but the DBSCAN’s approach may not be sufficient when the input dataset has textual heterogeneity (e.g., when we intend to find clusters from geo-tagged posts on social media relevant to a certain point-of-interest (POI)), thus leading to poor performance. In this paper, we present DBSTexC, a new density-based clustering algorithm using spatio-textual information on Twitter. We first define POI-relevant and POI-irrelevant tweets as the records that contain and do not contain a POI name or its coherent variations, respectively. By taking into account the fractions of POI-relevant and POI-irrelevant tweets, our DBSTexC algorithm shows a much higher clustering quality than the DBSCAN case in terms of the F1 score and its variants. DBSTexC can be thought of as a generalized version of DBSCAN due to the findings that it performs identically as DBSCAN when the inputs are homogeneous and far outperforms DBSCAN when the heterogeneous input data type is given.",
author = "Nguyen, {Minh D.} and Won-Yong Shin",
year = "2017",
month = "7",
day = "31",
doi = "10.1145/3110025.3110096",
language = "English",
series = "Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2017",
publisher = "Association for Computing Machinery, Inc",
pages = "23--26",
editor = "Jana Diesner and Elena Ferrari and Guandong Xu",
booktitle = "Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2017",

}

Nguyen, MD & Shin, W-Y 2017, DBSTexC: Density-Based spatio-Textual clustering on twitter. in J Diesner, E Ferrari & G Xu (eds), Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2017. Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2017, Association for Computing Machinery, Inc, pp. 23-26, 9th IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2017, Sydney, Australia, 17/7/31. https://doi.org/10.1145/3110025.3110096

DBSTexC : Density-Based spatio-Textual clustering on twitter. / Nguyen, Minh D.; Shin, Won-Yong.

Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2017. ed. / Jana Diesner; Elena Ferrari; Guandong Xu. Association for Computing Machinery, Inc, 2017. p. 23-26 (Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2017).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - DBSTexC

T2 - Density-Based spatio-Textual clustering on twitter

AU - Nguyen, Minh D.

AU - Shin, Won-Yong

PY - 2017/7/31

Y1 - 2017/7/31

N2 - Density-based spatial clustering of applications with noise (DBSCAN) is the most commonly used density-based clustering algorithm, where it can discover multiple clusters with arbitrary shapes. DBSCAN works properly when the input data type is homogeneous, but the DBSCAN’s approach may not be sufficient when the input dataset has textual heterogeneity (e.g., when we intend to find clusters from geo-tagged posts on social media relevant to a certain point-of-interest (POI)), thus leading to poor performance. In this paper, we present DBSTexC, a new density-based clustering algorithm using spatio-textual information on Twitter. We first define POI-relevant and POI-irrelevant tweets as the records that contain and do not contain a POI name or its coherent variations, respectively. By taking into account the fractions of POI-relevant and POI-irrelevant tweets, our DBSTexC algorithm shows a much higher clustering quality than the DBSCAN case in terms of the F1 score and its variants. DBSTexC can be thought of as a generalized version of DBSCAN due to the findings that it performs identically as DBSCAN when the inputs are homogeneous and far outperforms DBSCAN when the heterogeneous input data type is given.

AB - Density-based spatial clustering of applications with noise (DBSCAN) is the most commonly used density-based clustering algorithm, where it can discover multiple clusters with arbitrary shapes. DBSCAN works properly when the input data type is homogeneous, but the DBSCAN’s approach may not be sufficient when the input dataset has textual heterogeneity (e.g., when we intend to find clusters from geo-tagged posts on social media relevant to a certain point-of-interest (POI)), thus leading to poor performance. In this paper, we present DBSTexC, a new density-based clustering algorithm using spatio-textual information on Twitter. We first define POI-relevant and POI-irrelevant tweets as the records that contain and do not contain a POI name or its coherent variations, respectively. By taking into account the fractions of POI-relevant and POI-irrelevant tweets, our DBSTexC algorithm shows a much higher clustering quality than the DBSCAN case in terms of the F1 score and its variants. DBSTexC can be thought of as a generalized version of DBSCAN due to the findings that it performs identically as DBSCAN when the inputs are homogeneous and far outperforms DBSCAN when the heterogeneous input data type is given.

UR - http://www.scopus.com/inward/record.url?scp=85040219035&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85040219035&partnerID=8YFLogxK

U2 - 10.1145/3110025.3110096

DO - 10.1145/3110025.3110096

M3 - Conference contribution

AN - SCOPUS:85040219035

T3 - Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2017

SP - 23

EP - 26

BT - Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2017

A2 - Diesner, Jana

A2 - Ferrari, Elena

A2 - Xu, Guandong

PB - Association for Computing Machinery, Inc

ER -

Nguyen MD, Shin W-Y. DBSTexC: Density-Based spatio-Textual clustering on twitter. In Diesner J, Ferrari E, Xu G, editors, Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2017. Association for Computing Machinery, Inc. 2017. p. 23-26. (Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2017). https://doi.org/10.1145/3110025.3110096