An abusive text detection system based on enhanced abusive and non-abusive word lists

Ho Suk Lee, Hong Rae Lee, Jun U. Park, Yo-Sub Han

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Abusive text (indiscriminate slang, abusive language, and profanity) on the Internet is not just a message but rather a tool for very serious and brutal cyber violence. It has become an important problem to devise a method for detecting and preventing abusive text online. However, the intentional obfuscation of words and phrases makes this task very difficult and challenging. We design a decision system that successfully detects (obfuscated) abusive text using an unsupervised learning of abusive words based on word2vec's skip-gram and the cosine similarity. The system also deploys several efficient gadgets for filtering abusive text such as blacklists, n-grams, edit-distance metrics, mixed languages, abbreviations, punctuation, and words with special characters to detect the intentional obfuscation of abusive words. We integrate both an unsupervised learning method and efficient gadgets into a single system that enhances abusive and non-abusive word lists. The integrated decision system based on the enhanced word lists shows a precision of 94.08%, a recall of 80.79%, and an f-score of 86.93% in malicious word detection for news article comments, a precision of 89.97%, a recall of 80.55%, and an f-score 85.00% for online community comments, and a precision of 90.65%, a recall of 93.57%, and an f-score 92.09% for Twitter tweets. We expect that our approach can help to improve the current abusive word detection system, which is crucial for several web-based services including social networking services and online games.

Original languageEnglish
Pages (from-to)22-31
Number of pages10
JournalDecision Support Systems
Volume113
DOIs
Publication statusPublished - 2018 Sep 1

Fingerprint

Unsupervised learning
Language
Learning
Social Networking
Social Work
Violence
Internet
Word Lists

All Science Journal Classification (ASJC) codes

  • Management Information Systems
  • Information Systems
  • Developmental and Educational Psychology
  • Arts and Humanities (miscellaneous)
  • Information Systems and Management

Cite this

@article{abd039ba7373451e837dc3b16ce7afc7,
title = "An abusive text detection system based on enhanced abusive and non-abusive word lists",
abstract = "Abusive text (indiscriminate slang, abusive language, and profanity) on the Internet is not just a message but rather a tool for very serious and brutal cyber violence. It has become an important problem to devise a method for detecting and preventing abusive text online. However, the intentional obfuscation of words and phrases makes this task very difficult and challenging. We design a decision system that successfully detects (obfuscated) abusive text using an unsupervised learning of abusive words based on word2vec's skip-gram and the cosine similarity. The system also deploys several efficient gadgets for filtering abusive text such as blacklists, n-grams, edit-distance metrics, mixed languages, abbreviations, punctuation, and words with special characters to detect the intentional obfuscation of abusive words. We integrate both an unsupervised learning method and efficient gadgets into a single system that enhances abusive and non-abusive word lists. The integrated decision system based on the enhanced word lists shows a precision of 94.08{\%}, a recall of 80.79{\%}, and an f-score of 86.93{\%} in malicious word detection for news article comments, a precision of 89.97{\%}, a recall of 80.55{\%}, and an f-score 85.00{\%} for online community comments, and a precision of 90.65{\%}, a recall of 93.57{\%}, and an f-score 92.09{\%} for Twitter tweets. We expect that our approach can help to improve the current abusive word detection system, which is crucial for several web-based services including social networking services and online games.",
author = "Lee, {Ho Suk} and Lee, {Hong Rae} and Park, {Jun U.} and Yo-Sub Han",
year = "2018",
month = "9",
day = "1",
doi = "10.1016/j.dss.2018.06.009",
language = "English",
volume = "113",
pages = "22--31",
journal = "Decision Support Systems",
issn = "0167-9236",
publisher = "Elsevier",

}

An abusive text detection system based on enhanced abusive and non-abusive word lists. / Lee, Ho Suk; Lee, Hong Rae; Park, Jun U.; Han, Yo-Sub.

In: Decision Support Systems, Vol. 113, 01.09.2018, p. 22-31.

Research output: Contribution to journalArticle

TY - JOUR

T1 - An abusive text detection system based on enhanced abusive and non-abusive word lists

AU - Lee, Ho Suk

AU - Lee, Hong Rae

AU - Park, Jun U.

AU - Han, Yo-Sub

PY - 2018/9/1

Y1 - 2018/9/1

N2 - Abusive text (indiscriminate slang, abusive language, and profanity) on the Internet is not just a message but rather a tool for very serious and brutal cyber violence. It has become an important problem to devise a method for detecting and preventing abusive text online. However, the intentional obfuscation of words and phrases makes this task very difficult and challenging. We design a decision system that successfully detects (obfuscated) abusive text using an unsupervised learning of abusive words based on word2vec's skip-gram and the cosine similarity. The system also deploys several efficient gadgets for filtering abusive text such as blacklists, n-grams, edit-distance metrics, mixed languages, abbreviations, punctuation, and words with special characters to detect the intentional obfuscation of abusive words. We integrate both an unsupervised learning method and efficient gadgets into a single system that enhances abusive and non-abusive word lists. The integrated decision system based on the enhanced word lists shows a precision of 94.08%, a recall of 80.79%, and an f-score of 86.93% in malicious word detection for news article comments, a precision of 89.97%, a recall of 80.55%, and an f-score 85.00% for online community comments, and a precision of 90.65%, a recall of 93.57%, and an f-score 92.09% for Twitter tweets. We expect that our approach can help to improve the current abusive word detection system, which is crucial for several web-based services including social networking services and online games.

AB - Abusive text (indiscriminate slang, abusive language, and profanity) on the Internet is not just a message but rather a tool for very serious and brutal cyber violence. It has become an important problem to devise a method for detecting and preventing abusive text online. However, the intentional obfuscation of words and phrases makes this task very difficult and challenging. We design a decision system that successfully detects (obfuscated) abusive text using an unsupervised learning of abusive words based on word2vec's skip-gram and the cosine similarity. The system also deploys several efficient gadgets for filtering abusive text such as blacklists, n-grams, edit-distance metrics, mixed languages, abbreviations, punctuation, and words with special characters to detect the intentional obfuscation of abusive words. We integrate both an unsupervised learning method and efficient gadgets into a single system that enhances abusive and non-abusive word lists. The integrated decision system based on the enhanced word lists shows a precision of 94.08%, a recall of 80.79%, and an f-score of 86.93% in malicious word detection for news article comments, a precision of 89.97%, a recall of 80.55%, and an f-score 85.00% for online community comments, and a precision of 90.65%, a recall of 93.57%, and an f-score 92.09% for Twitter tweets. We expect that our approach can help to improve the current abusive word detection system, which is crucial for several web-based services including social networking services and online games.

UR - http://www.scopus.com/inward/record.url?scp=85049342985&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85049342985&partnerID=8YFLogxK

U2 - 10.1016/j.dss.2018.06.009

DO - 10.1016/j.dss.2018.06.009

M3 - Article

VL - 113

SP - 22

EP - 31

JO - Decision Support Systems

JF - Decision Support Systems

SN - 0167-9236

ER -