KPSpotter

A flexible information gain-based keyphrase extraction system

Min Song, Il Yeol Song, Xiaohua Hu

Research output: Contribution to conferencePaper

38 Citations (Scopus)

Abstract

To tackle the issue of information overload, we present an Information Gain-based KeyPhrase Extraction System, called KPSpotter. KPSpotter is a flexible web-enabled keyphrase extraction system, capable of processing various formats of input data, including web data, and generating the extraction model as well as the list of keyphrases in XML. In KPSpotter, the following two features were selected for training and extracting keyphrases: 1) TF*IDF and 2) Distance from First Occurrence. Input training and testing collections were processed in three stages: 1) Data Cleaning, 2) Data Tokenizing, and 3) Data Discretizing. To measure the system performance, the keyphrases extracted by KPSpotter are compared with the ones that the authors assigned. Our experiments show that the performance of KPSpotter was evaluated to be equivalent to KEA, a well-known keyphrase extraction system. KPSpotter, however, is differentiated from other extraction systems in the followings: First, KPSpotter employs a new keyphrase extraction technique that combines the Information Gain data mining measure and several Natural Language Processing techniques such as stemming and case-folding. Second, KPSpotter is able to process various types of input data such as XML, HTML, and unstructured text data and generate XML output. Third, the user can provide input data and execute KPSpotter through the Internet. Fourth, for efficiency and performance reason, KPSpotter stores candidate keyphrases and its related information such as frequency and stemmed form into an embedded database management system.

Original languageEnglish
Pages50-53
Number of pages4
Publication statusPublished - 2003 Dec 1
EventWIDM 2003: Proceedings of the Fifth ACM International Workshop on Web Information and Data Management - New Orleans, LA, United States
Duration: 2003 Nov 72003 Nov 8

Other

OtherWIDM 2003: Proceedings of the Fifth ACM International Workshop on Web Information and Data Management
CountryUnited States
CityNew Orleans, LA
Period03/11/703/11/8

Fingerprint

XML
HTML
Processing
Data mining
Cleaning
Internet
Testing
Experiments

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Information Systems

Cite this

Song, M., Song, I. Y., & Hu, X. (2003). KPSpotter: A flexible information gain-based keyphrase extraction system. 50-53. Paper presented at WIDM 2003: Proceedings of the Fifth ACM International Workshop on Web Information and Data Management, New Orleans, LA, United States.
Song, Min ; Song, Il Yeol ; Hu, Xiaohua. / KPSpotter : A flexible information gain-based keyphrase extraction system. Paper presented at WIDM 2003: Proceedings of the Fifth ACM International Workshop on Web Information and Data Management, New Orleans, LA, United States.4 p.
@conference{0623bab527bb4ba3bfc6599afe5f6656,
title = "KPSpotter: A flexible information gain-based keyphrase extraction system",
abstract = "To tackle the issue of information overload, we present an Information Gain-based KeyPhrase Extraction System, called KPSpotter. KPSpotter is a flexible web-enabled keyphrase extraction system, capable of processing various formats of input data, including web data, and generating the extraction model as well as the list of keyphrases in XML. In KPSpotter, the following two features were selected for training and extracting keyphrases: 1) TF*IDF and 2) Distance from First Occurrence. Input training and testing collections were processed in three stages: 1) Data Cleaning, 2) Data Tokenizing, and 3) Data Discretizing. To measure the system performance, the keyphrases extracted by KPSpotter are compared with the ones that the authors assigned. Our experiments show that the performance of KPSpotter was evaluated to be equivalent to KEA, a well-known keyphrase extraction system. KPSpotter, however, is differentiated from other extraction systems in the followings: First, KPSpotter employs a new keyphrase extraction technique that combines the Information Gain data mining measure and several Natural Language Processing techniques such as stemming and case-folding. Second, KPSpotter is able to process various types of input data such as XML, HTML, and unstructured text data and generate XML output. Third, the user can provide input data and execute KPSpotter through the Internet. Fourth, for efficiency and performance reason, KPSpotter stores candidate keyphrases and its related information such as frequency and stemmed form into an embedded database management system.",
author = "Min Song and Song, {Il Yeol} and Xiaohua Hu",
year = "2003",
month = "12",
day = "1",
language = "English",
pages = "50--53",
note = "WIDM 2003: Proceedings of the Fifth ACM International Workshop on Web Information and Data Management ; Conference date: 07-11-2003 Through 08-11-2003",

}

Song, M, Song, IY & Hu, X 2003, 'KPSpotter: A flexible information gain-based keyphrase extraction system' Paper presented at WIDM 2003: Proceedings of the Fifth ACM International Workshop on Web Information and Data Management, New Orleans, LA, United States, 03/11/7 - 03/11/8, pp. 50-53.

KPSpotter : A flexible information gain-based keyphrase extraction system. / Song, Min; Song, Il Yeol; Hu, Xiaohua.

2003. 50-53 Paper presented at WIDM 2003: Proceedings of the Fifth ACM International Workshop on Web Information and Data Management, New Orleans, LA, United States.

Research output: Contribution to conferencePaper

TY - CONF

T1 - KPSpotter

T2 - A flexible information gain-based keyphrase extraction system

AU - Song, Min

AU - Song, Il Yeol

AU - Hu, Xiaohua

PY - 2003/12/1

Y1 - 2003/12/1

N2 - To tackle the issue of information overload, we present an Information Gain-based KeyPhrase Extraction System, called KPSpotter. KPSpotter is a flexible web-enabled keyphrase extraction system, capable of processing various formats of input data, including web data, and generating the extraction model as well as the list of keyphrases in XML. In KPSpotter, the following two features were selected for training and extracting keyphrases: 1) TF*IDF and 2) Distance from First Occurrence. Input training and testing collections were processed in three stages: 1) Data Cleaning, 2) Data Tokenizing, and 3) Data Discretizing. To measure the system performance, the keyphrases extracted by KPSpotter are compared with the ones that the authors assigned. Our experiments show that the performance of KPSpotter was evaluated to be equivalent to KEA, a well-known keyphrase extraction system. KPSpotter, however, is differentiated from other extraction systems in the followings: First, KPSpotter employs a new keyphrase extraction technique that combines the Information Gain data mining measure and several Natural Language Processing techniques such as stemming and case-folding. Second, KPSpotter is able to process various types of input data such as XML, HTML, and unstructured text data and generate XML output. Third, the user can provide input data and execute KPSpotter through the Internet. Fourth, for efficiency and performance reason, KPSpotter stores candidate keyphrases and its related information such as frequency and stemmed form into an embedded database management system.

AB - To tackle the issue of information overload, we present an Information Gain-based KeyPhrase Extraction System, called KPSpotter. KPSpotter is a flexible web-enabled keyphrase extraction system, capable of processing various formats of input data, including web data, and generating the extraction model as well as the list of keyphrases in XML. In KPSpotter, the following two features were selected for training and extracting keyphrases: 1) TF*IDF and 2) Distance from First Occurrence. Input training and testing collections were processed in three stages: 1) Data Cleaning, 2) Data Tokenizing, and 3) Data Discretizing. To measure the system performance, the keyphrases extracted by KPSpotter are compared with the ones that the authors assigned. Our experiments show that the performance of KPSpotter was evaluated to be equivalent to KEA, a well-known keyphrase extraction system. KPSpotter, however, is differentiated from other extraction systems in the followings: First, KPSpotter employs a new keyphrase extraction technique that combines the Information Gain data mining measure and several Natural Language Processing techniques such as stemming and case-folding. Second, KPSpotter is able to process various types of input data such as XML, HTML, and unstructured text data and generate XML output. Third, the user can provide input data and execute KPSpotter through the Internet. Fourth, for efficiency and performance reason, KPSpotter stores candidate keyphrases and its related information such as frequency and stemmed form into an embedded database management system.

UR - http://www.scopus.com/inward/record.url?scp=18744367543&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=18744367543&partnerID=8YFLogxK

M3 - Paper

SP - 50

EP - 53

ER -

Song M, Song IY, Hu X. KPSpotter: A flexible information gain-based keyphrase extraction system. 2003. Paper presented at WIDM 2003: Proceedings of the Fifth ACM International Workshop on Web Information and Data Management, New Orleans, LA, United States.