Web scale taxonomy cleansing

Taesung Lee, Zhongyuan Wang, Haixun Wang, Seung won Hwang

Research output: Contribution to journalArticle

34 Citations (Scopus)

Abstract

Large ontologies and taxonomies are automatically harvested from web-scale data. These taxonomies tend to be huge, noisy, and con-tains little context. As a result, cleansing and enriching those large-scale taxonomies becomes a great challenge. A natural way to en-rich a taxonomy is to map the taxonomy to existing datasets that contain rich information. In this paper, we study the problem of matching two web scale taxonomies. Besides the scale of the prob-lem, we address the challenge that the taxonomies may not contain enough context (such as attribute values). As existing entity reso-lution techniques are based directly or indirectly on attribute values as context, we must explore external evidence for entity resolution. Specifically, we explore positive and negative evidence in external data sources such as the web and in other taxonomies. To integrate positive and negative evidence, we formulate the entity resolution problem as a problem of finding optimal multi-way cuts in a graph. We analyze the complexity of the problem, and propose a Monte Carlo algorithm for finding greedy cuts. We conduct extensive ex-periments and compare our approach with three existing methods to demonstrate the advantage of our approach.

Original languageEnglish
Pages (from-to)1295-1306
Number of pages12
JournalProceedings of the VLDB Endowment
Volume4
Issue number12
Publication statusPublished - 2011 Aug

Bibliographical note

Funding Information:
This research was supported by Microsoft Research and the MKE (The Ministry of Knowledge Economy), Korea, under IT/SW Cre-ative research program supervised by the NIPA (National IT Indus-try Promotion Agency) (NIPA-2010-C1810-1002-0003).

All Science Journal Classification (ASJC) codes

  • Computer Science (miscellaneous)
  • Computer Science(all)

Fingerprint Dive into the research topics of 'Web scale taxonomy cleansing'. Together they form a unique fingerprint.

Cite this