This paper introduces the problem of searching for social network accounts, e.g., Twitter accounts, with the rich information available on the Web, e.g., people names, attributes, and relationships to other people. For this purpose, we need to map Twitter accounts with Web entities. However, existing solutions building upon naive textual matching inevitably suffer low precision due to false positives (e.g., fake impersonator accounts) and false negatives (e.g., accounts using nicknames). To overcome these limitations, we leverage "relational" evidences extracted from the Web corpus. We consider two types of evidence resources-First, web-scale entity relationship graphs, extracted from name co-occurrences crawled from the Web. This co-occurrence relationship can be interpreted as an "implicit" counterpart of Twitter follower relationships. Second, web-scale relational repositories, such as Freebase with complementary strength. Using both textual and relational features obtained from these resources, we learn a ranking function aggregating these features for the accurate ordering of candidate matches. Another key contribution of this paper is to formulate confidence scoring as a separate problem from relevance ranking. A baseline approach is to use the relevance of the top match itself as the confidence score. In contrast, we train a separate classifier, using not only the top relevance score but also various statistical features extracted from the relevance scores of all candidates, and empirically validate that our approach outperforms the baseline approach. We evaluate our proposed system using real-life internet-scale entity-relationship and social network graphs.
Bibliographical noteFunding Information:
Acknowledgements This research was supported by the MKE (The Ministry of Knowledge Economy), Korea and Microsoft Research, under IT/SW Creative research program supervised by the NIPA (National IT Industry Promotion Agency) (NIPA-2011-C1810-1102-0008), and partially under the NIPA program of Software Engineering Technologies Development and Experts Education.
All Science Journal Classification (ASJC) codes
- Hardware and Architecture
- Computer Networks and Communications