List intersection for web search: Algorithms, cost models, and optimizations

Sunghwan Kim, Taesung Lee, Seung Won Hwang, Sameh Elnikety

Research output: Contribution to journalConference article

Abstract

This paper studies the optimization of list intersection, especially in the context of the matching phase of search engines. Given a user query, we intersect the postings lists corresponding to the query keywords to generate the list of documents matching all keywords. Since the speed of list intersection depends the algorithm, hardware, and list lengths and their correlations, none the existing intersection algorithms outperforms the others in every scenario. Therefore, we develop a cost-based approach in which we identify a search space, spanning existing algorithms and their combinations. We propose a cost model to estimate the cost of the algorithms with their combinations, and use the cost model to search for the lowest-cost algorithm. The resulting plan is usually a combination of 2-way algorithms, outperforming conventional 2-way and k-way algorithms. The proposed approach is more general than designing a specific algorithm, as the cost models can be adapted to different hardware. We validate the cost model experimentally on two different CPUs, and show that the cost model closely estimates the actual cost. Using both real and synthetic datasets, we show that the proposed cost-based optimizer outperforms the state-of-the-art alternatives.

Original languageEnglish
Pages (from-to)1-13
Number of pages13
JournalProceedings of the VLDB Endowment
Volume12
Issue number1
DOIs
Publication statusPublished - 2018
Event45th International Conference on Very Large Data Bases, VLDB 2019 - Los Angeles, United States
Duration: 2017 Aug 262017 Aug 30

All Science Journal Classification (ASJC) codes

  • Computer Science (miscellaneous)
  • Computer Science(all)

Fingerprint Dive into the research topics of 'List intersection for web search: Algorithms, cost models, and optimizations'. Together they form a unique fingerprint.

  • Cite this