This paper studies the optimization of list intersection, especially in the context of the matching phase of search engines. Given a user query, we intersect the postings lists corresponding to the query keywords to generate the list of documents matching all keywords. Since the speed of list intersection depends the algorithm, hardware, and list lengths and their correlations, none the existing intersection algorithms outperforms the others in every scenario. Therefore, we develop a cost-based approach in which we identify a search space, spanning existing algorithms and their combinations. We propose a cost model to estimate the cost of the algorithms with their combinations, and use the cost model to search for the lowest-cost algorithm. The resulting plan is usually a combination of 2-way algorithms, outperforming conventional 2-way and k-way algorithms. The proposed approach is more general than designing a specific algorithm, as the cost models can be adapted to different hardware. We validate the cost model experimentally on two different CPUs, and show that the cost model closely estimates the actual cost. Using both real and synthetic datasets, we show that the proposed cost-based optimizer outperforms the state-of-the-art alternatives.
|Number of pages||13|
|Journal||Proceedings of the VLDB Endowment|
|Publication status||Published - 2018|
|Event||45th International Conference on Very Large Data Bases, VLDB 2019 - Los Angeles, United States|
Duration: 2017 Aug 26 → 2017 Aug 30
Bibliographical notePublisher Copyright:
© 2018 VLDB Endowment 21508097/18/07.
All Science Journal Classification (ASJC) codes
- Computer Science (miscellaneous)
- Computer Science(all)