The Library
Scaling k-nearest neighbours queries (the right way)
Tools
Cahsai, A., Ntarmos, N., Anagnostopoulos, C. and Triantafillou, Peter (2017) Scaling k-nearest neighbours queries (the right way). In: 2017 IEEE 37th International Conference on : Distributed Computing Systems (ICDCS), Atlanta, GA, USA, 5-8 Jun 2017. Published in: 2017 IEEE 37th International Conference on : Distributed Computing Systems (ICDCS) pp. 1419-1430. ISSN 1063-6927.
Research output not available from this repository.
Request-a-Copy directly from author or use local Library Get it For Me service.
Official URL: http://doi.org/10.1109/ICDCS.2017.267
Abstract
Recently parallel / distributed processing approaches have been proposed for processing k-Nearest Neighbours (kNN) queries over very large (multidimensional) datasets aiming to ensure scalability. However, this is typically achieved at the expense of efficiency. With this paper we offer a novel approach that alleviates the performance problems associated with state of the art methods. The essence of our approach, which differentiates it from related research, rests on (i) adopting a coordinator-based distributed processing algorithm, instead of those employed over data-parallel executionengines (such as Hadoop/MapReduce or Spark), and (ii) on a way to organize data, to structure computation, and to index the stored datasets that ensures that only a very small number of data items are retrieved from the underlying data store, communicated over the network, and processed by the coordinatorfor every kNN query. Our approach also pays special attention to ensuring scalability in addition to low query processing times. Overall, kNN queries can be processed in just tens of milliseconds (as opposed to the (tens of) seconds required by state of the art. We have implemented our approach, usinga NoSQL DB (HBase) as the data store, and we compare it against the state-of-the-art: the Hadoop-based Spatial Hadoop (SHadoop) and the Spark-based Simba methods. We employ different datasets of various sizes, showcasing the contributed performance advantages. Our approach outperforms the stateof the art, by 2-3 orders of magnitude, and consistently for dataset sizes ranging from hundreds of millions to hundreds of billions of data points. We also show that the key constituent performance overheads incurred during query processing (such as the number of data items retrieved from the data store, the required network bandwidth, and the processing time at the coordinator) scale very well, ensuring the overall scalability of the approach. © 2017 IEEE.
Item Type: | Conference Item (Paper) | ||||
---|---|---|---|---|---|
Divisions: | Faculty of Science, Engineering and Medicine > Science > Computer Science | ||||
Journal or Publication Title: | 2017 IEEE 37th International Conference on : Distributed Computing Systems (ICDCS) | ||||
Publisher: | IEEE | ||||
ISSN: | 1063-6927 | ||||
Official Date: | 2017 | ||||
Dates: |
|
||||
Page Range: | pp. 1419-1430 | ||||
Status: | Peer Reviewed | ||||
Publication Status: | Published | ||||
Reuse Statement (publisher, data, author rights): | cited By 0 | ||||
Conference Paper Type: | Paper | ||||
Title of Event: | 2017 IEEE 37th International Conference on : Distributed Computing Systems (ICDCS) | ||||
Type of Event: | Conference | ||||
Location of Event: | Atlanta, GA, USA | ||||
Date(s) of Event: | 5-8 Jun 2017 |
Request changes or add full text files to a record
Repository staff actions (login required)
View Item |