The Library
Streaming weighted sampling over join queries
Tools
Shekelyan, Michael, Cormode, Graham, Ma, Qingzhi, Shanghooshabad, A. M. and Triantafillou, Peter (2023) Streaming weighted sampling over join queries. In: 26th International Conference on Extending Database Technology (EDBT) 2023, Ioannina, Greece, 28-31 Mar 2023. Published in: Proceedings of the 26th International Conference on Extending Database Technology (EDBT) 2023, March 2023. ISBN 9783893180936. doi:10.48786/edbt.2023.24 ISSN 2367-2005. (In Press)
|
PDF
WRAP-Streaming-weighted-sampling-join-queries-22.pdf - Published Version - Requires a PDF viewer. Available under License Creative Commons Attribution Non-commercial No Derivatives 4.0. Download (799Kb) | Preview |
Official URL: https://doi.org/10.48786/edbt.2023.24
Abstract
Join queries are a fundamental database tool, capturing a range of tasks that involve linking heterogeneous data sources. However, with massive table sizes, it is often impractical to keep these in memory, and we can only take one or few streaming passes over them. Moreover, building out the full join result (e.g., linking heterogeneous data sources along quasi-identifiers) can lead to a combinatorial explosion of results due to many-to-many links. Random sampling is a natural tool to boil this oversized result down to a representative subset with well-understood statistical properties, but turns out to be a challenging task due to the combinatorial nature of the sampling domain. Existing techniques in the literature focus solely on the setting with tabular data residing in main memory, and do not address aspects such as stream operation, weighted sampling and more general join operators that are urgently needed in a modern data processing context. The main contribution of this work is to meet these needs with more lightweight practical approaches. First, a bijection between the sampling problem and a graph problem is introduced to support weighted sampling and common join operators. Second, the sampling techniques are refined to minimise the number of streaming passes. Third, techniques are presented to deal with very large tables under limited memory. Finally, the proposed techniques are compared to existing approaches that rely on database indices and the results indicate substantial memory savings, reduced runtimes for ad-hoc queries and competitive amortised runtimes. All pertinent code and data can be found at: https://github.com/shekelyan/weightedjoinsampling
Item Type: | Conference Item (Paper) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Subjects: | Q Science > QA Mathematics Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software |
|||||||||
Divisions: | Faculty of Science, Engineering and Medicine > Science > Computer Science | |||||||||
Library of Congress Subject Headings (LCSH): | SQL (Computer program language), Sampling (Statistics) | |||||||||
Journal or Publication Title: | Proceedings of the 26th International Conference on Extending Database Technology (EDBT) 2023, March 2023. | |||||||||
ISBN: | 9783893180936 | |||||||||
ISSN: | 2367-2005 | |||||||||
Official Date: | March 2023 | |||||||||
Dates: |
|
|||||||||
DOI: | 10.48786/edbt.2023.24 | |||||||||
Status: | Peer Reviewed | |||||||||
Publication Status: | In Press | |||||||||
Reuse Statement (publisher, data, author rights): | © 2023 Copyright held by the owner/author(s). Published in Proceedings of the 26th International Conference on Extending Database Technology (EDBT), 28th March-31st March, 2023, ISBN 978-3-89318-093-6 on OpenProceedings.org. Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0 | |||||||||
Access rights to Published version: | Open Access (Creative Commons) | |||||||||
Date of first compliant deposit: | 19 December 2022 | |||||||||
Date of first compliant Open Access: | 19 December 2022 | |||||||||
RIOXX Funder/Project Grant: |
|
|||||||||
Conference Paper Type: | Paper | |||||||||
Title of Event: | 26th International Conference on Extending Database Technology (EDBT) 2023 | |||||||||
Type of Event: | Conference | |||||||||
Location of Event: | Ioannina, Greece | |||||||||
Date(s) of Event: | 28-31 Mar 2023 | |||||||||
Related URLs: |
Request changes or add full text files to a record
Repository staff actions (login required)
View Item |
Downloads
Downloads per month over past year