The Library
Scalable data quality for big data : the Pythia framework for handling missing values
Tools
Cahsai, A., Anagnostopoulos, C. and Triantafillou, Peter (2015) Scalable data quality for big data : the Pythia framework for handling missing values. Big data, 3 (3). pp. 159-172. doi:10.1089/big.2015.0002 ISSN 0302-9743.
Research output not available from this repository.
Request-a-Copy directly from author or use local Library Get it For Me service.
Official URL: https://doi.org/10.1089/big.2015.0002
Abstract
Solving the missing-value (MV) problem with small estimation errors in large-scale data environments is a notoriously resource-demanding task. The most widely used MV imputation approaches are computationally expensive because they explicitly depend on the volume and the dimension of the data. Moreover, as datasets and their user community continuously grow, the problem can only be exacerbated. In an attempt to deal with such a problem, in our previous work, we introduced a novel framework coined Pythia, which employs a number of distributed data nodes (cohorts), each of which contains a partition of the original dataset. To perform MV imputation, the Pythia, based on specific machine and statistical learning structures (signatures), selects the most appropriate subset of cohorts to perform locally a missing value substitution algorithm (MVA). This selection relies on the principle that particular subset of cohorts maintains the most relevant partition of the dataset. In addition to this, as Pythia uses only part of the dataset for imputation and accesses different cohorts in parallel, it improves efficiency, scalability, and accuracy compared to a single machine (coined Godzilla), which uses the entire massive dataset to compute imputation requests. Although this article is an extension of our previous work, we particularly investigate the robustness of the Pythia framework and show that the Pythia is independent from any MVA and signature construction algorithms. In order to facilitate our research, we considered two well-known MVAs (namely K-nearest neighbor and expectation-maximization imputation algorithms), as well as two machine and neural computational learning signature construction algorithms based on adaptive vector quantization and competitive learning. We prove comprehensive experiments to assess the performance of the Pythia against Godzilla and showcase the benefits stemmed from this framework. © Mary Ann Liebert, Inc. 2015.
Item Type: | Journal Article | ||||
---|---|---|---|---|---|
Divisions: | Faculty of Science, Engineering and Medicine > Science > Computer Science | ||||
Journal or Publication Title: | Big data | ||||
Publisher: | Springer Berlin Heidelberg | ||||
ISSN: | 0302-9743 | ||||
Official Date: | 16 September 2015 | ||||
Dates: |
|
||||
Volume: | 3 | ||||
Number: | 3 | ||||
Page Range: | pp. 159-172 | ||||
DOI: | 10.1089/big.2015.0002 | ||||
Status: | Peer Reviewed | ||||
Publication Status: | Published | ||||
Reuse Statement (publisher, data, author rights): | cited By 1 | ||||
Access rights to Published version: | Restricted or Subscription Access | ||||
Related URLs: |
Request changes or add full text files to a record
Repository staff actions (login required)
View Item |