The Library

Scalable data quality for big data : the Pythia framework for handling missing values

Tools

Cahsai, A., Anagnostopoulos, C. and Triantafillou, Peter (2015) Scalable data quality for big data : the Pythia framework for handling missing values. Big data, 3 (3). pp. 159-172. doi:10.1089/big.2015.0002 ISSN 0302-9743.

Research output not available from this repository.

Request-a-Copy directly from author or use local Library Get it For Me service.

Official URL: https://doi.org/10.1089/big.2015.0002

Request Changes to record.

Abstract

Solving the missing-value (MV) problem with small estimation errors in large-scale data environments is a notoriously resource-demanding task. The most widely used MV imputation approaches are computationally expensive because they explicitly depend on the volume and the dimension of the data. Moreover, as datasets and their user community continuously grow, the problem can only be exacerbated. In an attempt to deal with such a problem, in our previous work, we introduced a novel framework coined Pythia, which employs a number of distributed data nodes (cohorts), each of which contains a partition of the original dataset. To perform MV imputation, the Pythia, based on specific machine and statistical learning structures (signatures), selects the most appropriate subset of cohorts to perform locally a missing value substitution algorithm (MVA). This selection relies on the principle that particular subset of cohorts maintains the most relevant partition of the dataset. In addition to this, as Pythia uses only part of the dataset for imputation and accesses different cohorts in parallel, it improves efficiency, scalability, and accuracy compared to a single machine (coined Godzilla), which uses the entire massive dataset to compute imputation requests. Although this article is an extension of our previous work, we particularly investigate the robustness of the Pythia framework and show that the Pythia is independent from any MVA and signature construction algorithms. In order to facilitate our research, we considered two well-known MVAs (namely K-nearest neighbor and expectation-maximization imputation algorithms), as well as two machine and neural computational learning signature construction algorithms based on adaptive vector quantization and competitive learning. We prove comprehensive experiments to assess the performance of the Pythia against Godzilla and showcase the benefits stemmed from this framework. © Mary Ann Liebert, Inc. 2015.

Item Type:

Journal Article

Divisions:

Faculty of Science, Engineering and Medicine > Science > Computer Science

Journal or Publication Title:

Big data

Publisher:

Springer Berlin Heidelberg

ISSN:

0302-9743

Official Date:

16 September 2015

Dates:

Date	Event
16 September 2015	Published

Volume:

Number:

Page Range:

pp. 159-172

DOI:

10.1089/big.2015.0002

Status:

Peer Reviewed

Publication Status:

Published

Reuse Statement (publisher, data, author rights):

cited By 1

Access rights to Published version:

Restricted or Subscription Access

Related URLs:

Other

Request changes or add full text files to a record

Repository staff actions (login required)

View Item

University of Warwick
Publications service & WRAP

Highlight your research

The Library

Scalable data quality for big data : the Pythia framework for handling missing values

Abstract

Repository staff actions (login required)

University of WarwickPublications service & WRAP

Highlight your research

The Library

Scalable data quality for big data : the Pythia framework for handling missing values

Abstract

Repository staff actions (login required)

University of Warwick
Publications service & WRAP