Skip to content Skip to navigation
University of Warwick
  • Study
  • |
  • Research
  • |
  • Business
  • |
  • Alumni
  • |
  • News
  • |
  • About

University of Warwick
Publications service & WRAP

Highlight your research

  • WRAP
    • Home
    • Search WRAP
    • Browse by Warwick Author
    • Browse WRAP by Year
    • Browse WRAP by Subject
    • Browse WRAP by Department
    • Browse WRAP by Funder
    • Browse Theses by Department
  • Publications Service
    • Home
    • Search Publications Service
    • Browse by Warwick Author
    • Browse Publications service by Year
    • Browse Publications service by Subject
    • Browse Publications service by Department
    • Browse Publications service by Funder
  • Help & Advice
University of Warwick

The Library

  • Login
  • Admin

Scalable data quality for big data : the Pythia framework for handling missing values

Tools
- Tools
+ Tools

Cahsai, A., Anagnostopoulos, C. and Triantafillou, Peter (2015) Scalable data quality for big data : the Pythia framework for handling missing values. Big data, 3 (3). pp. 159-172. doi:10.1089/big.2015.0002

Research output not available from this repository, contact author.
Official URL: https://doi.org/10.1089/big.2015.0002

Request Changes to record.

Abstract

Solving the missing-value (MV) problem with small estimation errors in large-scale data environments is a notoriously resource-demanding task. The most widely used MV imputation approaches are computationally expensive because they explicitly depend on the volume and the dimension of the data. Moreover, as datasets and their user community continuously grow, the problem can only be exacerbated. In an attempt to deal with such a problem, in our previous work, we introduced a novel framework coined Pythia, which employs a number of distributed data nodes (cohorts), each of which contains a partition of the original dataset. To perform MV imputation, the Pythia, based on specific machine and statistical learning structures (signatures), selects the most appropriate subset of cohorts to perform locally a missing value substitution algorithm (MVA). This selection relies on the principle that particular subset of cohorts maintains the most relevant partition of the dataset. In addition to this, as Pythia uses only part of the dataset for imputation and accesses different cohorts in parallel, it improves efficiency, scalability, and accuracy compared to a single machine (coined Godzilla), which uses the entire massive dataset to compute imputation requests. Although this article is an extension of our previous work, we particularly investigate the robustness of the Pythia framework and show that the Pythia is independent from any MVA and signature construction algorithms. In order to facilitate our research, we considered two well-known MVAs (namely K-nearest neighbor and expectation-maximization imputation algorithms), as well as two machine and neural computational learning signature construction algorithms based on adaptive vector quantization and competitive learning. We prove comprehensive experiments to assess the performance of the Pythia against Godzilla and showcase the benefits stemmed from this framework. © Mary Ann Liebert, Inc. 2015.

Item Type: Journal Article
Divisions: Faculty of Science > Computer Science
Journal or Publication Title: Big data
Publisher: Springer Berlin Heidelberg
ISSN: 0302-9743
Official Date: 16 September 2015
Dates:
DateEvent
16 September 2015Published
Volume: 3
Number: 3
Page Range: pp. 159-172
DOI: 10.1089/big.2015.0002
Status: Peer Reviewed
Publication Status: Published
Publisher Statement: cited By 1
Access rights to Published version: Restricted or Subscription Access
Related URLs:
  • Other

Request changes or add full text files to a record

Repository staff actions (login required)

View Item View Item
twitter

Email us: wrap@warwick.ac.uk
Contact Details
About Us