Skip to content Skip to navigation
University of Warwick
  • Study
  • |
  • Research
  • |
  • Business
  • |
  • Alumni
  • |
  • News
  • |
  • About

University of Warwick
Publications service & WRAP

Highlight your research

  • WRAP
    • Home
    • Search WRAP
    • Browse by Warwick Author
    • Browse WRAP by Year
    • Browse WRAP by Subject
    • Browse WRAP by Department
    • Browse WRAP by Funder
    • Browse Theses by Department
  • Publications Service
    • Home
    • Search Publications Service
    • Browse by Warwick Author
    • Browse Publications service by Year
    • Browse Publications service by Subject
    • Browse Publications service by Department
    • Browse Publications service by Funder
  • Help & Advice
University of Warwick

The Library

  • Login
  • Admin

Scaling out big data missing value imputations : Pythia vs. Godzilla

Tools
- Tools
+ Tools

Anagnostopoulos, C. and Triantafillou, Peter (2014) Scaling out big data missing value imputations : Pythia vs. Godzilla. In: 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '14), New York, N.Y., U.S.A, 24-27 Aug 2014. Published in: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining pp. 651-660. ISBN 9781450329569. doi:10.1145/2623330.2623615

Research output not available from this repository.

Request-a-Copy directly from author or use local Library Get it For Me service.

Official URL: http://doi.org/10.1145/2623330.2623615

Request Changes to record.

Abstract

Solving the missing-value (MV) problem with small estimation errors in big data environments is a notoriously resource-demanding task. As datasets and their user community continuously grow, the problem can only be exacerbated. Assume that it is possible to have a single machine ('Godzilla'), which can store the massive dataset and support an ever-growing community submitting MV imputation requests. Is it possible to replace Godzilla by employing a large number of cohort machines so that imputations can be performed much faster, engaging cohorts in parallel, each of which accesses much smaller partitions of the original dataset? If so, it would be preferable for obvious performance reasons to access only a subset of all cohorts per imputation. In this case, can we decide swiftly which is the desired subset of cohorts to engage per imputation? But efficiency and scalability is just one key concern! Is it possible to do the above while ensuring comparable or even better than Godzilla's imputation estimation errors? In this paper we derive answers to these fundamentals questions and develop principled methods and a framework which offer large performance speed-ups and better, or comparable, errors to that of Godzilla, independently of which missing-value imputation algorithm is used. Our contributions involve Pythia, a framework and algorithms for providing the answers to the above questions and for engaging the appropriate subset of cohorts per MV imputation request. Pythia functionality rests on two pillars: (i) dataset (partition) signatures, one per cohort, and (ii) similarity notions and algorithms, which can identify the appropriate subset of cohorts to engage. Comprehensive experimentation with real and synthetic datasets showcase our efficiency, scalability, and accuracy claims. © 2014 ACM.

Item Type: Conference Item (Paper)
Divisions: Faculty of Science, Engineering and Medicine > Science > Computer Science
Journal or Publication Title: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Publisher: ACM
ISBN: 9781450329569
Official Date: 2014
Dates:
DateEvent
2014Published
Page Range: pp. 651-660
DOI: 10.1145/2623330.2623615
Status: Peer Reviewed
Publication Status: Published
Reuse Statement (publisher, data, author rights): cited By 5
Access rights to Published version: Open Access (Creative Commons)
Embodied As: 1
Conference Paper Type: Paper
Title of Event: 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '14)
Type of Event: Conference
Location of Event: New York, N.Y., U.S.A
Date(s) of Event: 24-27 Aug 2014
Related URLs:
  • Other

Request changes or add full text files to a record

Repository staff actions (login required)

View Item View Item
twitter

Email us: wrap@warwick.ac.uk
Contact Details
About Us