Skip to content Skip to navigation
University of Warwick
  • Study
  • |
  • Research
  • |
  • Business
  • |
  • Alumni
  • |
  • News
  • |
  • About

University of Warwick
Publications service & WRAP

Highlight your research

  • WRAP
    • Home
    • Search WRAP
    • Browse by Warwick Author
    • Browse WRAP by Year
    • Browse WRAP by Subject
    • Browse WRAP by Department
    • Browse WRAP by Funder
    • Browse Theses by Department
  • Publications Service
    • Home
    • Search Publications Service
    • Browse by Warwick Author
    • Browse Publications service by Year
    • Browse Publications service by Subject
    • Browse Publications service by Department
    • Browse Publications service by Funder
  • Help & Advice
University of Warwick

The Library

  • Login
  • Admin

On the efficiency of finding and using tabular data summaries : scalability, accuracy, and hardness

Tools
- Tools
+ Tools

Dickens, Charlie (2021) On the efficiency of finding and using tabular data summaries : scalability, accuracy, and hardness. PhD thesis, University of Warwick.

[img]
Preview
PDF
WRAP_Theses_Dickens_2021.pdf - Submitted Version - Requires a PDF viewer.

Download (1868Kb) | Preview
Official URL: http://webcat.warwick.ac.uk/record=b3719073

Request Changes to record.

Abstract

Tabular data is ubiquitous in modern computer science. However, the size of these tables can be large so computing statistics over them is inefficient in both time and space. This thesis is concerned with finding and using small summaries of large tables for scalable and accurate approximation of the data's properties; or showing such a summary is hard to obtain in small space. This perspective yields the following results:

• We introduce projected frequency analysis over an n x d binary table. If the query columns are revealed after observing the data, then we show that space exponential in d is required for constant-factor approximation to statistics such as the number of distinct elements on columns S. We present algorithms that use smaller space than a brute-force approach, while tolerating some super constant error for the frequency estimation.

• We find small-space deterministic summaries for a variety of linear algebraic problems in all p-norms for p≥ 1. These include finding rows of high leverage, subspace embedding, regression, and low rank approximation.

• We implement and compare various summary techniques for efficient training of large-scale regression models. We show that a sparse random projection can lead to fast model training despite suboptimal theoretical guarantees than dense competitors. For ridge regression we show that a deterministic summary can reduce the number of gradient steps needed to train the model compared to random projections.

We demonstrate the practicality of our approaches through various experiments by showing that small space summaries can lead to close to optimal solutions.

Item Type: Thesis or Dissertation (PhD)
Subjects: Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software
Library of Congress Subject Headings (LCSH): Mathematical statistics -- Computer programs, Statistics -- Data processing, Mathematical statistics -- Computer programs, Regression analysis -- Computer programs, Regression analysis -- Data processing, Iterative methods (Mathematics), Algorithms
Official Date: June 2021
Dates:
DateEvent
June 2021UNSPECIFIED
Institution: University of Warwick
Theses Department: Department of Computer Science
Thesis Type: PhD
Publication Status: Unpublished
Supervisor(s)/Advisor: Cormode, Graham, 1977-
Format of File: pdf
Extent: iii, 190 leaves : illustrations
Language: eng

Request changes or add full text files to a record

Repository staff actions (login required)

View Item View Item

Downloads

Downloads per month over past year

View more statistics

twitter

Email us: wrap@warwick.ac.uk
Contact Details
About Us