Skip to content Skip to navigation
University of Warwick
  • Study
  • |
  • Research
  • |
  • Business
  • |
  • Alumni
  • |
  • News
  • |
  • About

University of Warwick
Publications service & WRAP

Highlight your research

  • WRAP
    • Home
    • Search WRAP
    • Browse by Warwick Author
    • Browse WRAP by Year
    • Browse WRAP by Subject
    • Browse WRAP by Department
    • Browse WRAP by Funder
    • Browse Theses by Department
  • Publications Service
    • Home
    • Search Publications Service
    • Browse by Warwick Author
    • Browse Publications service by Year
    • Browse Publications service by Subject
    • Browse Publications service by Department
    • Browse Publications service by Funder
  • Statistics
  • Help & Advice
University of Warwick

The Library

  • Login

Fast mining of massive tabular data via approximate distance computations

Tools
- Tools
+ Tools

UNSPECIFIED (2002) Fast mining of massive tabular data via approximate distance computations. In: 18th International Conference on Data Engineering, FEB 26-MAR 01, 2002, SAN JOSE, CA.

Full text not available from this repository.

Abstract

Tabular data abound in many data stores: traditional relational databases store tables, and new applications also generate massive tabular datasets. For example, consider the geographic distribution of cell phone traffic at different base stations across the country or the evolution of traffic at Internet routers over time. Detecting similarity patterns in such data sets (e.g., which geographic regions have similar cell phone usage distribution, which IP subnet traffic distributions over tithe intervals are similar, etc) is of great importance. Identification of such patterns poses many conceptual challenges (what is a suitable similarity distance function for two "regions") as well as technical challenges (how to perform similarity computations efficiently as massive tables get accumulated over time) that we address. We present methods for determining similar regions in massive tabular data. Our methods are for computing the "distance" between any two subregions of a tabular data: they are approximate, but highly accurate as ye prove mathematically, and they are fast, running in tithe nearly linear in the table size. Our methods are general since these distance computations can be applied to arty raining or similarity algorithms that use L-p norms. A novelty of our distance computation procedures is that they work for an v L-p norms-not only the traditionally p = 2 or p = 1, but for all p less than or equal to 2; the choice of p, say fractional p, provides an interesting alternative similarity, behavior! We rise our algorithms in a detailed experimental study of the clustering patterns in real tabular data obtained front one of AT&T's data stores and show that our methods are substantially faster than straightforward methods while remaining highly accurate, and able to detect interesting patterns by varying the value of p.

Item Type: Conference Item (UNSPECIFIED)
Subjects: Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software
T Technology > TK Electrical engineering. Electronics Nuclear engineering
Series Name: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON DATA ENGINEERING (SERIES)
Journal or Publication Title: 18TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS
Publisher: IEEE COMPUTER SOC
ISBN: 0-7695-1531-2
ISSN: 1063-6382
Editor: Agrawal, R and Dittrich, K and Ngu, AHH
Date: 2002
Number of Pages: 10
Page Range: pp. 605-614
Publication Status: Published
Title of Event: 18th International Conference on Data Engineering
Location of Event: SAN JOSE, CA
Date(s) of Event: FEB 26-MAR 01, 2002
URI: http://wrap.warwick.ac.uk/id/eprint/10244

Data sourced from Thomson Reuters' Web of Knowledge

Request changes to a record

Actions (login required)

View Item View Item
twitter

Email us: publications@warwick.ac.uk
Contact Details
About Us