Sequence distance embeddings

[thumbnail of WRAP_THESIS_Cormode_2003.pdf]
Preview
PDF
WRAP_THESIS_Cormode_2003.pdf - Submitted Version - Requires a PDF viewer.

Download (1MB) | Preview

Request Changes to record.

Abstract

Sequences represent a large class of fundamental objects in Computer Science sets, strings, vectors and permutations are considered to be sequences. Distances between sequences measure their similarity, and computations based on distances are ubiquitous: either to compute the distance, or to use distance computation as part of a more complex problem. This thesis takes a very specific approach to solving questions of sequence distance: sequences are embedded into other distance measures, so that distance in the new space approximates the original distance. This allows the solution of a variety of problems including: Fast computation of short sketches in a variety of computing models, which allow sequences to be compared in constant time and space irrespective of the size of the original sequences. Approximate nearest neighbor and clustering problems, significantly faster than the naive exact solutions. Algorithms to find approximate occurrences of pattern sequences in long text sequences in near linear time. Efficient communication schemes to approximate the distance between, and exchange, sequences in close to the optimal amount of communication. </li> </ul> Solutions are given for these problems for a variety of distances, including fundamental distances on sets and vectors; distances inspired by biological problems for permutations; and certain text editing distances for strings. Many of these embeddings are computable in a streaming model where the data is too large to store in memory, and instead has to be processed as and when it arrives, piece by piece. The embeddings are also shown to be practical, with a series of large scale experiments which demonstrate that given only a small space, approximate solutions to several similarity and clustering problems can be found that are as good as or better than those found with prior methods.

Item Type: Thesis [via Doctoral College] (PhD)
Subjects: Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software
Divisions: Faculty of Science, Engineering and Medicine > Science > Computer Science
Library of Congress Subject Headings (LCSH): Sequences (Mathematics)
Publisher: Department of Computer Science
Place of Publication: Coventry, UK
Official Date: January 2003
Dates:
Date
Event
January 2003
Submitted
DOI: CS-RR-393
Institution: University of Warwick
Theses Department: Department of Computer Science
Thesis Type: PhD
Publication Status: Published
Supervisor(s)/Advisor: Paterson, Michael S.
Extent: xvii, 275 p.
Language: eng
URI: https://wrap.warwick.ac.uk/61310/

Export / Share Citation


Request changes or add full text files to a record

Repository staff actions (login required)

View Item View Item