Skip to content Skip to navigation
University of Warwick
  • Study
  • |
  • Research
  • |
  • Business
  • |
  • Alumni
  • |
  • News
  • |
  • About

University of Warwick
Publications service & WRAP

Highlight your research

  • WRAP
    • Home
    • Search WRAP
    • Browse by Warwick Author
    • Browse WRAP by Year
    • Browse WRAP by Subject
    • Browse WRAP by Department
    • Browse WRAP by Funder
    • Browse Theses by Department
  • Publications Service
    • Home
    • Search Publications Service
    • Browse by Warwick Author
    • Browse Publications service by Year
    • Browse Publications service by Subject
    • Browse Publications service by Department
    • Browse Publications service by Funder
  • Statistics
  • Help & Advice
University of Warwick

The Library

  • Login

Supervised sampling for clustering large data sets

Tools
- Tools
+ Tools

Kosmidis, Ioannis and Karlis, Dimitris (2010) Supervised sampling for clustering large data sets. Working Paper. University of Warwick. Centre for Research in Statistical Methodology, Coventry.

[img]
Preview
PDF
WRAP_Kosmidis_10-10w.pdf - Published Version - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader

Download (1226Kb)
Official URL: http://www2.warwick.ac.uk/fac/sci/statistics/crism...

Abstract

The problem of clustering large data sets has attracted a lot of current research. The approaches taken are mainly based either on the more efficient implementation or modification of existing methods or/and on the construction of clusters from a small sub-sample of the data and then the assignment of all observations in those clusters. The current paper focuses on the latter direction. An alternative supervised procedure to create the clusters is proposed. For learning the clusters, the procedure is using subsets of the data which are still constructed via sub-sampling but within partitions of the observation space. The general applicability of the approach is discussed together with tuning the parameters that it depends on to increase its ability. The procedure is applied to clustering the navigation patterns in the msnbc.com database.

Item Type: Working or Discussion Paper (Working Paper)
Subjects: Q Science > QA Mathematics
Divisions: Faculty of Science > Statistics
Library of Congress Subject Headings (LCSH): Cluster analysis, Sampling (Statistics)
Series Name: Working papers
Publisher: University of Warwick. Centre for Research in Statistical Methodology
Place of Publication: Coventry
Date: June 2010
Volume: Vol.2010
Number: No.10
Number of Pages: 17
Status: Not Peer Reviewed
Access rights to Published version: Open Access
References: Bradley, P., U. Fayyad, and C. Reina (1998a). Scaling clustering algorithms to large databases. In Proceedings of Knowledge Discovery and Data Mining conference, pp. 9{15. Bradley, P., U. Fayyad, and C. Reina (1998b). Scaling EM (expectation { maximization) clustering to large databases. Technical Report MSR-TR-98-35, Microsoft Research Report. Buchner, A., M. Baumgarten, S. Anand, M. Mulvenna, and J. Hughes (1999). Navigation pattern discovery from Internet data. In Proceedings of the Web Usage Analysis and User Profiling Workshop (WEBKDD '99), Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 25{30. Cadez, I. V., D. Heckerman, C. Meek, P. Smyth, and S. White (2003). Model-based clustering and visualization of navigation patterns on a web site. Data Mining and Knowledge Discovery 7 (4), 399{424. Chiu, T., D. Fang, J. Chen, Y. Wang, and C. Jeris (2001). A robust and scalable clustering algorithm for mixed type attributes in large database environment. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 263{268. Coleman, D. A. and D. L. Woodruff (2000). Cluster analysis for large datasets: An effective algorithm for maximizing the mixture likelihood. Journal of Computational and Graphical Statistics 9 (4), 672{688. Farnstrom, F., J. Lewis, and C. Elkan (2000). Scalability for clustering algorithms revisited. SIGKDD Explorations Newsletter 2 (1), 51{57. Fraley, C., A. Raftery, and R. Werhens (2005). Incremental model-based clustering for large datasets with small clusters. Journal of Computational and Graphical Statis- tics 14 (3), 529{546. Hay, B., G. Wets, and K. Vanhoof (2004). Mining navigation patterns using a sequence alignment method. Knowledge and information systems 6 (2), 150{163. Kaufman, L. and P. J. Rousseeuw (1986). Clustering large data sets. In E. S. Gelsema and L. N. Kanal (Eds.), Pattern Recognition in Practice 2, Proceedings of an International Workshop held in Amsterdam, 1985, pp. 425{437. Elsevier/North-Holland. Kaufman, L. and P. J. Rousseeuw (1990). Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons. Ng, S. K. and G. J. McLachlan (2003). On the choice of the number of blocks with the incremental em algorithm for fitting of normal mixtures. Statistics and Computing 13, 45{55. Posse, C. (2001). Hierarchical model-based clustering for large datasets. Journal of Computational and Graphical Statistics 10 (3), 464{486. Steiner, P. and M. Hudec (2007). Classification of large data sets with mixture models via sufficient em. Computational Statistics and Data Analysis 51 (11), 5416{5428. Tantrum, J., A. Murua, and W. Stuetzle (2004). Hierarchical model based clustering of large datasets through fractionation and refractionation. Information Systems 29, 315{326. Vijaya, P., M. Murty, and D. Subramaniam (2004). Leaders{subleaders: An efficient hierarchical clustering algorithm for large data sets. Pattern Recognition Letters 25, 505{513. Wehrens, R., L. M. C. Buydens, C. Fraley, and A. E. Raftery (2004). Model-based clustering for image segmentation and large datasets via sampling. Journal of Classification 21 (2), 231{253.
URI: http://wrap.warwick.ac.uk/id/eprint/35074

Request changes to a record

Actions (login required)

View Item View Item

Document Downloads

More statistics for this item...
twitter

Email us: publications@warwick.ac.uk
Contact Details
About Us