The Library
Supervised sampling for clustering large data sets
Tools
Kosmidis, Ioannis and Karlis, Dimitris (2010) Supervised sampling for clustering large data sets. Working Paper. University of Warwick. Centre for Research in Statistical Methodology, Coventry.
|
PDF
WRAP_Kosmidis_10-10w.pdf - Published Version - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader Download (1226Kb) |
Official URL: http://www2.warwick.ac.uk/fac/sci/statistics/crism...
Abstract
The problem of clustering large data sets has attracted a lot of current research. The approaches taken are mainly based either on the more efficient implementation or modification of existing methods or/and on the construction of clusters from a small sub-sample of the data and then the assignment of all observations in those clusters. The current paper focuses on the latter direction. An alternative supervised procedure to create the clusters is proposed. For learning the clusters, the procedure is using subsets of the data which are still constructed via sub-sampling but within partitions of the observation space. The general applicability of the approach is discussed together with tuning the parameters that it depends on to increase its ability. The procedure is applied to clustering the navigation patterns in the msnbc.com database.
| Item Type: | Working or Discussion Paper (Working Paper) |
|---|---|
| Subjects: | Q Science > QA Mathematics |
| Divisions: | Faculty of Science > Statistics |
| Library of Congress Subject Headings (LCSH): | Cluster analysis, Sampling (Statistics) |
| Series Name: | Working papers |
| Publisher: | University of Warwick. Centre for Research in Statistical Methodology |
| Place of Publication: | Coventry |
| Date: | June 2010 |
| Volume: | Vol.2010 |
| Number: | No.10 |
| Number of Pages: | 17 |
| Status: | Not Peer Reviewed |
| Access rights to Published version: | Open Access |
| References: | Bradley, P., U. Fayyad, and C. Reina (1998a). Scaling clustering algorithms to large databases. In Proceedings of Knowledge Discovery and Data Mining conference, pp. 9{15. Bradley, P., U. Fayyad, and C. Reina (1998b). Scaling EM (expectation { maximization) clustering to large databases. Technical Report MSR-TR-98-35, Microsoft Research Report. Buchner, A., M. Baumgarten, S. Anand, M. Mulvenna, and J. Hughes (1999). Navigation pattern discovery from Internet data. In Proceedings of the Web Usage Analysis and User Profiling Workshop (WEBKDD '99), Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 25{30. Cadez, I. V., D. Heckerman, C. Meek, P. Smyth, and S. White (2003). Model-based clustering and visualization of navigation patterns on a web site. Data Mining and Knowledge Discovery 7 (4), 399{424. Chiu, T., D. Fang, J. Chen, Y. Wang, and C. Jeris (2001). A robust and scalable clustering algorithm for mixed type attributes in large database environment. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 263{268. Coleman, D. A. and D. L. Woodruff (2000). Cluster analysis for large datasets: An effective algorithm for maximizing the mixture likelihood. Journal of Computational and Graphical Statistics 9 (4), 672{688. Farnstrom, F., J. Lewis, and C. Elkan (2000). Scalability for clustering algorithms revisited. SIGKDD Explorations Newsletter 2 (1), 51{57. Fraley, C., A. Raftery, and R. Werhens (2005). Incremental model-based clustering for large datasets with small clusters. Journal of Computational and Graphical Statis- tics 14 (3), 529{546. Hay, B., G. Wets, and K. Vanhoof (2004). Mining navigation patterns using a sequence alignment method. Knowledge and information systems 6 (2), 150{163. Kaufman, L. and P. J. Rousseeuw (1986). Clustering large data sets. In E. S. Gelsema and L. N. Kanal (Eds.), Pattern Recognition in Practice 2, Proceedings of an International Workshop held in Amsterdam, 1985, pp. 425{437. Elsevier/North-Holland. Kaufman, L. and P. J. Rousseeuw (1990). Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons. Ng, S. K. and G. J. McLachlan (2003). On the choice of the number of blocks with the incremental em algorithm for fitting of normal mixtures. Statistics and Computing 13, 45{55. Posse, C. (2001). Hierarchical model-based clustering for large datasets. Journal of Computational and Graphical Statistics 10 (3), 464{486. Steiner, P. and M. Hudec (2007). Classification of large data sets with mixture models via sufficient em. Computational Statistics and Data Analysis 51 (11), 5416{5428. Tantrum, J., A. Murua, and W. Stuetzle (2004). Hierarchical model based clustering of large datasets through fractionation and refractionation. Information Systems 29, 315{326. Vijaya, P., M. Murty, and D. Subramaniam (2004). Leaders{subleaders: An efficient hierarchical clustering algorithm for large data sets. Pattern Recognition Letters 25, 505{513. Wehrens, R., L. M. C. Buydens, C. Fraley, and A. E. Raftery (2004). Model-based clustering for image segmentation and large datasets via sampling. Journal of Classification 21 (2), 231{253. |
| URI: | http://wrap.warwick.ac.uk/id/eprint/35074 |
Actions (login required)
![]() |
View Item |
Tools
Tools

