Skip to content Skip to navigation
University of Warwick
  • Study
  • |
  • Research
  • |
  • Business
  • |
  • Alumni
  • |
  • News
  • |
  • About

University of Warwick
Publications service & WRAP

Highlight your research

  • WRAP
    • Home
    • Search WRAP
    • Browse by Warwick Author
    • Browse WRAP by Year
    • Browse WRAP by Subject
    • Browse WRAP by Department
    • Browse WRAP by Funder
    • Browse Theses by Department
  • Publications Service
    • Home
    • Search Publications Service
    • Browse by Warwick Author
    • Browse Publications service by Year
    • Browse Publications service by Subject
    • Browse Publications service by Department
    • Browse Publications service by Funder
  • Statistics
  • Help & Advice
University of Warwick

The Library

  • Login

Bayesian correlated clustering to integrate multiple datasets

Tools
- Tools
+ Tools

Kirk, Paul, Griffin, Jim E., Savage, Richard S., Ghahramani, Zoubin and Wild, David L.. (2012) Bayesian correlated clustering to integrate multiple datasets. Bioinformatics, Volume 28 (Number 4). pp. 3290-3297. ISSN 1367-4803

[img]
Preview
Text
WRAP_Kirk_Bioinformatics-2012-Kirk-bioinformatics_bts595.pdf - Published Version

Download (616Kb) | Preview
Official URL: http://dx.doi.org/10.1093/bioinformatics/bts595

Abstract

Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct – but often complementary – information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured via parameters that describe the agreement among the datasets. Results: Using a set of 6 artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real S. cerevisiae datasets. In the 2-dataset case, we show that MDI’s performance is comparable to the present state of the art. We then move beyond the capabilities of current approaches and integrate gene expression, ChIP-chip and protein-protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques – as well as to non-integrative approaches – demonstrate that MDI is very competitive, while also providing information that would be difficult or impossible to extract using other methods.

Item Type: Journal Article
Subjects: Q Science > QA Mathematics
Q Science > QH Natural history > QH301 Biology
Divisions: Faculty of Science > Centre for Systems Biology
Library of Congress Subject Headings (LCSH): Biology -- Data processing, Data integration (Computer science), Cluster analysis
Journal or Publication Title: Bioinformatics
Publisher: Oxford University Press
ISSN: 1367-4803
Date: 2012
Volume: Volume 28
Number: Number 4
Page Range: pp. 3290-3297
Identification Number: 10.1093/bioinformatics/bts595
Status: Peer Reviewed
Publication Status: Published
Access rights to Published version: Open Access
Funder: Engineering and Physical Sciences Research Council (EPSRC), Medical Research Council (Great Britain) (MRC)
Grant number: EP/I036575/1 (EPSRC)
References: Balasubramanian, R., LaFramboise, T., Scholtens, D., and Gentleman, R. (2004). A graph-theoretic approach to testing associations between disparate sources of functional genomics data. Bioinformatics, 20(18), 3353–62. Barash, Y. and Friedman, N. (2002). Context-specific Bayesian clustering for gene expression data. J Comput Biol, 9(2), 169–91. Brock, G., Datta, S., Pihur, V., and Datta, S. (2008). clValid: An R package for cluster validation. J Stat Softw, 25(4), 1–22. Carlson, M., Falcon, S., Pages, H., and Li, N. (2010). org.Sc.sgd.db: Genome wide annotation for Yeast. R package version 2.6.3. Cheng, Y. and Church, G. M. (2000). Biclustering of expression data. Proceedings of ISMB, 8, 93–103. Cherry, J. M., Adler, C., Ball, C., Chervitz, S. A., Dwight, S. S., Hester, E. T., Jia, Y., Juvik, G., Roe, T., Schroeder, M., Weng, S., and Botstein, D. (1998). SGD: Saccharomyces Genome Database. Nucleic Acids Res, 26(1), 73–9. Cho, R. J., Campbell, M. J., Winzeler, E. A., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T. G., Gabrielian, A. E., Landsman, D., Lockhart, D. J., and Davis, R.W. (1998). A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell, 2(1), 65–73. Cooke, E. J., Savage, R. S., Kirk, P. D., Darkins, R., and Wild, D. L. (2011). Bayesian hierarchical clustering for microarray time series data with replicates and outlier measurements. BMC Bioinformatics, 12(1), 399. Datta, S. and Datta, S. (2006). Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes. BMC Bioinformatics, 7, 397. Dempster, A., Laird, N., and Rubin, D. (1977). Maximum Likelihood From Incomplete Data via Em Algorithm. Journal Of The Royal Statistical Society Series B-Methodological, 39(1), 1–38. Friedman, J., Hastie, T., Rosset, S., Tibshirani, R., and Zhu, J. (2004). Consistency in boosting: Discussion. The Annals of Statistics, 32(1), 102–107. Fritsch, A. and Ickstadt, K. (2009). Improved criteria for clustering based on the posterior similarity matrix. Bayesian Anal, 4(2), 367–391. Granovskaia, M. V., Jensen, L. J., Ritchie, M. E., Toedling, J., Ning, Y., Bork, P., Huber, W., and Steinmetz, L. M. (2010). High-resolution transcription atlas of the mitotic cell cycle in budding yeast. Genome Biol, 11(3), R24. Green, P. and Richardson, S. (2001). Modelling heterogeneity with and without the Dirichlet process. Scand J Stat, 28(2), 355–375. Harbison, C. T., Gordon, D. B., Lee, T. I., Rinaldi, N. J., Macisaac, K. D., Danford, T. W., Hannett, N. M., Tagne, J.-B., Reynolds, D. B., Yoo, J., Jennings, E. G., Zeitlinger, J., Pokholok, D. K., Kellis, M., Rolfe, P. A., Takusagawa, K. T., Lander, E. S., Gifford, D. K., Fraenkel, E., and Young, R. A. (2004). Transcriptional regulatory code of a eukaryotic genome. Nature, 431(7004), 99–104. Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218. Huttenhower, C., Haley, E. M., Hibbs, M. A., Dumeaux, V., Barrett, D. R., Coller, H. A., and Troyanskaya, O. G. (2009). Exploring the human genome with functional maps. Genome research, 19(6), 1093–1106. Ideker, T., Thorsson, V., Ranish, J. A., Christmas, R., Buhler, J., Eng, J. K., Bumgarner, R., Goodlett, D. R., Aebersold, R., and Hood, L. (2001). Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science, 292(5518), 929–34. Ishwaran, H. and Zarepour, M. (2002). Exact and approximate representations for the sum Dirichlet process. Can J Stat, 30(2), 269–283. Jackson, J. D. and Gorovsky, M. A. (2000). Histone H2A.Z has a conserved function that is distinct from that of the major H2A sequence variants. Nucleic Acids Res, 28(19), 3811–6. Jackson, J. D., Falciano, V. T., and Gorovsky, M. A. (1996). A likely histone H2A.F/Z variant in Saccharomyces cerevisiae. Trends Biochem Sci, 21(12), 466–7. Jansen, R., Yu, H., Greenbaum, D., Kluger, Y., Krogan, N. J., Chung, S., Emili, A., Snyder, M., Greenblatt, J. F., and Gerstein, M. (2003). A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science, 302(5644), 449–53. Kirk, P. D.W. and Stumpf, M. P. H. (2009). Gaussian process regression bootstrapping: exploring the effects of uncertainty in time course data. Bioinformatics, 25(10), 1300–6. Lee, I., Date, S. V., Adai, A. T., and Marcotte, E. M. (2004). A probabilistic functional network of yeast genes. Science, 306(5701), 1555–8. Liu, X., Sivaganesan, S., Yeung, K. Y., Guo, J., Bumgarner, R. E., and Medvedovic, M. (2006). Context-specific infinite mixtures for clustering gene expression profiles across diverse microarray dataset. Bioinformatics, 22(14), 1737–44. Liu, X., Jessen, W. J., Sivaganesan, S., Aronow, B. J., and Medvedovic, M. (2007). Bayesian hierarchical model for transcriptional module discovery by jointly modeling gene expression and ChIP-chip data. BMC Bioinformatics, 8, 283. Lockhart, D. J., Dong, H., Byrne, M. C., Follettie, M. T., Gallo, M. V., Chee, M. S., Mittmann, M., Wang, C., Kobayashi, M., Horton, H., and Brown, E. L. (1996). Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol, 14(13), 1675–80. Mistry, M. and Pavlidis, P. (2008). Gene Ontology term overlap as a measure of gene functional similarity. BMC Bioinformatics, 9, 327. Myers, C. L. and Troyanskaya, O. G. (2007). Context-sensitive data integration and prediction of biological networks. Bioinformatics, 23(17), 2322–2330. Myers, C. L., Robson, D., Wible, A., Hibbs, M. A., Chiriac, C., Theesfeld, C. L., Dolinski, K., and Troyanskaya, O. G. (2005). Discovery of biological networks from diverse functional genomic data. Genome Biol, 6(13), R114. Neal, R. M. (1992). Bayesian mixture modeling. In Maximum Entropy and Bayesian Methods: Proceedings of the 11th International Workshop on Maximum Entropy and Bayesian Methods of Statistical Analysis, pages 197–211. Neal, R. M. (2000). Markov Chain Sampling Methods for Dirichlet Process Mixture Models. Journal of Computational and Graphical Statistics, 9(2), 249–265. Nieto-Barajas, L., Pr¨unster, I., and Walker, S. (2004). Normalized random measures driven by increasing additive processes. Ann Stat, 32(6), 2343–2360. Puig, O., Caspary, F., Rigaut, G., Rutz, B., Bouveret, E., Bragado-Nilsson, E., Wilm, M., and S´eraphin, B. (2001). The tandem affinity purification (TAP) method: a general procedure of protein complex purification. Methods, 24(3), 218–29. Rand, W. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336), 846–850. Rasmussen, C. E. (2000). The Infinite Gaussian Mixture Model. In Advances in Neural Information Processing Systems 12, pages 554–560. Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, MA. Reiss, D. J., Baliga, N. S., and Bonneau, R. (2006). Integrated biclustering of heterogeneous genome-wide datasets for the inference of global regulatory networks. BMC Bioinformatics, 7, 280. Rhodes, D. R., Tomlins, S. A., Varambally, S., Mahavisno, V., Barrette, T., Kalyana- Sundaram, S., Ghosh, D., Pandey, A., and Chinnaiyan, A. M. (2005). Probabilistic model of the human protein-protein interaction network. Nat Biotechnol, 23(8), 951–9. Rigaut, G., Shevchenko, A., Rutz, B., Wilm, M., Mann, M., and S´eraphin, B. (1999). A generic protein purification method for protein complex characterization and proteome exploration. Nat Biotechnol, 17(10), 1030–2. Rogers, S., Girolami, M., Kolch,W.,Waters, K. M., Liu, T., Thrall, B., andWiley, H. S. (2008). Investigating the correspondence between transcriptomic and proteomic expression profiles using coupled cluster models. Bioinformatics, 24(24), 2894–900. Rogers, S., Klami, A., Sinkkonen, J., Girolami, M., and Kaski, S. (2010). Infinite factorization of multiple non-parametric views. Mach Learn, 79(1-2), 201–226. Rousseau, J. and Mengersen, K. (2011). Asymptotic behaviour of the posterior distribution in overfitted mixture models. Journal Of The Royal Statistical Society Series B-Statistical Methodology, 73(5), 689–710. Santisteban, M. S., Kalashnikova, T., and Smith, M. M. (2000). Histone H2A.Z regulates transcription and is partially redundant with nucleosome remodeling complexes. Cell, 103(3), 411–22. Savage, R. S., Ghahramani, Z., Griffin, J. E., de la Cruz, B. J., and Wild, D. L. (2010). Discovering transcriptional modules by Bayesian data integration. Bioinformatics, 26(12), i158–67. Schena, M., Shalon, D., Davis, R.W., and Brown, P. O. (1995). Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science, 270(5235), 467–70. Shen, R., Olshen, A. B., and Ladanyi, M. (2009). Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics, 25(22), 2906–12. Solomon, M. J., Larsen, P. L., and Varshavsky, A. (1988). Mapping protein-DNA interactions in vivo with formaldehyde: evidence that histone H4 is retained on a highly transcribed gene. Cell, 53(6), 937–47. Stark, C., Breitkreutz, B.-J., Reguly, T., Boucher, L., Breitkreutz, A., and Tyers, M. (2006). BioGRID: a general repository for interaction datasets. Nucleic Acids Res, 34(Database issue), D535–9. Suchard, M. A., Wang, Q., Chan, C., Frelinger, J., Cron, A., and West, M. (2010). Understanding GPU Programming for Statistical Computation: Studies in Massively Parallel Massive Mixtures. J Comput Graph Stat, 19(2), 419–438. Troyanskaya, O. G., Dolinski, K., Owen, A. B., Altman, R. B., and Botstein, D. (2003). A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc Natl Acad Sci USA, 100(14), 8348– 53. Wei, P. and Pan, W. (2012). Bayesian Joint Modeling of Multiple Gene Networks and Diverse Genomic Data to Identify Target Genes of a Transcription Factor. Annals Of Applied Statistics, 6(1), 334–355. Wong, S. L., Zhang, L. V., Tong, A. H. Y., Li, Z., Goldberg, D. S., King, O. D., Lesage, G., Vidal, M., Andrews, B., Bussey, H., Boone, C., and Roth, F. P. (2004). Combining biological networks to predict genetic interactions. Proc Natl Acad Sci USA, 101(44), 15682–7. Yeung, K. Y., Medvedovic, M., and Bumgarner, R. E. (2003). Clustering geneexpression data with repeated measurements. Genome Biol, 4(5), R34. Yuan, Y., Savage, R. S., and Markowetz, F. (2011). Patient-specific data fusion defines prognostic cancer subtypes. PLoS Comput Biol, 7(10), e1002227.
URI: http://wrap.warwick.ac.uk/id/eprint/50805

Request changes to a record

Actions (login required)

View Item View Item

Document Downloads

More statistics for this item...
twitter

Email us: publications@warwick.ac.uk
Contact Details
About Us