The Library
Bayesian correlated clustering to integrate multiple datasets
Tools
Kirk, Paul, Griffin, Jim E., Savage, Richard S., Ghahramani, Zoubin and Wild, David L. (2012) Bayesian correlated clustering to integrate multiple datasets. Bioinformatics . ISSN 1367-4803 (In Press)
|
Text
WRAP_Kirk_Bioinformatics-2012-Kirk-bioinformatics_bts595.pdf - Published Version Download (616Kb) | Preview |
Official URL: http://dx.doi.org/10.1093/bioinformatics/bts595
Abstract
Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct – but often complementary – information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured via parameters that describe the agreement among the datasets. Results: Using a set of 6 artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real S. cerevisiae datasets. In the 2-dataset case, we show that MDI’s performance is comparable to the present state of the art. We then move beyond the capabilities of current approaches and integrate gene expression, ChIP-chip and protein-protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques – as well as to non-integrative approaches – demonstrate that MDI is very competitive, while also providing information that would be difficult or impossible to extract using other methods.
| Item Type: | Submitted Journal Article |
|---|---|
| Subjects: | Q Science > QA Mathematics Q Science > QH Natural history > QH301 Biology |
| Divisions: | Faculty of Science > Centre for Systems Biology |
| Library of Congress Subject Headings (LCSH): | Biology -- Data processing, Data integration (Computer science), Cluster analysis |
| Journal or Publication Title: | Bioinformatics |
| Publisher: | Oxford University Press |
| ISSN: | 1367-4803 |
| Date: | 2012 |
| Identification Number: | 10.1093/bioinformatics/bts595 |
| Status: | Peer Reviewed |
| Publication Status: | In Press |
| Access rights to Published version: | Open Access |
| Funder: | Engineering and Physical Sciences Research Council (EPSRC), Medical Research Council (Great Britain) (MRC) |
| Grant number: | EP/I036575/1 (EPSRC) |
| References: | Balasubramanian, R., LaFramboise, T., Scholtens, D., and Gentleman, R. (2004). A graph-theoretic approach to testing associations between disparate sources of functional genomics data. Bioinformatics, 20(18), 3353–62. Barash, Y. and Friedman, N. (2002). Context-specific Bayesian clustering for gene expression data. J Comput Biol, 9(2), 169–91. Brock, G., Datta, S., Pihur, V., and Datta, S. (2008). clValid: An R package for cluster validation. J Stat Softw, 25(4), 1–22. Carlson, M., Falcon, S., Pages, H., and Li, N. (2010). org.Sc.sgd.db: Genome wide annotation for Yeast. R package version 2.6.3. Cheng, Y. and Church, G. M. (2000). Biclustering of expression data. Proceedings of ISMB, 8, 93–103. Cherry, J. M., Adler, C., Ball, C., Chervitz, S. A., Dwight, S. S., Hester, E. T., Jia, Y., Juvik, G., Roe, T., Schroeder, M., Weng, S., and Botstein, D. (1998). SGD: Saccharomyces Genome Database. Nucleic Acids Res, 26(1), 73–9. Cho, R. J., Campbell, M. J., Winzeler, E. A., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T. G., Gabrielian, A. E., Landsman, D., Lockhart, D. J., and Davis, R.W. (1998). A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell, 2(1), 65–73. Cooke, E. J., Savage, R. S., Kirk, P. D., Darkins, R., and Wild, D. L. (2011). Bayesian hierarchical clustering for microarray time series data with replicates and outlier measurements. BMC Bioinformatics, 12(1), 399. Datta, S. and Datta, S. (2006). Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes. BMC Bioinformatics, 7, 397. Dempster, A., Laird, N., and Rubin, D. (1977). Maximum Likelihood From Incomplete Data via Em Algorithm. Journal Of The Royal Statistical Society Series B-Methodological, 39(1), 1–38. Friedman, J., Hastie, T., Rosset, S., Tibshirani, R., and Zhu, J. (2004). Consistency in boosting: Discussion. The Annals of Statistics, 32(1), 102–107. Fritsch, A. and Ickstadt, K. (2009). Improved criteria for clustering based on the posterior similarity matrix. Bayesian Anal, 4(2), 367–391. Granovskaia, M. V., Jensen, L. J., Ritchie, M. E., Toedling, J., Ning, Y., Bork, P., Huber, W., and Steinmetz, L. M. (2010). High-resolution transcription atlas of the mitotic cell cycle in budding yeast. Genome Biol, 11(3), R24. Green, P. and Richardson, S. (2001). Modelling heterogeneity with and without the Dirichlet process. Scand J Stat, 28(2), 355–375. Harbison, C. T., Gordon, D. B., Lee, T. I., Rinaldi, N. J., Macisaac, K. D., Danford, T. W., Hannett, N. M., Tagne, J.-B., Reynolds, D. B., Yoo, J., Jennings, E. G., Zeitlinger, J., Pokholok, D. K., Kellis, M., Rolfe, P. A., Takusagawa, K. T., Lander, E. S., Gifford, D. K., Fraenkel, E., and Young, R. A. (2004). Transcriptional regulatory code of a eukaryotic genome. Nature, 431(7004), 99–104. Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218. Huttenhower, C., Haley, E. M., Hibbs, M. A., Dumeaux, V., Barrett, D. R., Coller, H. A., and Troyanskaya, O. G. (2009). Exploring the human genome with functional maps. Genome research, 19(6), 1093–1106. Ideker, T., Thorsson, V., Ranish, J. A., Christmas, R., Buhler, J., Eng, J. K., Bumgarner, R., Goodlett, D. R., Aebersold, R., and Hood, L. (2001). Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science, 292(5518), 929–34. Ishwaran, H. and Zarepour, M. (2002). Exact and approximate representations for the sum Dirichlet process. Can J Stat, 30(2), 269–283. Jackson, J. D. and Gorovsky, M. A. (2000). Histone H2A.Z has a conserved function that is distinct from that of the major H2A sequence variants. Nucleic Acids Res, 28(19), 3811–6. Jackson, J. D., Falciano, V. T., and Gorovsky, M. A. (1996). A likely histone H2A.F/Z variant in Saccharomyces cerevisiae. Trends Biochem Sci, 21(12), 466–7. Jansen, R., Yu, H., Greenbaum, D., Kluger, Y., Krogan, N. J., Chung, S., Emili, A., Snyder, M., Greenblatt, J. F., and Gerstein, M. (2003). A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science, 302(5644), 449–53. Kirk, P. D.W. and Stumpf, M. P. H. (2009). Gaussian process regression bootstrapping: exploring the effects of uncertainty in time course data. Bioinformatics, 25(10), 1300–6. Lee, I., Date, S. V., Adai, A. T., and Marcotte, E. M. (2004). A probabilistic functional network of yeast genes. Science, 306(5701), 1555–8. Liu, X., Sivaganesan, S., Yeung, K. Y., Guo, J., Bumgarner, R. E., and Medvedovic, M. (2006). Context-specific infinite mixtures for clustering gene expression profiles across diverse microarray dataset. Bioinformatics, 22(14), 1737–44. Liu, X., Jessen, W. J., Sivaganesan, S., Aronow, B. J., and Medvedovic, M. (2007). Bayesian hierarchical model for transcriptional module discovery by jointly modeling gene expression and ChIP-chip data. BMC Bioinformatics, 8, 283. Lockhart, D. J., Dong, H., Byrne, M. C., Follettie, M. T., Gallo, M. V., Chee, M. S., Mittmann, M., Wang, C., Kobayashi, M., Horton, H., and Brown, E. L. (1996). Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol, 14(13), 1675–80. Mistry, M. and Pavlidis, P. (2008). Gene Ontology term overlap as a measure of gene functional similarity. BMC Bioinformatics, 9, 327. Myers, C. L. and Troyanskaya, O. G. (2007). Context-sensitive data integration and prediction of biological networks. Bioinformatics, 23(17), 2322–2330. Myers, C. L., Robson, D., Wible, A., Hibbs, M. A., Chiriac, C., Theesfeld, C. L., Dolinski, K., and Troyanskaya, O. G. (2005). Discovery of biological networks from diverse functional genomic data. Genome Biol, 6(13), R114. Neal, R. M. (1992). Bayesian mixture modeling. In Maximum Entropy and Bayesian Methods: Proceedings of the 11th International Workshop on Maximum Entropy and Bayesian Methods of Statistical Analysis, pages 197–211. Neal, R. M. (2000). Markov Chain Sampling Methods for Dirichlet Process Mixture Models. Journal of Computational and Graphical Statistics, 9(2), 249–265. Nieto-Barajas, L., Pr¨unster, I., and Walker, S. (2004). Normalized random measures driven by increasing additive processes. Ann Stat, 32(6), 2343–2360. Puig, O., Caspary, F., Rigaut, G., Rutz, B., Bouveret, E., Bragado-Nilsson, E., Wilm, M., and S´eraphin, B. (2001). The tandem affinity purification (TAP) method: a general procedure of protein complex purification. Methods, 24(3), 218–29. Rand, W. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336), 846–850. Rasmussen, C. E. (2000). The Infinite Gaussian Mixture Model. In Advances in Neural Information Processing Systems 12, pages 554–560. Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, MA. Reiss, D. J., Baliga, N. S., and Bonneau, R. (2006). Integrated biclustering of heterogeneous genome-wide datasets for the inference of global regulatory networks. BMC Bioinformatics, 7, 280. Rhodes, D. R., Tomlins, S. A., Varambally, S., Mahavisno, V., Barrette, T., Kalyana- Sundaram, S., Ghosh, D., Pandey, A., and Chinnaiyan, A. M. (2005). Probabilistic model of the human protein-protein interaction network. Nat Biotechnol, 23(8), 951–9. Rigaut, G., Shevchenko, A., Rutz, B., Wilm, M., Mann, M., and S´eraphin, B. (1999). A generic protein purification method for protein complex characterization and proteome exploration. Nat Biotechnol, 17(10), 1030–2. Rogers, S., Girolami, M., Kolch,W.,Waters, K. M., Liu, T., Thrall, B., andWiley, H. S. (2008). Investigating the correspondence between transcriptomic and proteomic expression profiles using coupled cluster models. Bioinformatics, 24(24), 2894–900. Rogers, S., Klami, A., Sinkkonen, J., Girolami, M., and Kaski, S. (2010). Infinite factorization of multiple non-parametric views. Mach Learn, 79(1-2), 201–226. Rousseau, J. and Mengersen, K. (2011). Asymptotic behaviour of the posterior distribution in overfitted mixture models. Journal Of The Royal Statistical Society Series B-Statistical Methodology, 73(5), 689–710. Santisteban, M. S., Kalashnikova, T., and Smith, M. M. (2000). Histone H2A.Z regulates transcription and is partially redundant with nucleosome remodeling complexes. Cell, 103(3), 411–22. Savage, R. S., Ghahramani, Z., Griffin, J. E., de la Cruz, B. J., and Wild, D. L. (2010). Discovering transcriptional modules by Bayesian data integration. Bioinformatics, 26(12), i158–67. Schena, M., Shalon, D., Davis, R.W., and Brown, P. O. (1995). Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science, 270(5235), 467–70. Shen, R., Olshen, A. B., and Ladanyi, M. (2009). Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics, 25(22), 2906–12. Solomon, M. J., Larsen, P. L., and Varshavsky, A. (1988). Mapping protein-DNA interactions in vivo with formaldehyde: evidence that histone H4 is retained on a highly transcribed gene. Cell, 53(6), 937–47. Stark, C., Breitkreutz, B.-J., Reguly, T., Boucher, L., Breitkreutz, A., and Tyers, M. (2006). BioGRID: a general repository for interaction datasets. Nucleic Acids Res, 34(Database issue), D535–9. Suchard, M. A., Wang, Q., Chan, C., Frelinger, J., Cron, A., and West, M. (2010). Understanding GPU Programming for Statistical Computation: Studies in Massively Parallel Massive Mixtures. J Comput Graph Stat, 19(2), 419–438. Troyanskaya, O. G., Dolinski, K., Owen, A. B., Altman, R. B., and Botstein, D. (2003). A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc Natl Acad Sci USA, 100(14), 8348– 53. Wei, P. and Pan, W. (2012). Bayesian Joint Modeling of Multiple Gene Networks and Diverse Genomic Data to Identify Target Genes of a Transcription Factor. Annals Of Applied Statistics, 6(1), 334–355. Wong, S. L., Zhang, L. V., Tong, A. H. Y., Li, Z., Goldberg, D. S., King, O. D., Lesage, G., Vidal, M., Andrews, B., Bussey, H., Boone, C., and Roth, F. P. (2004). Combining biological networks to predict genetic interactions. Proc Natl Acad Sci USA, 101(44), 15682–7. Yeung, K. Y., Medvedovic, M., and Bumgarner, R. E. (2003). Clustering geneexpression data with repeated measurements. Genome Biol, 4(5), R34. Yuan, Y., Savage, R. S., and Markowetz, F. (2011). Patient-specific data fusion defines prognostic cancer subtypes. PLoS Comput Biol, 7(10), e1002227. |
| URI: | http://wrap.warwick.ac.uk/id/eprint/50805 |
Actions (login required)
![]() |
View Item |
Tools
Tools

