The Library
Bayesian and machine learning approaches in metagenomics
Tools
Souliotis, Leonidas (2020) Bayesian and machine learning approaches in metagenomics. PhD thesis, University of Warwick.
|
PDF
WRA_Theses_Souliotis_2020.pdf - Submitted Version - Requires a PDF viewer. Download (6Mb) | Preview |
Official URL: http://webcat.warwick.ac.uk/record=b3736722~S15
Abstract
In this doctoral thesis, we present a novel set of bioinformatics tools to address key problems in the field of metagenomics. This set includes a fully probabilistic framework for estimating the number of present genomes on a species level in a metagenomic sample, the use of variational encoders as an alternative method for dimensionality reduction of the coverage and the tetramer composition of metagenomic samples and a natural language processing method for compressing the number of gene frequencies in metagenomes for better prediction of their phenotypic traits.
The first tool tackles the problem of metagenomic binning. A Bayesian non-parametric method is used in conjunction with a Gaussian mixture model to estimate more accurately the number of present genomes, and also correctly cluster the contigs into the appropriate bin. We call this method DP (Dirichlet Processes) algorithm. An attempt was made to improve the accuracy of the algorithm by incorporating extra information from the edges of the assembly graph, but this addition was not used to the final model as the signal from data used is too weak. This method is validated in a 20-genomes simulated mock community and is compared against the state-of-the-art binners in a 100 genome simulated community in different scenarios using different number of samples. The results show that this method perform at least in the same standards as the state-of-the-art methods, while
outperforming them in some scenarios. This method is also applied on a real 11 sample infant gut dataset.
The second tool is about the prediction of phenotypic traits in metagenomes. In this part, we build on the idea of using the frequencies of genes annotated, based on the Kyoto Encyclopedia of Genes and Genomes (KEGG), to predict the presence and absence of 83 functional and metabolic traits. We apply the doc2vec algorithm as a dimensionality reduction method on 9407 prokaryotic genomes, experimenting with different compression dimensions and training on various machine learning algorithms for the trait prediction part. We conclude that the dimensionality reduction improves the performance of the classifiers, and it achieves the best results when combined with L-1 logistic regression on 100 dimensions. In addition, we train the classifiers on using the uncompressed KO frequencies and we identify in which traits the compression offers no improvement, comparing
the number of KOs present in each case.
The third tool presented is about the use of variational autoencoders for compressing the coverage and tetramer composition before binnig in metagenomic samples. We combine the variational autoencoder architecture used in the VAMB binner for dimensionality reduction with the Bayesian non-parametric binning approach we presented above. We tested this novel combination using the same 20-genomes simulated mock community we used previously and we concluded that this combination performs better in clustering the contigs correctly than the DP algorithm on the species level. We also concluded that this combination does not perform well in real datasets, being unable to identify any `good' bins, assessed by the percentage of single-copy core genes present.
The last part of this work is case study of the oral microbiome. It is estimated that the oral hosts over 700 species of bacteria. In this study, we analyze 131 oral samples metagenomic samples from 68 individuals. We follow an assembly-based approach and then we split the analysis in two directions. In the first approach, the contigs are binned and the abundance of each sample to each bin is calulated. In the second approach, the contigs are not binned; open reading frames are called and mapped to KEGG genes and the coverage of each gene in every sample is calculated. We associate these coverages with various metadata and attribute their variation in the presence of different species or KOs.
Item Type: | Thesis (PhD) | ||||
---|---|---|---|---|---|
Subjects: | Q Science > QA Mathematics Q Science > QH Natural history |
||||
Library of Congress Subject Headings (LCSH): | Metagenomics, Bayesian statistical decision theory, Machine learning, Bioinformatics | ||||
Official Date: | September 2020 | ||||
Dates: |
|
||||
Institution: | University of Warwick | ||||
Theses Department: | Warwick Medical School | ||||
Thesis Type: | PhD | ||||
Publication Status: | Unpublished | ||||
Supervisor(s)/Advisor: | Quince, Christopher | ||||
Sponsors: | Engineering and Physical Sciences Research Council | ||||
Extent: | xviii, 135 leaves : illustrations, charts | ||||
Language: | eng |
Request changes or add full text files to a record
Repository staff actions (login required)
View Item |
Downloads
Downloads per month over past year