Gaussian latent tree model constraints for linguistics and other applications

[thumbnail of WRAP_THESIS_Shiers_2016.pdf] PDF
WRAP_THESIS_Shiers_2016.pdf - Submitted Version - Requires a PDF viewer.

Download (8MB)

Request Changes to record.

Abstract

The relationships between languages are often modelled as phylogenetic trees whereby there is a single shared ancestral language at the root and contemporary languages appear as leaves. These can be thought of as directed acyclic graphs with hidden variables, specifically Bayesian networks. However, from a statistical perspective there is often no formal assessment of the suitability of these latent tree models. A lot of the work that seeks to address this has focused on discrete variable models. However, when observations are instead considered as functional data, the high dimensional approximations are often better considered in a Gaussian context. The high dimensional data is often inefficiently stored and so the first challenge is to project this data to a low dimension while retaining the information of interest. One approach is to use the newly developed tool named separable-canonical variate analysis to form a basis.

Extending the techniques for assessing latent tree model compatibility to beyond discrete variables, the complete set of Gaussian tree constraints are derived for the first time. This set comprises equations and inequality statements in terms of correlations of observed variables. These statements must in theory be adhered to for a Gaussian latent tree model to be appropriate for a given data set. Using the separable-canonical variate analysis basis to obtain a truncated representation, the suitability of a phylogenetic tree can then be plainly assessed. However, in practice it is desirable to allow for some sampling error and as such probabilistic tools are developed alongside the theoretical derivation of Gaussian tree constraints.

The proposed methodology is implemented in an in-depth study of a real linguistic data set to assess the phylogenies of five Romance languages. This application is distinctive as the data set consists of acoustic recordings, these are treated as functional data, and moreover these are then being used to compare languages in a phylogenetic context. As a consequence a wide range of theory and tools are called upon from the multivariate and functional domains, and the powerful new separable-canonical function analysis and separable-canonical variate analysis are used. Utilising the newly derived Gaussian tree constraints for hidden variable models provides a first insight into features of spoken languages that appear to be tree-compatible.

Item Type: Thesis [via Doctoral College] (PhD)
Subjects: Q Science > QA Mathematics
Library of Congress Subject Headings (LCSH): Linguistics -- Statistical methods, Directed graphs
Official Date: June 2016
Dates:
Date
Event
June 2016
Submitted
Institution: University of Warwick
Theses Department: Department of Statistics
Thesis Type: PhD
Publication Status: Unpublished
Supervisor(s)/Advisor: Smith, J. Q. ; Aston, John A. D.
Sponsors: Economic and Social Research Council (Great Britain) [ES/I90427/1]
Extent: xiv, 190 leaves : illustrations, charts
Language: eng
URI: https://wrap.warwick.ac.uk/80590/

Export / Share Citation


Request changes or add full text files to a record

Repository staff actions (login required)

View Item View Item