Probabilistic neural topic models for text understanding

[thumbnail of WRAP_Theses_Pergola_2020.pdf]
Preview
PDF
WRAP_Theses_Pergola_2020.pdf - Submitted Version - Requires a PDF viewer.

Download (2MB) | Preview

Request Changes to record.

Abstract

Making sense of text is still one of the most fascinating and open challenges thanks and despite the vast amount of information continuously produced by recent technologies. Along with the growing size of textual data, automatic approaches have to deal with the wide variety of linguistic features across different domains and contexts: for example, user reviews might be characterised by colloquial idioms, slang or contractions; while clinical notes often contain technical jargon, with typical medical abbreviations and polysemous words whose meaning strictly depend on the particular context in which they were used.

We propose to address these issues by combining topic modelling principles and models with distributional word representations. Topic models generate concise and expressive representations for high volumes of documents by clustering words into “topics”, which can be interpreted as document decompositions. They are focused on analysing the global context of words and their co-occurrences within the whole corpus. Distributional language representations, instead, encode the word syntactic and semantic properties by leveraging the word local contexts and can be conveniently pre-trained to facilitate the model training and the simultaneous encoding of external knowledge. Our work represents one step in bridging the gap between the recent advances in topic modelling and the increasingly richer distributional word representations, with the aim of addressing the aforementioned issues related to different linguistic features within different domains.

In this thesis, we first propose a hierarchical neural model inspired by topic modelling, which leverages an attention mechanism along with a novel neural cell for fine-grained detection of sentiments and themes discussed in user reviews. Next, we present a neural topic model with adversarial training to distinguish topics based on their high-level semantics (e.g. opinions or factual descriptions). Then, we design a probabilistic topic model specialised for the extraction of biomedical phrases, whose inference process goes beyond the limitations of traditional topic models by seamlessly combining the word co-occurrences statistics with the information from word embeddings. Finally, inspired by the usage of entities in topic modelling [85], we design a novel masking strategy to fine-tune language models for biomedical question-answering. For each of the above models, we report experimental assessments supporting their efficacy across a wide variety of tasks and domains.

Item Type: Thesis [via Doctoral College] (PhD)
Subjects: Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software
Library of Congress Subject Headings (LCSH): Computational linguistics, Natural language processing (Computer science), Neural networks (Computer science), Sentiment analysis, Machine learning
Official Date: December 2020
Dates:
Date
Event
December 2020
UNSPECIFIED
Institution: University of Warwick
Theses Department: Department of Computer Science
Thesis Type: PhD
Publication Status: Unpublished
Supervisor(s)/Advisor: He, Yulan
Format of File: pdf
Extent: v, viii, 126 leaves : illustrations
Language: eng
URI: https://wrap.warwick.ac.uk/162616/

Export / Share Citation


Request changes or add full text files to a record

Repository staff actions (login required)

View Item View Item