The Library
Detecting semantic similarity : biases, evaluation and models
Tools
Peinelt, Nicole (2021) Detecting semantic similarity : biases, evaluation and models. PhD thesis, University of Warwick.
|
PDF
WRAP_Theses_Peinelt_2021.pdf - Submitted Version - Requires a PDF viewer. Download (4Mb) | Preview |
Official URL: http://webcat.warwick.ac.uk/record=b3599859~S15
Abstract
Semantic Similarity Detection refers to a collection of binary text pair classification tasks which aim to indicate certain semantic relations between two short texts. Modelling the semantic similarity between texts is a fundamental task in natural language processing with many applications. This is a challenging problem because the same meaning can be conveyed through a variety of expressions which requires a deep understanding of language beyond mere surface structure and heuristics. This thesis addresses Semantic Similarity Detection in a generic framework and places importance on its application for Community Question Answering (CQA). CQA aims at harnessing user-generated texts from question-answering websites and online forums to answer complex new questions. This involves two crucial Semantic Similarity Detection tasks: question paraphrase detection and answer identification. CQA content is characterised by informal language and domain-specific vocabulary, making it a demanding research area with practical application.
This thesis sets out to examine existing dataset biases and propose new methods for evaluating and modelling semantic similarity between text pairs. We shed light on lexical overlap bias in existing datasets and introduce alternative evaluation metrics which take direct word overlap between text pairs into account. Our metrics highlight model performance on difficult dataset instances, resulting in a more rigorous evaluation setup. We follow by investigating whether alternative information sources can be leveraged for more resilient and effective Semantic Similarity Detection models. Our first approach incorporates topic models with successful neural architectures. We experiment with a topic-enriched CNN-LSTM model, which subsequently leads to the development of a framework for combining topic models with a pre-trained Transformer model. Our second approach incorporates linguistically-enriched word embeddings with pre-trained Transformers by introducing a versatile and lightweight method for injecting dependency-based and counter-fitted embeddings into BERT. Finally, we summarise our main findings and discuss directions for future work.
Item Type: | Thesis (PhD) | ||||
---|---|---|---|---|---|
Subjects: | P Language and Literature > P Philology. Linguistics Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software |
||||
Library of Congress Subject Headings (LCSH): | Semantics -- Data processing, Semantics -- Mathematical models, Natural language processing -- Computer science | ||||
Official Date: | January 2021 | ||||
Dates: |
|
||||
Institution: | University of Warwick | ||||
Theses Department: | Department of Computer Science | ||||
Thesis Type: | PhD | ||||
Publication Status: | Unpublished | ||||
Supervisor(s)/Advisor: | Liakata, M. | ||||
Sponsors: | Alan Turing Institute | ||||
Format of File: | |||||
Extent: | 1 volume (various pagings)) : illustrations (some colour) | ||||
Language: | eng |
Request changes or add full text files to a record
Repository staff actions (login required)
View Item |
Downloads
Downloads per month over past year