The Library

Detecting semantic similarity : biases, evaluation and models

Tools

Peinelt, Nicole (2021) Detecting semantic similarity : biases, evaluation and models. PhD thesis, University of Warwick.

Preview

PDF
WRAP_Theses_Peinelt_2021.pdf - Submitted Version - Requires a PDF viewer.
Download (4Mb) | Preview

Official URL: http://webcat.warwick.ac.uk/record=b3599859~S15

Request Changes to record.

Abstract

Semantic Similarity Detection refers to a collection of binary text pair classification tasks which aim to indicate certain semantic relations between two short texts. Modelling the semantic similarity between texts is a fundamental task in natural language processing with many applications. This is a challenging problem because the same meaning can be conveyed through a variety of expressions which requires a deep understanding of language beyond mere surface structure and heuristics. This thesis addresses Semantic Similarity Detection in a generic framework and places importance on its application for Community Question Answering (CQA). CQA aims at harnessing user-generated texts from question-answering websites and online forums to answer complex new questions. This involves two crucial Semantic Similarity Detection tasks: question paraphrase detection and answer identification. CQA content is characterised by informal language and domain-specific vocabulary, making it a demanding research area with practical application.

This thesis sets out to examine existing dataset biases and propose new methods for evaluating and modelling semantic similarity between text pairs. We shed light on lexical overlap bias in existing datasets and introduce alternative evaluation metrics which take direct word overlap between text pairs into account. Our metrics highlight model performance on difficult dataset instances, resulting in a more rigorous evaluation setup. We follow by investigating whether alternative information sources can be leveraged for more resilient and effective Semantic Similarity Detection models. Our first approach incorporates topic models with successful neural architectures. We experiment with a topic-enriched CNN-LSTM model, which subsequently leads to the development of a framework for combining topic models with a pre-trained Transformer model. Our second approach incorporates linguistically-enriched word embeddings with pre-trained Transformers by introducing a versatile and lightweight method for injecting dependency-based and counter-fitted embeddings into BERT. Finally, we summarise our main findings and discuss directions for future work.

Item Type:

Thesis (PhD)

Subjects:

P Language and Literature > P Philology. Linguistics
Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software

Library of Congress Subject Headings (LCSH):

Semantics -- Data processing, Semantics -- Mathematical models, Natural language processing -- Computer science

Official Date:

January 2021

Dates:

Date	Event
January 2021	UNSPECIFIED

Institution:

University of Warwick

Theses Department:

Department of Computer Science

Thesis Type:

PhD

Publication Status:

Unpublished

Supervisor(s)/Advisor:

Liakata, M.

Sponsors:

Alan Turing Institute

Format of File:

pdf

Extent:

1 volume (various pagings)) : illustrations (some colour)

Language:

eng

Request changes or add full text files to a record

Repository staff actions (login required)

View Item

Downloads

Downloads per month over past year

View more statistics

University of Warwick
Publications service & WRAP

Highlight your research

The Library

Detecting semantic similarity : biases, evaluation and models

Abstract

Repository staff actions (login required)

Downloads

University of WarwickPublications service & WRAP

Highlight your research

The Library

Detecting semantic similarity : biases, evaluation and models

Abstract

Repository staff actions (login required)

Downloads

University of Warwick
Publications service & WRAP