The Library

English-Arabic cross-language plagiarism detection

Tools

Alotaibi, Naif (2022) English-Arabic cross-language plagiarism detection. PhD thesis, University of Warwick.

PDF
WRAP_Theses_Alotaibi_2022.pdf - Submitted Version
Embargoed item. Restricted access to Repository staff only until 30 May 2025. Contact author directly, specifying your specific needs. - Requires a PDF viewer.
Download (2883Kb)

Official URL: http://webcat.warwick.ac.uk/record=b3941316

Request Changes to record.

Abstract

The advancement of the information era and technology has contributed to the rapid growth of digital text libraries and automatic machine translation systems. The machine translation tools facilitate translating texts from one language into another. Those have resulted in increasing the content accessible in different languages, which makes it easy to perform translated plagiarism, which is referred to as “cross-language plagiarism”. Identification of plagiarism amongst texts in different languages is more challenging than recognising plagiarism within a corpus written in the same language.

This research proposes a new framework for enhancing English-Arabic cross-language plagiarism detection at the sentence level. The framework comprises of two phases: the first phase is feature extraction, while the second is plagiarism detection based on a supervised machine learning classification model. Phase one is concerned with extracted features among English-Arabic cross-language sentences, where we propose approaches to extracting sets of features at lexical, semantic and syntactic levels. This phase involves two components. The first relies on translation plus a monolingual, pre-trained word embedding model, integrated with term frequency inverse document frequency (TFIDF), and part of speech (POS) scheme methods, as well as word order information. The second component employs a pre-trained multilingual model for determining semantic relatedness between cross-language sentence pairs. In terms of the second phase, we propose to apply and examine using various supervised machine learning classifier methods, along with the extracted features and with combinations of those features to assist in the task of classifying sentences as either plagiarized or non-plagiarized.

Each phase was assessed using different datasets. The experimental results for phase one on different benchmark datasets, such as SemEval-2017, show the proposed methods for extracted features achieved improvement when compared against the baselines and other methods. Analysis of experimental data for phase two demonstrates that using extracted features and their combinations with various supervised machine learning classification methods achieves promising results. Ultimately, using the combination of extracted features along with a supervised ensemble machine learning classifier achieves the best classification results.

Item Type:

Thesis (PhD)

Subjects:

P Language and Literature > P Philology. Linguistics
P Language and Literature > PN Literature (General)
Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software

Library of Congress Subject Headings (LCSH):

Plagiarism -- Prevention -- Software, Plagiarism in literature, Translating and interpreting, Machine translating, Machine learning, Computational linguistics

Official Date:

October 2022

Dates:

Date	Event
October 2022	UNSPECIFIED

Institution:

University of Warwick

Theses Department:

Department of Computer Science

Thesis Type:

PhD

Publication Status:

Unpublished

Supervisor(s)/Advisor:

Joy, Mike

Sponsors:

Saudi Arabia. Wizārat al-Maʿārif ; Shaqra University

Format of File:

pdf

Extent:

xiv, 167 pages : illustrations

Language:

eng

Request changes or add full text files to a record

Repository staff actions (login required)

View Item

University of Warwick
Publications service & WRAP

Highlight your research

The Library

English-Arabic cross-language plagiarism detection

Abstract

Repository staff actions (login required)

University of WarwickPublications service & WRAP

Highlight your research

The Library

English-Arabic cross-language plagiarism detection

Abstract

Repository staff actions (login required)

University of Warwick
Publications service & WRAP