The Library
English-Arabic cross-language plagiarism detection
Tools
Alotaibi, Naif (2022) English-Arabic cross-language plagiarism detection. PhD thesis, University of Warwick.
PDF
WRAP_Theses_Alotaibi_2022.pdf - Submitted Version Embargoed item. Restricted access to Repository staff only until 30 May 2025. Contact author directly, specifying your specific needs. - Requires a PDF viewer. Download (2883Kb) |
Official URL: http://webcat.warwick.ac.uk/record=b3941316
Abstract
The advancement of the information era and technology has contributed to the rapid growth of digital text libraries and automatic machine translation systems. The machine translation tools facilitate translating texts from one language into another. Those have resulted in increasing the content accessible in different languages, which makes it easy to perform translated plagiarism, which is referred to as “cross-language plagiarism”. Identification of plagiarism amongst texts in different languages is more challenging than recognising plagiarism within a corpus written in the same language.
This research proposes a new framework for enhancing English-Arabic cross-language plagiarism detection at the sentence level. The framework comprises of two phases: the first phase is feature extraction, while the second is plagiarism detection based on a supervised machine learning classification model. Phase one is concerned with extracted features among English-Arabic cross-language sentences, where we propose approaches to extracting sets of features at lexical, semantic and syntactic levels. This phase involves two components. The first relies on translation plus a monolingual, pre-trained word embedding model, integrated with term frequency inverse document frequency (TFIDF), and part of speech (POS) scheme methods, as well as word order information. The second component employs a pre-trained multilingual model for determining semantic relatedness between cross-language sentence pairs. In terms of the second phase, we propose to apply and examine using various supervised machine learning classifier methods, along with the extracted features and with combinations of those features to assist in the task of classifying sentences as either plagiarized or non-plagiarized.
Each phase was assessed using different datasets. The experimental results for phase one on different benchmark datasets, such as SemEval-2017, show the proposed methods for extracted features achieved improvement when compared against the baselines and other methods. Analysis of experimental data for phase two demonstrates that using extracted features and their combinations with various supervised machine learning classification methods achieves promising results. Ultimately, using the combination of extracted features along with a supervised ensemble machine learning classifier achieves the best classification results.
Item Type: | Thesis (PhD) | ||||
---|---|---|---|---|---|
Subjects: | P Language and Literature > P Philology. Linguistics P Language and Literature > PN Literature (General) Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software |
||||
Library of Congress Subject Headings (LCSH): | Plagiarism -- Prevention -- Software, Plagiarism in literature, Translating and interpreting, Machine translating, Machine learning, Computational linguistics | ||||
Official Date: | October 2022 | ||||
Dates: |
|
||||
Institution: | University of Warwick | ||||
Theses Department: | Department of Computer Science | ||||
Thesis Type: | PhD | ||||
Publication Status: | Unpublished | ||||
Supervisor(s)/Advisor: | Joy, Mike | ||||
Sponsors: | Saudi Arabia. Wizārat al-Maʿārif ; Shaqra University | ||||
Format of File: | |||||
Extent: | xiv, 167 pages : illustrations | ||||
Language: | eng |
Request changes or add full text files to a record
Repository staff actions (login required)
View Item |