Style analysis for source code plagiarism detection

[thumbnail of WRAP_Theses_Mirza_2018.pdf]
Preview
PDF
WRAP_Theses_Mirza_2018.pdf - Submitted Version - Requires a PDF viewer.

Download (7MB) | Preview

Request Changes to record.

Abstract

The enormous growth in the available online code resources has created new challenges for detecting plagiarism in source code of programs. Several software applications can detect source code similarity using different detection methods. However, few current detection tools detect every kind of detection plagiarism attack. The aim of this thesis is, therefore, to enhance methods for plagiarism detection in source code using a style analysis approach that has been used to detect authorship.

There are very few large source-code datasets which are suitable for research purposes, and two such datasets include the BlackBox dataset and the SOCO (Detection of SOurce COde) dataset. SOCO is a benchmark dataset that contains groups of similar source-code files that can be considered plagiarised and has been used in authorship and plagiarism detection competitions.

In the first part of the thesis, the suitability of BlackBox as source of datasets for testing plagiarism detection is explored. The files in BlackBox were analysed and visualised in order to evaluate its suitability as a dataset that can be used in this research. The analysis aimed to identify similar source code files, and therefore to detect groups of Java files within BlackBox that can be used for evaluating the performance of source-code plagiarism detection methods.

In the second part of the thesis, a plagiarism detection framework (\the Metric-File Matrix Framework (MFM)" is proposed. The MFM framework is designed to overcome some of the limitations of existing plagiarism detection methods by 1) proposing a new set of metrics which consider structural and stylistic similarities; and 2) by using Singular Value Decomposition as a technique to remove noise and to reduce the dimensionality of the data to enhance the similarity detection.

The MFM framework was implemented and its performance was evaluated using the proposed metrics. For the evaluations, the SOCO dataset was adopted and the performance of the proposed framework was compared against other state-of- the-art plagiarism detection tools including JPlag.

Item Type: Thesis [via Doctoral College] (PhD)
Subjects: Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software
Library of Congress Subject Headings (LCSH): Source code (Computer science), Plagiarism -- Software, Data sets, Java (Computer program language)
Official Date: November 2018
Dates:
Date
Event
November 2018
UNSPECIFIED
Institution: University of Warwick
Theses Department: Department of Computer Science
Thesis Type: PhD
Publication Status: Unpublished
Supervisor(s)/Advisor: Joy, Mike ; Cosma, Georgina
Format of File: pdf
Extent: xv, 160 leaves : illustrations, charts
Language: eng
URI: https://wrap.warwick.ac.uk/133218/

Export / Share Citation


Request changes or add full text files to a record

Repository staff actions (login required)

View Item View Item