Skip to content Skip to navigation
University of Warwick
  • Study
  • |
  • Research
  • |
  • Business
  • |
  • Alumni
  • |
  • News
  • |
  • About

University of Warwick
Publications service & WRAP

Highlight your research

  • WRAP
    • Home
    • Search WRAP
    • Browse by Warwick Author
    • Browse WRAP by Year
    • Browse WRAP by Subject
    • Browse WRAP by Department
    • Browse WRAP by Funder
    • Browse Theses by Department
  • Publications Service
    • Home
    • Search Publications Service
    • Browse by Warwick Author
    • Browse Publications service by Year
    • Browse Publications service by Subject
    • Browse Publications service by Department
    • Browse Publications service by Funder
  • Help & Advice
University of Warwick

The Library

  • Login
  • Admin

Source code plagiarism detection in academia with information retrieval : dataset and the observation

Tools
- Tools
+ Tools

Karnalim, Oscar, Budi, Setia, Toba, H and Joy, Mike (2019) Source code plagiarism detection in academia with information retrieval : dataset and the observation. Informatics in Education, 18 (2). pp. 321-344. doi:10.15388/infedu.2019.15

[img]
Preview
PDF
WRAP-source-code-plagiarism-detection-academia-information-retrieval-Joy-2019.pdf - Published Version - Requires a PDF viewer.
Available under License Creative Commons: Attribution-Share Alike 4.0.

Download (1911Kb) | Preview
[img] PDF
WRAP-source-code-plagiarism-detection-academia-information-retrieval-dataset-observation-Joy-2019.pdf - Accepted Version
Embargoed item. Restricted access to Repository staff only - Requires a PDF viewer.

Download (550Kb)
Official URL: http://www.doi.org/10.15388/infedu.2019.15

Request Changes to record.

Abstract

Source code plagiarism is an emerging issue in computer science education. As a result, a number of techniques have been proposed to handle this issue. However, comparing these techniques may be challenging, since they are evaluated with their own private dataset(s). This paper contributes in providing a public dataset for comparing these techniques. Specifically, the dataset is designed for evaluation with an Information Retrieval (IR) perspective. The dataset consists of 467 source code files, covering seven introductory programming assessment tasks. Unique to this dataset, both intention to plagiarise and advanced plagiarism attacks are considered in its construction. The dataset's characteristics were observed by comparing three IR-based detection techniques, and it is clear that most IR-based techniques are less effective than a baseline technique which relies on Running-Karp-Rabin Greedy-String-Tiling, even though some of them are far more time-efficient.

Item Type: Journal Article
Subjects: Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software
Divisions: Faculty of Science > Computer Science
Library of Congress Subject Headings (LCSH): Data sets, Source code (Computer science), Plagiarism -- Software
Journal or Publication Title: Informatics in Education
Publisher: Vilnius University Institute of Data Science and Digital Technologies
ISBN: 1648-5831
Official Date: 2019
Dates:
DateEvent
2019Published
17 October 2019Accepted
Volume: 18
Number: 2
Page Range: pp. 321-344
DOI: 10.15388/infedu.2019.15
Status: Peer Reviewed
Publication Status: Published
Access rights to Published version: Open Access
Related URLs:
  • Publisher
Open Access Version:
  • Publisher

Request changes or add full text files to a record

Repository staff actions (login required)

View Item View Item
twitter

Email us: wrap@warwick.ac.uk
Contact Details
About Us