An approach to source-code plagiarism detection investigation using latent semantic analysis
Cosma, Georgina (2008) An approach to source-code plagiarism detection investigation using latent semantic analysis. PhD thesis, University of Warwick.
WRAP_THESIS_Cosma_2008.pdf - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Other (Permission e-mail)
Restricted to Repository staff only
Official URL: http://webcat.warwick.ac.uk/record=b2248352~S15
This thesis looks at three aspects of source-code plagiarism. The first aspect of the thesis is concerned with creating a definition of source-code plagiarism; the second aspect is concerned with describing the findings gathered from investigating the Latent Semantic Analysis information retrieval algorithm for source-code similarity detection; and the final aspect of the thesis is concerned with the proposal and evaluation of a new algorithm that combines Latent Semantic Analysis with plagiarism detection tools. A recent review of the literature revealed that there is no commonly agreed definition of what constitutes source-code plagiarism in the context of student assignments. This thesis first analyses the findings from a survey carried out to gather an insight into the perspectives of UK Higher Education academics who teach programming on computing courses. Based on the survey findings, a detailed definition of source-code plagiarism is proposed. Secondly, the thesis investigates the application of an information retrieval technique, Latent Semantic Analysis, to derive semantic information from source-code files. Various parameters drive the effectiveness of Latent Semantic Analysis. The performance of Latent Semantic Analysis using various parameter settings and its effectiveness in retrieving similar source-code files when optimising those parameters are evaluated. Finally, an algorithm for combining Latent Semantic Analysis with plagiarism detection tools is proposed and a tool is created and evaluated. The proposed tool, PlaGate, is a hybrid model that allows for the integration of Latent Semantic Analysis with plagiarism detection tools in order to enhance plagiarism detection. In addition, PlaGate has a facility for investigating the importance of source-code fragments with regards to their contribution towards proving plagiarism. PlaGate provides graphical output that indicates the clusters of suspicious files and source-code fragments.
|Item Type:||Thesis or Dissertation (PhD)|
|Subjects:||Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software|
|Library of Congress Subject Headings (LCSH):||Plagiarism -- Computer programs, Source code (Computer science), Latent semantic indexing|
|Institution:||University of Warwick|
|Theses Department:||Department of Computer Science|
|Supervisor(s)/Advisor:||Joy, Mike ; Russ, Steve|
|Extent:||xix, 289 leaves : ill., charts|
Actions (login required)