An approach to source-code plagiarism detection investigation using latent semantic analysis
Cosma, Georgina (2008) An approach to source-code plagiarism detection investigation using latent semantic analysis. PhD thesis, University of Warwick.
WRAP_THESIS_Cosma_2008.pdf - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Other (Permission e-mail)
Restricted to Repository staff only
Official URL: http://webcat.warwick.ac.uk/record=b2248352~S15
This thesis looks at three aspects of source-code plagiarism. The first aspect of the
thesis is concerned with creating a definition of source-code plagiarism; the second aspect
is concerned with describing the findings gathered from investigating the Latent Semantic
Analysis information retrieval algorithm for source-code similarity detection; and the final
aspect of the thesis is concerned with the proposal and evaluation of a new algorithm that
combines Latent Semantic Analysis with plagiarism detection tools.
A recent review of the literature revealed that there is no commonly agreed definition of
what constitutes source-code plagiarism in the context of student assignments. This thesis
first analyses the findings from a survey carried out to gather an insight into the perspectives
of UK Higher Education academics who teach programming on computing courses. Based
on the survey findings, a detailed definition of source-code plagiarism is proposed.
Secondly, the thesis investigates the application of an information retrieval technique,
Latent Semantic Analysis, to derive semantic information from source-code files. Various
parameters drive the effectiveness of Latent Semantic Analysis. The performance of Latent
Semantic Analysis using various parameter settings and its effectiveness in retrieving
similar source-code files when optimising those parameters are evaluated.
Finally, an algorithm for combining Latent Semantic Analysis with plagiarism detection
tools is proposed and a tool is created and evaluated. The proposed tool, PlaGate, is
a hybrid model that allows for the integration of Latent Semantic Analysis with plagiarism
detection tools in order to enhance plagiarism detection. In addition, PlaGate has a facility
for investigating the importance of source-code fragments with regards to their contribution
towards proving plagiarism. PlaGate provides graphical output that indicates the clusters of
suspicious files and source-code fragments.
|Item Type:||Thesis or Dissertation (PhD)|
|Subjects:||Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software|
|Library of Congress Subject Headings (LCSH):||Plagiarism -- Computer programs, Source code (Computer science), Latent semantic indexing|
|Official Date:||July 2008|
|Institution:||University of Warwick|
|Theses Department:||Department of Computer Science|
|Supervisor(s)/Advisor:||Joy, Mike ; Russ, Steve|
|Extent:||xix, 289 leaves : ill., charts|
Actions (login required)