How would you come up with a solution to identify plagiarism?
How would you come up with a solution to identify plagiarism?
-Vector space model approach
-Represent documents (the suspect and original ones) as vectors of terms
-Terms: n-grams; n=1 to as much we can (detect passage plagiarism)
-Measure the similarity between both documents
-Similarity measure: cosine distance, Jaro-Winkler, Jaccard
-Declare plagiarism at a certain threshold