What is Code Similarity?
Code Similarity is an indicator that helps you know if a candidate’s code submission may have been plagiarized. It is not a definitive indicator of plagiarism. It suggests when more investigation should be considered before moving the candidate forward. Code Similarity detection is supported for the following programming languages: C, C#, C++, Java, Javascript, Objective-C, PHP, Python 3, Python.
Four Potential Values for Code Similarity
Need More Data
Need More Data means there are not enough submissions to calculate Code Similarity. More than 10 submissions per challenge in each programming language are needed before Code Similarity can be generated for each candidate. As soon as 11 submissions are in the system, all submissions in that language will be evaluated for Code Similarity.
Extremely Similar
Extremely Similar means that the code submitted is very similar to other submissions in the same language and may have been plagiarized. To investigate, utilize a programmer proficient in that language.
Expected Range
Expected Range means that the code submitted is unique enough when compared to the submissions of other candidates. It is probably NOT plagiarized.
Extremely Unique
Extremely Unique means that the submission is highly unique when compared to compared the submissions of other candidates. This suggests that the submission is unique, and it should be reviewed. To investigate, utilize a programmer proficient in that language.
Best Practice
How do I know if a candidate cheated?
Follow Up Video Question
The best way to know if a candidate cheated on a CodeVue challenge is to ask a follow up video question that makes them discuss their code. For example, “Explain your solution and why you chose that approach.” Another good follow up question might be, “Describe another approach that you could have taken to solve this challenge.”
After reviewing the candidate’s code and test scores, watch their video response to the follow up question. Be sure that they are accurately describing what they submitted and are clearly conveying an understanding of the problem and solution. The ability of the candidate to discuss the details of both problems and solutions is perhaps their most valuable skill.
Live Interview: Live Coding
If you think the candidate is capable and want to dig deeper into their problem-solving skills through the use of code, schedule a Live interview with them and use Live Coding.
Live Coding in a Live interview allows the candidate and interviewer to collaborate in a shared code editor during the interview. Both the candidate and interviewer can write and execute code in the editor. This is a great way to discuss coding topics while in a live audio/video session, and it would be extremely difficult to cheat at coding during a Live Coding session.
Internally, we calculate a numeric score based on the entire set of code submissions in the system. The lower the numeric score, the more similar it is to other code submissions for that same challenge and language. Inversely, a higher numeric score indicates that the code submission is very different from the others that HireVue has received for that challenge. We categorize Code Similarity into three values:
- Extremely Similar
- Expected Range
- Extremely Unique
Recognizing that developers use both creativity and existing solutions to produce real world code, the Code Similarity may or may not indicate plagiarism. The following reasons are:
- A simple challenge may have few possible solutions:
Imagine a simple problem like “Count to ten.” With this problem, there are only a few common ways to produce a valid result and pass the test. You could have a hundred coders all submit essentially the same answer for this problem yet none of them would have copied someone else’s code and cheated along the way. The Code Similarity will end up reflecting the invariability of the problem, as opposed to plagiarism.
- A good solution may be clear to a good coder:
Imagine a classic “Sorting” problem where you are given a random set of numbers, and you have to return them in numerical order. If a candidate is posed with this challenge, they may use the standard “Quick Sort” algorithm for their solution. Since sorting algorithms are a standard topic in computer science, most programmers will know the standard algorithms. Their solution may look similar to other candidates’ submissions. A low Code Similarity does not indicate plagiarism in this case - quite the contrary, it indicates they know the topic!
CodeVue’s Code Similarity is based on a well-recognized algorithm known as “Measure of Software Similarity” (aka “MOSS”). MOSS is an extension of a white paper about detecting document duplication on the web originally written by Dr. Aiken of Stanford University. View his profile here. His paper describing this “Winnowing Algorithm” has been widely regarded as an effective basis for detecting duplicated digital content in many arenas.
Plagiarism isn’t necessarily a clear-cut issue. Coders can be automatons or artists. An automaton simply executes what they have been taught. An artist will customize the code and incorporate what they need into their solutions.
Both artists and automatons bring value that can be applied in different ways. The reality is that most working software developers are a combination of both artist and automaton. It is normal for a developer to look up existing solutions on the web, review them, and copy the core thinking or even complete blocks of code. They apply these assets to the particular problem they have at hand.
It’s a good practice because the developer ends up looking at the work of others and integrating the lessons and ideas learned from the same or similar problems. The best developers copy and create, which allows them to apply the lessons learned by other developers.