A Microsoft Research project that started a few years ago is now integrated into Visual Studio 2012 as a menu item: Analyzer Solution for Code Clones, as shown below.
Reusing code fragments by copying and pasting is a common activity in the development and bug fixing phases. In general, the number of code clones is likely to increase as the scale of code bases increase. The types of clones based on both the textual and functional similarities can be as follows:
- Type-1: Identical code fragments except for variations in whitespace, layout and comments.
- Type-2: Syntactically identical fragments except for variations in identifiers, literals, types, whitespace, layout and comments.
- Type-3: Copied fragments with further modifications such as changed, added or removed statements, in addition to variations in identifiers, literals, types, whitespace, layout and comments.
- Type-4: Two or more code fragments that perform the same computation but are implemented by various syntactic variants.
Clone Detection Process
This is a parser based search to detect the types of code duplication deribed above. It is a four step analysis process as shown below:
At the beginning of any clone detection approach, the source code is partitioned and the domain of the comparison is determined. There are three main objectives in this phase:
- Remove uninteresting parts: All the source code uninteresting to the comparison phase is filtered out in this phase. For example, partitioning is applied to embedded code to separate out different languages.
- Determine source units: After removing the uninteresting code, the remaining source code is partitioned into a set of disjoint fragments called source units.
- Determine comparison units / granularity: Source units may need to be further partitioned into smaller units depending on the comparison technique used by the tool.
Once the units of comparison are determined, if the comparison technique is other than textual, the source code of the comparison units is transformed to an appropriate intermediate representation for comparison.
This step is intended to eliminate superficial differences such as differences in whitespace, commenting, formatting or identifier names.
The transformed code is then fed into a comparison algorithm where transformed comparison units are compared to each other to find matches. Clones are ranked or filtered using textual analysis or automated heuristics. Often heuristics can be defined based on length, diversity, frequency, or other characteristics of clones in order to rank or filter out clone candidates automatically.
The Visual Studio output gives Exact Match, Strong Match, Medium Match and Weak Match to the developers. The developers can then perform a manual comparison to analyze and take appropriate actions.
Code Clone Settings and Exclusions
A settings file is available to configure this feature at the project level. The file is XML with a .CODECLONESETTINGS extension and should exists in the top level directory of the project.
This tool is highly scalable. It is easily parallelizable based on source code partition.
Duplicates Not Found
- Type declarations are not compared.
- Only statements in methods and property definitions are compared.
- Analyze Solution for Code Clones will not find clones that are less than 10 statements long. For these, you can apply Find matching clones in solution to shorter fragments.
- Fragments with more than 40% changed tokens.
- If a project contains a .codeclonesettings file, code elements that are defined in that project will not be searched if they are named in the Exclusions section of the .codeclonesettings file.
- Some kinds of generated code are *.designer.cs, *.designer.vb and InitializeComponent methods. However, this does not automatically apply to all generated code. For example, if you use text templates, you might want to exclude the generated files by naming them in a .codeclonesettings file.
We can achieve the following code quality benefits by running this tool during development and testing, quality milestones and post-release maintenance:
- Checking for similar issues before check-in
- Reference information for code review
- Architecture refactoring
- Code clean up
- Bug fixing
- Bug investigation