Method-level code clone detection through LWH (Light Weight Hybrid) approach

Kodhai, Egambaram; Kanmani, Selvadurai

doi:10.1186/s40411-014-0012-8

Research
Open access
Published: 22 October 2014

Method-level code clone detection through LWH (Light Weight Hybrid) approach

Egambaram Kodhai¹ &
Selvadurai Kanmani²

Journal of Software Engineering Research and Development volume 2, Article number: 12 (2014) Cite this article

8696 Accesses
18 Citations
Metrics details

Abstract

Background

Many researchers have investigated different techniques to automatically detect duplicate code in programs exceeding thousand lines of code. These techniques have limitations in finding either the structural or functional clones.

Methods

We propose a LWH (Light Weight Hybrid) approach combining textual analysis and metrics for the detection of method-level syntactic and semantic clones in C and Java projects. This approach has been experimenting for the detection of all four types of clones by a specific set of metrics assessment and textual comparison. A tool named CloneManager has been developed in Java to support the experiments carried out and to validate the proposed approach.

Results

A benchmark dataset widely referred in the literature and medium to large size open-source projects developed in C or Java. Java is used for the experiments.

Conclusions

The results show that the proposed approach is able to detect all four types of clones accurately with the precision and recall values ranging from 88% to 100%.

1 Introduction

Copying code fragments and then reusing them through the paste option with or without minor modification or adaptation is called “Code Cloning” and the pasted code fragment is called a “clone”. Most of the software systems comprise a substantial quantity of code clones; typically 10–15% of the source code in large software systems are part of single or more code clones (Kapser and Godfrey [2006]).

In literature, (Bellon et al. [2007]) has classified and defined four types of clones. A number of techniques have been proposed for the detection of type-1, type-2, and type-3 clones as per the definition of clone literature. However, for type-4 clones called semantic clones, very few attempts were made with limitations to detect them (Marcus and Maletic [2001]; Komondoor and Horwitz [2001]; Krinke [2001]; Gabel et al. [2008]; Liu et al. [2006]). So far, there is a lack of technique for the detection of all four types of clones in literature.

Clones may be useful from different points of view (Kapser and Godfrey [2008]). Clones carry important domain knowledge and thus studying clones may assist in understanding it (Pate et al. [2011]). Moreover, the software clone research has promoted academic-industrial collaboration. Software Practitioners used to copy and modify the existing project’s clones frequently to meet the needs of the clients and users in their new projects (Petersen [2012]).

A number of clone detection techniques have been proposed in literature. Among them, Text-based techniques are lightweight and are able to detect accurate clones with higher recall values, where recall refers to the overall percentage of clones exist in the source code that have been detected by the clone detector. However, it failed to detect suitable syntactic units (Bellon et al. [2007]). Token-based techniques are fast with high recall, but failed in precision. Precision refers to the quality of clones returned by the clone detector. Parser-based techniques are worthy in detecting syntactic clones. However, they give low recall values (Bellon et al. [2007]). Metric-based techniques are able to detect syntactic as well as semantic clones with high precision values. They are also very fast in detecting both syntactic and semantic clones. However, they fail to detect some of the actual clones (Bellon et al. [2007]). PDG (Program Dependency Graph) based techniques are able to find more semantic clones, where PDG is a directed graph which represents the dependencies among program elements in a program. However, sub-graph comparisons are very costly (Koschke et al. [2006]). These limitations in existing methods provide a path to investigate hybrid or combinational techniques in order to overcome them.

Although numerous techniques and tools have been proposed for code clone detection (Kamiya et al. [2002]), only little has been known about, which detected code clones are appropriate for refactoring and how to extract code clones for refactoring. A technique that helps to process the code clones is called Refactoring. Refactoring is defined as “restructuring an existing body of code, altering its internal structure without changing its external behaviour” (Fowler [1999]). By refactoring the clones detected, one can potentially improve understandability, maintainability and extensibility and reduce the complexity of the system (Fowler [1999]).

The granularity of clones can be free with no syntactic boundaries or fixed within predefined syntactic boundaries such as method or block (Roy and Cordy [2007]). Clone granularity is fixed at different levels, such as files, classes, functions/methods, begin-end blocks, statements or sequences of source lines.

Clone detection techniques have been proposed with free granularity, mostly with more than six lines of code (Kamiya et al. [2002]; Koschke et al. [2006]). On the analysis of different clone detection techniques, most of the matches tend to be methods/functions of 1-5 lines of code. Most of these methods are setter/getter functions which are valid set of clones. Only limited detectors used function clones as granularity. Function/Method clones are simply clones that are restricted to refer to entire function or method. Function/Method clones appear to be the most promising points of refactoring for all clone types. They are larger and tend to have a significant amount of code in common.

The techniques that return only Function/Method level clones are suitable for architectural refactoring as they represent a meaningful code segment. It is not so in the case of detecting clones with fixed number of lines in a continuous unsegmented file of code. Tools have been proposed in the literature, which analyses these clones further to extract meaningful codes for refactoring support ( Kapser and Godfrey [2006]; Ueda et al. [2002]; Zibran and Roy [2013]). Function/Method clones are the meaningful clones which are also useful for software maintenance and evolution phases. Thus, it motivates researchers to fix the granularity as function/method level (Mayland et al. [1996]; Roy and Cordy [2008]).

In this paper, a LWH (Light Weight Hybrid) approach has been proposed with a combination of textual comparison and metrics computation. As there is no need for external parsing, this approach is of light weight. Moreover, a model has been arrived to detect syntactic and semantic clones which will cover all four types of clones. For experimental validation, a tool has been developed using the proposed LWH approach to detect method/function level clones for both C and Java projects. This tool has been developed in Java and it has been named as CloneManager. Experimental results show that, the proposed tool CloneManager is efficient and accurate in detecting all types of clones.

This paper is presented in five major sections. Section 2 discusses the literature review for clone detection. Section 3 introduces the basic definitions and background details of code clone detection. The detailed implementation of the proposed method as a tool is elaborated in Section 4. Section 5 summarizes the experimental results. Section 6 concludes the paper.

2 Literature review

There has been more than a decade of research in the field of software clones. To understand the growth and trends in different dimensions of cloning research, we carried out a quantitative review of related publications. Clone detection research has proved that software systems have 9%-17% of duplicated code (Zibran et al. [2011]). (Thummalapenta et al. [2009]) indicated that in most of the cases, clones are changed consistently and for the remaining inconsistently changed cases, clones undergo independent evolution. Effective code clone detection will support perfective maintenance. Up to the present, several code clone detection methods have been proposed (Petersen [2012]; Al-Batran [2011]; Leitner et al. [2013]). Comparison and evaluation of code clone detection techniques and tools have been carried out by (Bellon and Koschke [2014]; Bellon et al. [2007]) and (Roy and Cordy [2007]; Roy et al. [2009]).

A clone detection process is usually done by converting the source code into another form that is handled by an algorithm to detect the clones. A rough classification is then carried out depending on the level of matches found. Token-based techniques (Li et al. [2006]; Leitao [2004]; Basit et al. [2007]) use a similar sequence matching algorithm. However, its accuracy is not that adequate as the normalization, and also token conversion process may bring false positive clones in result set. Many of the clone detection approaches have used Abstract Syntax Tree (AST) and suffix tree representation of a program to find clones (Evans et al. [2009]; Evans and Fraser [2005]; Greenan [2005]; Pate et al. [2011]; Koschke [2012]). Some of the clone detection techniques use an AST that is generated by a pre-existing parser. (Baker [1997]) describes one of the earliest applications of suffix trees for the clone detection process. An algorithm based on feature-vector computation over AST was applied by Lee et al. ([2010]) to detect similar clones. However, all of them use parsing, which results in heavy-weighted approach.

Lighter weight techniques were proposed in the literature without the use of parsing namely text-based techniques and metrics-based techniques. Text-based techniques (Wettel and Marinescu [2005]; Ducasse et al. [1999]) are investigated by comparing two code fragments with each other to find longest common subsequences of same text/strings to detect clones. Though these techniques detect clones they are not low in precision values. Metric-based techniques identify a set of suitable metrics to detect a particular type of clone. By a quantitative assessment of the metric values in the source code, the clone detection is done. (Kapser and Godfrey [2004]) chaos Cyclomatic complexity as the corroboration metric. However, they have only proved that their technique works well to locate the clone segments across several versions of a software system using a very small test set.

Hybrid techniques were also proposed in the literature. (Marco Funaro et al. [2010]) proposed a hybrid technique using Abstract Syntax Tree to identify clone candidates and textual methods to discard false positives. (Leitao [2004]) also proposed a hybrid approach with the combination AST and PDG. Both approaches use parsing which results in heavy-weight. As text-based techniques preserve higher recall, metrics-based techniques preserve higher precision and both of them are light-weight, a hybrid technique with the combination of textual analysis and metrics, is experimenting in this paper for the detection of all four types of clones.

3 Background

Clones may be compared on the basis of the program text that has been copied. A related definition of cloning was described by (Bellon et al. [2007]), who defined the types of code clones based on the degree and type of similarities.

3.1 Textual similarity

Type-1 is an exact copy without modifications (except for whitespace and comments).

Type-2 is a syntactically identical copy; except some changes in variable name, data type, identifier name, etc.

Type-3 is a copied fragment with further modifications. Statements can be changed, added or removed in addition to variations in identifiers, literals, types, layout and comments.

3.2 Functional similarity

Type-4 Two or more code fragments that perform the same computation, but implemented through different syntactic variants.

Table 1 illustrates the four types of clones. The clone pair (a, b) is of type-1 which have exactly the same code except the alignment, space and comment. The clone pair (a, c) is of type-2 which have minor differences in function names and parameters. The clone pair (a, d) is of type-3 with additional statements in code, as they need not be functionally similar. The clone pair (a, e) is of type-4 clones with no similarity in code, but the output of the functions are same.

Table 1 Illustration of four types of clones

Method-level code clone detection through LWH (Light Weight Hybrid) approach

Abstract

Background

Methods

Results

Conclusions

1 Introduction

2 Literature review

3 Background

3.1 Textual similarity

3.2 Functional similarity

3.2.1 Precision

3.2.2 Recall

4 Methods

4.1 Pre-processing

4.2 Method detection

4.3 Template conversion

4.3.1 Template conversion for type-1 and type-2

4.3.2 Template conversion for type-3 and type-4

Iterative equivalence

Conditional equivalence

Input equivalence

Output equivalence

Declaration equivalence

Braces

4.4 Metrics computation

4.5 Type-1 clone detection

4.6 Type-2 clone detection

4.7 Type-3 clone detection

4.8 Type-4 clone detection

4.9 Post-processing

5 Results and discussion

5.1 Experimental setup

5.2 Results

5.3 Procedure to determine reference data

5.4 Evaluation of the tool CloneManager

5.5 Comparison with existing tools

5.6 Threats to validity

5.6.1 Internal validity

5.6.2 External validity

5.6.3 Construct validity

6 Conclusion

Authors’ information

Appendix A

Appendix B

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords