Skip to main content


Correlating automatic static analysis and mutation testing: towards incremental strategies



Traditionally, mutation testing is used as test set generation and/or test evaluation criteria once it is considered a good fault model. This paper uses mutation testing for evaluating an automated static analyzer. Since static analyzers, in general, report a substantial number of false positive warnings, the intention of this study is to define a prioritization approach of static warnings based on their correspondence with mutations. On the other hand, knowing that Mutation Test has a high application cost, another possibility is to try to identify mutations of some specific mutation operators, which an automatic static analyzer is not adequate to detect. Therefore, this information can be used to prioritize the order of incrementally applying mutation operators considering, firstly, those with no correspondence with static warnings. In both cases, contributing to the establishment of incremental strategies on using automatic static analysis or mutation testing or even a combination of them.


We used mutation operators as a fault model to evaluate the direct correspondence between mutations and static warnings. The main advantage of using mutation operators is that they generate a large number of programs containing faults of different types, which can be used to decide the ones most probable to be detected by static analyzers.


We provide evidences on the correspondence between mutations and some types of static warnings. The results obtained for a set of 19 open-source programs indicate that: 1) static warnings may be prioritized based on their correspondence level with mutations; 2) specific set of mutation operators and their mutations may be prioritized based on their correspondence level with warnings.


It is possible to provide an incremental testing strategy aiming at reducing the cost of both static analysis and mutation testing using the correspondence information between these activities/artifacts.


In software development environments, static analysis tools are used to support the verification of violations of code standards. Examples of violations detected by these tools are the access to invalid objects (uninitialized), the usage of deprecated methods, the encoding in disagreement with a determined established standard, among others.

In these environments, it is also common the existence of maintenance and development activities to find (and correct) software faults. Software faults are introduced in the source code due to mistakes made by developers. A wrong command or instruction in the source code are examples of software faults (IEEE 1990).

Mutation Testing is a very effective testing criterion once the mutation operators or a subset of them, responsible to perform the syntactic changes into the original program for mutant generation, represent a plausible fault model. In general, such a fault model is used as test set generation and/or test evaluation criteria which makes Mutation Testing a good tool for experimentation (Andrews et al. 2005).

Besides effective, Mutation Testing has some drawbacks, mainly related to the high number of generated mutants. A way to reduce its cost is decreasing the number of mutation operators we need to use. This is called selective mutation and there are several alternatives to identify this subset of mutation operators (Acree et al. 1979; Mathur 1991; Mresa and Bottaci 1999; Offutt et al. 1993).

Automatic Static Analysis does not require software execution like Mutation Testing. On the other hand, it uses a set of well defined bug patterns aiming at, by static analysing the source code, issuing warnings related to some possible source code line problem. The main disadvantage of automatic static analyzers is the high number of warnings which do not correspond to a fault (false positive) but demands time to be analysed.

Besides the concern in usage of static analysis tools (Ayewah et al. 2007b; Hovemeyer and Pugh 2004; Louridas 2006), there is no agreement of their real benefits since it is not clear the relation between static warning and faults (Ayewah et al. 2007a).

Inspired by the work of Araújo Filho et al. (2010) and Couto et al. (2013), and trying to overcome the problem they faced of evaluating the direct correspondence between warnings and faults due to the small number of real faults, we revisited their work evaluating the correspondence between static warnings and mutations. We decided to use mutation testing due to the following reasons: 1) it is considered a good fault model for experimentation and has been successfully used for test set evaluations (Andrews et al. 2005); 2) mutants are generated by mutation operators which can be seen as fault categories so we can try to correlate warnings and specific types of faults; 3) it allows to increase fault concentration per Kilo Lines of Code (KLOC) by summing up the number of mutants derived from each source code line; and 4) it eases experimentation with a large number of software products. These reasons help overcoming limitations of previous works on these subjects. Moreover, from this study, we intend to define a strategy for static warnings prioritization based on their correspondence with mutations. The results can be used to evolve either static analyzers or mutation testing, or both. The former in the sense it will be possible to compare different static analyzers against the same fault model to decide which kind of mutations they are really adequate to detect. This information can be used to prioritize the warnings resolution starting from the ones more correlated with some mutation operator and also to guide the evolution of static analyzers by creating additional static verification rules to detect uncovered mutations. The later by avoiding the generation of mutants for that mutation operators which static analyzers are adequate to detect their faults statically.

In this paper, we report an approach that makes use of information obtained through the application of FindBugs (Hovemeyer and Pugh 2004) on detecting mutations generated by μJava (Ma et al. 2005). The objective is to identify correspondence between bug kinds with mutation operators. Based on this information we can establish incremental strategy for using bug kinds and mutation operators on an incremental way, aiming at mitigating the problem of both automated static analysis and mutation testing.

Given this scenario, the research questions of interest are:

  • Question Q 1 (direct correspondence): is there any correspondence between static location of warnings and program elements with mutations’ concentration?

  • Question Q 2 (direct correspondence at source code line level): the analysis is performed at a lower level considering each source code line individually, i.e., is there any correspondence between static location of warnings and program elements with mutations’ concentration at the source code line level?

Observe that a positive answer to these questions suggests that FindBugs is adequate to detect specific types of mutations. By analyzing which static warnings are better to detect specific types of mutations one may prioritize warnings to avoid earlier analysis of false positives. Moreover, we can also know which mutations are or are not detectable by static analysis tools.

We can summarize the contributions of this work as:

  • the identification of existence of direct correspondence between warnings and some mutation operators;

  • the identification of specific mutation operators which FindBugs is more prone to detect;

  • the identification of specific mutations operators which FindBugs is not adequate to detect;

  • the establishment of prioritization strategies for incremental use of bug kinds based on their correspondence level with mutations at the line level; and

  • the establishment of prioritization strategies for incremental use of mutation operators based on their inverse correspondence with static warnings at the line level.

The remainder of this paper is organized as follows: Section 2 presents basic information with respect to static analysis and mutation testing. Section 3 defines Direct Correspondence per Line (DCL) and describes the experimental study and the data collection process. Section 4.1 presents how DCL is used on the establishment of an incremental strategy for applying FindBugs bug kinds. Section 4.2 performs the same analysis but considering how DCL is used on the establishment of an incremental strategy for applying μJava mutation operators. Section 5 illustrates how the incremental testing strategies can be employed to reduce the cost of static analysis and mutation testing. Section 6 describes the lessons learned and threats to validity of this study. Section 7 describes related works and the main points and contributions of this work. In Section 8, we draw the conclusions and present future work. Finally, Appendix A provides complementary information about μJava mutation operators and FindBugs bug kinds.


Mutation Testing (MT) enables the generation of a high number of versions of a given determined system. This generation is performed by small syntactic changes which are done in the original system by mutation operators that simulate the mistakes more commonly committed by developers (DeMillo et al. 1978). To each change done by a mutation operator a new version of the system, called mutant, is generated. From a theoretical perspective, each mutant represents a possible fault that could be present in the original system (Copeland 2004).

There are several mutation tools available for Java programs (Coles 2015; Ferrari et al. 2011; Just et al. 2011; Ma et al. 2005). We are using the set of mutation operators implemented by μJava system which supports MT for Java programs (Ma et al. 2005). It creates object-oriented mutants for Java according to 47 mutation operators specialized to object-oriented faults: 19 Traditional Operators responsible to model faults at method level (Ma and Offutt 2005), and 28 Class Operators, responsible to model faults at class level (Offutt et al. 2006). Besides the advantage of having a well defined set of program faults, generated by mutation operators, is that the faults introduced by the operators are not detectable by Eclipse IDE as the injected faults used in the work of (Daimi et al. 2013).

Each μJava mutation operator has an acronym for identification. The first letter of this acronym determine the language features the operator is related to. For instance, method level operator AORB stands for “Arithmetic Operator Replacement (Binary)” and is one of the operators of Arithmetic (A) group. The reduced number of method operators implemented by μJava is due to the selective approach adopted to create such a set of mutants (Offutt et al. 1996). Appendix A has additional information about μJava mutation operators set.

An automated static analyzer is a tool that reads the source code of a given system and issues a set of warnings based on rules which look for deviations with respect to a given code standard. In general, warnings are issued with respect to a given source code line in which the tool detects any possible fault with respect to its supported rules.

Automated static analysis vocabulary includes the following terms: false positives, true positives and false negatives. A false positive occurs when a tool alerts to the presence of a non-existent fault. A false negative occurs when a fault exists, but it is not detected due to the fact that static analysis tools are not perfectly accurate and may not detect all faults. Finally, a true positive occurs when a tool produces a warning to indicate the presence of a real fault in the system under analysis.

There are several static analysis tools available for different programming languages (Burn 2014; Copeland 2005; Daimi et al. 2013; Evans and Larochelle 2002; Hovemeyer and Pugh 2004; Microsoft 2014; Pohl 2001). We use FindBugs (Hovemeyer and Pugh 2004) in this study for two reasons. First, because the same tool was used by Araújo Filho et al. (2010) and Couto et al. (2013) which inspired this work. They used FindBugs it to evaluate the correspondence between warnings and real faults. In our work, the actual faults were replaced by mutants. To make feasible future comparisons, we use the same static analysis tool. Second, several experiments involving static analysis uses FindBugs, and Tomas et al. (2013) pointed out that its false positive rate is less than 50 %.

FindBugs is one of the most popular static analysis tools and is widely used in Java community. It implements a set of bug detectors for a variety of common bug patterns. According to the structure of FindBugs, each bug category includes many bug kinds and each bug kind consists of several bug patterns. Figure 1 shows a sample structure of FindBugs categories, subcategories, patterns and bug patterns. Bug patterns in FindBugs are divided into categories: Bad Practice, Correctness, Malicious code vulnerability, Multithreaded correctness, Internationalization, Performance, Security, and Dodgy.

Fig. 1

Partial structure of FindBugs: correctness category (Shen et al. 2011)

In Fig. 1, bug kinds, like BC1, NP2, DLS3, and DMI4, belong to correctness category. And the bug kind BC (the abbreviation for bad casts of object references) contains four bug patterns: Impossible cast, Impossible downcast, Impossible downcast of toArray() result, and instanceof will always return false.

Experimental study


Let S={s 1,s 2,…,s t } be a set of t systems and s x S a system under analysis. s x is composed by a set of source code files F={f 1,f 2,…,f n } where n is the number of source files of s x . Consider f j F and Mf j a set of mutants generated from f j by applying a set of mutation operators MO. Therefore, Ms x ={Mf 1Mf 2Mf n } is the set of all mutants generated from s x .

Consider m i o k f j Mf j the i-th mutant of operator o k MO on the file f j . f j and m i o k f j differ from each other at least on some line of source code. Let Wf j be the resultant set of warnings of applying FindBugs on f j , and let Wm i o k f j be the resultant set of warnings of applying FindBugs on m i o k f j .

We illustrated situations that occur when a warning w i is reported by FindBugs on the original file f j (Wf j ) and/or on the mutant m i o k f j (Wm i o k f j ) in Figs. 2, 3 and 4. Consider w 1,w 2, and w 3 as warnings, Wm i o k f j the set of warnings generated in a specific mutated file f j , and Wf j the set of warnings generated in the original file f j , according to Figs. 2, 3 and 4:

  1. 1.

    Case 1: w 1{Wm i o k f j Wf j } is a warning that was reported in the mutant m i o k f j and that was not reported in the original file f j , considering the same line where mutation occurs. In other words, w 1 represents a true positive, since it indicates FindBugs only generate the warning due to the mutation;

  2. 2.

    Case 2: w 2{Wm i o k f j Wf j } is a warning that was reported in mutant m i o k f j , in the same line in which mutation happened, but w 2 was also reported in original file f j . In this way, w 2 does not depend on the mutation;

  3. 3.

    Case 3: w 3{Wf j Wm i o k f j } is a warning which was reported in original file f j , and that was not reported in mutant m i o k f j , considering the line in which mutation occurs. In other words, the mutation actually corrects the warning reported in f j .

Fig. 2

Warning w 1 related in the mutant m i o k f j at the mutation point

Fig. 3

Warning w 2 related in the mutant and original file, in the same line of the mutation

Fig. 4

Warning w 3 related in the original file but not in the mutant

Figures 2, 3 and 4 illustrate Cases 1, 2 and 3, respectively. As an example of Case 1, in the mutant shown in Code 2 (Additional file 1: Figure S2), w 1 was reported in the mutated line 4, and w 1 was not reported in correspondent line 4 of the original file (Code 1 - Additional file 1: Figure S1). As an example of Case 2, in the mutant shown in Code 4 (Additional file 1: Figure S4), w 2 is associated with both original (Code 3 - Additional file 1: Figure S3) and mutated files. Finally, as an example of Case 3, w 3 is reported in the original file (Code 5 - Additional file 1: Figure S5) but not on the mutant (Code 6 - Additional file 1: Figure S6).

In this study, we are specially interested in the situation represented by Case 1), where w 1Wm i o k f j and w 1Wf j . Thus, if we can find a set of warnings adequate to detect specific kinds of mutations, we may prioritize the warning analysis starting from them since, in general, they are sensible to specific types of mutations, i.e., they probably are not false positives. Cases 2 and 3 are also interesting to be investigated but are out of the scope of this work.

All the analysis is performed based on the concept of direct correspondence. As defined by (Couto et al. 2013), the direct correspondence occurs when the warning and the fault are relatively close. Relatively close means at the method level, i.e., if warnings and faults exist in the same method they considered the existence of a correspondence between warning and fault.

In our work, we evaluated the direct correspondence at the source code line level. In this sense, relatively close means at the same source code line instead of at the same method. Therefore, we adopt a fine grain to identify the correspondence between mutations and warnings more precisely, considering the so called Direct Correlation per Line (DCL) defined below.

Once each mutant is generated by a specific mutation operator we can identify classes of mutations which static analyzers are or are not adequate to detect.

Direct Correspondence per Line (DCL)

To calculate the DCL of each mutation operator and of each warning category, the following functions are defined:

  • TW(w)= total number of warnings of the type w reported in all mutants.

  • DCL A (w)= absolute number of warnings of the type w which are reported exactly in a mutation point, but the warning w does not exist on the same line in the original file.

$${DCL}_{R}(w) = \left\{ \begin{array}{ll} {DCL}_{A}(w) / TW(w) & \text{if}\ TW(w) > 0\\ 0 & \text{if}\ TW(w) = 0 \end{array}\right. $$

The aim of the functions DCL A (w) and DCL R (w) is to classify each bug kind according to its warnings w capability in detecting faults represented by mutants. The bug kinds with higher DCL R (w) rates should be prioritized in relation to the other bug kinds with lower DCL R (w) rates, since warnings of bug kinds with higher DCL R (w) are more probable to be true-positives and should be analyzed first.

  • TM(o k )= total number of generated mutants by operator o k .

  • DCL A (o k )= absolute number of mutants of operator o k with at least one warning of any type reported in the mutation point but the same warning does not exist on the same line in the original file.

$${DCL}_{R}(o_{k}) = \left\{ \begin{array}{ll} {DCL}_{A}(o_{k}) / TM(o_{k}) & \text{if}\ TM(o_{k}) > 0\\ 0 & \text{if}\ TM(o_{k}) = 0 \end{array}\right. $$

The aim of the functions DCL A (o k ) and DCL R (o k ) is to classify the types of faults represented by mutants of mutation operators according to the capability that these faults are detected by FindBugs. Operators with higher DCL R (o k ) correspondence rates represent types of faults that are easier detected by FindBugs. On the other hand, operators with lower DCL R (o k ) represent types of faults rarely detected by FindBugs, and may suggest new bug patterns that can be added to the static analyzer tool to improve its capability.

Experimental process

The process used to collect the data in the study is described in the steps below:

  1. 1.

    Let S be a set of systems, i.e, S={s 1,s 2,…,s t }

  2. 2.

    For each system s x S,1≤xt

  3. 3.

    For each source file f j in s x ,1≤jn, where n is the number of source files of s x

    1. 3.1

      Execution of FindBugs in f j and generation of a XML with the set of warnings Wf j

    2. 3.2

      Execution of ParserXMLFindBugs to read the XML and to include the Wf j on database (DB)

    3. 3.3

      Execution of μJava tool in f j and generation of a set of mutants Mf j

    4. 3.4

      For each mutant m i o k f j Mf j

    5. (a)

      Execution of FindBugs in m i o k f j and generation of a XML with the set of warnings Wm i o k f j

    6. (b)

      Execution of ParserXMLFindBugs to read the XML and to include Wm i o k f j on DB

    7. (c)

      Execution of diff between f j and m i o k f j to include lines number and textual difference on DB

  4. 4.

    Execution of SQL scripts to extract DCL data of S from DB

Figure 5 depicts the way data is collected and processed to support the experimental process. In case of the original system, observe that both FindBugs and μJava are executed considering the entire system. On the other hand, in case of mutants, once μJava generates mutants per file, we executed FindBugs only on each mutated file. In the later it is not necessary to run FindBugs on the entire system since it differs from the original only in the mutated file. In our cost analysis this point is clear.

Fig. 5

Data collection and processing for supporting experimental process

FindBugs is executed on each file f j of a system s x . The result of FindBugs execution on f j produces the set of warnings Wf j , which is stores in a XML file. ParserXMLFindBugs is an application we developed to read the XML file and stores the collected information into a database. In the next step, the mutants of file f j are generated through the application of all μJava mutation operators. The mutation operator o k , applied to each file f j , generates a set of mutants Mo k f j , which can be identified as m i o k f j (the i-th mutant of o k operator from file f j ). FindBugs is then executed on each mutant m i o k f j generating a set of warnings Wm i o k f j for such a mutant in a XML file. The parser ParserXMLFindBugs is used again to read the XML file and to store the warning information with respect to mutant m i o k f j into the database. ParserDiff is another application we developed to calculate the syntactical difference between the original file f j and the mutant m i o k f j , indicates on each source code line the mutation occurred, and stores such information into the database to identify the mutation produced. At the end of the process, SQL scripts are executed to extract matching information between mutations and static warning.

In this study, 19 systems were used (t=19), as shown in Table 1. These systems, which are available at Apache Foundation, are a subset of systems used by (Couto et al. 2013). For instance, considering the entire system s 11, Open JPA, version 1.0.0, it has 102,682 LOC5, to which FindBugs reported 359 warnings; μJava generated 65,529 mutants; which give an average of 3.5 warnings/KLOC6; and an average of 638.2 mutants/KLOC.

Table 1 Set of systems used in the experimetal study

Considering all the 19 systems, they sum up 749,966 LOC with 3709 static warnings and 493,522 mutants. Averages of these proportions are 6.4 and 702.2 for warnings/KLOC and mutants/KLOC, respectively, and medians 5.6 and 638.2, respectively.

For each one of the selected systems, mutants were generated using all the 47 mutation operators supported by μJava tool (Ma et al. 2005). To collect the amount of warnings reported by FindBugs, it was executed in each one of the selected systems using its default configuration, i.e., employing all the warning categories available. To collect the warnings present in each one of the mutants, FindBugs was executed in each one of them with the same configuration. All data reported by FindBugs and the data from mutants were stored in a database to verify the direct correspondence by line.

Data of FindBugs execution on each mutant are presented in Table 2. Information related to the warnings (grouped by bug kinds) reported by FindBugs and the mutation operator used to generate each mutant is registered.

Table 2 Bug kinds versus mutation operator

In Table 2, the following information is presented: the first column is a line identifier; column W (Warning) represents the types of warnings reported by FindBugs in the mutants; columns ISD, JTI, JTD, JSD, …, represent each one of the 47 μJava operators; column TW(w) is the result of function TW(w) defined in Section 3.1.1.

It was reported warnings with respect to 86 different bug kinds, however Table 2 presents only a subset of these bug kinds. As an example of warning reported in mutants, there is warning of Bx bug kind (line 5) which was reported twice in mutants of ISD operator, 392 times in mutants of JTI operators, 125 times in mutants of JTD operator, 368 times in mutants of JSD operator, and so on. Considering the mutants of the 47 mutation operators, the total warning of Bx bug kind is 107,060 (column TW(w) of line 5). Column ISD shows that in mutants of ISD operator, 88 warnings of different types were reported. Column JTI shows that in mutants of JTI operator 14,303 warnings were reported. The total of warning obtained in all the mutants is 980,533 (last column of last line).

From the data collected and presented in Table 2, we applied the functions DCL A (w),DCL R (w),DCL A (o k ), and DCL R (o k ) defined in Section 3.1.1, which produced Tables 3 and 5, employed in the definition of two incremental strategies, presented on Sections 4.1 and 4.2, respectively.

Table 3 DCL by Warning: DCL R (w)

Results and Discussion

Based on DCL we defined two incremental strategies. One, described in Section 4.1, intends to prioritize bug kinds. The other, described in Section 4.2, prioritizes mutation operators.

Using DCL for bug kinds prioritization

In this strategy we used the correspondence of warnings and mutants to identify bug kinds more probable to produce true positive warnings. Table 3 presents the direct correspondence by line between bug kinds and mutation. It contains information about DCL A (w) and DCL R (w) for all warnings (group by bug kinds) reported in mutants.

Table 3 presents the following information: the first column is a line identifier; column W (warnings) represents all bug kinds that have at least one warning related to some mutation point (at line level); columns ISD, JTI, JTD, JSD, …, represent each one of the 47 operators of μJava; columns DCL A (w) and DCL R (w) contain the values of each one of these functions defined in Section 3.1.1. Column TW(w) is the same information presented in Table 2, replicated here to ease the analysis. In the first lines of Table 3 are the bug kinds that were more adequate to detect faults modeled by some μJava mutation operators, and, in the last lines, are the bug kinds with lower correspondence with mutations.

Note that in Table 3 it is presented only the 43 bug kinds which reported at least one warning in the same line of mutation (column DCL A (w)≥1). In Table 2 are presented all the 86 bug kinds reported in the experiment. Therefore, there are 43 bug kinds that had DCL A (w)= 0. For this bug kinds, no warning w is reported in any mutant, or, when reported, w is also reported in the original file in the same line the mutation occurs. The bug kinds with no warning in the mutation point are identified with symbol ’ ’ in Table 2.

In general, 50,768 warnings were reported exclusively on the mutated lines, which gives a DCL of 5.18 % (50,768/980,533). As can be seen in Table 3, there are 4 bug kinds (lines 1 to 4) with DCL R (w)= 100.0 % and there are 5 bug kinds (last lines) with DCL R (w)≤0.01 %.

As it is shown in Table 3, there is a variation in DCL R (w). Bug kinds with DCL R (w)>10.0 % are presented in the first 15 rows of Table 3, of which 12 have DCL R (w)>33 %.

Bug kinds with DCL R (w)>88 % are: QF, RE, UM and QBA with 100.0 %, INT with 96.64 %, BIT with 92.12 %, and UR with 88.39 %, providing evidences that there is a direct correspondence between certain types of warnings and certain types of faults (mutation), motivating further work in this direction, for instance, analyzing the subcategories and priorities of the warnings.

Each bug kind of FindBugs has one or more bug patterns and each bug pattern has a priority (the greater the priority the greater the criticality of such a warning for FindBugs) and belongs to a specific category indicating which kind of fault such a warning subtype is related to (Hovemeyer and Pugh 2004). More information about FindBugs warnings is available on Appendix A. Table 4 presents additional information related to bug kinds with DCL R (w)>88 % (QF, RE, UM, QBA, INT, BIT, UR).

Table 4 Details of the bug kinds

The top 6 bug kinds have 14 bug patterns classified on the following categories (see Table 4): CORRECTNESS(8), STYLE(4), PERFORMANCE(1) and BAD_PRACTICE(1). Moreover, we can analyze the priority of these 14 bug patterns according to FindBugs classification. In this case we observe that 50 % are of priority 1 (most important for FindBugs), 28.6 % are of priority 2, and 21.4 % are of priority 3. This suggests that even bug patterns which FindBugs classifies as priority 2 or 3 may have a good detection capability of specific types of faults.

On the other hand, several other bug kinds presented DCL R (w)≈0.00 %. These bug kinds (DCL R (w)≈0.00 %) are not capable of detecting the difference between the original and the mutated source code. Examples of these bug kinds are: Dm, ES and Bx with DCL R (w)=0.01 %, and EI2 and IIO with DCL R (w)=0.00 % (lines 39 to 43 in Table 3).

By analyzing the 5 bug kinds with DCL R (w)≈0.00 % we observe that they group 23 bug patterns belonging to: PERFORMANCE(17), BAD_PRACTICE(3), II8N(2) and MALICIOUS_CODE(1). In terms of priority, the distribution of these 23 bug patterns is: 26,1 % of priority 1, 47,8 % of priority 2, and 26,1 % of priority 3. Observe that, although in this case, bug patterns with medium/low priority correspond to more than 73 %, there are bug patterns with priority 1 which have no correspondence with faults modeled by the μJava mutation operator.

Observe that this does not mean these bug kinds are good/bad predictor of faults or generate only true/false positive warnings. It only indicates that they are good/bad predictors on detecting faults modeled by these set of mutation operators. Nevertheless, once mutation testing has confirmed as an effective criterion for test set evaluation (Andrews et al. 2005) we consider the set of faults modeled by its mutation operators a good starting point for FindBugs bug kind prioritization.

This variation in the correspondence between bug kinds and mutation operators illustrates that warnings of some bug kinds are more sensitive in identifying certain types of faults, represented by specific mutation operators.

Using DCL for mutation operators prioritization

In the same way, correspondence information can be used to prioritize mutation operators. In this case, the idea is to use first mutation operators for that there is a lower correspondence with warnings, i.e., mutation operators which represent faults difficult to be detected by FindBugs.

Table 5 presents information about DCL A (o k ) and DCL R (o k ) for each mutation operator of μJava. Operators are classified by decreasing order of DCL R (o k ).

Table 5 DCL by Mutation Operator: DCL R (o k )

In columns of Table 5, the first column is a line identifier; the second and third columns present the operator name; the column Type is a category of each operator ((C) class and (T) traditional); column TM(o K ) is the result of function of same name defined in Section 3.1.1; column “ TM(o k ) with Warning” contains, for each operator, the number of mutants that had warnings (column total1) and the relative amount of these mutants with the total (column total 1/TM(o k )); column “ TM(o k ) without Warning” contains, for each operator, the amount of mutants that did not have warnings (column total2) and the relative amount of these mutants with the total (column total 2/TM(o k )); at last, columns DCL A (o k ) and DCL R (o k ) are the absolute and relative correspondence of warning and mutant, as defined in Section 3.1.1.

The Total line of Table 5 shows that 493,522 mutants were generated (considering all μJava mutation operators). On 38 % of these mutants (187,892 mutants) no warning was reported. On 305,630 (61.93 %) mutants, FindBugs reported at least one warning different of the ones reported in the original file.

DCL R (o k ) of each operator o k is presented in the last column of Table 5. In the first lines of this table are the mutation operators which generate faults easier to be detected by FindBugs. In the last lines of this same table are the mutation operators that represent fault categories FindBugs has difficult to identify. In general, considering all the mutation operators, DCL R (o k ) was of 8.54 % (42,138/493,522) (last column of Total line).

There is a variation in DCL R (o k ) rate among the several kinds of mutation operators. There are 4 class mutation operators (JTD, JTI, JSD and ISD), which have DCL R (o k ) above 49 %, with JTD reaching 87.88 %. Observe that one may suggest to prioritize the analyses of warnings associated with these types of faults since, according to these data, they have more chance to be true positives warnings. At least in 49 to 87.88 % of the cases they are able to point specific source code lines containing mutants of these mutation operators.

The operator JTD, from Java Specific feature and responsible for simulating faults by removing this keyword, has a direct correspondence by line of 87.88 %. Operators JTI and JSD, also from Java Specific feature category, are responsible to model faults related to this keyword insertion and static modifier deletion, and have correspondence rates of 72.55 and 65.59 %, respectively. Finally, the forth mutation operator is ISD, from Inheritance category. It has a correspondence rate of 49.33 % and models faults relate to super keyword deletion.

This does not mean that by correcting warnings of a warning category with higher correspondence to mutations, faults will be removed. But, if we do not have enough resources to deal with all the warning categories, this strategy, at least, provides information about which bug kinds generate warnings with some correspondence to specific fault categories, represented by the mutants.

On the other hand, there are mutation operators whose faults are not sensible by FindBugs. Warning insensitive mutations are generated by only one method mutation operator: LOD - responsible for removing logical operators (see second line of Table 6).

Table 6 Warning insensitive mutation operator

There are also 15 class mutation operators which generate mutants which are warning insensitive. They model different fault categories: 1 related to common mistake (EOA) on using reference assignment instead of cloning the object content; 7 related to Inheritance feature on deleting and removing a attribute on a subclass with the same name of a attribute in the parent class (IHD and IHI), or deleting or renaming methods on a subclass with the same name methods in the parent class (IOD and IOR), or removing the call to super() on the subclass constructor (IPC), or inserting super keyword (ISI), or moving the calling position of overriding methods (IOP); 2 of Java specific features related to removing default constructor if it exists (JDC) or removing the initialization of instance variables in declaration (JID); 1 related to overloading feature removing the overload method of a subclass (OMD); and 4 related to polymorphism feature changing the type of a variable declaration for a parent type (PMD), changing the cast type of a parent class to on of its subclass (PCC), changing the type of a parameter variable by the type of a child class type (PPD), and inserting type cast before reference variables (PCI). In static analyzers terminology these faults are called false negative, which means faults that exist in the source code but the static analyzers is not adequate to detect.

As stated previously, such information are useful to improve FindBugs to detect additional kinds of faults for instance, by written a new rule to check the existence of the default constructor (JDC) or checking for the need to overload parent class methods (OMD). Moreover, assuming an incremental testing strategy when combining static and dynamic analysis, this set of mutation operators which generate warning insensible mutants should be used during testing once the chance the faults they modeled be detected during automatic static analysis is reduced.

We consider the results obtained so far very promising and it is expected that, with the increase of the number of evaluated systems, it is possible to identify other fault categories, represented by mutation operators, which can contribute to an optimization in the establishment of incremental strategies which combine static and dynamic analysis in a more efficient and effective way.

Considering these results, we may suggest from Tables 3 and 5 incremental strategies for applying warning categories of FindBugs and mutation operators of μJava.

In the case of warning categories, the order is from the ones with higher direct correspondence rates to the lower correspondence rates. This strategy is illustrated in Section 5.1. In the case of mutation operators, we use the inverse order of correspondence since the lower the correspondence more difficult the specific fault types to be detected by automatic static analyzer and the mutation operator should be considered during testing. This strategy is illustrated in Section 5.2.

Observe that this knowledge database can be always improved. As soon as more information about warnings and mutations are collected, the incremental strategy can be updated aiming at improving its capability, contributing to reduce the cost of static analysis/mutation testing and guiding the reviewer to firstly analyze warnings more probably to lead to fault detection and quality improvement before to conduct mutation testing.

Incremental strategies: example of application

In this section we illustrate how the incremental strategies defined on Sections 4.1 and 4.2 can be used incrementally to reduce the cost of application of either static analysis or mutation testing.

Prioritization of bug kinds based on DCL R (w)

To illustrate how the direct correspondence can be employed to prioritize the analysis of warnings reported by FindBugs, we applied the bug kinds as described above on three additional systems: Cassandra, Hibernate and Apache POI. Table 7 presents the complexity of these systems based on its size and the number of warnings generated by FindBugs on its default configuration.

Table 7 Metrics of example systems

The suggested approach is to prioritize the analysis of bug kinds with higher DCL R (w) rates in detriment to the ones with lower DCL R (w) rates, respecting the bug kinds order defined on Table 3.

Table 8 presents the data of applying the bug kinds incrementally. For instance, DLS warning rules (line 11) reported 5 warnings in Cassandra system (W C column), 26 in Apache POI (W P column), 392 in Hibernate (W H column), with the total of 423 (W total column) warnings reported in the three systems.

Table 8 Incremental strategy for applying bug kinds

Columns CC C ,CC P and CC H of Table 8 store the cumulative cost of warnings by each category of each system. The column CC total stores the total warnings of the three systems. The CC(i) (each system and total) on the line i is obtained by the Eq. 3. Columns CR C ,CR P ,CR H and CR total show the cost reduction to consider the other bug kinds. CR(i) (each system and total), on the line i is obtained by Eq. 4. Cumulative Cost of Warning (CC(i)):

$$\begin{array}{@{}rcl@{}} CC(i) = W_{1} + W_{2} + \cdots + W_{i-1} + W_{i} \end{array} $$

Cost Reduction of Warning CR(i)):

$$\begin{array}{@{}rcl@{}} CR(i) = 1 - (CC(i) / Total) \end{array} $$

In Fig. 6, we illustrate the data for CC total and CR total , when the warnings are analyzed according to the order suggested in Table 8. Note, in Fig. 6, that initially CR total =100 % because no warning was analyzed (CC total =0). We suggest the warnings with higher DCL R (w) should be analyzed first, since these warnings are more likely to correspond to a fault according to our study.

Fig. 6

Cumulative Cost versus Cost Reduction of warning category incremental strategy

From Table 8, by summing up the number of warnings reported from bug kinds of lines 1 to 11, the cumulative cost were 8, 42, and 413 for each system individually (columns CC C ,CC P and CC H , respectively), which means that, if only these warning rules were analyzed the cost reduction with respect to all 79 bug kinds were 99.15, 94.59, and 80.62 % (columns CR C ,CR P and CR H , respectively), respectively. Therefore, overall, 463 warnings were generated (CC total column), with means a cost reduction of 87.98 % (CR total column).

If the top 18 warning rules were used, the 694 warnings were generated at all, representing a cost reduction around 81 % with respect to the total number of warnings. Moreover, observe that Cassandra and Apache POI present both cost reduction above 90 % for these set of bug kinds which may indicate that Apache community handled the problems reported by these bug kinds during its development processes.

In the last line of Table 8 we can see the total number of warnings the warning rules composing the incremental strategy generate on each system: 946, 776, and 2,131, respectively.

Prioritization mutation operators based on DCL R (o k )

Considering the prioritization of mutation operators, based on DCL R (o k ), presented in Table 5 (p. 16), we applied the mutation operators incrementally. The idea is to use the historical data previously collected about the correspondence rate between warnings and mutants. The incremental strategy would prioritize, initially, the use of the operators that generate mutants that were not detected by FindBugs and have a lower cost in terms of the number of generated mutants.

Table 9 shows the number of mutants generated for each type of μJava mutation operator in Cassandra (column M C ), Apache POI (column M P ) and Hibernate (column M H ). The M total column equals M C +M P +M H . For example, in line 15 we find the number of mutants generated regarding the IOD operator: 47 mutants in Cassandra (column M C ), 70 mutants in the Apache POI (column M P ), and 1,389 mutants in Hibernate (column M H ). On all the three systems, IOD generated 1,506 mutants (column M total ).

Table 9 Incremental strategy for applying mutation operator

In order to reduce the cost of Mutation Testing, Table 9 shows the mutation operators from μJava applied incrementally. The goal is to prioritize the order of applying the mutation operators based on increasing order by DCL R (o k ). When DCL R (o k ) of two mutation operators are the same, we apply first the mutation operator which generates less mutantes. The rationality is that this order privileges lower cost mutation operators which represent fault categories difficult to be detectable by FindBugs. In other words, considering the collected data, operators with smaller DCL R (o k ) indicate the changes that FindBugs was less effective or even unable to generate warnings that could detect them.

Table 9 presents mutation operators in increasing order by DCL R (o k ) and cost in terms of number of mutants. Equations below are defined, aiming at evaluating the cost of the incremental strategy. Cumulative Cost of Operator (CC(o k )):

$$\begin{array}{@{}rcl@{}} CC({o_{k}}) = M_{1} + M_{2} + \cdots + M_{k-1} + M_{k} \end{array} $$

Cost Reduction of Operator (CR(o k )):

$$\begin{array}{@{}rcl@{}} CR({o_{k}}) = 1 - (CC({o_{k}}) / Total) \end{array} $$

Equation 5 is the cumulative cost in terms of the number of mutants to be analyzed, following the priority order established in Table 9. It represents the number of mutants generated from the operator in the first line of Table 9, IHD, until the operator at the k-th line.

The CC C ,CC P and CC H columns store, respectively, the result of Function 5 on the generated mutants in Cassandra, Apache POI and Hibernate systems. The column CC total is the result of Function 5 applied to the three systems. The costs accumulated to analyze mutants related to the first 10 mutation operators (IHD, LOD, OMD, JDC, EOA, ISI, PPD, PCC, PMD and IOR) are 7 mutants in the Cassandra, 41 mutants in the Apache POI, 238 mutants in Hibernate, and 286 mutants considering the three systems together.

Considering the 16 first mutation operators with DCL R (o k )=0 % (IHD, LOD, OMD, JDC, EOA, ISI, PPD, PCC, PMD, IOR, IOP, IPC, IHI, JID, IOD and PCI), the cumulative cost is 174 mutants in Cassandra, 1,043 mutants in Apache POI, 10,033 mutants in Hibernate, summing up 11,250 mutants considering all the three systems together. Observe that this represents only 8.42 % (11,250/133,683) of all possible mutants μJava is able to generate considering the entire mutation operator set. Moreover, faults modeled by such mutation operators are not easily detected by FindBugs.

On the other hand, Eq. 6 represents the cost reduction (in %) relative to the total of mutants if only the operators until the k-th line are used.

Columns CR C ,CR P ,CR H and CR total store, respectively, the result of Eq. 6 in Cassandra, Apache POI, Hibernate, and the three systems together. The cost reductions provided for analyzing only the top 10 mutation operators (IHD, LOD, OMD, JDC, EOA, ISI, PPD, PCC, PMD and IOR) are 99.79 % for Cassandra, 99.79 % for Apache POI, 99.78 % for Hibernate, and 99.79 % considering all the three systems.

Note that the cost reduction still remains above 90.0 % considering the 16 operators DCL R (o k )=0.0 %. The cost reductions for analyzing the mutants of these 16 mutation operators (IHD, LOD, OMD, JDC, EOA, ISI, PPD, PCC, PMD, IOR, IOP, IPC, IHI, JID, IOD and PCI) are 94.71 % for Cassandra, 94.71 % for Apache POI, 90.93 % for Hibernate, and 91.58 % considering all the three systems together.

Figure 7 shows the curves of cumulative cost (CC(o k )) and cost reduction (CR(o k )), respecting the order defined in the prioritization strategy suggested by DCL R (o k ).

Fig. 7

Cumulative Cost versus Cost Reduction of mutation operator incremental strategy

Didactically, Figs. 6 and 7 can be understood by using three complementary scenarios:

  • Scenario 1: no warning category/mutation operator. This case illustrates the scenario in which Cumulative Cost is zero (0 %) and Cost Reduction is 100 %;

  • Scenario 2: all warning categories/mutation operators are used. This case illustrates the worst scenario where Cumulative Cost is 100 % and Cost Reduction is zero (0 %);

  • Scenario 3: any stage between Scenarios 1 and 2. This case illustrates situation where only a subset of warning categories or mutation operators is used. In this case, the reviewer/tester can combine both strategies in a complementary way considering the trade off between Cost Reduction between 0 % ≤CR≤100 % and a Cumulative Cost between 0 % ≤(100−CR)≤100 %. For instance, considering the warning categories in Fig. 6, if we choose a CR of 60 % we would consider the warning categories from QF to DE, implying a CC of 40 %.

    In the same way, when applying mutation operators as illustrated in Fig. 7, the tester may want to obtain a cost reduction around 80 %. In this case, he/she may consider to use operators from IHD to LOI, implying a CC around 20 %.

Lessons learned and threats to validity

For direct correspondence, mutation test showed (Table 5) a variation of direct correspondence among the operators, that is, among the fault categories simulated by the mutants. This result adds on the one obtained by Couto et al. (2013) since in that study it was not possible to establish direct correspondence per line. The results herein presented show strong evidence that a direct correspondence exists and it is established for specific fault kinds, enabling the establishment of complimentary test strategies integrating static and dynamic analysis.

Observe that our intention is to restrict the types of mutation operators we should use to prioritize true positive static warnings and to identify possible types of faults which FindBugs are not adequate to detect such that we can combine static and dynamic analysis in a coordinated way to take the advantage of each other.

Cost and benefit of the suggested approach:

Benefit: the value obtained by the DCL(w) of each type of warning w is used to establish a prioritization order to analyze the warnings. With the use of the suggested priority order, it is expected that the true positive warnings are analyzed as soon as possible, leaving the “less important” alarms (with smaller DCL) to be analyzed later if there are still available resources. On the other hand, the incremental strategy for applying mutation operator allows to consider firstly faults which FindBugs was not adequate to detect.

Cost: There is some cost to run FindBugs on the original program and on its mutants for the database generation and DCL computation. However, as the mutants do not need to be executed, the generation time is lower than the time demanded by mutation test.

To analyze the cost of database generation and DCL computation, first we need to run FindBugs on all Java files of each system. Considering a notebook computer with an Intel Core i5 processor and 8 Gb of RAM memory, FindBugs took 6.5 seconds, on average, to run on each file. We compute the average runtime per file once when running FindBugs on mutants, we need to run it only on the mutated file and not on the entire system. In this way, the most expensive system in terms of generated mutants is s17, which is composed by 438 Java files and μJava generated 79,968 mutants. When running FindBugs on s17, it took 48 minutes and 32 seconds, and reported 219 warnings in the original system. This fact gives an average time for reporting warnings around 6,6 second for system S17. With respect to mutants’ warnings computation time it was about 531,659.4 seconds (6.6 × 79,968) which corresponds to 6.15 days to finish the data collection.

Observe that the biggest system we analyzed in terms of lines of code, s11, with 102,682 lines of code, has 519 Java files and FindBugs reported 359 warnings in 49 minutes and 50 seconds, giving an average time of 5.8 seconds for reporting warning per file. This was the lowest runtime per file FindBugs obtained in our experiment. Considering that μJava generated 65,529 mutants for s11, we finished the data collection for this system in 4,37 days. The smallest system in terms of lines of code, s9, was also the one which took the lowest time to finish data collection, around 2 hours.

The cost of mutant generation and the cost for loading XML data into the database are not representative taking only a few seconds and are not considered in this analysis.

Observe that, if we have seven cloud computers, similar to the one we have used, we can collect all the data for the prioritization strategy in one day, considering the most expensive program. Therefore, it is evident that our strategy has a manageable cost only dependent on available hardware resources which, in general, are cheaper than human resources.

Once it is determined the prioritization order, as presented in Sections 4.1 and 4.2, the cost to use it, as shown in Section 5.1, is only to apply FindBugs and to analyze the warnings reported, respecting the suggested order. In the same way, when applying mutation testing, the tester may decided how much mutation operators he/she wants to applying respecting the time and cost constraint available. The suggested order prioritizes faults more difficult to be detected by FindBugs.

As any other study, this one also presents threats to validity and limitations as presented below.

External Validity: In this study, we used 19 of 30 systems used by Couto et al. (2013). One of the reasons for the absence of the other 11 systems is due to the difficulty in obtaining its dependencies (jars files) so that μJava tool could not generate the corresponding mutants.

Other limitations observed in the experiment are related to the programming language and the static analysis tools used. In the experiment, it was considered only the Java language and, therefore, the results cannot be extended to other languages, specially to those of dynamic typing. In relation to static analysis tools, it was used only FindBugs and the results cannot be automatically extended to other static analysis tools as well.

About equivalent mutants: one of the objectives of the study is to verify if a given mutation produced is detected by FindBugs Tool, and this is done statically, without the necessity of the mutant execution. Thus, in this first stage of the study, as the mutants were not executed, there is no data about live, dead or equivalent mutants. However, we shall explore the generation of test sets in automated and/or manual ways executing the mutants, evaluating the effectiveness of such test suites, analysing live mutants to determine equivalence and to construct a historical data base for Java mutation operators, similar to the one built to mutation operators for C language, enabling the automatic treatment of equivalent mutants based on Bayesian learning technique (Vincenzi et al. 2002).

Internal Validity: We do not identified internal validity threats. Initially, mutants were generated, through μJava tool (version 4), from the systems shown in Table 1. For this, all available operators in μJava tool were selected with the aim of generating the highest amount of mutants (fault) as possible.

After that, FindBugs Tool (version 3.0.0) was used with option -low so that warnings of any priority could be reported. At this stage, FindBugs reports all possible warnings. To calculate the place where mutation occurred, it was necessary to store in the data base the textual difference between each mutant and the original file (diff).

Construction Risk: To control the experimentation process and reduce the risk of an analysis based on incorrect data in systems shown in Table 1, data collection process and the effective usage of the technique based on the faults were validated in simpler systems used in previous experiments (Polo et al. 2009).

Related works

Ayewah et al. (2007a) examined different types of warnings reported by FindBugs classifying them into false positives, trivial bugs and serious bugs. They concluded that, besides the high false positive rates, static analysis tools often uncover true but trivial bugs.

Nagappan and Ball (2005) developed an empirical methodology for the early projection of pre-release defect density based on the outcomes of two different static analysis tools. They were able to establish a strong correspondence between the number of warnings reported by the static analysis tools and the actual pre-release fault density for Windows Server 2003 obtained through testing.

Daimi et al. (2013) also carried out a performance evaluation of five Java static analyzers. They evaluated the five tools using Eclipse according to three different criteria: the total number of violations (warnings) found, run time, and memory usage. Based on these criteria each tool was evaluated against six fault categories: data faults, control faults, interface faults, measurement faults, duplicated code, and code convention violations. Based on these categories, faulty codes were temporary injected in three different programs and each tool was evaluated against such a fault version. Although such fault categories were plausible, they reported that Eclipse IDE for Java was able to detect almost all the faults injected immediately once they were syntax errors.

To investigate the direct correspondence, the work of Couto et al. (2013) used three systems that together add up to around 118 thousand lines of code (LOC) and to those were reported 277 corrective faults through Bugzilla and Jira tools. The approach used in the studies presented in Couto et al. (2013) shows the following results concerning correspondence between faults and warnings: non-existence of direct correspondence. In these works, it was used change’s history from different versions store in iBugs repository (Dallmeier and Zimmermann 2007) and reports of software faults registered in Bugzilla (Bugzilla 2014) and Jira (Atlassian 2014) tools.

To the best of our knowledge we did not identify any research trying to combine mutation and static analyzers as proposed in this paper. Our work is complementary to the others described above in the sense it uses mutations to evaluate the capability of static analyzers in detecting such mutations. The results obtained so far indicate we can use the collected information to prioritize warning categories in an incremental strategy, allowing the reviewer to apply specific warning categories more adequate to detect mutations first by increasing the chance to analyze a true positive warning. On the other hand, it was also possible to identify warning categories which are not adequate to detect any kind of mutation suggesting these warning categories should be used only if there are time and resources available.

Mutation operators not detected by static analyzers are also source of information to improve the capability of static analyzers in detecting new kinds of faults. Such set of mutation operators were identified and can be further investigated.

Conclusion and future work

In this study we provided evidences of direct correspondence by line between warnings issued by FindBugs and mutations generated by μJava mutation operators.

Based on this correspondence we defined two incremental strategies for using FindBugs bug kinds and μJava mutation operators in a complementary way.

In the case of the incremental strategy for using FindBugs, we observed that for four specific bug kinds it is possible to have 100 % of direct correspondence with mutations, i.e., bug patterns of these bug kinds are adequate to detect the difference between the mutants and the original program. Even if we extend the analysis for the top 12 bug kinds, the correspondence with mutations is above 33 %.

On the other hand, we also identified that for a group of 16 mutation operators, FindBugs was not adequate to issue warnings on any of the mutants generated by such operators. Again, we used this information to define another incremental strategy for applying mutation operators considering faults modeled by these operators are very difficult to be statically detected by FindBugs. Moreover, these 16 mutation operators are responsible for less than 10 % of the total number of mutants and should be considered a good starting point for selective Mutation Testing.

As more resources are becoming available, additional bug kinds or mutation operators can be included to improve the fault detection capability of the desired strategy.

A collateral effect of the direct correspondence by line is the possibility to know fault categories FindBugs is not good enough to detect and to write additional bug patterns to improve its capability once, in general, static analysis is cheaper than dynamic analysis.

As future work we intend: 1) to incorporate other static analysis tools for Java (e.g. PMD, and JLint); 2) to extend this study to other programming languages; 3) to evaluate the use of mutation testing as a fault model for static analyzers comparison; 4) to investigate the Cases 2 and 3 shown in Figs. 3 and 4; and 4.2) to create a knowledge database which can be evolved automatically, based on the continuous collection of correspondence data, or manually, based on experts’ recommendations.


1 Bad casts of object references

2 Null pointer dereference

3 Dead local store

4 Dubious method invocation

5 The lines of code of each one of the systems presented in Table 1 were obtained through JavaNCSS Tool.

6 KLOC is a acronym to Kilo Lines of Code (LOC/1,000).

Appendix A: Descriptions of mutation operators and warnings

A.1 μJava mutation operators

With respect to mutation operators, μJava has 19 method mutation operators and 28 class mutation operators, as illustrated in Tables 10 and 11, respectively.

Table 10 μJava traditional mutation operator (adapted from (Ma and Offutt 2005))
Table 11 μJava class mutation operator (adapted from Ma et al. (2005))

The reduced number of method operators implemented by μJava is due to the selective approach adopted to create such a set of mutants (Offutt et al. 1996). As illustrated in Table 10, fault categories are related to Arithmetic, Relational, Conditional, Shift, Logical and Assignment, Variable, Constant, Operator and Statement operators.

Considering the class operators implemented by μJava, they are developed based on object-oriented language features and Java specific features, as illustrated in Table 11. The considered object-oriented features which have mutation operator implemented are: Inheritance, Polymorphism and Overloading, with 8, 7 and 3 mutation operators per feature, respectively. Additionally, it has 6 Java specific mutation operators and 4 related to common programming mistakes.

For more information about these set of mutant operators and examples of the mutations they performed the interested reader may refer to Ma and Offutt (2005) and Ma et al. (2005).

A.2 FindBugs warning categories

The version 3.0 of FindBugs there are 9 different warning categories. Each warning category has a set of bug kinds. At all, there are 121 different bug kinds, and each bug kind has a set of bug patterns, resulting in a set of 408 possible warnings. Table 12 presents all warning categories supported by the FindBugs version used in our experiment, the total of bug patterns on each category, and a sample of specific bug kind of each category.

Table 12 FindBugs warning categories

FindBugs also has different priority levels for each warning pattern from 1 to 3, higher to lower priority. In general, priority 1 means warnings that should be corrected and probably represents a fault reviewer would like to correct; priority 2 represents not so critical warnings but that may be interesting to be analyzed after the ones with priority 1; and priority 3 means warnings related to coding style and are considered informational.


  1. Acree, AT, Budd TA, DeMillo RA, Lipton RJ, Sayward FG (1979) Mutation analysis. Tech rep DTIC Document.

  2. Andrews, JH, Briand LC, Labiche Y (2005) Is mutation an appropriate tool for testing experiments In: XXVII International Conference on Software Engineering – ICSE’05, 402–411.. ACM Press, New York, doi:10.1145/1062455.1062530doi:

  3. Atlassian (2014) Jira. Tool’s Homepage, available at: Accessed 2 July 2014.

  4. Ayewah, N, Pugh W, Morgenthaler JD, Penix J, Zhou Y (2007a) Evaluating static analysis defect warnings on production software In: Proceedings of the 7th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering, PASTE’07, 1–8.. ACM, New York, doi:10.1145/1251535.1251536.

  5. Ayewah, N, Pugh W, Morgenthaler JD, Penix J, Zhou Y (2007b) Using findbugs on production software In: Companion to the 22Nd ACM SIGPLAN Conference on Object-oriented Programming Systems and Applications Companion, OOPSLA ’07, 805–806.. ACM, New York, doi:10.1145/1297846.1297897.

  6. Bugzilla (2014) Bugzilla – bug-tracking system. Tool’s Homepage, available at: Accessed 2 July 2014.

  7. Burn, O (2014) Checkstyle – coding standard verifier. Tool’s Homepage, available at: Accessed 2 July 2014.

  8. Coles, H (2015) Pitest: real world mutation testing. Web page. Last access: Accessed 2 July 2014.

  9. Copeland, L (2004) A practitioner’s guide to software test design. Artech House.

  10. Copeland, T (2005) PMD Applied: An Easy-to-use Guide for Developers. An easy-to-use guide for developers, Centennial Books.

  11. Couto, C, Montandon JaE, Silva C, Valente MT (2013) Static correspondence and correlation between field defects and warnings reported by a bug finding tool. Softw Qual J 21(2): 241–257. doi:10.1007/s11219-011-9172-5.

  12. Daimi, K, Banitaan S, Liszka K (2013) Examining the performance of java static analyzers In: XI International Conference on Software Engineering Research and Practice – SERP’13, 225–230.. CSREA Press, Las Vegas.

  13. Dallmeier, V, Zimmermann T (2007) Extraction of bug localization benchmarks from history In: Proceedings of the Twenty-second IEEE/ACM International Conference on Automated Software Engineering, ASE’07, 433–436.. ACM, New York, doi:10.1145/1321631.1321702.

  14. de Araújo Filho, JE, de Moura Couto CF, de Souza SJ, Valente MT (2010) A Study on the Correlation Between Field Defects and Warnings Reported by a Static Analysis Tool In: IX Brazilian Symposium on Software Quality – SBQS‘2010, 9–23.. SBC, Belém.

  15. DeMillo, RA, Lipton RJ, Sayward FG (1978) Hints on test data selection: Help for the practicing programmer. Computer 11(4): 34–43.

  16. Evans, D, Larochelle D (2002) Improving security using extensible lightweight static analysis. IEEE Softw 19(1): 42–51. doi:10.1109/52.976940.

  17. Ferrari, FC, Nakagawa EY, Maldonado JC, Rashid A (2011) Proteum/AJ: A mutation system for AspectJ programs In: X International Conference on Aspect-oriented Software Development Companion – AOSD’11, 73–74.. ACM, New York, doi:10.1145/1960314.1960340.

  18. Hovemeyer, D, Pugh W (2004) Finding bugs is easy. SIGPLAN Not 39(12): 92–106. doi:10.1145/1052883.1052895.

  19. IEEE (1990) IEEE Standard Glossary of Software Engineering Terminology. IEEE Standards Board, New York.

  20. Just, R, Schweiggert F, Kapfhammer GM (2011) Major: An efficient and extensible tool for mutation analysis in a java compiler In: XXVI IEEE/ACM International Conference on Automated Software Engineering – ASE’11, 612–615.. IEEE Computer Society, Washington, DC, doi:10.1109/ASE.2011.6100138.

  21. Louridas, P (2006) Static code analysis. IEEE Softw 23(4): 58–61. doi:10.1109/MS.2006.114.

  22. Ma, YS, Offutt J (2005) Description of method-level mutation operators for Java. On-line document, available at: Accessed 10 Aug 2015.

  23. Ma, YS, Offutt J, Kwon YR (2005) Mujava: an automated class mutation system: Research articles. Softw Test Verification Reliab 15(2): 97–133. doi:10.1002/stvr.v15:2.

  24. Mathur, AP (1991) Performance, effectiveness, and reliability issues in software testing In: Computer Software and Applications Conference, 1991. COMPSAC‘91., Proceedings of the Fifteenth Annual International, 604–605.. IEEE, New York. 10.1109/CMPSAC.1991.170248.

  25. Microsoft (2014) StyleCop. Tool Homepage, available at: Accessed 2 July 2014.

  26. Mresa, E, Bottaci L (1999) Efficiency of mutation operators and selective mutation strategies: an empirical study. J Softw Test Verification Reliab 9(4): 205–232.

  27. Nagappan, N, Ball T (2005) Static analysis tools as early indicators of pre-release defect density In: Proceedings of the 27th International Conference on Software Engineering, ICSE ’05, 580–586.. ACM, New York, doi:10.1145/1062455.1062558.

  28. Offutt, AJ, Rothermel G, Zapf C (1993) An experimental evaluation of selective mutation In: 15th International Conference on Software Engineering, 100–107.. IEEE Computer Society Press, Baltimore.

  29. Offutt, AJ, Lee A, Rothermel G, Untch RH, Zapf C (1996) An experimental determination of sufficient mutant operators. ACM Trans Softw Eng Methodol 5(2): 99–118.

  30. Offutt, J, Ma YS, Kwon YR (2006) The class-level mutants of mujava In: Proceedings of the 2006 International Workshop on Automation of Software Test, AST ’06, 78–84.. ACM, New York, doi:10.1145/1138929.1138945.

  31. Pohl, J (2001) Lint - C program verifier. Tool’s Man Page, available at: Access on: 2 July 2014.

  32. Polo, M, Piattini M, García-Rodríguez I (2009) Decreasing the cost of mutation testing with second-order mutants. Softw Test Verification Reliab 19(2): 111–131. doi:10.1002/stvr.v19:2.

  33. Shen, H, Fang J, Zhao J (2011) Efindbugs: Effective error ranking for findbugs In: Proceedings of the 2011 Fourth IEEE International Conference on Software Testing, Verification and Validation, ICST ’11, 299–308.. IEEE Computer Society, Washington, DC, doi:10.1109/ICST.2011.51.

  34. Tomas, P, Escalona MJ, Mejias M (2013) Open Source Tools for Measuring the Internal Quality of Java Software Products. A Survey. Computer Standards & Interfaces 36(1): 244–255.

  35. Vincenzi, AMR, Nakagawa EY, Maldonado JC, Delamaro ME, Romero RAF (2002) Bayesian-learning based guidelines to determine equivalent mutants. Int J Softw Eng Knowl Eng 12(06): 675–689.

Download references


The authors would like to thank the Brazilian Funding Agency: CAPES, CNPq and FAPESP. The authors would also like to thank the anonymous referees for their valuable comments.

Authors’ contributions

CAA carried out the experiment and data collection. AMRV participated in the organization of the experimental study and proposed the use of mutantion testing to evaluate static analysis tools. JCM and MED participated in the work suggestions and helped in the revision of the manuscript. All authors contributed on data analysis, conclusions and future work sections. All authors read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Author information

Correspondence to Auri M. R. Vincenzi.

Additional file

Additional file 1

Original versus mutated program samples. (ZIP 2 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark


  • Software testing
  • Warnings
  • Mutants
  • Static analysis
  • Mutation testing
  • Static analyzer evaluation