A multi-objective test data generation approach for mutation testing of feature models

Matnei Filho, Rui A.; Vergilio, Silvia R.

doi:10.1186/s40411-016-0030-9

Research
Open access
Published: 26 July 2016

A multi-objective test data generation approach for mutation testing of feature models

Rui A. Matnei Filho¹ &
Silvia R. Vergilio¹

Journal of Software Engineering Research and Development volume 4, Article number: 4 (2016) Cite this article

4931 Accesses
11 Citations
Metrics details

Abstract

Background

Mutation approaches have been recently applied for feature testing of Software Product Lines (SPLs). The idea is to select products, associated to mutation operators that describe possible faults in the Feature Model (FM). In this way, the operators and mutation score can be used to evaluate and generate a test set, that is a set of SPL products to be tested. However, the generation of test sets to kill all the mutants with a reduced, possible minimum, number of products is a complex task.

Methods

To help in this task, in a previous work, we introduced a multi-objective approach that includes a representation to the problem, search operators, and two objectives related to the number of test cases and dead mutants. The proposed approach was implemented and evaluated with three representative multi-objective and evolutionary algorithms: NSGA-II, SPEA2 and IBEA, and obtained promising results. Now in the present paper we extend such an approach to include a third objective: the pairwise coverage. The goal 4 is to reveal other kind of faults not revealed by mutation testing and to improve the efficacy of the generated test sets.

Results

Results of new studies are reported, showing that both criteria can be satisfied with a reduced number of products. The approach produces diverse good solutions and different sets of impacting factors can be considered.

Conclusions

At the end, the tester can either prioritize one objective, by choosing solutions in the extreme points of the fronts or choose solutions with smaller ED values, according to the testing goals and resources.

1 Background

A Software Product Line (SPL) can be defined as a set of products that share common features (Pohl et al. 2005). A feature is related to the software functionalities or system attributes that are visible to the user. Features allow distinguishing products and are important to represent variability. In this sense, different products can be generated by selecting different features. The features are generally expressed in a Feature Model (FM), which allows a hierarchical arrangement of features represented by a tree.

The adoption of the SPL approach in industries is crescent (SEI 2016), due to the associated advantages. With this increasing usage, the demand for specific SPL testing techniques has also been growing (da Mota Silveira Neto et al. 2011). An important activity in this context is the feature testing, which tests if the products derived from the FM match their requirements. To ensure correctness, ideally, all the products derived from the FM should be tested. However, this is impractical in terms of resources and execution time (Cohen et al. 2006). Then, to select only the most representative products, testing criteria are needed. A criterion provides a way to select and evaluate test data sets that, in the FM context, are sets of products to be tested.

The main criteria used for feature testing of SPL are based on combinatorial testing (Cohen et al. 2006; Henard et al. 2014; Lamancha and Usaola 2010; Perrouin et al. 2010). Such kind of test derives SPL products to test combination of features. For example, the pairwise testing (Cohen et al. 2006) requires that each possible pair of features from the FM is included in at least one product derived for the test.

Recently, fault-based criteria, such as that ones based on mutation testing, have been investigated for variability test using the FM (Ferreira et al. 2013; Henard et al. 2013a). As it happens in the traditional test of programs (Wong et al. 1995), studies show that this kind of criterion is more efficacious, in terms of revealed faults, but also more expensive, in terms of required test cases (Ferreira et al. 2013).

In addition to this, a hard task, associated to the application of a test criterion, is the generation of test data to reach the desired coverage. This task has been successfully solved in the Search Based Software Engineering (SBSE) field (Harman et al. 2012). Search based algorithms, such as the genetic ones, are capable to search, in a huge space, the solution that solves the problem and satisfies some constraints.

Recent surveys (Harman et al. 2014; Matnei Filho and Vergilio 2014) show that the use of search-based algorithms for SPL engineering has raised interest in the last two years, mainly for configuration of products, evolution and adaptation of the FM, and also selection of products for testing. Works on this last subject address minimization (Wang et al. 2013), prioritization (Wang et al. 2014) and generation (Ensan et al. 2012; Henard et al. 2013b) of products (test cases), taking into account different factors: number of test cases, pairwise coverage, number of revealed faults, and other ones related to costs. The mutation score is used only in the work of Henard et al. (2014), which implements the (1+1) Evolutionary Algorithm (EA) algorithm, and considers the operators defined in (Henard et al. 2013a).

Most existing works have some limitations. The main one is that they deal with the problem by using an aggregation function and a single-objective algorithm, generally evolutionary one. However the problem is in fact multi-objective, impacted by many factors. Due to this, multi-objective algorithms are more suitable. Such algorithms are based on the Pareto dominance concept (Pareto 1927) and produce a set of good solutions, that represent the best trade-offs between the objectives.

We can find in the literature works that propose multi-objective approaches (Harman et al. 2014). Among them we can mention the work of Lopez-Herrejon et al. (2013) that uses a multi-objective algorithm and two objectives: number of products and pairwise-coverage but, in general, multi-objective approaches do not address mutation testing.

To overcome such limitation, in a previous work (Matnei Filho and Vergilio 2015), we introduced a multi-objective and mutation based approach to generate sets of products for the variability test of FMs. The approach reached good results considering two objectives: the size and mutation score of the derived sets. However, there are other factors that impact on the testing cost and efficacy, and need to be considered. For instance, the works found in the literature does not generate test sets to satisfy both mutation and pairwise testing. Such goal is very important because studies reported by Ferreira et al. (2013) show that both criteria should be used in a complementary way, since they can reveal different kind of faults.

Motivated by this fact and to improve the efficacy of the generated test sets, this paper now extends the previous work by instantiating and evaluating the approach with a new objective: the pairwise coverage. The idea is to obtain a set of products to the feature testing with a minimal number of test cases, factor related to the cost, and high mutation score and pairwise coverage, factors related to the quality and efficacious to reveal faults.

The work uses the mutation operators and tool introduced in (Ferreira et al. 2013). The approach encompasses two main characteristics: i) introduces a new representation for the population, where an individual (solution) is a set of products, differently of most existing works, where an individual represents a product; and ii) allows the use of different objectives. The approach is implemented with three different algorithms, traditionally used by SBSE works: NSGA-II (Deb et al. 2002), SPEA2 (Zitzler et al. 2001), and IBEA (Zitzler and Künzli 2004). The performance of such algorithms are compared, according to quality indicators of the optimization field. The obtained solutions are also evaluated with respect to the three objectives. In all cases, solutions that kill all the non-equivalent mutants are obtained. We observe a reduced number of products required to mutation testing, mainly for FMs that derive the greatest number of products.

This work is organized as follows. First we present concepts from the multi-objective optimization field and we review mutation testing of SPLs, the adopted mutation approach and tool. After this, we introduce the test data generation approach: population, search operators, fitness, and implementation aspects, and we also present a use example of the introduced approach. Following, we describe how the evaluation was conducted, including the research questions, FMs used, algorithms configuration and, in the sequence, we also present and analyze the results. Finally at the end, we present related work and conclusions of the work, showing future research directions.

1.1 Multi-objective optimization

Optimization problems that are impacted by many factors are called muti-objective. For them, it is not always possible to find only a solution that optimizes all objectives simultaneously. This is because the objective functions, associated to diverse metrics, are usually in conflict, thus, a set of good solutions is generated, usually following Pareto dominance concepts (Pareto 1927). This set forms the approximation to the Pareto Front (P F a p p r o x), which is composed by different non-dominated solutions. Given a set of possible solutions, the solution A dominates B if, the value of at least one objective in A is better than the corresponding objective value in B, and the values of the remaining objectives in A are at least equal to the corresponding values in B.

We observe that the generation of test data sets is a multi-objective problem, impacted by many factors. For example, in our case, to reach a high score Sc implies in a higher cost, in terms of number of test cases t. A solution A with (t=15,S c=38 %) dominates the solution B with (t=16,S c=38 %), since the same score was reached with a lower number t. However, A does not dominate C with (t=18,S c=56 %), since C is better considering the score. A solution is said non-dominated if it is not dominated by any other solution.

1.1.1 Multi-objective algorithms

In order to solve a multi-objective problem, multi-objective algorithms have been successfully applied. Variants of Genetic Algorithms (GAs) are widely used in SBSE (Harman et al. 2012). A GA is a heuristic inspired by the theory of natural selection and genetic evolution. The search is started with an initial population composed by some solutions of the search space. From this population, search operators are applied consisting of selection, crossover and mutation. Such operators are specific for the representation adopted for the problem. They iteratively generate new solutions from existing ones, until some stopping condition is reached. Through the selection operator, copies of those individuals with the best values of the objective function are selected to be parent. So, the best individuals (candidate solutions) will survive in the next population. The crossover operator combines parts of two parent solutions to create a new one. The mutation operator randomly modifies a solution. The descendant population, created from the selection, crossover and mutation, replaces the parent population. At the end, the best solution found is returned.

In our work we use three most representative algorithms (Coello et al. 2006): NSGA-II (Non-dominated Sorting Genetic Algorithm) (Deb et al. 2002), SPEA2 (Strength Pareto Evolutionary Algorithm) (Zitzler et al. 2001), and IBEA (Indicator-Based Evolutionary Algorithm) (Zitzler and Künzli 2004). Each algorithm adopts different evolution and diversification strategies and were chosen in our study because they are well known Multiobjective Evolutionary Algorithms (MOEAs) (Coello et al. 2006) and largely used in the SBSE field (Harman et al. 2012). Next, they are described briefly.

1.1.1.1 Non-dominated sorting genetic algorithm (NSGA-II)

The algorithm NSGA-II (Deb et al. 2002) (see Fig. 1) is a MOEA based in GA with a strong elitism strategy. For each generation, NSGA-II sorts the individuals from parent and offspring populations, considering the non-dominance, creating several fronts (Lines 10 and 11 of Fig. 1). The first front is composed by all non-dominated solutions. The second one has the solutions dominated by only one solution. The third front has solutions dominated by two other solutions, and the fronts are created until all solutions are classified.

For the solutions of the same front, another sort is performed using the crowding distance to maintain the diversity of solutions (Line 12 of Fig. 1). The crowding distance calculates how far away the neighbors of a given solution are and, after calculation, the solutions are decreasingly sorted. The solutions in the boundary of the search space are benefited with high values of crowding distance, since the solutions are more diversified but with fewer neighbors.

Both sorting procedures, front and crowding distance, are used by the selection operator (Line 17 of Fig. 1). The binary tournament selects individuals of lower front; in case of same fronts, the solution with greater crowding distance is chosen. New populations are generated with recombination and mutation (Line 18 of Fig. 1). The computational complexity of NSGA-II is O(M N ^′ ²), where M is the number of objectives and N ^′ the population size (Deb et al. 2002).

1.1.1.2 Strength Pareto evolutionary algorithm (SPEA2)

SPEA2 (Zitzler et al. 2001) is also a multi-objective algorithm based on GA (Fig. 2). In addition to its regular population, SPEA2 uses an external archive that stores non-dominated solutions found at each generation.

In each generation a strength value for each solution is calculated and used by the selection operator. The strength value of a solution i corresponds to the number of j individuals, belonging to the archive and the population, dominated by i. The fitness of a solution is the sum of the strength values of all its dominators, from archive and population (Line 4 of Fig. 1); 0 indicates a non-dominated individual, whereas a high value points out that the individual is dominated by many other ones. After the selection of individuals (Line 8), new populations are generated by recombination and mutation (Line 9).

During the evolutionary process, the external archive, which is used in the next generation, is filled with non-dominated solutions of the current archive and population (Line 5). When the non-dominated front does not correspond exactly to the size of the archive, two cases are possible: a too large new archive or a too small one. In the former case, a truncation procedure is performed (Line 6); first the distances from the solutions to their neighbors are calculated, then, the nearest neighbors are removed. In the second case, the dominated individuals from the current archive and population are copied to the new archive (Line 7). The worst run-time of the truncation procedure is O(M ³) (Zitzler et al. 2001), where M is given by the sum of the population and external archive sizes, but SPEA can have different implementations and, in most cases, behavior similar to NSGA-II (Deb et al. 2002).

1.1.1.3 Indicator-based evolutionary algorithm (IBEA)

IBEA (Zitzler and Künzli 2004) is a multi-objective algorithm based on indicators (Fig. 3). Basically, a weight is assigned to each solution found, according to the quality indicators, favoring the user optimization objectives. IBEA performs binary tournaments for mating selection and implements environmental selection by iteratively removing the worst individual from the population and updating the fitness values of the remaining individuals.

The algorithm has as input values α that represents the population size, N maximum number of generations and k fitness scaling factor. To each population individual, a fitness value is calculated based on the k factor (Line 5 of Fig. 3). That individuals with worst values of fitness are removed from population (Lines 7 and 8). The other individual fitness are updated. If the generation counter is bigger than the maximum number of generations the decision vector is returned with the non-dominated individuals of population P, and the execution is finished (Line 11). Else, a binary tournament selection is performed with replacement on P, in order to fill the temporary mating pool P ^′ (Line 14). The individuals P ^′ suffer mutations and crossover, and resulting offspring is added to population P (Line 15). After that, the generation counter is incremented and the algorithm returns to Line 2. The complexity of the algorithm is O(α ²), with regard to the populatio size α (Zitzler and Künzli 2004).

1.1.2 Quality indicators

Quality indicators are used in the optimization field to evaluate the obtained solutions and to compare the algorithms performance. To calculate these indicators, generally three sets of solutions are used: 1) P F _approx: formed by non-dominated solutions returned by one execution of the algorithm; 2) P F _known: combination of all P F _approx obtained through different executions of an algorithm, removing dominated and repeated solutions; 3) P F _true: represents the Optimal Pareto Front to the problem. In our case this set is unknown. Due to this, and following the literature (Zitzler et al. 2003), this set was formed by all sets P F _known obtained with different algorithms by removing dominated and repeated solutions. The set P F _true is in fact an approximation to the real front.

In our work, we assessed the number of solutions and performed an analysis of the Pareto fronts, in order to evaluate the capability of the algorithms in finding a great number and diversified solutions. In this sense, we calculated the Error Ratio (ER) (Van Veldhuizen 1999) quality indicator. It corresponds to the ratio between the P F _known elements that are present and those that are not present in P F _true.

We also used the hypervolume (Zitzler et al. 2003) as the main quality indicator and the Kruskal Wallis (significance level of 95 %) (Derrac et al. 2011) as non-parametric statistic test. Hypervolume measures the coverage area of a known Pareto front (P F _known) on the objective space regarding a nadir (reference) point. The higher the hypervolume, the better the Pareto front. Hypervolume was used because it evaluates a set of solutions generated by multi-objective algorithms regarding convergence and diversity, besides being one of the most used indicators in the literature (Bringmann et al. 2014). In addition, hypervolume is $\vartriangleright $-completeness compliant (Zitzler et al. 2003), which means that, hypervolume can measure if a Pareto front is better than another in terms of dominance relation (Zitzler et al. 2003). It is a useful indicator for what we are trying to assess, since we want to find which resulting Pareto Front is the best one by comparing fronts generated by different algorithms. Furthermore, for calculating the hypervolume, only the objective values of the solutions and the nadir point is needed as entry.

Taking into account that the tester will choose only one solution from the set of non-dominated ones, we analyzed the solutions with the lowest Euclidean Distance (ED) from the ideal solution. Equation 1 shows how ED is calculated. ED represents the distance between two points P= (p ₁,p ₂,…,p _n) and Q= (q ₁,q ₂,…,q _n) in an Euclidean Space n-dimensional.

$$ ED = \sum\limits_{i = 1}^{n}{(p_{i} - q_{i})^{2}} $$

(1)

ED is used here as preference criterion to help the tester in the selection of a solution. The lower the ED value, the better the trade-off between the objectives.

1.2 Feature testing of SPL

As mentioned before the demand for specific testing techniques and tools for SPL is crescent. In the feature testing, the goal is to derive the most representative set of products from the FM. In this section we describe two criteria that can be used for this task.

1.2.1 Pairwise testing

We can find many works in the literature addressing combinatorial testing in the SPL context (Cohen et al. 2006; 2008; Lamancha and Usaola 2010; McGregor 2001; Oster et al. 2011; Perrouin et al. 2010; Uzuncaova et al. 2010). The pairwise testing is a well known and used kind of combinatorial test, and due to this was adopted in our work. To derive the pairs, we use the the Combinatorial tool¹ that implements, among other algorithms, the AETG algorithm, introduced by Cohen et al. (1997). The size of an AETG test set grows logarithmically in the number of test parameters. To illustrate how the pairs are derived, consider the FM of Fig. 4, for the SPL CAS (Car Audio System (Weißleder et al. 2008)), and Table 1, which presents the set of products generated by pairwise testing using such tool. In such table the valid products are represented only in terms of their variabilities. In this sense, Traffic Message Channel that is a obligatory characteristic does not appear, as well as Control. Product 1 includes the 14 pairs (Wheel Control,Map Data via CD), (Wheel Control,CD), (Wheel Control,AAC), (Wheel Control,USB), (Wheel Control,MP3), (Map Data via CD,CD), (Map Data via CD,AAC) and so on.

Table 1 Products required by pairwise testing to the FM of Fig. 4

A multi-objective test data generation approach for mutation testing of feature models

Abstract

Background

Methods

Results

Conclusions

1 Background

1.1 Multi-objective optimization

1.1.1 Multi-objective algorithms

1.1.1.1 Non-dominated sorting genetic algorithm (NSGA-II)

1.1.1.2 Strength Pareto evolutionary algorithm (SPEA2)

1.1.1.3 Indicator-based evolutionary algorithm (IBEA)

1.1.2 Quality indicators

1.2 Feature testing of SPL

1.2.1 Pairwise testing

1.2.2 Mutation testing in the SPL context

2 Methods

2.1 The search based approach

2.1.1 Population representation

2.1.2 Search operators

2.1.3 Fitness

2.1.4 Implementation aspects

2.2 Using the approach

2.2.1 Two-objective use

2.2.2 Three-objective use

2.3 Evaluation description

2.3.1 Research questions

2.3.2 Feature models used

2.3.3 Experiment organization

2.3.4 Parameters setting

2.3.5 Threats to validity

3 Results and discussion

3.1 Discussion

4 Conclusions

4.1 Related work

4.2 Concluding remarks

5 Endnotes

6 Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords