Skip to main content

An algorithm for combinatorial interaction testing: definitions and rigorous evaluations

Abstract

Background

Combinatorial Interaction Testing (CIT) approaches have drawn attention of the software testing community to generate sets of smaller, efficient, and effective test cases where they have been successful in detecting faults due to the interaction of several input parameters. Recent empirical studies show that greedy algorithms are still competitive for CIT. It is thus interesting to investigate new approaches to address CIT test case generation via greedy solutions and to perform rigorous evaluations within the greedy context.

Methods

We present a new greedy algorithm for unconstrained CIT, T-Tuple

Reallocation (TTR), to generate CIT test suites specifically via the Mixed-value Covering Array (MCA) technique. The main reasoning behind TTR is to generate an MCA M by creating and reallocating t-tuples into this matrix M, considering a variable called goal (ζ). We performed two controlled experiments addressing cost-efficiency and only cost. Considering both experiments, we did 3200 executions related to 8 solutions. In the first controlled experiment, we compared versions 1.1 and 1.2 of TTR in order to check whether there is significant difference between both versions of our algorithm. In such experiment, we jointly considered cost (size of test suites) and efficiency (time to generate the test suites) in a multi-objective perspective. In the second controlled experiment we confronted TTR 1.2 with five other greedy algorithms/tools for unconstrained CIT: IPOG-F, jenny, IPO-TConfig, PICT, and ACTS. We performed two different evaluations within this second experiment where in the first one we addressed cost-efficiency (multi-objective) and in the second only cost (single objective).

Results

Results of the first controlled experiment indicate that TTR 1.2 is more adequate than TTR 1.1 especially for higher strengths (5, 6). In the second controlled experiment, TTR 1.2 also presents better performance for higher strengths (5, 6) where only in one case it is not superior (in the comparison with IPOG-F). We can explain this better performance of TTR 1.2 due to the fact that it no longer generates, at the beginning, the matrix of t-tuples but rather the algorithm works on a t-tuple by t-tuple creation and reallocation into M.

Conclusion

Considering the metrics we defined in this work and based on both controlled experiments, TTR 1.2 is a better option if we need to consider higher strengths (5, 6). For lower strengths, other solutions, like IPOG-F, may be better alternatives.

1 Introduction

The academic community has been making efforts to reduce the cost of the software testing process by decreasing the size of test suites while at the same time aiming at maintaining the effectiveness (ability to detect defects) of such sets of test cases. Hence, several contributions exist for test suite/case minimization (Yoo and Harman 2012; Ahmed 2016; Huang et al. 2016; Khan et al. 2016) where the goal is to decrease the size of a test suite by eliminating redundant test cases, and hence demanding less effort to execute the test cases (Yoo and Harman 2012). One of the approaches to reduce the number of test cases is Combinatorial Interaction Testing (CIT) (Petke et al. 2015), also known as Combinatorial Testing (CT) (Kuhn et al. 2013; Schroeder and Korel 2000), Combinatorial Test Design (CTD) (Tzoref-Brill et al. 2016), or Combinatorial Designs (CD) (Mathur 2008). CIT relates to combinatorial analysis whose objective is to answer whether it is possible to organize elements of a finite set into subsets so that certain balance or symmetry properties are satisfied (Stinson 2004).

There are reports which claim the success of CIT (Dalal et al. 1999; Tai and Lei 2002; Kuhn et al. 2004; Yilmaz et al. 2014; Qu et al. 2007; Petke et al. 2015). Such approaches have drawn attention of the software testing community to generate sets of smaller (lower cost to run) and effective (greater ability to find faults in the software) test cases where they have been successful in detecting faults due to the interaction of several input parameters (factors).

CIT approaches to generate test cases can be divided in four main classes: Binary Decision Diagrams (BDDs) (Segall et al. 2011), Satisfiability (SAT) solving (Cohen et al. 1997; Yamada et al. 2015; Yamada et al. 2016), meta-heuristics (Garvin et al. 2011; Shiba et al. 2004; Hernandez et al. 2010), and greedy algorithms (Lei and Tai 1998; Lei et al. 2007)Footnote 1. Recent CIT test case generation methods based on BDD and SAT are interesting to constrained (there are restrictions related to parameter interactions) problems but they perform worse compared with greedy algorithms/tools in the context of unconstrained (there are no restrictions at all) problems.

To corroborate this claim, in (Segall et al. 2011) a BDD-based approach, implemented in the Focus tool, was better in terms of cost than the greedy solutions Advanced Combinatorial Testing System (ACTS) (Yu et al. 2013), Pairwise Indepedent Combinatorial Testing (PICT) (Czerwonka 2006), and jenny (Jenkins 2016) in the constrained domain. However, their method was worse than such greedy solutions for unconstrained problems.

A recent SAT-based approach (Yamada et al. 2016), implemented in the Calot tool, performed well in terms of efficiency (time to generate the test suites) and cost (test suite sizes) comparing again with the greedy tools ACTS (Yu et al. 2013) and PICT (Czerwonka 2006). Despite the advantages of the SAT-based approach, ACTS was much more faster than Calot for many 3-way test case examples. Moreover, if unconstrained CIT is considered, ACTS again was remarkable faster than Calot for large SUT models and higher-strength test case generation.

In the context of CIT, meta-heuristics such as simulated annealing (Garvin et al. 2011), genetic algorithms (Shiba et al. 2004), and Tabu Search Approach (TSA) (Hernandez et al. 2010) have been used. Recent empirical studies show that meta-heurisitic and greedy algorithms have similar performance (Petke et al. 2015). Hence, early fault detection via a greedy algorithm with constraint handling (implemented in the ACTS tool (Yu et al. 2013)) was no worse than a simulated annealing algorithm (implemented in the CASA tool (Garvin et al. 2011)). Moreover, there was not enough difference between test suites generated by ACTS and CASA in terms of efficiency (runtime) and t-way coverage. All such previous remarks, some of them based on strong empirical evidences, emphasize that greedy algorithms are still very competitive for CIT.

Even if some authors have argued that CIT resides in the constrained domain in real-world applications (Bryce and Colbourn 2006; Cohen et al. 2008; Petke et al. 2015), it is important to mention that unconstrained CIT may be interesting from a practical point of view, especially for critical applications such as satellites, rockets, airplanes, controllers of an unmanned train metro system, etc. For such types of applications, robustness testing is very important. In the context of software systems, robustness testing aims to verify whether the Software Under Test (SUT) behaves correctly in the presence of invalid inputs. Therefore, even though an unconstrained CIT-derived test case may seem pointless or even somewhat difficult to execute, it may still be interesting to see how the software will behave in the presence of inconsistent inputs.

Let us consider that we need to test a communication protocol implemented in several critical embedded systems. If each field of such a protocol is a parameter, it is interesting to impose no restriction (no constraint) in the parameter interactions so that a certain Protocol Data Unit (PDU) sent from system A to system B may have values not allowed in the combination of the fields (parameters) of the PDU. In other words, if the specification says that when field f i =1, possible values of field f j are between 20 and 70 (20≀f j ≀70), and other field f k <5, then a test case where f i =1, 1≀f j ≀4, and f k <5 is clearly inconsistent because of the value of f j . But, this can precisely the goal of the test designer because he/she wants to check how the receiving system (B) will act upon receiving a PDU like that from A. This is an example where unconstrained CIT is relevant. It is important to mention that the argument is not that constraints can not be used for testing critical systems but rather that, for certain types of tests (robustness), constraints are not as relevant.

Based on the context and motivation previously presented, this research relates to greedy algorithms for unconstrained CIT. In (Pairwise 2017), 43 algorithms/tools are presented for CIT and many more not shown there exist. Some of these solutions are variations of the In-Parameter-Order (IPO) algorithm (Lei and Tai 1998) such as IPOG, IPOG-D (Lei et al. 2007), IPOG-F, IPOG-F2 (Forbes et al. 2008), IPOG-C (Yu et al. 2013), IPO-TConfig (Williams 2000), ACTS (where IPOG, IPOG-D, IPOG-F, IPOG-F2 are implemented) (Yu et al. 2013), and CitLab (Cavalgna et al. 2013). All IPO-based proposals have in common the fact that they perform horizontal and vertical growths to construct the final test suite. Moreover, some need two auxiliary matrices which may decrease its performance by demanding more computer memory. Such algorithms accomplish exhaustive comparisons within each horizontal extension which may penalize efficiency.

PICT can be regarded as one baseline tool where other approaches have been done based on it (PictMaster 2017). The algorithm implemented in this tool works in two phases, the first being the construction of all t-tuples to be covered. This can often be a not interesting solution since many t-tuples may require large disk space for storage.

Thus, it is interesting to think about a new greedy solution for CIT that does not need, at the beginning, to enumerate all t-tuples (such as PICT) and does not demand many auxiliary matrices to operate (as some IPO-based approaches). Although we have some recent rigorous empirical evaluations comparing greedy algorithms with meta-heuristics solutions (Petke et al. 2015) and greedy approaches against SAT-based methods (Yamada et al. 2016), there are no rigorous empirical assessments comparing greedy algorithms/tools, representative of the unconstrained CIT domain, among each other.

In this paper, we present a new algorithm, called T-Tuple Reallocation (TTR), to generate CIT test suites specifically via the Mixed-value Covering Array (MCA) technique. The main reasoning behind TTR is to generate an MCA M by creating and reallocating t-tuples into this matrix M, considering a variable called goal (ζ). TTR is a greedy algorithm for unconstrained CIT.

Three versions of the TTR algorithm were developed and implemented in Java. Version 1.0 is the original version of TTR (Balera and Santiago JĂșnior 2015). In version 1.1 (Balera and Santiago JĂșnior 2016), we made a change where we do not order the input parameters. In the last version, 1.2, the algorithm no longer generates the matrix of t-tuples (Θ) but rather it works on a t-tuple by t-tuple creation and reallocation into M. Moreover, version 1.2 was also implemented in C.

We performed two controlled experiments addressing cost-efficiency and only cost. Considering both experiments, we performed 3,200 executions related to 8 solutions. In the first controlled experiment, our goal was to compare versions 1.1 and 1.2 of TTR (in Java) in order to check whether there is significant difference between both versions of our algorithm. In such experiment, we jointly considered cost (size of test suites) and efficiency (time to generate the test suites) in a multi-objective perspective. We conclude that TTR 1.2 is more adequate than TTR 1.1 especially for higher strengths (5 and 6).

We then carried out a second controlled experiment where we confronted TTR 1.2 with five other greedy algorithms/tools for unconstrained CIT: IPOG-F (Forbes et al. 2008), jenny (Jenkins 2016), IPO-TConfig (Williams 2000), PICT (Czerwonka 2006), and ACTS (Yu et al. 2013). We performed two evaluations where in the first one we compared TTR 1.2 with IPOG-F and jenny since these were the solutions we had the source code (to precisely measure the time). Hence, a cost-efficiency (multi-objective) assessment was accomplished. In order to address a possible evaluation bias in the time measures due to different programming languages, we compared the implementation of TTR 1.2 (in Java) with IPOG-F (in Java), and the implementation of TTR 1.2 (in C) with jenny (in C). In the second assessment, we did a cost (single objective) evaluation where TTR 1.2 (Java) was compared with PICT, IPO-TConfig, and ACTS. The conclusion is the same as before: TTR 1.2 is better for higher strengths (5 and 6).

In this paper, we extend our previous works where we presented version 1.0 of TTR (Balera and Santiago JĂșnior 2015), and version 1.1 together with another controlled experiment (Balera and Santiago JĂșnior 2016). The contributions of this work are:

  • Even though we considered version 1.1 of TTR in (Balera and Santiago JĂșnior 2016), we did not detail this version since the focus of this previous paper was this other controlled experiment. Thus, we highlight the key features of TTR 1.1 here;

  • We created another version of our algorithm, 1.2, where, at the beginning, TTR does not generate the matrix of t-tuples. Our goal here is trying to avoid an exhaustive combination of t-tuples as might happen with other classical greedy approaches. Moreover, we rely on just one auxiliary matrix which is different from other greedy solutions which require two auxiliary matrices;

  • We performed two controlled experiments in the unconstrained CIT domain (TTR 1.1 × TTR 1.2; TTR 1.2 × IPOG-F, jenny, IPO-TConfig, PICT, ACTS) with almost three times more participants, in each experiment, than in the previous one (Balera and Santiago JĂșnior 2016). In addition, we run each participant (instance) 5 times with different input orders of parameters and values to address the nondeterminism of the solutions. To the best of our knowledge, no previous research presented rigorous empirical evaluations for greedy solutions within the unconstrained CIT domain;

  • We really accomplished a multi-objective (cost-efficiency) evaluation in both controlled experiments (in the second one, we did it in the first assessment). Previously (Balera and Santiago JĂșnior 2016), we analyzed cost and efficiency in isolation.

This paper is structured as follows. Section 2 presents an overview of the main concepts related to CIT. In Section 3, we show the main definitions and procedures of versions 1.1 and 1.2 of our algorithm. Section 4 shows all the details of the first controlled experiment when we compare TTR 1.1 against TTR 1.2. In Section 6, the second controlled experiment is presented where TTR is confronted with the other 5 greedy tools. Section 7 presents related work. In Section 8, we show the conclusions and future directions of our research.

2 Background

In this section we present some basic concepts and definitions (Kuhn et al. 2013; Petke et al. 2015; Cohen et al. 2003) related to CIT. A CIT algorithm receives as input a number of parameters (also known as factors), p, which refer to the input variables. Each parameter can assume a number of values (also known as levels) v. Moreover, t is the strength of the coverage of interactions. For example, in pairwise testing, the degree of interaction is two, so the value of strength is 2. In t-way testing, a t-tuple is an interaction of parameter values of size equal to the strength. Thus, a t-tuple is a finite ordered list of elements, i.e. it is a set of elements.

A Fixed-value Covering Array (CA) denoted by CA(N,p,v,t) is an N×p matrix of entries from the set {0,1,⋯,(v−1)} such that every set of t-columns contains each possible t-tuple of entries at least a certain number of times (e.g. once). N is the number of rows of the array (matrix). Note that in a CA, entries are from the same set of v values.

A Mixed-value Covering Array (MCA)Footnote 2 it is an extension of a CA and it is more flexible because it allows parameters to assume values from different sets. Hence, it is represented as MCA\(\left (N,v^{p_{1}}_{1}v^{p_{2}}_{2}...v^{p_{m}}_{m}, t\right)\), where N is the number of rows of the matrix, \(\sum \limits _{i=1}^{m} p_{i}\) is the number of parameters, each v i is the number of values for each parameter p i , and t is the strength.

Therefore, in CIT a CA or MCA is a test suite and each row of such matrices is a test case. Suppose that we need to generate a pairwise unconstrained CIT test suite considering the following parameters and their respective values:

$$\begin{array}{*{20}l} OS &= \{macOS, Linux, Windows\},\\ Protocol &= \{IPv4, IPv6\},\\ DBMS &= \{MySQL, PostgreSQL, Oracle\}. \end{array} $$

We can formulate this problem as MCA (N,2132,2) which is denoted as a model for the CIT problem. In other words, we have one parameter (Protocol) which can assume two values, two parameters (OS, DBMS) which can assume three values, and t=2.

As we have mentioned in Section 1, CIT is an interesting solution for the test suite minimization problem. As a matter of perspective, let us consider that there are 10 parameters (A,B,⋯,J) and that each parameter has 5 values, i.e. A={a 1,a 2,⋯,a 5}, B={b 1,b 2,⋯,b 5},..., J={j 1,j 2,⋯,j 5}. If we performed an exhaustive combination, there would be 510=9.765.625 test cases generated where each test case is: t c i ={a k ,b k ,⋯j k }. By using version 1.2 of TTR with t=2, even in a unconstrained context, the test suite reduces to 45 test cases. This gives an idea of the strength of CIT for test suite minimization.

Note that the concepts and definitions we provided in this section are related to the context in which our work is inserted: unconstrained CIT. In case of constrained CIT, constraints must be considered and other definitions can be used (see e.g. (Yamada et al. 2016)).

3 TTR: a new algorithm for combinatorial interaction testing

In this section we detail versions 1.1 and 1.2 of our algorithm. The three versions (1.0 (Balera and Santiago JĂșnior 2015), 1.1, and 1.2) of TTR were implemented in Java.

3.1 TTR: Version 1.1

Version 1.0 of TTR (Balera and Santiago JĂșnior 2015) can be summarized as follows: (i) it generates all possible t-tuples that have not yet been covered. The Constructor procedure constructs the matrix Θ; (ii) it generates an initial solution, the matrix M; and (iii) it reallocates the t-tuples from Θ in order to achieve the best final solution (M) via the Main procedure. Then, the final set of test cases is updated in the matrix M. An important point here is that we order the parameters and values that are submitted to the algorithm. In other words, if we submit five parameters A,B,C,D,E with 10, 4, 3, 8, 5 values respectively, TTR orders these five parameters in ascending order: A,D,E,B,C. The goal is trying to be insensitive to the input order of parameters and values.

The same steps described above also exist in TTR 1.1. However, comparing with version 1.0 (Balera and Santiago JĂșnior 2015), in version 1.1 we do not order the parameters and values submitted to our algorithm. The result is that test suites of different sizes may be derived if we submit a different order of parameters and values. The motivation for such a change is because we realized that, in some cases, less test cases were created due to non-ordering of parameters and values.

Let us consider the running example in Fig. 1 with the strength, t, equals to 2. It is important to note that this is a unit testing level and hence each one of the parameters of register is an input parameter sumitted to TTR. Thus, there are 3 parameters: bank, function and card. We assume that there are two banks (bankA, bankB), two functions (debit, credit), and three types of cards (cardA, cardB, cardC) to deal with. Therefore, there are 2, 2, and 3 values of bank, function and card, respectively, as shown in Table 1.

Fig. 1
figure 1

A running example: register method

Table 1 Example of parameters and values: Fig. 1

A high-level view of version 1.1 of TTR is in Algorithm 1. The main reasoning of TTR 1.1 is to build an MCA M through the reallocation of t-tuples from a matrix Θ to this matrix M, and then each reallocated t-tuple should cover the greatest number of t-tuples not yet covered, considering a parameter called a goal (ζ). Also note that P is the submitted set of parameters, V is the set of values of the parameters, and t is the strength. As we have just pointed out, TTR 1.1 follows the same general 3 steps as we have in TTR 1.0.

Before going on with the descriptions of the procedures of our algorithm, we need to define the following operators applied to the structures (set, sequence, matrix) we handle. We also present some examples to better illustrate how such operators work.

Definition 1

Let A be a sequence and B be a set. The addition sequence-set operator, ⊙, is such that A⊙B is a sequence where the elements of B are added after the last position of A. Thus, if |A| is the length of sequence A and |B| is the cardinality of set B, |A⊙B|=|A|+|B|.

Example: Let us consider sequence A={1,2,3} and set B={4,5}. Then, A⊙B={1,2,3,4,5}.

Definition 2

Let A and B be two sequences with the same length, i.e. |A|=|B|. The addition sequence-sequence operator, ⊕, is such that A⊕B is a sequence where the element in position i of A⊕B, a b i , is a i , the element of A in position i, or b i , the element of B in position i. Also note the definition of an “empty" element, λ, within a sequence which is an element with no value. This operator then assumes that if a i ≠λ and b i ≠λ then a b i =a i =b i . However, if a i =λ and b i ≠λ then a b i =b i . On the other hand, if a i ≠λ and b i =λ then a b i =a i . Note that |A⊕B|=|A|=|B|.

Example: Let us consider sequences A={1,2,λ} and B={λ,2,3}. Then, A⊕B={1,2,3}.

Definition 3

Let A and B be two sequences. The removal operator, ⊖, is such that A⊖B is a sequence obtained by “removing” each element of B, b i , from A. This operator assumes that the original sequences A and B are known so that A⊖B=A.

Example: Let us consider that originally we have sequences A={1,2,λ}, B={λ,2,3}, and A⊕B={1,2,3}. Then A⊖B=A={1,2,λ}.

Definition 4

Let A and B be two sets. The set difference operator, ∖, is as defined in set theory.

Example: Let us consider we have sets A={1,2,3} and B={2,3}. Then A∖B={1}.

Definition 5

Let A be a matrix and B be a sequence. The concatenation operator, ∙, is such that A∙B is a matrix where a new row (sequence) B is added after the last row of A.

Example: Let us consider the matrix A below and sequence B={10,11,12}. The matrix A∙B is shown below.

$${\kern14pt} A = \left[ \begin{array}{lll} 1 & 2 & 3 \\[0.3em] 4 & 5 & 6 \\[0.3em] 7 & 8 & 9 \end{array}\right] $$
$$A \bullet B = \left[ \begin{array}{lll} 1 & 2 & 3 \\[0.3em] 4 & 5 & 6 \\[0.3em] 7 & 8 & 9 \\[0.3em] 10 & 11 & 12 \end{array}\right] $$

Definition 6

Let A be a matrix and B be a sequence. The removal from matrix operator, ∘, is such that A∘B is a matrix obtained by removing the entire row (sequence) B from the last row of matrix A. This operator assumes that the original matrix A and sequence B are known so that A∘B=A

Example: Let us consider we have matrix A and sequence B presented in the previous example. Then A∘B=A as shown below.

$$A \circ B = A = \left[ \begin{array}{lll} 1 & 2 & 3 \\[0.3em] 4 & 5 & 6 \\[0.3em] 7 & 8 & 9 \end{array}\right] $$

3.1.1 The constructor procedure

According to the specified input (parameters and values), the Constructor procedure aims to generate all t-tuples that needs to be covered. Each t-tuple is in the matrix Θ |C|×|P| Footnote 3 where |C| represents the number of t-tuples, t is the strength, and |P| is the number of parameters.

Each row, Ξ i , of Θ is a t-tuple that has not yet been covered and it has a variable, flag, associated with it whose purpose is to aid in the reallocation process of the t-tuple into the final solution. Note that since the order matters, each t-tuple Ξ i is indeed a sequence and not a set. Moreover, flag does not belong to Θ. Table 2 shows the matrix Θ for the example shown in Fig. 1 and t=2. Note that interactions are made for the values of b a n k∖f u n c t i o n, b a n k∖c a r d, and f u n c t i o n∖c a r d. Then, a t-tuple corresponding to the interaction of factors b a n k∖f u n c t i o n can be written in the form Ξ i ={b a n k A,d e b i t,λ}. Initially, all values of flag are false. Algorithm 2 shows the Constructor procedure.

Table 2 Matrix Θ for the example in Fig. 1

Constructor operates as follows: based on the set of parameters (domain), P, and the strength (t), interactions between the parameters are generated through the enumeration procedure, and stored in a set named E (line 1). For example, we have 3 parameters (bank, function and card) and t = 2 thus we know that the enumerator will generate the interactions 2 per 2 (t=2) between these 3 parameters. Thus E={I 1,I 2,I 3} where we have the sets I 1={b a n k,f u n c t i o n,λ}, I 2={b a n k,λ,c a r d}, and I 3={λ,f u n c t i o n,c a r d}. For better understanding, we denote the elements of I l in this way: b a n k∖f u n c t i o n, b a n k∖c a r d and f u n c t i o n∖c a r d. Then, the interactions (I l ) are selected one at a time (line 2), and during this selection, t-tuples are constructed based on each parameter of that interaction: in line 5, the first parameter of the first interaction, p 1, is selected. Note that each parameter, p j , is indeed another set composed of values, v k . Thus, p 1=b a n k={b a n k A,b a n k B}, p 2=f u n c t i o n={d e b i t,c r e d i t}, and p 3=c a r d={c a r d A,c a r d B,c a r d C}. Therefore, each of the values (v k ) is added in t-tuples (Ξ i ) (line 6) and also in Θ (line 7). Recall that Ξ i is indeed a sequence. From now on, subsequent parameters are selected one by one, and a new t-tuple is generated from the combination between each of the values (v k ) with each of the preexisting t-tuples (Ξ i ) in Θ (line 16). For example, the algorithm selects the first generated interaction, I 1, b a n k∖f u n c t i o n and construct all t-tuples between these two parameters. After processing each interaction, I l , the Constructor procedure removes it from the set E (line 21).

Note that the main difference between TTR 1.0 and 1.1 is that TTR 1.0 performs the ordering of the domain, P, that is the parameters are ordered according to the amount of values they have: from the highest to the lowest quantity. For example, considering Fig. 1 and this input order: bank, function, and card. In version 1.0, parameters are stored in an ordered way: the first parameter becomes card (3 values), the second parameter is bank (2 values) and the last parameter is function (2 values). In version 1.1, there is no such ordering and this explains why bank and function generate the first rows (t-tuples) of Θ (see Table 2).

3.1.2 The initial solution and addition of test cases

The matrix M N×(|P|+1) is the MCA we need to construct where there are N rows (i.e. test cases) and |P| parameters. The (|P|+1)-th column is not used to represent any parameter but rather to mean the value of the goal (ζ) associated with that test case. There exists an initial solution for the matrix M that is obtained by selecting the parameters interaction I l that has the largest amount of uncovered t-tuples (line 3 in Algorithm 1). Considering the input order bank, function, card, I 2=b a n k∖c a r d that is chosen because it has 6 t-tuples and it appears first than I 3=f u n c t i o n∖c a r d. All t-tuples derived via I 2 in the initial solution are combined with empty test cases, respecting the order of input of the parameters/values submitted to TTR 1.1 as shown in Table 3 (see t-tuples Ξ 5={b a n k A,λ,c a r d A}, Ξ 6={b a n k A,λ,c a r d B},⋯ from Θ (Table 2) in the initial M).

Table 3 Initial M: example of Fig. 1

In the same way, to the extent that existing test cases are no longer sufficient to allocate the remaining t-tuples in the Θ matrix, the same procedure is used to include new test cases in matrix M. In other words, when reallocation of t-tuples becomes inefficient, it is necessary to include new test cases. Thus, as in the construction of the initial solution, the interaction of factors I l that has the largest amount of uncovered t-tuples is selected, so that these will become new test cases. This strategy is performed on line 3 of Algorithm 1.

3.1.3 Goals

In order to modify the current solution to obtain the final solution, the test suite M, we rely on the variable goal (ζ). For each row of M, i.e. for each test case, there is an associated goal.

As the objective is to address the largest number of uncovered t-tuples, the goal is calculated according to the maximum number of uncovered t-tuples which potentially may be covered when a t-tuple Ξ i is moved from Θ to M. This results in a temporary test case τ r . In order to find ζ, it is necessary to take into account: (i) the disjoint parameters, P d , covered by the union between t-tuple Ξ i and a test case from M; (ii) the number of parameter interactions, y, which τ r has already covered; and (iii) the strength t. Therefore:

$$\zeta = \binom{P_{d}}{t} - y. $$

Let us consider again Fig. 1 and t = 2. According to Θ (see Table 2), the initial solution, M, is composed by the t-tuples due to parameters b a n k∖c a r d. This is because the I 2=b a n k∖c a r d has 6 tuples, I 3=f u n c t i o n∖c a r d has 6 t-tuples, and I 1=b a n k∖f u n c t i o n has 4 t-tuples. As b a n k∖c a r d appears first than f u n c t i o n∖c a r d and both have 6 tuples, so the algorithm selects it for reallocating into M.

The number of disjoint parameters, P d , is equal to 3. As the interaction b a n k∖c a r d is already contemplated in matrix M, the next parameter interaction providing the largest number of non-addressed t-tuples is f u n c t i o n∖c a r d. Then we have all 3 parameters with b a n k∖f u n c t i o n and f u n c t i o n∖c a r d which explains P d = 3. As t = 2, we have \(\binom {3}{2} = 3\). However, one of the 3 parameter interactions has already been covered during the initial solution (b a n k∖c a r d), so we need to cover only 2 parameter interactions. Thus, for each t-tuple in the initial solution M, there remains to be covered:

$$\zeta = \binom{3}{2} - 1 = 2. $$

This explains the goal (ζ) in Table 3. It is very important that y is subtracted in order to find ζ. If this is not done, the final goal will never be matched, since there are no uncovered t-tuples that correspond to this interaction.

Even considering y, it is also important to note that not always the expected targets will be reached with the current configurations of the M and Θ matrices. In other words, in certain cases, there will be times when no existing t-tuple will allow the test cases of the M matrix to reach its goals. It is at this point that it becomes necessary to insert new test cases in M. This insertion is done in the same way as the initial solution for M is constructed, as described in the section above.

3.1.4 The Main Procedure

The Main procedure is presented in Algorithm 3. After the construction of the matrix Θ, the initial solution, and the calculation of the goals of all t-tuples, Main sort Θ so that the elements belonging to the parameter interaction with the greatest amount of t-tuples get ahead (line 1). However, these t-tuples will not be reallocated from Θ to M at once. This is done gradually, one by one, as goals are reached (line 7 to 11). Since the matrix M is being traversed in the loop (line 4), it will be updated every time a t-tuple is combined with some of its test cases (note ⊕ in line 5).

Let us consider Fig. 2. All matrices in this figure represent snapshots of M. The upper left matrix (a) is the initial solution. As long as there exists t-tuples (Ξ i ) in Θ, the Main procedure works. Thus, Main selects from Θ the largest amount of uncovered t-tuples. In Table 2, t-tuples were selected from the parameter interactions I 3=f u n c t i o n∖c a r d. Every t-tuple of the f u n c t i o n∖c a r d interaction is combined with each test case in M until the t-tuple matches some goal (line 7).

Fig. 2
figure 2

Snapshots of M: a initial solution; b and c intermediate matrices; d final test suite

When an uncovered t-tuple fits into a row of M to complete a test case and this t-tuple is not removed on the line 9 in Algorithm 3, it means that the goal for that row of M is reached. Take the first row of the initial M (Table 3) which is a test case (τ r ) originated from Ξ 5={b a n k A,λ,c a r d A}, and the first t-tuple of f u n c t i o n∖c a r d interaction not yet covered in Θ, Ξ 11 = {λ,d e b i t,c a r d A}. The addition of Ξ 11 = {λ,d e b i t,c a r d A} in M is accepted because ζ = 2 is reached. Note that the initial M, with test cases τ r , is also an input parameter of this procedure. Hence, in line 5, M is updated due to the addition sequence-sequence operator (⊕). In addition, note that τ r is also a sequence as Ξ i . In other words, by inserting Ξ 11 = {λ,d e b i t,c a r d A}, we have a complete test case τ r = {b a n k A,d e b i t,c a r d A}. In this way, the other two interactions b a n k∖f u n c t i o n (Ξ 1 = {b a n k A,λ,d e b i t}) and f u n c t i o n∖c a r d (Ξ 11 = {λ,d e b i t,c a r d A}) are covered, and the goal is achieved. The upper right matrix (b) in Fig. 2 shows the result of this first addition.

After all combinations between t-tuples and test cases are made, that is, when procedure ends, the new ζ is calculated. The bottom left matrix (c) shows the new values of ζ (see rows 3 and 6). Thus the steps described above are repeated with the insertion/reallocation of t-tuples into the matrix M. Once an uncovered t-tuple of Θ is included in M and meets the goal, that t-tuple is excluded from Θ (line 7). Note that if t-tuple does not allow the test to which it was combined to reach the goal, it is “unbound” (line 9) from this test case so that it can be combined with the next test case. The final test suite is the matrix M shown at the bottom right (d).

It is possible that a certain uncovered t-tuple does not fit into M. Consequently, the flag variable associated with this t-tuple in Θ is signed as true so that the Main procedure knows that such a t-tuple can no longer be compared with rows of M. Main continues as long as there are uncovered t-tuples. Table 4 shows part of Θ after the first iteration. Note that t-tuples Ξ 13 = {d e b i t,c a r d C} and Ξ 16 = {c r e d i t,c a r d C} of the f u n c t i o n∖c a r d interaction are not inserted into M (see the values true).

Table 4 Part of Θ: unfitness

This exception is ilustred in Table 4, with Ξ 13 = {λ,d e b i t,c a r d C} and Ξ 16 = {λ,c r e d i t,c a r d C} happens because the tests generated by these t-tuples and the available rows of the matrix M address t-tuples already covered in Θ. Assuming that the test consists of the combination of a t-tuple and row 3 of M, only one t-tuple is covered since there is no more t-tuples to be covered in b a n k∖c a r d and b a n k∖f u n c t i o n, as illustrated in Table 4. However, ζ = 2 is not satisfied and these t-tuples can not be removed from Θ. Then it is necessary to recalculate the goals according to the parameter interactions that have been already addressed.

3.2 TTR: version 1.2

The high-level view of the new version of TTR, 1.2, is in Algorithm 4. This new version no longer uses the Constructor procedure since t-tuples are generated one at a time as they are reallocated. In other words, there is no more Θ, a matrix of t-tuples. What we have now is only φ which is a matrix of parameter interactions. TTR 1.2 works as follow: (i) generates only the parameter interactions (it does not generate the t-tuples yet); (ii) generates an initial solution, the matrix M; and (iii) the t-tuples are generated from φ in order to get the final solution (M) via the Main procedure.

Let us consider the code in Fig. 3 where parameters and values are given in Table 5 and t=3. It is a method to update information into a database of a company. TTR 1.2 constructs only parameter interactions according to the strength and stores the number of corresponding t-tuples (Ί) in a matrix φ. These parameter interactions are I 1 = {s t a t u s,e d u c a t i o n,r e g i m e,λ,8}, I 2 = {s t a t u s,e d u c a t i o n,λ,w o r k i n g_h o u r s,8}, I 3 = {s t a t u s,λ,r e g i m e,w o r k i n g_h o u r s,8}, and I 4 = {λ,e d u c a t i o n,r e g i m e,w o r k i n g_h o u r s,8}, where the last element of I l is the number of t-tuples Ί (in all these case I l =8). Here, each interaction I l is indeed a sequence because the algorithm needs to know the exact number of t-tuples and hence position matters. Note that λ is the empty element. No t-tuple corresponding to any parameters/values interactions is constructed as shown in Table 6. The calculation of Ί is simply done by multiplying the number of values of each parameter in the corresponding interaction.

Fig. 3
figure 3

A second running examples: update method

Table 5 Example of parameters and values: Fig. 3
Table 6 Matrix φ for the example of Fig. 3

3.2.1 Initial solution

In this case, the initial solution is no more than the construction of the t-tuples due to the parameters interactions with greater Ί, and their transformation into test cases. In Table 7, the t-tuples of the parameters interaction I 1 = {s t a t u s,e d u c a t i o n,r e g i m e,8} were all transformed into test cases and therefore, for this parameters interaction, Ί becomes 0 and it is no longer considered in the goal (ζ) calculation (Table 8). In fact, we have 4 parameters and t = 3, thus 4 interactions of possible parameters are generated: one is already covered remaining 3 parameter interactions (I 2,I 3,I 4) to be addressed. This justifies ζ=3 (Table 7).

Table 7 Initial M for the example of Fig. 3
Table 8 Matrix φ for the example of Fig. 3: after the initial solution

3.2.2 The main procedure

The new Main procedure is presented in Algorithm 5. After calculating the parameters interactions, Ί, the initial solution, and the goals of all test cases of M, Main selects the parameter interaction that has the highest amount of uncovered t-tuples (line 2) and constructs t-tuples so that they can be reallocated. However, they will be reallocated gradually, one by one, as goals are reached (line 4 to 13). The procedure combines the t-tuples with the test cases of M in order to match them.

Let us take the second running example (Fig. 3). The parameters interaction with the highest amount of non-addressed t-tuples is I 2={s t a t u s,e d u c a t i o n,λ,w o r k i n g_h o u r s,8} (Ί = 8; Table 8 after the initial solution): all t-tuples of this interaction are generated and stored in a sequence S (line 3). The first t-tuple, Ξ 1 = {a c t i v e,u n d e r g r a d u a t e,λ,a f t e r n o o n}, is combined with each test case, τ r in M (line 7). The t-tuple in question fits test case 1, τ 1. At that moment, it is verified whether the t-tuple Ξ i makes the τ r test reach its goal. This control is done through the g o a l() function that receives the τ r test case and then is broken in t-tuples (line 8) according to the parameters interactions that have Ί other than 0. For example, the test case τ 1 = {a c t i v e,u n d e r g r a d u a t e,p a r t i a l,a f t e r n o o n} is broken in t-tuples: {{a c t i v e,u n d e r g r a d u a t e,p a r t i a l,λ}, {a c t i v e,u n d e r g r a d u a t e,λ,a f t e r n o o n}, {a c t i v e,λ,p a r t i a l,a f t e r n o o n}, {λ,u n d e r g r a d u a t e,p a r t i a l,a f t e r n o o n}}. It is then verified how many of these t-tuples do not exist in M and, if this amount equals the respective ζ, Ξ i is permanently stored in M and a unit is taken from the value of Ί of each of the factor interactions that have t-tuples covered by this test case (line 12) because this keeps if the control of the quantity of t-tuples that still have to be covered. Since the matrix M is being traversed in the loop (line 6), it will be updated every time a t-tuple is combined with some of its test cases (line 7).

This step is repeated for all t-tuples. Each time a t-tuple is reallocated from S into M, the goals are recalculated. For example, when the matrix M permanently receives the 4th t-tuple, the test cases that become complete (with a value for each parameter) have ζ = 0 while the others still have ζ = 3 (Table 9).

Table 9 Intermediate matrix M for the example of Fig. 3

All I 2 t-tuples are reallocated from S in order to achieve the goal of all M test cases resulting the final test suite presented in Table 10. In fact, the Main procedure does not construct new t-tuples from another parameters interaction if the current one is not zero: if the parameters interaction I 2 (selected due to the greatest Ί) still has t-tuples, Main will not select another parameters interaction. To do this, the goal of the test cases will be decreased by one, until all t-tuples of the interaction of parameters I 2 make the test cases to match ζ.

Table 10 Final matrix M for the example of Fig. 3

4 Controlled experiment 1: TTR 1.1 × TTR 1.2

This section presents a controlled experiment where we compare versions 1.1 and 1.2 of TTR in order to realize whether there is significant difference between both versions of our algorithm. We accomplished such an experiment where we jointly considered cost and efficiency in a multi-objective perspective.

4.1 Definition and context

The primary aim of this study is to evaluate cost and efficiency related to CIT test case generation via versions 1.1 and 1.2 of the TTR algorithm (both implemented in Java). The rationale is to perceive whether we have significant differences between the two versions of our algorithm.

Regarding the metrics, cost refers to the size of the test suites while efficiency refers to the time to generate the test suites. Although the size of the test suite is used as an indicator of cost, it does not necessarily mean that test execution cost is always less for smaller test suites. However, we assume that this relationship (higher size of test suite means higher execution cost) is generally valid. We should also emphasize that the time we addressed is not the time to run the test suites derived from each algorithm but rather the time to generate them. We jointly analyzed cost and efficiency in a multi-objective way.

The set of samples, i.e. the subjects, are formed by instances that were submitted to both versions of TTR to generate the test suites. We randomly chose 80 test instances/samples (composed of parameters and values) with the strength, t, ranging from 2 to 6. Table 11 shows part of the 80 instances/samples used in this study. Full data obtained in this experiment are presented in (Balera and Santiago JĂșnior 2017).

Table 11 Samples for the controlled experiment: Instances. Caption: val = value; par = parameter

It is important to mention how each instance/sample can be interpreted. Let us consider instance i=1 in Table 11:

$$2^{1} 4^{1} 5^{1} 3^{1} 6^{1}, \quad t=2. $$

In the context of unit test case generation for programs developed according to the Object-Oriented Programming (OOP) paradigm, this instance can be used to generate test cases for a class that has one attribute (parameter) which can take 2 values (21), 1 attribute that can take 4 values (41), another attribute that can take 5 values (51), ⋯, 1 attribute that can take 6 values (61). In the system and acceptance testing context, this same sample can be used to identify test scenarios (test objectives) in a model-based test case generation approach (Santiago JĂșnior 2011; Santiago JĂșnior and Vijaykumar 2012). In both cases, the test suites must meet the criteria of pairwise testing (t=2) where each combination of 2 values of all parameters must be covered. Note that these samples were randomly selected and they cover a wide range of combinations of parameters, values, and strengths to be selected for very simple but also more complex case studies with different testing levels (unit, system, acceptance, etc.).

4.2 Hypotheses and variables

We defined two hypotheses as shown below:

  • Null Hypothesis, H 0.1 - There is no difference regarding cost-efficiency between TTR 1.1 and TTR 1.2;

  • Alternative Hypothesis, H 1.1 - There is difference regarding cost-efficiency between TTR 1.1 and TTR 1.2.

Regarding the variables involved in this experiment, we can highlight the independent and dependent variables (Wohlin et al. 2012). The first type are those that can be manipulated or controlled during the process of trial and define the causes of the hypotheses. For this experiment, we identified the algorithm/tool for CIT test case generation. The dependent variables allow us to observe the result of manipulation of the independent ones. For this study, we identified the number of generated test cases and the time to generate each set of test cases and we jointly considered them.

4.3 Description of the experiment

The experiment was conducted by the researchers who defined it. We relied on the experimentation process proposed in (Wohlin et al. 2012), using the R programming language version 3.2.2 (Kohl 2015). Both algorithms/tools (TTR 1.1, TTR 1.2) were subjected to each one of the 80 test instances (see Table 11), one at a time. The output of each algorithm/tool, with the number of test cases and the time to generate them, was recorded.

To measure cost, we simply verified the number of generated test cases, i.e. the number of rows of the final matrix M, for each instance/sample. The efficiency measurement required us to instrument each one of the implemented versions of TTR and measure the computer current time before and after the execution of each algorithm. In all cases, we used a computer with an Intel Core(TM) i7-4790 CPU @ 3.60 GHz processor, 8 GB of RAM, running Ubuntu 14.04 LTS (Trusty Tahr) 64-bit operating system. The goal of this second analysis is to provide an empirical evaluation of the time performance of the algorithms.

To perform the multi-objective cost-efficiency evaluation, we followed two steps. First, we transformed the cost-efficiency (two-dimensional) representation into a one-dimensional one. Thus, in a second step, we used statistical tests, such as the t-test or the nonparametric Wilcoxon test (Signed Rank) (Kohl 2015), to compare the two test suites (TTR 1.1 and TTR 1.2). To address the nondeterminism of the algorithms/tools, related to the the ordering input of parameters and values, we generated test cases with 5 variations in the order of parameters and values, and took into account the average of these 5 assessments for the statistical tests. We then got points (c A i ,t A i ) that represent the average cost (c A i ) and average time (t A i ) of the algorithms A (TTR 1.1, TTR 1.2) for each instance i (1≀i≀80).

We then determined an optimal point in a two-dimensional space, the point (0,0). This point implies a cost closer to 0 and requires a time closer to 0. The closest condition is because an algorithm is not expected to generate a test suite with, exactly, 0 test case or it does require 0 unit of time to generate the set of test cases. We then used a measure of distance, such as the Euclidean one, to measure the distance from the optimal point (0,0) to (c A i ,t A i ). Thus, each algorithm is then represented by a one-dimensional set, D, where each d i ∈D is the Euclidean distance between (0,0) and (c A i ,t A i ) for every instance i. We selected the Euclidean distance because it is one of the most used similarity distance measure. In software testing, Euclidean distance has been used as a quality indicator in multi-objective test case/data generation (Filho and Vergilio 2015; Santiago JĂșnior and Silva 2017), to support the automation of test oracles for complex output domains (web applications (Delamaro et al. 2013), text-to-speech systems (Oliveira 2017)), and many others.

Based on this cost-efficiency one-dimensional representation, we relied on appropriate statistical evaluation to check data normality. Verification of normality was done in three steps: (i) by using the Shapiro-Wilk test (Shapiro and Wilk 1965) with a significance level α = 0.05; (ii) by checking the skewness of the frequency distribution (in this case, − 0.1≀s k e w n e s s≀0.1 so that the data can be considered as normally distributed); and (iii) by using a graphical verification by means of Q-Q plot (Kohl 2015) and histogram. Thus, we believe we have greater confidence in this conclusion on data normality compared to an approach that is based only on the Shapiro-Wilk test considering the effects of polarization due to the length of the samples.

If we concluded that data came from a normally distributed population, then the paired, two-sided t-test was applied with α = 0.05. Otherwise, we applied the nonparametric paired, two-sided Wilcoxon test (Signed Rank) (Kohl 2015) with α = 0.05, too. However, if the samples presented ties, we applied a variation of the Wilcoxon test, the Asymptotic paired, two-sided Wilcoxon (Signed Rank) (Kohl 2015), suitable to treat ties, with significance level α = 0.05.

In order to reject the Null Hypothesis, H 0.1, we checked whether p−v a l u e<0.05 (t-test) or whether both p−v a l u e<0.05 and |z|>1.96 (Wilcoxon) where z is the z-score. If H 0.1 was rejected, we observed the average of all Euclidean distances (80) due to each algorithm. The algorithm that presented the lowest average of Euclidean distances was the one chosen as the most adequate. If H 0.1 could not be rejected, then the conclusion was that no statistical difference existed between both algorithms.

5 Results and discussion

In this section, we present the results of this first controlled experiment. Based on the cost-efficiency one-dimensional representation (Section 4.3), we considered four evaluation classes as follows:

  • All strenghts. In this case, all 80 instances/samples (Table 11) with all strengths (2, 3, 4, 5, and 6) were taken into account. Our idea here is trying to perceive the cost-efficiency performance of both algorithms in a context where several different strengths are selected to generate a test suite;

  • Low strengths. In this case, we selected only the samples with strength equals to 2. Our aim is to note how the algorithms perform for low strengths;

  • Medium strengths. By selecting samples with strength equals to 3 or 4, we want to evaluate an intermediate strength context;

  • High strengths. We aim to assess the performance for higher strengths, i.e. t= 5 or 6.

Table 12 presents the Euclidean distances of part of the 80 samples (all strenghts class only; complete data are in (Balera and Santiago JĂșnior 2017)) as well as the average values, \(\overline {x}\), of such distances. We checked data normality where Table 13 presents the p−v a l u e, p, due to the Shapiro-Wilk test and the skewness. Note that this table shows p and skewness of all four classes above (all, low, medium, and high strenghts). Moreover Sol 1 is TTR 1.1 and Sol 2 is TTR 1.2. Figures 4 and 5 present the Q-Q plots and histograms for all strengths, Figs. 6 and 7 present the Q-Q plots and histograms for lower strengths, Figs. 8 and 9 present the Q-Q plots and histograms for medium strengths, and Figs. 10 and 11 present the Q-Q plots and histograms for higher strengths, respectively.

Fig. 4
figure 4

Experiment 1: Q-Q plots. a TTR1.1; b TTR 1.2 - All Strengths

Fig. 5
figure 5

Experiment 1: Histograms. a TTR1.1; b TTR 1.2 - All Strengths

Fig. 6
figure 6

Experiment 1: Q-Q plots. a TTR1.1; b TTR 1.2 - 2 Strength

Fig. 7
figure 7

Experiment 1: Histograms. a TTR1.1; b TTR 1.2 - 2 Strength

Fig. 8
figure 8

Experiment 1: Q-Q plots. a TTR1.1; b TTR 1.2 - 3 and 4 Strengths

Fig. 9
figure 9

Experiment 1: Histograms. a TTR1.1; b TTR 1.2 - 3 and 4 Strengths

Fig. 10
figure 10

Experiment 1: Q-Q plots. a TTR1.1; b TTR 1.2 - 5 and 6 Strengths

Fig. 11
figure 11

Experiment 1: Histograms. a TTR1.1; b TTR 1.2 - 5 and 6 Strengths

Table 12 Experiment 1 - Results of the analysis of Euclidean Distance (all strengths)
Table 13 Experiment 1 - Results of the analysis of data normality

We can clearly see that all these data did not come from a normally distribution population because p<0.05 and the skewness is far from 0. Moreover, Q-Q plots and histograms reassure this conclusion. Hence, we used the nonparametric paired, two-sided Wilcoxon test (Signed Rank) or its variation (Asymptotic) when ties were detected. Table 14 presents the p−v a l u e, p, |z|, and additional information for classes all and low strengths while Table 15 shows the results for medium and high strengths.

Table 14 Experiment 1 - Results of the Wilcoxon test
Table 15 Experiment 1 - Results of the Wilcoxon test

Based on Tables 14 and 15, we could not reject H 0.1 (no difference) for all strengths, but we could do it for the other evaluation classes and hence accept the Alternative Hypothesis, H 1.1. As we have previously pointed out, when there is difference regarding cost-efficiency, we examine the average values of the Euclidean distances: the smaller the better. TTR 1.1 is better, in terms of cost-efficiency, than TTR 1.2 for lower strengths (t=2). However, for medium (t=3,4) and higher strenghts (t=5,6), TTR 1.2 surpassed TTR 1.1. This makes sense because in TTR 1.2 we do not generate, at the beginning, the matrix of t-tuples and hence we expect that the last version of our algorithm can handle properly higher strengths.

Therefore, even if we did not find statistical difference with all the strengths and TTR 1.1 was the best for lower strenghts, we decided to select TTR 1.2, to compare with the other solutions for unconstrained CIT test case generation, because TTR 1.2 performed better than TTR 1.1 for medium and higher strengths.

5.1 Validity

The conclusion validity has to do with how sure we are that the treatment we used in an experiment is really related to the actual observed outcome (Wohlin et al. 2012). One of the threats to the conclusion validity is the reliability of the measures (Campanha et al. 2010). We automatically obtained the measures via the implementations of the algorithms and hence we believe that replication of this study by other researchers will produce similar results. Even if other researchers may get different absolute results, especially related to the time to generate the test suites simply because such results depend on the computer configuration (processor, memory, operating system), we dot not expect a different conclusion validity. Moreover, we relied on adequate statistical methods in order to reason about data normality and whether we did really find statistical difference between TTR 1.1 and TTR 1.2. Hence, our study has a high conclusion validity.

The internal validity aims to analyze whether the treatment actually caused the outcome (result). Hence, we need to be sure whether other parameters have not caused the outcome, parameters that have not been controlled or measured. There are many threats to internal validity such as testing effects (measuring the participants repeatedly), history (experiment external events or between repeated measures of the dependent variable may influence the responses of the subjects, e.g. interruption of the treatment), instrument change, maturation (participants might mature during the study or between measurements), selection bias (differences between groups), etc. Note that the participants of our experiment are randomly samples composed of parameters, values, and strengths. Hence, we neither had any human/nature/social parameter nor unanticipated events to interruption the collection of the measures once started to pose an internal validity. Hence, we claim that our experiment has a high internal validity.

In the construct validity, the goal is to ensure that the treatment reflects the construction of the cause, and the result the construction of the effect. This is also high because we used the implementations of TTR 1.1 and TTR 1.2 to assess the cause, and the results, supported by the decision-making procedure via statistical tests, clearly provided the basis for the decision to be made between both algorithms.

Threats to external validity compromise the confidence in asserting that the results of the study can be generalized to and between individuals, settings, and under the temporal perspective. Basically, we can divide threats to external validity in two categories: threats to population and ecological threats.

Threats to population refer to how significant is the selected samples of the population. For our study, the ranges of strengths, parameters, and values are the determining points for this threat. Note that for such a study, the possibility of combination of strengths and parameters/values is literally infinite. However, we believe that our choice of the set of samples is significant (80) with strengths spanning from 2 to 6. Also, recall that the samples were determined completely randomly (by combining parameters, values, and strengths), as well as the input order of parameters and values was also random (for the 5 executions addressing nondeterminism). With this, we guarantee one of the basic principles of the sampling process which is the randomness to avoid selection bias.

Ecological threats refer to the degree to which the results may be generalized between different configurations. Pre-test effects, Post-test effects, and the Hawthorne effects (due to the participants simply feel stimulated by knowing that they are participating in an innovative experiment) are some of these threats. The participants in our experiment are the instances/samples composed of parameters, values and strengths and, therefore, this type of threat does not apply to our case.

6 Controlled experiment 2: TTR 1.2 × other solutions

In this section, we present a second controlled experiment where we compare TTR 1.2 with five other significant greedy approaches for unconstrained CIT test case generation. Many characteristics of this second controlled experiment ressemble the first one (Section 4). We emphasize here the main differences and point to this previous section whenever necessary.

6.1 Definition and context

The aim of this experiment is to compare TTR 1.2 with five other greedy algorithms/tools for unconstrained CIT: IPOG-F (Forbes et al. 2008), jenny (Jenkins 2016), IPO-TConfig (Williams 2000), PICT (Czerwonka 2006), and ACTS (Yu et al. 2013). These algorithms/tools have been selected due to their relevance for unconstrained CIT via greedy strategies.

The IPO algorithm (Lei and Tai 1998) is the basis for several other solutions such as IPOG, IPOG-D (Lei et al. 2007), IPOG-F, IPOG-F2 (Forbes et al. 2008), IPOG-C (Yu et al. 2013), IPO-TConfig (Williams 2000), ACTS (where several versions of IPO are implemented) (Yu et al. 2013), and CitLab (Cavalgna et al. 2013). Thus, we considered three of its variations: own our implementation of IPOG-F (in Java), IPO-TConfig (in Java), and IPOG-F2 implemented within ACTS (in Java). Note that ACTS is probably one of the most popular CIT tools where not only academia but industry professionals have been using it for various purposes (NIST National Institute of Standards and Technology 2015). A tool implemented in C, jenny (Jenkins 2016), has been used in informal (Pairwise 2017) and more formal (Segall et al. 2011) CIT comparisons. PICT (in C++) can be regarded as one baseline greedy tool where other tools have been created based on it (PictMaster 2017).

Like in Section 4, the metrics are cost, measured as the size of the test suites, and efficiency which again refers to the time to generate them. However, to proper measure the time to generate the test suites, we should have access to the source code of the tools in order to instrument them and get more precise and accurate measures. We had only the code of the implementation of TTR 1.2, our own implementation of IPOG-F, and jenny. Thus, we could not measure the time to generate the test cases due to IPO-TConfig, PICT, and ACTS (IPOG-F2). Moreover, note that the time measurements may be influenced by different programming languages within the cost-efficiency evaluation (TTR 1.2, IPOG-F, and jenny). In this case, we implemented TTR 1.2 not only in Java but also in C too in order to address a possible evaluation bias in the time measures when comparing TTR 1.2 against the other solutions. To sum up, we decided to perform two evaluations:

  • Cost-Efficiency (multi-objective). Here, we focused on TTR 1.2, IPOG-F, and jenny since these were the solutions we had the source code and could properly measure the time to generate the test suites. Hence, we compared TTR 1.2 (in Java) with IPOG-F (in Java), and TTR 1.2 (in C) with jenny (in C);

  • Cost (single objective). In this case, we compared TTR 1.2 (only in Java since efficiency is not considered here and thus time does not matter) with PICT, IPO-TConfig, and ACTS.

With respect to the subjects, the same 80 participants of Section 4 were used (Table 11 and full data are in (Balera and Santiago JĂșnior 2017)).

6.2 Hypotheses and variables

Hypotheses of this second experiment are:

  • Null Hypothesis, H 0.2 - There is no difference regarding cost-efficiency between TTR 1.2 (in Java) and IPOG-F (in Java);

  • Alternative Hypothesis, H 1.2 - There is difference regarding cost-efficiency between TTR 1.2 (in Java) and IPOG-F (in Java);

  • Null Hypothesis, H 0.3 - There is no difference regarding cost-efficiency between TTR 1.2 (in C) and jenny (in C);

  • Alternative Hypothesis, H 1.3 - There is difference regarding cost-efficiency between TTR 1.2 (in C) and jenny (in C);

  • Null Hypothesis, H 0.4 - There is no difference regarding cost between TTR 1.2 (in Java) and PICT;

  • Alternative Hypothesis, H 1.4 - There is difference regarding cost between TTR 1.2 (in Java) and PICT;

  • Null Hypothesis, H 0.5 - There is no difference regarding cost between TTR 1.2 (in Java) and IPO-TConfig;

  • Alternative Hypothesis, H 1.5 - There is difference regarding cost between TTR 1.2 (in Java) and IPO-TConfig;

  • Null Hypothesis, H 0.6 - There is no difference regarding cost between TTR 1.2 (in Java) and ACTS;

  • Alternative Hypothesis, H 1.6 - There is difference regarding cost between TTR 1.2 (in Java) and ACTS.

The independent variable is the algorithm/tool for CIT test case generation for both assessments (cost-efficiency, cost). The dependent variables are the number of generated test cases (cost evaluation), and this number of test cases in addition to the time to generate each set of test cases in a multi-objective perspective as in the previous section (cost-efficiency evaluation).

6.3 Description of the experiment

The general description of both evaluations (cost-efficiency, cost) of this second study is basically the same as shown in Section 4. Algorithms/tools were subjected to each one of the 80 test instances, one at a time, and the outcome was recorded. Cost is the number of generated test cases, and efficiency was obtained via instrumentation of the source code with the same computer previously mentioned.

For the multi-objective cost-efficiency evaluation (IPOG-F, jenny), we followed the same two steps previously mentioned: transformation of the cost-efficiency (two-dimensional) representation into a one-dimensional one and usage of statistical tests, such as the t-test or the nonparametric Wilcoxon test (Signed Rank) (Kohl 2015), to compare each pair of test suites (TTR 1.2 and other). To address the nondeterminism of the algorithms/tools, we again generated test cases with 5 variations in the order of parameters and values, and took into account the average of these 5 assessments for the statistical tests. Hence, we obtained the points (c A i ,t A i ) and calculated the Euclidean distances from the optimal point (0,0) to (c A i ,t A i ). Then, we checked data normality and, based on the result of normality, we used the the paired, two-sided t-test with α = 0.05 (normal data) or the nonparametric paired, two-sided Wilcoxon test (Signed Rank) or its Asymptotic version with α = 0.05 (non-normal data).

For the evaluation of cost (PICT, IPO-TConfig, ACTS), we did not need to transform from two into one dimension because it is a single dimension problem. The optimal point here is the value 0 and the Euclidean distance from 0 to c A i (average cost of the algorithms A for each instance i, 1≀i≀80) is |0−c A i |=|c A i |. We then performed the statistical evaluation just as in the multi-objective case.

6.4 Results, discussion and validity

In this section, we present the outcomes of both assessments of our second controlled experiment. Like in the first controlled experiment, to compare TTR 1.2 with IPOG-F, jenny, PICT, IPO-TConfig, and ACTS, we considered four evaluation classes: all, low, medium, and high strengths. Table 16 presents the Euclidean distances of part of the 80 samples (all strenghts class only; complete data are in (Balera and Santiago JĂșnior 2017)) and the average values, \(\overline {x}\). Table 17 presents results of the analysis of data normality (p−v a l u e (p) and skewness) where we can see all evaluation classes. In this table, Sol 1 is the other solution and Sol 2 is TTR 1.2. Figures 12 and 13 present the Q-Q plots and histograms for all strengths, Figs. 14 and 15 present the Q-Q plots and histograms for lower strengths, Figs. 16 and 17 present the Q-Q plots and histograms for medium strengths, and Figs. 18 and 19 present the Q-Q plots and histograms for higher strengths, respectively.

Fig. 12
figure 12

Experiment 2: Q-Q plots. a IPOG-F; b jenny; c PICT; d IPO-TConfig; e ACTS - All Strengths

Fig. 13
figure 13

Experiment 2: Histograms. a IPOG-F; b jenny; c PICT; d IPO-TConfig; e ACTS - All Strengths

Fig. 14
figure 14

Experiment 2: Q-Q plots. a ACTS; b IPO-TConfig; c IPOG-F; d jenny; e PICT - Lower Strengths

Fig. 15
figure 15

Experiment 2: Histograms. a ACTS; b IPO-TConfig; c IPOG-F; d jenny; e PICT - Lower Strengths

Fig. 16
figure 16

Experiment 2: Q-Q plots. a ACTS; b IPO-TConfig; c IPOG-F; d jenny; e PICT - Medium Strengths

Fig. 17
figure 17

Experiment 2: Histograms. a ACTS; b IPO-TConfig; c IPOG-F; d jenny; e PICT - Medium Strengths

Fig. 18
figure 18

Experiment 2: Q-Q plots. a ACTS; b IPO-TConfig; c IPOG-F; d jenny; e PICT - Higher Strengths

Fig. 19
figure 19

Experiment 2: Histograms. a ACTS; b IPO-TConfig; c IPOG-F; d jenny; e PICT - Higher Strengths

Table 16 Experiment 2 - Results of the analysis of Euclidean Distance (all strengths)
Table 17 Experiment 2 - Results of the analysis of data normality

Again we note that all these data did not come from a normally distribution population. The nonparametric paired, two-sided Wilcoxon test (Signed Rank) or its variation (Asymptotic) where then applied. Table 18 presents the p−v a l u e, p, |z|, and additional information for classes all and low strengths while Table 19 shows the results for medium and high strengths. We should mention that in 23 instances (3 with s t r e n g t h=4, 12 with s t r e n g t h=5, and 8 with s t r e n g t h=6) jenny was not able to generate test cases, in some input order of the parameters, due to out of memory issue. Specifically, jenny failed to finish when the test suite size was more than 1,000 test cases. Similar outcomes happened in IPO-TConfig: even if we waited for about 6 hours, it did not generate anything out and hence the tool did not create test cases in 20 instances (3 with s t r e n g t h=4, 9 with s t r e n g t h=5, and 8 with s t r e n g t h=6). In these cases, we adopted a policy penalty: in order to consider these unsuccessful participants, we doubled the respective measure we obtained (average value of the Euclidean distance) due to TTR 1.2 to be the one of jenny and IPO-TConfig. We believe that this is a fair decision because TTR 1.2 could finish generating test cases for all 80 instances.

Table 18 Results of the Wilcoxon test
Table 19 Results of the Wilcoxon test (medium and high strengths)

As shown in Table 18, for class all strengths, two Null Hypotheses were rejected: H 0.2 (TTR 1.2 × IPOG-F) and H 0.5 (TTR 1.2 × IPO-TConfig). TTR 1.2 was better (lowest average value of Euclidean distances) than IPO-TConfig but it was worse than IPOG-F. There is no difference between TTR 1.2 and jenny, PICT, and ACTS.

As in controlled experiment 1, TTR 1.2 did not demonstrate good performance for low strengths. There is no difference between TTR 1.2 and IPO-TConfig. In all the other comparisons, the Null Hypothesis was rejected and TTR 1.2 was worse than the other solutions. This can be attributed to the fact that the algorithm focuses on test cases that have parameter interactions that generate a large amount of t-tuples, which is usually seen in test cases with larger strenghts. This can be verified by the fact that the algorithm gives priority to just covering the interaction of parameters with the greatest amount of t-tuples.

For medium strengths, TTR 1.2 had alternate results. While the Null Hypothesis H 0.6 (TTR 1.2 × ACTS) could not be rejected and our algorithm was better than IPO-TConfig, IPOG-F, jenny, and PICT surpassed TTR 1.2.

The greatest advantage of TTR 1.2 turned out to be again for higher strengths. Recall that TTR 1.2 does not create the matrix of t-tuples at the beginning, and this can potentially benefit our solution compared with the other five for higher strengths. Note that TTR 1.2 was better than jenny, PICT, IPO-TConfig, and ACTS. The only exception is the comparison against IPOG-F where the Null Hypothesis, H 0.2, could not be rejected and thus there is no statistical difference between both approaches.

In general, we can say that IPOG-F presented the best performance compared with TTR 1.2, because IPOG-F was better for all strengths, as well as lower and medium strengths. For higher strengths, there was a statistical draw between both approaches. An explanation for the fact that IPOG-F is better than TTR 1.2 is that TTR 1.2 ends up making more interactions than IPOG-F. In general, we might say that efficiency of IPOG-F is better than TTR 1.2 which influenced the cost-efficiency result. However, if we look at cost in isolation for all strengths, the average value of the test suite size generated via TTR 1.2 (734.50) is better than IPOG-F (770.88).

As we have just stated, for higher strengths, TTR 1.2 is better than two IPO-based approaches (IPO-TConfig and ACTS/IPOG-F2) but there is no difference if we consider our own implementation of IPOG-F and TTR 1.2. This can be explained as follows. The way the array that stores all t-tuples is constructed influences the order in which the t-tuples are evaluated by the algorithm. However, it is not described how this should be done in IPOG-F, leaving it to the developer to define the best way. As the order in which the parameters are presented to the algorithms alters the number of test cases generated, as previously stated, the order in which the t-tuples are evaluated can also generate a certain difference in the final result.

The conclusion of the two evaluations of this second experiment is that our solution is better and quite attractive for the generation of test cases considering higher strengths (5 and 6), where it was superior to basically all other algorithms/tools. Certainly, the main fact that contributes to this result is the non-creation of the matrix of t-tuples at the beginning which allows our solution to be more scalable (higher strengths) in terms of cost-efficiency or cost compared with the other strategies. However, for low strengths, other greedy approaches, like IPOG-F, may be better alternatives.

As before and by making a comparison between pairs of solutions (TTR 1.2 × other), in both assessments (cost-efficiency and cost), we can say that we have a high conclusion, internal, and construct validity. Regarding the external validity, we believe that we selected a significant population for our study. Detailed explanations have been given in Section 5.1 and are valid here.

7 Related work

In this section we present some relevant studies related to greedy algorithms for CIT. The IPO algorithm (Lei and Tai 1998) is one very traditional solution designed for pairwise testing. Several approaches are based on IPO such as IPOG, IPOG-D (Lei et al. 2007), IPOG-F, IPOG-F2 (Forbes et al. 2008), IPOG-C (Yu et al. 2013), IPO-TConfig (Williams 2000), ACTS (where IPOG, IPOG-D, IPOG-F, IPOG-F2 are implemented)(Yu et al. 2013), and CitLab (Cavalgna et al. 2013). All IPO-based proposals have in common the fact that they perform horizontal and vertical growths to construct the final test suite. Moreover, some need two auxiliary matrices which may decrease its performance by demanding more computer memory. Such algorithms accomplish exhaustive comparisons within each horizontal extension which may penalize efficiency.

IPOG-F (Forbes et al. 2008) is an adaptation of the IPOG algorithm (Lei et al. 2007). Through two main steps, horizontal and vertical growths, an MCA is built. Both growths work based on an initial solution. The algorithm is supported by two auxiliary matrices which may decrease its performance by demanding more computer memory to use. Moreover, the algorithm performs exhaustive comparisons within each horizontal extension which may cause longer execution. On the other hand, TTR 1.2 only needs one auxiliary matrix to work and it does not generate, at the beginning, the matrix of t-tuples. These features make our solution better for higher strengths (5, 6) even though we did not find statistical difference when we compared TTR 1.2 with our own implementation of IPOG-F (Section 6.4).

IPO-TConfig is an implementation of IPO in the TConfig tool (Williams 2000). The TConfig tool can generate test cases based on strengths varying from 2 to 6. However, it is not entirely clear whether the IPOG algorithm (Lei et al. 2007) was implemented in the tool or if another approach was chosen for t-way testing. In our empirical evaluation, TTR 1.2 was superior to IPO-TConfig not only for higher strengths (5, 6) but also for all strengths (from 2 to 6). Moreover, IPO-TConfig was unable to generate test cases in 25% of the instances (strengths 4, 5, 6) we selected.

The ACTS tool (Yu et al. 2013) is one of the most used CIT tools to date. Several variations of IPO are implemented in ACTS: IPOG, IPOG-D (Lei et al. 2007), IPOG-F, and IPOG-F2 (Forbes et al. 2008). The implementation of our algorithm performed better in terms of cost, compared with IPOG-F2/ACTS, for higher strengths. However, both solutions performed similarly when we considered all strengths.

IPOG-C (Yu et al. 2013) generates MCAs considering constraints. It is an adaptation of IPOG where constraint handling is provided via a SAT solver. The greatest contribution are three optimizations that seek to reduce the number of calls of the SAT solver. As IPOG-C is based on IPOG, it accomplishes exhaustive comparisons in the horizontal growth which may lead to a longer execution. Besides, each t-tuple is evaluated to see if it is valid or not.

The algorithm implemented in the PICT tool (Czerwonka 2006) has two main phases: preparation and generation. In the first phase, the algorithm generates all t-tuples to be covered. In the second phase, it generates the MCA. The generation of all t-tuples which can often be a bad thing, since many tuples require large disk space for storage. With respect to the application of the tool, this tool is best applied in strenghts of low value as an example, there is no study (Yamada et al. 2016). Other tools have been created based on PICT (PictMaster 2017).

The jenny tool is implemented in C (Jenkins 2016). It is a light greedy tool but one of its limitation is the number of parameters it handles: from 2 to 52. In the controlled experiment we performed, TTR 1.2 was superior to jenny for higher strengths (5, 6) but they presented similar performances for all strengths (from 2 to 6). In 27.5% of the samples (strengths 4, 5, 6), jenny could not create test cases as we have mentioned before.

Automatic Efficient Test Generator (AETG) (Cohen et al. 1997) is based on algorithms that use ideas of statistical experimental design theory to minimize the number of tests needed for a specific level of test coverage of the input test space. AETG generates test cases by means of Experimental Designs (ED) (Cochran and Cox 1950) which are statistical techniques used for planning experiments so that one can extract the maximum possible information based on as few experiments as possible. It makes use of its greedy algorithms and the test cases are constructed one at a time, i.e. it does not use an initial solution.

In (Cavalgna et al. 2013), a new tool is presented for generating MCAs with constraint handling support: CitLab. Like ACTS, CitLab has several algorithms for test suite generation: AETG, IPO, and others. The bottom of line is that test case generation is only one of the characteristics of the tool. Like ACTS, CitLab does not present a new algorithm as it just implements algorithms proposed in the literature. Hence, the same limitations of the existing proposals are also here.

The Feedback Driven Adptative Combinatorial Testing Process (FDA-CIT) algorithm is shown in (Yilmaz et al. 2014). At each iteration of the algorithm, verification of the masking of potential defects is accomplished, isolating their probable causes and then generating a new configuration which omits such causes. The idea is that masked deffects exist and that the proposed algorithm provides an efficient way of dealing with this situation before test execution. However, there is no assessment about the cost of the algorithm to generate MCAs.

In order to better compare the previous studies with our algorithm, TTR 1.2, in Table 20 we show some main characteristics of all the algorithms/tools. In this table, * means that the characteristic is present, - means that it is not present, and empty (blank space) means that either it is not totally evident that the algorithm/tool has such a feature or it is not applicable.

Table 20 Greedy algorithms/tools for CIT

8 Conclusions

This paper presented a novel CIT algorithm, called TTR, to generate test cases specifically via the MCA technique. TTR produces an MCA M, i.e. a test suite, by creating and reallocating t-tuples into this matrix M, considering a variable called goal (ζ). TTR is a greedy algorithm for unconstrained CIT.

TTR was implemented in Java and C (TTR 1.2) and we developed three versions of our algorithm. In this paper, we focused on the description of versions 1.1 and 1.2 since version 1.0 was detailed elsewhere (Balera and Santiago JĂșnior 2015).

We carried out two rigorous evaluations to assess the performance of our proposal. In total, we performed 3,200 executions related to 8 solutions (80 instances × 5 variations × 8). In the first controlled experiment, we compared versions 1.1 and 1.2 of TTR in order to know whether there is significant difference between both versions of our algorithm. In such experiment, we jointly considered cost (size of test suites) and efficiency (time to generate the test suites) in a multi-objective perspective. We conclude that TTR 1.2 is more adequate than TTR 1.1 especially for higher strengths (5, 6). This is explained by the fact that, in TTR 1.2, we no longer generate the matrix of t-tuples (Θ) but rather the algorithm works on a t-tuple by t-tuple creation and reallocation into M. This benefits version 1.2 so that it can properly handle higher strengths.

Having chosen version 1.2, we conducted another controlled experiment where we confronted TTR 1.2 with five other greedy algorithms/tools for unconstrained CIT: IPOG-F (Forbes et al. 2008), jenny (Jenkins 2016), IPO-TConfig (Williams 2000), PICT (Czerwonka 2006), and ACTS (Yu et al. 2013). In this case, we carried out two evaluations where in the first one we compared TTR 1.2 with IPOG-F and jenny since these were the solutions we had the source code (to precisely measure the time). Moreover, to address a possible evaluation bias in the time measures when comparing TTR 1.2 against jenny (developed in C), we also implemented it in C in addition to the standard implementation in Java. Hence, a cost-efficiency (multi-objective) evaluation was performed. In the second assessment, we did a cost (single objective) evaluation where TTR 1.2 was compared with PICT, IPO-TConfig, and ACTS. The conclusion is as previously stated: TTR 1.2 is better for higher strengths (5, 6) where only in one case our solution is not superior (in the comparison with IPOG-F where we have a draw). The fact of not creating the matrix of t-tuples at the beginning explains this result.

Therefore, considering the metrics we defined in this work and based on both controlled experiments, TTR 1.2 is a better option if we need to consider higher strengths (5, 6). For lower strengths, other solutions, like IPOG-F, may be better alternatives.

Thinking about the testing process as a whole, one important metric is the time to execute the test suite which eventually may be even more relevant than other metrics. Hence, we need to run multi-objective controlled experiments where we execute all the test suites (TTR 1.1 × TTR 1.2; TTR 1.2 × other solutions) probably assigning different weights to the metrics. We also need to investigate the parallelization of our algorithm so that it can perform even better when subjected to a more complex set of parameters, values, strengths. One possibility is to use the Compute Unified Device Architecture/Graphics Processing Unit (CUDA/GPU) platform (Ploskas and Samaras 2016). We must develop other multi-objective controlled experiment addressing effectiveness (ability to detect defects) of our solution compared with the other five greedy approaches.

Notes

  1. Despite this classification, some algorithms/tools are both SAT and greedy-based.

  2. Some authors (Kuhn et al. 2013; Cohen et al. 2003) abbreviate a Mixed-Level Covering Array as CA too. However, as we have made a explicit distinction between Fixed-value and Mixed-Level arrays, we prefer abbreviate it as MCA. Note that an MCA is naturally a Covering Array. We have just used this abbreviation to stress that our work relates to mixed and not fixed arrays.

  3. Θ is a matrix whose order varies. In other words, TTR knows the number of columns beforehand (|f|), but the number of rows (|C|) depends on the interaction of t-way parameter’s values. During the reallocation process, TTR removes the rows until Θ is empty.

Abbreviations

ACTS:

Advanced combinatorial test system

AETG:

Automatic efficient test generator

CA:

Coverage array

CIT:

Combinatorial interaction test

CUDA:

Compute unified device architecture

ED:

Experimental designs

GA:

Genetic algorithm

GPU:

Graphics processing unit

IPOG:

In parameter order general

IPO-TConfig:

In parameter order TConfig

MCA:

Mixed-level covering array

MOA:

Mixed-level orthogonal array

OA:

Orthogonal array

OOP:

Object-oriented programming

PICT:

Pairwise indepedent combinatorial testing

SA:

Simulated annealing

SWPDC:

Software for the payload data handling computer

TSA:

Tabu search approach

TTR:

T-tuple reallocation

References

  • Ahmed, BS (2016) “Test case minimization approach using fault detection and combinatorial optimization techniques for configuration-aware structural testing”. Eng Sci Technol, Int J 19(2):737–753. http://www.sciencedirect.com/science/article/pii/S2215098615001706.

  • Balera, JM, Santiago JĂșnior VA (2015) T-tuple Reallocation: An algorithm to create mixed-level covering arrays to support software test case generation In: 15th International Conference on Computational Science and Its Applications (ICCSA), 503–517.. Springer International, Publishing, Berlin, Heidlberg.

    Google Scholar 

  • Balera, JM, Santiago JĂșnior VA (2016) “A controlled experiment for combinatorial testing” In: Proceedings of the 1st Brazilian Symposium on Systematic and Automated Software Testing, 2:1-2:10.. ACM, New York, NY, USA, SAST. http://doi.acm.org/10.1145/2X00000.993288.2993289.

  • Balera, JM, Santiago JĂșnior VA (2017) Data set. https://www.dropbox.com/sh/to3a47ncqpliq5l/AACj34JQ9S1I4fzQJf0xPZfva?dl=0. Accessed 17 Oct 2016.

  • Bryce, RC, Colbourn CJ (2006) “Prioritized interaction testing for pair-wise coverage with seeding and constraints”. Inf Softw Technol 48(10):960–970.

    Article  Google Scholar 

  • Cochran, WG, Cox GM (1950) “Experimental designs”. John, Wiley & Sons, New York; Chichester.

    MATH  Google Scholar 

  • Cohen, MB, Dalal SR, Fredman ML, Patton GC (1997) “The AETG system: an approach to testing based on combinatorial design”. IEEE Trans Softw Eng 23(7):437–444.

    Article  Google Scholar 

  • Cohen, MB, Dwyer MB, Shi J (2008) “Constructing interaction test suites for highly-configurable systems in the presence of constraints: A greedy approach”. IEEE Trans Softw Eng 34(5):633–650.

    Article  Google Scholar 

  • Cohen, MB, Gibbons PB, Mugridge WB, Colbourn CJ, Collofello JS (2003) “A variable strength interaction testing of components” In: Proceedings of 27th Annual Int. Comp. Software and Applic. Conf. (COMPSAC), 413–418.. IEEE, USA.

    Google Scholar 

  • Campanha, DN, Souza SRS, Maldonado JC (2010) “Mutation testing in procedural and object-oriented paradigms: An evaluation of data structure programs” In: Brazilian Symposium on Software Engineering, 90–99.. IEEE, USA.

    Google Scholar 

  • Cavalgna, A, Gargantini A, Vavassori P (2013) “Combinatorial interaction testing with citlab” In: Proceedings on 2013 IEEE Sixth International, Conference on Software Testing, Verification and Validation, 376–382.. IEEE, Nova York.

    Google Scholar 

  • Czerwonka, J (2006) “Pairwise testing in the real world: Practical extensions to test-case generators” In: Proceedings 24th Pacific Northwest Software Quality Conference, 285–294.. Academic Press, Portland.

    Google Scholar 

  • Dalal, SR, A Jain NK, Leaton JM, Lott CM, Patton GC, Horowitz B (1999) “Model-based testing in pratice” In: Proceedings 21st International Conference on Software Engineering (ICSE’99), 285–294.. AMC, Nova York.

    Google Scholar 

  • Delamaro, ME, de Lourdes dos Santos Nunes F, de Oliveira RAP (2013) “Using concepts of content-based image retrieval to implement graphical testing oracles”. Softw Test Verif Reliab 23:171–198. doi:10.1002/stvr.463.

    Article  Google Scholar 

  • Filho, RAM, Vergilio SR (2015) “A Mutation and Multi- objective Test Data Generation Approach for Feature Testing of Software Prod- uct Lines”. 29th Brazilian, Symposium on Software Engineering, Belo Hori-zonte.

    Google Scholar 

  • Forbes, M, Lawrence J, Lei Y, Kacker RN, Kuhn DR (2008) “Refining the in-parameter-order strategy for constructing covering arrays”. J Res Natl Inst Stand Technol 113(5):287–297.

    Article  Google Scholar 

  • Garvin, BJ, Cohen MB, Dwyer MB (2011) “Evaluating improvements to a meta-heuristic search for constrained interaction testing”. Empirical Soft Eng 16(1):61–102.

    Article  Google Scholar 

  • Hernandez, LG, Valdez NR, Jimenez JT (2010) “Construction of mixed covering arrays of variable strength using a tabu search approach”. Springer, International Publisher, Berlin, Heidelberg.

    MATH  Google Scholar 

  • Huang, CY, Chen CS, Lai CE (2016) “Evaluation and analysis of incorporating fuzzy expert system approach into test suite reduction”. Inf Softw Technol 79:79–105. http://www.sciencedirect.com/science/article/pii/S0950584916301197.

  • Jenkins, B (2016) “Jenny: A pairwise tool”. http://burtleburtle.net/bob/math/jenny.html. Accessed 6 June 2016.

  • Khan, SUR, Lee SP, Ahmad RW, Akhunzada A, Chang V (2016) “A survey on test suite reduction frameworks and tool”s. Int J Inf Manag 36(6, Part A):963–975. http://www.sciencedirect.com/science/article/pii/S0268401216303437.

  • Kohl, M (2015) “Introduction to statistical data analysis with R”. bookboon.com, London.

    Google Scholar 

  • Kuhn, DR, Wallace DR, Gallo AM (2004) “Software fault interactions and implications for software testing”. IEEE Trans Software Eng 30(6):418–421. http://doi.ieeecomputersociety.org/10.1109/TSE.2004.24.

  • Kuhn, RD, Kacker RN, Lei Y (2013) “Introduction to Combinatorial Testing”. Chapman and Hall/CRC, USA.

    MATH  Google Scholar 

  • Lei, Y, Kacker R, Kuhn DR, Okun V, Lawrence J (2007) “IPOG: A general strategy for t-way software testing”.

  • Lei, Y, Tai K-C (1998) “In-Parameter-Order: A test generation strategy for pairwise testing” In: Proceedings of the IEEE Int. Symp. on High-Assurance Syst. Eng. (HASE), 254–261.. IEEE Computer Society Press, USA.

    Google Scholar 

  • Mathur, AP (2008) “Foundations of software testing”. Dorling, Kindersley (India), Pearson Education in South Asia, Delhi, India.

    Google Scholar 

  • NIST National Institute of Standards and Technology (2015) “Automated combinatorial testing for software (ACTS)”. http://csrc.nist.gov/groups/SNS/acts/. Accessed 29 July 2017.

  • Oliveira, RAP (2017) “Test oracles for systems with complex outputs: the case of TTS systems”. PhD Thesis, Univesi-dade de SĂŁo Paulo, Brazil.

    Google Scholar 

  • Pairwise (2017) “Pairwise Testing: Combinatorial Test Case Generation”. http://www.pairwise.org/tools.asp. Accessed 29 July 2017.

  • Petke, J, Cohen MB, Harman M, Yoo S (2015) “Practical combinatorial interaction testing: Empirical findings on efficiency and early fault detection”. IEEE Trans Softw Eng 41(9):901–924.

    Article  Google Scholar 

  • PictMaster (2017) “Combinatorial testing tool PictMaster”. https://osdn.net/projects/pictmaster/. Accessed 29 July 2017.

  • Ploskas, N, Samaras N (2016) “GPU Programming in MATLAB”. Morgan Kaufmann, Boston. http://www.sciencedirect.com/science/article/pii/B9780128051320099951.

  • Qu, X, Cohen MB, Woolf KM (2007) “Combinatorial interaction regression testing: A study of test case generation and prioritization” In: Proc. IEEE Int. Conf. Softw. Maintenance, 255–264.. IEEE Computer Society Press, USA.

    Google Scholar 

  • Santiago JĂșnior, VA (2011) “Solimva: A methodology for generating model-based test cases from natural language requirements and detecting incompleteness in software specifications”. PhD thesis, Instituto Nacional de Pesquisas Espaciais (INPE).

  • Santiago JĂșnior, VA, Silva FEC (2017) “From Stat- echarts into Model Checking: A Hierarchy-based Translation and Specification Patterns Properties to Generate Test Cases” In: the 2nd Brazilian Symposium, 2017, Fortaleza. Proceedings of the 2nd, Brazilian Symposium on Systematic and Automated Software Testing - SAST, 10–20.. ACM Press, New York.

    Google Scholar 

  • Santiago JĂșnior, VA, Vijaykumar NL (2012) “Generating model-based test cases from natural language requirements for space application software”. Softw Qual J 20(1):77–143. doi:10.1007/s11219-011-9155-6.

    Article  Google Scholar 

  • Schroeder, PJ, Korel B (2000) Black-box test reduction using input-output analysis. In: Harold M (ed)Proceedings of the 2000 ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA ’00), 173–177.. ACM, New York.

    Google Scholar 

  • Segall, I, Tzoref-Brill R, Farchi E (2011) Using binary decision diagrams for combinatorial test design In: Proceedings of the 2011 International Symposium on Software Testing and Analysis (ISSTA ’11), 254–264.. ACM, New York.

    Google Scholar 

  • Shapiro, SS, Wilk MB (1965) “An analysis of variance test for normality (complete samples)”. Biometrika 52(3-4):591.

    Article  MathSciNet  MATH  Google Scholar 

  • Shiba, T, Tsuchiya T, Kikuno T (2004) “Using artificial life techniques to generate test cases for combinatorial testing” In: Proceedings 28th Int. Comput. Softw. Appl. Conf., Des. Assessment Trustworthy Softw.-Based Syst, 72–77.. IEEE Computer Society Press, USA.

    Google Scholar 

  • Stinson, DR (2004) “Combinatorial Designs: Constructions and Analysis”. Springer, New York.

    MATH  Google Scholar 

  • Tai, KC, Lei Y (2002) “A test generation strategy for pairwise testing”. IEEE Trans Softw Eng 28(1):109–111.

    Article  MathSciNet  Google Scholar 

  • Tzoref-Brill, R, Wojciak P, Maoz S (2016) “Visualization of combinatorial models and test plans” In: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE), 144–154.. IEEE, USA.

    Google Scholar 

  • Williams, AW (2000) “Determination of test configurations for pairwise interaction coverage” In: Testing of Communicating Systems: Tools and Techniques, IFIP TC6/WG6.1 13th International Conference on Testing Communicating Systems (TestCom 2000), August 29 - September 1, 2000, 59–74, Ottawa, Canada.

  • Wohlin, C, Runeson P, Host M, Ohlsson MC, Regnell B, WesslĂ©n A (2012) “Experimentation in Software Engineering. Springer-Verlag Berlin Heidelberg, Germany.

    Book  MATH  Google Scholar 

  • Yamada, A, Kitamura T, Artho C, Choi E, Oiwa Y, Biere A (2015) “Optimization of combinatorial testing by incremental SAT solving”. IEEE, USA.

    Book  Google Scholar 

  • Yamada, A, Biere A, Artho C, Kitamura T, Choi EH (2016) “Greedy combinatorial test case generation using unsatisfiable cores” In: Proceedings of 2016 31st IEEE/ACM International, Conference on Automated Software Engineering (ASE), 614–624.. IEEE, USA.

    Chapter  Google Scholar 

  • Yilmaz, C, Cohen MB, Porter A (2014) “Reducing masking effects in combinatorial interaction testing: A feedback driven adaptative approach”. IEEE Trans Softw Eng:43–66.

  • Yoo, S, Harman M (2012) “Regression testing minimization, selection and prioritization: A survey”. Softw Test Verif Reliab 22(2):67–120. https://dl.acm.org/citation.cfm?id=2284813.

  • Yu, L, Lei Y, Nourozborazjany M, Kacker RN, Kuhn DR (2013) “An efficient algorithm for constraint handling in combinatorial test generation” In: 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation, 242–251.. IEEE, Nova York.

    Chapter  Google Scholar 

  • Yu, L, Lei Y, Kacker RN, Kuhn DR (2013) “ACTS: A combinatorial test generation tool” In: Proceedings on 2013 IEEE Sixth International, Conference on Software Testing, Verification and Validation, 370–375.. IEEE, Nova York.

    Chapter  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) for supporting this research and Leoni Augusto Romain da Silva for his support in running part of the second controlled experiment.

Funding

This work was partially funded by Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) through a scholarship granted to the first author (JMB).

Availability of data and materials

Full data obtained during the experiments are in (Balera and Santiago JĂșnior 2017).

Author information

Authors and Affiliations

Authors

Contributions

JMB worked in the definitions and implementations of all three versions of the TTR algorithm, and carried out the two controlled experiments. VASJ worked in the definitions of the TTR algorithm, and in the planning, definitions, and executions of the two controlled experiments. All authors contributed to all sections of the manuscript. All authors read and approved the submitted manuscript.

Corresponding author

Correspondence to Juliana M. Balera.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Balera, J., Santiago JĂșnior, V. An algorithm for combinatorial interaction testing: definitions and rigorous evaluations. J Softw Eng Res Dev 5, 10 (2017). https://doi.org/10.1186/s40411-017-0043-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40411-017-0043-z

Keywords