Skip to main content

Similarity testing for role-based access control systems



Access control systems demand rigorous verification and validation approaches, otherwise, they can end up with security breaches. Finite state machines based testing has been successfully applied to RBAC systems and enabled to obtain effective test cases, but very expensive. To deal with the cost of these test suites, test prioritization techniques can be applied to improve fault detection along test execution. Recent studies have shown that similarity functions can be very efficient at prioritizing test cases. This technique is named similarity testing and assumes the hypothesis that resembling test cases tend to have similar fault detection capabilities. Thus, there is no gain from similar test cases, and fault detection ratio can be improved if test diversity increases.


In this paper, we propose a similarity testing approach for RBAC systems named RBAC similarity and compare to simple dissimilarity and random prioritization. RBAC similarity combines the dissimilarity degree of pairs of test cases with their relevance to the RBAC policy under test to maximize test diversity and the coverage of its constraints.


Five RBAC policies and fifteen test suites were prioritized using each of the three test prioritization techniques and compared using the Average Percentage Faults Detected metric.


Our results showed that the combination of the dissimilarity degree to the relevance of a test case to RBAC policies in the RBAC similarity can be more effective than random prioritization and simple dissimilarity, by itself, in most of the cases.


The RBAC similarity criterion is suitable as a test prioritization criteria for test suites generated from finite state machine models specifying RBAC systems.


Access control is one of the major pillars of software security. It is responsible for ensuring that only intended users can access data and only the required permissions to accomplish a task is guaranteed (Ferraiolo et al. 2007). In this context, the Role-Based Access Control (RBAC) model has been established as one of the most significant access control paradigms. In RBAC, users receive privileges through role assignments and activate them during sessions (ANSI 2004). Despite its simplicity, mistakes can occur during development and lead to faults, or either security breaches. Therefore, software verification and validation becomes necessary.

Finite State Machine (FSM) has been widely used for model-based testing (MBT) of reactive systems (Broy et al. 2005). Previous investigations using random FSMs have shown that recent test generation methods (e.g., SPY (Simão et al. 2009)), compared to traditional methods (e.g., W (Chow 1978) and HSI (Petrenko and Bochmann 1995)), tend to rely on fewer and longer test cases, reducing the overall test cost without impacting test effectiveness (Endo and Simao 2013). In the RBAC domain, although very effective and less costly, recent test generation methods still tend to output large amounts of test cases (Damasceno et al. 2016). Thus, there is a need for additional steps during software testing, such as test prioritization (Mouelhi et al. 2015).

Test case prioritization aims at finding an ideal ordering of test cases so that maximum benefits can be obtained, even if test execution is prematurely halted at some arbitrary point (Yoo and Harman 2012). A test prioritization criterion that has recently shown very promising results is similarity testing (Cartaxo et al. 2011; Bertolino et al. 2015). In similarity testing, we assume that resembling test cases tend to cover identical parts of an SUT, have equivalent fault detection capabilities, and no additional gain can be expected if executed simultaneously. This concept has been investigated under MBT (Cartaxo et al. 2011), access control testing (Bertolino et al. 2015) and software product line (SPL) testing (Henard et al. 2014) domains, but it has never been applied to RBAC. Moreover, since the fault detection effectiveness of test criteria are strongly related to its ability to represent faults of specific domains (Felderer et al. 2015), similarity testing may not be necessarily effective on RBAC domain.

In this paper, we investigate similarity testing for RBAC systems. A similarity testing criterion named RBAC similarity is introduced and compared to random prioritization and simple dissimilarity criteria using Average Percentage Faults Detected (APFD) metric, five RBAC policies, and three FSM-based testing methods. Our results show that RBAC similarity makes test prioritization more suitable to the specificities of the RBAC model and achieve higher APFD values compared to simple dissimilarity and random prioritization, in most of the cases.

This paper is organized as follows: Section 2 shows the theoretical background related to our investigation. Sections 2.1 to 2.3 give a brief introduction to FSM-Based Testing. The RBAC model and an FSM-based testing approach for RBAC systems are introduced in Sections 2.4 and 2.5. The test case prioritization problem and similarity testing are discussed in Section 2.6. Section 3 details our proposed similarity testing criteria named RBAC similarity. Section 4 depicts the experiment we performed to compare RBAC similarity to simple dissimilarity and random prioritization techniques. The results obtained from our experiments are analyzed and discussed. The threats to validity and final remarks are presented in Sections 6 and 7, respectively.


This section introduces the background behind our similarity testing approach for RBAC systems. First, we present the concept of FSM-based testing and three test generation methods (i.e., W, HSI, and SPY) which were considered in this study. Second, the RBAC model and an FSM-based testing approach for RBAC systems are described. At last, the test case prioritization problem and the specificities of the similarity testing are detailed.

Finite state machine based testing

A Finite State Machine (FSM) is a hypothetical machine composed of states and transitions (Gill 1962). Formally, an FSM can be defined as a tuple M=(S,s0,I,O,D,δ,λ) where S is a finite set of states, s0S is the initial state, I is the set of input symbols, O is the set of output symbols, DS×I is the specification domain, δ:DS is the transition function, and λ:DO is the output function. An FSM always has a single current (origin) state s i S which changes to destination (tail) state s j S by applying an input xI where s j =δ(s i ,x), and returns an output y=λ(s i ,x). An input x is defined for s if in state s there is a transition consuming input x (i.e. (s,x)D). Such transition is said defined. An FSM is complete if all inputs are defined for all states, otherwise it is partial. Figure 1 depicts an example of a complete FSM with three states {q0,q1,q2}.

Fig. 1
figure 1

Example of complete FSM

A sequence α=x1x2...x n I is defined for state sS, if there are states s1,s2,...,sn+1 such that s=s1 and δ(s i ,x i )=si+1, for all 1≤in. The concatenation of two sequences α and ω is denoted as αω. A sequence α is a prefix of a sequence β, denoted by αβ, if β=αω, for some given input sequence ω. An empty sequence is denoted by ε and a sequence α is a proper prefix of β, denoted by α<β, if β=αω for a given ωε. The set of prefix sequences of a set T is defined as pref(T)={α | βT and α<β}, if T=pref(T), T is prefix-closed.

The transition and output functions can be lifted to input sequences as usual; for the empty sequence ε, we have that δ(s,ε)=s and λ(s,ε)=ε. For a sequence αx defined for state s, we have that δ(s,αx)=δ(δ(s,α),x) and λ(s,αx)=λ(s,α)λ(δ(s,α),x). A sequence α=x1x2...x n I is a transfer sequence from s to sn+1 if δ(s,α)=sn+1, thus sn+1 is reachable from s. If every state of an FSM is reachable from s0 then it is initially connected and if every state is reachable from all states, it is strongly connected.

The symbol Ω(s) denotes all input sequences defined for a state s and Ω M abbreviates Ω(s0), which refers to all defined input sequences for an FSM M. A separating sequence for two states s i and s j is a sequence γ such that γΩ(s i )∩Ω(s j ) and λ(s i ,γ)≠λ(s j ,γ). In addition, if γ is able to distinguish every pair of states of an FSM, it is a distinguishing sequence. Considering the FSM presented in Fig. 1, the sequence a is a separating sequence for states q0 and q1 since λ(q0,a)=0 and λ(q1,a)=1.

Two FSMs M S =(S,s0,I,O,D,δ,λ) and M I =(S,s0′,I,O,D,δ,λ) are equivalent if their initial states are. Two states s i ,s j are equivalent if αΩ(s i )∩Ω(s j ), λ(s i ,α)=λ(s j ,α). An FSM M may have a reset operation, denoted by r, which takes to s0 regardless the current state. An input sequence αΩ M starting with a reset symbol r is a test case of M. A test suite T consists of a finite set of test cases of M, such that there are no α,βT where α<β. Prefixes α<β are excluded from test suite since the execution of β implies the execution of α. The length of a test case α is represented by |α| and describes the cost of executing α plus the reset operation. The number of test cases of one test suite T also describes the number of resets of T which is depicted as |T|.

Mutation analysis in FSM-based testing

In FSM-based testing, given a specification M, the symbol I(M) denotes the set of all deterministic FSMs, variants of M, with the same inputs of M for which all sequences in Ω M are defined. The set I(M) is called fault domain for M and these variants of M are named mutants and can be obtained either manually or by automatically performing simple syntactic changes using mutation operators (Andrews et al. 2006). Given m≥1, then I m (M) denotes all FSMs of I(M) with at most m states. Given a specification M with n states, a test suite TΩ M is m-complete if for each NI m distinguishable from M, there is a test case tT that distinguish M from N. The following mutation operators are often used on FSM-based testing (Chow 1978): change initial state (CIS), which changes the s0 of an FSM to s k , such that s0s k ; change output (CO), which modifies the output of a transition (s,x), using a different function Λ(s,x) instead of λ(s,x); change tail state (CTS), which modifies the destination state of a transition (s,x), using a different function Δ(s,x) instead of δ(s,x); and add extra state (AES), which inserts a new state such that mutant N is equivalent to M. Figure 2 shows examples of mutants of the FSM shown in Fig. 1 using CIS, CO, CTS, and AES operators. Changes are marked with an asterisk (*).

Fig. 2
figure 2

Examples of FSM Mutants. a FSM mutant - CIS. b FSM mutant - CO. c FSM mutant - CTS. d FSM mutant - AES

If the output of a mutant is different from the original FSM, for any test case, the mutant is distinguished (or killed) and the seeded fault denoted by the mutant is detected. Moreover, some mutants can be syntactically different but functionally equivalent to the original model. These are called equivalent mutants. The process of analyzing if test cases trigger failures and kill mutants is called mutation analysis and is often used in software testing research (Jia and Harman 2011; Fabbri et al. 1994).

The main outcome of the mutation analysis is the mutation score, which indicates the effectiveness of a test suite. Given a test suite T, the mutation score (or effectiveness) can be calculated using the equation \(T_{\text {eff}}=\tfrac {\#km}{(\#tm-\#em)}\). The #km parameter represents the number of killed mutants; the #tm defines the total number of generated mutants; and #em denotes the number of mutants equivalent to the original SUT. Thus, the mutation score consists of the ratio of the number of detected faults over the total number of non-equivalent mutants. An m-complete test suite has full fault coverage for a given domain I m (M) and can detect all faults in any FSM implementation with at most m states. Thus, it scores 1.0, by definition.

FSM-based testing methods

FSM-based testing relies on FSM models to derive test cases and evaluate if the behavior of an SUT conforms to its specification (Utting et al. 2012). To check this behavioral conformance, two basic sets of sequences are often used: the state cover (Q) and transition cover (P) sets (Broy et al. 2005).

A set of input sequences is a state cover set of M if for each state s i S there exists an αQ such that δ(s0,α)=s i and εQ to reach the initial state. A set of input sequences P is named transition cover set of M if for each transition (s,x)D there are sequences α,αxP, such that δ(s0,α)=s, and εP to reach the initial state. The transition cover set of an FSM is obtained by generating the testing tree of this FSM (Broy et al. 2005). The state and transition cover sets of the FSM depicted in Fig. 1 are respectively Q={ε,a,b} and P={ε,a,aa,ba,b,ab,bb}. After obtaining state and transition coverage, FSM-based testing methods require some pre-defined sets to identify the reached parts of an FSM. These are the characterization set and separating families.

A characterization set (W set) contains at least one input sequence which distinguishes each pair of states of an FSM. Formally, it means that for all pairs of states s i ,s j S,ij, αW such that λ(s i ,α)≠λ(s j ,α).

A separating family, or harmonized state identifiers, is a set of sequences H i for each state s i S that satisfies the condition s i ,s j S,s i s j βH i ,γH j that has a common prefix α such that αΩ(s i )∩Ω(s j ) and λ(s i ,α)≠λ(s j ,α). In the worst case, the separating family is the W set itself.

The characterization set of the FSM model shown in Fig. 1 is W={a,b}, and the separating family of states q0,q1,q2 are respectively H0={a,b}, H1={a}, and H2={b}. These sets are building blocks for most traditional and recent testing methods, such as W (Chow 1978; Vasilevskii 1973), HSI (Petrenko and Bochmann 1995), and SPY (Simão et al. 2009).

W method

The W method is the most classic FSM-based test generation algorithm (Chow 1978; Vasilevskii 1973). It uses the P set, to traverse all transitions, concatenated to the W set, for state identification. Moreover, it can also detect an estimated number of extra states using a traversal set \(\bigcup ^{m-n}_{i=0}(I^{i})\), such that (mn) is the number of extra states and Ii contains all sequences of length i combining symbols of I. Thus, by concatenating P, the traversal set, and W, the W method can detect (mn) extra states (e.g., AES mutants). Assuming the FSM in Fig. 1, no extra states (m=n) or proper prefixes, W method can generate T W ={aaa,aab,aba,abb,baa,bab,bba,bbb}, and |T W |=8.

HSI method

The Harmonized State Identifiers (HSI) method (Petrenko and Bochmann 1995) uses state identifiers H i to distinguish each state s i S of an FSM model. The HSI test suite is obtained by concatenating the transition cover set P with H i , such that δ(s0,α)=s i , s i S and αP. The HSI method can be applied to complete and partial FSMs. Assuming the FSM in Fig. 1, no extra states or proper prefixes, HSI method can generate T H S I ={aaa,aba,abb,baa,bba,bbb}, and |T H S I |=6, which is 75% the size of T W .

SPY method

The SPY method (Simão et al. 2009) is a recent test generation method able to generate m-complete test suites on-the-fly. First, the state cover set Q is concatenated to the state identifiers H i . Afterwards, differently from traditional methods, such as W and HSI, the traversal set is distributed over the set containing Q concatenated with H i based on sufficient conditions (Simão et al. 2009). Thus, by avoiding testing tree branching, test suite length and the number of resets can be reduced.

Experimental studies have indicated that SPY can generate test suites on average 40% shorter than traditional methods (Simão et al. 2009). Moreover, it can achieve higher fault detection effectiveness even if the number of extra states is underestimated (Endo and Simao 2013). Assuming the FSM in Fig. 1, no extra states or proper prefixes, SPY method can generate T S P Y ={aaaba,abbb,baa,bba}, and |T S P Y |=4, which is 50% the size of T W .

Role-based access control

Access Control (AC) is one of the most important security mechanisms (Jang-Jaccard and Nepal 2014). Essentially, it ensures that only allowed users have access to protected system resources based on a set of rules, named security policies, that specify authorizations and access restrictions (Samarati and de Vimercati 2001). In this context, the Role-Based Access Control (RBAC) model has been established as one of the most significant access control paradigms (Ferraiolo et al. 2007). It uses the concept of grouping privileges to reduce the complexity of security management tasks (Samarati and de Vimercati 2001).

In RBAC, roles describe organizational figures (e.g., functions or jobs) which own a set of responsibilities (e.g., permissions). Roles can be assigned or revocated to users via role assignments and performed under sessions through role activations. Role hierarchies can be specified as inheritance relationships between senior and junior roles (e.g., sales director inherits permissions from sales manager). Thus, the mapping between security policies and the organizational structure can be more natural. These elements compose the ANSI RBAC model (ANSI 2004) which can also be extended to groups of administrative roles and permissions (Ben Fadhel et al. 2015). In Fig. 3, the ANSI RBAC and, within dashed lines, the Administrative RBAC models are depicted.

Fig. 3
figure 3

ANSI RBAC and administrative RBAC

Masood et al. (2009) define an RBAC policy as a 16-tuple P=(U,R,Pr,UR,PR,≤ A ,≤ I ,I,S u ,D u ,S r ,D r ,SSoD,DSoD,S s ,D s ), where:

  • U and R are the finite sets of users and roles;

  • Pr is the finite set of permissions;

  • URU×R is the set of user-role assignments;

  • PRPr×R is the set of permission-role assignments;

  • A R×R and ≤ I R×R are the role activation and inheritance hierarchies relationships;

  • I={AS,DS,AC,DC,AP,DP} is the finite set of types of RBAC requests which respectively stand for user-role assignments (AS), deassignments (DS), activations (AC) and deactivations (DC); and permission-role activations (AC) and deactivations (DC);

  • \(S_{u},D_{u}: U \rightarrow \mathbb {Z}^{+}\) are static and dynamic cardinality constraints on users;

  • \(S_{r},D_{r}: R \rightarrow \mathbb {Z}^{+}\) are static and dynamic cardinality constraints on roles;

  • SSoD,DSoD2R are the Static and Dynamic Separation of Duty (SoD) sets, respectively;

  • \(S_{s}: SSoD \rightarrow \mathbb {Z}^{+}\) specifies the cardinality of SSoD sets;

  • \(D_{s}: DSoD \rightarrow \mathbb {Z}^{+}\) specifies the cardinality of DSoD sets.

Role inheritance hierarchy is a role-to-role relationship (e.g., r j I r s ) that enable users assigned to a senior role (r s ) to have access to all permissions of junior roles (r j ). Role activation is a variant of role hierarchy (e.g., r j A r s ) which enable users assigned to a senior role (r s ) to activate junior roles (r j ) without being directly assigned to that junior role (Masood et al. 2009). Cardinality constraints specify a bound on the cardinality of user-role assignment and role activation relationships (Ben Fadhel et al. 2015). Static cardinality constraints (S u and S r ) bound user-role assignments and dynamic cardinality constraints (D u and D r ) limit user-role activations (i.e., role activations) and they can be specified from a user (S u and D u , respectively) and role (S r and D r , respectively) perspectives. Separation of Duty (SoD) constraints define static and dynamic (SSoD and DSoD, respectively) mutual exclusion relationships among roles based on a positive integer number n≥2 to avoid the simultaneous assignments or activations of conflicting roles (ANSI 2004) (e.g., given SSoD={staff, accountant, director} and n=2, S S S o D =2 defines that no user can be assigned to more than two roles of SSoD i set). Listing 1 shows an example of RBAC policy with two users (line 1), one role (line 2), and two permissions (line 3).

User u1 is assigned to role r1 (line 4) that is assigned to the permissions pr1 and pr2 (line 5). Both users can be assigned and activate at most one role (line 6-7). Role r1 can be assigned to at most two users (line 8); however, it can be activated by one user per time (line 9).

FSM-based testing of RBAC systems

Masood et al. (2009) propose an approach based on FSMs to specify and test the behavior of RBAC systems. Given an RBAC policy P, an FSM(P) consists of a complete FSM modeling all access control decisions that an RBAC mechanism must enforce. Formally, an FSM(P) is a tuple FSM(P)=(S P ,s0,I P ,O,D,δ P ,λ P ) where

  • S P is the set of states that P reach given its mutable elements;

  • s0S is the initial state where P currently stands given UR and PR;

  • I P is the input domain where I P ={(rq,up,r)} for all rqI, u{UPr} and rR};

  • O is the output domain formed by granted and denied;

  • D=S P ×I P is the specification domain;

  • δ P :DS P is the state transition function; and

  • λ P :DO is the output function.

Each state sS P is labeled using a sequence of pairs of bits containing one pair for each combination of user-role and permission-role. A pair user-role can be assigned (10), activated (11) or not assigned (00); and a pair permission-role can be assigned (10) or not assigned (00). The maximum number of states of FSM(P) is bounded to 3|U|×|R| and the number of reachable states depends on the constraints of P. The set of input symbols I P contains all combinations of users, roles, permissions and types of RBAC requests which can be applied to P. Formally, it means that I P ={(rq,up,r)} rqI, up{UPr} and rR.

Transitions of FSM(P) denote access control decisions on destination states (s j S P ) and output symbols (granted or denied) given the specification domain, that is complete (Masood et al. 2009) and composed by pairs of an origin state (s i S P ), and an input symbol (rq,up,r)I P , and the constraints of P. Given the constraints of P, an origin state s i and an input symbol (rq,up,r), a destination state s j =δ P (s i ,(rq,up,r)) is defined by flipping the bits of s i label related to an user (or permission) up and role r, if the constraints of P allow such request. This procedure denotes how the state transition function δ P operates.

Regarding the output function λ P , a denied symbol is returned to inputs (requests) which do not change the state of P, such as user-role assignments already performed or requests denied due to some cardinality constraint. Thus, denied is only returned on self-loops. Transitions with different origin and destination states always return granted. The generation of an FSM(P) can be iteratively performed by evaluating all defined inputs of state s0 given the constraints of P (ΩFSM(P)).

Figure 4 shows the FSM(P) of the RBAC policy presented in Listing 1. Self-loop transitions, corresponding to requests returning denied, and transitions related to permissions are not shown to keep the figure uncluttered. The initial state 1000 depicts line 4 of Listing 1 where u1 is assigned to r1. From state 1000 all defined inputs are applied once to reach states 1100, 1010 and 0000 where respectively user u1 activates r1, u1 and u2 are assigned to r1, and none is assigned to r1. This procedure is iteratively repeated over all reached states until no new state is obtained. At the end, the resulting FSM(P) has a total of eight states due to Dr(r1)=1 which makes state 1111 unreachable, but not 9=3|U|×|R|, which is the maximum number of states.

Fig. 4
figure 4

Example of FSM(P) specifying an RBAC policy

Test generation from FSM(P)

Given an RBAC system implementing a policy P, FSM-based testing can verify if the behavior of such system conforms to P using its respective FSM(P) and some test generation method, such as W or transition cover (Masood et al. 2009).

Let \(\mathcal {R}\) denote the set of all RBAC policies. Given a policy \(P \in \mathcal {R}\), the set \(\mathcal {R}\) can be partitioned into two subsets of policies: Equivalent (conforming) to P (\(\mathcal {R}^{P}_{conf}\)); and Faulty policies (\(\mathcal {R}^{P}_{fault}\)). Since \(\mathcal {R}\) is infinitely large, Masood et al. (2009) proposed a mutation analysis technique to measure the effectiveness of a test suite as its ability to detect if an RBAC system behaves as some faulty policy \(P' \in \mathcal {R}^{P}_{fault}\).

The RBAC mutation analysis restricts \( \mathcal {R}^{P}_{fault}\) to be finite by only considering policies mutants P=(U,R,Pr,UR,PR,≤A′,≤I′,I,Su′,Du′,Sr′,Dr′,SSoD,DSoD,Ss′,Ds′) generated by making simple changes to policy P=(U,R,Pr,UR,PR,≤ A ,≤ I ,I,S u ,D u ,S r ,D r ,SSoD,DSoD,S s ,D s ). Note that all mutants share the same set of users (U), roles (R), permissions (Pr) and inputs (I) of the original policy P. The set \( \mathcal {R}^{P}_{fault}\) of faulty policies is generated by making changes using two kinds of operators: mutation operators and element modification operators.

The mutation operators generate RBAC mutants by adding, modifying and removing elements from UR, PR, ≤ A , ≤ I , SSoD, and DSoD sets (e.g. add role to SSoD set). The element modification operators mutate policies by incrementing or decrementing the cardinality constraints S u ,D u ,S r ,D r ,S s , and D s . Each of these RBAC faults has corresponding faults on the FSM domain (Chow 1978), and FSM-based testing methods are also able to detect them (Masood et al. 2009). Figure 5 illustrates a part of one testing tree generated from four test cases and the FSM(P) in Fig. 4.

Fig. 5
figure 5

Testing Tree of an FSM(P)

By executing this test suite, an RBAC mutant generated from the policy shown in Listing 1 by applying the element modification operator to increment Dr(r1)=1 to Dr(r1)=2 can be detected. The FSM of this variant has state 1111 as reachable and, since test case t3 covers the transition 1110−AC(u2,r1)→1110, it can detect this fault.

Test case prioritization

Although very effective, FSM-based testing of RBAC systems tends to generate a large number of test cases regardless the methods used (Damasceno et al. 2016). Thus, development processes of RBAC systems with time and resources constraints may demand improvements on test execution. To cope with this issue, different techniques have been proposed to improve cost-effectiveness of test suites, such as Test Suite Minimization, also called test suite reduction, where redundant test cases are permanently removed; and Test Case Selection, which selects test cases based on changed parts of a System Under Test (SUT) (Yoo and Harman 2012). These techniques reduce time effort, but they may not work effectively, since they may also omit important test cases able to detect certain faults (Ouriques 2015).

Test Case Prioritization improves test execution without filtering out any test case. It aims at identifying an efficient test execution ordering so that maximum benefits can be obtained, even if test execution is prematurely halted at some arbitrary point (Ouriques 2015). To that, it uses a function f which quantitatively describes the quality of an ordering as test criteria (e.g., test effectiveness, code coverage). To illustrate test prioritization, consider an hypothetical SUT with 10 faults and five test cases A,B,C,D,E, as shown in Table 1.

Table 1 Example of test cases with fault-detection capability, taken from Elbaum et al. (2000)

In this example, all faults can be detected by running test cases C and E, since they respectively have 70% and 30% of fault-detection effectiveness. Test case A, on the other hand, can detect only 20% of the faults so it can negatively affect fault detection along test execution if placed at the beginning of a test suite. Thus, it is possible to speed up fault detection during test cases execution by placing C and E at the beginning of the test suite.

After test prioritization, the quality of a ordering can be measured using the Average Percentage Faults Detected (APFD) metric. The APFD is a metric commonly used in test prioritization research (Elbaum et al. 2002), and it is defined as follows:

$$ \text{APFD} = \frac{\sum_{i=1}^{n-1} F_{i}}{n \times l} +\frac{1}{2n} $$

In Eq. 1, the parameter n describes the total number of test cases, l defines the number of faults under consideration and F i specifies the number of faults detected by a test case i. The APFD value depicts the detection of faults (i.e., test effectiveness) along with test execution given test cases ordering. This value ranges from 0 to 1 and the greater the APFD is, the better is test cases ordering. Table 2 shows the APFD for three prioritized test suites, T1, T2 and T3 obtained from test cases in Table 1. In this example, the APFD points that T3 performs better than T2 and T1.

Table 2 APFD value for the test cases example

Similarity testing

Similarity testing is a promising test case prioritization approach that uses similarity functions to calculate the degree of similarity between pairs of tests and define test ordering (Cartaxo et al. 2011; Bertolino et al. 2015; Coutinho et al. 2014). It is an all-to-all comparison problem (Zhang et al. 2017) and, as most test prioritization algorithms (Elbaum et al. 2002), it has complexity O(n2). It assumes that resembling test cases are redundant in a sense they cover the same features of an SUT and tend to have equivalent fault detection capabilities (Bertolino et al. 2015).

To run similarity testing, a similarity matrix describing the resemblance between all pairs of test cases of a test suite T must be calculated with a similarity function d x . The similarity matrix SM of a test suite T with n test cases is a matrix where each element SM ij =d x (t i ,t j ) describes the similarity degree between two test cases t i and t j , such that 1≤i<jn. In Eq. 2 an illustrative example of similarity matrix is presented.

$$ \begin{aligned} &\qquad\qquad\ \ \ t_{1} \quad \ \ t_{2} \qquad\quad\ \ \ \cdots \quad\ \ t_{n-1} \qquad\qquad\ \ t_{n}\\ SM &= \begin{array}{c} t_{1}\\ t_{2}\\ \vdots\\ t_{n-1}\\ t_{n} \end{array} \left[ \begin{array}{lllll} 0 & \quad d_{x}(t_1,t_2) & \quad\cdots & \quad d_{x}(t_1,t_{n-1}) & \quad d_{x}(t_1,t_{n}) \\ 0 & \quad 0 & \quad & \quad & \quad d_{x}(t_2,t_{n}) \\ \vdots & \quad \vdots & \quad \ddots & \quad \vdots & \quad \vdots \\ & & \quad & \quad 0 & \quad d_{x}(t_{n-1},t_{n}) \\ & \quad 0 & \quad \cdots & \quad 0 & \quad 0 \end{array} \right] \end{aligned} $$

After calculating the similarity matrix, test ordering is defined based on similarity degrees (Cartaxo et al. 2011; Bertolino et al. 2015; Henard et al. 2014; Coutinho et al. 2014). According to Elbaum et al. (2002), the ordering process can use total or additional information. Test prioritization based on total information uses only pairwise similarity for ordering test cases, whereas additional information includes the similarity of previously executed test cases to improve ordering (i.e., the most distinct test case compared to all previous).

Cartaxo et al. (2011) showed that similarity testing can be more effective than random prioritization when applied to test sequences automatically generated from Labelled Transition Systems (LTS) (Cartaxo et al. 2011). In their study, the similarity degree (d s d ) between two test cases was calculated as the number of identical transitions (nit) divided by the average test case length. The average length was used to avoid small (large) similarity degrees due to similar short (long) test sequences. An extensive investigation on similarity testing for LTS is found in (Coutinho et al. 2014).

Bertolino et al. (2015) also investigated the application of similarity testing on XACML systems. XACML is an XML-based declarative notation for specifying access control policies and evaluating access requests (OASIS 2013). Essentially, they proposed a test prioritization approach named XACML similarity (d x s ) which considers three values for test prioritization: (i) a simple similarity (d s s ), which describes how much resembling are two test cases (t i ,t j ) based on their lexical distance; (ii) an applicability degree (AppValue), which points the percentage of parts of an XACML policy affected by a test case; and (iii) a priority value (PriorityValue) which gives weight to pairs of test cases based on their applicability degree. Although investigations have shown that simple similarity d s s is comparable to random prioritization, XACML similarity enabled significant improvements compared to simple similarity and random prioritization.

It should be noticed that the XACML standard can be used to specify and implement RBAC policies (OASIS 2014). However, its current version (OASIS 2014) does not support the specification of SSoD and DSoD constraints. Moreover, since the effectiveness of test criteria is strongly related to its ability to represent specific domain faults (Felderer et al. 2015), there is no guarantee that similarity testing can be as effective on RBAC as they were on XACML and LTS.

Similarity testing for RBAC systems

In this section, we introduce our similarity testing approach specific to RBAC systems, named RBAC similarity. The RBAC similarity consists of a similarity testing approach based on Cartaxo et al. (2011) and Bertolino et al. (2015) approaches and suitable for FSM-based testing of RBAC systems. A prioritization algorithm used to perform ordering test cases based on similarity criteria is also discussed.

RBAC similarity

In XACML similarity, applicability is the relation between an access request and an XACML policy which quantitatively describes the impact of this request (i.e., test case) to the rules of the policy (Bertolino et al. 2015). In our work we extend the concept of XACML applicability to the RBAC domain and propose the RBAC similarity, a similarity testing approach specific to RBAC systems.

Essentially, the RBAC similarity (d r s ) takes an RBAC policy P and a test suite T generated from an FSM(P) and evaluates the degree of resemblance between all pairs of test cases t i ,t j T. To that, it uses a dissimilarity function and the applicability of this pair of test cases to the policy P under test. Given this information, a test case prioritization algorithm performs test ordering from the most distinct and relevant tests to the less diverse and suitable ones. To support similarity testing for RBAC, we proposed the concept of RBAC applicability which quantitatively describes the relevance of a test case to one RBAC policy. The dissimilarity function and the RBAC applicability are detailed in the following sections.

Simple dissimilarity:

The simple dissimilarity between test cases is measured based on the number of distinct transitions (ndt). Given two test cases t i and t j , the degree of simple dissimilarity (d s d ) is calculated as presented in Eq. 3.

$$ d_{sd}(t_{i},t_{j})=\frac{ndt(t_{i},t_{j})}{avg(length(t_{i})+length(t_{j}))} $$

The number of distinct transitions (ndt) between two test cases (t i ,t j ) is counted and then divided by the average length of the test cases t i and t j . Transitions are considered distinct when there is a mismatch between their origin states, input or output symbols, or destination (tail) states. The average test cases length is used to avoid small (or large) similarity degrees due to similar short (or long) test case lengths. Listing 2 shows an example of four test cases and their respective transitions and states covered given the FSM(P) previously shown in Fig. 4. The number of distinct transitions, the average length and the simple dissimilarity d s d for each pair of test cases are shown in Table 3.

Table 3 Simple dissimilarity of each pair of test cases

RBAC applicability:

The idea of the RBAC applicability is to quantitatively describe the relevance of a test case to one RBAC policy under test. An RBAC constraint is applicable to a test case if there is a match between the users, roles, or permissions of any input of this test case and the attributes of the constraint. For example, if an RBAC policy contains a static cardinality constraint S u (u1)=1, this constraint must regulate (i.e., apply some regulation to) all test cases with user u1 as test input (e.g., AS(u1,r2)). This idea enables to measure how much a test case t may impact a given policy P, without considering dynamic (behavioral) aspects of the RBAC model (e.g., FSM(P) states/transitions). Thus, it describes the structural or static coverage of a test case t over one policy P.

However, since RBAC is essentially a reactive system, a behavioral view of a test case is also necessary. In order to satisfy this requirement, we also propose the concept of behavioral or dynamic coverage. An RBAC constraint of a policy Preacts to a test case when this constraint is applicable to any input symbol and it influences on (enforces) the access control decision. As example, the test case t3, shown in Fig. 5, depicts a scenario of an RBAC policy containing a dynamic cardinality constraint D r (r1)=1 and two users u1 and u2 attempting to activate r1. This constraint is applicable (and reacts) to the last input requesting the second role activation of r1, and enforces a denied response. This information is associated with many transitions of the FSM(P) and used as requirements-based coverage criteria (Utting et al. 2012). Thus, by quantifying the number of RBAC constraints reacting to the inputs of a test case, the dynamic coverage of a policy P can be measured and support test prioritization.

Based on the concepts of static and dynamic coverage, we proposed the RBAC Applicability Degree (AD), which is an array of four values defined as shown in Eq. 4.

$$ AD_{P(t)}=\left[ pad_{P(t)} \quad asad_{P(t)}\quad acad_{P(t)} \quad prad_{P(t)}\right] $$

The RBAC Applicability Degree (AD) of a test case t to a given a policy P consists of four values:

  • Policy Applicability Degree (padP(t)), which shows the ratio of test inputs applicable to any RBAC constraint over the test case length;

  • Assignment Applicability Degree (asadP(t)), which shows the number of RBAC constraints related to assignment faults reacting to t;

  • Activation Applicability Degree (acadP(t)), which shows the number of RBAC constraints related to activation faults reacting to t; and

  • Permission Applicability Degree (pradP(t)), which shows the number of RBAC constraints related to permission faults reacting to t.

The padP(t) measures how much applicable one test case t is to a given policy based on all RBAC constraints applicable to t. The asadP(t) gives a quantitative information about how many RBAC constraints related to assignment faults (i.e., UR, S u , S r , SSoD, and S s ) react to t. The acadP(t) gives a quantitative information about how many RBAC constraints related to activation faults (i.e., ≤ A , D u , D r , DSoD, and D s ) react to t. Finally, the pradP(t) gives a quantitative information about how many RBAC constraints related to permission faults (i.e., PR, ≤ I ) react to t.

Based on the values of AD, the RBAC Applicability Degree (RAP(t)) is calculated. The RAP(t) value is a single quantitative attribute which summarizes the relevance of a single test case t to one policy P by summing the four applicability degrees.

$$ RA_{P(t)}= pad_{P(t)} + asad_{P(t)} + acad_{P(t)} + prad_{P(t)} $$

However, since test similarity is calculated for pairs of test cases, we also defined the RBAC Applicability Value (AppValue) which sums the applicability degrees of test cases (Eq. 6).

$$ AppValue(P,t_{i},t_{j}) = RA_{P(t_{i})} + RA_{P(t_{j})} $$

A priority value (PriorityValue) is calculated to weight the pairwise relevance of two test cases. This PriorityValue is a constant number α, β, γ, or δ defined based on the \(pad_{P(t_{i})}\) and \(pad_{P(t_{j})}\) values. These α, β, γ, and δ constants are defined by the user, such that α>β>γ>δ. The α is given for pairs of test cases where all test inputs are applicable, and δ is given if none of test inputs are applicable to the constraints of the RBAC policy P. The values 3, 2, 1 and 0 are suggested by Bertolino et al. (2015). Equation 7 shows the formula which derivates the PriorityValue

$$ PriorityValue(P,t_{i},t_{j}) =\left\{ \begin{array}{ll} & \alpha~ \text{if}~ (pad_{P(t_{i})} = pad_{P(t_{j})} = 1) \\ & \beta~ \text{if}~ (pad_{P(t_{i})}~XOR~pad_{P(t_{j})}) \\ & \gamma~ \text{if}~ (0 < pad_{P(t_{i})}, pad_{P(t_{j})}< 1) \\ & \delta~ \text{otherwise} \end{array} \right. $$

The RBAC Similarity (d r s ) of a pair of test cases consists of the sum of the d s d , AppValue and PriorityValue values, if d s d (t i ,t j )≠0, as shown in Eq. 8. The RBAC similarity was designed based on Bertolino et al. (2015) approach for similarity testing for XACML policies.

$$ d_{rs}(P,t_{i},t_{j}) =\left\{ \begin{array}{ll} & 0 ~~~~~~~~~~~~~~~~~~~~~~~~ \text{if}~ d_{sd}(t_{i},t_{j})=0 \\ & d_{sd}(t_{i},t_{j})+\\ & AppValue{(P,t_{i},t_{j})}+\\ & PriorityValue{(P,t_{i},t_{j})}~ \text{otherwise }\\ \end{array} \right. $$

As an example, the applicability degrees of each test case presented in Listing 2, given the RBAC policy in Listing 1, are presented in Table 4.

Table 4 RBAC applicability degree of each test case

As shown in Table 4, all test inputs of t3 are applicable to at least one RBAC constraint and test case t3 has the greatest RBAC applicability degree. Test case t2 has the second greatest value, followed by t1 and t0 with the same applicability degree. Afterwards, the simple dissimilarity, RBAC application value, and priority value are calculated for all pairs of test cases. All these values are joined in the RBAC similarity (d r s ) that is calculated for each pair of test cases, as presented in Table 5.

Table 5 RBAC similarity of each pair of test cases

Test prioritization algorithm

Given the similarity of all pairs of test cases, a test prioritization algorithm has to be used for scheduling test cases execution. The pseudocode of the test prioritization algorithm used in this study is presented in Algorithm 1. Essentially, the test prioritization algorithm iterates a similarity matrix calculated using a similarity function d x , from the most distinct pairs of test cases to the less dissimilar ones of a test suite S. Given each pairwise similarity, the longest test case is included in the list of prioritized test cases. Otherwise, the shortest is included, if not previously included. This process is performed until all test cases of S are included in L, which stands for the prioritized test suite.

Using the RBAC similarity and the test suite shown in Listing 2, the similarity matrix shown in Eq. 9 is obtained.

$$ \begin{aligned} &\quad \quad \quad \ \ t_{0} \quad t_{1} \quad t_{2} \quad t_{3}\\ SM&= \begin{array}{c} t_{0}\\ t_{1}\\ t_{2}\\ t_{3} \end{array} \left[ \begin{array}{cccc} 0 & 4.55 & 5.55 & 7.77 \\ 0 & 0 & 5.55 & 7.77 \\ 0 & 0 & 0 & 8.77 \\ 0 & 0 & 0 & 0 \end{array} \right] \end{aligned} $$

Using Algorithm 1, the first most dissimilar pair of test cases (t2,t3) is selected and the longest test case t3 is added to L. Afterwards, test case t0 is included since it is the longest test case from the next most dissimilar pair (t0,t3). The last pair considered is (t1,t3) and t1 is the next to be included. The prioritization ends with test case t2, from pair (t0,t2), scheduled at the end of the test execution. Listing 3 shows the L resulting test suite prioritized according to RBAC similarity.

Experimental evaluation

According to Damasceno et al. (2016), a larger number of test cases tends to be generated regardless the FSM-based testing methods for RBAC systems. Thus, the higher the number of states and transitions of FSM(P) increase, the greater the test suites are concerning the number of resets, total test suite length, and average test case length. Thus, additional steps become necessary to make software testing more cost-effective.

We proposed RBAC similarity to fill this research gap and designed an experiment to evaluate the cumulative effectiveness and the APFD of the RBAC similarity and compare to simple dissimilarity and random prioritization using test suites generated from FSM-based testing methods on RBAC systems. An schematic overview of this experiment is presented in Fig. 6.

Fig. 6
figure 6

Comparison of test prioritization techniques - schematic overview

Fifteen test suites were taken from a previous study (Damasceno et al. 2016) where test characteristics (i.e., number of resets, test suite length, and avg. test case length) and effectiveness were analyzed based on the FSM(P) characteristics (i.e., numbers of states, and transitions). These test suites were generated from five RBAC policies specified as FSM(P) models using the RBAC-BT software (Damasceno et al. 2016) and implementations of the W (Chow 1978), HSI (Petrenko and Bochmann 1995), and SPY (Simão et al. 2009) methods. Table 6 shows a summary of the five RBAC policies and the total number of RBAC mutants.

Table 6 RBAC policies characteristics

The RBAC-BTFootnote 1 is an FSM-based testing tool designed by Damasceno et al. (2016) to support FSM-based testing of RBAC systems and the automatic generation of FSM(P) models and RBAC mutants. RBAC-BT was extended to support test prioritization using RBAC similarity and simple dissimilarity. Due to the high number of pairwise comparisons required to perform test prioritization, a time constraint of 24 hours for each test prioritization procedure was defined. Procedures with a duration above this limit were canceled and random subsets of the complete test suites, named as subtest suite, were taken for prioritization.

On preliminary experiments, the prioritization of the test suites of policies P03, P04, P05 took more than 24 hours.

Thus, subtest suites of the aforementioned policies containing 2528 test cases were randomly generated 30 times. The number 2528 was taken from the largest complete test suite with test prioritization duration below the 24 hours threshold, the W test suite of policy P02. Table 7 shows the characteristics of the FSM(P) models and their respective complete test suites.

Table 7 FSM(P) and test characteristics

The six complete test suites were prioritized using each test prioritization, and the cumulative effectiveness of these test suites was measured in twenty-one parts. Afterwards, the cumulative effectiveness was used to calculate the APFD of each scenario. The APFD value was calculated using Eq. 1, F i as the number of faults detected by one test fragment i and l as the number of RBAC mutants. Random prioritization was performed 10 times to the 30 random subtest suites of P03, P04 and P05.

Using the R statistical package, we calculated mean APFD with confidence interval (CI) of 95% to all test scenarios and performed the nonparametric Wilcoxon matched-pairs signed ranks test to verify if the RBAC similarity reached different APFDs compared to simple dissimilarity and random prioritization with a confidence interval of 95%. As the alternative hypothesis, we considered that RBAC similarity performed better (i.e., greater mean cumulative effectiveness) than the other criteria.

To complement hypothesis tests, we analyzed the effect size by computing unstandardized (i.e., median and mean differences) and standardized measures (i.e., Cohen’s d Hedges g (Kampenes et al. 2007) and Vargha-Delaney’s  12 (Arcuri and Briand 2011)) using R and the effsize package (Torchiano 2017).

Analysis of the complete test suites

In this section, we discuss the results of the experiments comparing RBAC similarity, simple and random prioritization based on complete test suites. The mean cumulative effectiveness for P01 and P02 are respectively shown in Tables 8 and 9, and Figs. 7 and 8 with error bars calculated with a confidence interval of 95%. At the end of this section, we also show the mean APFD and the results of the Wilcoxon matched-pairs signed ranks test.

Fig. 7
figure 7

Cumulative effectiveness for P01 with error bars (CI=95%). a P01 + W. b P01 + HSI. c P01 + SPY

Fig. 8
figure 8

Cumulative effectiveness for P02 with error bars (CI=95%). a P02 + W. b P02 + HSI. c P02 + SPY

Table 8 Cumulative effectiveness of the P01 complete test suites
Table 9 Cumulative effectiveness of the P02 complete test suites

In most of the cases, there was no statistically significant difference between the prioritization algorithms in the P01 and P02 scenarios. The P01 + HSI scenario was the only exception where RBAC similarity reached an APFD higher than simple dissimilarity and random prioritization. In the five remaining scenarios, RBAC similarity performed without significant difference compared to at least one of the methods. The mean APFD for each scenario are shown in Table 10 with their respective confidence intervals of 95% subscripted.

Table 10 Mean APFD of the complete test suites with confidence interval of 95%

Table 11 shows the results of the Wilcoxon matched-pairs signed ranks test using a confidence interval of 95% to the mean cumulative effectiveness. In this case, we compared RBAC similarity to simple and random prioritization and random prioritization to simple dissimilarity. Significant results are highlighted in bold.

Table 11 Wilcoxon matched-pairs signed ranks test (CI=95%) for P01 and P02

Table 11 corroborates to the finding of Fig. 7 and Table 10 where RBAC similarity had a statistically significant difference compared to the other criteria in P01 + HSI scenario; and random prioritization reached significantly different APFDs compared to simple dissimilarity in the all scenarios.

Analysis of the subtest suites

Since test prioritization for P03, P04 and P05 was too expensive, we considered 30 random subtest suites with 2528 test cases. Random prioritization was run 10 times for each of the 30 subtest suites.

The mean cumulative effectiveness of P03, P04, and P05 are respectively presented in Tables 12, 13, and 14. Figs. 9, 10, and 11 show the mean cumulative effectiveness with error bars calculated using a confidence interval of 95%.

Fig. 9
figure 9

Cumulative effectiveness for P03 with error bars (CI=95%). a P03 + W. b P03 + HSI. c P03 + SPY

Fig. 10
figure 10

Cumulative effectiveness for P04 with error bars (CI=95%). a P04 + W. b P04 + HSI. c P04 + SPY

Fig. 11
figure 11

Cumulative effectiveness for P05 with error bars (CI=95%). a P05 + W. b P05 + HSI. c P05 + SPY

Table 12 Cumulative effectiveness of the P03 subtest suites
Table 13 Cumulative effectiveness of the P04 subtest suites
Table 14 Cumulative effectiveness of the P05 subtest suites

In the P03 test scenarios, the first 5 to 10% of the W, HSI, and SPY subtest suites (i.e., a subset of 125 to 250 test cases) became sufficient to reach the maximum effectiveness. All test prioritization approaches presented similar results and no statistical significance was found between RBAC and the other approaches. In scenarios like this, test minimization techniques may be more cost-effective than test prioritization due to its O(n2) complexity.

In the P04 scenario, the benefits of RBAC similarity started to become more visible and statistically significant, as shown in Fig. 10 and Table 13. There was one exception where no significant difference was obtained. In the P04 + W scenario, the W method generated an extremely large test suite and, to enable test prioritization, we selected random subtest suites containing 2528 test cases. This random selection may have reduced test diversity. In the other scenarios, P04 + HSI and P04 + SPY, we found that the cumulative effectiveness of the RBAC similarity had a statistically significant difference compared to the other methods.

The mean cumulative effectiveness for the P05 test scenarios are presented in Fig. 11 and Table 14. In the P05 scenario, RBAC similarity, simple dissimilarity, and random prioritization clearly had statistically different cumulative effectivenesses. Respectively, 65% of the W and HSI, and 80% of the SPY subtest suites prioritized using RBAC similarity became capable of reaching the highest effectivenesses. RBAC similarity presented a significantly greater cumulative effectiveness compared to random prioritization and simple dissimilarity.

To the P03, P04 and P05, we also calculated the mean APFD based on the cumulative effectiveness of all runs of the 30 random subtest suites. The mean APFD of each test scenario with confidence interval of 95% is shown in Table 15. The highest APFD values are highlighted in bold.

Table 15 Mean APFD of the subtest suites with confidence interval of 95%

In P03 scenario, the fault distribution along the FSM(P03) may have benefited fault detection and all methods performed similarly. In P04 scenario, there was only one case where RBAC similarity did not work well and no statistically significant difference was found (i.e., P04 + W). Regarding simple dissimilarity, it did not reach an APFD higher than random prioritization. At last, in all P05 scenarios, we found statistically significant differences between RBAC, simple and random prioritization. Table 16 shows the results of the Wilcoxon matched-pairs signed ranks test in the test scenarios of policies P03, P04 and P05. Significant results are highlighted in bold.

Table 16 Wilcoxon matched-pairs signed ranks test (CI=95%) for P03,P04 and P05

The analysis of the mean APFD and the confidence intervals of the subtest suites indicated that RBAC similarity performed better than simple dissimilarity and random prioritization in some scenarios. In addition to assessing whether an algorithm performs statistically better than another, it is crucial to measure the magnitude of such improvement. To analyze such aspect, effect size measures are required (Kampenes et al. 2007; Arcuri and Briand 2011; Wohlin et al. 2012).

Effect size to subtest suites

Effect size measures allow for quantifying the difference (i.e., magnitude of the improvement) between two groups (Wohlin et al. 2012). Kampenes et al. (2007) found that only 29% of software engineering experiments report some effect size measure. Thus, to improve our analysis, we also evaluated the effect that one test prioritization method had on the APFD compared with the other methods.

There are two main classes of effect size: (i) unstandardized, which are dependent from the unit of measurement; and (ii) standardized, which are independent from the evaluation criteria measurement units. For each pair of different prioritization method, we computed five different measures: two unstandardized (i) mean and (ii) median differences; and three standardized (iii) Cohen’s d (Cohen 1977), (iv) Hedges’ g (Hedges 1981), and (v) Vargha-Delaney’s  12 (Vargha and Delaney 2000).

Mean and median differences, Cohen’s d, and Hedges’ g are presented as often referred metrics in the software engineering literature (Kampenes et al. 2007). Cohen’s d, and Hedges’ g are computed based on the mean difference and an estimate of population standard deviation σ p o p and compared using standard conventions (Cohen 1992).

Vargha-Delaney (VD)  12 is an effect size measure based on stochastic superiority that denotes the probability of a method outperform another (Vargha and Delaney 2000). If both methods are equivalent then  12 =0.5. An effect size  12 >0.5 means that the treatment method has higher probability of achieving a better performance than the control method, otherwise vice-versa. Vargha-Delaney’s  12 is recommended by Arcuri and Briand (2011) as a simple and intuitive measure of effect size for assessing randomized algorithms in software engineering research. Table 17 shows the pairwise comparison of the three test prioritization methods. The metrics presented can also be used in future research (e.g., meta-analysis (Kampenes et al. 2007)).

Table 17 Pairwise comparison among the methods with respect to the APFD

We did not compute the effect size to P01 and P02 due to the deterministic nature of RBAC and simple prioritizations and its consequent σ p o p =0. The analysis of effect size corroborated to the mean APFDs and Wilcoxon matched-pairs signed ranks tests and RBAC similarity had good results in P04+HSI, P04+SPY, and all P05 scenarios.

We found differences of medium magnitude between RBAC compared with simple and random prioritizations in P04+HSI; and large magnitude in P04+SPY and all P05 scenarios. There was only one case (i.e., P03+HSI) where RBAC prioritization did not outrun the other methods. In the other scenarios, we found negligible to medium differences between the techniques. Thus, the following order was observed, from the method with the lowest to the highest APFDs, Simple Random RBAC.


Recently, Cartaxo et al. (2011) and Bertolino et al. (2015) showed that similarity functions can be helpful when it is necessary to prioritize exhaustive test suites automatically generated for LTS models and XACML policies, respectively. In our previous study (Damasceno et al. 2016), we found that, no matter what FSM-based testing methods are applied to RBAC systems, when the number of users and roles increase, larger test suites tend to be generated. Thus, specific domain test criteria are required to optimize FSM-based testing for RBAC systems. To this end, there are three main approaches: (i) Test minimization, (ii) Test selection, and (iii) Test prioritization.

Unlike (i) test minimization and (ii) test selection, that may compromise fault detection capability; (iii) test prioritization aims at finding an order of execution to an entire test suite (i.e., without filtering out any test case) based on some test criteria (Yoo and Harman 2012). In this paper, we investigated the test prioritization for RBAC systems, and we proposed the RBAC similarity.

RBAC similarity compared to the other criteria

Our results showed that RBAC similarity performed better than simple dissimilarity and random prioritization in some of the scenarios, especially those with large FSM(P) models. To policies P01 and P02, we did not find statistically significant differences between the test prioritization criteria in most of the scenarios. The only exception was to P01 + HSI, where a statistically significant difference between RBAC similarity and the other criteria was found. The HSI method reduces test dimensions by using harmonized state identifiers instead of the characterization set (Petrenko and Bochmann 1995). In this scenario, the characteristics of the HSI may have affected test diversity and, as a result, benefited RBAC similarity.

Due to the large number of test cases generated from policies P03, P04, and P05, prioritizing the complete test suites became infeasible. To overcome this issue, we opted to apply test prioritization on random subtest suites.

To policy P03, all test prioritization approaches increased the cumulative effectiveness to the maximum value yet at the first 5 to 10% and we did not find statistically significant differences between them. Thus, the fault distribution along the FSM(P03) model benefited fault detection and test prioritization. In scenarios like this, test minimization may be more suitable than test prioritization, which has an O(n2) cost. However, as we highlighted earlier, there is a risk of reducing the capability of test suites detecting faults out of the RBAC domain.

The benefits of the RBAC similarity became more evident in P04 and P05 scenarios, the largest FSM(P) models. In policy P04, we found a statistically significant difference between RBAC similarity to the subtest suites generated from HSI and SPY. The only exception was the P04 + W scenario where the random selection of subtest suites may have compromised test diversity.

In the P05 scenario, the RBAC similarity outperformed both test prioritization criteria with statistically significant differences. The analysis of the mean APFD values and effect size corroborate to the mean cumulative effectivenesses depicted in Figs. 9 to 11.

Random prioritization vs. simple dissimilarity

Our results showed a statistically significant difference between random prioritization and simple dissimilarity. In ten out of 15 scenarios, random prioritization presented APFD significantly different and higher than simple dissimilarity. RBAC faults can be exhibited across many different transitions of FSM(P) (Masood et al. 2010). Thus, test diversity may not imply on higher APFD.

Practical feasibility

We found that RBAC similarity may not be feasible to large complete test suites, as seen in scenarios P03, P04 and P05. The O(n2) complexity is an inherent characteristic of most test prioritization approaches (Elbaum et al. 2002), especially similarity testing, that is also an all-to-all comparison problem (Zhang et al. 2017). However, RBAC similarity can still be improved through (i) test minimization and/or (ii) parallel programming.

The RBAC applicability can be used in test minimization as requirements coverage criteria to find test cases relevant to the constraints (i.e., requirements) of RBAC policies. Afterwards, RBAC similarity can be applied as we proposed. Thus, a significant test cost reduction can be achieved, but at the risk of reducing the fault-detection capability (Yoo and Harman 2012).

Recent studies have proposed parallel algorithms to efficiently calculate similarity matrices for mathematical modelling of heterogeneous hardware (Rawald et al. 2015) and ontology mapping (Gîză-Belciug and Pentiuc 2015). However, to the best of our knowledge, they have never been investigated for similarity testing. RBAC similarity as a test minimization criterion and parallel algorithms to calculate similarity matrices for test prioritization could boost up similarity testing but this is out of the scope of this study and left as future work.

Threats to validity

Conclusion Validity: Threats to conclusion validity relate with the ability draw correct conclusions about the relation between the treatment and the outcomes. To mitigate this, we used the Wilcoxon matched-pairs signed ranks test to verify if the RBAC similarity reached different APFDs compared to simple dissimilarity and random prioritization with a confidence interval of 95%. We also computed the mean APFD with a confidence interval of 95% and five effect size measures to quantify the difference between the methods. The statistical analysis were performed using the R statistical package and the effsize package (Torchiano 2017). The R scripts, input and output statistical data are included in the RBAC-BT repository.

Internal Validity: Threats to internal validity are related with influences that can affect independent variables with respect to causality. They threat conclusions about a possible causal relationship between treatment and outcome. To mitigate this threat, random tasks (i.e., subtest suite generation and random prioritization) were repeatedly performed to avoid results obtained by chance. Most of artifacts used in this work were reused from the lab package of our previous study (Damasceno et al. 2016).

Construct Validity: Construct validity concerns with generalizing outcomes to the concept or theory behind the experiment. We used first-order mutants from the RBAC fault domain (Masood et al. 2009) to simulate simple faults and evaluate the effectiveness of each prioritization criteria. Mutation analysis is a common assessment approach of software testing investigations (Jia and Harman 2011). Other RBAC fault models could be used in this experiment, such as malicious faults (Masood et al. 2009) and probabilistic models of fault coverage (Masood et al. 2010). These fault models could be used to analyse RBAC similarity testing from a perspective of faults of different nature, but they were left as future work. Moreover, despite the relatively low number of faults, the RBAC fault model is still representative to functional faults of RBAC systems (Masood et al. 2009).

External Validity: It concerns with the generalization of the outcomes to other scenarios. To mitigate this threat, we included test suites generated from three different test generation methods and RBAC policies with different characteristics.


Essentially, the RBAC model reduces the complexity of security management routines by grouping privileges through roles which can be assigned to users and activated in sessions. Access control testing is one important activity during the development of RBAC systems since implementation mistakes may lead to security breaches. In this context, previous studies have shown that FSM-based testing can be effective at detecting RBAC faults, but very expensive. Thus, additional steps become necessary to make RBAC testing more feasible and less costly.

Test case prioritization comes as a solution to this problem and it aims at finding an ordering for test cases execution to maximize some test criteria. Similarity testing is a variant of test case prioritization which has been investigated under the XACML and LTS domains and enabled to find better orders for test cases execution. In this paper we introduce a test prioritization technique named RBAC similarity which uses the dissimilarity between pairs of test cases and their pairwise applicability to the RBAC policy under test (i.e., the relevance of these test cases to the RBAC constraints) as test prioritization criteria.

Our RBAC similarity approach was experimentally evaluated and compared with simple dissimilarity and random prioritization as baselines. The obtained results pointed out that RBAC similarity improved the mean cumulative effectiveness and the APFD and enable to reach the maximum effectiveness of the test suites at a faster rate with significant difference in most of the cases. In some scenarios, prioritizing HSI and SPY test suites with RBAC similarity resulted on better APFD values than applying the technique to W test suites. The characteristics of the test cases generated from HSI and SPY favoured the similarity testing algorithms while random selection applied to complete test suites generated from W negatively impacted test prioritization using similarity functions. Moreover, random prioritization also outperformed simple dissimilarity in most of the cases. We analyze our data using Wilcoxon matched-pairs signed ranks test and error bars with CI=95%, and five effect size metrics (i.e., mean and median differences, Cohen’s d, Hedges’ g and Vargha-Delaney’s  12 ) and found statistically significant in some scenarios.

All test artifacts (i.e., RBAC-BT tool, test suites, test results, RBAC policies, and statistical data) are available online Footnote 2 and can be used to replicate, verify and validate this experiment. As future work, we want to investigate alternative algorithms for ordering test cases, such as algorithms using total information for test prioritization, other fault models, such as simulated malicious faults and probabilistic fault models. We also intend to investigate the usage of RBAC similarity as a requirements coverage criterion for test minimization and as a fitness function in search-based software testing (McMinn 2004).





Download references


We acknowledge the help from all the LabES’s members (Software Engineering Laboratory) at the University of Sao Paulo (USP) for their valuable comments. We also thank the reviewers for all valuable comments and suggestions to this study.


Carlos Diego Nascimento Damasceno’s research project was supported by the National Council for Scientific and Technological Development (CNPq), process number 132249/2014-6.

Author information

Authors and Affiliations



CDND designed and conducted the experiment, adapted the RBAC-BT tool and analyzed the results. PCM and AS supported the validation of the experiment protocol and analysis of results. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Carlos Diego N. Damasceno.

Ethics declarations

Competing interests

The authors declare that they have no competing interests

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

N. Damasceno, C.D., Masiero, P.C. & Simao, A. Similarity testing for role-based access control systems. J Softw Eng Res Dev 6, 1 (2018).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Finite state machines
  • Role-Based Access Control (RBAC)
  • Test prioritization
  • Similarity testing