In order to share the architecture knowledge among the stakeholders, it must be adequately documented and communicated. The Software Architecture Document (SAD) is the usual artifact for capturing this knowledge. The SAD format can range from Word documents to a collection of Web pages hosted in a Wiki. The latter format is becoming common nowadays ([de Graaf et al. 2012]; [Farenhorst and van Vliet 2008]; [Jansen et al. 2009]). The SAD is generally structured around the concept of architectural views, which represent the many structures that are simultaneously present in software systems. A view presents an aspect or viewpoint of the system (e.g., static aspects, runtime aspects, allocation of software elements to hardware, etc.). Therefore, the SAD consists of a collection of documents (according to predefined templates) whose contents include views (e.g., module views, components-and-connectors views, allocation views) plus textual information about the views (e.g., system context, architectural drivers, key decisions, intended audience).
Stakeholders are important actors in the documentation process as they are the main consumers of the SAD. By stakeholder ([Mitchell et al. 1997]), we mean any person, group or organization that is interested in or affected by the architecture (e.g., managers, architects, developers, testers, end-users, contractors, auditors). We argue that the value of a SAD strongly depends on how its contents satisfy the stakeholders’ information needs. As we mentioned in Section 3, the V&B method proposes a stakeholder-based strategy for organizing the SAD views and their contents ([Clements et al. 2003]). V&B characterizes several types of stakeholders and then links them to specific architectural views, based on the anticipated usage of the views by the stakeholders and their preferred level of detail for the views. This information is summarized in the matrix of Figure 1.
We see the V&B characterization of stakeholders as a basic form of user profiles, and then propose a semi-automated approach that leverages on these profiles (and enriches them) for establishing links with relevant SAD documents. From this perspective, the SAD is personalized according to the stakeholders’ information needs, as captured in his/her profile. The process requires a certain time to learn the user interests and build accurate profiles. This situation is known as the “cold start” problem ([Schiaffino and Amandi 2009]). Initially, the profiles are only based on the V&B matrix of stakeholders’ preferences for architectural views. The “cold start” phase lasts until the system is able to gather additional information about stakeholders’ interests so as to enrich the profiles. The user’s browsing activity over a Web-based SAD is an example of such an information.
A key aspect of our approach is the granularity of SAD contents when mapped to Wiki pages. This granularity defines the "unit of recommendation” of our tool. In particular, we used one Wiki page per architectural view, plus one Wiki page per additional section (documentation beyond views) of the SEI’s V&B template, as it has been suggested by other Wiki-based SADs based on V&Bc ([Clements et al. 2003]). However, this mapping choice is not mandatory.
A general schema of our profile-based recommendation approach is depicted in Figure 2. The design consists of a pipeline of processing units that: i) generates user profiles, ii) generates document representations, and iii) computes matching relationships among users and SAD documents. We refer to these relationships as relevance links, which are actually the recommendations provided by our tool. The relevance links are computed on the basis of the similarity between the user profiles and the document representations. A detailed description of our pipeline can be found on Section 4.3.
4.1 Inputs of the approach
The inputs needed to perform the analysis of stakeholders and construct their profiles are the following:
-
SAD textual contents: The plain text from SAD documents (or sections) is automatically processed using NLP techniques (see Section 4.5) in order to generate documents models.
-
Interest indicators: When users interact with Web pages, several interest indicators can be recorded ([Al halabi et al. 2007]; [Claypool et al. 2001]). In particular, we analyzed indicators such as: time spent on reading a page, number of visits, mouse scrolls, mouse clicks, and the ratio between scrolls and time (which represents the frequency of scrolls while reading of Web page), among others ([Claypool et al. 2001]).
-
Semantic dictionary: We considered a dictionary composed of a hierarchy of concepts and categories. Categories are concepts with a higher level of abstraction. This kind of semantic source of knowledge is derived from a previous work ([Nicoletti et al. 2013a]). Instead of using a general-purpose dictionary, we here customized it to consider only Software Architecture concepts along with their corresponding categories. This dictionary was built by combining concepts from an existing ontology for software architectures ([de Graaf et al. 2012]) with concepts described by the SEI’s software architecture bibliography ([Bass et al. 2012]). We should note that thesauri commonly used for NLP tasks, such as WordNet or Wikipedia ([Nicoletti et al. 2013a]), are not specific enough in the Software Architecture domain of knowledge.
-
Semantic annotations: Our approach needs to build a model (or representation) of each SAD document. We argue that these models can be enriched with explicit annotations provided by an expert (e.g., a member of the software architecture team), who, in general, will also generate the SAD contents. This expert is able to select those concepts or categories that best describe the semantics of each document. The annotations are considered part of the document representations. Moreover, the annotations are helpful to model documents/sections of the SAD template that are partially completed (or even empty), or to refine the representation of documents that are not accurately described by their textual contents (e.g., documents that contain many images and little text).
Our approach is regarded as semi-automated because the intervention of experts is required to input semantic annotations in the SAD documents. The expert also makes annotations on role types and, thus, incorporates V&B-related information. The initial stakeholders’ profiles are mainly filled in with these annotations. For both annotation tasks, the expert uses the semantic dictionary as a “label catalog”. The rest of the tasks and computations can be performed automatically.
4.2 Modeling users and documents
Both user profiles and documents are represented by the same structure. This structure comprises two parts: i) a set of semantic concepts and categories, which are extracted from the dictionary mentioned above, and ii) a set of tags (or keywords) ([Schiaffino and Amandi 2009]; [Nicoletti et al. 2013a]). For each item (i.e., concept, category or tag), the number of occurrences is recorded as the item frequency.
In our context, tags are keywords or non-trivial words often extracted with NLP techniques. Trivial words, such us pronouns or prepositions, are usually excluded. Concepts are basic units of meaning that serves humans to organize and share their knowledge. For instance, the English Wikipedia articles have been used as a source of concepts. Categories are concepts with a higher level of abstraction, which might be link to concrete concepts or to other categories with different level of abstraction. In our dictionary, for instance, performance, fault tolerance and security are examples of concepts within the category quality atributes, which, in turn, is linked to the high-level category requirements.
A user profile or a document model is a triple M=<C O N,C A T,T A G>, in which:
-
C O N={c o n1,…,c o n
n
} is a set of concepts, where c o n
i
(1 ≤ i ≤ n) is a pair <C,F> with C as the concept and F as its frequency.
-
C A T={c a t1,…,c a t
m
} is a set of categories, where c a t
i
(1 ≤ i ≤ m) is a pair <C,F> with C as the category and F as its frequency.
-
T A G={t a g1,…,t a g
t
} is a set of tags, where t a g
i
(1 ≤ i ≤ t) is a pair <T,F> with T as the tag and F as its frequency.
This representation is convenient to calculate similarities between users and documents, as well as for quickly describing the interests of a given user or the contents of a given document. For instance, Figure 3 shows how the model of user interests might look like. We also decided to combine both concepts and tags, since the SAD is generally composed of general concepts and problem-specific concepts. Some examples of problem-specific concepts are: names of software components, specific stakeholders’ names, and tactics and patterns that are not included in the common catalogs, among others. The general concepts are defined in our semantic dictionary, whereas the problem-specific concepts are mined from the text.
4.3 The processing procedure
The process is divided into three main stages, as depicted in Figure 4. First, the document representation generation is performed. This stage runs a NLP semantic analysis of the documents (hosted in the Wiki), and afterwards merges the semantic annotations with the partial representation of those documents. We refer to a model (of a document or a user) as being “partial” when its constituents (e.g., tags) must be refined by running one or more processing units. First, the document annotations are included in the partial models of documents. A prefixed value (parameter N) denotes the weight (or frequency) that each annotation would have in the document model. Second, a NLP analysis of documents is performed. To this end, we have configured a sub-pipeline of NLP tasks that produces a term-based representation of the documents. The details of this sub-pipeline are described in Section 4.5.
Second, the user profiles generation stage takes place. The initial user profiles, which are empty at this point, are enriched with semantic annotations coming from the roles (or stakeholder types) associated to each user. The role annotations and the user profiles are merged as described in the previous stage. Next, we perform what we call semantic items weighting: the semantic items that were extracted from the documents visited by a given user are added to that user profile. We assume that when a user accesses a SAD section, the contents of that section are likely to be relevant to that user. Therefore, we consider interest indicators from usage statistics to weight the relevance of the semantic items incorporated into user profiles. In particular, we used the number of visits as the frequency of the new items for the user profile. For example, if a user visited a given document N times, and that document contains a concept X and a tag T, then both X and T are incorporated to the user profile with a frequency of N. This indicator was prioritized over the others based on an empirical assessment of its relevance for inferring user interests (see Section 5).
Finally, the user document linkage stage is executed. In this stage, the models for both users and documents have already been generated. These two kinds of models are processed by an algorithm that determines the degree of matching (or similarity) between the models. In this article, we analyzed several metrics to compute the similarity between two models (see Section 4.4) and compared them empirically (see Section 5).
The output of the complete procedure is a set of weighted links between users and documents, in which the weights indicate the relevance of the documents for each user. The output is grouped by user and anked in descending order to select the most important links per user. A threshold k is used to establish the number of documents retrieved as relevant. For example, if we have N different sections in a SAD, with k=1/4 we are considering the first 1/4N sections as relevant and the other 3/4N as irrelevant. Figure 5 shows an snapshot of our recommendation tool at work.
4.4 Computing similarity between models
In general, the strategies to compute the similarity (or distance) between two items map the similarity between the symbolic descriptions of two objects to a unique numerical value ([Huang 2008]). In this section we describe different strategies we used to compute the degree of matching between user profiles and documents models. For each strategy, we present a short description, its formal definition and, if necessary, some consideration for its usage. It is worth noting that for those strategies that compute the distance between two items, we consider the similarities as the inverse of such distance.
4.4.1 Euclidean distance
The Euclidean distance represents the distance between two points in space, given by the Phytagorean formula. It is one of the most widely used distances for numerical data ([Deza and Deza 2006]; [2009]; [Liu 2011]). Equation 1 ([Deza and Deza 2009]) is the formal definition of this distance, where and are the vectors to be analyzed, and wt,a and wt,b are the weights associated to attributes t
a
and t
b
respectively. In this case, weights correspond to number of occurrences.
(1)
4.4.2 Manhattan distance
This measure takes its name from its geometrical interpretation in the so called Taxicab geometry, in which the distance between two points is the sum of the absolute differences of their Cartesian coordinates. Its name allude to the grid layout of most streets on the island of Manhattan, which causes the shortest path a car could take between two intersections in the borough to have a length equal to the intersections distance in taxicab geometry. Equation 2 shows the formal definition of the Manhattan distance between two vectors and .
(2)
4.4.3 Chebyshev distance
This strategy defines the distance between two distributions considering the maximum difference between their attributes. It is a metric defined on a vector space where the distance between two vectors is the greatest of their differences along any coordinate dimension. Equation 3 shows its formal definition ([Deza and Deza 2009]).
(3)
4.4.4 Cosine similarity
The representation of distributions as vectors enables us to measure the similarity between them as the correlation between the corresponding vectors ([Huang 2008]). Such similarity can be quantified as the cosine of the angle between them. The strategy is independent of the length of the distributions, and it is one of the most widely used in information retrieval systems. Equation 4 presents the formal definition of the cosine similarity of vectors and , where represents the inner product of these vectors, and and represent their norms ([Deza and Deza 2009]; [Liu 2011]).
(4)
4.4.5 Cosine distance
Given the previous equation, the cosine distance in Equation 5 shows how to adapt it to compute the distance between two distributions.
(5)
4.4.6 Kullback-Leibler divergence
This strategy is also known as information gain or relative entropy ([Huang 2008]), and it is defined as in Equation 6.
(6)
The formula presents an indetermination condition when , and returns 0, that is when wt,a or wt,b are equal to 0. To avoid this indetermination and be able to compute the formula correctly, a correction is applied to those value equal to 0. These values are replaced by 10−6, so that it does not affect the original distributions. Additionally, the strategy is not symmetrical, that is . In this context, it cannot be used to measure distances. To overcome this difficulty, the average divergence is computed using the formula in Equation 7, where , and w t=π1∗wt,a+π2∗wt,b.
(7)
4.4.7 Dice-Sorensen similarity
The Dice-Sorensen coefficient is a statistic used for comparing the similarity of two samples ([Dice 1945]; [Sørensen 1948]). It is based on an analysis of the presence or absence of data in the samples considered. As compared to Euclidean distance, Sorensen distance retains sensitivity in more heterogeneous data sets and gives less weight to outliers. Its formal definition is given by Equation 8 ([Deza and Deza 2009]).
(8)
4.4.8 Jaccard distance
The Jaccard distance is a statistic used for comparing the similarity and diversity of sample sets. This strategy is very useful to analyze text similarity in huge collections ([Rajaraman and Ullman 2012]). Equation 9 shows the formal definition of this strategy where t
a
and t
b
represent the attributes of distributions and respectively.
(9)
4.4.9 Overlap coefficient
The overlap coefficient, also known as Simpson similarity, is a similarity measure related to the Jaccard index that computes the overlap between two sets, which is defined as Equation 10 ([Deza and Deza 2009]).
(10)
4.4.10 Pearson correlation coefficient
This strategy measures how related two distributions are. Equation 11 shows its formal definition, where and .
(11)
This strategy gives as result a value in the range [−1,+1], being 1 when .
4.4.11 Pearson correlation coefficient distance
This strategy applies a change to Pearson correlation coefficient so that the value of the metric fits in the range [0,+1] and hence, the strategy represents the distance between two distributions and .
(12)
4.4.12 Tanimoto distance
The Tanimoto distance can be defined as a variation of Jaccard distance ([Huang 2008]). It compares the weights of shared attributes with the weights of those attributes that belong to one of the distributions but are not shared between them. The strategy calculates the similarity between two distributions and giving a value in the range [0,1], being 1 when and 0 when the distributions are completely different. The formula of the distance is shown in Equation 14.
(13)
(14)
4.4.13 TF.IDF-based similarity function
In addition to the previous similarity metrics, we propose a candidate function that specifically fits the problem of matching stakeholder profiles against architecture documents. Our function is based on the TF.IDF (Term Frequency x Inverse Document Frequency) metric of the Information Retrieval field ([Baeza-Yates and Ribeiro-Neto 2011]). The function is computed as indicated in Equation 15, in which U is a triple describing a user profile, D is a triple describing a document model (Section 4.2), N is the amount of concepts, M is the amount of categories, T is the amount of tags (all from the user profile), ConF.IDF(x) is the CF.IDF-value for user concept x, CatF.IDF(y) is the CF.IDF-value for user category y, and TF.IDF(t) is the TF.IDF-value for user tag t ([Goossen et al. 2011]; [Nicoletti et al. 2012]). This computation outputs a value in the range of [ 0,+∞], which is then normalized to the range [ 0,1]. A high value represents a good similarity between the user and the document. If the value is close to 0, it means that there are few o none semantic items shared between the two models.
(15)
4.5 NLP Semantic analysis
This analysis aims at extracting concepts and tags from a textual input. The analysis is executed on the raw text from the Wiki pages. This involves two processes: tag mining and concept mining. The sequence of tasks for tag mining is the following:
-
1.
Text parsing: The input text from the SAD is parsed in order to remove custom annotations from the Wiki syntax as well as invalid characters.
-
2.
Sentence detection: The parsed input text is split into a set of sentences. The OpenNLPd implementation was used for this task.
-
3.
Tokenizer: The sentences are divided into tokens (terms). The OpenNLP implementation is again used here.
-
4.
Stop-words removal: Frequently used terms are removed. We use approximately 600 words for this task (a mixture of commonly-used stop-words).
-
5.
Stemming: The terms are reduced to their root form to improve the keyword matching. Porter’s Stemming algorithme is used here.
The sequence of tasks for concept mining is the following:
-
1.
Text parsing: Similar as done in tag mining (above).
-
2.
Sentence detection: Similar as done in tag mining (above).
-
3.
Concept matching: A set of concepts is associated with each sentence. Since the size of the concept dictionary is relatively small, we process the complete dictionary and try to match concepts with sentences. We apply stop-words removal and stemming (Porter’s algorithm) to both concept names and sentence text alike, aiming at improving the string matching algorithm. In case the match is positive, the concept is associated with the sentence.
-
4.
Categories matching: The category hierarchy tree is built for each concept. We associate a set of intermediate level categories to each concept, which is already associated with each sentence, based on our previous work ([Nicoletti et al. 2013a]). The process is repeated for the upper level categories. In the resulting profile, we register the matching categories and their frequency.