Requirements Engineering

, 16:251

Relevance-based abstraction identification: technique and evaluation

Authors

    • School of Computing and CommunicationsLancaster University
  • Pete Sawyer
    • School of Computing and CommunicationsLancaster University
  • Vincenzo Gervasi
    • Dipartimento di InformaticaUniversità di Pisa
Best Papers of Re'10: Requirements Engineering in a Multi-faceted World

DOI: 10.1007/s00766-011-0122-3

Cite this article as:
Gacitua, R., Sawyer, P. & Gervasi, V. Requirements Eng (2011) 16: 251. doi:10.1007/s00766-011-0122-3

Abstract

When first approaching an unfamiliar domain or requirements document, it is often useful to get a quick grasp of what the essential concepts and entities in the domain are. This process is called abstraction identification, where the word abstraction refers to an entity or concept that has a particular significance in the domain. Abstraction identification has been proposed and evaluated as a useful technique in requirements engineering (RE). In this paper, we propose a new technique for automated abstraction identification called relevance-based abstraction identification (RAI), and evaluate its performance—in multiple configurations and through two refinements—compared to other tools and techniques proposed in the literature, where we find that RAI significantly outperforms previous techniques. We present an experiment measuring the effectiveness of RAI compared to human judgement, and discuss how RAI could be used to good effect in requirements engineering.

Keywords

AbstractionsNatural languageRequirements elicitationEvaluation of tool

1 Introduction

Initial studies on abstraction identification in RE [1, 17] were motivated by the realization that rich contextual information about an envisioned system’s domain was needed in order to properly understand the stakeholders’ needs and formulate appropriate requirements. Abstraction identification was conceived as a contributor to understanding this context. It was seen as a means to discover the set of important concepts within the problem domain that encapsulated the scope of an envisioned system. Abstraction identification can thus be thought of as where the expertise of the domain expert and the requirements engineer meet: The domain expert has the rich contextual information needed for the framing of the requirements, whereas the requirements engineer has to evaluate the relevance of the context for the system-to-be.

Domain expertise is often available to the requirements engineer as documents. These documents may take many forms: They may be formal documents such as standards, problem descriptions or existing system specifications, or they may be less formal documents such as interview transcripts or field reports of ethnographic studies.

Identifying abstractions from such (often large) documents imposes a high cognitive load on the requirements engineer. Attention lapses, for example, can result in important abstractions being overlooked. To support RE, therefore, “\(\ldots\) the desire is for a clerical tool that helps with the tedious, error-prone steps of what a human elicitor does \(\ldots\)” [17]. Abstraction identification in RE has thus come to mean the automatic identification of abstractions from documents. Most of the work on abstraction identification from documents applies only to textual documents. Abstractions are not only contained in text, of course, but natural language remains the primary medium of communication in software development projects [16]. It can be reasonably assumed, therefore, that a large proportion of the key abstractions in any domain will be discoverable from textual sources, given sufficiently powerful text processing techniques. This paper proposes and evaluates one such technique.

Early work [1] recognized that abstractions cannot be lifted directly from text, but must be inferred from the words or phrases that serve as their signifiers [29]. The semantics of the abstractions are ascribed by human interpretation, but if the lexical signifiers (the words or phrases) identified by the software are correct, a requirements engineer or domain expert should be able to ascribe the correct semantics. Abstraction identification is thus broader in scope than automatic term recognition (ATR) [24], adding human interpretation of the terms returned by ATR.

It is important to consider that not every term returned by ATR must signify an abstraction, and that a document may well contain abstractions that are not signified by any of the terms returned by ATR. Indeed, an abstraction may never appear as a term in the document at all, yet be implicit. Also, different terms might signify the same abstraction and at times—although more rarely in a RE context—the same term might signify in different occurrences multiple, unrelated abstractions.

A set of abstractions is naturally very far from being a set of requirements. However, if organized in an appropriate way, perhaps in an ontology or a conceptual model, the identified abstractions can serve a number of useful purposes, such as:
  • The act of recognizing and organizing the abstractions helps the requirements engineer understand the problem domain, the entities it encapsulates and their relationships;

  • The set of abstractions provides a lexicon of terms or project dictionary that helps stakeholders and requirements engineers communicate effectively;

  • The set of abstractions may serve as a checklist against which a requirements engineer can validate the coverage of the formulated requirements;

  • The contents of an abstraction, i.e., the set of sentences that contain the abstraction, can be easily retrieved to (e.g.,) help verify the completeness of a set of requirements in their treatment of the abstraction, or infer relationships with other abstractions from co-occurrences.

RE is not the only domain where abstraction identification is useful. Much research in information retrieval has been applied to automatic abstraction generation and automatic indexing [9, 32, 35], and this has had particular value during the last 15 years or so because of its applications in search engines. ATR has also been applied to domain ontology construction [31] which serves a variety of purposes, including in RE [25]. With its high reliance on domain-specific jargon, often stylized language, and focus on precision and completeness, RE constitutes a distinctive challenge for abstraction identification.

The ATR technique that we present and evaluate in this paper, Relevance-driven Abstraction Identification (RAI), has been designed to support abstraction identification in RE. It combines a number of existing natural language processing (NLP) techniques in a novel way to enable it to handle both single and multiword terms, ranked in order of confidence. One of the main contributions of this paper is the evaluation method that we use for RAI, which avoids the problems associated with employing expert human judgement for evaluating how well the terms returned by ATR map onto the problem domain’s underlying abstractions.

The remainder of this paper is organized as follows. In Sect. 2, we present a formalization of the problem and introduce our basic technique. Then, in Sect. 3, we present our approach to evaluate abstraction identification techniques in general, and describe how experiments were conducted, with results presented in Sect. 4.

An improved version of the RAI technique is then introduced in Sect. 5, and its performance is evaluated, in Sect. 6, against human judgement and against two comparable techniques (AbstFinder and C-Value). Section 7 presents a review of related work, and Sect. 8 concludes the paper.

2 Extraction technique

2.1 Defining the problem

We can formalize the problem of extracting abstractions from a document as follows.

Given a document D (written in natural language, e.g., English), and possibly other static sources of knowledge K, not dependent on D, we want to extract a set of termsAD, where each term is either a word or a multiword compound that denotes a significant entity in the domain of discourse of D. On AD, we may also define a ranking, i.e., an ordering on the elements of AD, with the understanding that elements that rank higher are more significant in the domain than those that rank lower. Finally, we define a measure of significance \(\sigma: A_D\rightarrow {\cal R}\) so that higher values correspond to a higher degree of significance. Given σ, we denote with ≤σ the ranking induced by σ, namely \(\forall a_1, a_2 \in A_D, a_1 \leq_\sigma \, a_2 \iff \sigma(a_1) \leq \sigma(a_2). \)

The understanding is that terms in AD are those that most accurately describe the domain of discourse of D. In RE, the terms in AD could be those that most accurately identify entities in the domain whose behavior and relationships need to be examined in the course of the analysis (e.g., if D is a domain description document), or those whose behavior and relationships need to be defined as part of the requirements (e.g., if D is a requirements document), or maybe those that most effectively describe what a particular stakeholder’s goals are (e.g., if D is an interview transcript).

Of course, how good a particular set of abstractions AD is depends entirely on two factors: (1) whether it includes all and only the significant terms, as described above, and (2) whether its associated ranking ≤σ corresponds to the intuitive notion of different degrees of significance. For now, we will leave the assessment of quality of a set of abstractions to this intuitive definition, whereas in Sect. 3.1 we will propose a more rigorous definition. It should be noted though that while we use the word extraction to indicate the derivation of a set of abstractions from a document, there is no formal obligation that each and every abstraction must occur textually in the document—in fact, it may very well be the case that a relevant abstraction is not signified textually in the document at all. For example, in a document describing the domain for an inventory and logistics management application, the term RFID might never appear, and yet be relevant to the possible solution. By contrast, the term barcode might appear often in the document, and yet be discarded in favor of RFID tagging in the final warehouse management system.

2.2 Our approach

At this point, it is necessary to define the terminology that we use. A document is made up of lexical entities (or tokens), some of which will be words in the common meaning. RAI is designed to identify terms, i.e., sequences of tokens. A term is a potential signifier of an underlying abstraction in the domain of interest. We stress here that the relation between terms and abstractions is potential, as the vast majority of terms occurring in a document will not in fact signify relevant abstractions. Examples of terms are "tag", "RFID", and "RFID tag". In the following, we will take the liberty of referring to tokens as "words" (which is indeed the most common case), even though they might be numbers, acronyms or single characters.

Abstraction identification in RE is complicated by the nature of the documents containing domain knowledge. Standards and specification documents may use controlled language that is tractable to symbolic linguistic techniques that can exploit regular grammatical structure to identify terms, as is done for example in [2, 4]. In general, however, such well-behaved use of language cannot be assumed, so the techniques used to identify abstractions need to be robust enough to handle a variety of writing styles with different degrees of adherence to the norms of language use (or to any style manual or requirement template that might be in use). This has led to a number of approaches that use statistical [23, 42] or hybrid techniques [22]. AbstFinder [17] is perhaps the best known abstraction identification technique for RE and is unusual in that it eschews standard text processing techniques to treat the domain document as a stream of bytes, ignoring the lexical elements.

The main purpose of applying statistical methods for abstraction identification is to rank candidate abstractions based on a particular criterion that gives higher scores to likely abstraction candidates. The most common statistical technique is to infer the significance of a candidate term (and thus its likelihood of signifying an underlying abstraction) from the number of times it occurs in the document. Wermter and Hahn [46] point out that simple frequency profiling is hard to beat; however, one way to improve upon it is to apply additional knowledge such as the standard distributional properties of candidate terms by performing corpus-based frequency profiling. These properties can be determined if there exists a large enough normative corpus (our K from Sect. 2.1) within which the term occurs a representative number of times. The rate of occurrence thus predicted by the normative corpus can be compared with the actual rate of occurrence in the analyzed document, and the difference used to infer the strength of the term’s relevance to the domain. The more overrepresented a term is within a domain document (as compared to the representation in the corpus), the more likely it is to signify an important domain abstraction.

Corpus-based frequency profiling works as follows [40]. Assume we are interested in the significance of word w in the domain document. The domain document contains a total of nd words, and the normative corpus contains nc words. w occurs wd times in the domain document and wc times in the normative corpus. wd and wc are called the observed values of w. Based on the occurrences of w in the domain document and the normative corpus, we can define two expected values for w:
$$ \begin{aligned} E_d &= \frac{n_d (w_d+w_c)}{(n_d+n_c)}\\ E_c &= \frac{n_c (w_d+w_c)}{(n_d+n_c)} \end{aligned} $$
The log-likelihood value for w is then:
$$ LL_w = 2 \left(w_d \cdot \ln\frac{w_d}{E_d} + w_c \cdot \ln\frac{w_c}{E_c} \right) $$

Given a log-likelihood value for each term in the domain document, the terms can be ranked, placing the term with the highest LL value, and thus most likely to represent an underlying abstraction, at the top. This corpus-based frequency profiling is the primary technique used successfully by WMatrix [42]. It is also used by RAI, but the results of RAI are modified by the technique described below in order to cope with multiword terms.

There is a particular challenge associated with multiword terms since most techniques, including corpus-based frequency profiling, rely on identifying individual words, and count these individually. There are collocation analysis techniques (e.g., [3]) that can infer lexical affinities [30]; however, since most association measures are defined to measure the pair-wise adhesion of words (wi,  wj) only, they cannot be used for measuring the association between more than two words. In requirements engineering, it is fairly common to encounter domain terms, such as software requirements specification, that comprise more than two words. Correctly handling such sequences is therefore an important challenge, since several researchers claim that in specialized domains over 85% domain-specific terms are multiword units [41, 45]. In RAI, we apply simple syntactic patterns that posit multiword terms as common combinations of adjectives and nouns, adverbs and verbs, and prepositions.

2.3 RAI-0

A key problem is that although multiword terms can be identified, in abstraction identification we want to rank terms in order of the relevance of their signified abstractions. In terms of pure frequency, it is common for important multiword terms to occur relatively infrequently in a document. Worse, no normative corpus of which we are aware contains large numbers of multiword terms. This is because most such terms are specific to particular domains and hence are unlikely to find their way into a corpus whose role is to serve as a guide to general usage of a language (e.g., English). Hence, while the corpus-based frequency profiling technique described above works well for terms that are single words, in practice it doesn’t help with multiword terms.

To solve this problem, we synthesize a significance value for all terms using a heuristic based on the number of words of which the term is composed, and the LL value for each word. In its simplest form, the significance value for a term \(t=\langle w_1, w_2,\ldots,w_l\rangle\) is given by the formula:
$$ S_{t} = \frac{\sum_{i}LL_{w_i}}{ l} $$
(1)
Equation 1 simply calculates the mean of the LL values for all the component words comprising a multiword term. However, we hypothesize that not all the words contribute equally to the significance value of the multiword term of which they are a component. Our hypothesis is based on an assumption that such a term is typically composed of a headword and one or more modifiers. Thus, in the term sailing ship, the headword is the noun ship and the adjective sailing is a modifier that denotes sailing ship as a type or class of ship. We assume that the headword is the most significant component of the term; thus the term ship is more significant than sailing, and the LL value of ship should carry more weight than the LL of sailing. To accommodate our hypothesis, the significance equation is modified to incorporate a weight, ki, that assigns a weight to each word that is a component of the term (based on its position):
$$ S_{t} = \frac{\sum_{i}k_{i} LL_{w_i}}{ l} $$
(2)

We will evaluate whether our hypothesis is correct in Sect. 4 Notice however that assigning weights purely based on positional information, as we did, is a simplifying (but convenient) assumption, as position is certainly not the only indicator of the presence of a headword.

The first version of RAI, RAI-0, that we reported in [14] is implemented by the following procedure:
  1. 1.

    Every word in the domain text is annotated with a Part-of-Speech tag (PoS tag).

     
  2. 2.

    The set of words is filtered to remove common words unlikely to signify abstractions. Several lists of such stop words exist; RAI uses the ONIX list [37].

     
  3. 3.

    The remaining words are lemmatized to reduce them to their dictionary form, to collapse inflected forms of words to a base form or lemma. Thus, for example, tag and tags will be recognized as terms referring to the same concept despite having different lexical forms.

     
  4. 4.

    Each word is assigned a LL value by applying the corpus-based frequency profiling approach described above, using the 100 million word British National Corpus (BNC) [27] as the normative corpus K.

     
  5. 5.

    Syntactic patterns are applied to the text to identify multiword terms.

     
  6. 6.

    A significance score is derived for every term by applying (2).

     
  7. 7.

    Identified terms are sorted based on their significance score and the resulting list is returned.

     

If RAI-0 is working well, the terms with high scores will be likely to signify genuine domain abstractions, while terms with low scores will be unlikely to signify abstractions. Thus, attempts to validate the terms will be most usefully focused on terms with high scores. To aid validation of the terms, the set of terms can be filtered by, for example, ignoring all those with a score below some threshold, or by presenting only the topmost n significant terms.

3 Evaluation methodology

3.1 Definitions

As discussed in Sect. 2.1, the successful extraction of abstractions from a document means that the selected terms are indeed the most relevant for the purpose of understanding, analyzing, characterizing the contents of the document. However, the concept of relevance is still defined only informally, as the concept of significance cannot be disjoint from that of purpose, or in other words, from the intent of the stakeholder using the abstractions. Thus while RAI extracts terms, the actual significance of those terms (as opposed to the significance computed by RAI) is what determines whether a term indicates a genuine abstraction or is just noise that has either been generated, or has failed to be filtered, by the algorithm.

In this section, we describe the evaluation methodology that we use to discover the extent to which RAI-0 and its successor, RAI-1 (described in Sect. 5), successfully identify abstractions. The results of applying the evaluation methodology presented here to RAI-0 and RAI-1 are described in Sects. 4 and 6.

In order to define a measure for success, we assume the existence of a reference set of abstractions\(A'_D\), which is what the author(s) of D would consider the correct set of abstractions. We will see in Sect. 3.2 how we can obtain a valid \(A'_D\), for evaluation purposes, without imposing excessive burden on the author of a test document.

Once the reference set is obtained, several measures can be easily defined. Precision tells us how many of the abstractions we extracted were relevant, and Recall how many of the relevant abstractions we could extract. Formally,
$$ \begin{aligned} Precision & = \frac{\mid {A_D \cap A'_D} \mid}{\mid {A_D} \mid}\\ Recall &= \frac{\mid {A_D \cap A'_D} \mid}{\mid {A'_D} \mid} \end{aligned} $$

For many applications, it is reasonable to assume that the requirements engineer will only look at the top-n most relevant abstractions returned by a given extraction technique, according to their ranking. We will denote these additional metrics as Precisionn and Recalln respectively; these metrics are analogous to Precision and Recall, but in place of AD they use ADn, where \(A_D^n \subseteq A_D,\,\mid {A_D^n} \mid=n\) and \(\forall a\in A_D^n, b\in (A_D \setminus A_D^n),\, a\geq_\sigma b. \)

Depending on the particular application context, different weights can be assigned to Precision and Recall. For example, we can consider two different scenarios:
  • A development project is being executed, according to extreme programing principles. The programmer needs to get a quick idea of what are the main entities in the domain, in order to build a first software prototype, where those same entities are implemented as objects in some object-oriented language. Further analysis will be conducted later on, by progressively extending the functions and relationships of those objects. In this scenario, precision is probably more relevant, as the programmer expects to uncover more relevant entities later on, but does not want to waste effort now by writing code to implement the behavior of abstractions that, possibly, will turn out to be unneeded at the end (false positives).

  • In the validation of the requirements for a high-assurance system, structured inspections are being conducted. In particular, for every requirement, all that is known in the domain about all the abstractions that appear in the requirement must be collected and presented for inspection. In this scenario, recall is more important. Presenting some extraneous material might marginally increase the cost of inspection, whereas missing some relevant abstractions (a false negative) might results in an incomplete—and ultimately, unreliable—validation.

The significance values that RAI infers for terms are used to rank the terms to aid the requirements engineer single out those that signify genuine abstractions. Thus, for identification to work efficiently, ranking must cause terms in which we have low confidence, as signaled by low significance values, to appear toward the bottom of the list, and terms in which we have high confidence to appear toward the top of the list. The lag metric [19] was devised to quantify this tendency for separation of high and low confidence results, albeit in a requirements tracing problem. For abstraction identification, lag can be informally defined as the average number of false positives (terms that do not signify abstractions) with higher inferred significance value than a term that signifies a genuine abstraction.

Formally, for any term a in the ranked list of terms that signifies a genuine abstraction,
$$ a\in A'_D $$
the lag of a is
$$ lag(a) = \mid {\{ b\in(A_D\setminus A'_D)\;{\text {s.t.}}\; b >_\sigma a\}} \mid $$
Then, the overall lag is the average of that for all terms in the ranked list that signify genuine abstractions,
$$ lag = \frac{\sum_{a \in A'_D} lag(a)}{\mid {A'_D} \mid} $$
The lower the lag, the more differentiated are the genuine abstractions from the spurious terms. Thus, the lower the lag, the more effective ranking is as a means to help the requirements engineer distinguish genuine abstractions from spurious terms.

3.2 Experimental protocol

We have evaluated the effectiveness of our extraction technique by means of a case study. In order to test the technique on a sufficiently large document D, and yet keep the burden placed on the author for producing \(A'_D\) manageable, we have made recourse to readily available material, rather than trying to produce original D and \(A'_D\) for the case study. In fact, we have used the full source text of a book on a technical domain as D, and the corresponding analytical index, prepared by the author at the time the book was written, as source for \(A'_D\).

This approach is probably more representative of the case when abstractions are extracted for the purpose of understanding an application domain, rather than of that when abstractions are extracted from requirements (e.g., for validation or reverse-engineering purposes). We believe that the approach has, in addition to its obvious practical advantages, some methodological ones. First, we can rest assured that in no way was the selection of \(A'_D\) influenced by the preparation of the experiment (in fact, both the book and the index predate the present work). Second, the selected book is sufficiently long and the matter sufficiently complex that it can be considered representative of those cases in which abstraction extraction is more useful. Third, the hierarchical nature of an analytical index provides us with a suitable proxy for ranking of constituent parts of multiword terms.

Last but not least, the experimental setup mimics a situation which is common in multiple RE activities, including cases in which domain analysis is conducted by using, as source material, pre-existing documents (e.g., operation manuals, textbooks, bodies of legislation), and cases in which a new member joins a team working on the requirements for a large product or product family, and needs to gain rapid knowledge of the job by studying large bodies of pre-existing requirements.

In practice, we obtain D from the body matter of the book, and \(A'_D\) from its analytical index (through a series of pre-processing steps that will be better discussed in the following); then provide D as input to our prototype implementation which provides us with AD. Then, AD and \(A'_D\) are compared as discussed in Sect. 3.1, in order to evaluate the technique. The technical details are presented in the following, whereas results are presented in the next section, and discussed in 6.1.

3.3 Experimental set-up

The book that was chosen as the source document D was a textbook on RFID and its applications [10]. The book was chosen because the full text was available in machine-readable form, because its subject is representative of the kind of technical domain that a requirements engineer might plausibly need to develop a conceptual understanding of, and because its independent authorship removed any risk of bias on our part. The book is also large enough to present a real problem of information overload. It is 595 pages long and contains 156,028 words so its size serves to simulate the volume of text that a requirements engineer might encounter in a range of domain materials such as standards, manuals or indeed text books. The book’s analytical index holds 911 entries. An example of an index entry is given below for the abstraction antenna denoted by the headword "Antennas".
  • Antennas
    • description of, 21, 28–29, 32–33

    • impedance, 442

    • Mark IV, 61–62

    • mounting of, 273–274

    • position of, 183

    • Symbol, 48

In order to assess RAI’s performance, we need to detect which terms it identifies in the text match with the abstractions signified by index entries. The concept of match deserves some explanation. Our assumption is that the index entries are terms that signify underlying abstractions judged by the author to be the key to the book’s subject domain. By matching terms returned by RAI against abstractions in the index, we really mean that we are looking for lexical similarities between entries in the two data sets. We assume that a lexical similarity is indicative of semantic relatedness; that if a match exists between a term identified by RAI and one listed in the index, they signify a common underlying abstraction.

With this in mind, the antenna entry illustrates some interesting features of the index that slightly complicate this task. First, the concept antenna is denoted by an inflected form of the word antenna: the plural Antennas. Secondly, the index’s hierarchical structure includes a hierarchy of abstractions. For example, Mark IV Antenna can be considered to signify a specialization of antenna. Antenna Symbol, by contrast, might be considered an independent abstraction, albeit a related one. These two examples are interesting because the order of the words that comprise the abstractions’ multiword terms is different; in one Antenna comes first, in the other it comes last. Thirdly, some index entries arguably represent attributes of the headword-signified abstractions, rather than abstractions in themselves. Examples include impedance, mounting, and position of antennas.

These three features all have an effect on the design of the experiment. Requiring an exact literal match between terms returned by RAI and index entries is likely to indicate very low performance, even when there is a genuine conceptual match. It is therefore necessary to apply compensation. We do this by applying a set of standard techniques used in information retrieval. The first feature can be handled by lemmatizing the index entries in the same way that RAI lemmatizes the terms in the body of the book. This guarantees that the same words returned by RAI and occurring in the index will have the same morphology and match perfectly.

We don’t detect when index terms are specializations, independent abstractions or attributes so we treat them all as abstractions signified by multiword terms (although we ignore these problems here, they would become important if we intended to organize the terms in an ontology). This still leaves the problem of term ordering. We handle word ordering by simply not caring which order the component terms appear in. Hence, the index entry (and thus, the abstraction) "Antenna, position_of" would be treated as the unordered set of terms {antenna, position_of }, and we measure the lexical similarity of this set of words with terms extracted from the text. If either position_of_antenna or antenna_ position was found in the text, a perfect similarity value of 100% would be recorded (notice that the preposition of is handled transparently). In the general case, we define the similarity between two terms as follows.

Given \(t_1 = \langle w^1_1, \ldots, w^1_{l_1} \rangle\) and \(t_2 = \langle w^2_1, \ldots, w^2_{l_2} \rangle, \) with \(t_1\in A_D\) (i.e., t1 is a term extracted by RAI from the body text of the book) and \(t_2\in A'_D\) (i.e., t2 is an abstraction from the book index), the lexical similarity between t1 and t2 is given by
$$ H_1\frac{\mid {t_1 \cap t_2} \mid}{\mid {t_1} \mid}+ H_2\frac{\mid {t_1 \cap t_2} \mid}{\mid {t_2} \mid} $$
where H1 and H2 are parameters that weight the relative importance of the two terms (in our experiments, we had H1 = H2 = 1/2 to reflect a lack of preference).

We also make adjustments where separate terms identified in the text match component words in the index but are not collocated. For example, the index contains an abstraction represented by the set {access_card, bar_code }. Both access_card and bar_code are recognized as separate multiword terms by RAI, both partially matching the index abstraction and both scoring a similarity measure of 75%. Neither of the 4 grams access_card_bar_code nor bar_code_access_card, which would score 100% similarity with the index entry, are identified by RAI. This is because the four words never appear as a collocated sequence in the text. The two terms access_card and bar_code do, however, appear within the same sentence, even though they are not adjacent. We use this neighborhoodness as evidence of semantic similarity with the index abstraction and add a corrective factor of 20% (with a cap for the total score at 100%) to the two terms’ similarity values so that they both score 95%.

To qualify as a match between terms identified by RAI and index entries, we set a similarity threshold of 85%. In our previous example, we would consider an extracted term "bar code" appearing in a sentence which also contains "access card" (hence, with a total score of 95%) to match the relevant abstraction given in the analytical index as "access card, bar code".

The values of the various parameters mentioned above have been set experimentally and somewhat arbitrarily, and seem to work well in practice. However, they were not optimized to the case study, nor were they proven to be optimal in any way. It is possible that finely tuned parameters would give better results than those reported in Sects. 4 and 6, and hence our measures are to be considered as a lower bound to how effective the technique could be.

RAI returns its identified terms as a ranked list; terms with the highest inferred significance at the top. Ranking reflects the way in which RAI is designed to be used by a human requirements engineer who cannot be expected to examine each of the thousands of terms identified in a document as long as the RFID book. Instead, they might reasonably be expected to examine the 20, 50 or 100 most highly-ranked terms, particularly since precision tends to decrease as inferred significance decreases.

While ranking is useful in the real-life scenario of how RAI is designed to be used, it poses a problem for our experimental set-up. This is because the abstractions in the book index are not ranked by significance except insofar as any ranking is implicit in the shallow hierarchical structure of the index. Because of this, the index is essentially an unordered set of abstractions. In the absence of ranking, we therefore have to make a simplifying assumption that all 911 abstractions in the index are of equal importance to the RFID domain and that to achieve perfect recall, all 911 abstractions would have to be found by RAI.

To calculate RAI’s absolute recall, we take the highest-ranked 911 terms returned by RAI and count the proportion that return a similarity value of 85% or over. However, this doesn’t accurately reflect our expectations of how a requirements engineer would exploit RAI’s ranking by significance, as described above. We therefore, present a second assessment of recall for RAI, in which we count the number of abstractions in the index that correspond to the most significant 5, 10, 15, 20 and so on terms returned by RAI, again using an 85% threshold of similarity. The problem with this approach is that we should be calculating recall against the corresponding top-most 5, 10, 15 and so on index entries, but we can’t because the index is unranked. Thus, even if every one of the topmost n terms extracted by RAI corresponds to an entry in the index, recall can never exceed (n/911). More perniciously, even if those top n terms do match index entries, we have no way of knowing whether the index entries with which they correspond represent the n most significant abstractions in the RFID domain. In the worst case, the set of terms would match the nleast significant abstractions in the set of 911 index entries. We just cannot tell.

Despite this limitation, plotting recall for the top-ranked n terms lets us examine the rate of increase in recall as n increases. If RAI performed perfectly, we would expect recall to start at 0% and rise linearly to 100%. The closer to this ideal linear rate of increase in recall that RAI achieves, the more confident we should be that RAI’s ranking of terms reflects the relative importance of the abstractions signified by the index entries with which the terms match. Since the significance values that RAI attaches to terms represent confidence in the terms’ correctness, we would expect the rate of increase in recall to be close to the ideal where n is small, but for the gradient to become increasingly flat as n increased. Thus, not only are we interested in how close to the ideal rate of increase in recall RAI achieves as n increases, but we are also interested in the value of n over which recall that is close to the ideal is sustained.

For the calculation of precision, the absence of ranking in the index causes similar problems. We can calculate absolute precision over the full set of 911 index entries, but we have to be careful to correctly interpret how precision varies as n increases. Using the same increments of 5 terms that we use for recall, we plot precision against n. For precision, the ideal is a constant 100% and if RAI works well, close to 100% should be achieved when n is small. As n increases, we would expect precision to decay. Thus, mirroring recall, we are interested in how close precision is to 100% and over how large a range of n this is achieved.

Finally, we measure lag over the set of 911 most significant terms to quantify the success of ranking as a means to bring the terms that signify genuine abstractions to the top of the ranked list.

4 Evaluation 1: Constant versus variable weights for RAI-0

In this section, we evaluate the performance of RAI-0 using the methodology described in Sect. 3 The key objectives of this evaluation is to test our hypothesis that the LL value of different component words of a multiword term should be weighted variably according to their position in the term. We therefore compare two versions of RAI-0. In the first, the value of ki in (2) is set to 1.0 for all i. In the second, the sum over i of ki is 1.0, but the value of ki for any i varies as follows:
$$ K_{l} = \left\{\begin{array}{ll} k_0 =1.0 & \hbox{if}\,l=1 \\ k_1=0.4,\, k_0= 0.6 & \hbox{if}\,l=2 \\ k_2=0.2,\, k_1=0.3,\, k_0= 0.5 & \hbox{if}\,l=3 \\ k_3=0.2,\, k_2=0.2,\, k_1=0.3,\, k_0=0.5 & \hbox{if}\,l=4 \\ k_i=0,\, k_3=0.1,\, k_2=0.15,\, k_1=0.25,\, k_0=0.5 & \hbox{if}\,l>4, i\ge 4 \end{array}\right. $$
(3)

Here, component words are weighted in descending order from the last word in the term, with exact values assigned again arbitrarily (but under the intuition that headwords are most probably the last word in a multiword term).

Hence, if the term was "RFID tag", and corpus-based frequency profiling scheme described in 5 had assigned the component words "RFID" and "tag" LL values of LLRFID and LLtag respectively, LLtag would be weighted higher at 0.6 than LLRFID at 0.4. This reflects a simplifying assumption (valid for English) that in multiword terms, the core element is the final word, with the preceding words representing modifiers such as adjectives and adverbs.

For the absolute assessment of recall, of the most significant 911 terms returned by RAI-0 using constant and variable ki, respectively, 136 and 140 matched one of the 911 abstractions in the index as measured by their scoring 85% or more using our similarity measure. This corresponds to a recall of 14.93 and 15.37%. Since RAI-0 using constant and variable k respectively also returned 775 and 771 terms which fell below the similarity threshold when compared with index abstractions, the precision values are also 14.93 and 15.37%.

Figure 1 shows the performance of RAI-0 for constant and variable k in terms of rate of increase in recall and in precision. In the recall graph, it can be seen that for variable k, RAI-0’s recall increases as the sample size increases, at first linearly but then (after approximately 15–20 terms) quickly tails off. Eventually, it would reach the maximum of 14.93% if the x-axis was extended to 911. The recall for RAI-0 with variable k is similar for the brief initial linear part, but then flattens out before climbing again. Overall, the recall for variable k climbs more steeply than that for constant k.
https://static-content.springer.com/image/art%3A10.1007%2Fs00766-011-0122-3/MediaObjects/766_2011_122_Fig1_HTML.gif
Fig. 1

Recall and precision for RAI-0 using a constant weight of 1.0 (per equation 1), and a variable weight (per equations 2 and 3)

In the precision graph, RAI-0’s precision decays from the maximum of 100% as larger subsets of the ranked list are selected. With k variable, the gradient of decrease in precision is slightly steeper, but levels out sooner. Were the x-axis to be extended up to 911 terms, RAI-0’s precision graph would continue to decay to the minimum values of 4.93 and 15.37%.

Finally, the lag for constant and variable k are 27.54 and 30.85, respectively.

The results provide some support for our hypothesis that not all the component words contribute equally to the significance value of the multiword term of which they are a part. The version of RAI-0 with variable k performs somewhat better than that with constant k. The reason for this appears to be as follows. With k constant, modifiers with high LL values distort the overall significance values of multiword terms. For example, in the RFID book, the term rfid, unsurprisingly, gains the highest significance value for both variable and constant values of k. There are many other terms in which rfid forms one of the component words. In some, such as passive_rfid_tag, rfid forms part of the compound noun rfid_tag which is the second most significant term for both constant and variable values of k. Similarly, there are also several terms in which rfid_tag forms the noun, modified by an adjective such as passive_rfid_tag. Here there is no problem, because such terms tend to occur in the book index and so signify genuine abstractions. We infer that this is because terms with the pattern <adjective > _rfid_tag signify major specializations of rfid_tag. However, there are a number of terms identified by RAI-0 which do not occur in the book index and in which rfid is a component. Examples include rfid_animal_tag and rfid_pallet_tag. Here, rfid acts only as an adjective for the compound nouns animal_tag and pallet_tag. We infer that terms in which rfid is used merely as a modifier signify minor specializations of the rfid_tag abstraction and thus do not occur in the index. The problem for RAI-0 is that, if k is constant, the high LL value of rfid results in a high overall significance value for the terms signifying these specialized abstractions. If, however, k is variable, then the LL value of rfid carries less weight by virtue of having a k value of 0.2 instead of 1.0, resulting in the overall inferred significance being reduced enough to place the terms lower in the ranked list of terms. Note that, we haven’t established whether the variable values that we assign to k are optimal, but our variable k scheme does appear to replicate to some extent the use of modifiers for nouns in English.

Despite giving us confidence in our hypothesis that k should be variable, the performance of even the version of RAI-0 with variable k appears modest. We explain later why the RFID book is a tough test for RAI (and how other techniques don’t fare better on such a tough test). Nevertheless, RAI-0’s performance conforms to the general pattern for ATR techniques. That is, recall necessarily starts from 0, and climbs toward its absolute maximum at a decreasing rate as similarity value decreases. Similarly, precision typically starts at the technique’s maximum value (which is often at or near 100%) and then decays at first rapidly and then at an ever decreasing rate. The effect of these performance profiles on a requirements engineer using the techniques to identify abstractions is that their work is subject to the law of diminishing returns: The further down the ranked list of terms they look, the lower the density of genuine abstractions contained in the list. Hence, we seek to minimize the rate at which precision decays and maximize the rate at which recall grows.

In the next section, we describe improvements to RAI-0, giving RAI-1, designed to increase its performance.

5 RAI-1

The major modification to RAI-0 made for RAI-1 addressed a pattern that we noticed in the output from RAI-0. This pattern was an effect of the variable values of k assigned to component words of multiword terms. We established that assigning variable weights produced slightly better performance than weighting every component word equally. However, as a side-effect, compound terms composed of the same number of words and ending in the same head word tended to cluster together. This is because the headword is weighted most heavily by being assigned the highest value of k in (2). If the headword has a relatively high LL value compared to the other list words in the term, it will make a dominating contribution to the term’s overall significance value. Thus, terms that share the same high LL-value headword, tend to have similar significance values. Thus, for example, the ranked list produced by RAI-0 on the RFID book, contained a cluster of 2 grams ending in rfid. This cluster included valid terms such as passive_rfid but also spurious terms such as technology_rfid. On inspection, this phenomenon was a cause of the curious plateau in recall for RAI-0 with variable k that appears in the range 15–45 terms in Fig. 1a.

Several solutions were tried to resolve the problem. Tuning the variable values of k was not effective. In the case where equal weight was given to each component term, the effect was somewhat damped but even here, high LL-value terms such as rfid could appear at any position in a multiword term, causing the terms containing them to cluster together in the ranked list. A richer set of syntactic patterns for identifying multiword terms might be defined to reduce the occurrence of spurious terms, but we were reluctant to abandon our small set of simple patterns which was easy to understand and based on common patterns in English.

The solution that we adopted combines corpus-based frequency profiling used to calculate the significance value used by RAI-0 (St) with the raw frequency of the term in the text under analysis. As already noted by [46], raw frequency can be indicative of a term’s significance. The formula is given below where St is the result of applying equation 2, and the resulting St is used in the ranking for RAI-1.
$$ S'_{t} = S_{t} \frac{termFreq_D(t)}{\max\limits_{t'\in A_D}termFreq_D(t')} $$
(4)

Thus, the significance value for a given term is found by multiplying the result of (2) by the ratio of the term’s actual frequency to the frequency of the most frequent term in the text. This damps the effect that high LL-value headwords have on multiword terms.

The top abstractions identified in the RFID book by RAI-1 are shown in Table 1.
Table 1

The 30 topmost abstractions identified by RAI-1 in the RFID book

rfid

Inventory

rfid antenna

Tag

Passive tag

rfid project

rfid tag

Datum

Company

rfid technology

Passive rfid tag

rfid application

rfid system

Technology

Project

System

rfid wristband

rfid chip

Cost

Active rfid tag

Case

Antenna

rfid reader

Application

Active tag

Reader

Serie tag

Logistic

Supply chain

Product

6 Evaluation 2: comparative evaluation of RAI-1

In this section, we evaluate RAI-1 against RAI-0 (with variable k) and two other techniques proposed in the literature: AbstFinder [17] and C-Value [12]. AbstFinder is perhaps the best-known abstraction identification technique in RE, while C-value is a relatively recent general-purpose technique for extracting multiword terms.

We ran AbstFinder in its default form. We did not change any of AbstFinder’s default parameter settings which are hard-coded in the original C implementation. However, AbstFinder does produce some noise in the output. In our experiment, this noise took the form of duplicate terms and numbers that were clearly not abstractions. We removed this noise manually so as to leave only genuine inferred abstractions. Like RAI, C-value derives its list of inferred abstractions from a combination of linguistic and statistical information. We ran a Java implementation of C-value in the configuration described in [12]. As run, C-value produced no noise so its output was used unmodified.

For the absolute assessment of recall, of the most significant 911 terms returned by RAI-1, 294 matched one of the 911 abstractions in the index. This corresponds to recall and precision values of 32.27%; a considerable improvement over RAI-0. In the same test, AbstFinder found 65 correct abstractions, representing 7.14% recall and precision, while C-value found 117 correct abstractions at 12.84% recall and precision.

Figure 2 shows the performance of RAI-0, RAI-1, AbstFinder and C-value in terms of rate of increase in recall and in precision. In the recall graph, it can be seen that recall for RAI-1 increases linearly well beyond the point at which the gradient decreases in the other techniques.
https://static-content.springer.com/image/art%3A10.1007%2Fs00766-011-0122-3/MediaObjects/766_2011_122_Fig2_HTML.gif
Fig. 2

Recall and precision for RAI-1 compared with RAI-0, AbstFinder, and C-value

For precision, while RAI-0, AbstFinder, and C-value begin at 100% and immediately fall away, RAI-1 manages to sustain 100% precision for the first 80 terms, before beginning to decline. Of the top 90 abstractions identified by RAI-1, 82 are correct (i.e., they are found in the book index) As discussed, however, this does not mean they are necessarily the 82 most relevant entries in the index, given that we do not have a ranking for the book author’s intention.

Finally, the lag for RAI-1 is 5.99 compared to 27.54 for RAI-0, 62.61 for AbstFinder and 32.66 for C-value. This means that with RAI-1, on average over the whole set of 911 abstractions, the requirements engineer would have to discard, on average, 5 irrelevant abstractions before finding a relevant one. However, the precision profile means that the density of correct abstractions is higher at the top of the list where the requirements analyst would be expected to focus their effort. Thus, in practice the workload would be quite manageable.

In the context of the RFID book evaluation, there is a clear performance advantage to RAI-1 over AbstFinder and C-value. Moreover, the effect of the modifications made to RAI that led to RAI-1 have been dramatic. RAI-1 performs much better than RAI-0 in all our metrics; absolute recall and precision, the rate of increase in recall and rate of decline in precision, and in lag.

6.1 Discussion

The results for RAI-1 are encouraging. The low lag, combined with the sustained linear increase in recall and ability to sustain high precision before beginning to decline suggests that RAI-1 would be of genuine help to a requirements engineer seeking to identify the key abstractions in an unfamiliar problem domain. Nevertheless, we must be cautious because for absolute recall, RAI-1 failed to identify 617 of the 911 abstractions. However, two points can be made in the defense of RAI-1.
  • The first is that the book index represents a tough test. There are features of the way the index is put together that makes the relationship between the abstraction and its signifying terms lexically indirect. For example, the index contains an entry Department of Defense, which causes no problems. However, it also has a sub-clause, tag use which, in the way we formulated the experiment, we treat as an independent abstraction. This abstraction yields the set of terms on which to match {Department_of_Defensetag_use}. The terms Department_of_Defense is recognized but tag_use isn’t. This is because tag use merely indicates sections of the book that describe applications of RFID tagging by the DoD. Department of Defense appears in sentences with active tag and passive tag, but tag use doesn’t appear anywhere in the body of the text. Similar examples occur many times and are rooted in the fact that the relationships between abstractions and the terms that signify them are often indirect and lexically unfocused. A human will have no difficulty making the connection between a set of terms and the underlying abstraction intended by the book’s author, even where the terms returned by a tool and the terms listed in the book’s index are lexically dissimilar. In those same situations, our similarity measure will often simply fail to find enough of a lexical match to reach the 85% threshold value. An interesting area for further work would be to discover whether more sophisticated means of calculating the similarity, such as Latent Semantic Indexing [8], might yield significantly better results.

  • The second point is related to the first one; the experiment assesses RAI’s performance operating in an unsupervised mode for which it is not intended. Human judgement is needed to infer abstractions from their signifiers. RAI is designed so that the requirements engineer would work down from the most significant term, stopping when recall dropped below an acceptable level but drawing on other evidence to help form an opinion about whether each term was a genuine domain abstraction. Such evidence can be provided by the use of a KeyWord in Context (KWIC) viewer, such as that provided by the OntoLancs environment 1 [15] (Fig. 3) of which RAI is a part.

https://static-content.springer.com/image/art%3A10.1007%2Fs00766-011-0122-3/MediaObjects/766_2011_122_Fig3_HTML.gif
Fig. 3

Keyword in context. The illustration shows the requirements engineer viewing the ranked list of terms, and exploring the validity of the term "passive rfid tag" by viewing its occurrences in the source text

Despite these limitations, RAI-1 does show a notable improvement in performance over the AbstFinder and C-value benchmarks, at least under the conditions of unsupervised execution. Moreover, although book indexes have idiosyncrasies that make verifying the results of abstraction identification tools a challenge, they do have the benefits of freedom from bias and ready availability as identified in Sect. 3.2.

Some threats to the validity of the results that may overestimate the efficacy of RAI suggested by the results data are:
  • Lexical similarity is no guarantor of semantic relatedness, so just because a term identified by RAI matches a term in the index, it does not guarantee that both terms signify a common abstraction. However, this is a recognized risk of information retrieval that has not inhibited the usefulness of (e.g.,) Google.

  • We may be mistaken in our suggestion that a textbook on RFID is sufficiently representative of sources of domain information commonly used by a requirements engineer. Even if we are mistaken about representativeness, as we observe above, a textbook still represents a tough test of our tools.

  • The RFID textbook is a single, coherent text written by a single author. This is unusual in an RE context, where domain extractions would usually be collected from a variety of sources which might exhibit less consistency in the use of terms. Such inconsistency would also confound AbstFinder and C-value so is really a threat to the feasibility of ATR for abstraction identification rather than of RAI in particular.

We haven’t directly demonstrated that RAI-1 would adequately support a requirements engineer in the task of domain understanding. However, we have shown that, at least in some circumstances, its unsupervised performance exceeds that of AbstFinder for which evidence of real utility does exist [17]. Moreover, we have demonstrated that our method of evaluation is practicable, even if it offers a tough test for automatic abstraction identification techniques.

Finally, we would like to stress that the evaluation methodology that we have proposed and applied in this work lends itself to replication in the future, and offers an objective way of comparing different techniques for the abstraction identification task. We explicitly invite fellow researchers to test other approaches and further improvements in the same way.

7 Related work

The key contribution to abstraction identification in RE was made by AbstFinder [17]. In the experimental evaluation described in [17], AbstFinder achieved over 25% precision. AbstFinder also achieved a remarkable 100% recall. AsbtFinder is unusual in its use of signal processing techniques for identifying co-occurring words in different sentences that could tolerate different morphological forms and different relative orderings within sentences. AbstFinder treats text as byte streams and searches for patterns of co-occurring byte sequences within pairs of sentences using a series of circular shifts.

AbstFinder’s toleration of different word morphologies is achieved by setting a WordThreshold parameter as the minimum length of string on which to match. If WordThreshold is set too high, then (e.g.,) the common concept underlying flying and flight might be missed. Set lower (to two in this case), the common substring fl would be identified, but would also include unrelated concepts signified by other words containing the same substring, such as flanking.

In contrast to WordThreshold, which is set globally, a stemming algorithm [38] should give better results on a per-word basis. Lemmatization [44] should perform better still since, unlike stemming, it does not simply trim word endings seeking a common base form, but trims only the inflectional endings to derive the dictionary form of a word; its lemma. However, both stemming and lemmatization depend on first identifying the word as a lexical entity, an approach that is at odds with AbstFinders treatment of documents as byte streams. Corpus-based frequency profiling was first applied in RE by [26] and by [39]. To work, the normative corpus against which to compare must be very large in order for it to reliably reflect the frequency of occurrences of a significant subset of the chosen languages lexicon. Additionally, the corpus must have been annotated with part-of-speech tags, to resolve lexical ambiguities, and a frequency profile compiled for all the words it contains. After several decades of painstaking work by corpus linguists, several general-purpose normative corpora exist [11, 27].

Normative corpora cover general language usage; by contrast, specialist domains are often defined by concepts that do not occur in normative corpora. Such terms are often n-grams such as RFID tag. N-grams are hard to identify but a number of techniques exist. In RE, lexical affinities [30] represent collocations of words within text. A significant collocation is defined by Oakes [36] as the probability of one lexical item co-occurring with another word or phrase within a specified linear distance or span being greater than might be expected from pure chance. In [42], Berry–Rogghes z-score [3] was used successfully to identify frequently occurring n-grams. However, important abstractions do not always occur frequently within a document and may be missed by z-scores and similar techniques.

An alternative approach is to look for syntactic patterns such as <adjective> <noun> [21] which can deliver high precision but also low recall [13]. There is much disagreement with respect to the accepted syntactic configuration for collocations [33].

Semantic tagging offers another aid to abstraction. The requirements engineer can use semantic tags to filter information and infer the meaning of phrases and passages in which they occur. A number of semantic taxonomies exist and some, such as WordNet [34] and Cyc [28], are very large. Semantic tagging can help classify terms as part of the process of identifying abstractions from their lexical signifiers [42], but are generally not effective for ATR.

There has been significant research on Ontology learning from text [15, 31], including in applications to RE. Such work relies upon the identification of abstractions, but also seeks to identify where and how those abstractions are related. AbstRM is an interesting example of closely related work [18] that uses the KAOS meta-model [7] and builds upon AbstFinder to infer abstractions and inter-abstraction relationships from a set of requirements. AbstRM uses these to build an ontology that can be used to manage the requirements by helping infer traceability relationships.

While abstraction identification and ATR involve the application of a range of natural language processing techniques, recent work in requirements trace recovery [5, 20, 43] draws on the related field of information retrieval (IR). This work is motivated by common failures to manually record requirements trace information. It aims to infer trace relationships from lexical similarities between requirements statements derived from vector space techniques such as TF–IDF, cosine similarity and latent semantic analysis. Lexical similarity is taken as an indicator of semantic similarity. The more lexically similar two requirements are, the more likely there is to exist a trace relationship between them. While not specifically about tracing, similar work has been applied to inferring semantic relationships in large volumes of requirements and customer requests [6]. This application of IR to RE has been shown to achieve excellent results, with c. 90% recall and c. 20–35% precision [5].

Although the work on trace recovery and on requirements semantic similarity inference is not concerned specifically with abstraction identification, implicitly, it does exploit abstractions within requirements statements since these typically make up a significant proportion of the sets of terms on which two requirements are matched.

8 Conclusions

In this paper, we have presented a new technique for the identification of single- and multiword abstractions which we call relevance-driven abstraction identification (RAI). Abstraction identification is an essential step in automatic ontology learning or model generation. However, while these two applications are still in their infancy, the identification of abstractions, even if as a simple ranked list of terms, has a useful purpose in many circumstances in RE: among them, when familiarizing with new domains or with large requirements documents, when conducting structured inspections, when needing to quickly identify the most important concerns in an interview transcripts, or even when sketching a list of objects that will need to be implemented in a first prototype.

Tools implementing efficient abstraction identification techniques have the potential to assist the requirements engineer in an otherwise painstaking and error-prone task. The effort needed, for example, to learn the key concepts of a new domain is the main motivation for the sporadic attention paid to the problem by the RE community over the last 20 years. Indeed, it is often assumed that domain experts are readily available—and while it may be so in major consulting firms, it is common experience in practice that meetings with domain experts end with a tall pile of standards, specifications, books and manuals on the desk of some unfortunate requirements engineer.

While the evidence of the work presented in this paper is that human judgement is still needed in abstraction identification, a push-button tool that provides at least some guidance by focusing effort on the terms that most probably signify the most relevant abstraction is a significant contribution. Naturally, in no way would such a tool substitute for what is ultimately needed: diligent work by competent requirements engineers. However, by reducing the effort needed in the initial steps of an analysis process, we expect better returns on the initial effort investment, and hence, a quicker turnaround in iterative elicitation processes.

In our unsupervised evaluation, we used a pre-existing text book to guarantee the absence of bias. While the evaluation was unsupervised, the mechanism we used doesn’t ignore the need for expert human judgement: in fact, human judgement is crystallized in the input artifacts, in that the book author had to use their judgement to generate the index. However, our approach does make that judgement easy to access. We encourage other researchers developing linguistic tools for RE to adopt a similar technique, at least in the early phases of evaluation. Any representative document will do, provided it has a suitable analytical index.

As described in the paper, there are weaknesses in our technique for deriving the lexical similarity between tool-derived terms and index terms, and we plan to invest some effort in trying to improve this. However, we must be careful not to make the criteria for identifying a match too weak and exaggerate tool efficacy. Also, many weights, coefficients, and thresholds have been assigned without conducting a sensitivity analysis or a formal optimization process. We intend to investigate both issues in future work, but at the same time caution will be exercised to avoid optimizing the technique to the particular benchmark being used. By avoiding to optimize those values for better results on our benchmark, we have indeed obtained a stronger guarantee that the technique is of more general applicability.

Another important issue is to investigate scenarios of interactive usage, where feedback coming from the requirements engineer (for examples, by recording his or her choices of which abstractions among the top-scoring ones proposed by the tool are indeed significant) can be used to filter the remaining ones, or boost the extraction of further abstractions. Indeed, as part of future work we intend to verify if the various weights and thresholds can be adapted dynamically based on the particular user’s behavior on a particular document.

RAI is part of the MaTREx project investigating the role of tacit knowledge in RE. Ontology construction is one strand of this research and in addition to seeking to improve RAI’s performance, we are investigating ways to infer relationships between abstractions to represent a conceptual domain model.

Footnotes
1

Available on request from the authors.

 

Acknowledgments

This work was funded by EPSRC grant EP/F069227/1 MaTREx.

Copyright information

© Springer-Verlag London Limited 2011