1 Introduction

Data curation indicates processes and activities related to the integration, annotation, publication, and presentation of data throughout its lifecycle [8]. One category of data curation is data annotation, which aims at labelling the raw data to generate value and increase productivity. Data annotation has been used extensively in various computational machine learning algorithms for information extraction, item classification, record-linkage [6, 38, 39]. However, in dynamic environments, e.g. Twitter (twitter.com/) and Facebook (facebook.com/), where data are continuously changing, relying on pure algorithmic approaches do not scale to the need of businesses that need to annotate data over an extended period of time. Because algorithms make predictions based on the historical data only. While, in dynamic environments, the distribution of data is changing, and algorithms need to be updated to capture the changes, which is expensive and time-consuming.

In recent years, several pioneering solutions (e.g. [5, 23, 30, 33, 44, 51]) have been proposed to augment algorithms with rule-based techniques. Rules can alleviate many of the shortcomings inherent in pure algorithmic approaches. Rules can be written by non-technician analysts, which is less expensive than training algorithms [23]. Updating rules is faster than training algorithms and can supplement algorithms in cases they are not working well [44].

To keep a rule applicable and precise [5, 23, 33, 44, 51], there is a need for an analyst to adaptFootnote 1 the rule based on changes in the curation environment. Rule adaptation has been proven to be painstakingly difficult as the analyst needs to understand the context of data and the impact of modifications she applies to the rule [32]. In many cases, the analyst needs to apply different changes to identify the optimal one. This problem exacerbated in dynamic curation environments as adaptation is not a single rule modification task, and the rule needs to be updated over its lifetime.

In this paper, we take the first step toward creating an adaptive rule adaptation model in dynamic and constantly changing environments. While previous approaches rely on analysts to identify the optimal modifications for rules, we propose a different learning task. We focus on incrementally adapting a rule based on the changes in the curation environment. Hence, we focus on offloading analysts and updating a rule autonomously. We do this by utilizing a Bayesian multi-armed-bandit algorithm, which learns the optimal modification by observing rules performance over time. Besides, previous systems adapt rules at the syntactic level, e.g. keyword and regular expression. Syntactic-level adaptation limits rules ability in annotating data as rules skip a large number of semantically related items. Instead, we focus on coupling syntactic-level features with conceptual features to boost rules to annotate a larger number of items.

Overall, our solution is made up of the following stages: (1) Each time a rule annotates a set of items, we extract a set of candidate features (e.g. syntactic and conceptual features) as the potential modifications. (2) Then, a Bayesian multi-armed-bandit algorithm determines the optimal modification for the rule by estimating a probability distribution for candidate features. (3) Over time, by annotating more items, the algorithm learns the performance of candidate features better and modifies the rule to keep the rule applicable and precise.

Fig. 1
figure 1

The overview of adapting rules through analysts and crowd workers

Example: rule adaptation through analyst Consider a government that intends to analyze citizens’ opinions regarding the quality of social services, e.g. health care, domestic violence, and aged care services. Social media is one of the sources that decision-makers may rely on to understand public satisfaction levels. Social media allows users to express their opinions regarding their communities through their Posts, Tweets, or comments. The government may rely on learning algorithms to analyse and understand the people opinions. However, learning algorithms need training data, which is not available in many cases. One solution is to craft rules to curate and extract the data to train or augment the learning algorithms. For example, an analyst can write the rule \(R_1\) to curate items relevant to ‘Health Care’.

$$\begin{aligned} Rule_1 = {{\varvec{IF}} \,tweet \,contains \,{(`health')} \,keyword\ {\varvec{THEN}} \,tag \,with \,`Health \,Care'}. \end{aligned}$$
(1)

However, rule \(Rule_1\) will tag a large number of irrelevant items as every item, e.g. Tweets, Posts, comment, that contain the ‘Health’ keyword is not expressing an issue relevant to ‘Health Care’. Thus, the analyst may adapt the rule \(Rule_1\) by adding new keywords to make the rule more precise. For example:

$$\begin{aligned} Rule^{'}_1 = {{\varvec{IF}} \,tweet \,contains \,(`Health') \,keyword\, AND\, tweet \,contains \,{(`Care')} \,keyword \,{\varvec{THEN}} \,tag \,with \,`Health \,Care'}. \end{aligned}$$

After adaptation rule, \(Rule^{'}_1\) curate items that contain both ‘Health’, and ‘Care’ keywords. But the topic ‘Health Care’ contains a large number topical subs-spaces, which rule \(Rule^{'}_1\) will ignore as they may not contain both ‘Health’ and ‘Care’ keywords. Adapting a rule through an analyst is both time-consuming and challenging. Either due to the fact the analyst does not really know how or what is the perfect rule, e.g. is anything missing, or simply a certain type of analysis contains too many different conditions that would require many complex rules. This problem exacerbated as the social media data are ‘ever-changing and never-ending’ [23], and social media users trickling in a million pieces of different information every day. Thus, rule adaptation is not a one-time rule modification task and the analyst needs to update a rule over time based on changes in the curation environment. Figure 1 shows a typical workflow of adapting rules through the analyst.

Contribution By considering the work proposed by Tabebordbar et al. [21], in this paper we propose a feature-based technique for enhancing rules to annotate a larger number of items. Our solution couples both conceptual and syntactical features for annotating data. It leverages a summarization technique to identify the semantical relationship among items and extract the conceptual-level features.

The rest of this paper is organized as follows: we discuss the related works in Sect. 2. Then, in Sect. 3, we explain the problem. We discuss our solution in Sect. 4. Next, we present the performance of our approach on three different curation domains: mental health, domestic violence, and budget. The experiment results showed our approach could significantly improve the precision of rules in annotating data (by as much as 29% precision compared to the initial results). Finally, we discuss the future works and conclude the paper in Sect. 7.

2 Related Works

In this section, we discuss prior works related to rule adaptation (Sect. 2.1), and online learning algorithms (Sect. 2.2). In particular, we discuss the usage of a Bayesian multi-armed-bandit algorithm in unstructured and constantly changing environments. Besides, we consider it appropriate to discuss approaches on feature extraction to position our proposed feature summarization technique (Sect. 2.3).

2.1 Rule Adaptation

Rule adaptation is a continuous process focused on modifying a rule to fit the rule to the curation environment better. However, rule adaptation is a challenging and error-prone task. Thus, many solutions [23, 24, 30, 33, 48, 51] have been proposed to assist analysts in adapting rules. Several solutions [24, 30, 32, 33, 48] focused on interactively adapting rules. In these solutions, a system proposes possible adaptations for a rule, and an analyst adapts the rule through interacting with the system. For example, Milo et al. [33] proposed a cost-benefit approach for generalizing or specializing fraud detection rules. The approach developed a heuristic algorithm to interactively adapt rules with domain experts until the desired set of rules is obtained. Volks et al. [48] proposed a cost function to adapt the integrity constraint (IC) rules. The approach relies on the analyst feedback to update the cost function and resolve the inconsistencies in IC rules. Liu et al. [30] proposed an interactive approach for refining a rule using a set of positive and negative results. The method uses a provenance graph to identify candidate changes that can eliminate negative results. However, these solutions focus on adapting rules that operate on structured data, where a rule may adapted with a limited number of features. Besides, many of these solutions assume the analyst can access to a ground truth, e.g. a dataset of items tagged with the correct label, to verify the effectiveness of an adaptation.

Alternatively, to adapt rules in both unstructured and dynamic environments, some solutions [23, 44, 51] focused on augmenting interactive rule adaptation systems by coupling crowds with analysts. These solutions rely on crowd workers to determine the precision of rules. For example, Xie et al. [51] proposed an approach for validating rules for information extraction purposes. The approach relies on a voting technique to identify whether an adaptation of a rule produces a positive impact in extracting information or not. GC et al. [23] designed an interactive system by coupling analysts and crowds for adapting rules. The system verifies items annotated with a rule using crowd workers and assists analysts in identifying the optimal modification using a relevance feedback algorithm (Rocchio). Sun et al. [44] proposed a rule-based technique (Chimera) for large-scale data classification systems. First, the approach identifies the misclassified items in cooperation with crowd workers. Then, forwards the items to analysts to write rules and address the errors. Bak et al. [5] relies on visualization by showing the result of applying a rule on a set of data records. The system requires crowd workers to verify the outcome of applying the rule on the data record, indicating the optimal adjustment for the rule.

Although coupling crowd workers with interactive systems provide more flexibility in adapting rules in dynamic environments, these systems still rely on analysts for identifying the optimal modification of rules. In contrast, our approach not only offloads analysts but also it autonomically modifies a rule regarding changes in the curation environment. Table 1 shows a sample list of techniques used for adapting rules and their domains.

Table 1 A sample list of different rule adaptation techniques and their domains

2.2 Multi-armed Bandit Algorithm

In this section, we discuss how a Bayesian multi-armed-bandit algorithm has been used in dynamic and constantly changing environments. This algorithm increasingly used in large-scale randomized A/B experimentation by technology companies [28]. One area of work that used a Bayesian multi-armed-bandit algorithm is educational learning to facilitate the learners’ learning rate. For example, Williams et al. [50] proposed a system (AXIS) to improve explanation generation for online learning materials by employing a combination of crowds and a Bayesian multi-armed-bandit algorithm. Clement et al. [18] used a multi-armed bandit algorithm in intelligent tutoring systems to choose activities that provide better learning for students. Other areas that relied on a Bayesian multi-armed-bandit algorithm are feature engineering [3], gaming [31], and online marketing [11]. In this context, we follow a similar trend by employing a Bayesian multi-armed-bandit algorithm with the crowd workers. Over time, the algorithm based on the collected feedback determines an adaptation for the rule to keep the rule applicable and precise.

2.3 Feature Extraction

In addition to interactive systems for helping analysts in adapting rules, we consider it appropriate to include approaches in feature extraction to position our proposed summarization technique. Feature extraction is the process of identifying a set of variables that best describe the data [19]. Feature extraction is an ongoing task and requires to iteratively explore the curation environment to identify features that capture the salient aspect of data. Several approaches have been proposed to aid analysts in feature extraction (e.g. [7, 9, 10, 10, 12, 29, 46, 49]. For example, Anderson et al. [2] proposed BrainWash, a system that provides a pipeline to ease the process of feature extraction in large datasets. The system focused on helping a user to explore, extract, and evaluate features faster. Cheng et al. [13] relied on crowd workers for feature extraction. The approach refines the performance of machine learning algorithms based on the feedback received from crowds. Veeramachaneni et al. [46] proposed an approach to engage crowd workers in extracting features and predicting students’ stopout on massive open online courses (MOOC) systems. The approach provides a pipeline for evaluating and examining the relevancy features through crowd workers.

Another type of works focused on facilitating feature extraction through visualization techniques. For example, Patel et al. [35] relied on visualizing the confused region of machine learning classifiers to help analysts in extracting features. Brooks et al. [10] provide a visual summary of the data to aid a user to create a dictionary of features. Stoffel et al. [43] relied on visualization for examining machine learning features error. The system iteratively interacts with a user to remove ineffective features.

In contrast, we propose a summarization technique that identifies the semantical relationship among keywords and extracts features at the conceptual level. Each conceptual feature represents a group of semantically related keywords, which boosts rules to annotate a larger number of items.

3 Preliminaries and Problem Statement

We first introduce the component of rules used in this paper (Sect. 3.1). We then describe the problem in Sect. 3.2. Finally, we provide an overview of our solution in Sect. 3.3.

3.1 Preliminaries

FeatureWe express a rule R in forms of features, where each feature \(f \in R\) corresponds to a function in forms of

$$\begin{aligned} \langle Dataset.Function.Operator \rangle \ \rightarrow Value \end{aligned}$$

where Dataset is the data source such as Twitter and Facebook, Function performs the curation task (e.g. feature extraction), Operator represents the condition for a feature to curate the data, and Value is the output of a feature. An example of a feature is extraction functions, e.g. named-entities, or similarity extraction. Expressing a feature as a function allowing us to leverage the standard data-types as the feature’s operator. For example, if a feature operates over textual data, the operator for the feature will include string operators, such as contains and exact. Similarly, if a feature curates integer data the feature will include integer operators, such as equals and less-than. As an example, consider the feature \(f_1 =\ \langle Tweet.Keyword.Contains\)\((`Mental')\ \rangle\), which curates Tweets that contain ‘Mental’ keyword. In this example, Tweet represents the dataset the feature operates for curating the data, Keyword represents the function of the feature, and \(Contains\ ('Mental')\) is the operator and represents the condition for curating a Tweet.

Rule We represent a rule R as a tree of features, where each feature \(f \in R\) can have K children. We denote a path p in the tree as a sequence of features \(f_1,...,f_m\), where \(f_1\) represents the root feature and \(f_m\) represents the last feature in the path. More precisely, a path p is a conjunction of features in the form \(f_1\ \wedge \ ... \wedge f_m\). To curate an item with a rule, the item should be annotated with all features within a path. Notice that, we do not require inventing our rule language. Rather, the benet of rules being expressed as features, we can adopt any suitable functional or rule-expression language for our purpose.

Tag A Tag is the label, e.g. ‘Mental Health’, a rule assigns to a curated item, e.g. Tweet, to describe the item. In this paper, we use the term tag and annotate interchangeably. As an example, consider the rule presented in Figure 2. This rule is made up of three features \(\{f_1, f_2, f_3\}\), and tags a Tweet with ‘Mental Health’, if the Tweet curated with features \(\ f_1\ \wedge \ f_2\), or \(\ f_1\ \wedge \ f_3\). More clearly, \(Rule_1\) tags a Tweet with ‘Mental Health’, if the Tweet contains ‘Mental’ and ‘Health’ keywords or the Tweet contains ‘Mental’ and a keyword related to ‘Medical’ topic.

Fig. 2
figure 2

The overview of the proposed approach for adapting rules

3.2 Problem Statement

In the followings, we discuss two major problems in rule-based systems.

Adaptation through analyst Typically, to adapt a rule, an analyst examines the annotated items to identify the potential modifications that make the rule precise [23, 30, 32, 33]. However, rule adaptation is challenging and error-prone as the analyst needs to evaluate the impact of each modification she applies to the rule. Such a problem is categorized under the category of online learning problem, where an analyst does not have access to the entire knowledge to craft the adequate type of rule. Instead, over time she learns to better adapt a rule through examining the items.

To offload analysts from adapting rules, we formulated the problem as a Bayesian multi-armed-bandit algorithm. The algorithm is suitable when the required information for making a decision is provided piece-by-piece in a serial fashion. Each time a rule annotates a set of items, the algorithm collects feedback over the number of items the rule correctly/incorrectly annotated. Then, by receiving more feedback the algorithm learns a better adaptation for the rule over time.

For example, consider a rule R that operates over a dataset and must annotate data with a threshold \(\jmath\). Assume that at time \(\tau _i\) rule R annotated a set of items \(I_{\tau i}\ = \{i_1, i_2, ..., i_n\}\). We denote by \(P[R_{\tau _i}]\) as the precision of the rule observed at time \(\tau _i\). Our algorithm adapts rule R at time \(\tau _{i+1}\), where \(P[R_{\tau _i + 1}]\ >\ \jmath\).

Syntactic-level data annotation Typically, an analyst adapts a rule at the syntactic level, e.g. keywords and regular expressions. Using syntactic-level features allowing the analyst to more conveniently modify a rule by replacing irrelevant keywords or phrases with new ones. However, relying on syntactic-level features limits the capacity of a rule in annotating data as these features skip a large number of semantically related items. For example, consider the rule:

$$\begin{aligned}&Rule_{11}= Tweet.Keyword.Contains(`Mental') {\wedge } \,Tweet.Keyword.\\&\quad Contains(`Health') :\, `Mental\, Health' \end{aligned}$$

This rule tags a Tweet if the Tweet contains ‘Mental’ and ‘Health’ keywords. However, there exists a large number of Tweets relevant to ‘Mental Health’, which could not be tagged with the \(Rule_{11}\) as those Tweets may not contain both ‘Mental’ and ‘Health’ keywords.

3.3 Solution Overview

The overview of our proposed solution is shown in Figure 2. The approach consists of four steps, feature extraction, observation, estimation, and adaptation.

Feature extraction The initial step in the workflow is feature extraction, which extracts a set of candidate features \(T = \{t_1, t_2, ..., t_n\}\) from annotated items. The approach extracts candidate features both at the syntactic and conceptual levels. Each syntactic-level feature represents a keyword extracted from items annotated with a rule, while a conceptual feature represents a group of semantically related keywords. In Sect. 4.1, we accentuate how our approach extracts candidate features for adapting a rule.

Observation The second step in the workflow is observation, which gathers feedback to update a Bayesian multi-armed-bandit algorithm about changes in a curation environment. For gathering feedback, we rely on the crowd workersFootnote 2. Each time a rule annotates a set of items \(I= \{i_1, i_2, i_3, ..., i_n\}\), the algorithm receives feedback over a sample of annotated items \(S =\{i^{\prime }_1, i^{\prime }_2, i^{\prime }_3, ..., i^{\prime }_n\}\), where \(S \subset I\) to identify the latest rule performance in annotating the data. Crowds verify whether a rule correctly tagged an item or not. In Sects. 4.2 and 5, we review how crowd workers contribute in verifying items.

Estimation The third step in the workflow is Estimation, where a Bayesian multi-armed-bandit algorithm determines the performance of candidate features by estimating a probability distribution \(\theta\). The algorithm calculates the performance of features using workers collected feedback

To formulate workers feedback as a Bayesian multi-armed-bandit problem, we propose a reward/demote schema. Each time the rule annotates a set of items, the schema calculates a reward/demote for candidate features to update the algorithm about changes in the curation environment. In Sect. 4.3, we review how the approach estimates the probability distribution for candidate features.

Adaptation Given a set of candidate features T along with their probability distribution \(\theta\), we identify potential modifications that keeps a rule applicable and precise. We do this by removing or restricting features that deteriorate the rule performance. In Sect. 4.4, we review how our approach modifies a rule.

4 Adaptive Rule Adaptation

In this section, we explain components (feature extraction, observation, estimation, and adaptation) of our proposed solution.

4.1 Feature Extraction

The first step in our workflow is feature extraction, where we extract a set of candidate features as the potential modifications for a rule. Each time a rule annotates items, we extract a set of candidate features to calculate their performance in adapting the rule. We extract two types of candidate features, syntactic and conceptual. A syntactic-level feature represents a keyword within an annotated item, while the latter represents a group of semantically related keywords. Followings explain how we extract features.

Syntactic candidate feature

For extracting syntactic candidate features, we conduct a preprocessing task on annotated items I. The preprocessing performs tokenization, normalization, and noise removal. In tokenization, we split each item \(i \in I\) into smaller tokens. Normalization removes stop words and conduct stemming, and noise removal skips certain characters, e.g. emoji, URLs, that occur in items. We consider the remaining tokens as the candidate feature type of keyword.

Conceptual candidate feature

Conceptual candidate features are proposed to alleviate shortcomings exist in annotating data using syntactic features. Although syntactic features allow an analyst to modify a rule more conveniently, relying on these features cannot capture the salient aspect of data and limits rules capacity in annotating items. Thus, there is a need for more productive features to boost rules to annotate a larger number of items. We propose a summarization technique, which extracts and groups semantically related keywords and forms a concept. Summarization consists of two steps: (1) mapping, and (2) grouping. In the mapping step [37, 42], we map each syntactic feature using a knowledge base to an abstract concept and associate a descriptor to it. In the grouping step, we group features with an identical descriptor and consider each group as a conceptual candidate feature. Following explains how our proposed technique extracts two new conceptual features using two readily available knowledge bases: WordNet [20] and Empath [22]. Algorithm 1 shows the pseudo-code of summarization technique.

figure a
  1. 1.

    WordNet WordNetFootnote 3 is a semantic lexicon, which grouped English words into sets of synonyms called synsets. We use WordNet to identify semantic relations between keywords using their hypernyms relation. A hypernym is a relationship between a generalized term and a specific instance of it. For example, based on the hypernym relationship in the wordNet, we can describe the keyword ‘doctor’ as a ‘\(medical\_practitioner\)’. Thus, as the mapping step, we mapped each keyword using its hypernym relation to a more generalized form, where the hypernym acts as the descriptor for the keywords. Next, in the grouping step, we group features with the same descriptor and consider each group as a conceptual candidate feature. For example, consider \(Rule_{11}^{\prime }\), which tags Tweets with mental health.

    $$\begin{aligned}&Rule_{11}^{\prime }= Tweet.Keyword.Contains(`Mental') {\wedge }\, Tweet.Topic.\\&\quad Contains(`Medical\_Practitioner)' :\ `Mental \,Health' \end{aligned}$$

    This rule tags a Tweet, if the Tweet contains ‘Mental’ keyword, and a keyword relevant to ‘\(Medical\_Practitioner\)’. The topic ‘\(Medical\_Practitioner\)’ represents a large number of semantically related keywords, including doctors, physician, dentist.

  2. 2.

    Empath The second knowledge base we rely on for extracting conceptual features is EMPATH [22]. EMPATH is a deep learning skip-gram network, which categorizes text over 200 built-in categories. It represents a token as a vector using a vector space model (VSM) [36] and assigns tokens to categories based on their vector similarity. To extract conceptual candidate features, we query the EMPATH vector space model to map each keyword to a category. We use categories to represent keywords in the abstract concept. Then, we group keywords with the same categories and consider each group as a conceptual candidate feature. For example, consider the following keywords \(T\ = \{t_1:fund,\ t_2:illness,\ t_3:budget,\ t_4:disease\}\). To generate conceptual features, we query the EMPATH vector space model to map each keyword to a category. Assume, the following categories are identified \(T\ = \{t_1:Economy,\ t_2:Health,\ t_3:Economy,\ t_4:Health\}\). Then, we group keywords with the identical categories and represent \(\{fund,\ budget\}\) keywords as the Economy topic, and \(\{disease,\, illness\}\) keywords as the Health topic.

4.2 Observation

The second step in our proposed approach is observation, which gathers feedback to update a Bayesian multi-armed-bandit algorithm about changes in the curation environment. For gathering feedback, we rely on crowd workers. Each time a rule annotates a set of items, we take a sample of the items \(S =\{i^{\prime }_1, i^{\prime }_2, i^{\prime }_3, ..., i^{\prime }_n\}\), where \(S \subset I\) to send to the crowd. The crowd workers verify whether an item correctly tagged with the rule or not, e.g. if a rule tags an item with ‘Mental Health’. The task was to confirm whether the item is relevant to ‘Mental Health’ or not. For taking samples, we divided annotated items into subgroups [25] and represented each subgroup by a candidate feature – the population of subgroups determined by the frequency of candidate features in annotated items.

More clearly, consider the following candidate features \(T\ = \{t_1:fund,\ t_2:illness,\ t_3:budget,\ t_4:economy\}\) that extracted from items annotated with a rule. The approach divides annotated items into four subgroups, where feature \(t_1\) represents items contain fund, feature \(t_2\) represents items contain illness, and so forth.

Our sampling strategy boosts a Bayesian multi-armed-bandit algorithm (see Sect. 4.3) to better learn the performance of features in adapting a rule. For example, if we used more obvious techniques, such as random sampling, then the algorithm considers all items equally likely. Thus, it takes a longer time to learn the performance of candidate features.

4.3 Estimation

This step computes a probability distribution \(\theta\) for candidate features to determine their performance in adapting the rule. This step consists of two components: (i) reward/demote schema, which calculates a reward/demote for candidate features using workers feedback, and (ii) a Bayesian multi-armed-bandit algorithm, which estimates the performance of candidate features based on their collected rewards/demotes.

Reward/demote schema To adapt rules, we formulated rule adaptation as a Bayesian multi-armed-bandit algorithm. This algorithm is suitable when a system needs additional improvement to their decisions over time. The algorithm based on the feedback collects from the curation environment learns more consistent patterns of changes and takes a decision that maximizes its performance. A Bayesian multi-armed-bandit algorithm is a good fit for our problem because each time a rule annotates a set of items, it gets updated by the workers’ feedback.

To frame the rule adaptation as a Bayesian multi-armed-bandit problem, we propose a reward and demote schema using the feedback collected from workers. The schema assigns a reward/demote to candidate features \(t \in T\) appear in annotated items. The schema rewards r, a candidate feature, if it appears in an item that verified as relevant. Similarly, it demotes d, a candidate feature, if a feature appears in an irrelevant item. Over time, as a rule, annotates more items the schema updates candidate features reward/demote, allowing a Bayesian multi-armed-bandit algorithm to update its estimation regarding the performance of features in adapting the rule.

As each conceptual candidate feature represents a group of keywords, we calculate the reward/demote for these features based on the rewards/demotes collected by their associated keywords. More precisely, consider feature t as a conceptual candidate feature. Suppose \(t = \{t_1^{\prime }, t_2^{\prime }, ..., t_n^{\prime }\}\), where \(t^{\prime }\) represents a keyword associated with t. We calculate reward as \(r_{t}=\ \sum _{t^{\prime }=1}^{n} r_{t^{\prime }}\) and demote as \(d_{t}=\ \sum _{t^{\prime }=1}^{n} d_{t^{\prime }}\). Clearly, consider the candidate feature ‘\(Medical\_Practitioner\)’, that was introduced in the previous section. Suppose the following features are associated with it: doctor, dentist, and physician. We calculate reward/demote for ‘\(Medical\_Practitioner\)’ by summing up the rewards/demotes collected by doctor, dentist, and physician.

figure b

Bayesian multi-armed-bandit algorithm This section explains how a Bayesian multi-armed-bandit algorithm estimates the performance of candidate features. First, we explain the algorithm, and then we discuss how it operates in our settings.

We utilized Thompson sampling [41] for Bernoulli bandit problem, e.g. when the rewards are either 0 or 1. Thompson sampling is a Bayesian multi-armed-bandit algorithm that provides a dynamic policy for choosing which feature should be selected for adapting a rule, and an algorithm for incorporating new information to update this policy based on the candidate features rewards/demotes. Thompson sampling stores an estimated probability distribution \(\theta\) for each candidate feature to indicate their performance in adapting the rule. Each time, the algorithm receives a set of candidate features \(T=\{t_1, t_2, ..., t_n\}\) along with their reward/demote it updates candidate features probability distributions \(\theta = \{\theta _1, \theta _2, ..., \theta _n\}\), where \(0< \theta < 1\) using the Bayesian formula:

$$\begin{aligned} P(\theta \ |\ t)\ =\ \frac{P(t\ |\ \theta ) \times P(\theta )}{P(t)} \propto P(t\ |\ \theta ) \times P(\theta ) \end{aligned}$$

\(P(t\ |\ \theta )\) represents the likelihood and \(P(\theta )\) is the prior. The likelihood is a Bernoulli distribution, and the prior is a Beta distribution.

It has been proven that Beta distribution is a good choice of priors for Bernoulli rewards [1]. Beta distributions form a family of continuous probability distributions on the interval (0, 1). The pdf of \(Beta(\alpha , \beta )\), the Beta distribution with parameters \(\alpha \>\ 0,\ \beta \ >\ 0\), is given by

$$\begin{aligned} f(X;\alpha ,\beta )\ =\ \frac{\varGamma (\alpha \ +\ \beta )}{\varGamma (\alpha )\varGamma (\beta )} X^{\alpha \ -\ 1}(1\ -\ X)^{\beta \ -\ 1} \end{aligned}$$

Beta distribution is useful for Bernoulli rewards because if the prior is a \(Beta(\alpha , \beta )\) distribution, then after observing a Bernoulli trial, the posterior distribution is simply \(Beta(\alpha +\ 1, \beta )\) or \(Beta(\alpha , \beta \ +\ 1)\), depending on whether the trial resulted in a success or failure, respectively.

The Thompson sampling algorithm initially assumes feature t to have prior Beta(1, 1) on \(\theta _i\), which represents Beta(1, 1) has an uniform distribution on (0,1). Each time, a feature appears in feedback, receives a reward \(r_t\), or failures \(d_t\), the algorithm updates the distribution on \(\theta _i\) as \(Beta(r_t + 1, d_t + 1)\). The algorithm then samples from these posterior distributions of the \(\theta _t\)’s and selects feature according to the probability of its mean being the largest.

In order to formulate features feedback for Bernoulli bandit problem, we treat the total observation of features as a Bernoulli variable. The number of successes is the number of positive feedback, and the number of failures is the number of feedback minus successes. The update to \(Beta(\alpha , \beta )\) is simply Beta(\(\alpha +\) number of new successes, \(\beta \ +\) number of new failures).

Algorithm 2 shows how the approach estimates the value of \(\theta\) for candidate features. As an example consider the following rule.

$$\begin{aligned} Rule_1 = Tweet.Keyword.Contains(`Mental') : \,`Mental \,Health' \end{aligned}$$

Which tags Tweets with ‘Mental Health’. Assume, the following candidate features \(T = \{t_1:\ medical, t_2:\ health, t_3:\ wellbeing, t_4:\ care, t_5:\ qanda\}\) are extracted from annotated items as the potential modifications for the rule. First, to identify the performance of candidate features, the algorithm calculates their reward/demote using workers feedback. Then, a Bayesian multi-armed-bandit algorithm estimates a probability distribution \(\theta\) for candidate features. Each time the rule annotates a set of items, the algorithm updates the value of \(\theta\) based on the feedback gathers from workers to better understand features performance in adapting rules.

Fig. 3
figure 3

Adapting a rule by replacing/restricting its features

4.3.1 Discussion on Thompson Sampling

We have selected Thompson sampling as it has shown the near-optimal regret bound compared to other alternatives. Regret is the difference between the best performance from an option and the performance from the option chosen by the algorithm at time \(\tau _i\) [1]. Although other variants of multi-armed bandit algorithm, such as \(\epsilon\)-Greedy and Upper Bound Confidence (UBC), could be used for adapting rules. The \(\epsilon\)-Greedy may stuck in local optima by exploiting the sub-optimal option. This algorithm greedy explores the environment for a period and then, exploits the best-identified option \(\epsilon\) percent of the time and \(1\ -\ \epsilon\) for alternatives. For example, if we set the value of \(\epsilon =\ 0.05\), the algorithm will exploit the best option for \(95\%\) of the time and explore for random alternatives for \(5\%\) of the time. UCB has been proposed to reduce the randomness in \(\epsilon\)-Greedy. The algorithm aims at reducing the parameter \(\epsilon\) over time and is optimistic about options with high uncertainty. For example, it first chooses an option and observes its reward making us less uncertain about the option performance. The algorithm continues this behaviour until the uncertainty about the option reduces below a threshold.

4.4 Adaptation

In this section, we explain how our approach modifies a rule. Recall from Sect. 3.1, that we introduced a rule R as a tree of features, where each feature \(f \in R\) can have K children. We also defined a path p in a rule as a conjunction of a set of features in forms of \(p\ = f_1 \wedge \ f_2\wedge \ ...\ \wedge \ f_n\). First, to adapt a rule, we identify imprecise paths that annotate data with a precision below a threshold \(\jmath\). The threshold represents the minimum precision a path should have to be considered as precise. We determine the precision of paths by calculating the number of relevant/irrelevant items their features annotated. After identifying imprecise paths, we determine whether to replace or further restrict their features.

We would replace a feature in an imprecise path if the number of annotated items was below the average number of items annotated with its siblings, indicating the feature is imprecise and incapable of adequately annotating data. Conversely, we restrict a feature if the number of annotated items was greater than or equal to average, indicating the feature is applicable but should be restricted to be precise. For replacing or restricting features, we select candidate features that yielded the highest probability distribution \(\theta\) estimated by a Bayesian multi-armed-bandit algorithm.

Example Suppose, after annotating a set of items at time \(\tau _i\) the algorithm identifies that the rule is impreciseFootnote 4. Thus, it examines the number of annotated items and adapts the rule by appending K candidate featuresFootnote 5 that yielded the highest probability distribution (restriction) (Figure 3b). After adaptation, \(Rule_1\) annotates an item if the item curated with features in paths \(p_1 = \ f_1\ \wedge \ f_2\), or \(p_2 = \ f_1\ \wedge \ f_3\), or \(p_3 = \ f_1\ \wedge \ f_4\). Alternatively, the algorithm may replace a feature if it identifies the feature annotates data below the average number of items annotated with its siblingsFootnote 6. For example, suppose at time \(\tau _{i+n}\) feature \(f_2\) is identified as imprecise and incapable of annotating data adequatelyFootnote 7. Thus, the algorithm removes feature \(f_2\) and replaces the feature with a candidate feature that yielded the highest probability distribution value (Figure 3c). On the other hand, to select a candidate feature, the algorithm performs a feature extraction task and estimates candidate features probability distribution \(\theta\) based on the reward/demote features accumulated from time \(\tau _1\) to \(\tau _{i+n}\).

The proposed adaptation strategy allows to adapt rules according to changes in the curation environment. For example, by replacing an imprecise feature with a content bearing feature that obtained a high value of \(\theta\) over an extended period of time, we keep the rule applicable as the new feature better captures the salient aspect of data. Similarly, by restricting an imprecise feature that annotates a large number of items, we make the rule precise by filtering out the irrelevant items.

Fig. 4
figure 4

Sample of questions to workers to verify the tag of items

5 Gathering Workers Feedback

This section explains how we contribute workers to verify items annotated with rules. We created a task on Figure EightFootnote 8 micro-tasking market. The workers’ task was to confirm whether an item is relevant to the tag assigned by a rule or not. Workers could choose ‘Yes’ if they identify the item is related to the tag, and ‘No’, if they determine the item is irrelevant. In cases workers could not verify an item, they could choose ‘I don’t know’. For example, we present a Tweet to workers, which a rule tagged as relevant to ’Mental Health’. Then, workers task was to verify whether the Tweet is related to ’Mental Health’ or not. Besides, we provide workers with a textual instruction to explain to them how to confirm items.Footnote 9 We explained steps workers need to follow and provided them with three positiveFootnote 10 and three negativeFootnote 11 examples. For verifying each item, we paid 1 cent, and each worker varied ten items per page. At each round of the annotation task, we sent 3% of annotated items to workers. Figure 4 shows a sample question to workers.

5.1 Stopping Condition

In the previous section, we explained how workers verify annotated items. However, continuously sending items to crowds increases the cost of the adaptation task. Thus, there is a need to identify when a rule is stabilized to stop verifying more items. To address this problem, we developed a solution using the probabilistic policy defined in Thompson sampling algorithm to determine whether a path in a rule is stabilized or not. For each path, we estimate a probability distribution \(\theta\) based on the number of relevant/irrelevant items annotated. Then, we define a smoothing window Q to record the value of \(\theta\). We set the size of smoothing window \(Q\ =\ 3\) and average as the smoothing function. We consider a path as stabilized, if the value of Q increases or remains stable within \(3\epsilon\), where \(\epsilon \ =\ 0.01\)Footnote 12. More clearly, consider path \(p_3\ =\ f_1\ \wedge \ f_4\) presented in Figure 3. Each time the rule annotates a set of items, the algorithm records the value of \(\theta\) for the path. Then, the approach computes the value of Q, where \(Q_1\ =AVG (\theta _{1}, \theta _{2}, \theta _{3})\), and \(Q_2\ =AVG (\theta _2, \theta _3, \theta _4)\) and so forth. The algorithm stops sending items to workers, when the value of \(Q_{i+1} +3\epsilon \ \ge \ Q_i\) indicating the path is stabilized.

6 Experiments

First, we discuss the dataset was used for examining the performance of our proposed approach in Sect. 6.1. Then, in Sect. 6.2, we explain three scenarios have been defined to show the applicability of our approach. Finally, we discuss the results in Sect. 6.3.

6.1 Experiment Settings and Dataset

The core component of techniques described in the previous sections is implemented in Python. Three months of Twitter data (Australian region) were used as the input dataset (from May 2017 to August 2017) with \(\approx \ 15\ million\) Tweets. MongoDB and ElasticSearch were used for storing and indexing the input dataset. We demonstrate the performance of our approach in three different curation domains (domestic violence, mental health, and budget). We show how our approach learns to adapt a rule to annotate data more precisely over time. As the initial rules for annotating the data, we used rules that contain only one feature. For example, the initial rule for annotating Tweets in the mental health domain was in the form of \(Tweet.keyword.contains(`Mental')\)\(: `Mental \,Health'\), which tags Tweets that contain ‘Mental’ keyword. Then, at each timestep rules annotate a set of items, our approach adapts rules to make them more precise. We demonstrate the performance of the approach within five rounds of rule adaptation.

Fig. 5
figure 5

The performance of a Bayesian multi-armed-bandit algorithm in adapting rules. As presented the algorithm could improve rules precision in all domains

Fig. 6
figure 6

Comparison between the number of items annotated using conceptual- and syntactic-level features (Budget Domain)

Fig. 7
figure 7

Comparison between the number of items annotated using conceptual- and syntactic-level features (mental health domain)

Fig. 8
figure 8

Comparison between the number of items annotated using conceptual- and syntactic-level features (domestic violence domain)

6.2 Experiment Scenarios

To evaluate the performance of our solution and the applicability of the proposed algorithm, we have defined three different experiment scenarios:

  1. 1.

    Evaluating the performance of a Bayesian multi-armed-bandit algorithm in adaptation We explain an experiment scenario to represent the performance of a Bayesian multi-armed-bandit algorithm in adapting rules. We demonstrate how the algorithm keeps a rule precise and applicable by adding or removing features. We adapt rules with two different choices of features (\(K=\ 10,\ K=\ 20\)) (see Sect. 3.1). Adapting a rule with a higher number of features allows a rule to annotate a larger number of items, but with less precise ones.

  2. 2.

    Evaluating the proposed feature-based adaptation This scenario aims at demonstrating the performance of the proposed feature-based technique in augmenting rules to annotate a larger number of items. We demonstrate the improvement rules make in the number of annotated items while adapting rules using both syntactical- and conceptual-level feature. We also compare the obtained results with technique that adapts rules at the syntactic level only.

  3. 3.

    Comparison with existing studies The third scenario, we conducted a controlled experiment and compared the performance of our approach with a system proposed by GC et al. [23]. The proposed system is an interactive rule adaptation system, which relies on analysts for adapting rules. Each time a rule annotates a set of items, the system sends a sample of items to crowds and receives feedback over the number of items correctly/incorrectly tagged by the rule. Then, the system tokenizes items and weights every token using the TF-IDF weighting scheme. Subsequently, the system ranks tokens based on their TF-IDF weights and iteratively shows tokens to an analyst to adapt a rule. The system continues showing tokens until the analyst is satisfied with the resulting rule. To help the analyst to more effectively adapts the rule, the system incorporates the analyst feedback by adjusting the weight of tokens using a relevance feedback algorithm [40]. Whenever the analyst selects a token, the algorithm increases the weight of other candidate tokens that co-occurred with the selected token.

Table 2 Precision of the approach in adapting rules using summarization approach \(K\ =\ 20\)
Table 3 Precision of rules adapted through participants in Budget, Mental Health, and Domestic Violence Domains

6.3 Result

6.3.1 Performance of a Bayesian Multi-armed-bandit Algorithm in Adapting Rules

In this section, we demonstrate the performance of a Bayesian multi-armed-bandit algorithm in adapting rules (see Sect. 4.3). We show the precision of the rules adapted with two different choices of candidate features (\(k\ =\ 10, k\ =\ 20\)) that yielded the highest probability distribution \(\theta\). As presented in Figure 5 by adapting rules with 10 candidate features, the algorithm could significantly improve rules precision in all curation domains. For example, in the budget domain, the algorithm could improve the precision for \(36.65\%\), from \(54.56\%\) to \(91.21\%\). Similarly, in the domestic violence and the mental health domains, the algorithm could improve the precision for \(18.20\%\) and \(32.47\%\), respectively. Also, to demonstrate the applicability of the algorithm in adapting rules, we repeated the experiment with a higher number of features (\(K\ =\ 20\)). This boosts rules to annotate a larger number of items, but with less precise features. Figure 5 shows the obtained results for each domain. As presented, adapting rules with a higher number of features decreases the precision of rules; however, the algorithm could learn the performance of features and adapts the rule to improve its precision over time. For example, in mental health domain, the precision is improved by \(30.81\%\), and in budget and domestic violence domains the precision is improved by \(33.22\%\) and \(16.36\%\), respectively. In this experiment, we considered features that annotate data with a precision below 75% (\(\jmath<\)75%) as imprecise.

Discussion on rules performance As presented in Figure 5, the initial rules added to the curation system were imprecise and annotated a large number of irrelevant items. For example, the initial precision of rules in Budget and Mental Health domains was below \(55\%\). However, after collecting a set of feedback the algorithm identifies the need to restrict rules by adding a new set of features. Although the restricting rules could improve their precision, this limited rules to only annotate those items that contain the features selected by the algorithm during the adaptation. As presented in Figures 6, 7, and 8, after adaptation rules are annotating fewer items compared to their initial states. For example, in Budget domain the number of annotated items has reduced by 15240, after two rounds of adaptation. We can see similar trends for other curation domains as well. But, the promising fact is that a Bayesian multi-armed-bandit algorithm can learn a better adaptation for rules by incrementally collecting more feedback over time. This can be seen in Figure 5 that the algorithm could dramatically improve the rule precision. For example, in Budget domain the difference in precision between the adaptation that occurred at \(\tau _2\) and \(\tau _5\) is over \(10\%\). This difference for the Mental Health domain is over \(20\%\). Based on the obtained results, we concluded that a Bayesian multi-armed-bandit algorithm by collecting more feedback learns a better adaptation for rules over time, and if we can adapt rules with more robust features, we can improve both precision and recall. This fact, can be approved by comparing the precision and the number of annotated items between Figures 5 and Figures 6, 7, and 8. As presented by adapting rules with a higher number of features (\(K\ =\ 20\)), the algorithm could annotate a larger number of items, and at the same time maintain rules precision. In the next section, we discuss how feature-based adaptation augments the performance of rules to annotate a larger number of items.

Fig. 9
figure 9

Number of items annotated with rules adapted through participants after five rounds of annotations in three different curation domains: budget, mental health, and domestic violence

6.3.2 Feature-Based Adaptation.

As we discussed in the previous Sect. 6.3.1, adaptation limits the ability of rules in annotating items. To alleviate this problem, we discussed that adapting rules with higher number features could boost rules to annotate a larger number of items. However, increasing the number of features has a negative correlation with precision (by increasing the number of features in adaptation the precision of rules drops). Thus, to diminish the impact of an adaptation and maintaining the performance of a rule in annotating items, we proposed feature-based adaptation. In feature-based adaptation, we hypotheses that adapting a rule with a group of semantically related features would have a similar impact on the rule precision when adapting with a single feature. Thus, in this section, we study the impact of feature-based adaptation on rules performance. The goal is to study whether adapting a rule with a group of related features can enhance the performance of rules to annotate a larger number of items, and at the same time maintain their precision. To test our hypotheses, we conducted two sets of experiments. First, we discuss the precision of rules adapted through our approach. Then, we compare the number of annotated items with rules adapted using syntactic-level features.

Table 2 shows the precision of rules adapted using the feature-based technique. The obtained results confirm that feature-based adaptation can dramatically increase the performance of a rule in annotating items. At the same time, a Bayesian multi-armed-bandit algorithm could learn the performance of features and improves rules precision over time. This improvement for the domestic violence domain is \(10.11\%\), and for mental health and budget, domains are \(25.87\%\) and \(28.47\%\), respectively. Although the learning rate of the algorithm using the feature-based approach is slower than syntactic-level features, still the algorithm could improve rules precision in all domains. In addition, Figures 6, 7, and 8 compare the number of items annotated with rules adapted using the syntactic and feature-based approach. As presented, adapting rules with different features could boost rules to annotate a larger number of items. For example, in the domestic violence domain, the rule could annotate over 12,000 items. In mental health and budget domains, rules could annotate 13,574 and 8304 items, respectively. These numbers are much higher than adapting rules using syntactic-level features. For example, in the budget domain, the rule (\(k=10\)) could annotate 2137 items only. Annotating data using the syntactic-level features in mental health and domestic violence domains show a similar trend and rules could only annotate 5198, and 4127 items, respectively.

Discussion on feature-based adaptation An advantage of feature-based adaptation is that it allowing users to better investigate their information needs while seeking for topics that contain a large number of topical subspaces. Suppose a user intends to curate data relevant to ‘mental health’. There exists a large number of keywords, e.g. health, disorder, service, that are relevant to mental health but may not receive enough feedback to be considered for adapting the rule. Using, feature-based adaptation, we group all keywords that are associated with a topic, thus the rule can easily curate a varied and comprehensive list of items relevant to the user information need.

6.3.3 Comparison with Existing Studies

In this section, we compare the performance of our approach with the state of the art technique on rule adaptation. We implemented the system proposed by GC et al [23] (see Sect. 6.2) and conducted a controlled experiment. We asked three Ph.D. students in a laboratory that were familiar with the concept of learning algorithms, e.g. true positive rate, false-positive rate, to participate in the experiment. We explained to them how the system works and how they can use the system to adapt rules. Also, we allowed them to work with the system to gain the required understanding for adapting rules. To better compare the performance of our approach with the interactive system, we have asked participants to adapt rules in all domains. Then, in each curation domain, we selected the rule with the highest obtained precision and compared it with rules adapted by our approach. In this experiment, we asked participants to adapt rules with 20 features (\(k=20\)). Table 3 shows the results. As presented, our approach has comparable performance to interactive systems. For example, in budget domain participants could adapt the rule with \(90.86\%\) precision, which is \(3.08\%\) higher than our proposed approach. In the domestic violence and mental health domains, participants could adapt rules with the precision of \(92.59\%\) and \(86.32\%\), respectively. Besides, Figure 9 shows the number of items annotated with the rules adapted by participants. The figure shows the most precise rules in each domain. Although our approach and participants have shown a similar performance while using syntactic-level features for adapting rule, using the proposed feature-based technique our approach could significantly annotate a larger number of items. For example, the number of annotated items in budget domain is higher by 3233 items. The difference in mental health and domestic violence domains is 5320, and 4305, respectively. The overall cost that we paid for verifying items in the mental health domain is $35.10, and in the budget domain is $29.92, and in the domestic violence domain is $21.22.

By comparing the precision and the number of items annotated with our approach and participants, we believe that our adaptive approach outperforms current rule adaptation techniques. In particular, by considering the prohibitive cost of analysts for adapting rules, our proposed approach can boost companies and data enthusiasts that need to annotate data in unstructured and constantly changing environments with a limited budget.

7 Conclusion and Future Works

In this paper, we proposed an approach for adapting data annotation rules in unstructured and changing environments. Our approach offloads analysts from adapting rules and autonomically modifies rules based on changes in the curation environment. We utilize a Bayesian multi-armed-bandit algorithm, an online learning algorithm that learns the optimal modification for rules using the feedback gathers from the curation environment. In addition, our approach adapts rules at the conceptual level, which boosts rules to annotate a larger number of items compared to current methods that rely on syntactic similarity, e.g. keywords, regular expression, for adapting rules. We evaluated the performance of our approach on three months of Twitter data in three different curation domains: domestic violence, mental health, and budget. The evaluation results showed our approach has comparable performance to systems relying on analysts for adapting rules.

There are several exciting directions for future work. In this paper, we introduced a summarization, which boosts rules to annotate data at the conceptual level. As a part of future works, we plan to identify more features for adapting rules. Specifically, we focused on adapting rules with three other types of features, including entities, word2vec, and relation. We believe adapting rules with different kinds of conceptual features not only enhance the performance of rules to annotate a more significant number of items but also allows rules to capture the salient aspect of data better.