In this section, we present the proposed autoBOT approach. First, we discuss the representations of text considered, followed by the overall formulation of the approach. A schematic overview of autoBOT is shown in Figure 1.
Here, the training set of documents is first represented at different granularities (\(\varvec{F}\)); Sparse bag-of-words type of vectors on the level of characters, words, part-of-speech (POS) tags as well as keywords and relations spanning multiple tokens, to dense document embeddings and knowledge graph-based features (\({\mathcal {K}}\)). This is followed by the process of representation evolution (G field). The obtained initial set of representations is considered as the base for evolutionary optimization. Here, weights (individuals), multiplied with the feature values corresponding to the parts of this space are evolved so that a given performance score is maximized. The final set of solutions is used to obtain a set of individual classifiers, each trained on a different part of the space. However, for obtaining final predictions, a majority vote scheme is considered. Hence, evolution effectively emits an ensemble of classifiers. More details follow below.
Multi-level representation of text
Let \(\textrm {FT}\) represent the set of all feature types that are considered during evolution. Let D denote the set of considered document instances. Examples of feature types include single word features, their n-grams, character n-grams etc. Assuming f represents a given feature type. Let \(d_f\) denote the number of features of this type. The number of all features is defined as \(d = \sum _f d_f.\) Hence, the final d-dimensional document space consists of concatenated \(\varvec{F}_f \in {\mathbb {R}}^{|D| \times d_f}\)-dimensional matrices, i.e.
$$\begin{aligned} \varvec{F} = \Big |\Big |_i \varvec{F_i}, \end{aligned}$$
where i denotes the i-th feature type, and \(\big |\big |\) denotes concatenation along the separate columns. The matrix is next normalized (L2, row-wise), as is common practice in text mining. Types of features considered by autoBOT are summarized in Table 1.
Table 1 Different feature types considered by autoBOT The considered features, apart from the relational ones and document embeddings, are subject to TF-IDF weighting, i.e.,
$$\begin{aligned} \text {TF-IDF}(t, m) = \sum _{j \in m}\mathbb {1}[j = t] \cdot \log { \bigg ( \frac{|D|}{\sum _{k \in D}\mathbb {1}[t \in k] + 1} \bigg )}, \end{aligned}$$
(1)
where t is a token of interest and m the document of interest. The D is the set of all documents. While word and character n-grams, POS tags as well as document embeddingsFootnote 2 are commonly used, the relational, knowledge graph-based and keyword-based features are a novelty of autoBOT discussed below.
Relational features. One of the key novelties introduced in this paper is the relational feature construction method, summarized as follows. Consider two tokens, \(t_1\) and \(t_2\). autoBOT already considers n-grams of length 2, which would account for patterns of the form (\(t_1\),\(t_2\)). However, longer-range relations between tokens are not captured this way. As part of autoBOT, we implemented an efficient relation extractor, capable of producing symbolic features described by the following (i-th) first-order rule: \({\mathcal {R}}_i := \text {presentAtDistance}(t_1,t_2,{{\overline{\delta }}}(t_1,t_2))\), where \({{\overline{\delta }}}\) represents the average distance between a given token pair across the training documents. Thus, the features represent pairs of tokens, characterized by binary feature values, derived from the top \(d_{t=\text {relational}}\) distances (number of considered features) between token pairs. An example is given next.
Keyword-based features.
The second type of features introduced in this work are the features based on keywords. Given a document, keywords represent a subset of tokens that are representative of the document. There exist many approaches for keyword detection. For example, statistical methods, such as KP-MINER (El-Beltagy and Rafea 2009), RAKE (Rose et al. 2010) and YAKE (Campos et al. 2018), use statistical characteristics of texts to capture keywords. On the other hand, graph-based methods, such as TextRank (Mihalcea et al. 2004), Single Rank (Wan and Xiao 2008), TopicRank (Bougouin et al. 2013), Topical PageRank (Sterckx et al. 2015) and RaKUn (Škrlj et al. 2019) build graphs to rank words based on their position in the graph. The latter is also the method adopted as a part of autoBOT for the feature construction process, which proceeds in the following steps:
-
1.
Keyword detection. First, for each class, the set of documents from the training corpus corresponding to this class are gathered. Next, keywords are detected by using the RaKUn algorithm for each set of documents separately. In this way, a set of keywords is obtained for each target class.
-
2.
Vectorization. The set of unique keywords is next obtained, and serves as the basis for novel features that are obtained as follows. For each document in the training corpus, only the keywords from the subset of all keywords corresponding to the class with which the document is annotated are recorded (in the order of appearance in the original document), and used as a token representation of a given document. This way, the keywords specific for a given class are used to construct novel, simpler “documents”. Finally, a TF-IDF scheme is adopted as for e.g., character or word n-grams, yielding n most frequent keywords as the final features Footnote 3.
The rationale behind incorporating keyword-based features is that more local information, specific to documents of a particular class is considered, potentially uncovering more subtle token sets that are relevant for the differentiation between the classes.
Knowledge graph-based features. A key novelty introduced as part of autoBOT is the incorporation of knowledge-graph-based features. Knowledge graphs are large, mostly automatically constructed relational sources of knowledge. In this work we explored how ConceptNet (Speer et al. 2017), one of the currently largest freely available multilingual knowledge graphs could be used to construct novel features of which scope extends the considered data setFootnote 4. We propose an algorithm for propositionalization of grounded relations, discussed next.
Assuming a collection of documents D, the proposed propositionalization procedure identifies which relations, present in the knowledge graph, are also present in a given \(k \in D\). Let \({\mathcal {K}} = (N,E)\) represent the knowledge graph used, where N is the set of terms and E the set of subject-predicate-object triplets, so that the subject and the object are two terms. We are interested in finding a collection of features \(F_{\text {KG}}\) (i.e. knowledge graph-based features). We build on the late propositionalization ideas of Lavrač et al. (2020), where zero-order logical structures are effectively used as features, that are automatically identified. We refer to the algorithm capable of such scalable extraction of first-order features as PropFOL, summarised next. The key idea of PropFOL is related to grounding the triplets, appearing in a given knowledge graph while traversing the document space. More specifically, each document k is traversed, and the relations present in each document are stored. The relations considered by PropFOL are shown in Table 2.
Table 2 Considered relations. from ConcepNet considered by PropFOL The PropFOL operates by memorizing the collections of grounded relations in each k (document). Once the document corpus is traversed, the bags of grounded relations are vectorized in TF-IDF manner. Finally, for each new document, two operations need to be conducted. First, the grounded relations need to be identified. Second, the collection of relations is vectorized by using the stored weights of the individual relations occurring based on the training data. The feature construction algorithm is given as the Algorithm 1.
The algorithm consists of two main steps. First, the document corpus (D) is traversed (line 4), whilst the relations are being recorded for each document (k). Once memorized (for training data, line 7), a vectorizer is constructed, which in this work conducts TF-IDF re-weighting (line 16) of first order features, and based on their overall frequency selects the top n such features that shall be used during evolution. Note that this simple propositionalization scheme is adopted due to a large knowledge graph considered in this work, as one of the key purposes of autoBOT is to maintain scalability (such graph can be processed on an off-the-shelf laptop). Note that in practice, even though millions of entities and tens of millions of possible relations are inspected, the final collection of grounded relations, particular to a considered data set, remains relatively small. In more detail, the getAllTokens (line 2) method maps a given document corpus D to a finite set of possible tokens (e.g. words). The obtained token base is retrieved for each document (k, line 7) via getTokens method. The subset of tokens corresponding to a given document is next used to extract a subgraph of the input knowledge graph \({\mathcal {K}}\), corresponding to a given document. This step is mandatory as the subgraph effectively corresponds to the set of triplets that are used as features. The missing component at this point are the relations, which are retrieved via the decodeToTriplet method (line 12). Such triplets represent potentially interesting, background knowledge (\({\mathcal {K}}\))-based features. In the final part of the algorithm, triplet sets are processed as standard bags-of-items to obtain the real valued feature space suitable for learning (\(\varvec{F}_{\text {KG}}\)).
The following example demonstrates how the constructed features are obtained, and what are the potentially interesting relations entailed by performing such feature construction.
This type of feature construction is thus able to extract relations, otherwise inaccessible by conventional learners that operate solely based on e.g., word-based representations. Even though current implementation of autoBOT exploits the ConceptNet knowledge graph due to its generality, the implementation permits utilization of any triplet knowledge base that can be mapped to parts of texts, and as such offers many potentially interesting domain-specific applications.
Solution specification and weight updates
The key part of every genetic algorithm is the notion of solution (an individual). The solution is commonly represented as a (real-valued) vector, with each element corresponding to the part of the overall solution. Let FT represent the set of feature types. The solution vector employed by the autoBOT is denoted with \(\textrm{SOL} \in [0,1]^{|\textrm{FT}|}\) (|FT| is the number of feature types).
Note that the number of parameters a given solution consists of is exactly equal to the number of unique feature types (as seen in Table 1). The solution is denoted as:
$$\begin{aligned} \textrm{SOL} = \big [\underbrace{w_1,w_2, \dots , w_{|\textrm{FT}|}}_{\text {Subspace weights}}\big ]. \end{aligned}$$
Thus, the solution vector of the current implementation of autoBOT consists of 8 (hyper) parameters (for eight different feature types as seen in Table 1). Next, solution evaluation, the process of obtaining a numeric score from a given solution vector is discussed.
Each solution vector \(\textrm{SOL}\) consists of a set of weights, applicable to particular parts of the feature space. Note that the initial feature space, as discussed in Section 3.1, consists of \(d\) features. Given the weight-part of \(\textrm{SOL}\), i.e. \([w_1, w_2, \dots , w_{|\textrm{FT}|}]\), we define with \(I_i^\text {from}\) and \(I_i^{\text {to}}\) the two column indices, which define the set of columns of the i-th feature type. The original feature space \(\varvec{F}\) is updated as follows:
$$\begin{aligned} \varvec{F}^{I_i^\text {from} \text { to } I_i^\text {to}}_{s} = w_i^{s} \odot \varvec{F}^{I_i^\text {from} \text { to } I_i^\text {to}}. \end{aligned}$$
(2)
where \(\odot\) refers to matrix-scalar product and s to a particular individual (updated feature space). Note also that the superscript in the weight vector corresponds to the considered individual. The union of the obtained subspaces represents the final representation used for learning.
The key idea of autoBOT is that instead of evolving on the learner level, evolution is conducted at the representation level. The potential drawback of such setting is that if only a single learner was used to evaluate the quality of a given solution (representation), the fitness score (that in this work equals to the mean score obtained during a five-fold cross validation on the training set) would be skewed. To overcome this issue, autoBOT—instead of a single classifier—considers a wide spectrum of linear models parameterized with different levels of elastic net regularization (trade-off between L1 and L2 norms) and losses (hinge and log loss are considered). Being trained by the stochastic gradient descent, hundreds of models can be evaluated in a matter of minutes, offering a more robust estimate of a given representation’s quality. Note that each solution is considered by hundreds of learners, and there are multiple solutions in the overall population. More formally, we denote with
$$\begin{aligned} {\mathcal {S}}_c(\varvec{F}) = \mathop {\mathrm {arg \, max}}\limits _{h} \big [ \textrm{SGD}(\textrm{SOL},h,\varvec{F}) \big ] \end{aligned}$$
(3)
the optimization process yielding the best performing classifier when considering feature space \(\varvec{F}\), where SGD represents a single, stochastic gradient descent-trained learner parameterized via h (a set of hyperparameters such as the loss function and regularization). Note that SGD considers the labeled feature space during learning.
A detailed specification of the family of linear models that are considered during fitness computation are given in Section 4.2. We next discuss the final component of autoBOT that can notably impact the evolution—the initialization. Let \(F_f\) represent a feature subspace (see Section 3.1 for details). The initial solution vector is specified as:
$$\begin{aligned} \textrm {SOL}_{\text {init}} = [{\mathcal {S}}_c(\varvec{F}_f) \cdot {\mathcal {U}}(0.95,1.05)]_{f \in FT}. \end{aligned}$$
(4)
Note the link to Equation 3: the vector consists of feature type-specific performances. The \({\mathcal {U}}(a,b)\) represents a random number between a and b drawn from the uniform distribution. This serves as noise which we add to prevent initialization of too similar individuals. As in this work the F1 score is adopted for classifier performance evaluation, its range is known (0 to 1), thus the proposed initialization offers stable initial weight settingFootnote 5.
Dimension estimation
Commonly, dimension of a learned representation is considered as a hyperparameter. However, many recent works in the area of representation learning indicate that high-enough dimension is a robust solution across multiple domains, albeit at the cost of additional computational complexity. The proposed autoBOT exploits two main insights and adapts them for learning from sparse data. The dimension estimation is parametrized via the following relation:
$$\begin{aligned} d_f = \text {round}(d_d/s), \end{aligned}$$
where \(d_f\) is the final dimension, \(d_d\) the dense dimension and s the estimated sparsity. The idea is that autoBOT attempts to estimate the size of the sparse vector space based on the assumption that models that operate with dense matrices require \(d_d\) dimensions for successful performance, and that s is the expected sparsity of the space produced by autoBOT. In this work, we consider \(d_d = 128\) and \(s = 0.1\), the dense dimension is based on the existing literature and s is low enough to yield a sparse space.
Formulation of autoBOT
Having defined the key steps for evaluation of a single solution vector \(\textrm{SOL}\), we continue by discussing how such evaluation represents a part of the evolution process undertaken by autoBOT. The reader can observe that the genetic algorithm adopted as part of autoBOT is one of the simplest ones, introduced already in the 1990s (Davis 1991).
The key steps of autoBOT, summarized in Algorithm 2, are outlined below. They involve initialization (line 2), followed by offspring creation (line 6). The two steps first initialize a population of a fixed size, followed by the main while loop, where each iteration generates a novel set of individuals (solutions), and finally (line 14) evaluates them against their parents in a tournament scheme. Note that prior to being evaluated, each population undergoes the processes of crossover and mutation (lines 7 and 10), where individuals are changed either pointwise (mutation), or piecewise (crossover). Once the evolution finishes, the HOF object (hall-of-fame) is inspected, and used to construct an ensemble learner that performs classifications via a voting scheme. In this work, we explore only time-bound evolution. Here, after a certain time period, the evolution is stopped. The more detailed description of the methods in Algorithm 2 is as follows. The generateSplits method offers the functionality to generate data splits used throughout the evolution. This step ensures that consequent steps of evolutions operate on the same feature spaces and are as such comparable. The generateInitial method generates a collection of real-valued vectors that serve as the initial population as discussed in Equation 4. Next, the initializeRepresentation method constructs the initial feature space, considered during evolution. Note that by initializing this space prior to evolution, the space needs to be constructed only once compared to the naïve implementation where it is constructed for each individual. The mate and mutate methods correspond to standard crossover and mutation operators. The evaluateFitness method returns real valued performance assessment score of a given representation.Footnote 6 The updateHOF method serves as a storage of the best-performing individuals throughout all generations, and is effectively a priority queue with a fixed size. The selectTournament method is responsible for comparisons of individuals and the selection of the best-performing individuals that constitute the next generation of representations. Finally, the trainFinalLearners method considers the best-performing representations from the hall-of-fame, and trains the final classifier via extensive grid search.
We next discuss the family of linear models considered during evolution. Note that the following optimization is conducted both during evolution (line 13) and final model training (line 16). The error term considered by stochastic gradient descent is:
$$\begin{aligned} \textrm {Err}(\varvec{w},b) = \underbrace{\frac{1}{|D|}\sum _{i=1}^{|D|} {\mathcal {L}}(\varvec{y}_i, \varvec{w}^T \varvec{x}_i + b))}_{\text {Loss term}} + \alpha \Bigg [\underbrace{ \frac{1 - \beta }{2} \sum _{i=1}^{|D|} \varvec{w}_i^2}_{\text {L2}} + \underbrace{\beta \sum _{i=1}^{|D|} |\varvec{w}_i| }_{\text {L1}} \Bigg ], \end{aligned}$$
where y is the target vector, \(x_i\) the i-th instance, \(\varvec{w}\) is a weight vector, \({\mathcal {L}}\) is the considered loss function, and \(\alpha\) and \(\beta\) are two numeric hyperparameters: \(\alpha\) represents the overall weight of the regularization term, and \(\beta\) the ratio between L1 and L2. The loss functions considered are the hinge and the log loss, discussed in detail for the interested reader in Friedman et al. (2001).
Theoretical considerations and explainability
We next discuss relevant theoretical aspects of autoBOT, with the focus on computational complexity and parallelism aspects, as the no-free-lunch nature of generic evolution as employed in this work has been previously studied in other works (Wolpert and Macready 1997; English 1996). In terms of computational complexity, the following aspects impact the evolution the most:
Feature construction. Let \(\tau\) represent the number of unique tokens in the set of documents D. Currently, the most computationally expensive part is the computation of keywords, where the load centrality is computed (Škrlj et al. 2019). The worst case complexity of this step is \({\mathcal {O}}(\tau ^3)\) – the number of nodes times the number of edges in the token graph, which is in the worst case \(\tau ^2\). Note, however, that such scenario is unrealistic, as real-life corpora do not entail all possible token-token sequences (Zipf’s law). The complexities of e.g., word, character, relational and embedding-based features are lower. Additionally, the features based on the knowledge graph information also contribute to the overall complexity, discussed next. Let \(E({\mathcal {K}})\) denote the set of all subject-predicate-object triplets considered. The propFOL (Algorithm 1) needs to traverse the space of triplets only once (\({\mathcal {O}}(|E({\mathcal {K}})|)\)). Finally, both of the mentioned steps take additional |D| steps to read the corpus. We assume the remaining feature construction methods are less expensive.
Fitness function evaluation. As discussed in Section 3.2, evaluation of a single individual that encodes a particular representation is not conducted by training a single learner, but a family of linear classifiers. Let the number of models be denoted by \(\omega\), the number of individuals by \(\rho\), and the number of generations by |G| (G is a set of aggregated evaluations for each generation). The complexity of conducting evolution, guided by learning, is \({\mathcal {O}}(\rho \cdot \omega \cdot |G|)\).
Initial dimensionality estimation. The initial dimensionality is computed via a linear equation, and is \({\mathcal {O}}(1)\) w.r.t. the |FT| (number of feature types).
Space complexity. When considering space complexity, we recognize the following aspects as relevant. Let |I| denote the number of instances and |FT| the number of distinct feature types. As discussed in Section 3.1 the number of all features is denoted with \(d_a\), the space required by the evolution is \({\mathcal {O}}(|I| \cdot d_a \cdot \rho )\). In practice however, the feature space is mainly sparse, resulting in no significant spatial bottlenecks when tens of thousands of features are considered.
The individual computational steps considered above can be summarized as the following complexity:
$$\begin{aligned} {\mathcal {O}}(\underbrace{|D| + \tau ^3 + |E({\mathcal {K}})|}_{\text {Representation construction}} + \underbrace{\rho \cdot \omega \cdot |G|}_{\text {Evolution}}). \end{aligned}$$
We next discuss how autoBOT computes solutions in parallel, offering significant speedups when multiple cores are used. There are two main options for adopting parallelism when considering simultaneously both the evolution and learning. The parallelism can be adopted either at the level of individuals, where each CPU core is occupied with a single individual, or at the learner level, where the grid search used to explore the space of linear classifiers is conducted in parallel. In autoBOT, we employ the second option, which we argument as follows. Adopting parallelism at the individual level implies that each worker considers a different representation, thus rendering sharing of the feature space amongst the learners problematic. However, this is not necessarily an issue when considering parallelism at the level of learners. Here, individuals are evaluated sequentially, however, the space of the learners is explored in parallel for a given solution (representation). This setting, ensuring more memory efficient evolution, is implemented in autoBOT. Formally, the space complexity, if performing parallelism at the individual’s level rises to \({\mathcal {O}}(c \cdot |I| \cdot d_\alpha \cdot \rho )\), which albeit differing (linearly) only by the parameter c (the number of concurrent processes), could result in an order of magnitude higher memory footprint (when considering autoBOT on a e.g., 32 core machine). The option with sequential processing of the individuals but parallel evaluation of learners remains of favourable complexity \({\mathcal {O}}(|I| \cdot d_\alpha \cdot \rho )\) (assuming shared memory). An important aspect of autoBOT is also explainability, which is discussed next.
As individual features constructed by autoBOT already represent interpretable patterns (e.g., word n-grams), the normalized coefficients of the top performing classifiers obtained as a part of the final solution can be inspected directly. However, in practice, this can result in manual curation of tens of thousands of features, which is not necessarily feasible, and can be time consuming. To remedy this shortcoming, autoBOT’s evolved weights, corresponding to semantically different parts of the feature space can be inspected directly. At this granularity, only up to e.g., eight different importances need to be considered, one per feature type, giving practical insights into whether the method, for example, benefits the most by considering word-level features, or it performs better when knowledge graph-based features are considered. In practice, we believe that combining both granularities can offer interesting insights into the model’s inner workings, as considering only a handful of most important low-level (e.g., n-gram) features can also be highly informative and indicative of the patterns recognized by the model as relevant.
Finally, autoBOT also offers direct insights into high-level overview of what types of features were the most relevant. We believe such information can serve for transfer learning purposes on the task level, which we explore as part of the qualitative evaluation.
How successful was evolution?
Quantification of a given evolution trace, i.e. fitness values w.r.t generations has been previously considered in Beyer et al. (2002), and even earlier in Rappl (1989), where the expected value of the fitness was considered alongside the optimum in order to assess how efficient is the evolution, given a fixed amount of resources. To our knowledge, however, the scores were not adapted specifically for a machine learning setting, which we address in the heuristic discussed next. We remind the reader that \(G = (\textrm{perf}(i))_i\) represents a tuple denoting the evolution trace – the sequence of performances. Each element of G is in this work a real valued number between 0 and 1. Note that the tuple is ordered, meaning that when moving from left to right, the values correspond to the initial vs. late stages of the evolution’s performance. Further, the \(\textrm{perf}(i)\) corresponds to the maximum performance in each generation. Let \(\max _g(G)\) denote the maximum performance observed in a given evolution trace G. Let \({{\,\mathrm{arg max}\,}}_g (G)\) represent the generation (i.e. evolution step) at which the maximum occurs. Finally, let |G| denote the total number of evolution steps. Intuitively, both the maximum performance, as well as the time required to reach such performance (in generations) need to be taken into account. We propose the following score:
$$\begin{aligned} \text {GPERF}(G) = \underbrace{\max _g (G)}_{\text {Top score}} \cdot \underbrace{\bigg (1 - \frac{{{\,\mathrm{arg max}\,}}_g (G)}{|G|} \bigg )}_{\text {How late it converged to the top score?}}. \end{aligned}$$
Intuitively, the score should be high if the overall performance is good and evolution found the best performing solution quickly. On the other hand, if all the available time was spent, no matter how good the solution, the \(\textrm {GPERF}\) will be low. Note that the purpose of GPERF is to give insights into the evolution’s efficiency, which should also take into account the time to reach a certain optimum. If the reader is interested solely in performance, such comparisons are also offered. Note that \(\max _g (G)\) represents the best performing solution obtained during evolution. The heuristic, once computed for evolution runs across different data sets, offers also a potential insight into how suitable are particular classification problems for an evolution-based approach – this information is potentially correlated with the problem hardness.