Introduction

The human brain can be thought as the best pattern recognizer in the known universe. Since our early childhood, we have been observing patterns in the objects around us (e.g., flowers, toys, pets and faces). Learning patterns also reinforces, and is reinforced by, the acquisition of language. It is well known that most 5-year-old children are already able to recognize digits and letters [1]. At the same time, scientists, engineers and practitioners know that designing a general-purpose machine for Pattern Recognition (PR) able to contend in performances the brain remains, today, an elusive goal. PR, intended as a superordinate field, involve disciplines as Cognitive Sciences, Psychology, Artificial Intelligence (AI), and for some extent, Neuroscience, Linguistics and Philosophy. As a discipline which studies the ability of discovering and recognizing regularities in observations, PR can help to understand how human perception works and to discover the secrets behind the ability to gain new knowledge and to exploit it in appropriate ways. So forth, besides giving more insights into the study of human senses and the neural system, PR allows to build up automated systems adopted in medical diagnosis, industrial inspection, personal identification, man-machine interaction, Natural Language Processing (NLP) tasks, etc.

From its own side, an open issue in Cognitive Science is the causal relation of low level phenomena occurring in the senses and the nerves onto higher levels of understanding and conceptual thinking [2, 3]. Interestingly, it is an open issue even in automated mechanical PR, even if we are dealing with sensors, actuators, processing units etc.

In fact, one of the main problems that affect both disciplines (namely automated PR and Cognitive Sciences), each one within its own phenomenology, is the “representation” [4, 5]. In other words, Cognitive Science, in studying the cognitive activities of humans and other animals, provides us with a set of explanatory theories and even a set of constructive prescriptions that can aid to design artifacts like robots, animats, and chess-playing programs, with the aim of accomplishing various cognitive tasks. A key issue is the representation of information or data, in such a way that the cognitive system can be modeled, starting from stimuli coming from biological or digital sensors to high-level processing capabilities, for example, to perform tasks on higher semantic levels. It can be affirmed, without pretense of completeness, that self-organizing phenomena of the physical world are relevant also for understanding cognitive processes [6]; hence, some properties of cognitive systems are in resonance with the ones attributed to Complex Systems. Even language and text production can be thought as activities of complexly organized brains [7] and semantic meaning can be hidden in the middle-ware of the complexity around us. The engineer Alfred Korzybski, the inventor of General Semantics, giving great importance to language and the use of words, in 1933 affirmed that thinking is a matter of multilevel order of abstraction and content is a declination of the structure with complex relationships [8]. The multilevel order of abstraction can be found even in the organization of most Complex Systems where emergent properties and vertical information processing generate new abstract levels dominated by its own semantic content.

From a computational point of view Granular Computing (GrC) is the umbrella term to cover any theories, methodologies, techniques, and tools that make use of information granules in complex problem solving [9, 10]. Information granules are atomic units [11] that naturally give rise to hierarchical structures: the same problem or system can be perceived at different levels of specificity (detail), depending on the complexity of the problem, the available computing resources and the particular needs to be addressed [9, 12]. Some authors (e.g., W. Pedrycz) conceive GrC as a conceptual and algorithmic platform supporting analysis and design of human-centric intelligent systems [13]. Zadeh, the scientist who made fuzzy logic great, considers GrC as a basis for computing with words, i.e., computation with information described in natural language [14, 15]. For example, the text in a book can be seen as an increasing granulation of the information content starting from the alphabet letters and ending with the aggregation of concepts and topics, passing through “mesoscopic” structures such as words, sentences, paragraphs, chapters and so on. In this regard, E. G. Altmann et al. affirm that [16]: “literary texts are an expression of the natural language ability to project complex and high-dimensional phenomena into a one-dimensional, semantically meaningful sequence of symbols. For this projection to be successful, such sequences have to encode the information in form of structured patterns, such as correlations on arbitrarily long scales”.

Moreover, dealing with text, in automated PR systems the representation problem becomes more difficult because text is intrinsically structured at various levels, while classical PR problems solved by standard Machine Learning (ML) approaches need to work with the \(\mathbb {R}^n\) vector space geometry. In general, the GrC approach allows designing automated PR problems able to deal directly with structured or unconventional input domains [17]. Hence, a challenging task is finding effective models and algorithms able to represent and process a set of samples coming from a structured domain.

The aim of the current work is twofold:

  • from a more general point of view, it tries to investigate how to bridge the gap between some findings in Cognitive Science (the Conceptual Spaces and Prototype Theory [3, 18] and so forth) and the GrC approach in light of the problem of representation of text excerpts in text mining problems;

  • from a specific point of view, the objective of the current study is to experiment some text embedding methods through a GrC approach, known as symbolic histograms [19, 20], in solving two specific text classification problems through some standard Machine Learning algorithms able to process n-tuple of real numbers or more structured objects, such as sequences.

It is well known that in PR applied to text data a traditional representation approach consists in embedding words or documents in a mathematical space with useful algebraic properties, such as a linear vector space, also known as feature space. The essence of such algebraic space, capturing some kind of co-occurrence between words and contexts, is built on the top of Distributional Semantics (DS) [21], grounded, in turn, on the distributional hypothesis: similarity of meaning correlates with similarity of distribution. After all, Wittgenstein claimed in his Philosophical Investigations, that “the meaning of a word is its use in the language” [22]. In other words, as the American linguist Z.S. Harris sustained: “words that are used and occur in the same contexts tend to purport similar meanings” [23] or paraphrasing the British J. R. Firth “a word is characterized by the company it keeps” [24]. In ML, specifically in document classification or even in Computer Vision [25], the approach is known as “bag of words (BoW)” — sometimes known as surface form — pointing out the fact that the text is represented as the frequency of occurrence of each word building a feature space for training a classifier [26], disregarding grammar and even the order of words. The methodology is heavily adopted in Information Retrieval (e.g., the so-called traditional Vector Space Model (VSM) [27]) and text mining but is well known that it has some limitation, such as (i) the orthogonality, (ii) the construction of the vocabulary that requires a careful design due to its size, (iii) the sparsity of the model and the lack of context due to the discarding all information brought by surrounding words.

Researchers have attempted to address the representations of natural language that are capable of capturing meaning through what they call semantic spaces, a set of language models that adopt the DS. For example, the Hyperspace Analogue to Language (HAL) [28] is a method for creating a simulation that exhibits some of the characteristics of a human semantic memory finding lexical co-occurrences by moving a window of length l over the corpus. HAL allows representing words as vectors.

In general, authors refer to the word-context models as explicit models, while some family of transformations of the underlying data structure leads to implicit representations [29]. Canonical co-occurrence models are simpler to implement and they work well within standard ML pipelines. However, they possess a number of drawbacks, for example, sparsity (a lot zeros due to Zipf’s law) and high dimensions when dealing with huge corpora with large vocabularies. A simple frequency count, for example, does not embed intrinsically the fact that two words have the same meaning (synonymy) because they are treated as named entities, that is, they are symbols. Moreover, contexts can be similar too, or high-correlated. Furthermore, these raw representations can be very noisy.

In order to avoid some drawbacks, a number of implicit representations are provided in literature, some of which are known as dense representations, because they reach a non-sparse representation, often in a reduced feature space. The most adopted methodologies, in practice, use an implicit representation of features in a latent space where latent features are computed starting from the distributional models. For example, Latent Semantic Analysis (LSA), representing the text in a latent space through a set of linear algebraic transformations, aims at constructing a rich semantic space. LSA is obtained by means of (linear) matrix decomposition procedure known as Singular Value Decomposition (SVD), allowing dimensionality reduction (truncated SVD) and noise filtering. The dense embeddings produced by SVD sometimes perform better than the raw ones (grounded on PPMI matrices) on semantic tasks like word similarity. Various aspects of the dimensionality reduction contribute to improved performance. If low-order dimensions represent unimportant information, the truncated SVD may be able in removing noise. By reducing the input dimension, the truncation may also help the models to generalize better to unseen data. Due to interesting, and in some ways unexpected, properties, LSA has also been proposed as a cognitive model for human language use [30, 31]. Other techniques adopt other matrix factorization methods, such as the non-negative matrix factorization (NMF) or ML methods such as GloVE [32], which is based on a regression technique.

Recently, in technical literature there are some powerful neural approaches, for example, the word2vec algorithm [33, 34], which embeds the meaning of text in a similar way to HAL (windowing), but constructing a dense representation training a shallow Artificial Neural Network (ANN) — e.g., Skip-gram with negative sampling (SGNS). More recent approaches in neural language embedding adopt sophisticated Recurrent Neural Networks (RNN) bound with attention mechanisms for language modeling, such as the Bidirectional Encoder Representations from Transformers (BERT) [35] and related architectures. Another technique that uses an external corpus to build a semantic space is the Explicit Semantic Analysis (ESA) [36], where words are represented as vectors and each entry is a Wikipedia article. In other words, each Wikipedia article is a kind of concept and words are embedded in a “concept space”. Hence, some attempts in embedding “meaning” and working with concepts are based on the so-called Bag of Concepts (BoG) [37] that, rather than identifying features directly with some surface form, utilizes some artifices to make practical the intuition that the meaning of a document can be approximated by the union of the meanings of terms appearing in the document itself. There are a number of practical implementations of BoC that uses concept vectors. They differ on how they construct the concept space, for example, adopting implicit or explicit representations, such as Word-net [38] like approaches or hyper-linked encyclopedic textual corpora.

In this paper, as concerns the textual conceptualization, we deal with a simple type of BoC useful for building a suitable feature space, where both traditional ML algorithms or advanced ones, such as RNN — for example, a Long Short Term Memory (LSTM) — can safely operate.

In doing so, as stated above, we adopt GrC as a general toolbox, while the road-map of the proposed approach is grounded by a specific approach mediated from Cognitive Psychology and in general from Cognitive Science, that is the “Conceptual Space” [3]. The theory of Conceptual Spaces is a modern extension of Prototype Theory developed by Rosch [39, 40]. P. Gärdenfors affirmed that the problem of representation in Cognitive Science, thus the problem of the vertical information processing where stimuli and senses data become high-level thinking and concepts, is due to the lack of a middle level between the Sub-conceptual Representations based on associations and the Symbolic Representations where rational thinking operate. This level is the Conceptual Level, a bridge where information is organized in a smooth space and where the notion of prototype and similarity (intended as a mathematical distance) allows to deal with concepts and properties (as a particular instance of concepts) in representing real-world objects. Concepts are particular “natural” regions of the Conceptual Space [3].

The proposed methodology foresees first the embeddings of words in a given corpus through either (i) the neural word embedding technique — word2vec — that is based on the association between words and contexts computed through a neural technique or (ii) the classical LSA. The aim here is to build a semantic space — a Conceptual Space — were words coded by vectors are embedded.

The space of word vectors is, thus, partitioned in “natural” regions (Voronoi regions) through a clustering algorithm, where regions are intended as a semantically homogeneous containers around its prototype. Once constructed the Conceptual Space, each word in a given document takes part in a new representation, known as a symbolic histogram.

Symbolic histograms [17] is an embedding technique, where a pivotal role is played by a set of meaningful and recurrent substructures in the original data space, often adopted for representing other structured objects lying in a non-metric structured space, such as graphs, sequences, strings, and images. In the current approach each document in a given corpus is represented as a symbolic histogram.

Specifically, concepts are represented by symbols (i.e., prototypes). In this sense, the vectors correspond to sub-symbols [41] that are transformed into symbols through a process characterized by information loss.

In other words, a documents is represented as a probability distribution on a set of alphabet symbols — we will call representatives of concepts among the Conceptual Space — used as feature vector for feeding a classification algorithm. Specifically a comparison will be offered among Random Forest (RF), a Support Vector Machine (SVM) and an advanced RNN model able to deal with sequences (LSTM). In the last case, as further novelty, instead of a classical features space where features are concepts, the RNN processes sequences of concepts, that is, ultimately, a new representation of a document. By the way, Wiggins argues [42] that learning is not only a matter of acquiring static co-occurrences, unless it includes generalization and the ability of processing sequences of events or even sequences of concepts.

In light of the Conceptual Space Theory this approach adds a middle layer in the representation/embedding of text in documents. Hence, starting from a sub-conceptual layer where associations dominate the representation (neural embedding or LSA), the construction of the alphabet — obtained at training time — is based on a conceptual organization of the underlying associative layer, where are elicited a set of (read a small number of) prototypes that, in turn, offer a symbolic level used to build the embedding representation by symbolic histograms. The proposed embedding allows representing documents in a smaller feature space in term of dimension compared to BoW approaches, providing a good performance for further recognition tasks. Moreover, the new feature space constructed on the top of the granulation of the semantic information contained in the word embedding model is a classical real-valued feature space, allowing the adoption of standard ML algorithms (as mentioned earlier). This is a strong point of the proposed approach. It is worth to note that the proposed methodological framework opens the way to knowledge discovery applications and, in general, to the Explainable AI paradigm [43, 44]; a fact not so obvious for the modern neural architectures used in the NLP context.

The paper is organized as follows.

In “Related Works’’ a brief overview of related works is reported. In “Background: Prototypes and Conceptual Spaces’’ the Conceptual Space Theory and the Prototype Theory are outlined. In “Methods’’ is presented the adopted approach and the problem framing. The description of the data sets for the experiments and the main results are provided in “Experiments’’. Lastly, conclusions are drawn in “Conclusions’’.

Related Works

The symbolic histograms technique within the GrC model is widely adopted in many PR tasks [17], such as online handwriting recognition [45] or protein classification [46]. This technique is heavily adopted when dealing with unconventional structured data, such as graphs, for example, performing frequent substructures mining in graphs seriation [47, 48] and classification methods [19, 49]. In the specific field of text mining and text categorization GrC is found very promising [50, 51]. Concerning Knowledge Discovery applied to text mining problems, authors in [52] deal with concept formation and concept relationships identification through constructing a granules’ network. An automatic text categorization system is proposed in [53] considering a document as an ordered sequence of words, proposing a system able to automatically mine frequent terms, considering as a term not only a single word, but also a sub-sequence of a few consecutive words (i.e., n-grams). The categorization system is tailored to process sequences of atomic elements (i.e., encoded words) by means of an embedding procedure based on clustering and adopting the symbolic histograms technique.

Many authors have adopted the BoC terminology referring to some technique for dealing with more general representations of words or sentences rather than the BoW model. In [37] authors adopt a particular technique within the BoC paradigm called Random Indexing, training a SVM with good results. Random Indexing is even used in [54] along with the Holographic Reduced Representation, previously proposed in cognitive models, which can encode relations between words. In [55] authors propose the cross-language concept matching (CLCM) technique, which relies on Wikipedia inter-language links to convert concept vectors from the Spanish to the English language space. They synthesize a classifier of text documents, represented as vectors in spaces of Wikipedia concepts, and provide an analysis of its suitability for classification of Spanish biomedical documents when only English documents are available for training. An approach, called Mined Semantic Analysis, is proposed in [56]. The study tries to address and mitigate problems arising in concept space models, such as the limitation to direct association between words and concepts, affecting the ability of models to transfer the association relation to other implicit concepts which contribute to the meaning of these words. The particular BoC paradigm is able to build concepts through concept rich encyclopedic corpora, even exploring the “see also” link graph in Wikipedia. A different declination of the BoC technique is provided in an interesting investigation [57] in line with the current research work, where authors creates concepts through clustering word vectors generated from word2vec and using the frequencies of clusters’ representatives to compute document embedding vectors. They propose a suitable weighting scheme, such as the concept frequency-inverse document frequency. Through these data-driven concepts, the method allows semantically similar words to be preserved effectively in a suitable document proximity measure. A related BoC approach is proposed in [58] solving an emotion estimation task from text excerpts, characterized even by youth slang, an ambiguous and difficult task when using existing dictionaries, such as thesaurus. In an interesting work [59] authors try to outperform the lack of concept overlapping in some text mining tasks,resulting in a data sparsity problem, proposing an efficient vector aggregation method, grounded on a neural embedding model, able to generate fully continuous BoC representations.

Background: Prototypes and Conceptual Spaces

Humans are extremely efficient at learning new concepts. Cognitive Science is interested in how to model concept learning starting from the ability of humans to learn concepts from a few examples. On the other side, ML, along with the data-driven approach, uses its own models to learn from examples. The main approaches in modeling concept learning are the one known as “symbolic” and the one known as “associationist” [3]. The symbolic approach starts from the assumption that cognitive systems can be described as Turing machines. Hence, cognition is a matter of symbol manipulations. Within the associationists paradigm associations between different kinds of information elements carry the main burden of representation [60]. The Swedish cognitive scientist P. Gärdenfors sustains that connectionism — the ANN approach — is a special case of associationism [3]. However, the same author admits that there is no unique correct way of describing cognition. There are phenomena that neither the symbolic representation nor the associationist appears to offer appropriate modeling tools. He proposes the “Conceptual Spaces”, as the framework placed in the middle of the two main approaches, that is the most appropriated for modeling concept learning and representation. The theory of conceptual spaces, due to its versatility and capability even when in dealing with high-dimensional spaces, has been extended together to the 3-way formal analysis to investigate phenomenal consciousness, within a quantum framework [61]. By the way, the three approaches mentioned can be seen as three levels of representations of cognition with different scales of resolution or “granulation”. Conceptual Spaces are able to geometrize the thought, because world objects are embedded in a geometric space where the notion of distance, region and prototype can be used to model concepts [62]. Actually, the embedding of real-world objects, through a series of suitable measures on them, is a normal procedure in automated PR systems. Measurable properties in automated PR and ML are called “features”, while in Conceptual Space theory they are called “quality dimensions”. However, neither with the symbolic approach (as an example, the first-order logic) nor with the associanist/connectionist approach, it is easy to deal with similarities [3]. While the associationist approach suffers for the black-box problem — think to ANN — the symbolic approach seems not working at the appropriate abstraction level, for example, lacking in creative induction, new knowledge creation and basically being not able to perform conceptual discoveries. Moreover, the symbolic approach lacks in automatic management of semantic and meaning. On the contrary, in Conceptual Spaces induction can be derived “naturally” from the metric properties of the underlying algebraic space, allowing what is known in automated PR and ML as “generalization capability”. That is, the capability of generalizing predictions on unseen data. By the way, P. Gärdenfors asserts that the symbolic level is not completely non-significant and it depends strictly on the underlying conceptual level [3].

An important distinction, useful in the context of the current work and due to Palmer [63], is about intrinsic and extrinsic representation. The former, is valid when the representing relation has the same inherent constraints as its represented relation. For example, in the isomorphism between the dimension “age” and the “height” of a bar in a chart, the structure of the represented relation (age) is intrinsic in the representing relation (height). In contrast, extrinsic representations must be accompanied by a rule that specifies how the representation is to be interpreted; such a rule provides the “meaning” of the representation. On the symbolic level, atomic concepts are not modeled, just named by the basic symbols. Even if complex concepts can be constructed through compositions of logical or syntactical rules, they remain extrinsically represented. In DS the BoW model considers intrinsically words as named entities, that is, as symbols with no further relational structure and the frequency count for representing documents is a symbol count. This leads to the synonymy problem.

Within the Conceptual Space theory the geometric characteristics of the quality dimensions are utilized to introduce a spatial structure for properties:

Criterion P: A Natural Property is a Convex Region in Some Domain

A subset C of a conceptual space S is said to be convex if, for all points x and y in C, all points between x and y are also in C [3]. It’s worth to note that Criterion P assumes that a notion of “betweenness” among objects is provided when each concept is represented as a point in a given space [3]. Convexity, for example, is mantained for the color naming and the three-dimensional representation of the color space. It is worth to note that properties defined by the Criterion P are a special case of concept.

Studying the phenomenology of colors and its perceptual representation in Cognitive Psychology E. Rosch and collaborators defined the Prototype Theory providing us with a model of categorization [39, 40]. The main idea in this theory is that within a category of objects, like those instantiating a property or a concept, certain members are judged to be more representative of the category than others. That prototype representation of a category is generally taken to be a generalization or abstraction of a class of instances falling into the same category [64]. In cognitive linguistics a prototype is a typical instance of a category and other elements are assimilated to the category on the basis of perceived similarity to the prototype [65].

The appealing feature of Conceptual Space lies in the underlying algebraic structure, that can be metric. This means that are fulfilled all or some properties of metric spaces [66]. A natural partition of such spaces is the Voronoi tessellation, a particular tessellation of the space based on a simple rule. If \(p_1,p_2,...,p_n\) are prototypes of a space S, the Euclidean distance \(d_E(p,p_i)\) among a point p and the prototypes \(p_i\) can be defined. If we now state that p belongs to the same category as the closest prototype \(p_i\), it can be shown that this rule will generate a partitioning of the space, the so-called Voronoi tessellation [67]. Not every distance metric (e.g., Manhattan or in general the Minkowski distance for some values of its parameter) generates a set of regions that fulfill the convexity property, however, for the Euclidean distance this property holds. Among the many methods used to compute Voronoi cells [67], the clustering algorithm k-means can help, in an unsupervised fashion, to compute centroidal Voronoi regions, where centroidal points are the centroids of the regions [68]. Hence, centroids are isomorphic to prototypes of some Conceptual Space. Thereby, depending on the nature of the space S (i.e., the nature of dimensions), the Conceptual Space becomes a semantic space (here the term semantic is used in a weak interpretation). In this way, the Voronoi tessellation provides a constructive geometric answer to how a similarity measure, together with a set of prototypes, determines a set of categories [3]. The Conceptual Spaces have been adopted even in trying to pragmatically untie the knot of semantics, intended as the relationship between an expression and an extralinguistic reality, within the riverbed of the cognitive semantics. The last assumes that the referents of words are identified with conceptual structures in people’s minds. However, semantics is a huge field of study where numerous discipline converges, such as Semiology, Semiotics, Linguistics, Psychology, Pragmatics, Communication, and Philosophy of Language. In Linguistics and, specifically, in Computational Linguistics the meaning, and in general the semantic content of a word or expression, assumes a specific way of being related to a context, which is empirical and measurable. For example, it is common to refer to space generated by the BoW model as a semantic space, specifically, a mathematical space grounded on the DS.

Methods

The approach presented in details hereinafter is an attempt of systematizing the theory of Conceptual Spaces with a specific declination of the BoC paradigm built upon the background of the GrC approach. The overall processing pipeline is composed by several steps where information extracted from the text is granulated, and information granules are adopted, in turn, in constructing a new embedding space grounded on the symbolic histogram technique. The main objective is to find an economic representation of documents as BoC for classification purposes, hence for text categorization. Following the scheme proposed in Fig. 1, given a corpus of documents, the first step is to perform the embedding of words in an algebraic space, called in the following Conceptual Semantic Space (CSS). The embedding of words can be performed through various methodologies outlined in “Introduction’’. In this work the LSA and the neural word embedding through the word2vec algorithm are performed.

Fig. 1
figure 1

Information processing scheme

The word embedding step is grounded on the co-occurrences (collocates) of words obtained through a context window of suitable length. In the case of LSA, the word vectors within the reduced latent space are obtained on the top of a BoW model with TF-IDF weighting, where contexts are documents. Hence, this layer fits with the “associative layer” [3]. The word vectors generate a vector semantic space endowed with the standard Euclidean norm, thus, it is defined a dissimilarity measure based on the Euclidean distance [69]. In the case of neural embedding through the word2vec algorithm, word vectors are directly obtained by the training procedure, ready to be further processed. Instead of using directly the word vectors, a Voronoi tessellation is computed, where each region coincides with a concept whose instances are linked by semantic relations. The Voronoi tessellation is obtained by computing the representatives — the prototypes — through a clustering algorithm. The k-means algorithm used in the following, but in principle other clustering algorithms can be adopted. This step embodies the “conceptual layer” that is the layer interposed between the “associative” and the “symbolic” one. Figure 2 depicts an example of CSS obtained for a corpus of scientific paper abstracts (“Abstracts” data set hereinafter), with four classes (“Anatomy”, “Information Theory”, “String Theory”, “Semiconductors”) of which a deep description will be given in the experiment section. The CSS in Fig. 2 is synthesized by a Voronoi tessellation in \(k=8\) regions, where the prototypes are highlighted by crosses. Dots represent words embedded (initially in a 100-dimensional space) through the word2vec algorithm. Principal Component scores are computed for dimensionality reduction with the aim of data visualization.

Fig. 2
figure 2

Centroidal Voronoi regions of the Conceptual Semantic Space for the Abstracts data set obtained through the k-means. Dots are words computed with the word2vec algorithm (word embedding) and projected in a bi-dimensional space through PCA. Crosses are the prototypes for each region. In this explanatory example the number of conceptual regions are \(k=8\)

Accordingly, the conceptual semantic layer is the ground for a symbolic representation of documents, namely each word is abstracted by its concept computed measuring the semantic similarity of the vector representation of words and the prototypes on the underlying CSS. Thus, documents are represented as discrete probability distributions on concepts.

Fig. 3
figure 3

Concept cloud for each one of the \(k=8\) conceptual regions for the Abstracts data set. The thickness of each word is proportional to the similarity (Euclidean distance) to the prototype computed as the centroid of the conceptual region

Prototypes are intended, therefore, as symbols of a suitable alphabet \(\mathcal {A}\) of concepts used for the symbolic representation.

Let \(\mathcal {H}=\left\{ D_1,D_2,...,D_L\right\}\) be a corpus with L documents, where each document D, \(D=\left\{ w_1,w_2,...w_{|D|}\right\} \in \mathcal {H}\), hence D is a collection of words \(w_i, i=1,2,...,|D|\) in a vocabulary \(\mathcal {V}\). The prototype \(c_j \in \mathcal {A}, j=1,2,...,k\), abstracting a concept of a region \(\mathcal {R}_j,j=1,2,...,k\) of the Conceptual Space P, defines what we can call a symbol of a suitable alphabet \(\mathcal {A}\). It is worth to note that the parameter k defines the level of granulation of the CSS. Each document can be suitably represented by some statistics on the alphabet symbols \(c_i \in \mathcal {A}\), namely, if the prototypical region pertaining the partition obtained by word embedding vectors is a “concept”, the document is represented as a “bag of concepts”. In the limit where the number of prototypes (aka the cardinality of the alphabet \(\mathcal {A}\)) equals the number of words in the corpus of documents \(\mathcal {H}\), the standard BoW model is recovered. It is worth to note that the symbol \(c_i\) is obtained by a suitable mapping \(\mathcal {M}\) from the underlying word vector \(\mathbf {w}_i \in \mathcal {W}\), obtained through the word embedding, and concepts in \(\mathcal {A}\), that is \(\mathcal {M} : \mathcal {W} \rightarrow \mathcal {A}\), where \(\mathcal {M}(\mathbf {w}_i)=c_j, j=1,2,...,k\), for the i-th word within a document.

Figure 3 depicts the word clouds for a CSS \(\mathcal {P}\) partitioned in \(k=8\) semantic regions. The thickness of each word is proportional to the similarity (based on the Euclidean distance) to the prototype computed as the centroid of the conceptual region.

Fig. 4
figure 4

Symbolic histogram for four documents pertaining the classes “Anatomy”, “Information Theory”, “String Theory”, “Semiconductors” of the Abstracts data set

Moreover, in Fig. 4 the symbolic histograms for four documents pertaining the classes “Anatomy”, “Information Theory”, “String Theory”, and “Semiconductors” of the Abstracts data set are reported. The length of a bar represents the number of occurrences of each one of the (ten) symbols (prototypes) for a given class.

The symbolic histogram representation allows naturally to embed documents in a vector space giving the way for classification or regression ML algorithms. However, there are possible other representations. Instead, it is possible to build a centroidal prototype for each document simply computing the average of word vector representations of prototypes. In other words, instead of having a prototype derived from a count histogram, we have an average value of word vectors prototypes associated to each word in a document. This alternative will be introduced more formally below. Another quite different representation of documents adopting prototypes is conceiving a document as a sequence of words, hence as a sequence of the prototypes associated to words. Namely, a document is represented by a sequence of concepts, where concepts are semantic abstraction of words. Words, in this setting, are fine-grained representation, while concepts pertain to coarser one. This new representation gives the way for sequence-based ML algorithm, such as the deep learning-based LSTM. Interestingly, this sequence-based representation allows framing a document as a random walk of concepts, instead of a random walk of words.

Classification Problem Framing with Symbolic Histograms

A general classification problem instance is defined as a triple of disjoint sets, namely training set (\(\mathcal {S}_{tr}\)), validation set (\(\mathcal {S}_{vs}\)), and test set (\(\mathcal {S}_{ts}\)). Given a specific parameters setting, a classification model is built based on \(\mathcal {S}_{tr}\) and it is validated at training stage on \(\mathcal {S}_{vs}\). The generalization capability of the optimized model (the one synthesized by the whole training procedure) is finally measured on \(\mathcal {S}_{ts}\). Hence, given a corpus \(\mathcal {H}=\left\{ D_1,D_2,...,D_L\right\}\) composed by L documents D, we have

$$\begin{aligned} \mathcal {H}=\left\{ \mathcal {S}_{tr} \cup \mathcal {S}_{vs} \cup \mathcal {S}_{ts}| \mathcal {S}_{tr} \cap \mathcal {S}_{vs}=\emptyset , \mathcal {S}_{tr} \cap \mathcal {S}_{ts}=\emptyset , \mathcal {S}_{vs} \cap \mathcal {S}_{ts}=\emptyset \right\} . \end{aligned}$$
(1)

The CSS \(\mathcal {P}\) is conceived as a hard partition of order k, as a collection of k disjoint and non-empty clusters, \(\mathcal {P}=\{\mathcal {C}_{1}, \mathcal {C}_2, ..., \mathcal {C}_k\}\). In this study the partition is obtained through the well-known k-means algorithm [70, 71]. Each cluster \(\mathcal {C}_i\in \mathcal {P}\) is synthetically described by a representative or prototype element, which we denote as \(\mathbf {c}_i=R(\mathcal {C}_i)\); let \(R(\mathcal {P})=\{\mathbf {c}_1, \mathbf {c}_2, ..., \mathbf {c}_k\}\) be the set of representatives of the partition \(\mathcal {P}\).

The definition of a cluster representative is well defined for vector feature spaces equipped with an algebraic structure, where it can be simply computed as the average vector in a set of real-valued vectors, i.e., \(\mathbf {c}_i=\frac{\sum _{\mathbf {w}_i \in \mathcal {C}_i }\mathbf {w}}{\left| \mathcal {C}_i\right| }\).

Alternatively, the representative of \(\mathcal {C}_i\) can be computed as the element \(\mathbf {c}_i\) that minimizes the sum of distances (MinSOD) [72]:

$$\begin{aligned} \mathbf {c}_i = \mathop{\text{arg min}}\limits_{\mathbf {w}_j\in \mathcal {C}_i} \sum _{\mathbf {w}_k\in \mathcal {C}_i} d(\mathbf {w}_j, \mathbf {w}_k). \end{aligned}$$
(2)

In this case the representative is an object of the cluster, that is \(\mathbf {w}_j\in \mathcal {C}_i\). Note that computing the MinSOD does not require an algebraic structure, demanding just the definition of a dissimilarity measure. From this point of view, the MinSOD representative is much more general, and can be applied in any data domain.

Finally, each prototype \(\mathbf {c}_j\) identifies a centroidal Voronoi region \(\mathcal {R}_j\) through the Euclidean distance \(d_j=\left\| \mathbf {w}_i-\mathbf {c}_j\right\| _2\), with \(\mathbf {w}_i \in \mathcal {W}\), where \(\mathcal {W}\) is the set of word vocabulary vectors.

Embedding Words

As concerns the word embedding procedure, a comparison will be offered between the word embedding obtained through LSA, by means of the SVD decomposition, and the neural embedding, by means of the word2vec algorithm — see “Introduction’’. In particular the two techniques allow new ways to represent each word \(w \in \mathcal {V}\) through suitable vectors \(\mathbf {w} \in \mathcal {W}\), just considering a mapping \(\Phi : \mathcal {V} \rightarrow \mathcal {W}\), from the set of vocabulary words to word vectors, i.e., \(\Phi (w)=\mathbf {w}\).

The Symbolic Histogram Construction

In abstracting a concept, hence a prototype for a word of a given document, it is necessary to associate a prototype to each word of a given document. Hence, given a document \(D=\left\{ w_1,w_2,...w_{|D|}\right\} \in \mathcal {H}\) as a collection of words \(w_i\), and its vector representation \(\Phi (w_i)=\mathbf {w}_i\) first, the nearest cluster prototype \(\mathbf {c}^*\in R(P)\) is individuated according to the following expression:

$$\begin{aligned} c(\mathbf {w})=\mathbf {c}^{*}_{\mathbf {w}} = \mathop {\arg \min }\limits _{{\mathbf {c}_j} \in R(P)} d( \mathbf {w}, {\mathbf {c}_j}). \end{aligned}$$
(3)

The construction of the symbolic histogram is performed as follows. An array \(\mathbf {I}_{w_i}=[\delta _1,\delta _2,...,\delta _k]^T\) of indicator functions is constructed, where:

$$\begin{aligned} \delta _{j}=\left\{ \begin{matrix} 1&{} \mathrm {if} \, c(\mathbf {w}_i) = \mathbf {c}_j, i=1,2,...,|D|\\ 0 &{} \mathrm {otherwise}. \end{matrix}\right. \end{aligned}$$
(4)

Finally, the symbolic histogram for a document d is provided by:

$$\begin{aligned} \mathbf {h}^{D}=\sum _{i\mathop{=}1}^{|D|}\mathbf {I}_{w_i}. \end{aligned}$$
(5)

Alternatively, instead of constructing a symbolic histogram as an array of counters, it is possible to represent the document \(D \in \mathcal {H}\) as the average of the associated centroids \(c(\mathbf {w}_i)\), for each word \(w \in D\), that is:

$$\begin{aligned} \mathbf {h}^{D}_{avg}=\sum _{i\mathop{=}1}^{|D|}\frac{c(\mathbf {w}_i)}{|D|}. \end{aligned}$$
(6)

At this point, each document in the corpus has an associated symbolic histogram, hence a vector of Integers or Real-valued numbers, depending on the specific rule adopted. In other words, documents are embedded in a bag of concept vector space.

Classification Layer

Once obtained the new representation, that is, the new vector space (through the symbolic histograms) or the new sequence of concepts, a learning layer can be designed depending on the problem at hand. In this work it is faced a classification problem comparing three different classification algorithms, namely SVM with Gaussian Kernel [73, 74], Bagged Tree RF [75, 76] and LSTM [77,78,79]. The first two learning algorithms are suited for working with Real-valued patterns, while LSTM is conceived for learning with a representation grounded by sequences of objects. Specifically, in the current approach LSTM is fed by the sequences of prototype vectors \(c(\mathbf {w})_i\) obtained through Eq. 3 corresponding to the sequence of words \(w_i\) pertaining a given document D. These classification algorithms belong to three big and heterogeneous families of learning algorithms, namely kernel-based, where learning is conceived as a convex optimization problem (SVM), random tree-based (RF), and deep learning-based, specifically RNNs. Hence, this choice guarantees the diversity of the learning paradigms applied to the proposed method. It is worth noting that RF algorithms are based on the bootstrap technique (some samples will be used multiple times) and the observations that are out of the bootstrap sample are called out-of-bag (OOB). This technique allows estimating the importance of variables (features) through a suitable procedure described, for example, in [80].

Experiments

Data Sets

As concerns text data for experiments, the “Reuters-21578” data set and the “Abstracts” data set have been used. Reuters-21578 is a benchmark data set for document classification consisting in 8 classes. The collection of documents appeared on the Reuters news-wire in 1987. The documents were assembled and indexed with categories by personnel from Reuters Ltd. [81]. The adopted splitting is the “ModApte” split [82] on 7674 documents and 8 classes. The “Abstracts” data set is a collection of 575 abstracts of scientific papers belonging to 5 classes (‘Anatomy’, ‘Information theory’, ‘Smart Grid’, ‘String Theory’, ‘Semiconductors’), collected by authors. Some statistics on the experimented data sets are reported in Tables 1 and 2. The former provides some general information about the data set, while the latter reports some statistic per class, such as the mean and standard deviation of document length per class.

Table 1 Data set statistics (in brackets it is reported the standard deviation)
Table 2 Data set statistics per class (in brackets it is reported the standard deviation)

Specifically, in Table 1 the total number of documents (\(\#\) docs), the dimension of the vocabulary before pre-processing (\(\left| \mathcal {V}\right|\)), the dimension of the vocabulary after the pre-processing (\(\left| \mathcal {V}\right| _{pre}\)), the number of classes (\(\#\) class) and the average length of documents in terms of tokens (words) (\(\left| \bar{D}\right|\)) with standard deviation in brackets are reported. In Table 2 the class names (class), the average length of documents in term tokens (words) (\(\left| \bar{D}\right|\)) with standard deviation in brackets and the number of documents per class (\(\#\) docs) are reported. In Fig. 5 the statistics on document lengths per class and for each data set are reported, while in Fig. 6 the histograms of the length of documents for both data sets are depicted. This information will be useful for setting the sequence length parameter for experiment with the LSTM algorithm.

Fig. 5
figure 5

Document length per class

Fig. 6
figure 6

Document length histogram

Fig. 7
figure 7

Class distribution for the experimented data sets

In Fig. 7 the class distributions for both the data sets are reported. We note that the “Reuters-21578” data set has a heavy skewed class distribution leading to a strong unbalanced data set, making challenging the classification task, while the “Abstracts” data set classes are equally distributed.

Experimental Settings

As concerns the performance measures of the classifiers, several metrics for the multi-class case are adopted.

Specifically, considering the i-th class (\(i=1,2,...|C|\)) of the available data sets it is possible to define:

  • \(TP_i\) (true positive): number of patterns belonging to the i-th class and correctly classified by the system;

  • \(FN_i\) (false negative): number of patterns belonging to the i-th class whose class is incorrectly assigned to the \(i_th\) class predicted by the system;

  • \(FP_i\) (false positive): number of patterns not belonging to the i-th class whose class is incorrectly by the system.

  • \(TN_i\) (true negative): number of patterns belonging to the i-th class and correctly classified by the system.

Starting from these metrics a set of derived indicators for each class can be computed, such as the \(Accuracy_i\), \(Precision_i\) and \(Recall_i\) together with other global figures of merit, such as the Informedness and the Cohen’s Kappa. Besides these metrics, the global classification performances can be assessed in two ways: (i) macro-averaging that is the average of the same measure calculated for each class, (ii) micro-averaging that is the sum of counts to obtain cumulative TP, FN, TN, FP and then calculating the performance measure. Macro-averaging treats all classes equally while micro-averaging favors classes characterized by a relative higher number of patterns [83].

The final metrics adopted in the current study are (the higher, the better):

  • the average Accuracy (Acc.) in [0,1], that is the average per-class effectiveness of a classifier;

  • the Precision (P) in [0,1], that is the fraction of relevant instances among the retrieved instances by the classifier;

  • the Recall (R) in [0,1], that is the fraction of the total amount of relevant instances that were actually retrieved by the classifier;

  • the Informedness (Inf.) in [0,1] — known as J-index — is the maximum distance between the bisector diagonal line of the Receiver operating characteristic (ROC) [84] diagram and the ROC curve estimated. It indicates the probability of an informed decision compared to a chance;

  • the Cohen’s kappa (kappa) in [0,1] that, considering the classification task as a rating process, measures inter-rater reliability (sometimes called inter-observer agreement) [85];

  • the macro F1 score (Fmacro) in [0,1] that is the unweighted mean of the F1 scores calculated per class, where \(F1= \frac{2TP}{2TP+FP+FN}\) [83];

  • the micro F1 score (Fmicro) in [0,1] the same expression as Fmacro, but using the total number of TP, FP and FN, instead of computing these scores for each class [83].

The experimental settings are organized as follows.

In order to assess the proposed approach, two main sets of experiments are provided. The first set aims at comparing the three learning algorithms (namely, LSTM, C-SVM, RF) adopting both the word2vec and the LSA embedding, for both “Abstracts” and “Reuters-21578” data sets. The embedding is computed on the given corpus. In this case the cardinality of the alphabet \(\mathcal {A}\), that is the number of clusters k or the number of concept regions, is left to vary in the integer range [2,1002] (see Fig. 8), while a snapshot of the performance, for \(k=502\), is provided in Table 3 for the “Reuters-21578” data set and, for \(k=202\), in Table 4 for the “Abstracts” data set. The specific choice of the granularity level k has been made simulating an arbitrary setting where we have no information about the variability of the performance as function of the granularity level. In other words, this setting simulates the case in which the performance cannot be computed for an increasing set of granularity levels due to, for example, computational and time constraints. The best granularity level, instead, is considered in the second set of experiments. In fact, the second set of experiments — reported in Tables 5 and 6 — allows evaluating and comparing the proposed methodology with a baseline approach. Specifically, for the mentioned learning algorithms the best level of granulation k in terms of performances is compared with a classical approach where the feature vectors representing documents are obtained either from the TF-IDF representation or from the LSA representation. In other words, the features for the classification task are the TF-IDF weighted words count or the weights related to the latent variables, respectively. In the end, taking advantage of the implicit features weighting offered by the RF algorithm, a task of knowledge discovery is performed. Hence, a threshold filtering is adopted in order to select and show the most important concepts within the concept region that, in turn, allowed reaching a good classification performance. Before discussing the main results, it is worth to deal with the text pre-processing steps and the parameter setting of the adopted learning algorithms.

Regarding the pre-processing steps, words in documents are lowercased and stop words are eliminated using a stop words list.

As concerns the LSTM algorithm, it is preceded by a word embedding layer (through the word2vec algorithm) — namely a concept embedding layer — with a dimension of the word vectors equal to 100. The LSTM layer foresees 180 cells followed by a fully connected layer and a softmax layer. The classification layer computes the cross entropy loss for multi-class classification problems with mutually exclusive classes. The maximum number of epochs is set to 50, while the initial learning rate is set to 0.005. In the case of SVM, a C-SVM with multiple kernels is used. A set of hyper-parameters are optimized with the Bayesian optimization technique and 5-fold cross validation. Specifically, the hyper-parameters optimized are the multi-class coding (One-versus-All and One-versus-One), the Box Constraint, the scale of the kernel, the type of kernel function (Gaussian, linear, polynomial), the polynomial order and the binary variable indicating whether standardize data or not. For the ensemble learning, an ensemble of boosted classification trees is experimented, hence with trees as weak learners. Even in this case, a Bayesian hyper-parameters optimization has been chosen and 5-fold cross validation is performed. In particular the optimization of the training algorithm (Bag, Subspace, AdaBoostM1, AdaBoostM2, GentleBoost, LogitBoost, LPBoost, RobustBoost, RUSBoost, TotalBoost) and the number of learning cycles [80, 86,87,88]. The OOB performance is measured for establishing the predictor importance.

Where not specified, for robustness purposes, the optimizations of hyper-parameters or the simple learning routines (for example, in LSTM) are repeated three times and performance results are averaged. The data set splitting for the training set \(\mathcal {S}_{tr}\) and test set \(\mathcal {S}_{ts}\) is 80%, 20%, respectively, both for C-SVM and RF. In the case of LSTM the data set is split in training set \(\mathcal {S}_{tr}\), validation set \(\mathcal {S}_{vs}\), test set \(\mathcal {S}_{ts}\) with the following percents: 50%, 25%, 25%, respectively.

Results and Granularity Assessment

Fixing the concept granularity value to \(k=202\) — see Table 4 — where the three classification algorithms are compared both with the word2vec embedding and the LSA embedding in building the concept space over the “Abstracts dataset” (that is balanced), best classification performances in term of Accuracy (0.94) are obtained with C-SVM with word2vec. The second best performances are obtained with RF and C-SVM (Accuracy 0.93) with the LSA embedding. LSTM reaches lower performances on both embeddings with an Accuracy of 0.85 for word2vec and 0.87 for the LSA embeddings. Interestingly, if we compute the figures of merits as the granularity of the concept space increases (varying k), by inspection of Fig. 8, it is found that with a very low number of concepts all classifiers achieve higher performances that, in turn, stabilize till the end of the experimented range (\(k=1002\)). A similar behavior can be found analyzing the “Reuters-21578” data set. Here the granularity level is fixed to \(k=502\). Also in this case the performance in terms of classification capability increases quickly rising k (graphs not shown for the sake of brevity). However, in this case, examining the results reported in Table 3, the higher Accuracy value is attained by C-SVM (0.95 with LSA embedding), but even the LSTM obtains a good Accuracy (0.94 with word2vec embedding).

If we consider the classification task as a rating process, Cohen’s kappa coefficient is low for LSTM and RF, for both embeddings, but reaches the highest value for C-SVM with word2vec embedding. In terms of Fmicro and Fmacro C-SVM obtains its best results with the word2vec embedding. The best informed decision, taking into account the unbalance of the “Reuters-21578” data set is achieved by C-SVM with Informedness of 0.86 (LSA embedding). In general, as an expected behavior, we have a moderate variability of the classifiers’ performances for both embeddings and data sets. The low Accuracy attained by LSTM for the “Abstracts” data set is likely to be addressed to the low granularity level and the short dimension of the data set, in terms of the number of documents and documents length. However, if we look at results for the best granularity level reported in Table 6 (“Abstracts” dataset), where the three classifiers are compared for both embedding types and with the TF-IDF features, LSTM obtains better performances with Accuracy 0.90 (word2vec embedding for best \(k=222\)) and 0.92 (LSA embedding for best \(k=162\)), outperforming the plain case (Accuracy 0.70), where sequences are directly generated without conceptualizing the corpus. In this particular setting, C-SVM with LSA embedding, for the best \(k=382\), outperforms both LSTM and RF. For C-SVM the LSA embedding adopted for constructing the conceptual space is found better than the TF-IDF case (Accuracy 0.98 and 0.96, respectively). It is worth to note that for both C-SVM and RF the results for the plain case (TF-IDF feature space) are good (RF attains an Accuracy of 0.97 for \(k=342\) and 0.95 with TF-IDF) in spite of the high dimensionality of the features space. This confirms the capability of both classifiers to work well in high-dimensional spaces. Both algorithms achieve high Accuracy and high Informedness for a similar granularity level above \(k=300\) in the LSA embedding case, while LSTM obtain even good performances (Accuracy 0.92) with the same embedding with a very low granularity (\(k=162\)). Furthermore, considering the “Reuters-21578” data set on the same experimental setting, we have C-SVM that outperforms the other classifiers in terms of Informedness and Fscores in TF-IDF case, with a not so great separation from the LSA embedding with a granularity level of \(k=722\). For this data set even LSTM achieves good results in terms of Accuracy and Informedness, specially for the word2vec embedding (Accuracy 0.95, Informedness 0.82, Fmicro 0.95, Fmacro 0.83). Similar performances are obtained with RF that only in the word2vec embedding case outperforms the TF-IDF setting. In fact, for the LSA embedding, despite the high granularity level (\(k=962\)) the TF-IDF setting attains a high Accuracy (0.95 v.s. 0.93) and a higher Informedness (082 v.s. 0.89). A similar behavior can be found for the Fmicro and Fmacro and for the kappa coefficient. Comparing the two data sets, that are very different in their structure and contents, the granularity level needed for attaining good results is found proportional to the complexity of the data set itself. In fact, the “Abstracts” data set consists of a set of short documents and each class is well separated in term of contents, at least at a semantic point of view. The “Reuters-21578” data set possesses more classes that are strongly unevenly distributed (see Fig. 7). For both data set, the conceptualization does not degrade the results, instead we obtain good performances with a lower granularity level. This means a low complexity of the feature space, that instead to be equal to the cardinality of the vocabulary (in the plain TF-IDF case) it matches the number of concepts adopted for representing the documents.

In the current experiments, there is no supremacy among the word2vec and the LSA in building the concept space. In fact, even if for both embeddings a smaller concept space attains good performances, there are classifiers that perform well for LSA and classifiers that do the best for the word2vec embedding. It is well known that both the embeddings possess suitable semantic characteristics, and the general performances depend on the entire processing chain and the particular hyper-parameter settings.

In general, the granulation of the word embedding space can be seen as a dimensionality reduction paradigm at the cost of inserting a new block in the downstream processing of texts before the classification task. However, this conceptualization block can be constructed once and for all even adopting richer corpora (e.g., Wikipedia), conversely to the one employed for the specific classification task. It is important to take care of the granulation parameter, significant for the classifier performances due to their attitudes in working with high or low dimensionality.

Towards the Explainable AI paradigm

The choice of the three particular classification algorithms depends on their specific characteristics. In fact, if from one hand SVM is known to be performing even with high dimensional feature spaces embedded in \(\mathbb {R}^n\), LSTM is suitable with sequences and needs a further dense layer to be appropriate for classification tasks. RF, for its part, is suited for classification tasks where it is important also estimating the importance of features. In fact, RF offers the possibility to obtain a set of weights, as many as the number of features, that, in turn, in this study are a kind of superordinate word, we call concepts. Fixing an arbitrary threshold over these weights leads to estimate the strongest concepts related to the specific classification task. From a different point of view, this procedure can be seen as a concept filtering task, where only the strongest ones survive. This methodology can help to infer knowledge on the corpus, eliciting the Explainable AI paradigm.

In Fig. 9(a), (b), as a bar chart, the weights related to the concepts and the thresholds (fixed to the 20% of the largest weight value), for the “Reuters-21578” and “Abstracts” data set, respectively, are depicted. For these specific experiments, for brevity purposes, the granularity of the concept space is fixed to \(k=50\) and it is constructed through the word2vec neural embedding. It is worth to note that the CSS is obtained through the k-means clustering algorithm, thus the prototype, being the average word vector for a given concept region, is a surrogate word vector. So forth, in order to find the existing word related to the prototype, the \(\ell _2\) norm distance is computed selecting the nearest word vector and the corresponding token. In Tables 7 and 8 the survived concepts together with the nearest five words obtained computing a \(\ell _2\) norm between the prototype and the respective word vectors are reported, normalizing for the highest similarity value that hold for the closest existing word vector to the prototype. In case of clusters with cardinality lower than five, all words within the cluster are shown. For the “Abstracts” data set, the best closest prototypes, for a given threshold value, are holographic, concurrently, explicitly, explanation, succesfully, achieve, intrusiveness, leastsquares, pregnancy, furthermore, nonabelian, effectiveness, percutaneous, approximation, nanolasers, robustness, insulator, chalcogenide, rolling, infrared. From Table 7, selecting, for example, a populated region, such as the fourth and considering the closest word (concurrently) to its prototype, we have electrification, resistant, diameter, platform, industrial. In general, we can find high semantic words that elicit roughly which term the algorithm estimates as important for the classification task; in fact, besides words with high semantic content, we can find verbs (for example, achieve) or other lexemes that are mostly used in papers’ abstracts. The same rationale can be found behind the results for the “Reuters-21578” data set illustrated in Table 7. Here, we can find less singleton or low-populated clusters, due to the dimension of the corpus. Even in this case it can be found a set of prototype words that span uniformly the conceptual space. Nevertheless, the richness of the semantic contents of prototypes and the underlying word cloud is attributable to the dimension and the heterogeneity of the corpus, since it is likely to lead to better representations for words, that in turn, leads to a performing CSS.

Table 3 Classification performances for a given granularity level k for LSTM, C-SVM and RF (“Reuters-21578” data set)
Table 4 Classification performances for a given granularity level k for LSTM, C-SVM and RF (“Abstracts” data set)
Fig. 8
figure 8

Classification performance varying the number of concept regions from 2 to 402 (“Abstracts” data set) for the LSTM, SVM, RF algorithms, for both word2vec and LSA techniques

Table 5 Performances comparison over the “Reuters-21578” data set between LSTM, C-SVM and RF for both word2vec and LSA embeddings. Results of the best granulation level k are compared with the plain solution given by a sequence formed by the word vectors related to words in the given text for LSTM, and the TF-IDF weighting scheme for the other classifiers
Table 6 Performances comparison over the “Abstracts” data set between LSTM, C-SVM and RF for both word2vec and LSA embeddings. Results of the best granulation level k are compared with the plain solution given by a sequence formed by the word vectors related to words in the given text for LSTM, and the TF-IDF weighting scheme for the other classifiers
Fig. 9
figure 9

Concept importance for the “Reuters-21578” (a) and “Abstracts” (b) data sets. The threshold value filters low importance concepts. The value is set as the half of the max value computed on concept importance

Table 7 Most important concepts obtained by filtering the Concept Regions through the feature importance estimation provided by the RF algorithm (“Reuters-21578” data set). There are reported the first five words for each region that exceed a threshold value — see Fig. 9(a)
Table 8 Most important concepts obtained by filtering the Concept Regions through the feature importance estimation provided by the RF algorithm (“Abstracts” data set). There are reported the first five words for each region that exceed a threshold value — see Fig. 9(b)

Conclusions

The current study is an effort in providing a clear relationship between findings in the Conceptual Spaces theory and the problem of text representation in Pattern Recognition, specifically in NLP tasks involved in text mining. Text mining, as a particular application of Machine Learning techniques related to textual data, benefits from better representations of text as hierarchically organized set of features, where Granular Computing techniques can offer a wide range of tools for designing performing classification algorithms. Within this framework, where Granular Computing is bound to the Conceptual Spaces theory and Machine Learning, it is offered a comparison of some classification algorithms working on features constructed over the prototypes of a conceptual space obtained, in turn, over a suitable neural word embedding. Results show primarily that the conceptual layer placed in the middle between the associative layer and the symbolic layer (the symbolic histograms layer in this study) can be used for working with more abstract entities compared to words. These entities are a byproduct of the granulation of the conceptual space, that can be obtained with any algorithm in charge of embedding the text in a vector space. The three algorithms compared, two of them (SVM and RF) able to receive in input n-tuple of Real-valued numbers and the other (LSTM) working with input sequences, perform well for a large range of granulation levels of the conceptual space. Interestingly, depending on the nature of the textual data set, a low granulation level allows achieving good classification results, that at the stage of the current study depends only weakly from the specific algorithm. Moreover, the conceptual level together with the symbolic histograms technique can aid in Knowledge Discovery tasks, providing a framework for transforming black-box classifiers in gray ones, mining the strongest concepts that lead to a particular classification task, making a tiny step towards how a machine can represent meaning. Future works foresee the training of the conceptual space on exogenous corpora and, furthermore, the extension to n-gram prototypes where a suitable dissimilarity measure between sequences of vectors needs to be carefully designed in order to build up the symbolic histograms for the concepts embedding.