Detecting and learning from unknown by extremely weak supervision: exploratory classifier (xClass)

In this paper, we break with the traditional approach to classification, which is regarded as a form of supervised learning. We offer a method and algorithm, which make possible fully autonomous (unsupervised) detection of new classes, and learning following a very parsimonious training priming (few labeled data samples only). Moreover, new unknown classes may appear at a later stage and the proposed xClass method and algorithm are able to successfully discover this and learn from the data autonomously. Furthermore, the features (inputs to the classifier) are automatically sub-selected by the algorithm based on the accumulated data density per feature per class. In addition, the automatically generated model is easy to interpret and is locally generative and based on prototypes which define the modes of the data distribution. As a result, a highly efficient, lean, human-understandable, autonomously self-learning model (which only needs an extremely parsimonious priming) emerges from the data. To validate our proposal, we approbated it on four challenging problems, including imbalanced Faces-1999 data base, Caltech-101 dataset, vehicles dataset, and iRoads dataset, which is a dataset of images of autonomous driving scenarios. Not only we achieved higher precision (in one of the problems outperforming by 25% all other methods), but, more significantly, we only used a single class beforehand, while other methods used all the available classes and we generated interpretable models with smaller number of features used, through extremely weak and weak supervision. We demonstrated the ability to detect and learn new classes for both images and numerical examples.


Introduction
Machine learning and pattern recognition, including classification, are perhaps at the peak of their development with a sharp interest not only from scientists and practitioners, but also from the wider public and media. This is, in part, thanks to the boom surrounding the wider area of artificial intelligence (AI) and recent successful and widely publicized applications ranging from games [14,34], driverless cars [10,33], defense and security [1,32,35], home applications [23,28]. Despite the great success of the standard bearer algorithm in this area, the so-called deep learning in image and speech recognition [18,27], the underlying concept of machine learning which requires large amount of labeled training data remains unchanged. So-called reinforcement learning offers some departure from complete labeling, but still requires user input for each individual data sample. The most powerful approaches such as deep learning and support vector machines (SVM) suffer from lack of interpretability [5,11,25,30], are extremely power, time and computational resources hungry and are like dinosaurs-unable to adapt and change with agility. They require complete retraining even for a single or few new data samples.
In this paper, we propose a method and algorithm that departs from the traditional approach and offers a paradigm shift bringing the machine learning, in general, and pattern recognition and classification, in particular, extremely close to a fully unsupervised form. In a nutshell, it offers a self-learning locally generative models that work together and require extremely light supervision in the form of few data samples. It is able to automatically detect the unknown and to learn from it. This is in sharp contrast to the traditional approach where learning is, in essence, only an averaging of the history. The current approaches struggle to detect changes, dynamical evolution or appearance of new classes. They also assume a certain number of features (the same for all classes) provided at the start of the process. This is one of the reasons traditional approaches struggle to predict or react quickly to sudden changes in the data pattern, such as the economic crash during 2008 [15], for example.
Methods like eClass [8], FLEXFISClass [20] and other similar ones are called ''evolving'' classifiers. They are designed to take into account new coming data samples. However, when talking about new classes (rather than just new data samples) class label is required which means these methods are supervised learning methods. The proposed method in this paper is unsupervised in regard to the new data that represent a new class. There are also unsupervised evolving algorithms for clustering [9], but these methods do not deal with classification as the method proposed in this paper. Another type of methods that claim to approach similar problems is the so-called zero-shot learning (ZSL) methods. They have as an objective to transfer a learnt model to unknown classes without the acquisition of new features. However, the main problem with this type of technique is the dependence on additional information to relate unknown classes to previously trained models. Not always such information is available or possible to acquire [17]. In this respect, the ZSL approach is not unsupervised in terms of the new class and not a direct comparator.
The proposed approach is prototype-based and learns locally around them extracting the empirical data distribution called typicality as well as the data density [6]. The approach is recursive, thus computationally very lean. It is also non-iterative, nonparametric. This adds to its efficiency in terms of time and computational resources. From the user perspective, the proposed approach is clearly understandable to human users since it can be represented in a linguistic IF...THEN form. It combines reasoning and logic with machine learning. It can also be presented as a deep neural network. Finally, it also has a statistical nature and offers an empirical form of the probability density function (pdf) [7].
In this paper, we apply this new principally different type of machine learning to four challenging problems and demonstrate its significant advantages. The main challenges that the method proposed in this paper addresses are: i) to detect when a certain unlabeled (new) data sample is not from a class that was used in training, i.e. to have class ''Unknown'' or ''New''; ii) to learn from such new unlabeled data in an unsupervised manner. The proposed approach to address the first issue is based on the drop of the density that represent the confidence in a decision. The proposed approach to the second issue is by learning from the data for which the class is ''New''. The proposed approach further selects prototypes out of the data samples of the ''New'' class according to their density in the same way as for the other/known classes. Because, the learning in the proposed approach is per class, all new data from a ''New'' class are analysed separately from the data from the known classes. The remainder of this paper is organized as follows: The method and algorithm section introduces the proposed exploratory approach for extremely weakly supervised classification. The experimental data employed in the analysis and results are presented in the Results section. Discussion is presented in the last section of this paper.

Concept and basic algorithm
Traditionally, the pipeline of learning from data includes the following steps: (1) Pre-precessing, which includes different substeps like normalization/standardization, dealing with missing data, and feature selection [16]. Specifically for image processing there are often other stages, such as rotation, augmentation, scaling, and elastic deformation [26]. Even deep learning methods which claims to avoid handcrafting apply some of the cited steps. (2) Learning phase, which can be offline, when the full dataset is available; or it can be done online, when the data arrive in the form of a data stream (sampleby-sample). Evolving learning, ability of the algorithms to adapt their parameters and structure according to the non-stationary data streams, is a more sophisticated form of online learning [3,29]. (3) Generating outputs for new unseen data, which is the validation phase. Different algorithms use different strategies in order to validate the model generated in the learning phase.
The proposed method also starts with a pre-processing step which involves mostly the same steps depending on the specific problem. For example, for image processing we may also apply scaling, augmentation, rotation, etc. Practically for all problems normalization and standardization is required. The proposed xClass method uses standardization and normalization as follows: Firstly, it standardizes the newly observed data sample, x i ; where i ¼ 1; 2; . . .; n denotes a time stamp in the current moment. j ¼ 1; 2; . . .; n refers to the number of features of the given x.
where b x denotes the standardized data sample. Outliers ðjb xj ! 3Þ are ignored and not used for training. After that, the data are rescaled within the range [0, 1] to consider them in the same proportion. It is important to highlight that in the proposed xClass method, the normalization is done upon the standardized data. Unity-based normalization of the i-th element of the j-th sample is given by: where x denotes the normalized data sample. The prototype-based learning is the core of the proposed method which represents local (the prototypes are focal points of locally valid generative models described by multimodal Cauchy distribution [6]. The meta-parameters are initialized with the first observed data sample. The proposed algorithm works per class; therefore, all the calculations are done for each class separately.
where l denotes the global mean of data samples of the given class. P is the number of the identified prototypes in total from the observed data samples. Each class C is initialized by the first data sample of that class: where p 1 is the prototype of C 1 ; S 1 is the corresponding support (number of members); r 1 is the corresponding radius of the area of influence of C 1 .
In this paper, we use r Ã ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi 2 À 2cosð30 o Þ p same as [6]; the rationale is that two vectors for which the angle between them is less than p=6 or 30 o are pointing in close/ similar directions. That is, we consider that two feature vectors can be considered to be similar if the angle between them is smaller than 30 degrees. Note that r Ã is data derived, not a problem-or user-specific parameter. In fact, it can be defined without prior knowledge of the specific problem or data. The next step is to calculate the data density at x i and p j ðj ¼ 1; 2; . . .; PÞ.
where p j ðj ¼ 1; 2; . . .; PÞ is the set of prototypes and r i is the standard deviation. The reason it is Cauchy is not arbitrary [4]. It can be demonstrated theoretically that if Euclidean or Mahalanobis type of distances in the feature space are considered, the data density reduces to Cauchy type as referred in equation (5). It can also be demonstrated that the so-called typicality, s, which is the weighted average of the data density, D, with weights representing the frequency of occurrence of a data sample [6]. Furthermore, the typicality s can be considered an empirically derived form of the pdf having the same properties; notably, it integrates to 1 an infinite range.
Density per feature f is obtained according to the equation (5), where D f i denotes the density for f-th feature of the x i sample. The cumulative effect across all data samples per feature can be obtained according to the equation (6).
The cumulative contribution for each feature K f i can be rank ordered, n represents the number of samples. The higher, the value of K f i is for a particular feature, the more important is the f-th feature. The rationale is that an interesting feature has higher density than other featuresmeaning that it conveys unique, different clear information, and, as a consequence, it contributes more to the classifier's result because the overlap between data of different classes is less pronounced for this feature.
Then the algorithm absorbs the new data samples one by one by assigning then to the nearest (in the feature space) prototype: Because of this form of assignment, the shape of the data partitioning is of the so-called Voronoi tesselation type [21]. We call all data points associated with a prototype data clouds, because their shape is not regular (e.g. hyperspherical, hyper-ellipsoidal, etc.) and the prototype is not necessarily the statistical and geometric mean [6]. In case, the following condition [6] is met: It means that x i is out of the influence area of p j . Therefore, x i becomes a new prototype of a new data cloud with meta-Neural Computing and Applications (2021) 33:15145-15157 15147 parameters initialized by equation (9). Add a new data cloud: Otherwise, data cloud parameters are updated online by equation (10). It has to be stressed that all calculations per data cloud are performed on the basis of data points associated with a certain data cloud only (i. e. locally, not globally, on the basis of all data points).
One of the strongest aspects of the proposed approach is its high level of interpretability which comes from its prototype-based, local generative models as well as as its ability to be expressed as a set of linguistic IF...THEN fuzzy rules of the following type: The fuzziness represents the degree of association/similarity to the prototypes. Indeed, the value of data density, D, equation (5) can be interpreted as a membership function of the fuzzy set ðx $ pÞ [6]. With a maximum 1 when x ¼ p. The continuous typicality, s given by the equation (12), is an empirically derived form of probability distribution. The value of s even at the point x ¼ p i is much less than 1 the integral of R 1 À1 sdx ¼ 1. The typicality per class offers conditional probability that is the basis of a generative model, but within both, xDNN and xClass from the classifier design point of view, we are interested in the local peaks of the typicality which coincide with the peaks of the data density. Indeed, it can be demonstrated that since the mathematical expression of the typicality is a mixture of Cauchy expressions and of the data density is a Cauchy expression, the peaks of s and D are at the same value of x*. Data density, D is much easier to calculate and therefore, we use D rather than s further.

Detect and learn from unknown
This is the most innovative part of the proposed algorithm in addition to the feature selection per class, which wakes it exploratory (we call it xClass) and allows to detect new data patterns autonomously and learn from them.

Drop of confidence (detect the novelty)
Unlabeled data samples become available as soon as the training process with labeled samples finishes. Then, the eXploratory classifier (xClass) can continue to learn from these unknown data samples. The unlabeled training samples are defined as the set u f g, and the number of unlabeled samples is defined as U.
The first step in the weakly supervised learning process of xClass is to extract the vector of confidence/degrees of closeness to the nearest prototypes for each unlabeled data sample defined as kðu i Þ, i ¼ 1; 2; . . .; U follows: where k denotes the confidence degree. The recursive mean l i of the k max for the labeled data samples is used to detect sudden drop of the confidence generated by the xClass classifier when a new unknown class arrives and can be calculated as follows [2]: Then the m-r rule is applied, for detailed explanation about the m-r please refer to [24]. New classes are actively added by the proposed xClass classifier when the inequality (15) is satisfied and rules are actively created. Otherwise, if the inequality is not satisfied the newly arrival unlabeled data samples are used for updating the structure and meta-parameters of the xClass classifier. Figure 1 illustrates the drop of confidence of the proposed method when a new a unseen class arrives. The black line indicates the confidence of xClass. As the fall is detected, if the inequality (15) is satisfied this indicates that the label of this data sample is not any of the known to xClass labels. The options are that: a) This drop is a one off due to outlier, noise, randomness, or b) a number of such data samples above a drop of confidence is detected are close to each other in the data space (please note that they may not necessarily arrive one after the other as in Fig. 1). Otherwise, if the condition given by the inequality (15) is not met the data sample is used to update the meta-parameters of the proposed method.
IFk max ðU i Þ \ð l i À mrÞTHENðU i 2 Possible new class detectedÞ ELSEðUpdate structure and meta À parametersÞ When the inequality (15) is satisfied, the arrival data sample is denoted as a potential outlier and temporally saved. When several of potential outliers are close to each other in the data space, have similar densities, they are denoted as ''new class 1'', if more than one group is formed than new classes are formed as well and new labels as 'new class 2' are generated. The user can be proactively asked to (optionally) label with a semantically meaningful identification, for example, ''apple'', however, no retraining is required.
One or few labels for new detected classes are provided. The validation process is done through the 'winners-takeall' principle, which is given by, The general structure of the proposed xClass approach is illustrated by the block diagram presented in Fig. 2.

Results
In this section, we will demonstrate the results obtained by the proposed extremely weakly supervised classification approach. Computational simulations were performed to assess the accuracy of the classification methods considering 4 different benchmark problems. The results from experimentation with the proposed algorithm aim to demonstrate that it offers: -high precision as compared with the top state-of-the-art algorithms. -ability to detect unseen/new data patterns autonomously and learn from them. -ability to learn with extremely low supervision (few) labeled data samples for the newly detected classes. -ability to autonomously select the most effective features per class. -highly transparent interpretable model. -no user-or problem-specific algorithmic parameter (except for feature selection which can be done by ad hoc decision). -non-iterative algorithm able to learn continuously.

iRoads dataset
In the first experiment, the iRoads dataset [22] was considered. The convolutional deep neural network VGG-16 was trained with 80% of the available iRoads dataset; however, images for the 'Rainy day' scenario were omitted of the training phase. After the training phase, 'Rainy day' trained images were presented to the neural network. As the VGG-16 approach was not trained for the presented situation, and it is not able to adapt its structure for the newly arrived class, it misclassified the 'Rainy Day' scenario with almost 90% confidence as a 'Night' scenario as illustrated in Fig. 3.
The convolutional neural network VGG-16 misclassifed with almost 90% of confidence the 'Rainy day' driving scenario as a 'Night' scenario as illustrated in Fig. 3. This is not surprising because the VGG-16 (same as other mainstream deep nerual networks) can only recognise what it was trained for and is not equipped with an exploratory mechanism to enable detection and learning from unknown data samples. In such new situations mainstream deep networks require a full retraining in order to correctly classify new classes. However a full retraining of a deep neural network is usually time consuming, computational expensive, and costly and involves the human for labeling purposes.
The xClass exploratory mechanism is able to discover new classes as they arrive to the system due to its mechanism based on the recursive density estimation [2] and Chebyshev inequality approach [24] as given in Fig. 4. The blue line indicates the confidence value (K max boundary) given by the xClass classifier, the red line indicates the the recursive density estimation value, the green line is the 3-r. The sudden fall of the blue line indicates the moment when the unlabeled set of images belonging to an unknown class arrive to the system. The proposed xClass classifier was trained with 80% of the available iRoads images of all classes except the 'Rainy day' class. Then, the new unlabeled class was present to the proposed classifier, xClass was able to successfully detect the suddenly drastic fall in the confidence (Fig. 4) and proactively create a new class as illustrated in Fig. 5. The prototype-based and non-iterative nature of the proposed method allowed to detect the fall in the confidence (k max ) in real time, and differently, from traditional deep learning approaches, no retraining is required to learn the new class.
The proposed xClass classifier obtained 99.12% classification accuracy for unlabeled images using the 3-r approach. The semantically meaningful label 'Rainy Day Scene' is optional and requires only one-off involvement by the human (by default it will stay as 'new class 1'). The final rule generated for this new class detected by the proposed xClass classifier is given in Fig. 6.

Faces-1999 dataset
As a second example, we consider the Faces-1999 dataset provided by Caltech [12]. For the faces recognition problem, the xClass classifier is trained with just one type of face, differently from traditional approaches which are primed with all available classes (20 different types of faces). We used the fully connected layer of VGG-16 for features extraction. For each image it produces 4096 values that can be considered [to be] abstract features.
As the traditional approaches are not equipped with exploratory mechanism, they are not able to discover discover new data patterns, and then, they classify new arrival data samples as the trained class. The, the proposed approach was presented to the new classes, and it was able to detect these new types of faces through the drop of confidence as illustrated in Fig. 7. After the detection of these new classes, an extremely weak supervision (1% training data labeled) and weak supervision (10% training data labeled) is provided in order to label these newly arrived. After, the labeling phase, the classification task was performed. As one can see from Figs. 8 and 9, the proposed xClass method can surpass its state-of-the-art competitors as they require more labeled data to provide good results. With just 1% of training data is clearly visible the advantage of xClass. On real scenarios, the labeling process is very time consuming and is not always possible. The classification curve is given in Fig. 9. Figure 7 illustrates the sudden drop in the confidence when new unknown classes are presented to xClass classifier; the xClass uses the drop of confidence based on the density of the data to discover new classes. Traditional approaches are not equipped with exploratory mechanisms as the proposed xClass method; therefore, they are not able to detect new data patterns and adapt their structure to this situation. It is notable that the proposed xClass classifier can obtain better results without the necessity for huge number of labeled data, differently from traditional approaches. The performance curve is given in Fig. 9, as illustrated, with xClass still producing better classification rates when more training data are provided.

Caltech-101 dataset
As a third case, we consider the Caltech-101 dataset [13]. As in the other experiments the proposed xClass classifier was primed with 80% of data samples from the first class for training, and then, used its exploratory mechanism to discover the other classes autonomously and learn from them based on the data density according through the drop of confidence as detailed in Fig. 10; as illustrated in Fig. 11, traditional approaches are not able to detect new data patterns after the training phase (traditional approaches were trained with just 1 class), and then, tend to produce results with low accuracy. Unlike supervised methods which are data hungry, the proposed xClass approach could obtain high classification accuracy with extremely weak supervision (Fig. 11), in order word, with less training data as possible. The acquisition of labeled data requires enormous human efforts and it is very time consuming. Figure 12 gives the evolution of the performance of the proposed exploratory classifier as more training samples are provided. As it is illustrated in Fig. 12, the xClass classifier is able to produce better results in terms of accuracy, demonstrating its efficiency to detect and learn from unknown effectively. The Caltech-101 dataset is constituted of 101 different classes. However, in the experiment only 10 classes were used. Supervised methods such as Decision tree, k-nearest neighbors (KNN), Adaboost, and SVM require information about all the available classes beforehand, in order to produce better results (the red bars in Fig. 11 illustrate the results obtained when just one class is used in the training phase). In comparison, the proposed extremely weakly supervised approach requires just the knowledge about one class beforehand as illustrated in Fig. 10 as the other classes are discovered through its exploratory mechanism. The blue bar in Fig. 11 illustrates the classification results when just 1% of labeled training data is provided for all classes. The proposed exploratory xClass classifier could obtain almost 90% of classification accuracy. State-of-theart approached have the necessity for labeled training data to produce acceptable results as illustrated in Fig. 12. Even when more labeled training data are provided, the proposed   xClass classifier still produce better results in terms of accuracy than its competitors. Furthermore, the ZSL method proposed by [19] was reported to provide 57% accuracy for the same problem which is significantly poorer result than the one obtained by the proposed xClass method. In addition to the significantly higher accuracy than the ZSL method, the proposed xClass method also has the advantage of allowing human inspection of the decision-making process (explainable).

Vehicles dataset
In the fourth case, we consider the vehicles dataset [31], which is a non-image based dataset. xClass is, firstly, trained with just one sample of the first class, and then, it has to autonomously detect the other classes based on the empirically observed data and the sudden drop of confidence (Fig. 13). The inner parallel feature selector of the proposed approach selected 7 out of the 18 original features differently for each class. This is helpful to improve the interpretability of the proposed classifier. Results obtained by xClass and its competitors are given in Fig. 15. It is important to highlight that SVM, KNN, Decision Tree, Adaboost, Long short-term memory (LSTM) are all supervised methods, and they were trained with all available classes beforehand (red bars in Fig. 14 illustrate the results obtained by the traditional supervised approaches if just one class is used in the training phase). However, the proposed xClass approach could obtain better results in terms of accuracy even though it uses an extremely weak supervision (Fig. 14). Figure 13 illustrates the drop of confidence when new unseen classes are presented to the proposed classifier. Differently from traditional approaches which require the knowledge of all available classes beforehand, the proposed xClass uses its exploratory mechanism to autonomously discover this new class with basis just on the empirical data. Red bars on Fig. 14 shows the results obtained by state-of-the-art methods if just one class is presented during the training phase, as they are not able to detect new arrivals data patterns and adapt they structure to this scenario, they wrongly classify the new arrived data samples as the known class. Different types of supervision (extremely weak, weak, full) is provided during experiments, in all cases the proposed method could provide better results in terms of classification performance than its competitors as illustrated in Fig. 15. It is possible to note through Fig. 14 that the results obtained for extremely weak supervision with xClass surpass its competitors in more than 25% in terms of classification performance, which indicates the efficiency of the proposed method.
As given in Fig. 15, xClass is able to improve its results if more training data and all classes are provided. For validation purposes, 20% of the data samples were used in all cases and labels for newly detected classes by xClass are attached during this phase. The AnYa fuzzy rule [6] for the newly identified class R new can be written as follows: where x is the set of selected features given by the densitybased feature selector. x can be written as follows:

Conclusion
In this paper, we break with the traditional approach to supervised classification. We offer a new fully autonomous extremely weakly supervised approach (xClass) which is able to learn from just a single class and a handful of labeled data samples. Then, as new classes, unknown to the human user the trained classifier appear at a later stage, the proposed xClass method is able to successfully discover this and learn from the data autonomously as demonstrated in the Results section. Furthermore, the features (inputs to the classifier) are automatically sub-selected by the algorithm based on the accumulated data density per feature per class. Results demonstrated that the proposed approach offers a high precision as compared with the top state-ofthe-art algorithms.
The proposed xClass approach could surpass its competitors in terms of accuracy for all experiments using extremely weak supervision, as well as, full supervision. Moreover, the proposed algorithm produced highly transparent interpretable results, which are helpful for human experts analysis. No user-or problem-specific algorithmic parameter (except for feature selection which can be done by ad hoc decision) are required which is also an advantage provided by the proposed xClass classifier.
To validate our proposal, we tested it on four challenging problems, including adversarial autonomous cars scenarios classification, imbalanced faces detection, and objects detection. Not only we achieved higher accuracy (in one of the problems outperforming by 25% the other methods), but, more significantly, we only used the knowledge of just a single class beforehand and extremely weakly labeled data and we generated interpretable models with smaller number of features used. Furthermore, the proposed xClass method demonstrated the ability to learn from unknown without retraining, which is one of the biggest problems of deep learning based on neural networks. As illustrated, the convolutional deep learning misclassified an unknown class with high confidence; on the other hand, the proposed approach was able to detect a sudden drop in the confidence and learn from this unknown data, and then it was able to proactively create a new class for this new scenario. The proposed method is applicable to a wide range of problems, especially for problems with unknown dimension and for problems for which the concept changes over time.
As a future work, we will investigate the occurrence of more than one unknown class at the same time. Furthermore, we will also explore highly dynamic problems such as video and other forms of data streams and address the time needed to learn online.

Declarations
Competing Interests The authors declare no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.