On the learning of vague languages for syntactic pattern recognition

The method of the learning of vague languages which represent distorted/ambiguous patterns is proposed in the paper. The goal of the method is to infer the quasi-context-sensitive string grammar which is used in our model as the generator of patterns. The method is an important component of the multi-derivational model of the parsing of vague languages used for syntactic pattern recognition.


Introduction
Pattern recognition methods are included in two main approaches: the decision-theoretic approach [7,10] and the syntactic-structural approach. The latter contains three groups of methods: the algebraic group [36], the structural group [8] and the syntactic group, called syntactic pattern recognition, SPR [20,22,24]. The problem of the self-learning of SPR systems is one of the basic open problems in syntactic pattern recognition [20,22]. The learning modules in SPR systems are constructed according to the theory of language induction (grammatical inference). Although the first language induction algorithms were proposed in the late 1960s, the results in this research area seem to be still not satisfactory from the point of view of their practical applications [22]. Therefore, the requirement of the availability of a grammatical inference algorithm for any model of syntactic pattern recognition has been formulated as the condition sine qua non of the effectiveness of the model [22].
Among the variety of syntactic pattern recognition applications, signal analysis seems to be especially popular since the early 1970s [9]. These applications include electrocardiography, electroencephalography, pulse wave analysis, auditory brainstem response audiometry, cardiotocography, technical analysis in economics, information flow management, industrial signal processing, signal analysis in seismology, etc. [22,24,29,39]. The syntactic pattern recognition method, based on the class of DPLL(k ) context-free grammars which are able to generate a considerable subclass of context-sensitive languages, was proposed in [17,30]. The method was applied for process monitoring and control in particle physics [5], auditory brainstem response audiometry [18], and fetal palates diagnosis [33].
Various pattern recognition and computer science methods have been widely used for the short-term electrical load, STEL, and prediction recently. The most recent publications in this application area include the following papers. Amjady and Keynia used neural networks and evolutionary algorithms in the context of price forecasting of electricity markets in [2]. Fan and Hyndman presented in [12] a STEL forecasting method for the National Electricity Market of Australia which is based on additive regression. Yang, Wu, Chen and Li proposed a hybrid model which consists of neural networks and autoregressive integrated moving average in [47]. Hong and Wang developed a method based on fuzzy interaction regression for applications in the areas of: operations, maintenance, demand response and energy market activities in [28]. Wang, Liu and Hong proposed a big data approach using multiple linear regression for STEL prediction with recency effects in [45]. A hybrid method of STEL forecasting in Philippines based on hidden Markov model and autoregressive integrated moving average was presented by Hermias, Teknomo and Monje in [25]. Tian, Ma, Zhang and Zhan developed a STEL prediction model based on long short-term memory neural networks and convolutional neural networks [44]. The variety of methods in 1 3 this area include are surveyed in [1,3,4,6,11,26,27,38,43,46].
In our previous paper [21], we have used the syntactic pattern recognition approach for short-term electrical load prediction. In the application of STEL prediction, the method had to be extended considerably in order to process ambiguous patterns. It was made by the introducing the class of vague DPLL(k ) languages and the constructing of the Syntactic Pattern Recognition-based Electrical Load Prediction, SPRELP, system on the basis of this class [21]. Since new signal patterns occur very often in this application area, the requirement of the self-learning of the SPRELP system has turned out to be crucial for the effectiveness of the system. The purpose of the research presented in the paper has been to develop a holistic method of syntactic pattern learning that includes both the learning of self-organizing maps, SOMs (used for the generation of a structural pattern representation) and the learning in the parsing system for DPLL(k ) grammars [21] (used for the structural pattern recognition).
The generic scheme of our hybrid model is presented in Sect. 2. The concept of the use of self-organizing maps for the generating of vague patterns which belong to a DPLL(k ) language is introduced in Sect. 3. The algorithm of the language induction for the DPLL(k ) class is discussed in Sect. 4. The application of the SPRELP system for shortterm electrical load prediction is described in Sect. 5. The concluding remarks are contained in the final section.

Syntactic pattern recognition with vague languages
The solving of the following two fundamental open problems seems to be crucial for the construction of an effective syntactic/structural pattern recognition system: • the generation of a structural pattern representation in case of distorted/ambiguous objects (including the development of stochastic models [24,40]) and • the learning of SPR systems (including the use of grammatical induction algorithms [20], learning automata [48] etc.).
For the solving of the first problem, the two-phase recognition model was proposed in [21]. In order to avoid the loss of information which represents the vagueness/ambiguity of objects or processes to be recognized, a model based on the so-called vague structural patterns has been defined. The generic scheme of such a model/system is shown in Fig. 1. Firstly, let us present the recognition phase in the system. In the first step, a vague structural pattern is generated on the basis of an input feature vector. Such a pattern is defined with the help of vague primitives. A vague primitive allows us to describe the fuzzy nature of objects/processes to be recognized. Let us introduce it formally in the following way.
where: a k 1 , a k 2 , … , a k j ∈ Σ T are different symbols, k 1 , k 2 , … , k j are the measures, called attributes, which are ascribed to a k 1 , a k 2 , … , a k j , correspondingly. □ In our approach, the attributes of vague primitives can be of the form of distance, probability, fuzziness measures, etc. Let us consider the following example. Let a set of structural primitives be defined as it is shown in Fig. 2a. The exemplary vague primitive is depicted in Fig. 2b, where attributes are assigned to its component primitives according to the Euclidean metric. We assume that this metric is used for the attributing of vague patterns in the examples presented in the paper. The first step, i.e., the generation of vague structural patterns, will be discussed in a more detailed way in the next section.
In the second step, the vague structural pattern is recognized by the DPLL(k ) syntax analyzer (parser). The syntax analyzer determines b best vague patterns. These patterns are treated as acceptable approximations of the (idealized) template. This is the fundamental difference between our approach and standard syntactic pattern recognition approaches in which the single "correct" structural pattern is determined during parsing. The quality of these b best patterns is defined with the help of the attributes of their component vague primitives. Finally, the quality measures are used for the obtaining of the "averaged" structural pattern by the problem solver. The recognition path, presented briefly here, was described in [21] in a detailed way.
In the syntactic pattern recognition paradigm, the availability of a structural sample set is assumed for the purpose of the self-learning of a pattern recognition system. (A sample set in syntactic pattern recognition corresponds to a learning (training) set in the decision-theoretic approach.) This sample set consists of structural patterns, and it represents the formal language underlying. A sample set should be big enough to represent a variety of structural patterns. Therefore, the constructing of the formal grammar or the control table of the syntax analyzer (parser) "by hand" is impossible. Fortunately, grammatical induction algorithms can be used in order to automate this process [19,22]. The availability of a grammatical induction algorithm for the class of grammars used in the model of syntactic pattern recognition proposed is so important that it has been formulated as the fundamental methodological principle in [20]: A syntactic pattern recognition model should include the following three components: a generative grammar, a computationally efficient parsing algorithm and a grammatical induction algorithm of the polynomial time complexity.
Since we propose the two-step syntactic pattern recognition model, the learning should be performed at both steps of pattern processing, i.e., during the structural pattern generation and the grammar/control table induction. This twostep (self-)learning path in our model is denoted with dotted arrows in Fig. 1. In the model presented in the paper, the self-organizing map [34,35], SOM, will be applied for the generation of vague structural patterns at the recognition phase [21]. Therefore, SOM is to be trained in the first step of the learning phase. At the same time, SOM generates vague patterns of a structural sample set. The SOM training process will be described in the next section. The induction algorithm for the generation of the DPLL(k ) parser control table based on the structural sample set in the second step of the learning phase will be presented in Sect. 4.

Learning vague patterns with self-organizing maps
In machine learning, considered as the area of artificial intelligence [19], three generic approaches are considered: supervised learning, unsupervised learning and reinforcement learning. For the constructing of syntactic pattern recognition systems unsupervised learning is usually used [22]. Since in the first step of the learning in our system numerical patterns are assumed, the approaches corresponding to the decision-theoretic pattern recognition, like cluster analysis, principal component analysis, etc., or the neural network learning approaches can be applied. Let us note that during the learning process vague structural patterns are to be generated, i.e., the symbolic/discretized representations are to be defined and they are to be associated with the measures that can express their vagueness. (As it has been discussed in the previous section.) At the same time, the dimensionality reduction is to be done. (We assume an optimized, small number of terminal symbols of the underlying formal grammar in the second step.) Thus, a model that is centroidbased-clustering-like with a strong dimensionality reduction is preferable. In the area of neural networks typical models with unsupervised learning include self-organizing maps, SOM, adaptive resonance theory, ART, and Hebbian learning models. Taking into account the requirement analysis performed shortly above, self-organizing maps enhanced with the centroid-based clustering seem to be the most suitable to meet these requirements, i.e., the generating of a low-dimensional discretized representation associated with the Euclidean distance in a feature space. Our model is just based on such an approach. In our model, we follow the principles of competitive learning used in SOMs, i.e., the winner-take-all principle and the limiting of the strength of each neuron. The SOM neurons correspond to the classes which, in turn, are represented by the terminal symbols (from Σ T , cf. Definition 1) of the underlying formal grammar. The Euclidean metric is used for the defining of the activation function. The function is calculated as the reciprocal of the distance between the unknown pattern and the centroid of the class/cluster corresponding to this neuron. The neurons' activations are normalized to the unit vector.
Let us consider the following example which is depicted in Fig. 2c. A new pattern is denoted with the small diamond and the distances between this pattern and the centroids of the classes (clusters): d, e and f are equal to: 5, 3 and 7, correspondingly. The activations for the classes, which correspond to the attributes of the related structural primitives (cf. Fig. 2b), are calculated as follows: If the unknown pattern is recognized, then the neurons which exhibit the highest activations are chosen for the generation of a fuzzy primitive. Now, we can present the learning phase. Firstly, for each class (represented by the corresponding neuron) the maximum diameter is set. The new (unknown) pattern is added to the class, if the condition defining the (maximum) diameter of the corresponding cluster is not violated. Let us assume the following denotations.
N Ω (j) denotes the cluster which relates to the neuron Ω after the j-th iteration, The learning process is defined with the formulas 1 and 2. and (1) Y has not been added to the class Ω N Ω (j) ∪ {Y}, Y has been added to the class Ω (2) Y has not been added to the class Ω , Y has been added to the class Ω

Learning parser control table with DPLL(k) grammar induction
Before we define the induction algorithm which is applied for the generating of the parser control table during the learning phase in our approach (cf. Fig. 1), we introduce the class of the grammars used. The class of DPLL(k ) grammars [17] is used because it fulfills two basic methodological requirements of syntactic pattern recognition [20,22]. On the one hand, this class is of the big generative power. DPLL(k ) grammars are able to generate all the context-free languages and the remarkable subclass of context-sensitive languages, including, e.g., On the other hand, the syntax analysis model constructed for this class is efficient, i.e., the DPLL(k ) parser is of the O(n 2 ) time complexity [17].
The big descriptive power of DPLL(k ) grammars results from the fact that they belong to the family of programmed grammars. The first (statically) programmed context-free grammars were proposed by Rosenkrantz in 1969 [41] in order to generate some context-sensitive languages. The increase in the generative power of these grammars has been obtained by the controlling of derivations with the help of the static fields associated with the grammar productions. The static fields have been pre-specified during the defining of the grammar. Then, the concept of dynamically programmed, DP, context-free grammars was introduced in 1995 [16] (also described in [22]). The dynamic fields which can be processed (by the storing and retrieving of the indices of productions) during a derivation are used in these grammars. Let us introduce them formally with the following definition.
Definition 2 A dynamically programmed, DP, (context-free) grammar is a quadruple G = (Σ N , Σ T , P, S) , where: Σ N is a set of nonterminal symbols; Σ T is a set of terminal symbols; P is a set of n productions of the form: FALSE} is the predicate of applicability of p i ; L i ∈ Σ N and R i ∈ Σ * are the leftand right-hand sides of p i , respectively; A i is the sequence of actions add, read, move performed over ⋃ k=1,…,n DCL k ; DCL i is the derivation control tape for p i ; S is the start symbol (axiom), S ∈ Σ N . □ A pair (L i , R i ) is called the core of p i . For every p i , p j ∈ P , i ≠ j , the core of p i differs from the core of p j . For every p i ∈ P , the derivation control tape DCL i with the head operations add, read, move is defined, where add(k, m) writes a symbol m on the cell of DCL k under the head, read(k) returns the value which has been read by the head, and move(k) moves the head of DCL k right.
A derivation is defined in the following way. At the beginning, the production (1) is applied. The production (i) is applied if its predicate of applicability i is true. (The predicate is defined with the help of the operation read.) After the application of the core of the production, the sequence of actions add, move is performed for selected tapes.
DP grammars and standard top-down parsable LL(k) grammars [37,42] have been incorporated into DPLL(k ) grammars in [17] in order to define a polynomial parser. The LL(k) parser makes a derivational step looking ahead to the successive k-length prefixes of the input word. (Let us remind that a word/string u is a prefix of a word/string w if there is a word/string v (possible empty) such that w = uv .) Before we present the definition of DPLL(k ) grammars, we introduce the following notions and denotations.
Let G = (Σ N , Σ T , P, S) be a context-free (dynamically programmed) grammar, ∈ Σ * , and | x | denote the length of a string x ∈ Σ * . FIRST k ( ) denotes a set of all the terminal prefixes of the strings of the length k (or of a length less than k, if a terminal string shorter than k is derived from ) that can be derived from in the grammar G, i.e., Now, we can define DPLL(k ) grammars [17].

Definition 3
Let G = (Σ N , Σ T , P, S) be a (context-free) dynamically programmed grammar, core * =⇒ denotes a sequence of derivation steps consisting in the applying of production cores only. G is called a DPLL(k) grammar iff the following two conditions are fulfilled.
2. For G there exists m > 0 such that for any leftmost derivation where is the string of the indices of productions applied, if | | ≥ m then the first symbol of is terminal. □ Now we can define the induction algorithm for DPLL(k ) grammars. Firstly, let us introduce the preliminary notions and denotations, which concern the problem of grammatical induction [22].
A sample of a language L over an alphabet Σ T is an ordered pair (S + , S − ) , where a finite set S + ⊆ L , S + ≠ � is called a positive sample and a finite set S − ⊆ (Σ * T ⧵ L) is called a negative sample (i.e., S + ∩ S − = � ). We talk about text learning (induction from a positive sample), if S − = � . We talk about informed learning (induction from positive and negative samples), if S − ≠ � . The problem of grammatical induction consists in looking for a grammar G (or the control table of the underlying syntax analyzes A) which generates the language L. We say also that we look for a grammar which is consistent with the sample (S + , S − ) . In the case of text learning, G is consistent with (S + , S − ) iff ∀x ∈ Σ * T ∶ x ∈ S + ⇒ x ∈ L(G). Our induction method is, as usually in syntactic pattern recognition, a text learning method. Let us introduce the socalled polynomial specification of the language [32], which will be used by the grammatical induction rules for DPLL(k ) grammars.

Definition 4
Let T be the set of terminal symbols occurring in S + , V be a subset of positive integers. Polynomial specification of a language is of the form L(T, V) = S w j (n k ) i , where w j is a polynomial of a variable n k ∈ V . S i is called a polynomial structure, and it is defined in a recursive way as follows: (1) S i = (a i 1 ⋯ a i r ) , where a i j ∈ T (Then S i is called a basic polynomial structure.) or (2) S i = (S where S i k is defined as in (1) or (2). (Then S i is called a complex polynomial structure.) □ Let us consider the following example of a polynomial specification of a language. Let there be given L(T, V) = (a n b 2n (ab) n 2 +1 ) n , where T = {a, b} is a set of terminal symbols and V = {n} is a set of integer variables. Then, the polynomial structures are defined for L(T, V) as it is shown in Table 1.
The structure of the polynomial specification is shown in Fig. 3.
The rules of grammatical induction are based on the concept of the polynomial specification and its structures. Let us introduce them according to the theory introduced in [31].
The Rules of DPLL(k) Grammar Induction Let L(T, V) be a polynomial specification of a language L. The DPLL(k ) grammar G = (Σ N , Σ T , P, S) , which generates L, is constructed as follows.

Σ T ∶= T. 2.
For every n ∈ V , two positive integer variables v n and d n are defined.
3. For every polynomial structure of L(T, V), which is denoted by S, the nonterminal X S ∈ Σ N and positive integer variables c S and e S are defined.
4. For every polynomial structure of L(T, V), which is denoted by S, the following productions of P are defined. a) If S is of the form of (a 1 ⋯ a r ) , then the corresponding productions are defined as it is shown in Table 2.
b) If S is of the form of (S , then the corresponding productions are defined as it is shown in Table 3.

5.
Let X S 1 be the nonterminal defined for the first polynomial structure. The initial production of P is defined as it is shown in Table 4. Now, we can consider the example of the inducing of DPLL(k ) grammar based of the rules introduced above. Let us assume that the polynomial specification of the language is given as in the example above, i.e., L(T, V) = ((ab) n c n 2 +2 ) n . The only one variable n ∈ V is used in L(T, V). Therefore, we define two variables v n and d n . Since there are four polynomial structures: S 1 , S 1 1 , S 1 2 , S 1 3 (cf. Fig. 3), we should define four corresponding nonterminals: X 1 , X 11 , X 12 , X 13 in Σ N , and eight variables: c 1 , c 11 , c 12 , c 13 , e 1 , e 11 , e 12 , e 13 . One can easily notice that S 1 1 , S 1 2 , and S 1 3 are basic polynomial structures, and S 1 is a complex polynomial structure.
The set of DPLL(k ) grammar productions is generated with the help of the induction rules introduced above as it is shown in Table 5.
At the end the initial production of P is defined in the following way. 1 = true, S ⟶ X 1 , A 1 ∶ c 1 ∶= 0;c 11 ∶= 0;c 12 ∶= 0;c 13 ∶= 0; At the end of this section, let us consider the following example of the derivation of the word abbabab by the DPLL(k ) grammar induced above. We monitor the values of the variables: v n , d n , c 1 , c 11 , c 12 , c 13 , e 1 , e 11 , e 12 and e 13 during the derivation process. Table 1 Exemplary polynomial structures S 1 = a n b 2n (ab) n 2 +1 w 1 (n) = n S 1 1 = a w 1 1 (n) = n S 1 2 = b w 1 2 (n) = 2n S 1 3 = ab w 1 3 (n) = n 2 + 1 Fig. 3 The structure of the polynomial specification of the language L = {(a n b 2n (ab) n 2 +1 ) n } Table 2 The rules of the production definition for a basic polynomial structure Table 3 The rules of the production definition for a complex polynomial structure (c S = e S ) and (d n = false) X S ⟶ X S 1 ⋯ X S r X S c S ∶= c S + 1; v n ∶= v n + 1; e S ∶= w(v n ) Table 4 The rules of the initial production definition customer each year.) The prediction of customers' electrical demand for one day ahead is performed daily. It concerns a 24-hours period, and it is defined for each hour. This forecast is made on the basis of the following two groups of input data: • an hourly temperature forecast for two days ahead and an hourly insolation forecast for two days ahead.
These data define a feature vector which is read into the SPRELP system (cf. Fig. 1). The short-term electric load, STEL, prediction in the SPRELP system was described in [21] in a detailed way. Let us only mention here that the generating of a vague pattern is made on the basis of the feature vector with the help of a self-organizing map as it has been presented is Sect. 3. During the self-learning stage, the SPRELP system uses the rules of the DPLL(k ) grammar induction introduced in Sect. 4.
The example of daily actual electrical loads and shortterm electric load predictions performed by the SPRELP system for selected months are shown in Fig. 4a-d. It can be easily seen that accuracy of predictions differs for various seasons. For months of winter and autumn when forecasts of temperature and insolation are less precise the load forecast errors are bigger than for months of spring and summer. The monthly forecast errors are shown in Fig. 4e-f.
The accuracy of short-term electrical load prediction methods which have been published recently and the accuracy of the SPRELP system forecast are included in Table 6. For the comparison of the forecast accuracy, the following metrics have been used: MAPE (Mean Absolute Percentage Error-the metric which is used commonly to evaluate the performance of STLF methods), MAE (Mean Absolute Error) and RMSE (Root Mean Square Error). (Some authors do not present MAE and/or RMSE errors.) As we can see, the variety of methodologies are used for the short-term electrical load forecasting, including: neural networks, evolutionary algorithms, various regression methods and hidden Markov models. Our method is the first syntactic pattern recognition method which has been used for short-term electrical load prediction. As one can notice, this method generates reasonably good forecasts with respect to the criteria of: MAPE, MAE and RMSE. As we have mentioned it in the introduction, the first version of the SPRELP system was presented in 2016 in [21]. The hybrid syntactic pattern recognition methodology that is the theoretical framework for the system has been developed since then, and the results

Concluding remarks
As we have mentioned in the introduction, the availability of a self-learning method is a fundamental methodological requirement when the new syntactic pattern recognition model is proposed [20,22]. The novel syntactic pattern recognition method which solves one of the crucial open problems of the losing of information about the uncertainty/ unambiguity of objects during the generating their structural patterns on the basis of feature vectors was introduced in [21]. This pattern recognition method has been used successfully for the short-term electrical load prediction [21]. However, the occurring of the variety of numerical patterns at the input of the constructed short-term electrical load prediction system has encouraged us to develop the two-step hybrid learning method. This method uses self-organizing maps to generate vague structural patterns in the first step. The rules of the grammar induction are applied to generate the control table of the syntax analyzer which is used in the system for syntactic pattern recognition. On the one hand, the comparison of the method with other methods of the short-term electrical load prediction has shown that the implemented system generates reasonably good forecasts with respect to the criteria used for the assessment of the performance of such systems, i.e., MAPE, MAE and RMSE. On the other hand, it seems that the generating of forecasts for subareas having the homogenous characteristics instead of the making the forecast for the really big area (as it is made in case of the electricity distribution company mentioned) as a whole would give better results. For such a separation of areas, the graph model can be used as a representation formalism and the graph parsing [13][14][15] can be applied for syntactic pattern recognition at the meta-level of the whole structure. The research into developing such a two-level structural approach is to be started, and its results will be the subject of further publications.  Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.