Evaluation of a machine learning classifier for metamodels

Modeling is a ubiquitous activity in the process of software development. In recent years, such an activity has reached a high degree of intricacy, guided by the heterogeneity of the components, data sources, and tasks. The democratized use of models has led to the necessity for suitable machinery for mining modeling repositories. Among others, the classification of metamodels into independent categories facilitates personalized searches by boosting the visibility of metamodels. Nevertheless, the manual classification of metamodels is not only a tedious but also an error-prone task. According to our observation, misclassification is the norm which leads to a reduction in reachability as well as reusability of metamodels. Handling such complexity requires suitable tooling to leverage raw data into practical knowledge that can help modelers with their daily tasks. In our previous work, we proposed AURORA as a machine learning classifier for metamodel repositories. In this paper, we present a thorough evaluation of the system by taking into consideration different settings as well as evaluation metrics. More importantly, we improve the original AURORA tool by changing its internal design. Experimental results demonstrate that the proposed amendment is beneficial to the classification of metamodels. We also compared our approach with two baseline algorithms, namely gradient boosted decision tree and support vector machines. Eventually, we see that AURORA outperforms the baselines with respect to various quality metrics.

the years by shifting the focus of development from coding to modeling. With the emerging trend of globalization, modeling repositories [28,38,42,43] are useful platforms to locate artifacts for reuse [39]. In recent years, there is an increasingly large amount of metamodels uploaded to open repositories. Such repositories contain a valuable knowledge base that offers the wisdom of the crowd. In this context, engineers can benefit or learn from what the others have done by searching over the repositories. Nevertheless, locating relevant information in a repository induces technical challenges as the related artifacts need to be appropriately classified, which has been done manually even until recently. Such a task is both time-consuming and prone to error that may lead to accuracy issues.
We aim at giving support to modelers/developers in their metamodeling tasks. For the sake of presentation, we consider a developer who wants to create new metamodels for database domain, even though this can be generalized to any domain in practice. Then, a category consisting of metamodels for working with databases would come in handy. It helps the developer quickly approach the most suitable ones by narrowing down the search scope by considering only metamodels in the same category. Given such a category, the developer might browse and review those metamodels that best fit her needs. Starting or getting inspired by existing metamodels helps tame the complexity of developing new artifacts by reducing the development time and increasing efficiency.
The manual annotation of metamodels is typically tedious and usually inapplicable to a large amount of data. In this sense, it is essential to have convenient machinery to automate such a process. Clustering techniques have been exploited to group similar metamodels [10]. Nevertheless, the performance of clustering techniques heavily relies on the chosen similarity measure. In other words, clustering is considered to suffer from two main disadvantages: (i) timing performance and (ii) identifying the appropriate number of clusters [58].
Machine learning (ML) has gained a profound momentum in recent years [35], thereby fostering a plethora of applications and yielding a superior performance compared to that of conventional approaches. An ML algorithm attempts to acquire real-world knowledge by autonomously simulating human's learning activities [66]. ML systems can automatically extract meaningful patterns from training data. Given the circumstances, we anticipate a high potential of ML in the MDE context. Interestingly, though learning algorithms have been applied to solve different issues, such as refactoring rules [54], or domain modeling [36], still the presence of machine learning in MDE is not commensurate with its potential.
In our previous work [58], we designed and implemented AURORA, an automated supervised classifier for metamodel Repositories based on a feed-forward neuRAl network. Our work is among the first attempts to automate the classification of metamodel repositories. Metamodels are considered data points in our approach, and they are classified using supervised learning techniques. In this way, we overcome the main limitation of different unsupervised clustering techniques where it is necessary to specify the number of clusters in advance. With supervised learning, the number of clusters depends on the training data, and it has been specified ex ante in the labeled data. In our experiment, we used a dataset with nine categories; however, such a number can easily be changed if there are more categories previously classified. In this respect, there is no need to dictate any number in advance with our approach. By experimenting on the dataset of manually classified metamodels using the tenfold cross-validation methodology, we obtained an encouraging performance in terms of success rate, precision, recall, and F 1 score.
This paper further extends our work by conducting a comprehensive evaluation on different machine learning classifiers for metamodels. First, we investigate the mathematical background and propose an amendment in the internal design of AURORA, with the final aim of making it more effective. Second, we study AURORA's performance in detail, investigating different configurations by means of a thorough evaluation. Third, since there exist several supervised learning techniques and they have been widely used for different purposes, in the evaluation we compare AURORA with two well-founded classification algorithms, namely gradient boosted decision tree [31] (GBDT) and support vector machines [74] (SVM). Both techniques have been exploited in a wide range of applications, such as pattern classification [22], face recognition [65], or recommender systems [40], to name a few. However, to the best of our knowledge, they have never been utilized to classify metamodels. This motivates us to perform an investigation on the practicality of GBDT and SVM in metamodel classification.
In summary, our paper makes the following contributions: -implementation of a supervised classifier for categorizing metamodels based on a neural network; -investigation of various factors that affect AURORA's classification performance; -an experimental validation on a dataset of metamodels that has been collected from GitHub and manually categorized afterward; -comparison of AURORA with gradient boosted decision tree and support vector machines; -we published the conceived tool together with the parsed metadata to facilitate future research. 1 Paper's structure We organize the paper into the following sections. The background and motivations of this work are introduced in Sect. 2. The supervised classifier for metamodels is detailed in Sect. 3. The evaluation is presented in Sect. 4, and the main results are analyzed in Sect. 5. Afterward, Sect. 6 presents some related discussions. The threats to validity of our findings are discussed in Sect. 7. We review the related work and conclude the paper in Sects. 8 and 9, respectively.

Motivations and background
As a base for further presentation, Sect. 2.1 briefly describes metamodels, while Sect. 2.2 reports a motivating example to clarify the usefulness of classifying metamodels. Afterward, in Sect. 2.3, we provide a background on clustering and classification of metamodels. Finally, Sect. 2.4 emphasizes the need to find a suitable grouping technique.

A brief introduction to metamodels
A metamodel specifies the abstract concepts of a domain where concepts and the relationships among them are expressed by modeling infrastructure [72]. As an example, we depict in Fig. 1 a simplified representation of the Ecore metamodel. The Eclipse Modeling Framework (EMF) 2 allows modelers to design, define, and edit models. EMF contains Ecore as a meta-metamodel for describing metamodels. All the represented metaclasses are named elements. EPackage is composed of sub-packages as well as classes, while EClass is made of EStructuralFeatures, i.e., EAttributes and EReferences. Moreover, EClass can inherit structural features from other classes. EAttribute and EReference inherit the lowerbound, and upperbound attributes in order to define the cardinalities of structural features. EReference links the container to EClass by eReferenceType reference. Moreover, containment is a Boolean attribute, which allows one to specify if the linked EClass is contained, e.g., to trigger cascade deletion, or not in the container being modeled. EAttributes are typed as EDataType instance by eAttributeType reference.

Motivating example
Model repositories offer a valuable resource for users who are interested in acquiring knowledge from already developed modeling artifacts. In recent years, the adoption of repositories is increasingly becoming commonplace in different application domains including power grids [63], biology, e.g., BioModels Database, 3  a few. Since the searchability and discoverability of metamodels strongly impact on the reuse of repositories, such platforms have to provide domain experts with advanced searching algorithms and facilities, which are suitable to manage large datasets and to enable quick identification of stored items [63]. To give an intuition about the possible extension of a publicly available dataset of models, we refer to GenMyModel, 8 an online modeling platform that offers a predefined set of modeling formalisms, e.g., UML, BPMN, Flowchart. Currently, the platform hosts more than one million of models, and it has more than 900,000 registered users. Concerning duplicates in modeling repositories, a recent work [50] reveals that there is a substantial amount of clones. In particular, through an empirical study to identify duplicate code files on 4.5 million of GitHub repositories, including those for metamodels, the authors show that more than 40% of the files are duplicated in the analyzed repositories. In contrast, some specific platforms impose mechanisms to filter duplicates. For instance, using a hash function, MDE-Forge [11] checks for clones before adding new metamodels, and this considerably reduces the number of duplicates in the platform. Altogether, this implies that the level of duplications may vary from platform to platform, depending on their purpose. By focusing on the management of metamodel repositories, the grouping of the stored metamodels into independent categories facilitates future lookup as well as maintenance [8]. In particular, ReMoDD allows repository contributors to tag an artifact with predefined sets of labels, for example, artifact type, keywords, or System/Software Domains. Meanwhile in ATL Zoos, repository maintainers can assign a weak domain classification to metamodels.
However, manual classification is not only time-consuming but also susceptible to error. A misspecification of metadata, such as artifact description, labels, or domains, may substantially compromise the reachability and thus potential reuse of a metamodel. We take an example as follows: Fig. 2 depicts the metadata stored into the ATL Ecore Zoo repository for four different metamodels, i.e., xasm, AsmL, UMLDI-StateMachines, and Sync Charts. By a careful inspection, we realized that all the metamodels share many elements in common, and they are related to the State Machine domain. In particular, as shown in Figs. 3 and 4, there are pairwise similarities between xasm and AsmL, or between UMLDI-StateMachines and SyncCharts. From the figures, it is evident that the metamodels should be classified in the same category, because of the large commonalities among them. However, in fact three of them have been classified by the modelers of ATL Ecore Zoo into three independent domains, while xasm is left unclassified. Through this example, we see that the wrong categories deteriorate

Clustering and classification of metamodels
The motivating example in Sect. 2.2 highlights an urgent need to find suitable machinery to automate the process of grouping metamodel repositories. In other words, advanced techniques and tools to automatically organize modeling artifacts are needed [7], especially when metamodels are added or changed frequently, e.g., in case of cloud-based 9 settings 9 https://www.eclipse.org/che/ [11]. Clustering and classification are some of the critical operations that are used to dig deep into available data for gaining knowledge and for identifying repetitive patterns. Both classification and clustering have been widely used in different fields, e.g., medicine, biology, physics, geology, to name a few. Clustering is an unsupervised learning technique, which aims at distributing a set of objects in different groups according to some similarity function [61]. Thus, the number of classes is not known ex ante, as in the case of supervised learning. In software engineering, such techniques have been employed, e.g., in reverse engineering and software maintenance for categorizing software artifacts [47]. In the context of open-source software, categorizing available projects by understanding their similarities allows for reusing source code or learning how existing and similarly complex systems have been developed, thereby improving software quality. Meanwhile, classification is a supervised learning technique, which relies on the existence of predefined classes of objects and aims at understanding which class a new item belongs to [45]. In this respect, labeled training data are a prerequisite for classification, as it is needed to guide the learning process.
The possibility of applying clustering techniques has been investigated to organize reusable metamodels automatically, aiming to solve the daily task in MDEForge [11]. As an example, Fig. 5 depicts the resulting dendrogram by clustering a real set of metamodels [64] (whose characteristics are going to be introduced in detail in Sect. 4.2). In particular, there are nine categories of metamodels and they are grouped using the clustering technique presented in our previous work [10]. Modelers are provided with structured overviews of available metamodels, which instead are typically shown as mere lists of stored elements. Metamodels are automatically organized in views consisting of connected graphs. Each graph represents identified clusters, where nodes correspond to metamodels and the similarity among nodes is represented as the thickness of the edges.

On the need for a suitable grouping technique
Since both classification and clustering are suitable for our purpose, i.e., automatic grouping of metamodel repositories, we need to select the most feasible method to be deployed in practice. Despite the fact that unsupervised categorization obtains encouraging results, as well as it has been exploited as the main engine to group metamodels in MDEForge [10], we realized that there are two main issues that obstruct the adoption of the hierarchical clustering technique to categorize metamodels in the long term, as discussed below: -Usability: Our previously conceived mechanism can automatically group metamodels according to their content and structure, relying on a similarity function [10]. We applied the hierarchical agglomerative clustering by shifting the dendrogram cutoff from 0.1 to 1.0 with 0.05 as the step and 0.25 as the dendrogram cutoff value. Given such settings, the number of clusters varies enormously (cf. Fig. 5). Thus, the repository administrator, i.e., the person who configures the similarity function, has to specify a concrete number of clusters. Moreover, since various similarity functions are available, the selection of such functions might have a profound effect on the final clusters. In this context, the administrator may face some difficulties in grasping the application domains formalized by the available metamodels and getting a bird's eye view of them. In this sense, we can see that usability of the clustering technique is substantially compromised; -Efficiency: Some clustering techniques suffer from a low efficiency. For instance, on the Ecore Zoo dataset consisting of around 300 metamodels, the clustering takes around four hours to finish on top of an ordinary computer [10]. Moreover, it is worth noting that every time (a) A simple perceptron with 3 inputs (b) A three-layer neural network a new metamodel needs to be categorized, and the same procedure must be replicated on all the existing metamodels. Such a categorization activity turns to be not only a time-consuming but also an inefficient process.
On the quest for a mechanism that transcends the limitations of hierarchical clustering and automates the grouping process by exploiting the available data, we came up with a supervised learning technique, using a neural network to learn from existing labeled data and automatically classify unknown items afterward. The AURORA tool [58] has been conceived as the first effort to realize a proper classification tool for MDEForge. In the next section, we provide more details on its design and implementation.

A neural network to classify metamodel repositories
When developers create or update a repository, they specify one or more categories to the contained metamodels, and afterward, the categories serve as a means to help other developers narrow down the search scope and efficiently approach the repository. We hypothesize that the perception of humans toward the relationship among metamodels and categories is encoded in the manually classified categories.
In this respect, we attempt to simulate human's cognition toward the metamodels-categories relationship, exploiting the manually categorized data. To this end, the features of a metamodel such as classes, attributes, and the corresponding label are used as input to tune the internal parameters, so as to eventually assign the metamodel to a specific category. Supervised learning techniques attempt to mimic the learning activities from humans by generalizing through labeled data and deducing conclusions for new incoming data. The ability to learn from data allows neural networks to have a wide range of applications, e.g., pattern recognitions [14] or forecasting [80]. We designed and implemented AURORA [58] based on a neural network, which learns from a training set consisting of labeled metamodels and effectively classifies incoming unlabeled ones. As a base for further presentations, Sect. 3.1 systematically explains the mathematical background behind the training process of a feed-forward neural network [62]. Afterward, Sect. 3.2 introduces the proposed architecture to realize the automated classifier. In Sect. 3.3, we briefly introduce gradient boosted decision tree and support vector machines as two baseline algorithms for our approach.

Training and testing
The atomic element of a neural network is called perceptron. Each perceptron receives a set of inputs and produces an output. Figure 6a depicts a simplified perceptron with three inputs. For each input x i , there is a weight vector ω 1 = ω (1) 11 , ω (1) 21 , ω (1) 31 associated with it. A bias b (1) 1 is attached to increase the flexibility in adjusting the output.
Considering Fig. 6a, the output of the perceptron is (1) ), where f is the activation function which maps a set of inputs onto the final output. There are various activation functions, and one of the most popular ones is ReLU (rectified linear unit) [1], which returns 0 given a negative input, and returns the input itself if it is larger than 0, i.e., f (x) = max(0, x). In this sense, ReLU is computationally efficient as it performs only a simple operation. However, it also suffers from a setback as follows: If too many activations are smaller than zero, then accordingly the output neurons will be rounded as zero. Given the circumstances, the network cannot produce correct predictions. We anticipate that the application of ReLU, or other activation functions, would be more beneficial to the final classification given a large amount of training data. Thus, we plan to consider ReLU and other activations in our future study. In this work, we make use of the sigmoid function since it is simple, yet effective. More importantly, the function has been widely exploited in machine learning studies [62].
A feed-forward neural network is made of several connected layers of neurons, and the output of one layer is fed as input for the next layer's neurons, except for the output layer. The edges of the network convey information in a single direction [68], e.g., from left to right. Depending on the purpose, the number of neurons in a hidden layer, as well as the number of hidden layers, may vary. Figure 6b depicts a simple example of a neural network with one input layer, one output layer, and on a hidden layer. For the sake of clarity, most of the constituent weights, biases as well as the activation functions of the perceptrons are omitted from the figure, and only some weights are kept. The input layer is consists of L neurons, corresponding to the number of input features, i.e., X = (x 1 , x 2 , . . . , x L ). The hidden layer consists of M perceptrons, i.e., We illustrate the learning process using the pseudo-code in Algorithm 1. In this listing, epoch is one round of learning performed on the training data, i.e., introducing all the input vectors. For each epoch, only some mini-batches of the training inputs are selected, and the training is done on these samples. At the beginning of each iteration, the training set is shuffled (Line 5) to randomize the input data, thus avoiding the problem of being stuck in local minima [62]. We consider a set of training inputs (X , y), and T is the batch size, i.e., the number of training samples fed to the network in one epoch.
The network is fed with input data X and labels y obtained after being split (Line 6), which is going to be detailed in Sect. 4.3.
The final aim of the learning process is to find a function that best fits the input data with the output data. We consider a single training sample (X (t) , y (t) ), and for the sake of presentation, it is called (X , y) in the rest of the paper. The difference between the real category and the predicted valuê y (Line 7) is computed using an error function (mean square error, or cost function in various studies) as follows (Line 8).
The error converges to zero whenŷ ≈ y, i.e., the prediction matches with the real label. This boils down to minimize the error function E(Ω, b) by choosing a suitable set of weights and biases [14], which now become the function's variables, while the input training vector and its label become constants. Such a minimization is done by applying stochastic gradient descent (SGD) [17,62]. We consider the example in Fig. 6b to illustrate how SGD works. The set of weights that bridges the input layer to the hidden layer is called (1) i j is the weight connecting neuron i to neuron j; the set of weights from the hidden layer to the output layer is called ω (2) = ω (2) jk , the set of biases The predicted valueŷ k for neuron k of the output layer is computed using the following formula.
where σ is the sigmoid function, which is defined as: furthermore: The gradient is computed exploiting the chain rule for differentiation: and From Eqs. (1), (2), (3) the constituent derivates are computed as: In practice, the reduction of error becomes slow, and even with a large number of epochs, there is no improvement in the learning [55]. To this end, the cross-entropy cost function has been introduced to increase the learning efficiency, and it is defined as follows.
The model performs prediction on the training data; as a new epoch, the gradient descent for the new weights and biases is updated by incorporating the error [37]. The error between the actual outcome and the predicted results is used to tune the model, and eventually to minimize errors (Line 9).
where γ is called the learning rate, or step size which can be empirically set during the training; t ∈ 1,..,T indicates an individual training sample; and T is the batch size, i.e., the total number of training instances fed as input in one iteration.
The outcome of the training phase is model with weights and biases, i.e., ω (1) , ω (2) , B = {b j }, b j ∈ R, j = 1, M (Line 12 of Algorithm 1) that can be used to approximately produce the outputs from the inputs.

4: end procedure
As it is shown in Algorithm 2, in contrast to learning, the testing process is much more straightforward where only the testing data, i.e., the metamodels that need to be classified, is fed to model that has been obtained from the training phase. The outcome is the predicted labels for the input data.

Summary.
Learning is the process of refining the constituent weights and biases, by introducing all the instances of the training data. Meanwhile, testing is to predict a label for an input metamodel by means of the weights and biases obtained through learning.
Neural networks can also deal with multi-label classification by imposing some adaptations, e.g., using Bernoulli distributed variables instead of a single categorical variable. In this respect, the labels that belong to an instance have a higher ranking than those that do not belong to the instance [81]. In the scope of this work, however, we focus on multiclass (or multinomial) classification [2], i.e., each instance of the data needs to be classified into one of the nine categories. During the prediction phase, for each output of the neural network in Fig. 6b, the classifier returns a real score, and we assign the output with the highest score to the metamodel. Eventually there is only a label for each testing metamodel.

Realization
The framework for classification is depicted in Fig. 7 and it consists of the training and deployment phases which are going to be clarified in this subsection.
The training phase starts with collecting metamodels from public repositories like GitHub 1 using the Crawler component 2 . In particular, while metamodels can be collected from different sources, in the scope of this paper we make use of GitHub as the main data provider, since the platform offers a huge number of metamodel repositories. Afterward, expert modelers are asked to inspect the metamodels and manually classify them into independent groups based on their experience. By means of labeled data including metamodels and labels assigned by humans, the training process is in charge of identifying the right weights to configure the Weights Calculator component. Then, the NLP Processor component 3 parses terms from the input metamodel and normalizes the extracted raw terms by performing on all the named elements, which are contained in the collected metamodels the following natural language processing (NLP) steps: -Stemming: this phase is conducted to squeeze all words that share a same root to a common term; -Lemmatization: it is performed to remove inflectional endings, attempting to restore words to their dictionary form; -Stop words removal: articles or supporting words such as prepositions, e.g., the, is, to, which, are removed from the corpus.
The Data Encoder component 4 takes as input the manipulated metamodels and encodes them as matrices enabling the subsequent phases of the approach. Input configuration directives are fed to select the scheme to be used for encoding the metamodels, i.e., uni-gram, bi-gram, and n-gram, and they are explained below.
uni-gram represents a simple collection of terms which do not preserve any metamodel structure; bi-gram partly represents the structure of the metamodels by considering the containment relations between named ; n-gram represents a metamodel structure by enriching bigrams with attribute properties, e.g., typing information, cardinality, and reference multiplicity.
To filter out those terms that do not occur frequently, cutoff values are given: They define the least number of term occurrences in all the metamodels. Less frequent terms represent singularities that must be avoided. On the other hand, higher cutoff values give place to accuracy problems. Afterward, the Data Encoder component 4 yields filtered feature vectors whose terms have a frequency higher than the cutoff values. The resulting feature vectors, also called term document matrix (TDM), is a two-dimensional matrix where each row represents a metamodel in the repository and columns are the extracted terms, so each entry (i, j) represents the frequency of term j in metamodel i.
Subsequently, the feature vectors, and the corresponding labels are used to train AURORA through the Weights Calculator component 5 . The result consists of the weights and biases that are going to be used by the neural network to classify any incoming metamodel in the deployment phase. Whenever an unlabelled metamodel 6 is fed to AURORA, it is parsed through the NLP Processor 7 and the Data Encoder 8 components; the generated feature vector is then put into the Classifier component 9 that, in turn, performs the needed classification and assigns a label to the vector. Finally, the outcome is a label that classifies the metamodel given as input.
We take an example to illustrate how the data processing phase works by considering the Library metamodel shown in Fig. 8. The metamodel is used to generate models, where a Library may contain a set of Books together with Writers. A writer in turn can write books, and a book can be composed by different writers. A writer is determined by a firstname and a lastname, while a book is featured by a title together with the number of pages.
The constituent terms in the metamodel are extracted and encoded using the aforementioned encodings. We give a sum-  Table 1 the final results by showing all the terms of Library, as well as the corresponding encodings for the raw vectors and the normalized feature vectors. Each element name is characterized by an identifier and its type. For example, the package name library is (i) individually encoded in uni-grams; (ii) in bi-grams as part of the terms of the metaclasses #2, #6, and #10; and (iii) in n-grams as part of the terms of the structural features #3, #4, #5, #7, #8, #9, #11, #12, and #13. With respect to the normalized feature vectors, the terms need to be parsed through the following phases: stemming, lemmatization, and stop words removal. As a result, the term books is converted into book, containing both the metaclass Book and the feature books. Furthermore, additional information about cardinalities (e.g., 0.1) and whether the node is subject to containment (e.g., TRUE) is eventually incorporated.

Baselines: gradient boosted decision tree and support vector machines
In recent years, several machine learning approaches [45] have been proposed to address the classification problem. In this section, we review two techniques, i.e., gradient boosted decision tree [31] and support vector machines [78], that are exploited as additional classification engines for metamodels in this work. These two techniques have been widely adopted  [22,53,65], and we further study their performance in MDE for the classification of metamodels.
-In machine learning, weak classifiers are those that are modified to enhance their prediction capability [49]. This can be done by restraining classifiers from using all the available information, or by simplifying their internal structure. Such an intervention enables classifiers to learn from different parts of a training set, resulting in a reduction in error correlation. Gradient boosted decision tree [31] (GBDT) is an ensemble model of decision tree to predict a target label. During its learning phase, GBDT populates a series of weak decision trees, where each of them attempts to correct errors from the previous one. In other words, a GBDT learns by relying on the following assumption: the observations that a weak decision tree classifier can handle are left, while the remaining difficult observations are subject to developing new weak decision trees. -The support vector machine algorithm [74] (SVM) can also be used to classify metamodels. First, it represents metamodels as labeled points in an N-dimensional space; then, to learn from data, it extracts hyperplanes that better split the points among the categories. A hyperplane can be understood as a border between two classes. Its dimension varies according to the number of input features, i.e., N. In this way, the learning phase is an optimization process that attempts to find hyperplanes maximizing the distance among points in independent classes. Whenever an unlabeled metamodel is fed into SVM, it is transformed into a data point; then, the generated hyperplanes are used to assign a label to the metamodel.
In the evaluation, we investigate the performance of both classifiers in comparison with our proposed approach.

Evaluation
In this section, we analyze the performance of our proposed classifier employing a thorough evaluation based on two datasets as well as various quality indicators. The section is organized as follows: Sect. 4.1 presents the research questions. The dataset used for evaluation is introduced in Sect. 4.2. Section 4.3 explains the applied methodology, and Sect. 4.4 recalls the evaluation metrics.

Research questions
We study the performance of our proposed approach by considering the following research questions: -RQ 1 : Which encoding scheme is most useful for classifying metamodel repositories? We experiment with various configurations to study which encoding technique between uni-gram, bi-gram, and n-gram facilitates a better prediction performance. -RQ 2 : How does the cutoff value affect the classification's outcome? Parsing metamodels with a small cutoff value results in a big feature set as given in Table 2, thus increasing the computational complexity. We are interested in ascertaining whether a compact feature set can yield a good trade-off between accuracy and efficiency. -RQ 3 : Does the cross-entropy cost function account for a gain in performance? In our previous work [58], we employed the quadratic cost function, and still, the obtained results were promising. In this paper, we incorporated the cross-entropy error function, which is supposed to be more effective and efficient in computing errors, with the ultimate aim of further improving the performance. -RQ 4 : Is an increase in the network's width or depth beneficial to the performance? We investigate whether changing the depth or width of the network significantly influences the prediction performance. -RQ 5 : How does AURORA compare with GBDT and SVM? Neural networks are a well-founded machine learning technique. In this paper, we compare AURORA with two more supervised learning algorithms, namely gradient boosted decision trees (GBDT) and support vector machines (SVM).

Dataset
We made use of an existing dataset 10 which consists of metamodels related to domain-specific languages (DSLs). The metamodels were inspected and manually classified by 10 https://doi.org/10.5281/zenodo.2585456 expert modelers based on their experience [64]. The final dataset consists of 555 metamodels distributed in nine independent categories. The results of the classification process are summarized in Fig. 9a. In contrast, Fig. 9b depicts the distribution of the metamodels with respect to the number of metamodels, the number of metaclasses, and the number of features. According to Fig. 9a, the most populous category is State machine DSLs with 159 metamodels, accounting for 29% of the total amount. Meanwhile, the smallest category is Issue Tracker DSLs with only 7 metamodels. Such a category may render the classification difficult since there is a limited amount of data for training. The distribution of metamodels with respect to the number of metaclasses, attributes, and references is depicted in Fig. 9b, where most of the metamodels contain a low number of metaclasses as well as attributes and references. The metamodels are concentrated on the bottom left corner of the tensor. Only those of the category Review system DSLs contain a large number of metaclasses, and they are scattered across the 3D space. The metamodels have been manually classified following humans' perception. Such a classification might vary from modelers to modelers in practice when it comes to the application domain. For simplicity, we consider the application domain as the engineered intent in various metamodels, even if some could be classified into various categories, depending on the granularity level. To be concrete, Office-tools DSLs, State Machine DSLs can be considered as general-purpose w.r.t. DSLs, which implement specific domains by definition. Moreover, we can see that there might be a possible overlap between the categories; for instance, the Review systems DSLs category appears to have much in common with Bibliography DSLs. Data drives the performance of AURORA , and, once being trained, the tool is expected to do exactly what has been dictated in the training data. In this way, we assume that depending on the input data, we can get different granularity levels in the classification outcomes.
To discard those terms that are considered sparse for the whole corpus (and as such not too representative), the following cutoff values are considered k = {2, 4, 6, 8}. The scheme is inspired by one of the preprocessing phases done in the domain of Document Representation to reduce the dimensionality of a set of documents [77]. Larger values of c would miss the intrinsic features of the dataset by excluding too many terms. Table 2 describes the network configuration used in the experiments. The number of input features corresponds to the number of perceptrons in the first layer of the neural network depicted in Fig. 6b. It is worth noting that using n-gram as encoding scheme results in the largest number of input perceptrons, i.e., 10, 837 neurons when k = 2, while uni-gram needs much fewer neurons to represent a metamodel, e.g., 1, 240 neurons when k = 8. For the most basis configuration, we use only 10 perceptrons in the second layer. In Sect. 5.4, such number is going to be adjusted to

Experimental settings
The tenfold cross-validation technique has been utilized to study AURORA's performance, as it is retained among the best methods for model selection in machine learning [44].
In such a setting, the original dataset is divided into ten equal parts, so-called folds, and numbered from 1 to 10. For each test configuration, the validation is conducted in ten rounds. For each round i, fold i is used as test set and the remaining nine folds fold j ( j = i, 1 ≤ i, j ≤ 10) are combined to form a training set. The training data is used to build the neural network following the paradigm presented in Sect. 3, i.e., to identify a set of weights and biases that best produce the corresponding outputs, given the inputs, while the testing data is used to validate the system. Like in the training set, every metamodel in the test set does have a category (label) associated with it. However, for the validation phase, such a label is removed and saved as ground-truth data. Only the feature vector of a testing metamodel is built using one of the encoding schemes mentioned in Sect. 4.2. Each vector is then fed into the network, which eventually generates a label. This label is compared against the one stored as ground-truth data to see if they match. It is expected that the labels for all testing metamodels will match with those stored as groundtruth data.
The validation process simulates a real deployment as follows. The training data corresponds to the repositories being available for mining; they are collected from different sources TP c Fig. 10 Relationship between T P c , T N c , F P c , and F N c [69] and manually classified by humans, i.e., when developers create a metamodel. Meanwhile, the testing data represents the repositories that need to be classified. Thus, the validation process attempts to investigate whether AURORA applies to real deployment, e.g., being suitable for classifying metamodel crawled from GitHub. For the sake of reproducibility, we published in GitHub the AURORA tool together with all metadata parsed from the original dataset. 11

Metrics
We consider S as a set of metamodels and they need to be classified into C independent categories. For a category c ∈ C, we callŶ c as the set of metamodels that has been classified to c by any of the recommendation engines, i.e., GBDT, SVM, and AURORA (cf. Sects. 3.1 and 3.3), and Y c as the actual set of metamodels of c in the ground-truth data, i.e., the data manually labeled by humans.
First, we explain the notations used in this paper by referring to the Venn diagram in Fig. 10. Given a category c ∈ C, we consider the following definitions: In the scope of this paper, we are interested in understanding how well the produced categories match with those that have been manually classified. In particular, we measured the relevance of the classified results obtained by running the three tools, i.e., GBDT, SVM, and AURORA with the ground truth using accuracy, precision, recall, F 1 -score, ROC, and AUC, which are defined as follows.

Accuracy
Accuracy is the ratio of the number of correctly classified items to the total number of items, regardless of their categories.

Precision and Recall
Precision measures the fraction of the number of correctly classified metamodels to category c to the total number of classified metamodels in this category. At the same time, Recall is the ratio of the number of correctly classified items to c to the total number of items in the ground-truth data of c. Recall is also named as true positive rate (TPR).

F 1 -score (or F 1 )
The metric is a combination of precision and recall.

False positive rate (FPR)
This metric measures the ratio of the number of items that are falsely classified into category c, to the total number of items that are either correctly not classified, or falsely classified into the category.
Finally, we compute the evaluation metrics by averaging out the corresponding scores for all the categories.

ROC curve and AUC
We represent the relationship between FPR and TPR using a receiver operating characteristic (ROC) in a 2D space [27], where the x-axis and the y-axis correspond to FRP and TPR, respectively. An ROC curve spans from (0,0) to (1,1) in the space.
In this paper, we deal with multi-class classification, where a metamodel needs to be assigned to one (and only one) of the nine considered categories. However, ROC curves are used for binary classification by their nature. Therefore, we need to transform the multi-class problem into binary classification using the following workaround. For each specific category (among the nine considered categories), i.e., c i , we consider it the positive class, while all the remaining eight categories are considered negative. That means, if a metamodel m is correctly classified into its positive class, we count it as true. Similarly, any metamodels that do not belong to c i and are not classified to c i are also marked as true. In contrast, if m is classified into any of the negative classes, it is counted as false; or vice versa, if any metamodels that do not belong to c i but are classified to c i are also marked as false. In this way, instead of multi-class classification, we end up with binary classification, where there are two distinct classes: True (i.e., true positive and true negative) and False (i.e., false positive and false negative). A good classifier should assign a high probability to the former while giving a low probability to the latter. In other words, the classifier is expected to have a low FPR and obtain a high TPR. We can define a threshold to predict the classes with high confidence, given a certain probability of being distinctive. In this way, an ROC curve is plotted in the 2D space by varying such a threshold. 12 This allows one to easily comprehend the performance at different levels of the decision that a classifier's outcome is accepted.
Altogether, this implies that an ROC curve locating closer to the upper left corner corresponds to a better prediction performance than the one that resides at the lower right corner in the 2D space. The area under the ROC curve (AUC) is an explicit indication of how good a classifier is. A dummy classifier, i.e., the one that randomly assigns a label to metamodels, has an AUC value of 0.5, while a perfect classifier has an AUC value of 1.0.
In the next section, we present in detail the experimental results by referring to the research questions in Sect. 4.1.

Experimental results
By executing the experiments, we computed accuracy, precision, recall, and F 1 scores for every test configuration with respect to the error function (EF) in two cases: quadratic (QD) and cross-entropy (CE). The final results are depicted in Table 3. We answer the aforementioned research questions by referring to the table. For RQ 1 and RQ 2 , we consider only the configurations with the quadratic error function. Meanwhile, for RQ 3 , the results obtained by using the cross-entropy function are incorporated to study the effect of changing the error function. In RQ 4 , we analyze whether an increase in the number of neurons or network depth enhances the overall accuracy. Finally, we compare AURORA with GBDT and SVM in RQ 5 .

RQ 1 : Which encoding scheme is most useful for classifying metamodel repositories?
We evaluate which technique between uni-gram, bi-gram, and n-gram yields the best prediction outcome, given that the quadratic error (QD) function is used. According to Table 3, we see that between bi-gram and n-gram, the former always contributes to better performance. For instance, when k = 2, using n-gram yields 84.30% as the accuracy, meanwhile the corresponding value by bi-gram is 85.90%. This applies also to the other cutoff values, i.e., k = {4, 6, 8}. The lowest accuracy for bi-gram obtained by k = 8 is 77.40%, which is better than 72.40%, the corresponding value by n-gram. By other quality metrics, i.e., precision, recall, and F 1 score, bigram always brings in a superior prediction outcome. The best precision obtained with bi-gram is 0.899 when k = 2, while using n-gram yields 0.841 as precision. Similarly, the best recall by bi-gram and n-gram is 0.782 and 0.759, respectively. The best F 1 scores by using bi-gram and n- gram are 0.828 and 0.798, respectively, implying that bi-gram surpasses n-gram by all evaluation metrics. Among others, uni-gram helps gain the best performance concerning all quality metrics. In particular, the maximum accuracy yielded by using uni-gram is 93.44%, which is considerably higher compared to 85.90% and 84.30% by bi-gram and n-gram, respectively. Being consistent with the accuracy scores, uni-gram is the most useful encoding since it gets a precision being larger than 0.90. Again, uni-gram still contributes to the best performance in comparison with bi-gram and n-gram. Using uni-gram with a large cutoff value can still obtain a good performance. For example, when k = 4 the accuracy is 95.60% which is better than 94.30%, the corresponding score when k = 2.
Finally, we investigate whether the observed differences are statistically significant. To this end, we perform ANOVA tests on the results obtained by the encoding schemes. In particular, we compare the three encodings pairwise, resulting in three comparison configurations as follows. C 1 : n-gram vs. bi-gram; C 2 : uni-gram vs. bi-gram; and C 3 : uni-gram vs. n-gram. Table 4 reports which encodings have signifi- For each pairwise comparison, the column diff shows the actual difference between the means, the p-adj column represents the adjusted p-value, while the lwr and upr columns collect the lower-end and upper-end points of the interval, respectively. Two groups may be different, and the null hypothesis can be rejected if the interval between lwr and upr does not include 0, or if the value of p-adj < 0.05. For instance, there is a significant difference in precision values between uni-gram and n-gram, i.e., adjusted p-value p-adj = 0.001, as well as between uni-gram and bi-gram, i.e., p-adj = 0.008, while there is no significant difference in precision values between n-gram and bi-gram encodings, i.e., p-adj = 0.485. Besides the three aforementioned encoding algorithms, i.e., uni-gram, bi-gram, and n-gram, we also tried to embed word2vec [52] and BERT [24] as two additional encodings. Nevertheless, by means of an empirical evaluation, we realized that using AURORA with data parsed by word2vec yields a mediocre performance. This happens due to the fact that word2vec uses a context window with an arbitrary size to represent the terms of a document. Unlike the n-gram encoding that has been specifically tailored to represent metamodels, the technique involves a term within a context that does not reflect metamodels' structure. Meanwhile, BERT cannot be used to parse metamodels as it generates a completely different output, which is an incompatible format to AURORA. We anticipate that BERT can be exploited as an independent classifier by training with the bert-based model and the uni-gram representation of each training metamodel. Nevertheless, this is out of the scope of the current work. Thus, we consider the issue as future work.
Answer to RQ 1 . Uni-gram is the most effective encoding scheme with respect to all metrics; bi-gram helps obtain a better outcome than n-gram does. The differences in performance among the encodings are statistically significant.

RQ 2 : How does the cutoff value affect the classification's outcome?
Given that the quadratic loss function is used, we study how the cutoff value influences the performance by all the encoding schemes in Table 3. First, by uni-gram, the accuracy fluctuates among the cutoff values; however, there is no distinct difference in the accuracy. For instance, the maximum score obtained when k = 2 is 93.44%, which is slightly better than 93.00%, the accuracy when k = 6. When the cutoff value increases, the accuracy slightly changes. Take as an example; the accuracy decreases to 91.70% when k = 8. Concerning precision, we see a similar pattern. Though the scores vary through the cutoff values, there is no significant improvement in precision when we incorporate more features as the input data. The best precision corresponding to 0.941 is obtained when k = 2, while the worst precision 0.903 is achieved when k = 8. By recall and F 1 , the best performance is seen when k = 4, while by other cutoff values, there is no clear distinction between the outcomes. Considering bi-gram, it is evident that increasing the cutoff value induces a reduction in performance concerning all evaluation metrics. For example, when k = 2, AURORA achieves an accuracy of 85.90%, while it gets 80.60% correctly classified items when k = 4. The precision, recall, and F 1 scores gained by using bi-gram also decrease considerably along with the cutoff values.
The results demonstrate a clear difference in performance when using n-gram as the encoding scheme, i.e., all the evaluation metrics considerably decrease when c increases. Take as an example, the accuracy changes from 85.60% to 80.30% when we change k from 2 to 4. The accuracy declines further down to 75.10% when k = 6. By other metrics, i.e., precision, recall, and F 1 scores, the same trend can be observed: they substantially fall when we use fewer features to feed AURORA.
Answer to RQ 2 . Using uni-gram, there is no distinct difference in performance when the cutoff value changes. However, by bi-gram and n-gram, a better prediction performance is achieved for a more detailed feature set.

RQ 3 : Does the cross-entropy cost function account for a gain in performance?
By changing the loss function, we expect to get a gain in performance. To answer this research question, we consider the scores presented in Table 3 and Figs. 11a, b, c and 12a, b, c. From Table 3, it is clear that when the cross-entropy (CE) cost function is used, better accuracy is obtained by all encoding schemes as well as cutoff values. For instance, when uni-gram is used to parse the metamodel, and k = 4 we get 95.60% as accuracy, which is also the maximum value among all the configurations. Similarly, by other cutoff values for uni-gram, the cross-entropy error function facilitates a superior performance in comparison with the quadratic one. By bi-gram, we witness the same trend: The cross-entropy function contributes to better performance. In particular, we get 86.80%, 81.40%, 79.90%, and78.80% as accuracy for k = {2, 4, 6, 8}, respectively. Meanwhile, the corresponding scores by employing the quadratic function are 85.90%, 80.60%, 77.12%, and77.40%. The same effect can also be seen with n-gram: By all cutoff values, the crossentropy function helps achieve a better accuracy compared to the quadratic cost function.
Similarly, by other metrics, i.e., precision, recall, and F1, it can be seen that using cross-entropy as the loss function brings in a performance gain. Concerning precision, as shown in Table 3, only by bi-gram there is a slight increment in precision with cross-entropy compared to quadratic, but by the other encoding schemes, the cross-entropy function fosters a better classification performance. For example, with n-gram and k = 2, 0.874 is the precision compared to 0.841 by the quadratic function using the same setting. Similarly, the recall and F 1 scores yielded by the cross-entropy function are always superior to those yielded by the quadratic cost function.
We further compare the two loss functions by reporting the ROC curves for all the categories in Figs. 11a, b, c and 12a, b, c. By observing Figs. 11a and 12a together to compare the two functions with respect to using uni-gram as the encoding scheme, it is evident that the ROC curves corresponding to the cross-entropy function converge to the upper left corner of the diagram, implying a high true positive rate as well as a low false positive rate. Moreover, the AUC values for the categories range from 0.97 to 1.00. Meanwhile, the quadratic function suffers a low prediction accuracy for a category with a small number of metamodels. In particular, in Fig. 11a by Class 2 (Issue Tracker DSLs) with only seven metamodels, the corresponding ROC curve is close to the diagonal line, which corresponds to a mediocre performance. Similarly, by considering pairwise the figures, either Figs. 11b and 12b, or 11c and 12c, the same trend can be witnessed: In comparison with the quadratic function, the cross-entropy one contributes to a more effective classification of the metamodels into their correct categories.
Altogether, we conclude that the application of crossentropy as the error function brings about an improvement in the performance, compared to the original AURORA design [58]. Moreover, it is worth noting that also for this configuration, uni-gram is the most suitable encoding scheme as it contributes to a gain in performance in comparison with bi-gram and n-gram.
The gain in performance by using cross-entropy can be explained as follows. The quadratic function is based on mean squared error (MSE), which is indeed more suitable for linear regression by its definition. Meanwhile, the crossentropy function is more suitable for classification as it works with particular sets of output values, i.e., multi-nominal classification. Altogether, we see that this is important in practice since we can substantially improve the prediction performance by choosing a suitable loss function, depending on the requirements.
Answer to RQ 3 . In comparison with the quadratic loss function, the cross-entropy loss function fosters a superior prediction performance.

RQ 4 : Is an increase in the network's width or depth beneficial to the performance?
We investigate whether an increase in the network's width or depth boosts up the prediction accuracy. Figure 6b depicts a simple neural network where there are one input, one hidden and one output layer. In practice, the number of perceptrons, i.e., the network's width as well as the number of hidden layers, i.e., the network's depth can be flexibly changed to aim for better accuracy. However, increasing either depth or width contribute toward computational complexity. To this end, we are interested in understanding which network topology sustains the overall performance, i.e., how many hidden layers as well as how many neurons for each layer should be chosen to maintain a trade-off between accuracy and efficiency. To the best of our knowledge, there exist some rules of thumb to set the number of neuron M for a hidden layer, e.g., L ≤ M ≤ N [15], or M ≤ 2 × L where L, N are the number of input and output neurons, respectively (cf. Fig. 6b). Still, selecting the right topology for neural networks, i.e., the number of layers, as well as the number of nodes for each hidden layer remains an empirical matter [80]. First, we examine how an increase in the network's width affects performance. For this experiment, we chose the best configuration with the following settings: cross-entropy error function, cutoff value k = 2, and uni-gram as the encoding scheme. The number of perceptrons has been varied as follows: M = {10, 20, 30, 40, 50, 100}, and the corresponding accuracies are depicted in Table 5.
The table shows that generally, there is no radical improvement in accuracy when M is increased. There is only a tiny growth in the accuracy when M is changed from 20 to 30, i.e., the corresponding accuracies are 94.50% and 94.82%. Interestingly, when M = 40, there is a decrease in the accuracy compared to when M = 30. Similarly, when M rises to  To further investigate the effect of changing M, we also sketched the ROC curves for these experiments. We noticed that there are no significant differences between the performance obtained by changing M. Thus, the corresponding images are omitted from the paper for the sake of clarity. Similarly, we have seen that there is no evident change in the AUC values obtained for various categories. In this sense, we conclude that on the given dataset, a growth in the number of perceptrons in the hidden layer is not beneficial to performance, since it not only hampers accuracy gain but also adds up to the computational complexity.
Second, to see the effect of changing the network's depth, we fixed 10 as the number of perceptrons for each layer, i.e., width=10, and varied the number of hidden layers, i.e., ρ = {2, 3, 4, 5, 10}. We obtained similar results: There is no definite improvement in performance when we add more hidden layers, while the computational complexity goes up very quickly. For the sake of presentation and clarity, the corresponding results are omitted from the paper.
Answer to RQ 4 . On the given dataset, an increase in the network's depth or width does not help AURORA enhance its classification performance.

RQ 5 : How does AURORA compare with GBDT and SVM?
Besides neural networks, GBDT and SVM are among the most well-founded machine learning algorithms [31,74]. They have been widely applied in various domains, such as pattern classification [22], face recognition [65], image segmentation [53], or recommender systems [40]. In this research question, we perform additional experiments to see how well AURORA works compared to GBDT and SVM in terms of prediction performance. To simplify the presentation, we make use of uni-gram to encode the data for this evaluation. Moreover, aiming for a fair comparison, we run ten-fold cross-validation using exactly the same input data for all the three classifiers, i.e., GBDT, SVM, 13 and AURORA.
The experimental results are shown in Table 6. Concerning accuracy, it is evident that AURORA always obtains a superior performance compared to both GBDT and SVM. For instance, with c=2, AURORA gets 94.30% as 13 We made use of the GBDT and SVM implementations embedded in the scikit-learn library (www.scikit-learn.org). With precision, the same trend can be witnessed, i.e., AURORA gets the best precision compared to the baselines by all the cutoff values. Only by recall, GBDT is the classifier that gets the best performance by almost all configurations, i.e., c={2,6,8}. This means that GBDT is good at retrieving the metamodels that belong to the ground-truth data. However, when we consider F 1 -score, it is clear that AURORA is the best classifier since it always achieves the largest F 1 value. Altogether, by taking all the evaluation metrics into consideration, we conclude that AURORA brings the best classification results, among others.
Answer to RQ 5 . AURORA outperforms both GBDT and SVM by most of the quality metrics.

Discussions
This section discusses the lessons learned as well as the envision for future developments.
• Encoding The experimental results indicate that the techniques used to parse and represent metamodel play an essential role in obtaining an excellent performance. We came across an exciting outcome: Running AURORA on a feature set parsed by uni-gram leads to the best classification results. This appears to be counterintuitive at first sight because the structure in metamodels has highly informative content, and as given in Sect. 3.2, uni-gram is not structure-preserving. A possible explanation of why uni-gram fosters the best classification performance might be related to the nature of the manual classification of the dataset involved in the experiment. In essence, the assigned labels reflect categories that are unrelated to the structural information of the metamodels, i.e., the structure does not offer any useful information for distinguishing the categories.
• Applicability In the scope of this paper, we deal with categorization in the metamodel application domains. In particular, we considered a set of metamodels for domain language specifications classified according to their functionalities. Consequently, it is expected that the outcome might be different if other classification criteria rather than application domain were adopted. For instance, if we classify metamodels according to their complexity using metrics, such as number of classes, attributes, and hierarchies, we may obtain completely different outcomes. This suggests that the encoding performance is not a general property of the method only but also depends on the specific classification. However, this is a mere assumption, although an interesting one, which needs to be further investigated. • Model classification Also models could be automatically classified with respect to some characteristics derived from the language they conform to. For this reason, it is important to first classify metamodels and then models.
In this respect, we plan to apply the proposed approach to classification of models. This aspect is linked to the metamodel concepts they conform, in fact it would be particularly useful to apply it to GPL, e.g., UML models. For instance, UML models could be classified according to the domain they represent: statecharts for IoT systems or statecharts for Web applications could be classified using our approach. • Deeper and wider networks Recently, several attempts have been made to boost up a neural network's performance by increasing various dimensions. However, what we can see from the experiments is that the extension of the given network does not bring any gain. As a consequence, we conclude that for the given dataset, it is reasonable to choose the most compact network configuration to maximize both accuracy and efficiency. We attempt to ascertain the rationale behind the decline in performance when the network is deeper or wider as follows. Deepening or widening allows a network to approximate the target function with increased nonlinearity, thereby getting better feature representations. Nevertheless, it also adds up to the complexity of the network, which makes the network more difficult to optimize and susceptible to overfitting. Adding more layers may lead to overfitting, for example, given a training metamodel, the model memorizes well the target class, and thus, it fails to generalize from data. Therefore, depending on the domain and the amount of input data, extending a network may probably bring benefits. For instance, our experiments give evidence that AURORA performs poorly on categories with a small number of metamodels; however, it substantially improves its performance on more populous categories. In this respect, the results suggest that data plays an essential role in obtaining a high prediction performance: To be concrete, a denser dataset is likely to be beneficial to the final classification. • Deployment A question that may arise at any time is: "How can AURORA be deployed in practice?" As shown in Sect. 3.2, initially it is necessary to involve humans, e.g., a group of experienced modelers, to manually categorize a large enough dataset. Then, the dataset is fed to the system to train it, and the building phase might be done offline, e.g., overnight. Afterward, the deployment stage can be conducted to classify an arbitrary metamodel automatically. We can incrementally train the neural network, i.e., when more labeled metamodels are available over time, for example, by including more modelers for the manual labeling process; we continue refining the network without needing to train it from scratch. This is one of the main advantages of a neural network: In the first place, it is related to efficiency since there is no need to run the training again and again when there is a newly added metamodel, as it is the case with Clustering [10]. Notably, the system earns the benefit of hindsight: It can improve the prediction outcome once it has learned from additional training data. Furthermore, once the training data has been used, it can be discarded to give place for additional data, i.e., it is not necessary to store the already-trained data. Our proposed tool is a data-driven approach, and thus, its prediction performance is greatly influenced by the quality of data. To be concrete, if the data is not adequately labeled, i.e., there is label noise, it may cause a reduction in accuracy [30]. We consider the motivating examples in Sect. 2.2, where there are very similar metamodels; however, they have been labeled with entirely different names by humans. When such data is fed to train AURORA, then the final prediction would be negatively affected since it is challenging to teach a machine learning tool to name two metamodels similar in structure but with different labels [30,59]. This can be considered as a pitfall that modelers should avoid when deploying AURORA. In summary, we conclude that the labeled data quality accounts for variation in the classification outcome, which confirms the findings of an existing work [73].

Threats to validity
We distinguish between internal, construct, external, and conclusion validity as follows.
• Internal validity Such threats are the internal factors that could have influenced the final outcomes. One possible threat could be seen through the results obtained for the categories with a considerably low number of items (e.g., class Tracker DSLs or Review sys. DSLs in Fig. 9a). Such a threat is eased by denser groups (e.g., class Database DSLs and State mac. DSLs in Fig. 9b). Furthermore, the selection of the cutoff values may introduce bias, since c is just an even number throughout the evaluation, i.e., k = {2, 4, 6, 8}. Besides the experiments discussed in this paper, we conducted similar ones with odd numbers, i.e., k = {1, 3, 5, 7}. As far as we can see, there are no significant differences between the conclusions obtained from the new experiments; therefore, the results are omitted from the paper, for the sake of clarity. • External validity The main threat to external validity concerns the generalizability of our findings, i.e., whether they would still be valid outside the scope of this study. The types of categorization have a profound impact on the performance of any machine learning algorithms.
In this respect, we suppose that the generalizability of our approach only applies to the metamodel application domain level. For the other types of categorizations, it may not hold any more, and thus, this needs further investigation. The external validity can also be affected by categories with a small number of metamodels. Thus, we attempt to moderate the threat by considering a set of 555 metamodels that are of different sizes and cover various categories. We suppose that the threat is partially moderated by the categories with a significant number of metamodels. However, it can be fully mitigated only if there is a considerably large as well as balanced dataset.
In this respect, it is necessary to consider a bigger dataset with more metamodels for each category, and this is deferred as our future work. • Construct validity They are related to the experimental settings presented in the paper, concerning the simulated setting used to evaluate the tool. The threat has been mitigated by applying tenfold cross-validation, attempting to simulate a real scenario of classification. • Conclusion validity This is related to the exploited experiment methodology, i.e., if it is intrinsically related to the obtained outcome, or there are also other contributing factors. The quality metrics, i.e., accuracy, precision, recall, F 1 score, ROC, and AUC, may possibly cause a threat to conclusion validity. To deal with the issue, we employed the same metrics for evaluating all the experimental configurations.

Related work
Our main contributions are related to the deployment of ML techniques in MDE. Thus, in this section, we review some of the most notable related work on the adoption of ML algorithms in the context of MDE, as well as the categorization of models. Afterward, we list some significant applications of neural networks and associate our work with them.

Machine learning in MDE
The MDE community has made considerable progress in recent years as regards adopting various ML algorithms to solve different issues [5,19]. Kusmenko [34]. In the context of collaborative modeling, Barriga et al. [9] propose the adoption of an unsupervised and reinforcement learning (RL) approach to repairing broken models, which have been corrupted because of conflicting changes. The primary intent is to reach model repairing with human quality without requiring supervision potentially. Various studies [20,25] use ML techniques to automatically infer model transformation rules from sets of source and target models. Metalearning is a technique that aims at using ML itself to automatically learn the most appropriate algorithms and parameters for an ML problem. Hartmann et al. [36] propose interleaving ML with domain modeling. In particular, they decompose machine learning into reusable, chainable, and independently computable small learning units. These microlearning units are modeled together with and at the same level as the domain concepts. Breuker [18] reviews the main modeling languages used in ML as well as inference algorithms and corresponding software implementations. This work aims to explore the opportunities of defining a DSML for probabilistic modeling. Similarly, ML has been collaterally treated [70], where the authors present an extension of the modeling language and tool support of ThingML-an open-source modeling tool for IoT/CPS-to address ML needs for IoT applications. To allow developers to design solutions that solve machine learning-based problems by automatically generating code for other technologies as a transparent bridge, a language, and a technology-independent development environment were introduced [32]. A similar tool called OptiML has been proposed [76], attempting to bridge the gap between ML algorithms and heterogeneous hardware to provide a productive working environment.
Cabot et al. [23] discuss how Model-driven software engineering can benefit from the adoption of machine learning techniques and in general of cognification, which is the "application of knowledge to boost the performance and impact of a process." Modeling bots, model inferences, and real-time model reviewers are only some of the examples of new functionalities that can be introduced thanks to cognification.
Based on the experience we have acquired by developing recommender systems in software engineering [56,57], we see a great potential of applying similar recommendation techniques for the MDE domain. We already developed a system for providing developers with suitable packages and classes when they create metamodels. As future work, we plan to further improve the proposed approach by employing long short-term memory (LSTM) neural networks.
In summary, it can be seen that learning algorithms have been successfully applied to tackle different issues in the MDE context. Nevertheless, the adoption of cognitive techniques in this domain seems not to have received proper attention being commensurate with its potential. In this respect, we believe that the introduction of AURORA is expected to advance the existing approaches in the classification of metamodels.

Categorization of metamodels
Classification and clustering techniques have been employed to categorize metamodels for a diversity of purposes, not excluded reuse [6]. In particular, an agglomerative hierarchical clustering technique has been employed to organize metamodels in repositories automatically. Hierarchical clustering techniques are the main engine to perform metamodel comparison, analysis, and visualization [6]. In these approaches, metamodel named elements are converted into lexical terms and represented as a vector space model. This encoding enables the usage of hierarchical clustering algorithms to group similar objects and serve it as a dendrogram. The approaches mentioned above share the same intent of this paper even though they are based on unsupervised learning techniques, whereas our work follows the supervised learning strategy.
Lopez et al. [51] present a domain-oriented approach for software requirements reuse. First, requirements captured from semiformal diagrams are injected into models, which are then analyzed to check for any probable quality issues. Finally, a mechanism to cluster the requirements is applied to allow the domain requirements patterns to be identified and reused. Reuser [67] is a tool used to automatically retrieve related UML artifacts and propose them to the modeler. A search algorithm is applied to find similar artifacts by classifying them into a concept lattice. Then it compares the input artifact query against the concept lattice, to retrieve the closest set of matching artifacts.
Jiang et al. [41] discuss different characteristics of UML metamodel extension mechanisms according to a four-level classification. Each level of metamodel extension has different features in (contrasting) aspects such as readability, expression capability, use scope, and tool support. The paper aims at providing modelers with a reliable theoretical base for selecting the right level to extend the metamodel to find the right trade-off of the above aspects. A classification model for UML stereotypes has been developed in [13], where the artifacts are analyzed according to their potential to alter the syntax and semantics of the base language to be able to control their application in practice. Feature-based criteria for classifying approaches according to the type of spatial relation involved have been widely discussed [16]. In contrast to our work, this is a taxonomic approach that manually identifies categories for visual languages based on their structure (and neglecting concrete syntax and semantics). The taxonomy promoted by Burnett et al. [21] only considers visual programming languages, mainly from the paradigm they express. This is a brief (and non-exhaustive) survey of purpose-oriented classification approaches, and further secondary studies are available. However, it goes beyond the scope of our paper to discuss the why classifications, as well as their enabling techniques, are necessary.
Most of the approaches that are agnostic of the classification purpose are based on lexical and structural information encoded in the metamodels. Such knowledge can be leveraged in a more semantic-aware analysis when the artifacts are constrained to specific categories. For instance, an approach [51] focuses on requirement diagrams where the various relationships within a diagram have particular meanings that help achieve better classification performance. More recently, generic clustering techniques have been extensively investigated [75]. The work presents a tool that supports the decomposition of a metamodel into clusters of model elements. The main goal is to improve model maintainability by facilitating model comprehension. In particular, modularity is achieved by decomposing the input model into a set of sub-models. The main feature that makes this study different from others is that it works by splitting a single model into clusters, and a post-processing phase is introduced: Users are allowed to add more clusters or to rename, remove, as well as to regroup existing ones. Indeed, the identified clusters result unnamed in contrast to our approach, and the representation is only grouping similar concepts in the same containers.

Applications of neural networks
The ability to learn from labeled data underpins the main strength of neural networks, making them a well-defined technique in machine learning. To name just a few, pattern recognitions [14], forecasting [80], and classification [4,26] are the main application domains of neural networks. For classification, neural networks have demonstrated their suitability in various applications. Augusteijn et al. [4] investigate two different types of neural networks for classification, i.e., back-propagation and probabilistic neural network, and find out that only the latter is suitable for the detection of novel patterns. Convolutional neural networks (CNNs) were originally designed to work with images, and thus, they have been widely used to solve various recognition tasks. To name but a few, CNNs have been applied to detect olive fruits [33], to classify images [46]. The success of CNNs in computer vision is a source of inspiration for applications in other domains. For instance, CNNs have been exploited to classify DNA sequences [3], or to process natural language [12]. We envisage the adoption of CNNs to solve different issues in MDE, such as metamodel specification or classification [60].
Through a careful observation, we realized that AURORA is the first application of neural networks to classify metamodels. Furthermore, it is among the initial attempts to deploy neural networks into a completely new domain, i.e., MDE [71]. Our work distinguishes itself from existing studies in MDE as follows: (i) We employed a well-founded ML technique to automate the classification of metamodels; (ii) the tool can automatically learn from labeled metamodels to deal with unlabeled metamodels, thereby applying to real metamodel categorization scenarios; and (iii) last but not least, our proposed paradigm is expected to pave the way for further adoptions of advanced machine learn-ing algorithms, not excluding deep learning [35,62] in MDE.

Conclusions
We proposed AURORA as the first attempt to automatize the classification of metamodels, exploiting a well-founded machine learning technique. AURORA is built on top of a feed-forward neural network utilizing two different loss functions. Thus, it can learn from labeled data and classify unknown metamodels. The prototype is handy since it considerably reduces the effort required to classify metamodels by hands, thus automatizing the categorization process.
AURORA has been evaluated on a manually classified dataset using the tenfold cross-validation technique. The experimental results demonstrate that the tool is capable of categorizing the test data with high accuracy. We also see that AURORA outperforms gradient boosted decision tree and support vector machines, two well-defined machine learning algorithms.
As future work, we plan to investigate the adoption of different types of networks, e.g., convolutional neural networks, or recurrent neural networks to classify various modeling artifacts, e.g., models, code generators, or to recommend model transformation, to conceive recommender systems being able to support modelers with a wide range of modeling activities.