Identifying algorithm in program code based on structural features using CNN classification model

In software, an algorithm is a well-organized sequence of actions that provides the optimal way to complete a task. Algorithmic thinking is also essential to break-down a problem and conceptualize solutions in some steps. The proper selection of an algorithm is pivotal to improve computational performance and software productivity as well as to programming learning. That is, determining a suitable algorithm from a given code is widely relevant in software engineering and programming education. However, both humans and machines find it difficult to identify algorithms from code without any meta-information. This study aims to propose a program code classification model that uses a convolutional neural network (CNN) to classify codes based on the algorithm. First, program codes are transformed into a sequence of structural features (SFs). Second, SFs are transformed into a one-hot binary matrix using several procedures. Third, different structures and hyperparameters of the CNN model are fine-tuned to identify the best model for the code classification task. To do so, 61,614 real-world program codes of different types of algorithms collected from an online judge system are used to train, validate, and evaluate the model. Finally, the experimental results show that the proposed model can identify algorithms and classify program codes with a high percentage of accuracy. The average precision, recall, and F-measure scores of the best CNN model are 95.65%, 95.85%, and 95.70%, respectively, indicating that it outperforms other baseline models.


Introduction
Information technology (IT) has become an indispensable part of global society. One of the essential requirements for developing IT tools is computer programming and the importance of programming education is attracting global attention [1]. Programming languages, curriculum, teaching, and learning methods as well as platforms have become the subject of representative basic research on programming education [2][3][4]. As a result, a considerable Yutaka Watanobe and Md. Mostafizer Rahman are contributed equally to this work and considered primary contributors.

Md. Mostafizer Rahman mostafiz26@gmail.com
Extended author information available on the last page of the article. amount of code 1 is generated and accumulated daily by learners at different levels in platforms such as online judge (OJ) systems [5]. These large code archives can be used as a suitable reference for problem solving, searching for problems and answers as well as for educational research and analysis [6]. In the context of education, identifying the algorithm in the code can be useful for advanced code analysis, including code evaluation [7,8], plagiarism checking, and problem evaluation (or difficulty estimation) [9,10]. Furthermore, educational data mining (EDM) using large-scale programming data from repositories enable various empirical analyses. These analyses demonstrate the correlation between academic achievement and programming skills, user assessment, learning path recommendations to facilitate programming learning [1,11,12].
In software engineering (SE), algorithms are implemented at the functional level of the code. Solution codes can be reused for various purposes in SE in the form of libraries, open sources, components and APIs. One of the important aspects for faster coding is code reuse [13]. Code reuse is a practice of using existing code snippets to create a new function or code, as it requires understanding other codes and algorithms used. The identification of algorithms are also important for development environments (IDEs, editors, etc.) and related intelligent software tools, where feedback and support functions are involved. In a development environment, services for various types of searches against a set of program codes are indispensable. Identifying algorithms in code can be useful for advanced code analysis, including code cloning, refactoring, function prediction, debugging, code evaluation, and software metrics. On the other hand, as intelligent software tools, various ML models have been specifically designed for generating, evaluating, modifying, supplementing, and improving source code. The accuracy and efficiency of many specialized ML models for these operations as well as augmentation and retrieval tasks are highly dependent on identifying the program code [14,15]. Therefore, the algorithm implemented in the code can be a useful feature for ML models.
Due to the vast amount of code accumulated, manually searching for codes using keywords, comments/documents, tags, names, and other metadata is a challenging task. The unavailability, non-uniformity, and inadequacy of metadata is also a major obstacle in code retrieval. This is because many keywords are freely defined by programmers, the main reason for non-uniformity, and these keywords may not be suitable for accurate code classification. However, to find similar codes for reference purposes, it is not enough to find identical codes of similar algorithms based only on metadata. Therefore, artificial intelligence (AI) can be useful as a core technology to solve this problem. In recent years, advanced deep neural network (DNN) models, such as recurrent neural networks (RNN), feed-forward neural networks (FNN), long short-term memory (LSTM) [16], bidirectional long short-term memory (BiLSTM) [17], and convolutional neural network (CNN) [18], are effectively used for such diverse tasks as computer vision [19][20][21][22], travel and Internet-of-Things time series data [23,24], fault diagnosis of chemical data [25], and autonomous transportation systems [26]. Meanwhile, DNN models are considered an effective method in the context of programming activities.
In recent times, DNN models have achieved significant results for program code classification, recommendation, error detection, prediction, and code assessment [7,12,[27][28][29][30]. Moreover, DNN models are used for various programming tasks (e.g., code completion, evaluation, repair, generation, and summarization) [31][32][33]. To make DNN models more effective in programming-related tasks, real-world programming data resources can be advantageous, and one of the sources can be OJ data. The OJ system is an effective platform for programming exercises and competitions, allowing programmers to practice throughout the year [34,35]. OJ systems can effectively provide autonomous learning opportunities for code evaluation and detailed feedback on program errors [9,10,12]. Let P = {p 1 , p 2 , p 3 , · · · , p n } be the set of problems related to various algorithms, V = {v 1 , v 2 , v 3 , · · · , v m } be the set of verdicts. For each problem P , there are many solutions S = {s 1 , s 2 , s 3 , · · · , s w } and each solution receives a verdict in V with evaluation values such as CPU time. Typically, OJ systems provide decisions or verdicts depending on the errors and acceptance of the codes. Each error decision gives a specific reason for an error in the code. For example, error decisions such as memory limit exceeded (MLE), time limit exceeded (TLE), and runtime error (RE) are made when the performance of the algorithm is not enough for solving the corresponding problem. In contrast, the error decision, e.g., wrong answer (WA), is made when the code contains logical errors. Thus, large real-world OJ data (solution codes with verdict and performance logs for problem sets) can be a real treasure in the task of AI for coding [36][37][38].
Despite the remarkable results of DNN models in programming tasks, the structural (or algorithmic) features of the code have not been adequately discussed. However, knowing the algorithms used in the program code is important from an educational and software development perspective to better understand the code. Therefore, the classification of program code based on the structural features of the code remains an open problem. To address this research gap, we propose a CNN-based program code classification model that can be applied to both programming education and software development. The proposed model classifies program codes by identifying the algorithms contained in the codes. In addition, this study presents a new data preprocessing approach for program codes. The code preprocessing requires several steps, including (i) user-defined properties/tokens of program codes such as functions, classes, keywords, and variables are filtered; (ii) as the structural features (SFs) such as if , else, loops, mathematical operators, bitwise operators, and assignment operators of the codes are considered; (iii) the SFs of each program code are converted into a onehot binary matrix (OBM). We have collected two different datasets of real-world program codes based on various algorithms for model training, validation, and evaluation. Three CNN models are developed, trained, and evaluated based on various structures and hyperparameters to select the best model for program code classification. The best CNN model is applied for the classification task considering the experimental results. The contribution of the research work is as follows: • The proposed CNN model can identify the algorithm used in the program code and classify the code based on the identified algorithm. • We present a novel strategy for program code processing. SFs are extracted from program codes and converted into OBM for model training. SFs facilitate the model to understand the algorithmic properties of codes better. • The average precision, recall, and F-measure values of the proposed model are 95.65%, 95.85% and 95.70%, respectively, which outperform the values obtained by other referenced models. • The proposed classification model and its novel data preprocessing approach can be useful for various educational and industrial applications.
The remainder of this paper is structured as follows. Section 2 presents the background and related works. Section 3 describes the proposed approach, and Section 4 presents the experimental results and evaluations. Section 5 discusses the results in detail, and finally, Section 6 concludes this study with suggestions for future work.

Background and related works
This section presents prior studies related to programming education and its challenges, ML in software development practices and its challenges, code evaluation and repair, and code classification.

Programming education and its challenges
Research in programming education has gained potential worldwide, and learning programming in higher education has been recognized as significantly important for the sustainable development of IT infrastructure [39]. A datadriven study [1] has shown that better programming skills have a positive impact on students' academic performance. In [40], EDM has been performed to support programming learning based on programming data. Sun et al. [41] have proposed a model to evaluate students' programming skills in terms of programming and test performance. Based on object-oriented programming tasks, the model observed the improvement of students' programming skills. The experimental results showed that test performance was positively correlated with programming performance. Qian et al. [42] conducted a comprehensive study to identify students' misconceptions and difficulties in introductory programming course. Students are most confronted with these misconceptions, such as conceptual, syntactic, and strategic knowledge. The challenges faced by students depend on many factors, including the unusualness of language syntax, programming environments, incorrect concepts and strategies, and instructor competence. Medeiros et al. [2] categorized the challenges in introductory programming and essential issues for learning programming and teaching in higher education. In addition, the study [43] identified significant challenges such as writing, debugging, conceptualizing, and tracing code. To overcome these challenges in learning programming, pedagogical teaching/learning techniques and valuable learning tools are also presented. Meanwhile, due to rapid social and technological changes, many interesting and convenient tools are available, which sometimes have a negative impact on programming learning and students' motivation [39].

Machine learning in software development practices and its challenges
Recently, ML has been gaining attention as a method for developing various software systems, such as speech recognition, computer vision, natural language processing (NLP), robot control, and other application domains. ML capabilities can be integrated into a software system in many ways, including ML components, tools, libraries (cover ML functionalities), and frameworks [14]. In contrast, a widespread trend has emerged: the development and implementation of ML-enabled systems are fast and inexpensive. However, long-term maintenance is not costeffective [44]. Wan and collaborators investigated the differences in software development practices between ML and non-ML [14]. Moreover, common practices and workflows for building large-scale ML applications, systems, and platforms at Microsoft, Amazon, and Google have been presented in [6,[45][46][47]. Additionally, various testing and debugging tools have been proposed to test and debug ML-based applications and systems [48][49][50][51]. Despite these efforts, standardization and operationalization of reliable ML systems are inevitable. Based on real-world ML-enabled software development practices [6], around 11 challenges have been identified, from data collection to model evolution, evaluation, and deployment. However, our proposed classification model can be a supporting component in building large-scale ML-based applications and systems dealing with SFs.

Program code evaluation and repairing
Recently, researchers have made continuous efforts to achieve significant results in this area. Programming languages are quite different from natural languages, as program codes contain a large amount of complex structural information. However, conventional NLP models are inadequate for program codes. Therefore, in [52], a treebased CNN model for the programming code processing task has been proposed. Rahman et al. [8] presented a model for source code evaluation using LSTM neural networks. These networks have been combined with an attention mechanism to understand the complex context of the code. During code evaluation, the model identified errors, including logic and syntax errors in codes with a high accuracy percentage.
In [53], a multi-modal attention network (MMAN) has been proposed to properly represent the SFs of source codes and improve the reasoning for which features have the most impact on the final results. The MMAN can represent both structured and non-structured features of source codes, using a tree-LSTM for the abstract syntax tree (AST) and a gated graph neural network (GNN) for the control flow graph. In another study [7], an LSTM model has been developed to identify source code errors in C programming. In this model, characters, variables, keywords, tokens, numbers, functions, and classes have been encoded with the defined IDs. The model detected errors in faulty solution codes with high accuracy. Terada et al. [29] presented an interesting model for predicting the following unknown code sequence to complete the code. The model was built using an LSTM network. Their model can help novice programmers who have difficulty writing complete code from scratch. This model has effectively predicted the correct words to complete the code. In addition, code evaluation, completion, and repair tasks have been performed using an LSTM neural network at different levels of programming learning [31,32].

Program code classification
The program code classification model is essential for a better understanding of the code. Researchers have proposed various approaches for program code classifications. In the early stages of code classification and prediction, NLP models have been applied to source code to perform various prediction tasks [54][55][56]. A GNN model [57] was proposed for students' program code classification that integrates AST and data flow to improve the performance of the model. The GNN model classifies student program code with an accuracy of 97%. Fan et al. [28] proposed a method for classifying defective source codes using RNNs with attention mechanisms. Two evaluation indicators, such as area under the curve (AUC) and F1-measure, were used. AUC and F1measure achieved about 7% and 14% additional accuracy compared to other benchmark models.
Furthermore, many models have been proposed for classifying program codes based on programming languages.
Ugurel et al. [58] performed two types of classification using SVM: first, classification of programming languages and, second, classification of different categories of programs (e.g., databases, multimedia, and graphics). Tian et al. [59] used a Latent Dirichlet mapping method to classify the programming language associated with the source code based on the words. Alreshedy et al. [60] presented an ML language model to classify the source code snippets based on the programming language. A multinomial naive bayes (MNB) classifier was used to classify the source code snippets in their works. The contributions of the stack overflow were used as experimental data. This classification method used features such as comments, variables, and functions, instead of syntactic information. Reyes et al. [61] presented a model for classifying source code using LSTM. Archived source codes are classified based on written programming languages. Empirical results show that the LSTM model performed better than the Naive Bayes and linguistic classifier. Gilda has used a CNN model [62] to identify programming languages from source code snippets.
In [63], classification based on code tags has been performed using three classification methods SVM, random forest, and AdaBoost. In [64], the decision tree-based classification method has been used to classify source codes related to sorting algorithms. LeClair et al. [65] mentioned that the source code can be classified into six categories: games, admin, network, words, science, and usage. Xu et al. [66] used LSTM and CNN to identify vulnerabilities in source code. In addition, a CNN-based classification model was used to classify code based on the algorithms used.
In brief, numerous promising methods have been proposed and experimented with in various studies. The researchers have used traditional unsupervised and supervised classifiers. In addition, CNN and LSTM have been employed as language models for source code-related research and applications. However, the relative importance of the methods is challenging to identify. The proposed code classification model differs from other models due to its novel data preprocessing and selection approach for the CNN model. In this study, three CNN models based on different structures and hyperparameters are trained, validated, and evaluated. The best CNN model is selected for the classification task based on the results.

Proposed approach
Programmers prefer implementing algorithms for efficient code. However, implementing algorithms in code is not a trivial task. This research aims to identify the algorithm contained in the program code and classify the code based on the identified algorithm. We have used real-world solution codes of different algorithms from programming competitions and academic courses. A crucial step is a data preprocessing for model training and evaluation, where SFs are extracted from the codes, excluding all user-defined elements (e.g., variables, classes, and functions). These SFs of the codes help the DNN model better understand the algorithm's flow. CNN-based classification models are developed for classifying codes with various structures and hyperparameters. Although the CNN models are widely used in computer vision research, they have recently achieved significant success in various programmingrelated tasks (classification, error detection, prediction, and language modeling) [67,68]. The proposed classification model includes several phases, from data acquisition to model training and evaluation: (i) data acquisition and categorization, (ii) data preprocessing, (iii) CNN models training, and (iv) program code classification with the optimal CNN model. The basic framework of our proposed approach is shown in Fig. 1. The proposed approach is explained in detail in the following sections.

Data collection and categorization
Selecting relevant datasets from a real-world data repository is essential in research. In this study, real-world program codes are collected from the Aizu Online Judge (AOJ) system [69,70]. All program codes are written in the C++ programming language. AOJ is a platform that hosts various academic programming activities and programming competitions. As of February 2022, AOJ has over 3,000 programming problems and 100,000 users. It presents programming problems very efficiently based on categories and algorithms. The AOJ system has archived more than 6 million solution codes and submission logs, creating research opportunities for SE and programming education. For example, IBM and MIT have used solution codes from AOJ for their CodeNet project [36,71].
In this study, all program codes are divided into two separate datasets: A and B. In Dataset A, we considered the categories that cover a large number of algorithms in computer science and engineering, such as computational geometry problems (CGP), number theory problems (NTP), flow network problems (FNP), shortest path problems (SPP), query for data structures problems (QDSP), and combinatorial optimization problems (COP), as shown in Table 1. These categories include basic algorithms from graph theory, geometry, numerical analysis, puzzles, numbers, search, computational theory, networks, advanced mathematics, and advanced data structures and algorithms. All program codes of each category in Dataset A are collected from the problems of programming competitions in AOJ 2 .
2 https://onlinejudge.u-aizu.ac.jp/challenges/search/categories As shown in Table 2, all program codes related to sorting, such as counting sort, bubble sort, insertion sort, merge sort, selection sort, shell sort, and quick sort, are contained in Dataset B 3 . In addition, some essential key features such as complexity and method of sorting algorithms are presented.

Data preprocessing
To achieve better results from DNN models, effective input shapes can play a vital role. It is essential to create a suitable input shape that represents the actual features of the original data. Programming is a highly complex representation than natural languages. Therefore, we extracted suitable features from the program codes for model training so that the model can be trained effectively. The workflow of preprocessing the program code is shown in Fig. 2.
Only structural properties are extracted from the code for tokenization in program code transformation. Usually, program code consists of operators, operands, loops, branches, keywords, methods, and classes. Therefore, key attributes of the program code are extracted. In contrast, user-defined elements such as comments, variables, classes, and functions with little impact, are not considered. A list of featured tokens (T) and their corresponding IDs are shown in Table 3. Initially, SFs are extracted from the program codes according to Algorithm 1. The steps of program code preprocessing are described in the forthcoming subsections.

Comments deletion
All comments in the program code are identified and removed with the removeComments() function because

Extraction of feature tokens
After removing comments from the code, feature tokens, such as if , else, loops, the math operator, bitwise operator, assignment operator, compound assignment operator, comparison operator, braces, parentheses, and square braces, are selected. Typically, in C++ programming, parentheses are used for function calls and declarations, conditional statements (if , while, do), loops, and operator precedence. In contrast, braces are used for processing functions, classes, structs, if and loops. Square brackets are also used to access arrays. With this definition, all the feature tokens in the program code are selected for extraction using the extractSelectedF eatures() function, as shown in Fig. 4. In addition, irrelevant tokens such as functions and variables are identified and removed from the code, e.g., all variables and functions arbitrarily defined by the programmers. The names of the variables and functions may vary depending on the programmer's definition in the code. Thus, a single code can have many different variable and function names. Also, C++ is a statically typed programming language; the type of variables must be explicitly specified in the code, while Ruby and Python are dynamically typed languages. Therefore, all user-defined variable and function names are removed from the code so 16 Identifying algorithm in program code based on structural features using CNN classification model

Tokenization of the features
All the feature tokens are extracted from the code, as shown in Fig. 5(a). Next, the extracted feature tokens are converted into token IDs according to Table 3. This process is called tokenization or encoding. In this research, the tokenization/encoding process represents each SF of code as a token. All these tokens are mapped with the numeric numbers to feed DNN models. In learning DNN models, a sequence of tokens is converted into a sequence of numerical vectors, which are later processed by the neural network. Basically, DNN models neither know the SFs such as { + & = [ ] } of the code nor understand the semantic or algorithmic features of the code. Therefore, tokenization/encoding is an important process of DNN to learn the neural network model from scratch. For example, we defined token IDs from 0−16 for different features of the code (Table 3)

One-hot binary matrix conversion from token IDs
After the tokenization process, the IDs are assigned to the corresponding feature tokens. A sequence of token IDs is converted to a P × Q matrix structure, where P represents the number of token IDs and Q is the highest value of token ID +1. According to the definition of token ID, the maximum value of a token is 16, so the maximum length of the Q (column of matrices) will be 17. Finally, the token IDs are converted to an OBM structure of P elements according to Algorithm 2. In Algorithm 2, the concept of the (1) has been used for constructing the OBM. The conversion of token ID to OBM is shown in Fig. 6.
Where S P represents the token ID of P th iteration and Q is the sequence of column. In Algorithm 2, first, line 5 takes the entire tokenized solution code (e.g., Fig. 5(b)), then line 6 processes the individual tokens for OBM conversion, and finally lines 7-14 are repeatedly used for OBM construction until the token of a code runs out.

Padding
The final step of preprocessing the program code is padding. This is an essential step for the DNN model with batches. To train a DNN model, all input sequences in each batch must have the same length. Therefore, random tokens are added to the input sequence's end (post) and beginning (pre) to make the same length. One of the reasons for this is to avoid overfitting by adding random tokens to the input sequences.

Architecture of the CNN model
CNN has become an effective deep learning technique for solving complex tasks in various domains in recent years. Thus, the use of CNN has increased significantly in various fields of computer science and engineering [22,68,72]. The architecture of a CNN model is illustrated in Fig. 7. The CNN architecture includes different sized convolutional layers (CL), activation functions (AF), maxpooling, fully connected layers (FCL), dropout layers, and softmax function for the classification tasks. The OBM is used as input for different sized CL via a dropout layer. To determine the features of the code for the evaluation process, each CL learned the features of the code from the input sequences. The output of each CL is used as input to the AF (e.g., ReLU/LeakyReLU). The ReLU and LeakyReLU AFs are expressed by (2) and (3), respectively.
Where z is an input and α is a small magnitude. A max-pooling layer is added after each CL. The maxpooling layer extracts the maximum value from the output of each activation/feature map generated by the convolutional filter/kernel. In this manner, important information is preserved, and the size of the feature/activation map is reduced. Thereafter, the output of different max-pooling layers is concatenated. The pooled results are passed to an FCL via a dropout layer. The FCL is learned from combining filters that are highly correlated with each algorithm category. Finally, the output of the FCL is converted into probabilities via the softmax layer according to (4). In the following (4), the probability Y k is calculated from a k , where a k is the output of the FCL. The loss function L is calculated by (5) using the predicted value Y k and actual value t k .
The dropout layer was placed in front of CL and FCL to avoid overfitting. The initial dropout layer randomly generates the whole column zero, and the other dropout layers randomly generate some inputs zero. Lastly, the softmax layer is used to classify program codes based on probability. The output probabilities are calculated in the softmax layer for each category based on the given codes. The sum of the probabilities of all the categories is 1 (one). The category with the highest value is declared the winner.

Hyperparameters
Different architectures and hyperparameters of the CNN model are fine-tuned to select the best/optimal model for program code classification. We used filters/kernels of different sizes such as 16 × 17, 32 × 17, and 64 × 17 in the CL. The lengths/batch sizes (BS) for the input sequences are 16, 32, and 64. However, the horizontal length of the convolutional filter and the OBM is always equal, i.e., 17. The output length of all convolution layers is 64, and thus the length of the training sequence is also 64. The large convolutional filter length is that the CL can learn the characteristics of the entire code block of the program code. Some hyperparameters are given in Table 4.

Overview
In this section, we present the target models and experimental steps, dataset preparation, evaluation metrics, and experimental environment. In this paper, we conducted the experiments in two phases. In the first phase, experiments are conducted using different architectures of CNN models. Based on the performance of the CNN models, the best model is selected for further experiments. In the second phase, experiments are conducted with the best CNN model and two other baseline models (i.e., LSTM and BiLSTM). An overview of the experimental phases is shown in Fig. 8.

Data preparation for experiments
The details of our datasets and their preprocessing procedures are presented in Sections 3.1 and 3.2. We have two datasets A and B. Dataset A covers a wide variety of algorithms, including combinatorial, geometric, graph, and numerical algorithms. Dataset B, on the other hand, consists of codes related to sorting algorithms. In the experiments, the total number of program codes for datasets A and B of 45,398 and 16,216, respectively, are used, and about 10% of the total number of program codes for each dataset are randomly selected for evaluation. In addition, all the program codes are written in C++ programming language and have been accepted by the AOJ, which means that all codes are "correct" and efficient enough. Since dataset A has more program codes and more diversity than dataset B, dataset A is used for training and evaluation in the first phase of the experiment. Next, the best CNN model is selected based on the performance and both datasets A and B are used for evaluation. In the second phase, experiments are performed on dataset A and comparisons are made between the best CNN, LSTM and BiLSTM models.

Evaluation metrics
To evaluate the model performance, precision (P o ), recall (R o ), F-measure (F o ), and accuracy (A o ) are calculated and the following corresponding (6), (7), (8), and (9) are used for these evaluation metrics. A larger P o value indicates the higher credibility of the classification results of a particular category. In other words, P o indicates the accuracy of classification predictions. The R o is an index that measures program codes classified into a certain category; the F o is a harmonic mean between R o and P o .
Where T P is the true positive, F P is the false positive, F N is the false negative, G is the correct number of classification and N is the total program codes.

Implementation details
All the experiments are executed on the PyTorch framework with two NVIDIA GeForce GTX 1080 GPU of 32 GB of memory. Details of the experimental hyperparameters of the model are presented in Section 3.4. Experimental results of training accuracy and time, classification accuracy, comparisons between training, validation, and evaluation scores, and accuracy with 10-fold cross-validation of the models are presented below.

Training accuracy and time of the CNN models
To investigate the training accuracy, Dataset A is first used for model training because it contains a variety of program codes from different algorithms compared to dataset B.   Because ReLU ignores all negative values of neurons, which results in many neurons becoming inactive and generating only 0. This is also known as the dying ReLU/dead neuron problem [73]. LeakyReLU is used to solve the dying ReLU problem by using a small slop value (e.g., α = 0.01).       Bolded entries in these tables are important in this paper because they are used for comparison and description Bolded entries in these tables are important in this paper because they are used for comparison and description up to about 94%. Therefore, the classification scores of the models help to identify and select the optimal hyperparameters for each model to achieve the best results. Based on the classification A o and F o scores, the top-3 results are presented in Table 8.

Comparison between training, validation, and evaluation of the top-3 models
To evaluate the performance of the top-3 models, the training, validation, and evaluation curves were compared, Bolded entries in these tables are important in this paper because they are used for comparison and description which were generated based on 100,000 iterations for the top-3 models, as shown in Fig. 15(a), (b), and (c), respectively. From these figures, the following observations can be made: (i) all models achieved a training accuracy of approximately 96% and a validation and evaluation accuracy of approximately 94%; (ii) during the first 55,000 iterations, all three models experienced more overfitting; (iii) all accuracies increase linearly up to 80,000 iterations and then become more stable. Although the top-3 models achieved almost similar F o score and A o , as shown in Table 8. Herein, 10-fold crossvalidation is performed with the top-3 models and their corresponding hyperparameters to select the best/optimal model. In each cross-validation, different sets of training, validation, and test data are randomly selected to verify the effectiveness of the models. The accuracy comparison between top-3 models for each validation step is shown in Fig. 16. In addition, the average cross-validation accuracy (ACV) is calculated for each model using (10).
where H is the number of cross-validation, A o is the accuracy. Figure 16 demonstrates that (i) the CNN-Arch-II and CNN-Arch-III models achieved higher accuracy than the CNN-Arch-I model in 10-fold cross-validations, except for the 10 th round of 10-fold cross-validation; (ii) the CNN-Arch-III model achieved an ACV of 92.76%, which is comparatively higher than that of the other two models; (iii) the ACV values of the CNN-Arch-I and CNN-Arch-II models are 91.56% and 92.69%, respectively. Considering the results in training, validation, evaluation, and classification of all the models, the CNN-Arch-III model achieved better results. To validate the superiority of the CNN-Arch-III model, we also performed additional experiments with three (03) more convolution layers (e.g., 4, 5, and 6) (see in Section 4.2.2). The obtained experimental results could not exceed the performance of the CNN-Arch-III model. Henceforth, all experiments are performed with the CNN-Arch-III model.   Figs. 12, 13, and 14, the training time increased for all models, regardless of LR or AF, when BS was set to 16. In addition, all models consumed approximately 0.60% additional time for training when LeakyReLU is used. Furthermore, in most cases, the models achieved better classification results with F o scores of 94.20% and A o of 93.90% when LR was slowed to 0.0001 and AF was LeakyReLU , as shown in Tables 5, 6, and 7. Thus, it can be seen that the optimal hyperparameter settings have a significant impact on model performance.

Program code classification with the optimal CNN model
In this part of the experiment, the results of program code classification using the best model are presented. The best CNN model (CNN-Arch-III) is used for further experiments. Here, program code classification tasks are performed with datasets (A and B), and the corresponding results are presented.

Model performance with Dataset A
Dataset A contains a large number of program codes on various algorithms such as a tree, graph, geometry, computational theory, discrete mathematics, and data structure (see in Table 1). Therefore, dataset A is more diverse than dataset B. During model training, P o , R o , and F o scores are calculated using the validation data for each category, as shown in Fig. 17. All learning curves are generated with 100,000 iterations. From the Fig. 17 Fig. 18(a) and (b). The confusion matrices indicate that the model achieves approximately 100% of P o and R o values for the FNP and CGP categories, respectively. Table 9 shows the validation performance for each category of algorithms. The model (CNN-Arch-III) achieved  Table 10. In this part of the experiment, precision and recall scores are calculated for each category, and validation and evaluation are also performed for dataset A using the CNN-Arch-III model. Given the diversity of dataset A, the overall classification results achieved with the best model are significant.

Model performance with Dataset B
Dataset B is also used for training, validation, and evaluation of the model (CNN-Arch-III) similar to Dataset A. The program codes of Dataset B refer to sorting algorithms. The purpose of all sorting algorithms is the same, but the way they are applied in the codes is different. The SFs of the program codes of all the sorting algorithms are used for model training, allowing the models to learn the actual features of the sorting algorithms, instead of the codes. For the evaluation, the average P o , R o , and F o are calculated for each category of sorting algorithm, as shown in Table 11. The model obtained an average P o , R o , and F o scores of 97.00%, 96.90%, and 96.90%, respectively.
Comparing the performance of the model in datasets A and B, the model achieved a higher F o score of 96.90% for Dataset B than for Dataset A (F o score of 94.50%). This is due to the more diversity of program codes and algorithms in Dataset A so that the model could better process and learn the features of the sorting algorithms.

Program code classification with the LSTM and BiLSTM models
To compare the classification performance of the proposed model, experiments with baseline models, such as LSTM and BiLSTM are performed under considering the same

Comparison with baseline models
To validate the effectiveness of our CNN-based program code classification model, different state-of-the-art models Bolded entries in these tables are important in this paper because they are used for comparison and description  Table 14, and second, a comparison of the results with state-of-the-art models, as shown in Table 15.
The experimental results, datasets, number of program codes, languages, and models are considered when making comparisons with other studies, as shown in Table 14. Models such as DP-ARNN [28], RF [28], LSTM [8], and LSTM-AttM [8] are used to classify the defective source codes as either defective or non-defective (i.e., binary classification). In the binary classification, the LSTM-AttM model achieved a comparatively higher F o score of 94.00% than the other referenced models. The Stacked Bi-LSTM model achieved an F o score of about 89.24% for the multiclass classification task, which is higher than that for other models. In contrast, the proposed CNN-Arch-III model achieved a higher F o score of 95.70% than the other comparative multiclass classification models. In addition, the CNN-Arch-III model achieved a higher F o score among all classification models (binary and multiclass). Moreover, the experimental data size of our study is 61,614, which is also larger and more diverse than that of the other compared baseline classification models from different studies.
In addition, experiments are performed on the same dataset for LSTM and BiLSTM models, as shown in Tables 12 and 13, respectively. The results are compared with the proposed CNN models, as shown in Table 15  Bolded entries in these tables are important in this paper because they are used for comparison and description classification results of the proposed CNN models showed the potential for detecting algorithms in program codes.

Discussion
In this section, we discussed the approach, including scalability of the model compared to other state-of-theart models, and usefulness of the model in learning programming and software engineering. In addition, we discuss the threats to the validity of the proposed model.

Model performance analysis
In this paper, we focus on training the DNN models using the algorithmic features of the code rather than the metainformation. We considered SFs as key components of the algorithm in each solution code. A large number of practice-oriented solution codes are collected and processed for training and evaluation of the model. We conducted extensive experiments with different CNN architectures and hyperparameters. The CNN-Arch-III model achieved better training, validation, and evaluation accuracy than other CNN models. Comparisons are also made between CNN, LSTM, and BiLSTM models to demonstrate the classification performance of these models. Experimental results show that DNN models recognize the algorithm in solution codes with an acceptable degree of accuracy. The CNN-Arch-III model achieved an average F o score of 94.5% and 96.9% for datasets A and B, respectively for code classification. This result shows that the model achieved high accuracy in classifying "program codes" without meta-information.
In addition, we reviewed a large body of literature on program code classification. We found that studies classify codes based on various types of meta-information of the code, including programming language [58][59][60][61], code tags [63], errors [8,28], and category [64,65]. To the best of our knowledge, no study has considered the algorithmic (structural) features of codes in the classification task. Consequently, a comparison of the proposed CNN-Arch-III model with other relevant classification methods is presented in Table 14. However, in this paper, we recognize the importance of the algorithmic (structural) features of the codes for the classification task. The experimental results (Tables 5, 6, 7, 12, and 13) show that DNN models have achieved significant results using the SFs of the program codes for the classification task.

Model scalability
In this study, SFs are extracted from the codes and then the CNN model is trained to classify the program codes. The model classifies the program codes based on the category of algorithms with a high percentage of F o score of about 95.7%. The higher accuracy demonstrates that the proposed approach including SFs extraction, OBM conversion, and training and evaluation of the best CNN model with real-world program codes, are effective. Moreover, the experiments are conducted with program codes of C++ programming, which is considered a procedural programming language. Thus, the proposed model can also be utilized for classifying program codes of other procedural languages, such as Python, Java, and C. Based on the comparison studies with the baseline classification models, the proposed model (CNN-Arch-III)  Tables 14 and 15. Also, the proposed model has the scalability to classify large industrial program codes. Typically, industrial program codes are quite long and contain many functions and classes. As these functions may contain different algorithms, the proposed model can be useful for classifying codes at the function-level. It can be seen that the proposed code classification model can be useful and scalable for various programming-related tasks.

Model usage in programming learning
One of our research objectives is how the model can help programmers learn to program in real-world environments. From this viewpoint, the proposed model has been developed. The experimental results indicate that the present study can be useful for programming learning. A considerable amount of programming code is regularly generated from various sources such as academia, industry, programming platforms, and the OJ. However, programmers often find it challenging to identify the algorithms in the reference program codes while learning and searching from a large number of codes. Therefore, knowing the program code algorithm can help programmers better understand the code and accelerate their learning progress. Here, the proposed code classification model can effectively assist programmers in identifying algorithms contained in program codes. Moreover, the proposed model can be integrated with various real-world programming learning platforms, including OJ systems.

Model usage in software engineering
Repositories of real-world program codes play a key role in building effective ML models in SE. ML models are suitable in various fields of SE, such as strategic decision making, rapid prototyping, design and analysis, bug detection, code review, bug fixing, code reuse, and intelligent programming assistants (IPA). In addition, ML-enabled IPA systems can provide the best relevant code examples, best practices, and related texts as just-in-time support. As a result, the importance of ML models in software development and their application in SE is increasing significantly [14,75]. The proposed CNN model classifies program codes by identifying the algorithms contained in the codes. Therefore, this model can also be used directly/indirectly for various SE tasks such as code review, bug detection, code examples, and code refactoring. In particular, the proposed model can be used as a supporting component of other ML models in SE that deal with SFs of program codes.

Threats to validity
This study applied several novel ideas from data preprocessing to model development. The model achieved significant results in classifying program codes during the experiment. However, the proposed model may suffer due to the following reasons/threats: (i) variation in the list of feature tokens for other programming languages; (ii) different strategies for data preprocessing; (iii) different sets of programming problems; (iv) problem sets with other programming languages such as C, Python, Java, and C#; and (v) different values of hyperparameters and architectures of the CNN model. In the follow-up work, we plan to validate the model's performance by addressing the threats above mentioned.

Conclusion and future work
We developed CNN models to classify the program codes based on the identified algorithms. Real-world program codes were collected from the AOJ system and utilized in all experimental tasks. The SFs of the program codes were extracted to learn the CNN models. They were converted to OBM, followed by several processing steps. Different sets of hyperparameters such as CL, LR, AF, BS were used in CNN models in different combinations. The top-3 CNN models and their hyperparameters were selected based on the superior experimental results. In addition, a 10-fold cross-validation was performed to select the most suitable (topmost) CNN model and hyperparameters for further experiments. Subsequently, all the experiments with the best CNN model were performed on both datasets (A and B). The model achieved significant classification results for both datasets, an average P o , R o , and F o score of 94.30%, 94.80%, and 94.50%, respectively for Dataset A, and an average P o , R o , and F o score of 97.00%, 96.90%, and 96.90%, respectively, for Dataset B. Furthermore, the performance of the proposed CNN model was compared with those of other baseline models. Results indicate that the proposed model outperforms the referenced models. The results show that the proposed model is more scalable in classifying program codes of diverse algorithms. In addition, the model can be useful in classifying program codes of other procedural programming languages, such as C, Java, Python, and C#.
In the future, the code block sequence of program codes can be considered, instead of SFs, to investigate the model performance. Moreover, a multi-label classification model can be considered to classify program codes with multiple labels. In addition, the model can be used to evaluate large-scale industrial program codes. Funding This research work was supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI (Grant Number 19K12252).

Code Availability
In this paper, all real-world program codes for the experimental tasks are collected from the AOJ platform. All codes can be accessed via the following reference URLs: https:// onlinejudge.u-aizu.ac.jp, https://judge.u-aizu.ac.jp/onlinejudge/, and http://developers.u-aizu.ac.jp/index.

Conflict of Interests
The authors declare that they have no conflicts of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.