String Kernel-Based Techniques for Native Language Identification

Gurram, Vamshi Kumar; Sanil, J.; Anoop, V.  S.; Asharaf, S.

doi:10.1007/s44230-023-00029-z

String Kernel-Based Techniques for Native Language Identification

Research Article
Open access
Published: 14 June 2023

Volume 3, pages 402–415, (2023)
Cite this article

Download PDF

You have full access to this open access article

Human-Centric Intelligent Systems Aims and scope Submit manuscript

String Kernel-Based Techniques for Native Language Identification

Download PDF

Vamshi Kumar Gurram¹^na1,
J. Sanil ORCID: orcid.org/0000-0002-2735-9414¹,
V. S. Anoop²^na1 &
…
S. Asharaf¹^na1

1375 Accesses
2 Citations
Explore all metrics

Abstract

In recent years, Native Language Identification (NLI) has shown significant interest in computational linguistics. NLI uses an author’s speech or writing in a second language to figure out their native language. This may find applications in forensic linguistics, language teaching, second language acquisition, authorship attribution, identification of spam emails or phishing websites, etc. Conventional pairwise string comparison techniques are computationally expensive and time-consuming. This paper presents fast NLI techniques based on string kernels such as spectrum, presence bits, and intersection string kernels incorporating different learners such as a Support Vector Machine (SVM), Random Forest (RF), and Extreme Gradient Boosting-XGBoost (XGB). Feature sets for the proposed techniques are generated using different combinations of features such as n-word grams and noun phrases. Experimental analyses are carried out using 8235 English as a second language articles from 10 different linguistic backgrounds from a typical NLP benchmark dataset. The experimental results show that the proposed NLI technique incorporating a spectrum string kernel with an RF classifier outperformed existing character n-gram string kernels incorporating SVM, RF, and XGB classifiers. Also, comparable results were observed among different combinations of string kernels. Interestingly, the random forest classifier outperformed SVM and XGB classifiers with different feature sets. All the proposed NLI techniques demonstrated promising results with significant improvement in training time, with the best result attaining more than a 95 percent decrease in training time. The reduced training time of proposed techniques makes it well suited to scale NLI applications for production.

The Gapped Spectrum Kernel for Support Vector Machines

Detecting Smishing Attacks Using Feature Extraction and Classification Techniques

Selecting and Weighting N-Grams to Identify 1100 Languages

Find the latest articles, discoveries, and news in related topics.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Native Language Identification, often called NLI, is a multi-class supervised classification task that determines an author’s native language L1 by analyzing their spoken or written work in the second language L2. NLI uses natural language processing techniques for extracting and identifying linguistic behaviors of native speakers using learner corpora [1, 2]. This may find several applications such as second language acquisition[3], authorship attribution[4], forensic linguistics[5], and language education. Many recent approaches introduced in the machine learning and natural language processing literature attempted NLI with varied degrees of success. Many of them modeled the NLI as a classification problem and used machine learning algorithms such as support vector machines to predict L1 of the unobserved input. The power and popularity of deep learning approaches have also been used in NLI recently[6, 7].

In the recent past, a simple, robust, and shallow technique based on string kernels has been found to be an effective solution to NLI problems. An improvement of the technique was developed by merging multiple string kernels. When kernels are merged by sum/multiply or kernel alignment, the features are embedded in a higher dimensional space, achieving a better selection of classifier discriminant functions and superior NLI performance. String kernels were implemented at the character level, which could capture linguistic patterns and similarities across languages. The peculiar feature of character level string kernel technique was that they are language/topic independent and linguistically neutral [8]. Despite the benefits, character-level string kernel text handling approaches require significant training time due to higher dimensional feature space. Also, being at the character level of operation, they may be incapable of capturing local syntactic effects effectively.

The paper attempts to address these problems by presenting approaches for native language identification using string kernels at the word level. The main aim of this paper is to study the feasibility of constructing native language classifiers disregarding spelling errors, grammatical errors, and punctuation and analyzing text written in L2. A subset of features based on n-word grams and noun phrases is generated to build our approach. The proposed techniques then incorporate different string kernels [8,9,10] and their combinations for vectorization for feature encoding combined with classifiers. Out of the many classifiers available, in this paper, support vector machine, random forest, and XG Boost classifiers are selected due to their popularity and applicability to the problem considered. As the proposed technique is implemented at the word level, it will take less training time and includes the advantages of string kernel-based text handling techniques. The main contributions of this paper may be summarized as follows:

Proposes different feature sets using n-grams and noun phrases that may better encode the textual features.
Introduces an approach for native language identification using string kernels that employ word-level features.
Conducts detailed experimental analysis of different native language identification techniques based on string kernels using benchmark datasets.

The remainder of the manuscript is summarized: Sect. 2 details some related work with a summary highlighting significant works and explains the motivation behind the presented work. The proposed methodology, including the description of features used, is mentioned in Sect. 3. Experimental details of the proposed NLI techniques, along with a description of the corpus and pre-processing, are given in Sect. 4. Section 5 presents and discusses the experimental results. In the last part of the paper, Sect. 6, conclusions and future works are mentioned.

2 Related Studies

NLI determines a speaker’s native language from the linguistic content they generate in a second language and is often handled as a multi-class classification problem. This section details and evaluate some of the recent and prominent approaches reported in the literature that attempts to solve the native language identification problem by employing string kernels and word-level features.

An approach that represents every textual specimen as a set of points in a topic-independent feature space is reported recently[11]. The top-k Stylistically Similar Text samples (SSTs) from the corpus are pre-identified and used to identify the author’s native language using the k-closest neighbors’ classifier. In the UGC corpus, this method reported an accuracy of 84.51 %. Another approach was applied to the UGC corpus, [12] using logistic regression as the classification model. This work incorporates character 3-grams, token uni-grams, spelling, grammar errors, function words, POS 3-grams, sentence length, social networks, and subreddits as features and achieved an accuracy of 78.99 % for cross-corpus testing. An accuracy of 85.2% and 86.8% was the best result from Cross Validation (CV) and test on the TOEFL dataset in [13], which incorporated ensemble methods using features such as word or lemma n-grams, character n-grams, function word unigrams and bigrams, parts-of-speech n-grams, dependencies, context-free grammar rules, adaptor grammars, and TSG fragments. When applied to two distinct datasets, such as the ASK corpus of Norwegian learners [14] and the Jinan Chinese learner corpus [15], the ensemble approaches achieved 81.8% and 76.5% accuracy, respectively, with Latent Dirichlet Allocation (LDA) [16, 17] based classification producing the best results in both cases.

The model proposed by Humayun et al. [18] combines hierarchical Convolutional Neural Networks (CNN) [19] that are applied to vowel segments and entire voice utterances. From the dataset of five native Indian languages with an average of 55 speakers each, the model achieves up to 83.6% average accuracy and above 80-20% train-test ratios. The categorization results are given separately for each ARPABET vowel segment for both short and long speech durations. The low-pass filtering-based voice augmentation the model uses to increase classification accuracy works well. In work proposed by Cimino and Dell’Orletta [20], a sentence-level classifier is utilized to improve the predictions of a second document-level classifier. A logistic regression model trained on common NLI lexical, stylistic, and syntactic features serves as the foundation of the sentence classifier. An SVM trained with the same feature set and labeled sentence predictions is used to classify documents. With an accuracy of 88.18% on the official NLI Shared Task 2017 test data, the given sentence document stacking architecture for the essay track did better than all other competitors.

An SVM-based NLI method that considers several lexical and syntactic features is detailed in [21]. The method combined typed character n-grams and syntactic n-grams with the word, lemma, POS n-grams, function words, and spelling error character n-grams. Additionally, the method added two new feature categories: syntactic n-grams and typed character n-grams. The log-entropy measure is used to assign a weight to each feature. Based on the official NLI Shared Task 2017 test data, the system got an overall accuracy of 88.09%. Rather than performing regular feature selection, the technique presented in the paper [22] employs multiple kernel learning to combine several string kernels. Kernel Discriminant Analysis (KDA) and Kernel Ridge Regression (KRR) are used separately in the learning step. Each of the experiments included in this work’s empirical studies shows that the suggested technique performs at the cutting edge of NLI. Experimental analysis was conducted in the ICLE [23], TOFEL11 [24], and TOFEL11-Big corpora. The system that integrates the intersection string kernel and presence bits kernel through kernel alignment and learns using KDA in the TOEFL11 corpus achieved the most remarkable performance on the test set with an accuracy of 85.3% and CV with an accuracy of 84.1%. Integrating presence bits kernel with the Local Rank Distance (LRD)-based kernel or intersection string kernel results in the best systems when applied to ICLE corpora. The KRR classifier, which combines the presence bits kernel with the kernel based on LRD, has the highest performance of 91.3%. Several KRR and KDA-based systems were tested on the TOEFL11-Big corpus after being trained on the TOEFL11 corpus, with an overall accuracy of 67.7%. It’s worth noting that the significant performance improvement is not the result of the learning technique (KRR or KDA) but rather the string kernels that operate at the character level.

An improvement of the above work on essay transcripts, fusion, and speech tracks using i-vectors is presented in [25]. The i-vectors were used to build the kernel by computing the Euclidean distance between two pairs of normalized i-vectors. Then the Radial Basis Function (RBF) kernel was applied to convert the distance into a similarity measure. An essay, a speech transcript, and an i-vector representation derived from an audio file produced by speakers speaking 11 languages were used in the work. Normalized versions of the presence bits kernel and intersection string kernel, as well as squared RBF versions of the presence bits kernel and intersection string kernel, were used in the experimental analysis. The sum of the squared RBF presence bits kernel and the squared RBF intersection kernel gave the best accuracy on the essay training dataset, which was 85.5%. While in the case of the speech training dataset, the squared RBF presence bits kernel, the squared RBF intersection kernel, and the squared RBF kernel based on i-vectors were added together to get the best accuracy of 85.45% [26] contains information on other significant works completed in the NLI as part of shared task 2017.

There are recent deep learning approaches for native language identification reported in the natural language processing avenues. A cross-corpus evaluation of the performance of content-based and content-independent features by Yasmeen Bassas et al. [27]. The paper presents an NLI technique using a pre-trained model on the TOEFL dataset, and evaluation is carried out using the Reddit dataset. Three classifiers, such as Logistic Regression, linear SVM, and the baseline classifiers, are used for classification. It is observed that content-based features produced more accurate and robust results. A deep generative NLI technique based on transformer adapters is presented and discussed in the article proposed by Ahmet Yavuz Uluslu et al. [28]. The transformer decoders were introduced to overcome the problem of memory limitations and thereby improve training time, resulting in the scaling of NLI applications. It was observed from the paper that this scaling was achieved with a decrease in the performance of the model. An n-gram NLI model with logistic regression applied to English literature in computer science, achieving an accuracy of 90 percent, is presented in the paper [29]. The experimental analysis in the article proved the possibility of using vocabulary to detect language with reasonable accuracy.

Table 1 Summary of the literature review highlighting the details of methods, features, datasets used and accuracy achieved

Full size table

A summary of significant papers reviewed highlighting the details of methods, datasets used, and the accuracy achieved is given in Table 1. Prominent approaches reported by Ionescu et al. [22] and [25] unequivocally demonstrates the possibility of utilizing string kernels [32] within the framework of NLI. Experiments using string kernels have revealed state-of-the-art performance when character n-gram features are considered. Thus it is worth exploring the possibilities of using string kernels at the word level for the task of native language identification which is the main motivation for the work reported in this paper.

3 Proposed Methodologies

Native language identification involves analyzing a non-native speaker’s linguistic features in their written or spoken language usage. These linguistic features are extracted and will be used to build a statistical or machine-learning model capable of predicting the native language of the speaker.

This section provides an in-depth description of implementing the proposed methodology for identifying native languages. The methodology employed involves a three-step process that includes feature extraction, computation of similarity measures, and classification. Figure 1 depicts the general workflow of the proposed methodology. The corpus undergoes pre-processing before feature extraction. Section 4.1 provides a detailed discussion of the pre-processing techniques utilized in the proposed methodology. Upon pre-processing, the linguistic characteristics inherent in the corpus will be extracted. The creation of feature sets involves using n-gram words and noun phrases as their fundamental components, with a particular emphasis on the noun phrases. After the extraction of feature sets, they are vectorized to render them appropriate for machine learning. String kernels are utilized to perform feature extraction and vectorization. The documents converted into vectors will be used to train the classifiers such as SVM, RF, and XGB. Training will be carried out for the entire documents in the corpus. This completes the training phase. For testing an unknown query document, the initial step involves pre-processing the query document, followed by extracting and vectorizing features utilizing string kernels. The generation of query documents in vectorized form is similar to the methodology employed during the training phase. Subsequently, the classifier will utilize the vectorized query documents to make predictions about the native language.

The description of the proposed methodology would greatly benefit from a comprehensive explanation of the process involved in generating feature sets and computing similarity measures using three-string kernels. The creation of feature sets involves utilizing n-gram words and noun phrases as their fundamental components. Noun phrases may hold significant importance in certain languages, such as English. It is analyzed that, along with verbs, noun phrases also constitute a dominant portion of speeches and essays and carry most of the semantic or syntactic content of sentences. As noun phrases contain more than one-word unigrams, if they are not extracted, similarity measures may not preserve the semantics, making them meaningless. Hence, it is worth noting how proper extraction and inclusion of noun phrases in features affect the performance of the native language identification process. This work uses three distinct feature sets (denoted by FS) to evaluate the similarity measures and classify the documents.

1.
FS1: This feature set considers word n-grams.
2.
FS2: The extracted noun phrases are combined with word n-grams.
3.
FS3: The noun phrases are removed to make only word n-grams.

The feature sets are created from the training data, and every n-gram word and noun phrase included within the dataset is used as a feature as per the definition of feature sets. The second stage involves calculating the similarity measure, which is done with the help of a string kernel. Three string kernels [10, 22], namely, Spectrum Kernel (SPK), Presence Bits Kernel (PBK), and Intersection Kernel (ISK), are used to find the similarity measure of feature sets FS1, FS2, and FS3.

Spectrum Kernel: Counting the number of substrings of length n shared between two strings is one way to evaluate the degree to which these strings are similar. The n -spectrum kernel is the result of this process. The mathematical formulation of the spectrum string kernel is given in Eq. 1 [8].
$$\begin{aligned} k_{sp}(s_{1},s_{2})=\sum _{u\epsilon \sum ^n} numb_u (s_{1}) \times numb_u (s_{2}) \end{aligned}$$
(1)
where, $k_{sp}(s_{1},s_{2})$ is the n-gram spectrum kernel which gives the similarity between two strings $s_{1}, s_{2}$ over an alphabet $\sum , s_{1}, s_{2} \epsilon \sum ^*$, $numb_u(s_1)$, $numb_u(s_2)$ is the number of times the string u has occurred as a substring in $s_{1}$ and $s_{2}$, the feature map associated with spectrum kernel gives a vector of $|\sum ^n |$ dimension with the histogram of frequencies of occurrence of all its substring with each string.
Intersection Kernel: Intersection string kernel gives a high score if n-gram often occurs in both strings, taking into account the lower of the two frequencies. The spectrum kernel gives a high score even if it is only present in one of the two strings. So, the intersection kernel shows a deeper level of information about how the frequency patterns of the two strings relate to each other. The intersection kernel is calculated as in Eq. 2 [8].
$$\begin{aligned} k_{it}(s_{1},s_{2})=\sum _{u\epsilon \sum ^n} min \bigl (numb_u (s_{1}), numb_u (s_{2})\bigl ) \end{aligned}$$
(2)
where, $k_{it}(s_{1},s_{2})$ is the n-gram intersection kernel of strings $s_{1}$ and $s_{2}$, $numb_u(s)$ is the frequency of occurrence of string u as a substring in s.
Presence Bits Kernel: When the embedding feature map is adjusted such that for each string, a vector is associated holding the presence bits (instead of frequencies) of all its substrings of length n, then a variation of this kernel known as the character n-gram presence bits kernel and may be generated as shown in Eq. 3 [8].
$$\begin{aligned} k_{p}(s_{1},s_{2})=\sum _{u\epsilon \sum ^n} it_u (s_{1}) \times it_u (s_{2}) \end{aligned}$$
(3)
where, $k_{p}(s_{1},s_{2})$ is the presence bits kernel between strings $s_{1},s_{2}$, if string u is present as a substring in s then $it_u (s) =1$ and 0 otherwise.

To increase the overall performance of a classifier, it is reasonable to consider integrating all the kernels into a single one. This may embed the features in a higher dimensional space, expanding the search space for linear patterns for the classifier to choose an appropriate discriminant function. Aggregating the kernels is a straightforward technique analogous to concatenating feature vectors. The computation of similarity measures using an integration of string kernels such as SPK + PBK, PBK + ISK, SPK + ISK, and SPK + PBK + ISK on extracted features are also explored in this paper. Once the similarity kernel matrix of the training dataset is calculated, different classifiers such as SVM, RF, and XGB are trained. These classifiers are used to find the native language of unseen text scripts. The pseudocode of the proposed native language identification is given in Algorithm 1. In Algorithm 1, the notations used to represent various attributes of the proposed NLI techniques are given in Table 2.

Table 2 Notations and definitions of attributes used in Algorithm 1

Full size table

4 Experiments

This Section details the experimental testbed and dataset used in this paper to implement the proposed algorithms. All the experiments were conducted on a server with an NVIDIA Tesla P100 with an Intel Xeon Platinum 8160 CPU running at 2.10 GHz with 96 cores, and 48 GB of RAM. Python 3.10 is used for creating the scripts for the algorithm, and the natural language processing functions were used from the Natural Language Toolkit (NLTK) available at https://www.nltk.org/.

4.1 Dataset

This paper uses the UD English-ESL / Treebank of Learner English (TLE) dataset [33]. The selected dataset comprises 9232 answers to the two FCE tasks produced by the learners taking the Cambridge ESOL First Certificate in English (FCE) examination in 2000 and 2001, encompassing ten native languages. Table 3 provides a breakdown of selected textual components across languages. The training and testing datasets composition is 8235 and 997 of data, respectively. This dataset is found to be one of the most commonly used datasets for English language learner essays, making it ideal for this research. Before the dataset is used in the experimental analyses, a few pre-processing procedures are done. The spelling errors in the corpus were removed, proper word spacing was ensured, and fluent text was obtained by removing the HTML tags. Extra spaces, punctuation, and symbols in the corpus were also cleaned, and all the text was converted to lowercase. Finally, common words such as “dear”, “sir”, “madam”, “Mrs”. and “ms.” were removed. Noun phrases are extracted using the Spacy library available at https://spacy.io/. The BeautifulSoup Python library available at https://pypi.org/project/beautifulsoup4/ is used to extract the HTML tags to get the text. NLTK library was used for tokenization, and punctuations were removed with the help of the Python library Regex available at https://pypi.org/project/regex/. A snapshot of the dataset before and after pre-processing is shown in Fig. 2.

Preliminary tests with different feature sets (different word n-grams) and classifiers on the corpus were conducted to determine which learning technique would produce the most accurate result on the test data. All word n-grams between 1 and 10 were assessed, and the best result in terms of accuracy was obtained for different word grams with a random forest classifier. As mentioned in the methodology, the preliminary tests were carried out using standalone and combinations of string kernels for SVM, RF, and XGB classifiers. The performance of the model is evaluated based on the metrics such as accuracy, precision, recall, f1-score, and training time. The results obtained from this implementation, followed by a detailed discussion, are given in Sect. 5.

Table 3 Number of text scripts belongs to different languages in the UD English-ESL / TLE training dataset

Full size table

5 Results and Discussions

This Section details the experimental results obtained using a test dataset. Table 4 shows the precision, recall, and F1-Score values acquired for each of the native languages in the test dataset by employing the spectrum string kernel with FS3 and RF classifier (found to be the best). The classification accuracy obtained for NLI with different string kernels (individual and combined kernels) applied to the feature sets as defined in Sect. 3 is given in Tables 5 and 6. The precision, recall, and F1-score of the most accurate model among SVM, RF, and XGB of a particular feature set, when applied to the test data using proposed and existing string kernel NLI techniques, are also presented in the respective Tables.

Table 4 Precision, recall, and F1-score of the proposed spectrum string kernel using FS3 with a random forest classifier having a maximum accuracy of 99.09 $\%$ for the UD English-ESL / TLE test dataset

Full size table

The existing string kernel techniques are also applied to the UD English-ESL/TLE corpus to conduct a comparative study. Character p-grams are utilized as the feature sets in [10], and different combinations of string kernels are used to construct the similarity measure. In the remainder of the paper Existing Technique is denoted as ET. The existing character string kernel techniques that will be implemented in this paper alongside the proposed techniques are character p-grams with spectrum kernel (ET1), presence bits kernel (ET2), and intersection string kernel (ET3). Different combinations of character p-grams string kernels, such as SPK+PBK+ISK (ET4), SPK+PBK (ET5), PBK+ISK (ET6), and SPK+ISK (ET7), are also considered. The experiments were carried out for characters 1 gram to 8 grams with different classifiers, and only the best character n-gram in classification accuracy is presented here. Finally, string kernels with a range of 5–8 n-grams were calculated using spectrum kernel (ET8), presence bits kernel (ET9), intersection kernel (ET10), SPK+PBK (ET11), PBK+ISK (ET12), SPK+ISK (ET13), and SPK+PBK+ISK (ET14).

The findings shown in Tables 5 and 6 demonstrate that all proposed strings kernel with RF classifier provides superior performance in terms of classification accuracy consistently. Once again, the concept of merging kernels results in more robust systems. A string kernel merger involving any spectrum kernel, presence bits kernel, or intersection kernel has yielded better/comparable accuracy with respect to character n-grams. In Tables 5 and 6, feature set FS2-T indicates FS2 applied on tokenized original train dataset.

Table 5 Experimental results on the test dataset: accuracy (%), marco average precision, recall and F1-score for UD English-ESL/TLE corpus dataset using individual string kernels and existing techniques [8] with SVM, RF, and XGB classifiers

Full size table

Table 6 Experimental results on the test dataset: accuracy (%), marco average precision, recall, and F1-score for UD English-ESL/TLE corpus dataset using proposed and existing string kernel merger techniques [8] with SVM, RF, and XGB classifiers

Full size table

It is observed that the spectrum kernel technique had the lowest performance of all the proposed techniques with SVM classifier and the proposed string kernel techniques outperformed the state-of-the-art native language identification techniques based on character string kernels. It is further observed that the string kernels applied to all three feature sets (FS1, FS2, and FS3) and FS2 on tokenized original train dataset produced consistent results with a very marginal variation. Table 5 indicates that the best performance is achieved using the spectrum string kernel with FS3 applied on the test dataset with RF classifier. Similarly, Table 6 shows that the best accuracy was achieved with character 8-grams using the presence bits kernel and the intersection kernel combined in the test dataset. However, the accuracy of the proposed string kernels was comparable/better for other pairwise and triplet combinations. Thus, the experimental findings on the UD English-ESL/Treebank of Learner English (TLE) corpus show that the proposed string kernel technique produces results that can perform better or are comparable to conventional character n-grams string kernel native language identification systems.

An illustration of the accuracy achieved using various proposed fast NLI techniques with SVM, RF, and XGB learners, along with existing techniques presented in Tables 5 and 6, is given in Fig. 3. Figure 3 gives a streamlined view of how the proposed NLI techniques are positioned with respect to the existing techniques in terms of accuracy.

From Tables 5 and 6, the following summary of observations can be made.

1.
Standalone string kernels with SVM classifier: the best performance in terms of accuracy was seen with spectrum kernel - FS2, presence bits kernel - FS1, and intersection string kernel - FS2-T.
2.
Standalone string kernels with RF classifier: spectrum, presence bits, and intersection string kernel achieved the best accuracy with FS3.
3.
Standalone string kernels with XGB classifier: best accuracy was obtained with spectrum kernel using FS1 and FS2, presence bits kernel with FS3, and intersection string kernel with FS1 and FS2.
4.
Overall, all proposed standalone string kernels with SVM, RF, and XGB outperformed respective existing character 8 and 5–8 grams.
5.
For pairwise and triplet combination of string kernels with SVM classifier: all the proposed techniques have better accuracy than existing techniques.
6.
For pairwise and triplet combination of string kernels with RF and XGB classifier: the performance of proposed string kernel techniques is comparable with existing character 8 and 5–8 grams.

The amount of training time required for the proposed and existing native language identification techniques using SVM, RF, XGB classifiers discussed in this paper are illustrated in Fig. 4. In the Figure, the x-axis and y-axis represent different classifiers and training time in hours respectively, and training is carried out using 8235-second language documents.

It is apparent that standalone string kernel methods using spectrum, presence bits, and intersection kernels with SVM, RF, and XGB classifiers shown in Fig. 4(a), took just around half an hour to complete the training using feature set 1, while the existing techniques using character 8-grams (ET1, ET2, ET3) and 5-8 grams (ET8, ET9, ET10) took more than 3 h and 10 h, using different classifiers respectively. This shows a considerable training time reduction, indicating the proposed techniques’ faster operation.

Training time of pairwise combination of string kernels with SVM, RF, and XGB classifiers is shown in Fig. 4b. Figure 4b also illustrates a significant reduction in training time. The existing character 8 grams and 5–8 grams took more than 6 and 20 h to complete the training process. An equivalent training time was observed among different classifiers. With proposed string kernels, the training time is only around two hours for feature sets FS2, FS2-T, and FS3, while for FS1 it is under one hour. A similar training time pattern was seen when a triplet combination of string kernels was made, as illustrated in Fig. 4c. The proposed techniques were able to complete training in around 3 h, with the best time being 1.5 h using FS1. In contrast, the existing character n-grams took more than 9 and 30 h to complete training.

To better understand the impact of the proposed techniques on training time, the percentage training time reduction among the fastest existing and proposed techniques is given in Table 7 as a relative measure. Table 7 shows that the proposed techniques using string kernels and their different combinations cut down the training time by approximately 84 percent and 95 percent, respectively, compared to the existing character 8-grams and 5–8 grams NLI systems. This reduction in training time was similar in the case of SVM, RF, and XGB classifiers.

Thus, it can be vividly comprehended that all of the presented techniques are much quicker than the existing character n-gram string kernel techniques while at the same time producing superior/comparable results in terms of accuracy. The paper thus provides more precise techniques for identifying one’s native language while drastically reducing training time.

Table 7 Percentage training time reduction among fastest existing and proposed native language identification techniques

Full size table

6 Conclusions and Future Works

This paper introduced approaches for native language identification techniques for determining a person’s native language. It is observed that string kernel-based approaches in all four sets of n-gram features improved classification accuracy and significantly reduced the training time. In all the experiments demonstrated in this paper, the best native language identification performance was observed when the string kernels were incorporated with random forest classifiers while computing the similarity measure. Among the proposed string kernel approaches, the spectrum kernel yielded the best results with a random forest classifier. The performance of all the proposed string kernels on various feature sets was comparable to that of the state-of-the-art techniques. It was further observed that the training time of the proposed string kernel techniques is comparatively lower than the other existing studies. The fastest approach with substantial accuracy was using a standalone spectrum, presence bits, and intersection string kernels in conjunction with an SVM, RF, and XGB classifiers.

As the end results are promising, it is worth investigating the feasibility of employing higher-order semantics for similarity measure computation with kernels in the future. Also, as a potential future work, the proposed string kernel techniques could be tried out and evaluated in a cross-corpus setting and possibly extended to other languages. The proposed techniques, combined with a deep SVM learner based on the arc-cosine kernel and CNN-based techniques may also be worth considering as a potential extension of this work for the native language identification problems.

Data Availability

All data generated or analyzed during this study are included in this published article.

References

Khurana D, Koli A, Khatter K, Singh S. Natural language processing: State of the art, current trends and challenges. Multimed Applicat. 2023;82:3713–44. https://doi.org/10.1007/s11042-022-13428-4.
Article Google Scholar
Shaik T, Tao X, Li Y, Dann C, McDonald J, Redmond P, Galligan L. A review of the trends and challenges in adopting natural language processing methods for education feedback analysis. IEEE Access. 2022;10:56720–39. https://doi.org/10.1109/ACCESS.2022.3177752.
Article Google Scholar
Crossley SA, Kyle K. Managing second language acquisition data with natural language processing tools. In: Berez-Kroeker AL, McDonnell B, Koller E, Collister LB, editors. The Open Handbook of Linguistic Data Management. Cambridge, Massachusetts, United States: MIT Press; 2022. p. 411–21.
Chapter Google Scholar
Zheng W, Jin M. A review on authorship attribution in text mining. WIREs Computat Statist. 2022;15(2):1584. https://doi.org/10.1002/wics.1584.
Article MathSciNet Google Scholar
Sousa-Silva R. Computational forensic linguistics: An overview of computational applications in forensic contexts. Language Law / Linguagem e Direito. 2018;5(2):118–43.
Google Scholar
Lotfi E, Markov I, Daelemans W. A deep generative approach to native language identification. In: Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), 2020;pp. 1778–1783. https://doi.org/10.18653/v1/2020.coling-main.159
Saha S, Chakraborty N, Kundu S, Paul S, Mollah AF, Basu S, Sarkar R. Multi-lingual scene text detection and language identification. Pattern Recognit Lett. 2020;138:16–22. https://doi.org/10.1016/j.patrec.2020.06.024.
Article Google Scholar
Ionescu RT, Popescu M, Cahil A. String kernels for native language identification: Insights from behind the curtains. Comput Linguist. 2016;42(3):491–525. https://doi.org/10.1162/COLI_a_00256.
Article MathSciNet Google Scholar
Chandran NV, Anoop VS, Asharaf S. Topicstriker: A topic kernels-powered approach for text classification. Results Eng. 2023;17: 100949. https://doi.org/10.1016/j.rineng.2023.100949.
Article Google Scholar
Kernal S, Sammut C, Webb GI. Encyclopedia of Machine Learning. Boston, MA: Springer; 2010. p. 9.
Google Scholar
Sarwar R, Rutherford AT, Hassan S-U, Rakthanmanon T, Nutanong S. Native language identification of fluent and advanced non-native writers. ACM Transact Asian Low-Res Lang Informat Process. 2020;19(4):55–15519. https://doi.org/10.1145/3383202.
Article Google Scholar
Goldin G, Rabinovich E, Wintner S. Native language identification with user generated content. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-1395
Malmasi S, Dras M. Native language identification with classifier stacking and ensembles. Computat Linguist. 2018;44(3):403–46. https://doi.org/10.1162/coli_a_00323.
Article Google Scholar
Tenfjord K, Meurer P, Hofland K. The ask corpus - a language learner corpus of norwegian as a second language. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06). European Language Resources Association (ELRA), Genoa, Italy. http://www.lrec-conf.org/proceedings/lrec2006/pdf/573_pdf.pdf
Wang M, Malmasi S, Huang M. The jinan chinese learner corpus. In: Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 118–123. Association for Computational Linguistics, Denver, Colorado. 2015. https://doi.org/10.3115/v1/W15-0614.
Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. J Mach Learn Res. 2003;3:993–1022.
MATH Google Scholar
Witten IH, Frank E, Hall MA, Pal CJ. Probabilistic methods. In: Priya D, editor. Data mining: practical machine learning tools and techniques. 4th ed. Elsevier, UK: Morgan Kaufmann; 2017. p. 335–416.
Chapter Google Scholar
Humayun MA, Yassin H, Abas PE. Native language identification for Indian-speakers by an ensemble of phoneme-specific, and text-independent convolutions. Speech Communicat. 2022;139:92–101. https://doi.org/10.1016/j.specom.2022.03.007.
Article Google Scholar
Li J, Zhang Z, He H. Hierarchical convolutional neural networks for eeg-based emotion recognition. Cognit Comput. 2018;10:368–80. https://doi.org/10.1007/s12559-017-9533-x.
Article Google Scholar
Cimino A, Dell’Orletta F. Stacked sentence-document classifier approach for improving native language identification. In: Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 430–437. Association for Computational Linguistics, Copenhagen, Denmark. 2017. https://doi.org/10.18653/v1/W17-5049.
Markov I, Chen L, Strapparava C, Sidorov G. Cic-fbk approach to native language identification. In: Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 374–381. Association for Computational Linguistics, Copenhagen, Denmark. 2017. https://doi.org/10.18653/v1/W17-5042.
Ionescu RT, Popescu M, Cahill A. Can characters reveal your native language ? a language-independent approach to native language identification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1363–1373. Association for Computational Linguistics, Doha, Qatar. 2014. https://doi.org/10.3115/v1/D14-1142.
Granger S, Maïté D, Fanny M, Hubert N, Magali P. The international corpus of learner english. version 3. Monographie (book), Presses universitaires de Louvain : Louvain-la-Neuve (2020). Accessed May 2023. https://hdl.handle.net/2078.1/229877.
Daniel B, Joel T, Higgins D, Aoife C, Martin C. Toefl11: A corpus of non-native english toefl. Research Report ETS RR-13-24, University of Southern California (2013). Accessed May 2023. https://files.eric.ed.gov/fulltext/EJ1109982.pdf
Ionescu RT, Popescu M. Can string kernels pass the test of time in native language identification ? In: Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 224–234. Association for Computational Linguistics, Copenhagen, Denmark 2017. https://doi.org/10.18653/v1/W17-5024.
Malmasi S, Evanini K, Cahill A, Tetreault J, Pugh R, Hamill C, Napolitano D, Qian Y. A report on the 2017 native language identification shared task. In: Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 62–75. Association for Computational Linguistics, Copenhagen, Denmark 2017. https://doi.org/10.18653/v1/W17-5007.
Bassas Y, Kuebler S, Riddell A. Native language identification with cross-corpus evaluation using social media data: ‘reddit’. World Acad Sci, Eng Technol. 2023;17(1):53–7. https://doi.org/10.5281/zenodo.7563501.
Article Google Scholar
Uluslu AY, Schneider G. Scaling native language identification with transformer adapters. In: Proceedings of the 5th International Conference on Natural Language and Speech Processing (ICNLSP 2022), pp. 298–302. Association for Computational Linguistics, Trento, Italy. 2022. https://aclanthology.org/2022.icnlsp-1.35.
Jastrzebska A, Homenda W. Supervised identification of writer’s native language based on their english word usage. In: Buchmann RA, Silaghi GC, Bufnea D, Niculescu V, Czibula G, Barry C, Lang M, Linger H, Schneider C. (eds.) Information Systems Development: Artificial Intelligence for Information Systems Development and Operations (ISD2022 Proceedings), Cluj-Napoca, Romania: Babeş-Bolyai University. 2022. https://aisel.aisnet.org/isd2014/proceedings2022/knowledge/8/.
Zampieri M, Ciobanu AM, Dinu LP. Native language identification on text and speech. In: Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 398–404. Association for Computational Linguistics, Copenhagen, Denmark. 2017. https://doi.org/10.18653/v1/W17-5045.
Steinbakken S, Gambäck B. Native-language identification with attention. In: Proceedings of the 17th International Conference on Natural Language Processing (ICON), pp. 261–271. Association for Computational Linguistics, Indian Institute of Technology Patna, Patna, India. 2020.
Nikhil VC, Asharaf S, Anoop VS. String kernels for document classification: A comparative study. In: 2022 International Conference on Innovative Trends in Information Technology (ICITIIT). IEEE, Kottayam, India 2022. https://doi.org/10.1109/ICITIIT54346.2022.9744134.
Yannakoudakis H, Briscoe T, Medlock B. A new dataset and method for automatically grading esol texts. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 180–189. Association for Computational Linguistics, Portland, Oregon, USA. 2011. https://aclanthology.org/P11-1019.pdf.

Download references

Acknowledgements

The authors would like to extend their gratitude to the reviewers and editor for their insightful remarks that helped improve the overall quality of this work.

Funding

No funding was received for conducting this study.

Author information

Vamshi Kumar Gurram, V. S. Anoop and S. Asharaf contributed equally to this work.

Authors and Affiliations

School of Computer Science and Engineering, Kerala University of Digital Sciences, Innovation and Technology, Thiruvananthapuram, India
Vamshi Kumar Gurram, J. Sanil & S. Asharaf
School of Digital Sciences, Kerala University of Digital Sciences, Innovation and Technology, Thiruvananthapuram, India
V. S. Anoop

Authors

Vamshi Kumar Gurram
View author publications
You can also search for this author in PubMed Google Scholar
J. Sanil
View author publications
You can also search for this author in PubMed Google Scholar
V. S. Anoop
View author publications
You can also search for this author in PubMed Google Scholar
S. Asharaf
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, data collection, and analysis were performed by GVK, SJ, AVS, and A S. The first draft of the manuscript was written by SJ and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to J. Sanil.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Gurram, V.K., Sanil, J., Anoop, V.S. et al. String Kernel-Based Techniques for Native Language Identification. Hum-Cent Intell Syst 3, 402–415 (2023). https://doi.org/10.1007/s44230-023-00029-z

Download citation

Received: 13 March 2023
Accepted: 29 May 2023
Published: 14 June 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s44230-023-00029-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

String Kernel-Based Techniques for Native Language Identification

Abstract

Similar content being viewed by others

The Gapped Spectrum Kernel for Support Vector Machines

Detecting Smishing Attacks Using Feature Extraction and Classification Techniques

Selecting and Weighting N-Grams to Identify 1100 Languages

1 Introduction

2 Related Studies

3 Proposed Methodologies