Semi-supervised deep learning based named entity recognition model to parse education section of resumes

A job seeker’s resume contains several sections, including educational qualifications. Educational qualifications capture the knowledge and skills relevant to the job. Machine processing of the education sections of resumes has been a difficult task. In this paper, we attempt to identify educational institutions’ names and degrees from a resume’s education section. Usually, a significant amount of annotated data is required for neural network-based named entity recognition techniques. A semi-supervised approach is used to overcome the lack of large annotated data. We trained a deep neural network model on an initial (seed) set of resume education sections. This model is used to predict entities of unlabeled education sections and is rectified using a correction module. The education sections containing the rectified entities are augmented to the seed set. The updated seed set is used for retraining, leading to better accuracy than the previously trained model. This way, it can provide a high overall accuracy without the need of large annotated data. Our model has achieved an accuracy of 92.06% on the named entity recognition task.


Introduction
Globally, companies receive resumes in large numbers that require screening. Resumes carry semi-structured text, which is difficult to parse. The difficulty arises from differences in structures, styles, formats, order, and types of information that the resumes incorporate. It usually consists of various sections that reflect the candidate's competency. Accurate parsing of these resume sections without manual intervention is a dire need.
A widely used technique for recognizing entities is named entity recognition (NER). NER refers to identifying all the occurrences belonging to a specific type of entity in the text. NER tasks require a large amount of annotated data that could be extremely cumbersome to produce. There exists a need for auto-annotated data, which can provide good accuracy.
One of the most common approaches used for NER is a reference from a list [18,35]. This approach usually leads to better performance and depends on the entire list and, therefore, defaults. We can also perform NER tasks using various deep learning models.
In NER, the combination of word embedding [23,24], convolutional neural networks (CNN) [13], bidirectional long-short term memory (Bi-LSTM) [14] and conditional random fields (CRF) is the most preferred combination [27]. In our model, we have used the combination without the CRF layer. CRFs perform better with structured data.
Since resumes have semi-structured data, we decided to skip the CRF layer [5].

Resume parsing
Much research has been carried out in the field of resume parsing in recent times. Jiang et al. [16] have used statistical and rule-based algorithms for extracting relevant information. However, this approach fails to generalize for resumes in English. Farkas et al. [10] have devised an application where the user uploads his resume from which details are automatically extracted, and an application form is subsequently filled. The user is then allowed to edit the form, if required, and submit it. This method relies on high recall so that even if the information fetched is not precise, the user can edit the automatically extracted information in the form and submit it. A CRF-based resume miner to extract information provides a method for ranking applicants for a given job profile [31]. These methods give low precision and low recall for institute and degree names [10,31].
A cascaded hybrid model for resume parsing uses a combination of the hidden Markov model and support vector machine to extract information in a hierarchic manner [36]. This method again suffers from low precision and recall for institute and degree names. Pawar, Srivastava, and Palshikar [25] have developed an unsupervised algorithm for automatically creating a gazette. They use a search algorithm for the NER task, which performs better than a naïve approach based on regular expressions.
Jiang et al. [16] have designed a parser, which first partitions the resume with the help of rule-based regular expressions by analyzing the characteristics of a Chinese resume. For necessary information, squeeze and sliding window algorithms have been used to achieve 87% accuracy. For complex information, SVM was trained with 1200 resumes, and on testing with 300 resume test samples, 81% accuracy was obtained. This approach involves more rules than automatic information extraction.
Chuang et al. [8] have leveraged the characteristics of the Chinese resume, wherein it is divided into simple and complex items by an iterative process. Eigenvectors, TF-IDF, and SVM algorithm further identify the multiple items with an average accuracy of 81%.
The task of recognizing the education section of a resume has also been taken up. Ravindranath et al. [29] have taken Gibb's Sampling approach to recognize the education section and other parts such as the work experience of a resume by converting them to a parse-tree. Tikhonova and Gavrishchuk [32] have used NLP-based methods to recognize the education section of a resume. A Jaccard score of 0.806 was achieved for a Russian dataset for resumes.
Sayfullina et al. [30] have presented a novel technique for resume classification into 27 different categories using domain adaptation. An attempt to overcome the paucity of labeled resume data has been made. Three different kinds of resume component datasets have been worked on, namely job descriptions, resume summaries, and children's dream job descriptions, which vary majorly. Two models have been considered, a word embeddings-based fastText method for classification and a CNN model. For each of the categories, mentioned CNN outperforms the fastText method. It opens prospects for future work in improving the results using the CNN model for the low number of labeled resumes available.

Named entity recognition using deep learning
Recently deep learning-based methods have also been explored for NER. A pre-trained word-embedding is used as an input to a neural network model and character-level features [19,27]. A comparison of a Bi-LSTM cum CRF model with a transition-based chunking model with shiftreduce parsers is made for NER, concluding that the former gave a better performance [19]. Stacking of recurrent neural network layers for a biomedical sequence was employed for classification purposes [34]. Bi-LSTM, CNN, and CRF layers have been incorporated into the neural network model for unstructured data [26,27]. Transfer learning is adopted to solve the labeled data scarcity issue, where an artificial neural network (ANN) trained on a large dataset is used to predict another large dataset [20]. For the problem of manual annotation, a semi-supervised approach with bi-lingual corpora is made use of for increasing annotated training data [37]. Yu et al. [36] have developed a cascaded information extraction (IE) framework. A CV is segmented into blocks with labels for different information types in the first pass using an HMM Model. Then in the second go, detailed information, like Name, is extracted from individual blocks instead of searching in the entire resume for it using a hybrid model consisting of HMM and SVM.
Maheshwary and Misra [21] have proposed a Siamese adaptation of CNN to accomplish the task of matching resumes with a particular job opening. The model consists of a pair of identical CNN, which they propose gives them a measure of semantic relatedness of words in the resume in a controllable manner with low computational costs. They test their model against simple models like Bag of Words, and TF-IDF compared to which their model performed better.
Deep learning model approaches have shown a significant performance and accuracy improvement for named entity recognition task. Moreover, it can be generalized over a wide variety of data, unlike the rule-based approaches.

Named entity recognition for semistructured data
Chifu et al. [6] have created skills, and web crawled resumes are checked for POS patterns after text preprocessing using the Stanford NLP framework. If words are not present in the skill ontology, new skills are updated for further skill detection using algorithms trained for specific lexical patterns. Wikipedia is the primary source of ontology, and the whole system is highly dependent on the same. Ghufran et al. [11] leverage the fact that an individual's resume contents are available on Wikipedia for automatic annotation without POS tags. N-grams are constructed from keywords in a resume and then queried to Wikipedia. Returned results are in the form of an interpretation graph, processed for disambiguation and cross-language references. On 153 resumes, education section entities are recognized with an F-score of 85.68%. Dependency on Wikipedia for information is very high, and the dataset of resumes is also limited.
Zhang et al. [8] have proposed a technique for parsing the semi-structured data of the Chinese resumes. The system consists of the following key components, firstly the set of classes used for classification of the entities in the resume, secondly the algorithm used for identification of those entities, and lastly, the system design. The entities are divided into two major subcategories, that is, simple items like name, and date of birth, and miscellaneous items like the learning experience, skills, etc., which exhaustively cover all entities. A total of 5000 resumes is used, and a system comprising of SVM, regular expression, and vector space model-based classifier base has been implemented. The overall accuracy of 87% for necessary information and 81% for complex information has been achieved. Future work opens potential in improving the rough segmentation accuracy in resumes, thereby leading to improved information extraction.
Zhang et al. [38] have proposed an analytic system for the mining and visualizing the semi-structured data in resumes. The semantic information is first extracted after which visualization, in the form of understanding the career progression of an individual, assessing the social relationships, and holding an overall view of the resume is done to represent this collected information. Prospects include incorporating visualization of geographical dimensions.
Darshan [9] has used Perl-based regular expressions to convert semi-structured into an ontological structure. This semantic information is further represented in XML format for information extraction from resumes.
The limited literature on resume parsing has not demonstrated very high accuracy in identifying institutions and degrees [10,16,36].
In our work, we focus on accurate identification of academic degrees and institute names in a resume's education section. The proposed method eases the recruiter's search for candidates from specific institutions or academic qualifications. It further helps in the analysis regarding recruitment trends specific to colleges, compensations, and industry exposures provided to the candidates [15].
The main contributions of this paper are summarized as follows: • We demonstrated the use of a modified semi-supervised technique for parsing institute and degree names. Instead of following the traditional semi-supervised approach, we introduced a correction module to rectify the predictions. We added these corrected predictions back to the original seed set, thereby increasing its size. On retraining, this procedure results in improved accuracy, precision, and recall in comparison with the previously trained model. • We achieved high performance for recognizing degrees and institutes in a resume without large annotated data.

Methodology
In this section, we explain the different modules of the proposed method.

Preprocessing
The seed set contains 550 resumes, which are split into 50 for testing and 500 for training purposes. The preprocessing included the following steps: • Conversion of pdf resumes to JSON using PDF2JSON [39] • Extraction of words from JSON for Part-of-Speech (POS) tagging using the Natural Language Toolkit (NLTK) tagger [3,4]. POS tags were used as a feature since the identification of proper nouns in the data aids in the identification of institute names.

Corpus
BILOU is an encoding schema where the last token of multi-token chunks is denoted explicitly by a last (L) tag.
The BILOU encoding scheme suggests to learn classifiers that stands for the Beginning, the Inside and the Last tokens of multi-token chunks as well as Unit-length chunks [28]. It means an increase in the number of parameters to be learned by the model. Since we have followed a semisupervised approach with initially less tagged data, we decided not to burden the model with additional parameters. The BIO (Beginning, Inside, Outside) encoding schema could represent the same data without any additional tag. Hence we choose it to retain accuracy. We manually annotated the seed set using BIO encoding. The annotations are given in Table 1. From the tagged resumes, we extracted the education sections, which constitutes our raw corpus.

Data cleaning
When resumes in PDF format are converted to JSON, many inconsistencies can arise. Since the irregularity can arise in many variations, rectifying it to bring them all to a common format is necessary for the model to function properly. Some of those issues are: • resolving unbalanced parenthesis • Irregular spacing between words • Replacing&amp with&, and

Irregular spacing between words
If the resume contains ''Studied in CMS'', here the extracted words would be: The extra space present here between ''in'' and ''CMS'' would create one extra empty word which is corrected by ignoring empty words in the document (Fig. 2).

Replacing&amp; with&
When the resumes in their PDF form are converted to JSON, the '&' symbol is sometimes returned as '&amp;' and sometimes as '&'; therefore, before moving ahead, it is essential to resolve any inconsistencies of this kind. Consider the following example; if we have ''Studied 10th & 12th from CMS'' in our resume, the extracted words would be: for maintaining consistency among all such cases (Fig. 3).

Removing unwanted characters (including non-ASCII characters)
There may be some extra unwanted characters present in the words that are extracted. These extra characters must be removed to ensure a common format. Take

Model
In this subsection, we describe the various components of the proposed model (see Fig. 5).

Classification model
The classification model consists (see Fig. 6) of the following layers implemented using Keras [7]: • Word Embedding Layer: Words used in resumes such as the institute names or the degree names may not be present in a pre-built word embedding. Therefore, we created a new set of word embedding built on our corpus with Keras's help. We produced two-word embedding layers, one for classification entities, and another for their respective POS tags. We then concatenated these two to form the base model. We added a Dropout layer to prevent over-fitting of the data with a probability of 0.1 as experimentally that gave us the best result. We have used 10% of the dataset as the development set. A considerable batch size reduces the generalization ability of a deep learning model [17]. We, therefore, trained the model using a small batch size of five for 20 epochs. We observed that the model converges well in 20 epochs, as shown in Fig. 7.
In Fig. 7 we have used cross-entropy loss as the loss function. Cross-entropy loss measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverge from the actual label. A perfect model would have a cross-entropy loss of 0. The crossentropy loss function used is given by where y i is the ground truth and y 0 i is the predicted score for each class i 2 K.

Correction module (Processing predictions made on the unlabeled dataset)
The unlabeled data (360 files) undergoes the same cleaning as specified above. We split the unlabeled data into 12 sets of 30 records each. We rectified the classification model predictions made on each set using a correction module. This module's overall task is to ensure that the tags We used two pointers to extract the initially predicted entity, to check for correctness, as depicted in Fig. 8. A tag that starts with a 'B' or an 'I' is marked with a start pointer. The subsequent end of the tag is marked with the end pointer. The model extracted entities present between these pointers for further comparisons. The two lists generated by us are a comprehensive collection of the institute and degree names in the USA and India, so the trivial approach would be to opt for direct string matching of the extracted entities with the list. However, we aim to create a model for recognizing institute and degree names for any resume in the English language that may not belong to the USA or India. If we use the direct string-matching approach, we may encounter a name that may not be present in our list, since it relies solely on the name in the list for its recognition.
Our model incorporates a neural network component, which does not depend on the list to recognize entities in the resume data and instead works on identifying patterns in the data. Thus, it can successfully recognize new names with similar patterns even if they are not present in the list.
We created a dictionary for mapping short-forms present in institution and degree names to a standard uniform format. For example, we mapped short-forms ''engg.'' and ''eng.'' to the word ''engineering,'' ''tech.'' to ''technology'' and ''edu.'' to ''education.'' The mapping for shortform has ensured that, if the model found short-forms in either the education section of a resume or the list, it compared it uniformly.
The original resume's data is not transformed: these mappings are used to aid the correction module in comparing the original name found in the list with the extracted entity. It is done to reduce the error in tag-prediction. For example, while comparing the entities of the institute ABC Institute of Tech. to the entry in the list-ABC Institute of Technology, we map the word Tech. to Technology to  resolve the ambiguity. It ensures that without making any change to the data (i.e., resumes), our model can handle real-world data with such ambiguities. It means that our model can identify the institute or degree names even if they are present in a different format.
Since the extracted entity could be incomplete, we employed Fuzzywuzzy [12], a search system to select the best match for institute or degree name in L. Fuzzywuzzy is an open-source tool for searching in a fuzzy manner. Python's Fuzzywuzzy library is used for measuring the similarity between two strings. To obtain the similarity ratio between two strings, it uses Levenshtein distance [2]. We fed Fuzzywuzzy with institute and degree names and the extracted entity to fetch the top five closest matches. Words -''and'', ''of'', ''in'', and ''&'' are ignored during the comparison. We noticed that many times people interchange the 'and' and '&' and sometimes they forget the 'in' or 'of' and hence ignored these.
A comparison with the list is made in two ways depending upon whether the name consists of punctuations. If no exact match, for the results fetched from L 1 ðL 2 Þ, is found, the following procedure is followed: 1. The extracted entity is expanded to accommodate its immediate neighboring word on the left and the right, that is, the start pointer = start pointer-1 and end pointer = end pointer ? 1. 2. This entity is then sent to fuzzy-wuzzy to fetch a new set of top five matches w.r.t. this entity. 3. The process mentioned above is repeated to check if an exact match can be found. 4. If it is found, then correct the corresponding predictions in the document. 5. Else, the other list L 2 ðL 1 Þ is checked. 6. If not even found in the other list, then make the whole predicted entity O (others).
Entities that are already corrected are not rechecked in this entire iterative process for each document. For degree name correction, the same procedure, as mentioned above, is repeated. The details of the correction module are given in Algorithm 1.  Some of the cases covered by the correction module for institute name and degree is given in Tables 2 and 3, respectively. In case 3 of Table 3 the term 'of technology' is a common occurrence in degree names (Bachelor 'of technology') as well as institute names. The model hence predicted it as I-DEG but correction module rectified it by associating the term with its neighboring words and finding the name Manipal Institute of Technology in the institute list created by us.

Corpus expansion and retraining
The newly annotated data produced by our correction module, for a set, is added to the training set of our corpus. The model is retrained over the newly formed training set for future predictions. This addition to the corpus is repeated for each set of unlabeled data until all the data is exhausted. After retraining the model every time, it is tested against the test data we had set aside initially during the train-test split to ascertain the result's accuracy.

Experimental results and discussion
The performance of our model is based on four parameters, namely accuracy, precision, recall, and F1 score [22]. The precision is the percentage of correctly identified entities out of the total identified entities, and the recall is the percentage of entities present in the dataset found by our model. F1 is the harmonic mean of precision and recall. Accuracy is the number of correct predictions upon the total number of predictions.
The experimental result is evaluated based on the evaluation script for the Conference on Computational Natural Language Learning (CoNLL) 2003 shared task [33]. The size of the training set initially (seed) is 500 resume's education sections, and the test set comprises of education sections of 50 resumes. The result obtained are given in Table 4, 5, 6, and 7, respectively. Initially, the program counted 1574 tokens (words and punctuation   Table 4, 5, 6, and 7, the iteration number is equal to the number of unlabeled sets added to the seed set.
With every iteration, each set of unlabeled education sections is fed to the classification and correction module. As the results in Tables 4, 5, and 7 suggest, the increase in training corpus size leads to increased model performance on test data.
It is known that CRF performs better with structured data [5], and since resumes are semi-structured documents, we have experimented with the CRF layer, and the results are given in Table 6. The results obtained through our proposed model are given in Table 7. Our model excludes the CRF layer, and from Tables 6 and 7, we can infer that the proposed model gives more accurate results while being less complex.
To present the effectiveness of our proposed model, we ran the model without the correction module on 860 (500 ? 360) manually annotated resumes. The performance of this supervised model (see Table 8) is similar to the performance achieved by our semi-supervised proposed model (see Table 7), which lacked annotated data. It solidifies our claim that our proposed model helps overcome the paucity of large annotated data.

Comparison with other approaches
Note that due to limited work on resumes for the NER task and the absence of a standard and large dataset (due to the proprietary nature of the data), a direct comparison between any two techniques presented in different studies may not be viable or fair. We have proposed a novel technique and would like to draw a parallel for the same by comparing it with the available literature on NER tasks for resumes. Our work focuses on the education section, since the degree and institution that the degree is obtained from would be available in only the education section of the resume. It may be challenging to directly imply our technique's superiority due to the factors mentioned above. In this section, we have provided a relative performance comparison and not direct. The results compared with other approaches are purely based on the overall accuracy and precision mentioned therein.
The literature on resume parsing using a semi-supervised NER approach is limited. Zhang et al. [8] used resume document block analysis based on pattern matching, multi-level information identification, and feedback control algorithms and obtained an accuracy of 81% on complex items (like school name) using a larger dataset (2000 resumes for training and 3000 resumes for testing), all manually tagged. Ayishathahira et al. [1] made use of the Bi-LSTM-CNN model to arrive at an F1 score of 76 and 73 for qualification and institution names, respectively, with manual tags. Our model achieved an F1 score of 68.09 for institute names, an F1 score of 73.28 for institute and degree names together, and total accuracy of 92.06% with only 500 handcrafted resumes and 360 automatically tagged resumes. The cascaded hybrid model described by Yu et al. [36] gives an average F1 score of 73.20 for degrees, but for graduate school, the average F1 score is only 40.96 which is considerably lower than the proposed model. The graphbased semi-supervised approach proposed by Zafrani et al. [37] for NER can achieve an F1 score of 41.51 for organizations.
For a recruiter, it is essential to scrutinize and filter out candidates based on their resume. Since resumes contain much information about the candidate, a fast way of parsing through the resume and obtaining relevant information is essential. It can be achieved by the process of identifying degree and institute names -to gauge the educational qualifications. Previous work done in this field provide low precision and recall for institute and degree name recognition [34,37]. Our proposed work can provide improved performance in identifying these entities without the need for mostly annotated resumes, which is generally required for the NER task [19,20,37]. The correction module substitutes for the manual annotation, assuming that the initial predictions made by the model are reliable, by rectifying those predictions. The experimental results are in coherence with the above claim, and we see an increase in overall accuracy.
Due to the privacy issues about the personal details present in resumes, they are not available in large quantities for research. Therefore, the identification of the peak of accuracy could not be achieved. Although these constraints exist, the given amount of iterations indicates increasing accuracy with the increase in the availability of resumes.
We have tried to prove that one can achieve high performance for the NER task using our technique even if enough annotated resumes are not available. We have been able to show that even though there may be a paucity of annotated resumes, we can achieve high performance for the NER task with unannotated resumes itself.

Conclusion
In this paper, we presented a model for accurate identification of degrees and institute names in a resume based on NER. It is composed of Word Embeddings, CNN, and Bi-LSTM. A correction mechanism for rectification of the predictions made on the unlabeled data is incorporated. These corrected predictions are added to the training corpus. As the neural network processes more unlabeled data, the model accuracy on retraining increases. It achieves high performance without the need for extensive annotated data using a semi-supervised approach. An overall F1 score of 73.28 and an accuracy of 92.06% are obtained. Future work in this area includes extending this model for majors, specialization, and other components of an education section. anonymous reviewers whose insightful comments and suggestions have significantly improved this paper.
Funding Open access funding provided by Manipal Academy of Higher Education, Manipal.
Availability of data and material A private company owns the dataset used for this research work, and due to privacy issues, it cannot be made public.

Compliance with ethical standards
Conflicts of interest Authors do not have any conflict of interest or competing interest to declare.
Code availability Code is not available on any public server; however, it can be provided if the need arises.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.