New labeled dataset of interconnected lexical typos for automatic correction in the bug reports
- 66 Downloads
Large-scale and especially open-source projects use software triage systems like Bugzilla to manage their user’s requests like bugs, suggestions, and requirements. The software triage systems have many tasks like prioritizing, finding duplicate and assigning bug reports to developers automatically, which needs text mining, information retrieval, and natural language processing techniques. We already showed there are many typos in the bug reports which reduce the performance of artificial intelligence techniques. The connected terms were one of the most types of typos in the context of bug reports. Also, we introduce some algorithms to correct the connected terms earlier, but there was not any labeled dataset that can be used to evaluate the accuracy of process of typo correction. Now we made a new labeled dataset including 42,970 typos between 182,096 to can be used for the typo correction evaluation process. There are 52% connected typos in the labeled dataset, which show the previous results about the number of connected typos were correct. Then we used the typo correction algorithms which were introduced in prior studies to evaluate their accuracy. The experimental results show 81.6% and 83.3% accuracy in top-5 and top-10 suggestions of the list of typo corrections, respectively.
KeywordsNatural language processing Typo correction Interconnected lexical typo Tree structure Bug reports
Mathematics Subject Classification68T50 68T20 68U15 68P05 68P10 68P20 68P30 94A13 68Q25 68R15 68W10 68W32 68W40 05C05
JEL ClassificationC88 L86 D81 D83 L17 Z13
Software triage systems such as Bugzilla are software which usually gets bug reports online, and then the Triagers will deal with these bug reports to evaluate the importance and priority of each report, finding duplicate reports based on their contents, assign bug reports to developers for checking bugs and planning to modify the project in future. Because of the large count and volume of bug reports, many researchers have tried to automate these processes since 2004 by artificial intelligence techniques and algorithms. Duplicate bug reports detection is a great problem in this research area [1, 2]. The algorithms and techniques of duplicate bug report detection such as Term Frequency and Inverse Document Frequency in information retrieval techniques need to check the similarity of two bug reports to each other word by word , so the lexical correctness of words and terms is essential for these techniques . There are many typos in bug reports, e.g., more than 50% of bug reports have typos, and more than 2.5% of bug reports have more than 50% typos . These typos distort the similarity detection process in duplicate detection. It is crucial to detect and correct these typos automatically because there are more than 1.5 million typos  in Mozilla Firefox, Android, Open Office, and Eclipse datasets  and about 390-kilo unique typos in there. A scientific semi-dictionary is made for typo detection in bug reports to detect typos automatically .
There are many types of typos in texts like additional, removal, or substitute characters. Interconnected terms are a regular typo in a software context because there are many method or class names in this context which contains interconnected terms like ‘LinkedList’ or ‘connectToServer’. Sometimes these words are camel case, and sometimes users typed them and have not any specific case sensitivity. Also, sometimes, users forgot to press space between words, so there will be many interconnected terms in the software bug reports. These interconnected terms must be separated; otherwise information retrieval techniques like term frequency, cannot detect text similarities for the duplicate bug report detection problem. Some new algorithms are introduced in prior studies [6, 7] for the correction of interconnected terms, but there is not any standard labeled dataset to evaluate the accuracy of them. The primary purpose of this research is to make a labeled dataset and evaluate the accuracy of the algorithms of the correction of interconnected terms.
2 Literature review
Typo detection and correction is a regular and an ancient issue in text mining and natural language processing [8, 9]. There are many efforts on typo detection and correction in a scientific context like clinical records, which uses Shannon’s noisy channel model to predict the next words based on the previous word sequence . In some cases, there is less last word sequence like web query, so the log of web query can be used as a baseline, and the maximum entropy model can help for rare queries to conquer the sparseness problem of prior data .
Some researchers focus on the correction of misspelled typos by different kinds of machine learning and natural language processing models, e.g., creating a confusion matrix for the other type of misspellings like additional or removal or transposal or replaced characters to searching these patterns in terms and predict the correction . String transduction tries to map one string to another and can be used for misspelled typo corrections . Machine learning is used in character scale to typo detection and corrections, but the recall rate is low (about 30%) .
Also, phonetic, language and keyboard models can be useful for correction prediction by decision tree as a machine learning-based technique [15, 16]. Another approach can be creating a model based on machine learning techniques to detect typos and predict the correction according to context and domain knowledge [17, 18].
Some other researchers focus on using tree structure for typo correction. It is possible to make a tree based on a probabilistic model of the relationship between characters of words which what characters can become after a particular character and in advance mode, after a sequence of characters. So, these models use Bayes's theory to make a prediction model on a tree called Trie and use it for typo correction as the user is typing [19, 20]. The tree structure can be used for grammatical checking and translating, too, by merging several grammatical trees in a Trie . The simple Trie (without probability) is used for spell checking also . The acyclic deterministic finite automata is a graph with a similar structure that can be used for spell checking and typo correction . There are some methods for query in Trie by wild characters, too . Trie-based index structure can be used for real-time interaction like search recommendation and query completion .
The interconnected terms problem was not significant a lot in other contexts, and there is no specific method for the correction of interconnected terms. As it was tested, the google translate, and Microsoft office word can detect two parts interconnected terms and recommend a correction for them, but if there are more than two meaningful terms, they cannot identify and suggest any correction. It shows that even huge companies have not been investigated with this problem until now in general-purpose situations.
A divide and conquer algorithm based on the longest common sequence algorithm can be considered to find out the meaningful terms in an interconnected term. It is a simple brute force algorithm which will consider all combinations of start and end index of a substring in an interconnected term to find a meaningful term. Meaningfully checking needs a dictionary. A good trustful dictionary for computer context has been made which can be used for this purpose too .
The primary purpose of this study is to evaluate the accuracy of algorithms used for the correction of interconnected typos. The prior dataset  has no label; in other words, the rectification of each typo was not given in the dataset. So, in this meanwhile, we select some typos randomly and divide them into 1000 items in separate files and ask some students of computer engineering to determine the correction of each typo manually. It was a time-consuming process. Then all the revisions have gathered and combined. There are 42,970 typos in this labeled dataset now.
4 Experimental results
The evaluation results of correction of interconnected typos
There are 13,820 true predictions between 16,932 I.T. in the top-5 suggestions and 14,120 in the top-10 recommendations. So, the accuracy of predictor algorithms is 81.62% and 83.39% for top-5 and top-10 lists, respectively. Interestingly, the accuracy of the top-1 suggestion is 66.69%, which is considerable.
The last row of Table 1 shows that 9004 typos have one space in the correct form, in other words, they have two meaningful words, and other ones have more than two connected words, which are about 46.8% of interconnected terms. So, it shows that the efforts of this study were not pointless, and it is essential to proceed with this study more, especially for software triage systems and similar systems like FAQ forums, e.g., Stackoverflow.
Sometimes there is some mistake in the dataset which human correctors don’t lend them. The good results show that these mistakes are few but are not zero. For example, there are some incomprehensible terms in the dataset like ‘wszelkie’ which should not be classified as an interconnected term, but the human corrector selected this wrong. There are many other examples that the predictor algorithms suggest accurate prediction, but the human correctors wrongly suggested.
Sometimes the human correctors of the labeled dataset choose a prefix or postfix with the term as a single term like the plural ‘s’ or ‘un’ prefix in ‘students’ and ‘unregister.’ The primary reference of predictor algorithms is the scientific dictionary, which may have not ‘unregister’ as a term. So, after predicting, the stem of both correct form and predicted combination should be checked too, and sometimes it is impossible, e.g. ‘isenabling’, which misleads the predictor because the term ‘enabling’ is not in the dictionary.
There are many abbreviations or similar words like file extensions in the selected scientific dictionary, which cause misdirect predictor algorithms. For example, consider ‘xredline’ as I.T. The algorithms predict ‘xre dli ne’ as the first combination because the ‘xre’ and ‘dli’ are meaningful terms in the selected dictionary as an abbreviation.
Sometimes there is some new idiom that is not in the scientific dictionary and leads the predictor algorithms wrong. For example, the term ‘slideshow’ was not in the dictionary and predictor select ‘slide show’ as a result, which was not equal to the human corrector selection.
The average word length (AWL) is not a useful metric in all cases. For example, the AWL of ‘x red line’ and ‘xre dli ne’ are the same and equal to 8/3, but the length of each word of every combination is not identical. Sometimes a combination with the most word length is more acceptable. It is better to introduce new metrics to cover these situations too.
The prior predictor algorithms [6, 7] are heuristic and fast but do not check the total search space. They try to find the best combination by choosing the first term which has the most AWL for the beginning position of combination, then second term and so on from left to right. There are some cases that need choosing the first term in the middle or last of combination with the most AWL. For example, the predictor checks the ‘mypassword’ and return ‘myp ass word’, which is not correct, and if the predictor selects the longest meaningful term (‘password’) first and then chooses the ‘my’, the AWL was more than the selected combination.
This study introduces a new labeled dataset for interconnected typo (I.T.) correction and prior supplement studies. The new dataset is used to evaluate the accuracy of previous studies. The experimental results show that more than 46% of interconnected terms have more than two meaningful words. Also, the accuracy of the correction of I.T.s was more than 81%, which is not so bad, but it can be improved in the future. It should be considered that the runtime of these algorithms is very much low (less than 1 s), and their memory usage is very low, too (in the order of the size of the dictionary) [6, 7]. So, it is good to use the Neural Match Tree-based algorithm for prediction and correction of interconnected terms.
In the future, many improvements can be used in the meaningfulness combination extraction process to achieve the best one between other combinations and also based on the main context. Also, other metrics can be introduced for this purpose instead of average word length, which was used in state of the art. The meaningfulness combination finding algorithm can be improved too.
It is my duty to thanks my dear students from the University of Kashan, Islamic Azad University of Isfahan (Khorasgan), and Shahid Ashrafi Esfahani University, which help us to make the new labeled dataset.
Compliance with ethical standards
Conflict of interest
The authors declare that they have no conflict of interest.
- 1.Soleimani Neysiani B, Babamir SM (2019) Improving performance of automatic duplicate bug reports detection using longest common sequence. In: IEEE 5th international conference on knowledge-based engineering and innovation (KBEI), Tehran, IranGoogle Scholar
- 2.Soleimani Neysiani B, Babamir SM (2019) New methodology of contextual features usage in duplicate bug reports detection. In: IEEE 5th international conference on web research (ICWR), Tehran, IranGoogle Scholar
- 3.Soleimani Neysiani B, Babamir SM (2016) Methods of feature extraction for detecting the duplicate bug reports in software triage systems. Paper presented at the international conference on information technology, communications and telecommunications (IRICT), Tehran, IranGoogle Scholar
- 4.Soleimani Neysiani B, Babamir SM (2018) Automatic typos detection in bug reports. Paper presented at the IEEE 12th international conference application of information and communication technologies, KazakhstanGoogle Scholar
- 5.Alipour A, Hindle A, Rutgers T, Dawson R, Timbers F, Aggarwal K (2013) Bug reports dataset. https://github.com/kaggarwal/Dedup. Accessed 25 Feb 2019
- 6.Soleimani Neysiani B, Babamir SM (2019) Automatic interconnected lexical typo correction in bug reports of software triage systems. Paper presented at the international conference on contemporary issues in data science, Zanjan, IranGoogle Scholar
- 7.Soleimani Neysiani B, Babamir SM (2019) Fast language-independent correction of interconnected typos to finding longest terms. Paper presented at the 24th international conference on information technology (IVUS), LithuaniaGoogle Scholar
- 8.Zhuang L, Jing F, Zhu X-Y (2006) Movie review mining and summarization. In: Proceedings of the 15th ACM international conference on information and knowledge management, 2006. ACM, pp 43–50Google Scholar
- 11.Chen Q, Li M, Zhou M (2007) Improving query spelling correction using web search results. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL)Google Scholar
- 12.Noaman HM, Sarhan SS, Rashwan M (2016) Automatic Arabic spelling errors detection and correction based on confusion matrix-noisy channel hybrid system. Egypt Comput Sci J 40(2):54–64Google Scholar
- 13.Ribeiro J, Narayan S, Cohen SB, Carreras X (2018) Local string transduction as sequence labeling. In: Proceedings of the 27th international conference on computational linguistics, 2018, pp 1360–1371Google Scholar
- 14.Korpusik M, Collins Z, Glass J (2017) Character-based embedding models and reranking strategies for understanding natural language meal descriptions. In: Proc Interspeech, pp 3320–3324Google Scholar
- 15.Almeida GAM (2016) Using phonetic knowledge in tools and resources for natural language processing and pronunciation evaluation. Universidade de São PauloGoogle Scholar
- 16.de Mendonça Almeida GA, Avanço L, Duran MS, Fonseca ER, Nunes MGV, Aluísio SM (2016) Evaluating phonetic spellers for user-generated content in Brazilian Portuguese. In: International conference on computational processing of the Portuguese language, 2016. Springer, pp 361–373Google Scholar
- 18.Huang Y, Murphey YL, Ge Y (2013) Automotive diagnosis typo correction using domain knowledge and machine learning. In: IEEE symposium on computational intelligence and data mining (CIDM), 2013. IEEE, pp 267–274Google Scholar
- 19.Duan H, Hsu B-JP (2011) Online spelling correction for query completion. In: Proceedings of the 20th international conference on World Wide Web, 2011. ACM, pp 117–126Google Scholar
- 20.Hsu B-J, Wang K, Duan H (2012) Online spelling correction/phrase completion system. Google PatentsGoogle Scholar
- 21.Oflazer K (1996) Error-tolerant tree matching. In: Proceedings of the 16th conference on computational linguistics, vol 2. Association for Computational Linguistics, pp 860–864Google Scholar
- 23.Deorowicz S, Ciura MG (2005) Correcting spelling errors by modeling their causes. Int J Appl Math Comput Sci 15:275–285Google Scholar
- 24.Ito N (1997) Character-string retrieval system and method. Google PatentsGoogle Scholar
- 25.Fafalios P, Tzitzikas Y (2015) Type-ahead exploratory search through typo and word order tolerant autocompletion. J Web Eng 14(1&2):80–116Google Scholar