1 Introduction

There are many existing English language proficiency evaluation systems, such as Test of English as a Foreign Language (TOEFL), Test of English for International Communication (TOEIC), International English Language Testing System (IELTS), and others, which aim to measure a person's ability to speak English, which can later be used for academic or professional needs. In particular, IELTS is designed to assess the readiness of participants in terms of studying or practicing further education courses or higher in a university. In making the questions for the IELTS test, there are six stages, including commissioning, pre-editing, editing, pretesting, standard fixing, and test construction and grading. Of course, in making IELTS test questions, it also requires experts, costs and a lot of time. As said by Cotton [9], that in making questions it takes 50% to think about a set of questions from the total time of making. The IELTS test has four sections, namely listening, reading, writing, and speaking. In the reading comprehension test, the IELTS has several types of questions, including multiple choice, identifying information, identifying writer's views/claims, matching information, matching headings, matching features, matching sentence endings, sentence completion, summary completion, note completion, table completion, flow-chart completion, label completion diagrams, and short-answer questions. Therefore, this research focuses on developing applications that are able to generate questions automatically with the type of short answer questions in the reading comprehension section. The flow of the computational model of this system was adopted from previous research, namely the model for generating What, Where, When, Who, Which, and How (5W + 1H) questions [2].

The type of short answer question on IELTS is the type that asks the test taker to answer questions related to factual information about the details contained in the text. Test takers are required to write down their answers in the form of words and numbers on the answer sheet provided, and write their answers using words from the reading text. In addition, there are usually instructions regarding the length of the answer that can be written, for example "No More Than Three Words and/or A Number from the passage", "One Word Only", or "No More Than Two Words". If the participant writes the answer beyond what is instructed, then the participant will lose points for that question. Numbers can be written in the form of words or numbers, and words written using hyphens will be considered as single words. The questions are in the same order as the information in the text [25]. This type of question aims to assess the ability of test takers to find and understand the right information in the text.

Systems that are able to generate questions automatically have been widely used in previous studies. For example, research conducted by Mazidi and Tarau [29], which discusses automatic question generation using the DeconStructure algorithm, dependency, SRL parse, TextRank algorithm, and internal NLU analysis methods. Meanwhile, the research conducted by Kumar et al. [22] used the Part-of-Speech (POS) Tagger and Support Vector Machine (SVM) methods to generate fill-in-the-blank questions. The tools he uses are of course different, including “Amazon Mechanical Turk”, “scikit-learn” python package, “Radial Basis Function (RBF) kernel”, “WordVec”, and “WordNet”. The result of his research is a system called “RevUP”. Furthermore, research related to automatic question generators uses semantic pattern recognition to create questions with different depths and types [28]. This research uses the method of Negation Detection, and Linguistic Considerations, with tools “SENNA” software, Python, and “WordNet”. The results of his research show a 44% reduction in error rates relative to the previous best system, top average across all metrics, as well as a 61% reduction in error rates on grammatical assessments.

This study seeks to produce an automatic question generating system with the type of questions generated in the form of short answer questions in reading comprehension using NLP and the KNN. The NLP method is used to process data in the form of text while KNN, which is a machine learning method, is used to choose the best question based on training data (i.e., data on questions that have been raised in IELTS questions). The machine learning method is intended so that the resulting questions have a quality that is not much different from the question data that has appeared before. The main contributions of this research are as follows: (i) by performing machine learning method we attempt to maintain question quality from datasets of historical questions; (ii) questions can be generated automatically and fast because the input data are from articles and users just need to determine numbers of questions; and (iii) this research works on Part of Speech (POS) tagging instead of using words, so the model constructed can be used for all words available in dictionary.

2 Research method

2.1 Computational model

Figure 1 shows the flow model of the system built in this study; this path was adopted from research conducted by Ali [2]. The system built in this study can do scrapping or retrieve article content from the website, by entering the URL of the website address. In doing this scrapping using the library provided by python, namely “newspaper3k”. This library can be used to retrieve the content, author, and publish date of articles. Before the data of this article is extracted to become a candidate question, previously this data must be processed so that it can be used to extract questions. A detailed explanation of this computational model is as follows:

Fig. 1
figure 1

Automatic question generation model flow

2.2 Data collection

This stage is needed to collect articles that will be extracted into several questions that are suitable to be used as short answer questions in IELTS reading comprehension. Fig. 2, shows an example of an article taken from the BBC news website, which is about the corona situation in India. The library used at this stage is “newspaper3k”, which is to retrieve article contents from links entered by users.

Fig. 2
figure 2

Examples of news articles taken from the website

2.3 Data preprocessing

After obtaining the article that will be converted into several questions, the next step is to separate the sentences. This separation is done with the condition that the beginning of the sentence must begin with a capital letter and end with a period, if it does not meet the requirements then the sentence will not be processed to the next step. Figure  3, shows an example of the result of separating sentences from paragraphs.

Fig. 3
figure 3

Breaking paragraphs into sentences

The next step after splitting sentences is to preprocess each sentence generated from the paragraph. This preprocessing is carried out with the aim of eliminating characters and symbols, so that they can be processed at the next stage. Removing these characters and symbols uses the “regex” function, and an example of the results of the preprocessing can be seen in Fig. 4.

Fig. 4
figure 4

Remove characters and symbols

2.4 Converting to tree structure

At this stage, sentences that have been cleaned of characters or symbols using the regex of the previous stage will be converted into a tree structure, because characters or symbols cannot be converted to POS tags, so “regex” is needed. The library used to perform this conversion is “Stanford Core NLP” via “NLTK”. The results of the conversion into a tree will be used by the next stage, namely to extract simple sentences from complex sentences. An example of the results of the conversion into a tree structure can be seen in Fig. 5. Previously the results of this conversion were in the form of a data tree, therefore the conversion process from tree data types to strings was required.

Fig. 5
figure 5

Converting to tree structure

2.5 Elementary sentence extracted system (ESES)

Elementary Sentence Extracted System (ESES) is a system used to extract elementary sentences or simple sentences from complex sentences. Because in making questions, simple sentences are needed, ESES will be very useful in terms of extracting these sentences. Figure 6 shows the flow of the computational model for extraction of simple sentences which is divided into four arrays, including NP or noun phrase, VP or verb phrase, word depth in the sentence, and also the word order in the sentence.

Fig. 6
figure 6

ESES flow model

Then each NP and VP will be combined by looking at the depth and order of each NP and VP. A simple sentence has only one subject, no less and no more, meanwhile, a sentence that is considered a complex sentence is a sentence that has more than one subject. This requirement is in accordance with the research conducted by Kalady et al. [20]. Figure 7 shows an example of the result of extracting simple sentences from complex sentences.

Fig. 7
figure 7

Extraction of complex sentences into simple sentences

2.6 Classification

After getting simple sentences from the previous stage, the next stage is to determine the Fine Class and Course Class. As seen in Fig. 8, Fine Class determines the initial class of each word in the sentence, while Course Class is a regrouping of Fine Class which is divided into five predetermined classes, namely Human, Location, Entity, Time, and Count. These five classes are the rules of the sentence, such as "Human verb Human", according to the coarse class produced from the sentence.

Fig. 8
figure 8

Classify to fine and coarse class

2.7 Determining and generating question types

The sentence rules resulting from the previous stage will be used at this stage to determine the types of questions that can be generated. Then, this stage will also produce a question sentence and the appropriate answer from the sentence. For example, in Fig. 9, the sentence rules will be extracted from simple sentences, which will then obtain the types of questions by looking at the subject, object, and preposition. Because the sentence rules have "time" then one of the types of questions that can be generated is "when". What types of questions can be generated with sentence rules.

Fig. 9
figure 9

Getting the question types

After getting what types of questions can be generated in one sentence, the next step is to make question sentences according to the type of question. Each type of question will produce one question, except for questions with the type of "whom" where in the sentence there is a coarse class with more than one type of "Human". For example, it can be seen in Fig. 10, which is from the type of question "when" then this will produce a question sentence in the form of "When Husam cooks the rice?", this question asks for the time or according to the rules of the sentence "Time", with the answer "at 4 pm".

Fig. 10
figure 10

Generate a question from a question type

2.8 Eliminating pronoun

The resulting question sentences still have to be processed by sorting or cleaning the question sentences and answers from pronouns, because with the question sentences and answers that have pronouns it will cause ambiguity in answering the question. Therefore, at this stage we will remove candidate questions that have pronouns in the question sentence or in the answer. For example, it can be seen in Fig. 11, in the sentence the question has a pronoun, namely "The country", therefore the question will be deleted and will not be processed at a later stage.

Fig. 11
figure 11

Examples of question sentences that have a pronoun

2.9 POS Tagging on question candidates

The next step is to convert it into a POS tag and determine the number of words in the question candidate sentence. POS tag data and many words in this question sentence are useful for determining the feasibility of questions from candidate questions generated by the system. The steps of the POS tag and the determination of the number of words in a sentence can be seen in Figs. 12 and 13. From the sample sentence questions, preprocessing was carried out to remove characters and symbols, after which they were converted to get a POS tag, which finally obtained many words in a sentence by counting the number of words. POS tags. The library used at this stage is “Stanford Core NLP”, which is to convert sentences into trees.

Fig. 12
figure 12

The results of pre-processing training data, POS tag conversion, and many words

Fig. 13
figure 13

Preprocessed candidate query results, POS tag conversions, and word count

2.10 Calculating distance of data training dan testing

The next step is to convert it into a POS tag and determine the number of words in the question candidate sentence. POS tag data and many words in this question sentence are useful for determining the feasibility of questions from candidate questions generated by the system. From the sample sentence questions, preprocessing is carried out to remove characters and symbols, after that they are converted to get POS tags, which finally get a lot of words in the sentence by counting the number of POS tags. Before using the KNN formula, the POS tag is first converted to a numeric value. The first step is to initialize each tag into a number, the numbers for each tag can be seen in Table 1.

Table 1 Initialize tag values

After getting the value of each tag, the next step is to determine the value of S, provided that the range is from 0 to 100 with 36 tags. The calculation can be seen in Eq. 1, so that the S value is 2.86. Since the value of S has been obtained, the next step is to calculate the value of V, which is the numeric value of each tag. As seen in Eq. 2, the value of V is the value of S multiplied by the tag value and then subtracted by one.

$$S= \frac{100}{36-1}=2.86$$
(1)
$$0= -1, P=tag\, value V=\left(2.86*P+O\right), tag=VB=27 V=\left(2.86*27+\left(-1\right)\right)=76.22$$
(2)

After the numerical value of the POS tag is obtained, it is possible to calculate the distance between the test data and the training data. Figure 14 is an example of calculating the distance between training data and test data, the result of this calculation is 91.96, where the smaller the number, the more similar the test data to the training data. Because the results are 91.96, it can be said that the test data questions are not similar to the training data questions.

Fig. 14
figure 14

Example of the distance between the training data and the test data

2.11 Sorting question candidates

After getting a list of question sentences that have been sorted from question sentences and also answers that have pronouns and getting the value of the distance between the training data and the test data, the next step is to sort the question sentences according to the distance from the training data. In order to obtain a list of candidate questions that are sequentially according to their proximity.

2.12 Changing word with its synonim

Next is the stage of changing some words that have been determined with their synonyms. The words in the sentence that will be replaced with synonyms include words that have adjective, adverb, and verb tags. The conversion of certain words into their synonyms is intended to make the resulting question sentences more difficult than the original sentences whose words were taken from the article text. Therefore, this stage is necessary so that the resulting candidate questions have a more difficult level of difficulty than before. For example, Fig. 15 shows that the word "now" is changed to its synonym "today", and the word "tested" becomes "proved".

Fig. 15
figure 15

An example of converting certain words to their synonyms

2.13 Refining grammar

This stage is to check the grammar of the question candidate sentence, which is to make corrections to the grammatical errors in the question candidate sentence. To check this grammar, we use a library in the python programming language, namely the language tool. For example, Fig. 16 is the result of checking using the language tool library. The change is that what was originally "ebooks" was changed to "e-books".

Fig. 16
figure 16

Example of grammar checking and improvement

2.14 Deleting duplication

This stage is the selection or discarding of candidate questions that come from the same sentence. An example of the final result of this generated question can be seen in Fig. 17.

Fig. 17
figure 17

Example of paragraphs and system generated questions

Evaluation Methods.

The following is methods to be used to evaluate the results:

Grammar Checker: This analysis is used to find out grammatical errors in question sentences. We use the help of a website application to check grammar, by using a website with the url https://www.reverso.net/spell-checker/english-spelling-grammar/.

Expert Judgement: The next stage is to evaluate the feasibility of the questions by human experts. It is intended that the resulting questions can be assessed for feasibility by considering the following criteria:

  • Grammatical Correctness (GC): that is to determine whether the question sentences generated by the system are syntactically well-formed. At this point, we divide 3 levels of value according to the number of grammar errors, including:

  • Best: Generated questions do not have grammatical errors.

  • Good: Generated questions have one or two grammatical errors.

  • Worst. Generated questions have more than 3 grammatical errors.

  • Answer Existence (AE): namely determining whether the answer to the question is appropriate, or the question can be answered by the answer key correctly.

  • Difficulty Index (DI): which is to determine the value of how difficult the questions are generated by the system. This point the author divides into 3 levels of value, including: easy, medium, and hard.

Next, we calculate the percentage value of the expert assessment using the formula shown in Eq. 3.

$$Percentage=\frac{Total\;score\;of\;the\;evaluation}{Maximum\;score}\;\ast\;100\%$$
(3)

3 Experimental design

In this study, the data used is news article data that is used to generate questions from the article, which will then become test data or candidate questions. Other data is IELTS question data that has appeared which is used for training data. The number of training data sources used are 49 books and 3 websites with a total of 1010 questions. Examples of e-books taken as data collection for IELTS training questions include:

  • Cambridge Practice Tests for IELTS 2 [6]

  • Cambridge Practice Tests for IELTS 1 [31]

  • Cambridge IELTS 3 With Answer Edition [7]

  • IELTS Reading Tests [30]

  • Collins Reading for IELTS [41]

  • Insight into IELTS Update Edition [18]

  • Barron's IELTS [26]

  • Headway Academic Skills: Level 1 Student Book [15]

  • Master IELTS 6: IELTS Precise Reading [35]

  • High Impact IELTS Academic Module [5]

As for the data used as candidate questions, it was obtained from several trusted websites, namely the BBC, CNN, The Jakarta Post, and the New York Times. The topics of each article used are different, including health, hoaxes, holidays, the environment, and sports (See Table 2).

Table 2 Article sources

4 Results and discussion

The results of this experiment are questions and answers generated by the developed system. Table 3 shows the candidate questions and answers generated by the system in this experiment. As an example, some questions generated by the system are taken. Grammar analysis is used to find out grammatical errors in question sentences. It aims to find out whether the question sentences generated by the system are in accordance with the grammar rules or not. This error grammar checks based on error, misspelling, uncertainty, and undefined. Of these 21 questions, the grammar check resulted in a total of 5 questions that had grammatical errors, so the result was a percentage of correct grammar of 76.19%.

Table 3 Generate question experiment results

Next, we evaluate the feasibility of the questions by human experts. It is intended that the resulting questions have a high level of feasibility. This evaluation was carried out by two experts. Table shows the results of the evaluation carried out by the expert; it can be concluded that the first expert for the grammatical correctness parameter gave a score of 29 with a percentage of 46.03%. The answer existence parameter is given a score of 39 with a percentage calculation of 92.86%. As for the difficulty index parameter, a score of 21 is given with a percentage calculation of 33.33%. While the assessment of the second expert with a total score per parameter, namely the GC parameter has a total of 46, the percentage is 73.02%. The AE parameter has a total score of 41 with a percentage of 97.62%. The DI parameter has a total score of 23 with a percentage value of 36.51% (See Table 4).

Table 4 Results of evaluation by expert

After calculating the percentage by looking at each text and each expert, the next step is to perform calculations by combining the results of the evaluation carried out by the first expert with the second expert. This is done to be able to see the final percentage value and also its average. The results of the calculations that have been carried out can be seen in Table 5, the ideal value, the maximum score of each parameter is different, namely the grammatical correctness parameter has an ideal value of 3 with a maximum score of 63 and has a total score of 75. The answer existence parameter has an ideal value of 2 with a score of a maximum of 42 and has a total score of 80. The difficulty index parameter has an ideal value of 3 with a maximum score of 63 and has a total score of 44.

Table 5 Calculation results for each parameter

Furthermore, this study also tries to compare it with existing research, as shown in Table 6. Like the research conducted by Heilman and Eskenazi [16], the questions generated by the system have a precision of 80%, this study uses the thesaurus extraction technique method, the evaluation used is the thesaurus extraction technique. It is also used using experts in assessing the questions generated by the system. This research produces multiple choice questions. In contrast to the research conducted by Heilman and Eskenazi, the research conducted by Yuan et al. [42] produced questions of the same type as this study, namely the type of short answer questions. However, the evaluation carried out is different, in the research evaluation is carried out by comparing the system developed with the existing system in the previous study, namely “Seq2Seq” and the results using the Fluency (PPL) value. The value generated by the researcher is 175.7 while the “Seq2Seq” system has a value of 153.2. Which means that the system developed by the researcher produces more specific questions, although there are still shortcomings in terms of context.

Table 6 Comparisons with previous research

5 Conclusions

After conducting research on automatic question generation with short answer questions on IELTS reading comprehension, the following are important contributions:

This research has succeeded in designing a computational model for automatic question generation with short answer questions on IELTS reading comprehension. This computational model uses the k-NN algorithm and the NLP method. The stages include scrapping to get news articles, tokenization, conversion to tree structure, simple sentence extraction, generating questions, cleaning question sentences that have pronouns, converting question sentences into POS tags, converting them to numeric, then calculating the distance between the test data. with training data using numeric data, changing words in sentences into synonyms, improving grammar, and finally removing question sentences from the same sentence or duplication.

After experimenting with five articles with five different themes and also four different websites, the results were in the form of candidate questions which were then evaluated by two experts. The results of the evaluation carried out showed that the grammatical correctness had a percentage of 59.52%, for answer existence it had a percentage of 95.24%, and for the difficulty index it had a percentage of 34.92%. So, it can be seen that the answer existence is very good.