1 Introduction

Textbooks are considered the main source of any education system. Therefore, almost all educators rely on them in order to prepare their teaching plans and syllabus, moreover, providing them as references to their students. This is because textbooks are high-quality textual resources that introduce the basic concepts and primary knowledge necessary to understand a specific domain (Alpizar-Chacon & Sosnovsky, 2021).

Recently, electronic textbooks are more preferable to a wide range of students and educators than physical or hard copy textbooks. This is because they give learners the ability to easily navigate through their pages, search for specific words or phrases, add editable annotations, and much more. Textbooks can be found in different electronic formats, such as HTML,.DOCX, PDF, etc. However, PDF textbooks are the most popular electronic textbook format that students and educators seek to have more over the other electronic formats (Alpizar-Chacon & Sosnovsky, 2021; Bast & Korzen, 2017). The difference between a document in PDF and other format is that a PDF document does not contain a normal text, but it actually contains thousands of basic objects, various compression techniques, font formats, lines, and curves. These objects in a PDF are instructions to draw shapes, images, and text (Whitington, 2011). The disadvantage of PDF textbooks is that they rarely save information about their structure (Tkaczyk et al., 2015), which affects the quality of the text extracted from them.

Textbooks are created by domain experts who put a considerable effort into identifying the main concepts and facilitating the basic knowledge of a domain. Furthermore, they greatly focus on organizing the content of textbooks in a logical sequence according to their understanding of the domain. This logical sequence appear in a form of a set of chapters, sections, and subsections that are provided with informative headings. (Alpizar-Chacon & Sosnovsky, 2021).

One of the distinguishing features of textbooks is that their authors focus on defining key terms explicitly in a way that makes learners keep reading without struggling to search for the meaning of new terminologies. Thus, they take care of defining simple terms first, then gradually defining more complex ones that are based on understanding the simple ones defined first. In good-quality textbooks, authors create a section, called glossary, at the end of the textbook. The glossary section is very helpful for learners, especially students. It provides all key terms that are defined and explained in different chapters in a textbook in one place with their definitions (Gacitua et al., 2011; Park et al., 2002), which enables learners to easily find and review all the key terms related to the important concepts introduced by the textbook.

In the literature, it is better to distinguish between two research areas relating to key terms (Campos et al., 2020): term extraction, and term assignment. In term extraction, researchers propose solutions for extracting all possible terms referring to all domain concepts given in a particular text document. The extracted terms in this research area are called key terms, glossary terms, keywords, or keyphrases. On the other side, researchers in the term assignment area propose solutions for labeling a document or a collection of documents with one or more terms that refer to a domain concept or more for the purpose of document classification. The selected terms can also be called key terms, keywords, or keyphrases, but not glossary terms.

A term in both areas is a word or group of words that refer to a particular domain concept (Dwarakanath et al., 2013), and, sometimes, terms can also be called concepts. (Castellví et al., 2001; Gul et al., 2022). Extracting terms from text has a vital role and is the foundation of various tasks, including text summarization, clustering, thesaurus building, opinion mining, categorization, query expansion, recommendation, information visualization, retrieval, indexing, and digital libraries (Campos et al., 2020). It can also play an important role in the academic publishing industry to perform tasks such as, recommending articles and books, highlighting missing citations to authors, suggesting available reviewers for submissions (Augenstein et al., 2017).

However, it is not a trivial task for experts to extract terms from text documents (Silva et al., 2014), as the manual term extraction consumes a lot of time and a great human effort (Berry et al., 2003; Liu et al., 2010; Xu & Zhang, 2021). Thus, there is an urgent need to use the computing power of computers for automatic term extraction (Babar & Patil, 2015). Another issue of term extraction is that it is subjective even if it is done manually, according to a study conducted in Silva et al. (2014). This study shows that about 60% of experts agree that identifying terms in a domain is a subjective task. Therefore, this indicates that the automatic extraction of terms may not be totally precise (Silva et al., 2014).

In automatic term extraction, there are two machine learning methods: supervised and unsupervised. Supervised term extraction requires labeling many training data sets to convert term extraction problem into a classification problem (Wang & Wang, 2019; Witten et al., 1999). This method trains a model on labeled training data sets in a particular domain and uses the trained model to determine whether a word in a text is related to that domain. The limitation of this method is that the model must be trained on large labeled documents in order to serve as training sets. Unfortunately, this large amount of labeled documents are not always available in many domains. Moreover, unlabeled documents require large amounts of manual extraction of key terms as well as labeling them in order to have these documents labeled. (Campos et al., 2020; Sun et al., 2020; Xu & Zhang, 2021).

An alternative solution is to consider unsupervised methods that include linguistic, statistical, and graph-based methods for term extraction (Campos et al., 2020; Sun et al., 2020). Unlike supervised methods, the unsupervised methods do not need training data sets. Consequently, this saves time and money of the manual term extraction as well as the manual labeling of the extracted terms in order to create the training data set (Campos et al., 2020). Another advantage over supervised methods is that they can extract terms from text belonging to any domain, as they do not need labeled data sets in the domain of the problem for a training process (Papagiannopoulou & Tsoumakas, 2020). For these advantages, unsupervised methods are preferred as an alternative approach for term extraction (Duari & Bhatnagar, 2019).

This paper proposes a model for an automatic unsupervised linguistic-based method for extracting key terms from a PDF textbook. It is based on the basic linguistic techniques of the natural language processing: pattern recognition, sentence tokenization, part-of-speech (POS) tagging, and chunking. The proposed model succeeds to extract relevant and domain-related terms from a single PDF textbook, even if a term occurs only once in the text. Consequently, the proposed model goes a step over methods relying on statistical techniques to filter domain-related terms from the extracted terms. These methods need to compute term frequencies and set a frequency threshold in order to select top terms that have a frequency exceeds the predefined threshold. It is worth mentioning that, unlike these methods, the proposed method can extract key terms without computing term frequencies or setting a frequency threshold. The proposed method also focuses on textbooks in PDF format in order to provide solutions that work around problems resulting from extracting text from PDF textbooks. This is because the ultimate goal of the proposed method is to facilitate extracting the important terms, for educators and learners, directly from their PDF textbooks, regardless of the domain of the textbook.

The paper is structured as follows. Section 2 discusses a number of recent related work. Section 3 explains the proposed model. Section 4 illustrates the model phases and architecture. Section 5 explains the foundation layer of the model. Section 6 explains the preprocessing phase and its processes. Section 7 describes the defining sentence extraction phase and its processes. Section 8 focuses on explaining the term extraction phase and its processes. Section 9 describes the experiments through which the proposed model is evaluated. Section 10 discusses the experiment result. Section 11 concludes the paper with future work trends.

2 Related work

Chacon and Sosnovsky (Alpizar-Chacon & Sosnovsky, 2021) use PDFBoxFootnote 1 library to extract text from PDF textbooks and then propose two phases to extract the key terms from their index section. They propose 12 rules to identify the index section, including rules to identify the beginning of the index section, the layout of index pages where text is organized into two columns, the index entries of each column in every index page. In the first phase, they extract index terms from non-hierarchical index entries. In the second phase, they propose an algorithm to extract index terms from hierarchical index entries. Each hierarchical index entry consists of an unordered group of words that have to be arranged in a way that meets a term in the textbook. The algorithm is proposed to detect the right order of these words in order to detect different terms from this group of words. The algorithm uses the stem and lemma forms of each word in the hierarchical index entry and creates all their possible permutations. Then the algorithm breaks down the text into noun chunks, and every permutation is compared to every noun chunk. The matched permutations are chosen as key terms and added to the list of index terms extracted from the non-hierarchical index entry.

Mishra and Sharma (Mishra & Sharma, 2020) propose a domain-specific approach to extract glossary terms from large-sized text requirement documents. First, they build a text corpus of the domain of home automation. The text corpus consists of the CrowdRE dataset proposed by Murukannaiah et al. in Murukannaiah et al. (2017); Murukannaiah et al., 2016) and the Wikipedia’s home automation web pagesFootnote 2 crawled by web scraping packagesFootnote 3 in Python. Then the Python’s Natural Language Toolkit (NLTK)Footnote 4 is used in the preprocessing phase to tokenize the textual data in the corpus into sentences. All the tokenized words are then converted into lowercase and all alphanumeric tokens and stopwords are removed. The NLTK tagger is used to tag tokens of each sentence and then text chunking is applied to identify noun phrase chunks. Lemmatization is then applied to lemmatize the generated chunks. The lemmatized noun phrases are considered candidate glossary terms. Then, a semantic filter is performed to select domain-related terms among these candidates. The semantic filter uses Word2Vec (Mikolov et al., 2013a, b) model for word embeddings and for computing semantic similarity scores using cosine similarity between candidates to identify all candidates related to the domain as final glossary terms.

Gul et al. (Gul et al., 2022) extract key terms from textual documents in education domain using common NLP techniques and DBPedia as a knowledge base. The NLP techniques are used to divide the textual documents into n-grams representing the candidate terms. All n-grams representing stopwords are removed based on a predefined stopword list. The frequency of each remaining n-gram is computed to remove n-grams that have a frequency less than a predefined threshold. Then the DBPedia is used to remove meaningless n-grams that do not have an entry in the DBPedia. The remaining n-grams are considered the key terms of the textual documents.

Campos et al. (Campos et al., 2020) proposed Yake!, an unsupervised statistical method to extract keywords from a single text document without relying on external sources such as Wikipedia as a knowledge base. They used the Python’s SegtokFootnote 5 sentence segmenter to divide the document into sentences and every sentence into chunks. The chunks are blocks of characters separated by punctuation marks. Then the web_tokenizer module of the Python’s Segtok segmenter is used to divide every chunk into tokens. Statistics of each token is computed, including token frequency and index of sentences where the token exists. Moreover, a co-occurrence matrix is created for each token to save its predecessor and subsequent tokens found within a given window. They also propose five features to be extracted for each token. A score for each feature is computed based on the previously computed statistics about each token. The scores of features of a token are used to compute the token score in a document. Candidate keywords are then generated from all n-gram tokens in every chunk. All keywords that are started or ended with a stopword existing in a given list of stopwordsFootnote 6 are rejected, but those within a keyword are accepted. Then a keyword score is computed based on scores of all tokens forming the keyword. Keywords having lower scores are considered the most relevant in the domain.

Duari and Bhatnagar (Duari & Bhatnagar, 2019) propose a graph-based method called sCAKE in order to extract keywords from text. They follow the work in Rousseau and Vazirgiannis (2015) to extract candidate keywords as nouns and adjectives using the Apache OpenNLP toolkitFootnote 7 for tokenization and POS tagging the text. After that, they remove all stopwords and produce stemmed tokens using Porter Stemming Algorithm.Footnote 8 Stemmed tokens are used as candidates that are connected by the proposed Context-Aware Text Graphs (CAG) method that connects candidates related to each other based on their co-occurrences. They also propose the Semantic Connectivity based Word Scoring method (SCScore) that identify important and relevant candidates by scoring every candidate based on its relation with its neighbors and its position in the text. Then the candidate keywords are sorted according to their SCScore and the user can choose the top K candidates as the final keywords.

Dwarakanath et al. (Dwarakanath et al., 2013) propose a linguistic statistical technique to extract glossary terms from software requirement text documents. They identify requirement sentences by those that begin with an explicit label. In every sentence, all words in title case, capital case, in quotes, and URLs are considered acronyms which are automatically treated as glossary terms. All brackets and their contents are removed from sentences. Then every sentence is parsed by the Link Grammar (LG) (Sleator & Temperley, 1993) which is a dependency parser that provides information about the syntax of a sentence. They use the LG parser, instead of a POS tagger, to identify noun phrases and verb phrases to be considered candidate glossary terms. They classify nouns into concrete, process, and abstract nouns. Verbs are also classified into concrete and auxiliary verbs. Abstract nouns that do not refer to a physical object such as ‘capability’, and auxiliary verbs such as ‘is’ are not included in glossary terms. Then they use a statistical technique based on occurrence and frequency to choose the final glossary terms from those candidates.

Conde et al. (Conde et al., 2016) propose a tool called LiTeWi that uses TF-IDF, KP-Miner, and CValue techniques to extract domain-related terms from electronic textbooks. The TF-IDF statistical technique proposed in Salton and Buckley (1988) is used to identify terms in a document and then selecting candidate terms based on their frequencies. KP-Miner, which is a keyphrase extraction statistical-based system proposed in El-Beltagy and Rafea (2009), is used to extract keyphrases from a document as candidate terms. They also proposed a method uses linguistic and statistical technique, called CValue. The method uses a linguistic tool called Illinois POS taggerFootnote 9 and chunkerFootnote 10 to extract noun phrases that may contain multiple adjectives and prepositions as candidate terms; then every noun phrase is given a CValue score based on its frequency. All candidate terms that are extracted by the previous techniques are combined together into one list. Then, based on the combination of stopwords proposed in Salton (1971) and (Fox, 1990), they filter out all stopwords from the list of the candidate terms. The remaining candidates are then mapped to Wikipedia articles to measure their domain-relatedness score. Finally, terms that have a CValue score and domain-relatedness value greater than a specific threshold are considered the final key terms.

Most of the existing work look at stopwords negatively and remove them before identifying keywords, however, they have a positive role in our work to extract domain-related key terms. Some existing work are based on statistical techniques, where they depend on keyword frequencies and a given threshold frequency to select domain-related keywords among candidates. However, specifying a threshold is subjective; moreover, statistical techniques are not effective when extracting terms from a single document. Other methods rely on external sources as knowledge bases to filter domain-related keywords from the extracted terms. This approach has a limitation of not selecting a domain-related keyword from candidates if it does not have an entry in the knowledge base.

Our work is somehow related to approaches that use NLP techniques to extract key terms. However, our work uniquely employs chunks and stopwords to extract not only keywords but also domain-related ones from a single document, without the need to use statistical techniques or external sources as knowledge bases in order to measure the domain-relatedness of the extracted terms. Unlike most existing work that focuses on text documents, our work focuses on text documents in the PDF format, especially textbooks, and provides solutions to work around problems with the text extracted from the PDF format.

3 The proposed model

The proposed model focuses on extracting glossary terms from text, especially text extracted from textbooks in the PDF format. The main idea that this model revolves around is to find out sentences that define the glossary terms. The proposed model refers to these sentences as defining sentences. Therefore, it is better to first define a sentence and term, and then describe the proposed method to extract terms from sentences.

The proposed model defines a sentence as the main unit of the text of any textual material, including textbooks, for defining terms in a particular domain or giving more explanation about the defined terms. Simply, a sentence consists of a subject, verb, and object. Subjects and objects can be a word or group of words. The verb part consists of either a main verb or an auxiliary verb followed by a main verb. Auxiliary verbs can be the verb to be, verb to do, verb to have, or modal verbs. The Verb to be includes am, is, are, was, and were; the verb to do includes does, do, and did; the verb to have includes have, has, had; the modal verbs include can, could, will, would, shall, should, may, might, and must. In contrast, the main verbs can be any verb that connects the subject with the object in a sentence.

As shown in Fig. 1, the proposed model classifies sentences in text into: (1) is-a sentences, denoted by IS, which are sentences having the verb to be as a main or auxiliary verb, (2) not-is-a sentences, denoted by NIS, which are sentences that do not have the verb to be as a main or auxiliary verb and are used to give more explanation about the defined key terms. However, the defining sentences are sentences that are derived from is-a sentences and not-is-a sentences under some conditions. The proposed model classifies the defining sentences to four main types: (1) ISm sentences, which are is-a sentences that have the verb to be as a main verb, (2) ISx sentences, which are is-a sentences that have the verb to be as an auxiliary verb followed by a selected main verb, (3) NISsmv sentences, which are not-is-a sentences that have a selected main verb other than the verb to be, and (4) NIScalled sentences, which are not-is-a sentences having the verb called, known as, referred to as, termed, or termed as.

Fig. 1
figure 1

Defining sentences classification

Moreover, the proposed model defines a term as a word or group of words in a sentence, and it can also be called a key term, glossary term, keyword, or keyphrase that represents a concept, entity, object, or process in a domain. In addition, the proposed model classifies terms on the basis of the forms in which a term can come within a sentence into two main types: (1) regular terms and (2) irregular terms. The regular term is denoted by R and defined as a term that consists of a single noun, group of nouns, or group of nouns and adjectives. The irregular term is denoted by IR and defined as a term that consists of not only group of nouns and adjectives but also other part of speech, including determiners, adverbs, prepositions, and conjunctions.

In light of the previous definitions and classifications of a term and sentence, The proposed model, as shown in Fig. 2, classifies the ISm sentences to four types: (1) ISmR sentences, which are ISm sentences that have a regular term as a subject, (2) ISmRadv sentences, which are ISm sentences that have a regular term as a subject with a preceding adverb sentence, (3) ISmRe sentences, which are ISm sentences that have a regular term as a subject with extra information between commas before the main verb, and (4) ISmIR sentences, which are ISm sentences that have an irregular term as a subject.

Fig. 2
figure 2

Types of ISm sentences

If text’s sentences are denoted by S, the defining sentences by D, and terms by T, the proposed model can be formally described as follows:

$$S=IS \cup NIS$$
(1)
$$\left(IS_{m}\cup IS_{x}\right)\subset IS$$
(2)
$$IS_{m}\cap IS_{x}= \varnothing$$
(3)
$$\left(NIS_{called} \cup NIS_{smv}\right)\subset NIS$$
(4)
$$NIS_{called}\cap NIS_{smv}= \varnothing$$
(5)
$$D=\left(IS_{m}\cup IS_{x}\cup NIS_{called} \cup NIS_{smv}\right)$$
(6)
$$T=R\cup IR$$
(7)
$$IS_{m}= \left(IS_{m}R\cup IS_{m}R_{e}\cup IS_{m}R_{adv}\cup IS_{m}IR\right)$$
(8)
$$IS_{m}R\cap IS_{m}R_{e}\cap IS_{m}R_{adv}\cap IS_{m}IR=\mathrm{ \varnothing }$$
(9)

4 Model architecture and workflow

The model architecture, as shown in Fig. 3, takes a textbook in a PDF format as input and brings out the textbook key terms as output. The textbook passes through three phases into the model until the key terms are extracted. The three phases are (1) preprocessing, (2) defining sentences extraction, and (3) term extraction. The first phase deals with the PDF textbook where the text is extracted from the PDF format and prepared for the next phases. The other two phases deal with the text extracted from the PDF textbook in order to extract the defining sentences and the key terms it includes. The foundation layer of the model is not a phase, but it shows the tools and techniques that all processes in the three phases of the model rely on in order to achieve the ultimate goal of textbook key terms extraction.

Fig. 3
figure 3

The main phases and the foundation layer of the model

More details are shown in Fig. 4 about the processes involved in every phase and the basic techniques used in the foundation layer. As shown in Fig. 4, the model has 21 processes that depends on four basic techniques. In the following sections, techniques in the foundation layer are described, and each process is discussed with the proposed algorithm that it builds on.

Fig. 4
figure 4

Architecture and workflow of the proposed model

5 Foundation layer

As shown in Fig. 4, the foundation layer consists of a method and four NLP (Natural Language Processing) techniques that all processes of the model are built on:

  1. (1)

    PDF text extraction. A method that is used in only one process in the model. It receives a PDF textbook and extracts the text from its PDF format that all the other NLP techniques will work on.

  2. (2)

    Sentence tokenization. An NLP technique used to break down the extracted text into meaningful sentences.

  3. (3)

    Part of speech (POS) tagging. An NLP technique used to identify the role of each word in every sentence, whether it acts as a noun, verb, adverb, adjective, determiner, conjunction, etc.

  4. (4)

    Chunking. An NLP technique used to work on the result of the POS tagging in order to identify certain phrases.

  5. (5)

    Pattern recognition. An NLP technique used to find out specific patterns of characters directly from the extracted text without depending on the result of the sentence tokenization or POS tagging.

6 Preprocessing phase

The preprocessing phase is the first phase of the proposed model. This phase contains only one process to extract text from the PDF format, and the other processes in that phase focus on preparing the extracted text in a way that enables the other phases’ processes to perform their tasks effectively. The preprocessing phase consists of five processes: (1) text extraction from a PDF textbook, (2) text cleaning from footnote and reference numbers, (3) text cleaning from brackets, and (4) sentence extraction, and (5) sentence cleaning from headings. The five processes are discussed in the following sections.

6.1 Text extraction from a PDF textbook

The purpose of this process is to receive a PDF textbook and extract the text from its PDF format. This process relies on the PDF text extraction method defined in the foundation layer in order to extract the text from a PDF textbook. However, the text extractor leaves some challenges that must be tackled in order to extract meaningful and precise key terms. Every process, in the following sections, addresses one or more of these challenges that influence the process’s performance. In addition, every process provides a solution to work around these challenges within its proposed algorithm.

6.2 Text cleaning from footnote and reference numbers

Another challenge that hinders extracting key terms from the text extracted from PDF textbooks is how the text extractor extracts footnote and reference numbers from PDF textbooks. Textbooks can use footnote numbers with some terms to give more information about them in the footer of pages. Other textbooks can use a reference number at the end of a sentence to cite a reference included in a reference list at the end of a chapter. Both footnote and reference numbers form a challenge to extract accurate terms from the text.

Footnote and reference numbers appear in the extracted text from the PDF format as normal numbers, not in a superscript format, and that raises an issue. For footnote numbers, they seem to be a part of a key term, and thus they are extracted with them. Consequently, footnote numbers hinder extracting accurate key terms. For reference numbers at the end of sentences, the text extractor considers the period ending the sentence and the reference number at the end of the sentence together as a word, and thus it joins the sentence with its next sentence. Therefore, reference numbers mislead the sentence extractor to accurately separate sentences.

This process is proposed to tackle this challenge in order to extract terms and sentences effectively and accurately from the text extracted from the PDF format. It cleans the text from footnote and reference numbers by recognizing all patterns representing them in the text. The following is the regular expression that is proposed in order to identify different patterns of footnote and reference numbers in text. Then the process removes numbers from the result of this regular expression.

"\w+\.\d+|\w+\.\s\d+|\w+\s\d+|\w+\d+"

6.3 Text cleaning from brackets

Sentences that mention a term with its abbreviation between brackets, and the abbreviation is not at the end but within the term’s words raise another challenge of extracting accurate terms. For example (Stair & Reynolds, 2012), the term Business-to-Business (B2B) E-Commerce is mentioned in a sentence with its abbreviation, and the abbreviation is within the words of the term between brackets. In this example, the term is not Business-to-Business, but the term that should be extracted is Business-to-Business E-Commerce. Consequently, brackets can hinder extracting terms accurately. Subsequently, this process is proposed in order to clean the text from brackets as a preparation for extracting accurate terms.

6.4 Sentence extraction

This process receives the text extracted from the PDF textbook after it is cleaned from footnote and reference numbers, and after it is cleaned from brackets (see Sections 6.2 and 6.3). Then the process extracts all sentences from the cleaned text. The process relies on NLTK sentence extractor that extracts sentences based on the period at the end of every sentence.

6.5 Sentence cleaning from headings

This process receives all sentences extracted from the text as input, which are the output of sentence extraction process discussed in Section 6.4. It is noticed that headings and subheadings are included in the extracted sentences. These headings and subheadings hinder extracting key terms from the extracted sentences in a precise way, as they appear in the beginning of the first sentence following them. For example, if the heading is INFORMATION CONCEPTS, in uppercase, and the first sentence after the heading is Information is the important concept in information technology. Although the heading and the first sentence are extracted in separate lines, the sentence extractor extracts them together and consider them in one sentence as follows (Stair & Reynolds, 2012):

  • INFORMATION CONCEPTS Information is the important concept in information technology.

As a result of merging a heading with the first sentence that follows it, the extracted term from the first sentence, in this example, will be INFORMATION CONCEPTS Information instead of Information. Therefore, headings and subheadings must be removed from the text to extract terms accurately. A challenge found in the extracted text from the PDF format is about missing any information about the format of words, including headings and subheadings, except that the text extractor leaves words that are in uppercase or capitalized as they are and splits them into separate lines. As shown in Algorithm 1 that is proposed for this process, these two features of headings, the uppercase format and splitting in a separate line, are employed in order to extract headings and subheadings from the text.

Another issue that misleads extracting headings and subheadings is the sentence-special-format words. A sentence-special-format word is a word that appears in a sentence capitalized or in uppercase format, similar to the format of the headings and subheadings. Similarly, the text extractor splits these sentence-special-format words in separate lines. Consequently, headings, subheadings, and these sentence-special-format words are very similar without being able to distinguish between them.

Algorithm 1
figure a

Extract headings and subheadings from text

This challenge is also addressed in Algorithm 1 in order to only extract headings and subheadings. The algorithm splits the text into lines and if it founds all words in a line capitalized or in uppercase, it checks the next line; and if the next line starts with a verb or punctuation mark such as a period, it decides that the words in the line don’t represent a heading or subheading.

7 Defining sentence extraction phase

The defining sentences phase focuses on extracting the defining sentences, explained earlier in Section 3. Therefore, it consists of 8 processes: (1) Extract ISm Sentences, (2) Extract ISmR Sentences, (3) Extract ISmRadv Sentences, (4) Extract ISmRe Sentences, (5) Extract ISmIR Sentences, (6) Extract ISx Sentences, (7) Extract NIScalled Sentences, and (8) Extract NISsmv Sentences. In the following sections, each of these processes is discussed with examples and the algorithm that is built on is also given.

7.1 Extract ISm sentences

This process receives all sentences from the process discussed in Section 6.5. Then it extracts ISm sentences that are described in Section 3 as one of defining sentences. The following are examples (Stair & Reynolds, 2012) that show how ISm sentences define key terms:

  • Input is the activity of gathering and capturing raw data defines the key term input.

  • Workstations are more powerful than personal computers defines the key term Workstations.

  • A byte is typically eight bits defines the key term byte.

  • An enterprise system is central to an organization defines the key term enterprise system.

This process is built on Algorithm 2 that proposes a regular expression pattern to recognize is-a sentences and extracts them from all text’s sentences that are extracted by the process discussed in Section 6.5.Then it shows how ISm sentences are extracted from the is-a sentences.

Algorithm 2
figure b

Extract is-a sentence from text and ISm sentences from is-a sentences

7.2 Extract ISmR sentences

This process focuses on extracting ISmR sentences, which are ISm sentences having regular terms. It also does not focus on identifying regular terms in any position in a sentence. Instead, as it is found that the traditional way to define a key term is to use it as a subject at the beginning of a sentence, this process focuses only on regular terms that come in the subject position of ISm sentence. The process is built on Algorithm 3 that proposes the following chunk pattern to identify regular terms in the subject position.

<DT>?<RB>*<VBN>*<JJ.?>*<NN.?.?>*<JJ.?>*<:>?(<VBG>*<NN.?.?>*<VBG>*)+<RB>*<VBZ|VBP|VBD>

As shown in this chunk pattern, the sentence is extracted if it starts with a term that consists of a noun(s) (NN, NNS, NNP, or NNPS tags) and may also be preceded by a determiner (DT tag), an adverb (RB tag), a verb in past participle (VBN tag), and/or an adjective (JJ tag). The tag(< : >) is used to include multiple hyphenated adjectives. The pattern also makes sure that the regular term is a subject when it is followed by a verb in the 3rd person singular present (VBZ tag), a verb in the non-3rd person singular present (VBP tag), or a verb in the past (VBD tag). The following sentences give an example (Stair & Reynolds, 2012) of regular terms that come as a subject:

  • Knowledge workers are people who create, use, and disseminate knowledge. The term in this sentence is Knowledge workers.

  • Highly structured problems are straightforward, requiring known facts and relationships. The term in this sentence is Highly structured problems.

Algorithm 3
figure c

Extract ISmR sentences from ISm sentences

7.3 Extract ISmRadv sentences

This process focuses on extracting ISmRadv sentences, which are sentences having a regular term in the subject position, but there is an adverb sentence preceding it. The following gives an example (Stair & Reynolds, 2012) of ISmRadv sentences:

  • Originally, a hacker was a person who enjoyed computer technology. The term is hacker.

  • In the context of systems development, stakeholders are people who benefit from the system. The term is stakeholders.

The process is built on Algorithm 4 that proposes the following chunk pattern to recognize ISmRadv sentences.

^<^(DT)|RB|RBR|VBN|IN|JJ><.+>*<,><DT>?<RB>*<VBN>*<JJ.?>*<:>?<NN.?.?>+<VBZ|VBP|VBD>

As it is shown in the pattern, it looks like the pattern that recognizes ISmR sentences except that it has a pattern in the beginning to recognize adverb sentences as well.

Algorithm 4
figure d

Extract ISmRadv sentences from ISm sentences

7.4 Extract ISmRe sentences

Although regular terms usually come as subjects followed by the verb to be as a main verb, they can also be followed by extra information mentioned between two commas before the main verb. The following is an example (Stair & Reynolds, 2012) of ISmRe sentences:

  • A CD burner, the informal name for a CD recorderis a device…. The term is CD burner.

This process focuses on extracting these sentences from ISm sentences. It is built on Algorithm 5 that proposes the following chunck pattern to extract ISmRe sentences from ISm sentences:

^<DT>?<RB>*<JJ.?>*<:>?<NN.?.?>+<,><.+>+<,><VBZ|VBP|VBD>

This pattern is similar to the one that recognizes ISmR sentences except that it adds a pattern for recognizing the extra information coming between commas after the subject and before the verb.

Algorithm 5
figure e

Extract ISmRe sentences from ISm sentences

7.5 Extract ISmIR sentences

This process focuses on extracting ISmIR sentences from ISm sentences. ISmIR sentences are sentences that have irregular terms in the subject position and the verb to be as a main verb. The following sentence (Stair & Reynolds, 2012) gives an example of an ISmIR sentence:

  • Redundant array of independent/inexpensive disks (RAID) is a method of storing data.

The irregular term of this sentence is Redundant array of independent/inexpensive disks (RAID).

This process is built on Algorithm 6 that proposes the following chunk pattern to recognize ISmIR sentences:

^<DT>*<VBG|VBN|JJ.?|RB|NN.?.?>*<IN><DT>*<VBG|VBN|JJ.?|RB|NN.?.?>+<VBZ|VBP|VBD>

Algorithm 6
figure f

Extract ISmIR sentences from ISm sentences

7.6 Extract ISx sentences

Besides ISm sentences, there are other defining sentences (ISx) that have the verb to be as an auxiliary verb but followed by one of selected main verbs. The selected main verbs are called, termed, termed as, referred to as, defined as, and known as. The following sentence (Stair & Reynolds, 2012) gives an example of ISx sentence:

  • Processing that uses several processing units is called multiprocessing

This sentence defines the key term multiprocessing, however, it has the verb to be as an auxiliary verb but with the verb called, one of the selected main verbs. Algorithm 7 shows the proposed regular expression pattern to recognize these sentences and how to extract them from is-a sentences.

Algorithm 7
figure g

Extract ISx sentences from is-a sentences

7.7 Extract NIScalled Sentences

The not-is-a sentences, as defined earlier in Section 3, can be considered defining sentences when they satisfy some conditions. This section sets a condition that turns a not-is-a sentence into a defining sentence, which is having the verb called, known as, referred to as, termed, or termed as among its words.

The following are examples (Stair & Reynolds, 2012) of sentences that are considered not-is-a sentences and defining sentences as well. They also show different forms of having the verb called in not-is-a sentences.

  • This concept, called cloud computing, allows people to get the information they need from the Internet. This sentence defines the term cloud computing.

  • An executive support system, also called an executive information system, helps top-level managers. This sentence defines the term executive information system.

  • A small company called Microsoft developed PC-DOS and MS-DOS to support the IBM personal computer introduced in the 1980s. This sentence defines the term Microsoft.

  • A group support system includes the DSS elements just described as well as software, called groupware, to help groups make effective decisions. This sentence defines the term groupware.

  • A company can either develop a one-of-a-kind program for a specific application (called proprietary software). This sentence defines the term proprietary software.

The following algorithm, Algorithm 8, is proposed to extract the NIScalled sentences from all sentences in a text, regardless of the form in which the verb called comes. As there is another algorithm, Algorithm 7, that extracts ISx sentences that also can have the verb called as a main verb followed by the auxiliary verb to be, this algorithm focuses only on extracting NIScalled sentences that do not have the verb called as a main verb after the verb to be.

Algorithm 8
figure h

Extract NIScalled sentences from all sentences in the text

7.8 Extract NISsmv sentences

The NISsmv sentences define another condition that can transform a not-is-a sentence into a defining sentence. The condition is that it must have one of these selected verbs involve, refer to, specify, include, combine, define, contain, consist of, house, mean to become a defining sentence. Algorithm 9 shows the regular expression pattern used to select the NISsmv sentences from all sentences in the text.

Algorithm 9
figure i

Extract NISsmv sentences from all sentences in the text

8 Term extraction phase

This phase focuses on extracting terms that come either in the form of regular terms or irregular terms. These terms are only extracted from the defining sentences that are extracted in the previous phase. This phase consists of 8 processes: (1) Extract TISmR terms from ISmR Sentences, (2) Extract TISmRadv terms from ISmRadv sentences, (3) Extract terms TISmRe from ISmRe sentences, (4) Extract TSISmIR terms from ISmIR sentences, (5) Extract TOISmIR terms from ISmIR sentences, (6) Extract TISx terms from ISx sentences, (7) Extract TNIScalled terms from NIScalled sentences, and (8) Extract TNISsmv terms from NISsmv sentences. Each process focuses on extracting terms from a particular type of the defining sentences. In the following sections, each process is discussed with examples and with the algorithms that it builds on.

8.1 Extract TISmR terms from ISmR sentences

The TISmR terms are regular terms in the subject position of the ISmR sentences that is discussed in Section 7.2. every ISmR sentence is extracted based on the following chunk:

<DT>?<RB>*<VBN>*<JJ.?>*<NN.?.?>*<JJ.?>*<:>?(<VBG>*<NN.?.?>*<VBG>*)+<RB>*<VBZ|VBP|VBD>

Consequently, every regular term in these sentences must be extracted without any determiner. The following sentence gives an example (Stair & Reynolds, 2012) of this case:

  • An information system is a set of interrelated components that collect, manipulate, store, and disseminate data and information. The term in this sentence is information system without the article an.

In addition, the process extracts regular terms that can be a noun(s) preceded by two or three words hyphenated as an adjective. The following sentence gives an example (Stair & Reynolds, 2012) of the two words hyphenated as an adjective:

  • A computer-based information system is a single set of hardware. The extracted term in this sentence is computer-based information system.

Another sentence that gives an example (Stair & Reynolds, 2012) of three hyphenated words as an adjective is:

  • Consumer-to-consumer e-commerce is a subset of e-commerce. The extracted term in this sentence is Consumer-to-consumer e-commerce.

The following chunk pattern is proposed to recognize a regular term in the subject position with three hyphenated adjectives:

^<DT>?<JJ>*<:>?<NN.?>+<VBZ|VBP|VBD>

The tag < : > is used in the pattern to work around a problem of wrong tagging three hyphenated adjectives, where the NLTK POS tagger tags the first two hyphenated adjective together as < JJ > , the second hyphen as < : > , and the third adjective as < NN > .

Moreover, the process can extract regular terms that contain an apostrophe. The following sentence (Stair & Reynolds, 2012) gives an example of this case:

  • Nvidia’s GeForce D is software that can display images on a computer screen. The extracted term in this sentence is Nvidia’s GeForce D.

Sometimes, in PDF textbooks, there is not enough space in a line to have a whole word written completely. Therefore, a word is divided into two parts, the first part is written to complete the line and the second part is written at the beginning of the next line. A hyphen is written after the first part to indicate that the remaining part of the word is written at the beginning of the next line. Unfortunately, the PDF text extractor leaves words with end-of-line hyphens not merged, which represents a real challenge of extracting terms accurately. However, this process also focuses on facing this challenge by combining together the whole word representing a key term.

The following is an example (Stair & Reynolds, 2012) of a term with the end-of-line hyphenation:

  • bandwidth, the more information can be exchanged at one time. Broadband communications is a relative term but generally means a telecommunications system that can

As shown in the this example, the term that has the end-of-line hyphen is Broadband communications. The word communication is divided into two parts, commu and nications, because there is not enough space in the line to write the whole word. The first part is written and followed by a hyphen, commu-, and the second part, nications, is written at the beginning of the next line. The process extracts the whole term without a hyphen as Broadband communications. This process is built on the Algorithm 10 that proposes a chunk pattern to recognize regular terms in ISmR sentences and also explains how they are extracted.

Algorithm 10
figure j

Extract regular terms TISmR from ISmR sentences

8.2 Extract TISmRadv terms from ISmRadv sentences

The ISmRadv sentences, discussed in Section 7.3, are ISm sentences that starts with an adverb sentence before the subject. The TISmRadv terms are regular terms in the subject position of the ISmRadv sentences. The Algorithm 11 is proposed to enable this process to extract TISmRadv terms from ISmRadv sentences.

Algorithm 11
figure k

Extract regular terms TISmRadv from ISmRadv sentences

8.3 Extract TISmRe terms from ISmRe sentences

The ISmRe sentences, discussed in Section 7.4, have an extra information about the subject that is located between the subject and the main verb, verb to be, and separated by commas. The TISmRe terms are subjects in the ISmRe sentences. The following is an example (Stair & Reynolds, 2012) of a sentence having TISmRe term:

  • A CD burner, the informal name for a CD recorder, is a device. The term is CD burner.

This process is built on the Algorithm 12 that is proposed to extract TISmRe terms from ISmRe sentences.

Algorithm 12
figure l

Extract regular terms TISmRe from ISmRe sentences

8.4 Extract TSISmIR terms from ISmIR sentences

The TSISmIR terms are irregular terms in the subject position of the ISmIR sentences that are discussed in Section 7.5. The Algorithm 13 is proposed to enable this process to extract TSISmIR terms from ISmIR sentences.

Algorithm 13
figure m

Extract subjects as irregular terms TSISmIR from ISmIR sentences

8.5 Extract TOISmIR terms from ISmIR sentences

The TOISmIR terms are irregular terms in the object position of the ISmIR sentences that are discussed in Section 7.5. The following are examples (Stair & Reynolds, 2012) of ISmIR sentences having TOISmIR terms:

  • Another major activity of a TPS is data manipulation, the process of….. The term is data manipulation.

  • An important part of an expert system is the explanation facility, which allows …. The term is explanation facility.

  • Turning data into information is a process, or a set of logically related tasks. The term is process.

The Algorithm 14 is proposed to enable this process to extract TOISmIR terms from ISmIR sentences.

Algorithm 14
figure n

Extract objects as regular terms TOISmIR from ISmIR sentences

8.6 Extract TISx terms from ISx sentences

The ISx sentences, discussed in Section 7.6, are ISm sentences that have the verb to be as an auxiliary verb with one of the selected main verbs. The TISx terms are regular or irregular terms in the subject position of the ISx sentences. The Algorithm 15 is proposed to enable this process to extract TISx terms from ISx sentences.

Algorithm 15
figure o

Extract regular or irregular terms TISx from ISx sentences

8.7 Extract TNIScalled terms from NIScalled sentences

The NIScalled sentences, discussed in Section 7.7, are NIS sentences having the verb called that is not a main verb after the auxiliary verb, verb to be. The TNIScalled terms are regular or irregular terms that come after the verb called or one of called sisters mentioned in Section 7.7. The Algorithm 16 is proposed to enable this process to extract TNIScalled terms from NIScalled sentences.

Algorithm 16
figure p

Extract regular or irregular terms TNIScalled from NIScalled sentences

8.8 Extract TNISsmv terms from NISsmv sentences

The NISsmv sentences, discussed in Section 7.8, are NIS sentences having one of the selected main verbs. The TNISsmv terms are regular or irregular terms that come after a selected main verb. The Algorithm 17 is proposed to enable this process to extract TNISsmv terms from NISsmv sentences.

Algorithm 17
figure q

Extract subjects as terms TNISsmv from NISsmv sentences

9 Experiments

This section explains the experimental evaluation of the proposed model to assess the effectiveness of the different processes and algorithms. The proposed model is implemented by using TextractFootnote 11 , which is a Python package for PDF text extraction, and Python’s NLTK for sentence tokenization, POS tagging, chunking, and pattern recognition. Textract and NLTK are Python’s libraries or tools used to implement methods and NLP techniques mentioned in the foundation layer (Section 5).

The proposed algorithms are evaluated through two experiments. In the first experiment, a PDF textbook is used from the business domain, but, in the second experiment, another one is used from the science domain. The extracted terms from every PDF textbook is compared to the textbook’s key terms (glossary terms) included in its key terms section. The textbook key terms are used as a gold standard to evaluate the effectiveness of the proposed work. Our work provides a parameterless method to extract keywords, so that no parameters are needed to be given or tuned except for a PDF textbook that is given as an input. More details about the two experiments are given in the following sections.

9.1 Experiment 1

In this experiment, a free and openly licensed PDF textbook is used from the business domain, entitled the Organization development (OpenStax, 2019). The textbook consists of 704 pages which contain a cover page, title page, copyright pages, acknowledge page, table of contents, preface, 19 chapters, 2 appendixes, and index. In every chapter, there are figures, caption for each figure, tables, and header and footer on every page. At the end of each chapter there are key terms, summary, a case study, review questions, and exercises. The textbook has 19 lists of key terms containing 468 terms in total.

In order to evaluate the different processes of the proposed work, the 468 key terms of the textbook are classified into 15 categories with their weights. As shown in Table 1, these categories can be classified into two sections: (1) defined categories and (2) undefined categories. The defined categories are those introduced by the proposed model. The undefined categories are new categories that are not proposed by the model, but discovered through the experiments. The weight of each category is calculated as a ratio of the number of terms in the category to the total number of the key terms.

Table 1 Categories of the textbook key terms used in Experiment 1

The first eight categories in Table 1 are for terms that are described in detail in the previous sections of the proposed work, but categories from number 9 to 15 are new categories that are described as follows:

  • Category (9). This category is about terms that are not defined using a sentence, but they are mentioned between brackets after their explanation. The following sentence (OpenStax, 2019) gives an example of these terms:

    • Successful employees know what they want to achieve (direction). This sentence defines the term direction.

  • Category (10). This category is about terms that are defined indirectly within a sentence such as the following sentences (OpenStax, 2019):

    • This set of perceptions often leads to abundance-based change, in which leaders assume. This sentence defines the term abundance-based change.

    • He meant that corporate culture is more influential. This sentence defines the term corporate culture.

    • Entrepreneurs can access debt capital, which is …. This sentence defines the term debt capital.

    • Over time, these areas of specialization mature through differentiation, the process of organizing employees into groups. This sentence defines the term differentiation.

  • Category (11). Terms came in the object but with not ISmIR sentences. The following sentence (OpenStax, 2019) gives an example of these terms:

    • Another prevalent type is family entrepreneurship that …. This sentence defines the term family entrepreneurship.

    • An appraisal system that has received increasing attention in recent years is the behaviorally anchored rating scale. This sentence defines the term behaviorally anchored rating scale.

  • Category (12). This category is about terms that are defined by main verbs that are other than verb to be as a main verb and other than the selected main verbs. Examples of these verbs are enable, begin, allow, automate, connect, combine, handle, give, perform, require, support, and much more.

  • Category (13). This category is about terms that come with verb to be as an auxiliary verb, but the main verb is not one of the selected main verbs. For example (OpenStax, 2019):

    • Expert power is demonstrated when person …. This sentence defines the term Expert power.

  • Category (14). This category is about terms came in sentences starting with a conjunction. For example (OpenStax, 2019):

    • Whereas frustration is a reaction to an obstruction in instrumental activities or behavior, anxiety is a feeling of inability. This sentence started with a conjunction whereas to combine two sentences defining two terms, frustration and anxiety.

  • Category (15). This category is about terms that are defined in the key terms section only and don’t have a sentence defining them outside this section.

As shown in Table 1, terms of category 15 are excluded from the total key terms because they are not mentioned in the main text. Moreover, in the key terms section, these terms are not involved in a complete sentence. Therefore, the fair key terms that is used as a gold standard will become 416 instead of 468 terms after excluding terms of category 15.

9.2 Experiment 2

In this experiment, another free and openly licensed PDF textbook is used from the science domain, entitled the University Physics Volume 1 (Ling et al., 2021). The textbook consists of 979 pages which contain a cover page, title page, copyright pages, acknowledge page, table of contents, preface, 17 chapters, 7 appendixes, answer key, and index. In every chapter, there are figures, caption for each figure, tables, equations, and header and footer on every page. At the end of each chapter, there are key terms, key equations, summary, conceptual questions, and problems. The textbook has 17 lists of key terms containing 314 terms in total.

As shown in Table 2, the 314 key terms of the textbook are classified, similar to experiment 1, using the 15 categories with their weights. The fair total key terms used as a gold standard for evaluation will become 304 terms after excluding terms in category 15.

Table 2 Categories of the textbook key terms used in Experiment 2

10 Result and discussion

This section discusses the result of the two experiments. It also evaluates the proposed work by comparing the extracted terms with the PDF textbooks’ key terms in each experiment. As preparation for this comparison, the textbooks’ key terms and the extracted terms are kept in a singular form and any duplicated terms are removed. Besides that, a key term is assumed to be extracted even if the extracted term is not exactly the same as the key term, i.e., containing one or more words over the words in the key term. There were a few number of extracted terms that followed that assumption. The performance of the proposed term extraction method is measured in terms of recall and precision as follows:

$$Category\;Recall= \frac{Number\;of\;terms\;extracted\;from\;a\;category}{Number\;of\;the\;textbook\;key\;terms\;in\;that\;category}$$
(10)
$$Recall= \frac{Number\;of\;terms\;extracted\;from\;all\;categories}{Number\;of\;the\;textbook\;key\;terms\;in\;all\;categories}$$
(11)
$$Precision= \frac{Number\;of\;terms\;extracted\;from\;all\;categories}{Number\;of\;all\;extracted\;terms\;in\;the\;result}$$
(12)

The result of term extraction in each experiment is shown in Table 3. The table shows category numbers, the number of key terms in each category, the number of extracted terms from each category, the recall percentage of each category. In addition, it shows the total key terms, the total number of the extracted terms from all categories, and the recall percentage of each experiment. In experiment 1, as shown in Table 3, the proposed method was able to extract 221 out of 416 key terms with recall 53.1%. It also shows that, in experiment 2, it was able to extract 232 out of 304 key terms with recall 76.3%. As mentioned in Section 9.1, the Fair Total Key Terms of every experiment is used to calculate the recall, which is the Total Key Terms after excluding the terms of category 15. The difference of the performance in the two experiments can be explained that the proposed work can extract the key terms from mathematical and science textbooks better than from theoretical textbooks.

Table 3 The recall percentage of the extracted terms

It is noticed, as shown in Table 3, that the key terms in the defined categories represent about 60% and 80% of the total key terms in experiment 1 and 2 respectively. This observation is an evidence that the proposed model succeeded to propose patterns and algorithms to identify up to 80% of key terms defined by textbooks’ authors without relying on a statistical technique or external knowledge base to measure the domain-relatedness of the extracted terms.

The proposed model deliberately avoids extracting key terms in categories 12 and 13 because their key terms came with main verbs other than the selected main verbs. Not including these verbs with the selected main verbs avoids extracting a lot of terms that are not domain-related, as these verbs can come with any term regardless of being a key term or not. Although those two categories have low weight among the key terms, 15% and 0.4% in experiment 1 and 4.8% and 1.3% in experiment 2, which means they have a low negative impact on the total recall. Therefore, avoiding them achieves two advantages: (1) having a great positive impact on the precision and (2) increasing the proposed model’s accuracy of extracting domain-related terms.

On the other hand, as shown in Table 4, terms extracted in experiment 1 are 1140 out from 17889 sentences, whereas, in experiment 2, the extracted terms are 1783 out from 20878 sentences. However, the proposed work extracts terms with a precision 19% in experiment 1 and 13% in experiment 2. Nevertheless, it is worth to note that most of terms extracted other than those in the textbook’s key terms section fall in the highest weight term categories. Therefore, these extracted terms are recommended to be included into the key terms section, as the textbook’s author introduced them in a way that reflects their considerable importance and relevance to the domain.

Table 4 Recall and precision of the two experiments

Another observation about the proposed model is that it can automatically filter out nouns and noun phrases in textbook’s figures without the need to remove them by preprocessing operations, as they do not come in a sentence. This feature has a positive impact on precision and considered another evidence of how the proposed model employs the NLP linguistic techniques uniquely to avoid selecting not relevant terms. Nevertheless, not filtering pages of preface, copyright, dedication, table of contents, exercises, caption of figures, sentences in figures out from the textbook in the preprocessing phase has a negative impact on the precision, because the proposed work extracts all terms following the proposed patterns from these pages, which most of them are not relevant. Moreover, extracting not relevant subjects from ISmIR sentences when extracting the TISmOIR terms (objects) is a bad side effect that has a negative impact on precision.

11 Conclusion and future work

This paper introduces a novel model for automatically extracting key terms from a single PDF textbook. The proposed model is a linguistic-based unsupervised machine learning approach, which does not rely on training huge amount of data or require a particular computing resources. The novelty of this approach is that it utilizes the basic NLP linguistic techniques: pattern recognition, sentence tokenization, POS tagging, and chunking, not only to extract key terms, but also to select the most relevant and domain-related ones. Furthermore, the proposed model extracts domain-related terms without relying on a statistical technique or an external knowledge base, such as Wikipedia, DBPedia, or WordNet, to measure the domain-relatedness of terms.

The proposed model proposes a unique classification of textbook sentences and terms. Sentences are classified into two main types: is-a sentences and not-is-a sentences. The is-a sentences include two types: ISm and ISx sentences. The not-is-a sentences also include two types of sentences: NIScalled and NISsmv. From this classification, the proposed model identifies the defining sentences that are used to define relevant and important terms in textbooks. The defining sentences include four types of sentences: ISm, ISx, NIScalled, and NISsmv. The proposed work also classifies terms into two types: regular terms (R) and irregular terms (IR). Based on the term type, ISm sentences only are classified into: ISmR, ISmRadv, ISmRe, and ISmIR. Consequently, the model proposes a process to extract regular and irregular terms from each type of the defining sentences, including the types of ISm sentences.

The input of the proposed model is a PDF textbook, and the output of the model is a list of key terms extracted from the textbook. These key terms are extracted through 21 processes that are classified into three phases. All of these processes are built on basic NLP linguistic techniques: pattern recognition, sentence tokenization, POS tagging, and chunking, which form the foundation layer of the model. The first phase contains a process for extracting the text from a PDF textbook. It also contains four processes for cleaning text in order to work around problems found in the text extracted from the PDF textbook. The second phase contains eight processes in which eight patterns and algorithms are proposed to identify the defining sentences and extract them from the text. The third and last phase contains another eight processes that each has an algorithm that focuses on manipulating with a specific defining sentence in order to extract their key terms. The proposed model succeeded to achieve recall of 53.1% and 76.3% in experiments 1 and 2 respectively with the average of 64.7%. From the precision perspective, the proposed work extracts terms with the precision of 19% and 13% in experiments 1 and 2 respectively, with the average of 16%. This low precision is due to the large number of terms extracted by the proposed work that most of them fall in the high weight categories, even though they are not included as key terms in the key terms section by the textbook authors.

In the near future, it is planned to improve the average recall percentage of the defined categories, currently, it is 91.5%. Furthermore, studying the possibility of maximizing the overall recall percentage through extracting more terms from the undefined categories. Moreover, it is also planned to take more steps towards improving the precision percentage of the proposed work through proposing a filtering process to reduce or eliminate the number of not relevant terms in the result. It is also planned to evaluate our promising proposed work against the state-of-the-art methods that have available implementation on extracting terms from textbooks.