Terminology Saturation Analysis: Refinements and Applications

Kosa, Victoria; Dobosevych, Oles; Ermolayev, Vadim

doi:10.1007/978-3-031-53770-7_3

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1810))

Included in the following conference series:

International Symposium on AI, Data and Digitalization

730 Accesses

Abstract

In this paper, we outline the results of our recent research on terminology saturation analysis (TSA) in subject domain-bounded textual corpora. We present the developed TSA method. We further report about the two use cases that proved the validity, efficiency, and effectiveness of TSA. Based on our experience of TSA use, we analyse the shortcomings of the method and figure out the ways to refinement and improvement. Further, we share our prognoses on how TSA could be used for: (i) generating quality datasets of minimal size for training large language models for performing better in scientific domains; (ii) iteratively constructing domain ontologies and knowledge graphs that representatively describe a subject domain, or topic; or (iii) detecting and predicting events based on the TSA of textual streams data.

You have full access to this open access chapter, Download conference paper PDF

Keywords

1 Introduction

Our recent research has demonstrated [1] that TSA could help reveal terminological patterns in domain-bounded textual data – e.g. topical collections of scholarly publications^{Footnote 1}. We also discovered that it could instrument the discovery of the trends of technology adoption in industry [3]. We found out that the major factors hampering terminological saturation were: (i) the immaturity of the domain implying that the domain-bounded corpus is too small; (ii) the heterogeneity within the domain – e.g. the fragmentation due to the competition among different R&D strands; or (iii) the volatility of the domain terminology over time. Based on these findings, it was remarkable to notice that the existence of a terminologically saturated sub-collection in a corpus of texts – a terminological core sub-collection – indicates the maturity and stability of the respective topic or domain. On the other hand, the absence of terminological saturation points out that an opportunity window is open for the further development of the focal subject domain, including the mergers of competing strands. Application wise, our research was aimed at ensuring the completeness of a text corpus in a domain for ontology learning from texts. However, the results seem to have a broader potential R&D impact.

One promising use is in extracting the smallest possible, yet representatively complete, datasets for training machine learning models for natural language processing tasks. Furthermore, a knowledge graph, built using the terminology extracted from such a terminological core dataset, could be used as a structured representation of the set of features characterising the description of the domain. This might help make the training of the models more efficient and the outputs from the trained deep learning models better explainable, hence – trustworthy.

Another potential use case could be in event detection and prediction using social media text or document streams. We hypothesise that, if terminological saturation is detected in a timed topical stream of texts, it might point out that either (i) the stream is dominated by the authors that use coherent terminology; or (ii) the majority of the community around the topic is focused on something important, that already happened in the past or will happen soon. On the other hand, the lack of terminological saturation in a topical stream might indicate that the situation around the topic is stable in the democratic sense, which is characterised by the plethora of different competitive opinions and judgements on the topic.

The remainder of the paper is structured as follows. In Sect. 2, our approach to TSA [1] is outlined to make the paper self-contained. The review of the recent related work is provided in Sect. 3, with an aim to reveal the current research gaps regarding the potential use of TSA in the context of related open research questions. Section 4 outlines our results in answering some of these questions. Section 5 deliberates on the ways to refine the TSA method for making it more effective and efficient. In Sect. 6, our vision is presented of how the answers to the rest of the open questions, within our focus of interest, could be approached. This vision outlines the plans of our future work. Finally, the conclusions are drawn in Sect. 7.

2 An Outline of TSA

TSA seeks a local-optimal solution for the automated extraction of representative terminologies and respective document sub-collections using an iterative successive approximation approach. As an output, if the process converges to a solution, a terminological core sub-collection (TCSC) is extracted from the input collection of documents. TCSC carries the saturated set of terms (T_sat) that is representative for the subject domain. This set of terms is also extracted for further use. Each term in the set is supplemented with a significance value. The extracted set of terms is sorted in the decreasing term significance order.

Successive approximation starts with an empty set of documents – the dataset D₀. At the i-th iteration, several (inc) new documents are taken from the input collection as a plain text file, and appended to the dataset D_i-1 that has been processed at the i-1-st iteration, resulting in the dataset D_i. Hence, the datasets incrementally grow in the iterations. It is supposed that, while growing, the datasets successively become closer to the dataset D_sat that represents the TCSC. T_sat is finally acquired from D_sat if saturation is detected at the i-th iteration.

To evaluate how close we are to D_sat, terminological difference is measured between the sets of significant terms retained from the term candidates extracted from the datasets in successive iterations. At the i-th iteration, terminological difference (thd_i) is measured between the sets of retained significant terms T_i and T_i-1.

For generating T_i within the i-th iteration, the following steps are performed:

1
Extract the set of term candidates, with their significance values, from D_i as pairs \(b=<{t}_{i}^{k},{sc}_{i}^{k}> \in {B}_{i}\), where \({B}_{i}\) is the set of terms b extracted from D_i, t is a term candidate, sc is its significance score reflecting the number of occurrences of b in D_i; B is ordered by the decrease in ns values
2
Compute the cut-off threshold eps_i to retain significant terms
3
Retain significant terms \(t=<{t}_{i}^{k},{ns}_{i}^{k}>\) from B_i into \({T}_{i}\) and measure \({thd}_{i}=thd({T}_{i},{T}_{i-1})\); ns is a normalised significance score

For detecting terminological saturation, thd_i and eps_i values are observed. The process is stopped when thd_i reliably goes below eps_i (c.f. the chart in Fig. 1) – hence the following two conditions of terminological saturation (c.f. [5]) both hold true:

(i) \(thd\left({T}_{i}{T}_{i-1}\right)<{eps}_{i}\); and.

(ii) \(\forall j>i, thd\left({T}_{j}{T}_{j-1}\right)<{eps}_{j}\).

Significance scores for b in B are computed using the MPCV method [6], which is an optimised and scalable refinement of the C-Value method [7]. The eps function sets the individual term significance threshold for B based on the sc in B. The rationale behind eps is to regard the sum of the sc in the upper part of B as the simple majority opinion:\(eps={sc}_{i} :({\sum }_{j=1}^{i}{sc}_{j}>1/2{\sum }_{j=1}^{\| B\| }{sc}_{j})\).

The thd function measures the distance between the significance vectors of the sets of retained terms (\(thd:\{<{T}_{i}, {T}_{j}>\} \to {\mathfrak{R}}^{+}\)), which is the Manhattan distance (c.f. [8]) metric in the space of all possible sets of terms:

\(thd\left({T}_{i}, {T}_{j}\right)={\sum }_{k=1}^{\| int\left({T}_{i},{T}_{j}\right)\| }\left|{ns}_{i}^{k}-{ns}_{j}^{k}\right|+{\sum }_{k=1}^{\| dif\left({T}_{i},{T}_{j}\right)\| }{ns}_{i}^{k}+{\sum }_{k=1}^{\| dif\left({T}_{j},{T}_{i}\right)\| }{ns}_{j}^{k}\).

3 Related Work and Open Problems

As mentioned in [4], TSA is not only a valid method for extracting terminological core sub-collections of texts and respective representative sets of terms for a particular subject domain. The method could be used to explore the patterns and trends that shape the community sentiments around and interpretations of (the semantics of) this domain. In particular, we envisioned that TSA could become an effective instrument for solving several open research problems. These problems relate to exploiting TSA in relevant use cases (Sect. 3.2 to 3.4). However, TSA needs to be further improved to become more effective in the settings where it currently falls short. These settings are discussed in Sect. 3.1. In this Section, we explore and analyse the most recent State-of-the-Art (SotA) research regarding these two strands. Via this analysis, we outline the corresponding open problems.

3.1 Shortcomings of TSA

TSA works well in the cases of established and mature subject domains. These domains are characterised by a well-shaped terminological consensus among the stakeholders presented in a representative body of the mainstream publications. Furthermore, new terms are contributed on time in a small proportion to the stable part of the terminology. As the fraction of these new terms is small, the induced volatility does not affect the saturation trend. Therefore, terminological saturation in such subject domains is quickly reachable and steady, as reported, for example, with regard to our experiments with the Springer Knowledge Management collection of journal articles [1, Ch. 5].

TSA is a hybrid domain-neutral method based on linguistic and statistical processing of data and designed to qualify mainstream terms as more significant. It falls short under several conditions. These conditions and the corresponding reasons for TSA shortcomings are as follows.

Immature, Hence Quickly Evolving, or Niche Domains.

If a subject domain is new, immature, and evolves quickly, its terminology cannot be stable. Therefore, its mainstream interpretation by the knowledge stakeholders is blurred and volatile. This factor complicates the selection and collection of a collection of documents that reliably fall into the domain. It is also worth mentioning that immature domains often adopt terms from the other, topically neighbouring domains. This terminological re-use, though natural, is also a complication for a probabilistic topic modelling and snowball sampling approach [9] that is used in TSA for collecting relevant source documents. There might be two possible ways to approach this challenge. One could be to use a human-in-the-loop methodology for refining topic modelling [10]. Another, yet complementary way could be using a deep learning model (c.f. [11]) to gain more accuracy.

Innovative but not yet Frequently Cited Sources.

One more deficiency of TSA is that it sacrifices recently introduced innovative terms in favour of the well-established terms in the domain, thus hampering terminological trend capture and analysis. This happens because the term significance measure in TSA is based on the frequency of term occurrence in the analysed document corpus. A balanced account for the citation-based frequency of use and innovativeness of a term could be a better indicator for assessing its significance. A recently published approach to measure the innovativeness of a term is reported in [12]. Their score is based on “how much the predicted publication year is ahead of or behind the actual publication year, which reflects whether the paper covers more topics researched by papers published in the past or more of its topics are covered by future papers” [12].

Quality of Terms Recognition.

The C-Value terms recognition method that lies at the basis of the term recognition pipeline of TSA in its current implementation is known to be one of the most effective unsupervised hybrid methods in terms of its accuracy (c.f. [13]). However, its average precision on the mix of datasets (0.53) [13] is not sufficient and needs to be improved. One of the possible approaches is to employ a deep learning (DL) model in a refinement phase for the chunk of documents within the iteration of the TSA method. One of the most recent surveys of the use of DL transformer-based approaches for terms recognition is [14]. It could serve as a starting point for selecting a relevant DL model for domain-bound terms recognition.

Table 1. Selected LLMs: NLP/I/U tasks and the volume of the language corpora used for (pre-) training. The information in the table is given based on [16]. The LLMs were selected as “landmark” according to [16].

Full size table

Prognosis of Terminological Saturation.

One of the shortcomings of TSA, lowering its performance, is the lack of the method for the prognosis of steady saturation after processing several successive approximation iterations (c.f. Sect. 2). A possible approach to remediate this deficiency might be the analysis of the statistical distributions of the terms extracted in these several iterations. A good reference for relevant distributions is provided by [15].

3.2 Representative Datasets of Minimal Size for NLP/I/U

A mainstream approach, currently, in Natural Language, Processing, Interaction, and Understanding (NLP/I/U) is the use of pre-trained Large Language Models (LLMs) (c.f. [16, 17]). LLMs are transformer-based neural models that are built and used following the pre-training approach. In this context, it is worth mentioning that LLMs are the larger successors of the Pre-trained Language Models (PLMs) of the smaller scale. LLMs demonstrate outstanding performance in several important NLP/I/U tasks (c.f. Table 1). Table 1 also shows the scale (number of parameters), pre-training dataset(s) and their size(s) of several prominent PLMs and LLMs, and their reported use in NLP/I/U tasks.

One important shortcoming, that hinders the development and use of LLMs in research, development, and practice, is the amount of resources needed to pre-train an LLM for achieving their SotA performance level. Indeed, a LLM has to be (pre-) trained on a huge language corpus to be as effective as indicated in Table 1. This complication has been recognized as important by the research community and funding bodies, e.g. the European Commission. For example, one of the expected outcomes in the recent Horizon 2020 Call for Proposals under the Advanced Language Technologies theme was developing “Language models, capable of learning from smaller language corpora”^{Footnote 5}.

We also found out that LLMs have not been frequently used, so far, for topic modelling in scientific domains and domain-bounded terminology recognition in textual documents. Rare examples are [27] for scientific topic modelling and [28] for domain-bounded terminology recognition. A possible reason for this scarcity might be the need to fine-tune a basic pre-trained LLM for achieving acceptable quality. A necessary resource for fine-tuning is the representative corpus of documents that covers the domain of focus sufficiently completely.

In subject domain-bounded settings, TSA could serve as an effective instrument for generating the representative document collections of minimal size that could be further labelled and used for LLM training within these subject domains. We outline our results in using TSA for extracting these TCSCs in Sect. 4.1.

3.3 Building Scientific Domain Ontologies and Knowledge Graphs

A scientific knowledge graph (SKG) is an emerging representation for scholarly knowledge in a particular scientific field or domain. SKG complements the traditional way of representing and disseminating this knowledge in the form of scholarly papers by enabling machine processing of this knowledge for semantic search, querying, analysis involving reasoning, and visualisation.

As pointed out, e.g. in [29], a scientific “domain is characterised by its specific terminology and phrasing which is hard to grasp for a non-expert reader”. Hence, extracting conceptual knowledge and building an SKG needs to be based on the human domain-specific professional expertise and specifically tailored extraction procedures. It also implies that an SKG for a scholarly subject domain needs to be built using a well-formed domain ontology that representatively covers the concepts (terminology) in the domain. The recent related work offers the SotA approaches for narrowing this gap.

A topical example of building and ontology for a broad scholarly domain is the extraction of the Computer Science Ontology (CSO) [30] using Klink-2 [31], and later building an SKG for Artificial Intelligence [32] using the CSO Classifier in the pipeline. This approach demonstrated impressive performance in terms of scalability – the dataset for building CSO covered 16M papers. However, it used only metadata in the combination with external resources, like DBpedia^{Footnote 6} entries. This allowed only for shallow document analysis with regard to the conceptual/terminological coverage of the domain.

A domain-neutral approach for the extraction of scientific concepts from scholarly papers is proposed in [29]. This approach is based on the analysis of the paper abstracts using a DL pipeline. The pipeline is trained on a document corpus, covering 10 scientific domains, manually annotated by labelling generic scientific concepts. The experiments showed satisfactory transferability of these generic concepts across domains. The solution was reported to achieve “a fairly high F1 score”.

The PLUMBER framework [33] is an integrative approach aiming at bringing “together the research community’s disjoint efforts on KG completion”. PLUMBER collects and dynamically integrates “40 reusable components for various KG completion subtasks” into the pipelines that fit best for SKG completion in a particular domain.

Our approach for building scientific domain ontologies from texts is OntoElect [34]. It uses TSA as the 1st phase, which ensures the sufficient coverage of the domain by extracted terminology that is further conceptualised into ontological fragments. The solution is sufficiently scalable [1, Ch. 5] to allow processing industrial-size full-text paper collections. Being a modular pipeline (c.f. PLUMBER), it allows including reusable components at different stages. However, such inclusions/replacements need to be done manually, which is a shortcoming against PLUMBER, that is automatically configurable.

3.4 Trend Analysis in Non-Stationary Time Series of Publications

Trend discovery and analysis are widely used techniques for the analysis of the history and prognosis of the future development in many fields and disciplines. The mainstream of these methods are based on processing time series data and discovering statistical patterns and turning points in it. Among the vast available body of research papers, a good recent overview of these methods, with the pointers to relevant applications, is [35].

In our work, we are focused on using TSA for analysing the trends and time lags in technology transfer to and adoption in industries [1, Ch. 6]. In this setting, the datasets containing relevant scientific paper sub-collections are regarded as non-stationary time series [36] with respect to the terms carried by the papers published at time points. It is also worth noting that the use of neural networks and machine learning has been gaining increasing popularity in non-stationary time series analysis for quite some time already [37]. Hence, it might be reasonable to combine these approaches with DL-based refinement of TSA.

3.5 Event Detection and Prediction

One of the recent reviews of the SotA approaches to event detection and prediction in online social networks, based on text analytics, is [38]; the paper also presents the timeline and taxonomy of existing methods. Outlines the major open issues in the field. An advanced approach to detect correlated events from individual documents using a graph model of event relationships is proposed in [39]. It also contains the review of the SotA in document-level event detection. One of the recent contributions of a DL-based approach to fine-grained event detection from texts is [40]. This work proposes the BMRMC model. These SotA techniques and models could be put on the top of TSA to enable its use for event detection and prediction based on textual information streams.

4 Tested Uses of TSA

In this Section, we present two important application use cases in which TSA has been experimentally proven [1] being instrumental and effective. The first case covers the situation when a mature and stable body of scientific texts exist in the corresponding subject domain and describes this domain sufficiently fully. Therefore, steady terminology saturation could be expected in the document collection. The second case is for an immature, hence quickly evolving domain of scientific knowledge. In this case, a representative collection of scholarly publications cannot be easily collected or even does not exist. Hence, the saturation of terminology is not achievable.

4.1 Extracting a TCSC for a Mature Scholarly Domain

TSA has been used in the experiments with several well-established real-world document collections coming from different scientific subject domains with different breadths in the topic coverage [1, Ch. 5]. Please refer to Table 2 for the overview of these collections.

Table 2. The summary of the characteristics of real-world document collections and datasets

Full size table

The results of TSA on these collections are summarised in Table 3. As a result, the extracted TCSCs in all the domains contain significantly less documents that have statistically the very similar terminological coverage of the domain. Hence, if these core sub-collections are used as the datasets for training DL models for NLP/I/U tasks, the effort to label these documents manually will be significantly lower. Remarkably, the quality of training could be expected to be the same as if the corresponding entire collections are used. Furthermore, as terminologically redundant documents are eliminated, it might be expected that the trained models will not be over-fitted toward a subset of terms.

The most substantial decrease in the numbers of documents and retained significant terms has been demonstrated with regard to the KM collection. This result is indeed encouraging as it proves that using TSA to extract ontologies and knowledge graphs from full-text domain-bounded document collections of industrial volumes is a feasible and scalable approach.

Table 3. Compactness of TCSCs and saturated bags of retained significant terms (adapted from [1])

Full size table

4.2 Analysing Trends in an Immature and Evolving Domain

In our industrial use case^{Footnote 7}, TSA has been used [1, Ch. 6] to verify the prognosis, by Gartner^{Footnote 8} [41], of the Generative Adversarial Network technology adoption by the IT industry. That was the case to demonstrate the utility of observing the absence of terminological saturation in the collection of publications for an immature and rapidly evolving domain. In this study, we also tried the iterative refinement of the process of collecting relevant publications to the collection. This refinement was done via the use of several most terminologically significant papers, among those collected at the previous iteration, as a refined seed for topic modelling and citation network analysis at the subsequent iteration.

Our research questions were: (1) Are the research contributions and respective professional community around the technology mature? and (2) Are there any gaps in the knowledge about the technology between the academic researchers and industrial adopters? To answer these questions, the terminological footprints, over the domain description, of different sub-collections of publications were examined: (i) authored by academics; (ii) authored by people from industry; and (iii) authored collaboratively by academics and industrialists. Figure 1 shows that only the academic part of the GAN community is mature, as it possesses a terminologically saturated body of publications. Hence, the GAN technology is ready to be transferred to industry.

On the other hand, Fig. 2 highlights substantial terminological differences between the sub-collections. Therefore, further collaboration between the academic and industrial parts of the community might contribute to the increase of the maturity in the industrial body of knowledge and practice.

The combination of factors that has been discovered points out the opportunity window for the successful transfer of the GAN technology to industry, which might increase the competitive advantage of the participating companies.

5 Toward Improving TSA

We plan to improve the quality of our baseline (probabilistic topic modelling and snowball sampling) pipeline for collecting relevant scholarly texts within a subject domain by exploiting a domain-neutral DL-based topic modelling approach (c.f. Sect. 3.3), especially in the niche and emerging domains. This will be done following an iterative bootstrapping method with a human in the loop. The baseline of this method has been initially tested in our GAN use case (c.f. Sect. 4.2). In the initial iteration the following steps of the documents collection workflow will be performed:

1.
Collect the draft set of documents and extract the TCSC and T_sat using the TSA pipeline [1, Ch. 4]
2
Form the validated \({TCSC}_{val}\) by manually examining the TCSC using a domain expert and filtering out the irrelevant documents
3
Generate the validated set of relevant significant terms \({T}_{val}^{sat}\) by applying the TSA term extraction pipeline to \({TCSC}_{val}\)
4
Automatically label the documents of \({TCSC}_{val}\) using the terms from \({T}_{val}^{sat}\)
5
Use \({TCSC}_{val}\) for training the DL model for topic modelling and relevant documents discovery (DL-TMDD)

In every subsequent iteration, the TSA pipeline in the 1-st step will be replaced by the use of DL-TMDD. The outlined workflow could be stopped at iteration i if \(thd\left({T}_{val}^{sat}\left[i\right],{T}_{val}^{sat}\left[i-1\right]\right)<{eps}_{i}\).

Another important issue is improving the accuracy of term recognition in the linguistic part of the TSA pipeline. Currently TSA uses NLTK API [42] to implement its linguistic part and extract the bag of candidate terms. To improve it, a similar iterative bootstrapping approach could be used for training a DL model to perform a classification task on the input text dataset for discovering term candidates (DL-TD). The initial iteration could be the use of the domain-neutral approach of [29] that builds upon the training of a DL model, on a manually labelled cross-domain corpus of scholarly articles, to discover generic scientific terms. The subsequent iterations will follow the incremental successive approximation pattern of TSA, using, however, the DL-TD as the substitute for NLTK and C-Value method in the pipeline, as follows for the i-th iteration.

1.
Train DL-TD on the labelled TSA dataset D_i-1 of the (i-1)-th iteration
2.
Add the increment of inc documents to D_i-1 to form D_i
3.
Use DL-TD to discover terms in D_i
4.
Validate the discovered additional terms by the inspection of a human domain expert
5.
Automatically label D_i using the validated set of retained significant terms

This workflow should continue until either terminological saturation is detected or all the documents in the collection are processed.

As valuable side-effects, the labelled dataset (\({TCSC}_{val}\) and \({T}_{val}^{sat}\)) for training 3-d party models (e.g. for PLUMBER) in the domain will be developed. This dataset could also be used for the iterative development of the domain ontology using the OntoElect methodology [34].

Yet further improvement of TSA performance could be sought via exploring the ways to predict terminology saturation, after initial iterations. This could be done by examining the collected knowledge about the statistical distributions of the terms in the collection under examination, in particular, to recognize, as significant, innovative [12] or emerging terms in addition to mainstream significant terms. The starting point for this work could be exploiting the distributions surveyed in [15].

Finally, the iterative bootstrapping approach, outlined above, could be made transferrable, as the subset of general scientific terms is domain-neutral. This subset of general terms could also be extended in the process of building the saturated sets of terms for different subdomains within a broader domain, other neighbouring or overlapping domains.

6 Potential Applications of TSA

Based on the already tested uses of the baseline TSA method (Sect. 4) and its planned refinements (Sect. 5), the following applications of TSA are envisioned as plausible.

Generating Datasets for the Focused Training of LLMs in Scientific Subject Domains.

In Sect. 3.2, the shortcoming of LLMs has been pointed out with regard to the very substantial volume of the resources required for achieving the necessary accuracy in performing several signature tasks for these models (c.f. Table 1). Furthermore, as rightfully mentioned in [29], terminology and phrasing is specific to a scientific subject domain, which complicates the use of LLMs within such a domain. These complications raise the need for a tailored LLMs training for scientific domains in order to achieve acceptable performance. Hence, an efficient and effective method yielding quality and representative training datasets is of high demand, especially, if the method ensures that such a dataset has the minimal possible volume. The iterative bootstrapping approach, proposed in Sect. 5, might result in such a method.

Building Domain Ontologies to Support SKG Completion.

A more advanced potential application of the refined TSA could be the use of the discovered domain-bounded saturated sets of retained significant terms as the input for constructing the ontologies describing the respective subject domains. As proposed in [34], TSA is the method underpinning the initial phase of our OntoElect methodology for ontology refinement. In the ontology refinement cycle, the method supplies the Conceptualization phase of OntoElect [43] with the \({T}_{val}^{sat}\), that are used as the building blocks for developing ontological fragments around the significant terms that represent concepts. The terms that represent properties are used to enrich these concepts semantically and connect ontological fragments in a semantic network. Using an evolving domain ontology-in-the-loop approach, which could be supported by OntoElect, might enable better performance of the frameworks like [31, 33] for semi-automatically completing SKG. This is especially relevant for immature and evolving domains, or for the domains of broad public interest covered by social media text streams.

Detecting and Predicting Events.

Another promising idea for applying TSA arises from our observations of terminology volatility over time in real document collections [1, Ch. 5]. It was interesting to observe that, in scientific domains that we used for evaluating TSA (Sect. 4.1 and 4.2) there was a certain correlation between the appearance of an influential paper, that has further been well cited, and the rise of the terminological volatility. The volatility has been increased due to the appearance of a bunch of the new papers contributing new terms. If this observation is generalised to what happens, e.g., in social media texts within respective communities and topics, the following use case scenario could be thought of. Those who followed twitter on the agricultural exports from Ukraine in the spring of 2023, definitely noticed the rise of negative sentiment coming from Central-Eastern Europe, related to the prices of Ukrainian grain. This sentiment could have been detected if the volatility of used wordings had been analysed using TSA. This volatility could have been interpreted as an indicator of a potential event of grain exports ban in several involved countries. The event indeed happened after a short time lag. On the contrary, a noticeable decrease in terminology volatility over time, compared to the usual distributions, could indicate a coordinated campaign aimed at artificially forming desired sentiments on a topic. This is a potential indicator of propaganda, which might be useful for selecting messages for their origin verification and fact checking.

7 Conclusive Remarks

In this visionary paper, we presented our views and prognoses on how TSA could be refined and further exploited, for public good, in several important application fields. These include, but are not limited to, tailored training of LLMs in scientific subject domains, instrumenting the completion of SKG, trend analysis in immature and evolving domains, detecting and predicting events based on social media streams analysis. For enabling this spectrum of applications, we proposed to design an iterative bootstrapping approach within the refined TSA method, based on the development and use of domain-neutral DL models for topic modelling and terms recognition. This vision of the bootstrapping approach involves human-and-ontology-in-the-loop, as presented in Sect. 5. The envisioned applications of the improved TSA also constitute our plans for the future work. We also look forward to trying the refined TSA method as an instrument in our OntoElect ontology refinement methodology and the SKG development frameworks like Klink-2 or PLUMBER.

Notes

1.
Please see a comprehensive review of the related work (as for January 2022) in our chapter [2].
2.
Https://commoncrawl.org/
3.
https://www.tensorflow.org/datasets/catalog/c4.
4.
Https://www.gutenberg.org/
5.
https://ec.europa.eu/info/funding-tenders/opportunities/portal/screen/opportunities/topic-DEtails/horizon-cl4-2023-human-01-03.
6.
Https://wiki.dbpedia.org/.
7.
SAGOIT-IT: Strategic Analysis of R&D Gaps and Opportunities for Industrial Uptake in Trending IT Fields –the project funded by Group BWT, llc.
8.
https://www.gartner.com/en

References

Kosa, V., Ermolayev, V.: Terminology saturation: detection, measurement, and use. Cognitive Science and Technology, Springer, Singapore (2022). https://doi.org/10.1007/978-981-16-8630-6
Kosa, V., Ermolayev, V.: Related work and our approach. In: Terminology Saturation: Detection, Measurement, and Use. Cognitive Science and Technology, Springer, Singapore, pp. 7−39 (2022) https://doi.org/10.1007/978-981-16-8630-6
Kosa, V., Ermolayev, V.: Saturated terminology extraction and analysis in use. In: Terminology Saturation: Detection, Measurement, and Use. Cognitive Science and Technology. Springer, Singapore, pp. 155−170 (2022)
Google Scholar
Ermolayev, V., Kosa, V.: Terminology saturation analysis for machine learning and event detection. In: Akkerkar, R. (ed.): Symposium on AI, Data and Digitalization (SAIDD 2023), Sogndal, Norway, 09–10 May 2023, Western Norway Research Institute (2023)
Google Scholar
Tatarintseva, O., Ermolayev, V., Keller, B., Matzke, W.-E.: Quantifying ontology fitness in OntoElect using saturation- and vote-based metrics. Revised Selected Papers of ICTERI 2013. CCIS, vol. 412, pp. 136–162. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-319-03998-5_8
Kosa, V., Chaves-Fraga, D., Dobrovolskiy, H., Ermolayev, V.: Optimized term extraction method based on computing merged partial C-values. Revised selected papers of ICTERI 2019. CCIS, vol. 1175, pp. 24–49. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-39459-2_2
Frantzi, K.T., Ananiadou, S.: The C-Value/NC-Value domain independent method for multi-word term extraction. J. Natural Lang. Process. 6(3), 145–179 (1999). https://doi.org/10.5715/jnlp.6.3_145
Article Google Scholar
Gomaa, W.H., Fahmy, A.A.: A survey of text similarity approaches. Int. J. Comp. Appl. 68(13), 13–18 (2013)
Google Scholar
Dobrovolskyi, H., Keberle, N.: Collecting seminal scientific abstracts with topic modelling, snowball sampling and citation analysis. In: ICTERI 2018. Volume I: Main conference, Kyiv, Ukraine, 2018, CEUR-WS, vol. 2105, pp. 179–192 (2018)
Google Scholar
Fang, Z., Alqazlan, L., Liu, D., et al.: A User-Centered, Interactive. Human-in-the-Loop Topic Modelling System. arXiv e-prints (2023). https://doi.org/10.48550/arXiv.2304.01774
Article Google Scholar
Zhang, H., Chen, B., Cong, Y., Guo, D., Liu, H., Zhou, M.: Deep autoencoding topic model with scalable hybrid Bayesian inference. IEEE Trans on Patt. Anal. Mach. Intell. 43(12), 4306–4322 (2021). https://doi.org/10.1109/TPAMI.2020.3003660
Article Google Scholar
Savov, P., Jatowt, A., Nielek, R.: Identifying breakthrough scientific papers. Inf. Process. Manage. 57(2), 102–168 (2020). https://doi.org/10.1016/j.ipm.2019.102168
Article Google Scholar
Zhang, Z., Iria, J., Brewster, C., Ciravegna, F.: A comparative evaluation of term recognition algorithms. In: 6th International Conference on Language Resources and Evaluation, pp. 2108–2113 (2008)
Google Scholar
Hanh, T.T., Martinc, M., Caporusso, J., Doucet, A., Pollak, S.: The Recent Advances in Automatic Term Extraction: A survey. arXiv:2301.06767 (2023)
Misuraca, M., Spano, M.: Unsupervised analytic strategies to explore large document collections. In: JADT 2018. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-52680-1_2
Zhao, W.X., Zhou, K., Li, J., et al.: A Survey of Large Language Models. arXiv:2303.18223 (2023)
Fan, L., Li, L., Ma, Z., et al.: A Bibliometric Review of Large Language Models Research from 2017 to 2023. arXiv:2304.02020 (2023)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training. (2018)
Google Scholar
Zhu, Y., Kiros, R., Zemel, R. et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: Proceedings of the ICCV 2015, Santiago, Chile, December 7–13, 2015, pp. 19–27. IEEE (2015)
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 (2019)
Raffel, C., Shazeer, N., Roberts, A., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 1–67 (2020)
MathSciNet Google Scholar
Brown, T.B., Mann, B., Ryder, N., et al: Language Models are Few-Shot Learners. arXiv:2005.14165 (2020)
Chen, M., Tworek, J., Jun, H., et al.: Evaluating Large Language Models Trained on Code. arXiv:2107.03374 (2021)
Touvron, H., Lavril, T., Izacard, G., et al.: LLaMA: Open and Efficient Foundation Language Models. ArXiv, abs/2302.13971 (2023)
Google Scholar
Gao, L., Biderman, S.R., Black, S., et al.: The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv:2101.00027 (2020)
OpenAI: Gpt-4 technical report. OpenAI (2023)
Google Scholar
Grootendorst, M.: BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv:2203.05794 (2022)
Zheng, C., Deng, N., Cui, R., Lin, H.: Terminology extraction of new energy vehicle patent texts based on BERT-BILSTM-CRF. In: Barolli, L. (ed) Advances in Internet, Data & Web Technologies. EIDWT 2023. LNDECT, vol. 161. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-26281-4_19
Brack, A., D’Souza, J., Hoppe, A., Auer, S., Ewerth, R.: Domain-independent extraction of scientific concepts from research articles. In: Jose, J.M. et al. (eds.) Advances in Information Retrieval. ECIR 2020. LNCS, vol. 12035. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-45439-5_17
Salatino, A.A., Thanapalasingam, T., Mannocci, A., Osborne, F., Motta, E.: The computer science ontology: a large-scale taxonomy of research areas. In: Vrandečić, D. et al. (eds) ISWC 2018: The Semantic Web, LNCS, vol. 11137. Springer, Cham, pp. 187–205 (2018). https://doi.org/10.1007/978-3-030-00668-6_12
Osborne, F., Motta, E.: Klink-2: Integrating multiple Web sources to generate semantic topic networks. In: Arenas, M, (ed.) The Semantic Web - ISWC 2015: 14th International Semantic Web Conference, Bethlehem, PA, USA, October 11-15, 2015, Proceedings, Part I, pp. 408–424. Springer Int. Publishing, Cham (2015). https://doi.org/10.1007/978-3-319-25007-6_24
Chapter Google Scholar
Dessì, D., Osborne, F., Recupero, D.R., Buscaldi, D., Motta, E., Sack, H.: AI-KG: An Automatically Generated Knowledge Graph of Artificial Intelligence. In: Pan, J.Z., Tamma, V., d’Amato, C., Janowicz, K., Bo, Fu., Polleres, A., Seneviratne, O., Kagal, L. (eds.) The Semantic Web – ISWC 2020: 19th International Semantic Web Conference, Athens, Greece, November 2–6, 2020, Proceedings, Part II, pp. 127–143. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-62466-8_9
Chapter Google Scholar
Jaradeh, M.Y., Singh, K., Stocker, M., Roth, A., Auer, S.: Information extraction pipelines for knowledge graphs. Knowl. Inf. Syst. 65, 1989–2016 (2023). https://doi.org/10.1007/s10115-022-01826-x
Article Google Scholar
Ermolayev, V.: OntoElecting requirements for domain ontologies. The case of time domain. EMISA Int. J Concept. Model. 13(Sp.I.): 86–109 (2018) https://doi.org/10.18417/emisa.si.hcm.9
Ghaderpour, E., Pagiatakis, S.D., Hassan, Q.K.: A survey on change detection and time series analysis with applications. Appl. Sci. 11(13), 6141 (2021). https://doi.org/10.3390/app11136141
Article Google Scholar
Rhif, M., Abbes, A.B., Farah, I., Martínez, B., Sang, Y.: Wavelet transform application for/in non-stationary time-series analysis: a review. Appl. Sci. 9(7), 1345 (2019). https://doi.org/10.3390/app9071345
Article Google Scholar
Kim, T.Y., Oh, K.J., Kim, C., Do, J.D.: Artificial neural networks for non-stationary time series. Neurocomputing 61, 439–447 (2004). https://doi.org/10.1016/j.neucom.2004.04.002
Article Google Scholar
Xiangyu, Hu., Ma, W., Chen, C., Wen, S., Zhang, J., Xiang, Y., Fei, G.: Event detection in online social network: Methodologies, state-of-art, and evolution. Comput. Sci. Rev. 46, 100500 (2022). https://doi.org/10.1016/j.cosrev.2022.100500
Article Google Scholar
Zhou, Ji., Shuang, K., An, Z., Guo, J., Loo, J.: Improving document-level event detection with event relation graph. Inf. Sci. 645, 119355 (2023). https://doi.org/10.1016/j.ins.2023.119355
Article Google Scholar
He, X., Yan, G., Si, C., et al.: General fine-grained event detection based on fusion of multi-information representation and attention mechanism. Int. J. Mach. Learn. & Cyber. (2023). https://doi.org/10.1007/s13042-023-01900-y
Article Google Scholar
Trends appear on the Gartner hype cycle for emerging technologies 2019. Gartner Inc. https://www.gartner.com/smarterwithgartner/5-trends-appear-on-the-gartner-hype-cycle-for-emerging-technologies-2019/. Accessed 14 Oct 2021
Bird, S., Loper, E., Klein, E.: Natural Language Processing with Python. O'Reilly Media Inc. (2009)
Google Scholar
Moiseyenko, S., Vasileyko, A., Ermolayev, V.: Building a feature taxonomy of the terms extracted from a text collection. In: Proceedings MS-AMLV 2019, CEUR-WS vol. 2566, 59–70 (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

Ukrainian Catholic University, Kozelnytska st. 2a, Lviv, 79076, Ukraine
Victoria Kosa, Oles Dobosevych & Vadim Ermolayev

Authors

Victoria Kosa
View author publications
You can also search for this author in PubMed Google Scholar
Oles Dobosevych
View author publications
You can also search for this author in PubMed Google Scholar
Vadim Ermolayev
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vadim Ermolayev .

Editor information

Editors and Affiliations

Western Norway Research Institute, Sogndal, Norway
Rajendra Akerkar

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kosa, V., Dobosevych, O., Ermolayev, V. (2024). Terminology Saturation Analysis: Refinements and Applications. In: Akerkar, R. (eds) AI, Data, and Digitalization. SAIDD 2023. Communications in Computer and Information Science, vol 1810. Springer, Cham. https://doi.org/10.1007/978-3-031-53770-7_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-53770-7_3
Published: 14 March 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53769-1
Online ISBN: 978-3-031-53770-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Terminology Saturation Analysis: Refinements and Applications

Abstract

Keywords

1 Introduction

2 An Outline of TSA

3 Related Work and Open Problems

3.1 Shortcomings of TSA

Immature, Hence Quickly Evolving, or Niche Domains.

Innovative but not yet Frequently Cited Sources.

Quality of Terms Recognition.

Prognosis of Terminological Saturation.

3.2 Representative Datasets of Minimal Size for NLP/I/U

3.3 Building Scientific Domain Ontologies and Knowledge Graphs

3.4 Trend Analysis in Non-Stationary Time Series of Publications

3.5 Event Detection and Prediction

4 Tested Uses of TSA

4.1 Extracting a TCSC for a Mature Scholarly Domain

4.2 Analysing Trends in an Immature and Evolving Domain

5 Toward Improving TSA

6 Potential Applications of TSA

Generating Datasets for the Focused Training of LLMs in Scientific Subject Domains.

Building Domain Ontologies to Support SKG Completion.

Detecting and Predicting Events.

7 Conclusive Remarks

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation