Introduction

Online rumors are false information spread through online media, which have the characteristics of wide content1, hard to identify2,3. Online rumors can mislead the public, disrupt social order, damage personal and collective reputations, and pose a great challenge to the governance of internet information content. Therefore, in order to effectively detect and govern online rumors, it is necessary to conduct an in-depth semantic analysis and understanding of the rumor text content features.

The research on the content features of online rumors focuses on the lexical, syntactic and semantic features of the rumor text, including lexical, syntactic and semantic features4, syntactic structure and functional features5, source features5,6, rhetorical methods7, narrative structure6,7,8, language style6,9,10, corroborative means10,11 and emotional features10,12,13,14,15,16,17,18. Most of the existing researches on rumor content features are feature mining under a single domain topic type, and lack of mining the influence relationship between multiple features. Therefore, this paper proposes to build an online rumor domain ontology to realize fine-grained hierarchical modeling of the relationship between rumor content features and credible verification of its effectiveness. Domain ontology is a systematic description of the objective existence in a specific discipline19. The construction methods mainly include TOVE method20, skeleton method21, IDEF-5 method22,23, methontology method24,25 and seven-step method26,27, among which seven-step method is the most mature and widely used method at present28, which has strong systematicness and applicability29, but it does not provide quantitative indicators and methods about the quality and effect of ontology. The construction technology can be divided into the construction technology based on thesaurus conversion, the construction technology based on existing ontology reuse and the semi-automatic and automatic construction technology based on ontology engineering method30. The construction technology based on thesaurus conversion and the construction technology based on existing ontology reuse can save construction time and cost, and improve ontology reusability and interoperability, but there are often differences in structure, semantics and scene. Semi-automatic and automatic construction technology based on ontology engineering method The application of artificial intelligence technology can automatically extract ontology elements and structures from data sources with high efficiency and low cost, but the quality and accuracy are difficult to guarantee. Traditional domain ontology construction methods lack effective quality evaluation support, and construction technology lacks effective integration application. Therefore, this paper proposes an improved TFI network rumor domain ontology construction method based on the seven-step method. Starting from the terminology layer, the framework layer and the instance layer, it integrates the top-level ontology and core document content feature reuse technology, the bottom-up semi-automatic construction technology based on N-gram new word discovery algorithm and RoBERTa-Kmeans clustering algorithm, defines the fine-grained features of network rumor content and carries out hierarchical modeling. Using SWRL rules and pellet inference machine, the tacit knowledge of ontology is mined, and the quality of ontology validity and consistency is evaluated and verified.

The structure of this paper is as follows: Sect “Related work” introduces the characteristics of rumor content and the related work of domain ontology construction.; Sect “Research method” constructs the term layer, the frame layer and the instance layer of the domain ontology; Sect “Domain ontology construction” mines and verifies the implicit knowledge of the ontology based on SWRL rules and Pellet reasoner; Sect “Ontology reasoning and validation” points out the research limitations and future research directions; Sect “Discussion” summarizes the research content and contribution; Sect “Conclusion” summarizes the research content and contribution of this paper.

Related Work

Content features of online rumors

The content features of online rumors refer to the adaptive description of vocabulary, syntax and semantics in rumor texts. Fu et al.5 have made a linguistic analysis of COVID-19’s online rumors from the perspectives of pragmatics, discourse analysis and syntax, and concluded that the source of information, the specific place and time of the event, the length of the title and statement, and the emotions aroused are the important characteristics to judge the authenticity of the rumors; Zhang et al.6 summarized the narrative theme, narrative characteristics, topic characteristics, language style and source characteristics of new media rumors; Li et al.7 found that rumors have authoritative blessing and fear appeal in headline rhetoric, and they use news and digital headlines extensively, and the topic construction mostly uses programmed fixed structure; Yu et al.8 analyzed and summarized the content distribution, narrative structure, topic scene construction and title characteristics of rumors in detail; Mourao et al.9 found that the language style of rumors is significantly different from that of real texts, and rumors tend to use simpler, more emotional and more radical discourse strategies; Zhou et al.10 analyzed the rumor text based on six analysis categories, such as content type, focus object and corroboration means, and found that the epidemic rumors were mostly “infectious” topics, with narrative expression being the most common, strong fear, and preference for exaggerated and polarized discourse style. Huang et al.11 conducted an empirical study based on WeChat rumors, and found that the “confirmation” means of rumors include data corroboration and specific information, hot events and authoritative release; Butt et al.12 analyzed the psycholinguistic features of rumors, and extracted four features from the rumor data set: LIWC, readability, senticnet and emotions. Zhou et al.13 analyzed the semantic features of fake news content in theme and emotion, and found that the distribution of fake news and real news is different in theme features, and the overall mood, negative mood and anger of fake news are higher; Tan et al.14 divided the content characteristics of rumors into content characteristics with certain emotional tendency and social characteristics that affect credibility; Damstra et al.15 identified the elements as a consistent indicator of intentionally deceptive news content, including negative emotions causing anger or fear, lengthy sensational headlines, using informal language or swearing, etc. Lai et al.16 put forward that emotional rumors can make the rumor audience have similar positive and negative emotions through emotional contagion; Yuan et al.17 found that multimedia evidence form and topic shaping are important means to create rumors, which mostly convey negative emotions of fear and anger, and the provision of information sources is related to the popularity and duration of rumors; Ruan et al.18 analyzed the content types, emotional types and discourse focus of Weibo’s rumor samples, and found that the proportion of social life rumors was the highest, and the emotional types were mainly hostile and fearful, with the focus on the general public and the personnel of the party, government and military institutions.

The forms and contents of online rumors tend to be diversified and complicated. The existing research on the content features of rumors is mostly aimed at the mining of content characteristics under specific topics, which cannot cover various types of rumor topics, and lacks fine-grained hierarchical modeling of the relationship between features and credible verification of their effectiveness.

Domain ontology construction

Domain ontology is a unified definition, standardized organization and visual representation of the concepts of knowledge in a specific domain31,32, and it is an important source of information for knowledge-based systems19,33. Theoretical methods include TOVE method20, skeleton method21, IDEF-5 method22,23, methontology method24,25 and seven-step method26,27. TOVE method transforms informal description into formal ontology, which is suitable for fields that need accurate knowledge, but it is complex and time-consuming, requires high-level domain knowledge and is not easy to expand and maintain. Skeleton method forms an ontology skeleton by defining the concepts and relationships of goals, activities, resources, organizations and environment, which can be adjusted according to needs and is suitable for fields that need multi-perspective and multi-level knowledge, but it lacks formal semantics and reasoning ability. Based on this method, Ran et al.34 constructed the ontology of idioms and allusions. IDEF5 method uses chart language and detailed description language to construct ontology, formalizes and visualizes objective knowledge, and is suitable for fields that need multi-source data and multi-participation, but it lacks a unified ontology representation language. Based on this method, Li et al.35 constructed the business process activity ontology of military equipment maintenance support, and Song et al.36 established the air defense and anti-missile operation process ontology. Methontology is a method close to software engineering. It systematically develops ontologies through the processes of specification, knowledge acquisition, conceptualization, integration, implementation, evaluation and document arrangement, which is suitable for fields that need multi-technology and multi-ontology integration, but it is too complicated and tedious, and requires a lot of resources and time37. Based on this method, Yang et al.38 completed the ontology of emergency plan, Duan et al.39 established the ontology of high-resolution images of rural residents, and Chen et al.40 constructed the corpus ontology of Jiangui. Seven-step method is the most mature and widely used method at present28. It is systematic and applicable to construct ontology by determining its purpose, scope, terms, structure, attributes, limitations and examples29, but it does not provide quantitative indicators and methods about the quality and effect of ontology. Based on this method, Zhu et al.41 constructed the disease ontology of asthma, Li et al.42 constructed the ontology of military events, the ontology of weapons and equipment and the ontology model of battlefield environment, and Zhang et al.43 constructed the ontology of stroke nursing field, and verified the construction results by expert consultation.

Domain ontology construction technology includes thesaurus conversion, existing ontology reuse and semi-automatic and automatic construction technology based on ontology engineering method30. The construction technology based on thesaurus transformation takes the existing thesaurus as the knowledge source, and transforms the concepts, terms and relationships in the thesaurus into the entities and relationships of domain ontology through certain rules and methods, which saves the time and cost of ontology construction and improves the quality and reusability of ontology. However, it is necessary to solve the structural and semantic differences between thesaurus and ontology and adjust and optimize them according to the characteristics of different fields and application scenarios. Wu et al.44 constructed the ontology of the natural gas market according to the thesaurus of the natural gas market and the mapping of subject words to ontology, and Li et al.45 constructed the ontology of the medical field according to the Chinese medical thesaurus. The construction technology based on existing ontology reuse uses existing ontologies or knowledge resources to generate new domain ontologies through modification, expansion, merger and mapping, which saves time and cost and improves the consistency and interoperability of ontologies, but it also needs to solve semantic differences and conflicts between ontologies. Chen et al.46 reuse the top-level framework of scientific evidence source information ontology (SEPIO) and traditional Chinese medicine language system (TCMLS) to construct the ontology of clinical trials of traditional Chinese medicine, and Xiao et al.47 construct the domain ontology of COVID-19 by extracting the existing ontology and the knowledge related to COVID-19 in the diagnosis and treatment guide. Semi-automatic and automatic construction technology based on ontology engineering method semi-automatically or automatically extracts the elements and structures of ontology from data sources by using natural language processing, machine learning and other technologies to realize large-scale, fast and low-cost domain ontology construction48, but there are technical difficulties, the quality and accuracy of knowledge extraction can not be well guaranteed, and the quality and consistency of different knowledge sources need to be considered. Suet al.48 used regular templates and clustering algorithm to construct the ontology of port machinery, Zheng et al.49 realized the automatic construction of mobile phone ontology through LDA and other models, Dong et al.50 realized the automatic construction of ontology for human–machine ternary data fusion in manufacturing field, Linli et al.51 proposed an ontology learning algorithm based on hypergraph, and Zhai et al.52 learned from it through part-of-speech tagging, dependency syntax analysis and pattern matching.

At present, domain ontology construction methods are not easy to expand, lack of effective quality evaluation support, lack of effective integration and application of construction technology, construction divorced from reality can not guide subsequent practice, subjective ontology verification and so on. Aiming at the problems existing in the research of content characteristics and domain ontology construction of online rumors, this paper proposes an improved TFI network rumor domain ontology construction method based on seven-step method, which combines top-down existing ontology reuse technology with bottom-up semi-automatic construction technology, and establishes rumor domain ontology based on top-level ontology reuse, core document content feature extraction and new concept discovery in the real corpus from the terminology layer, framework layer and instance layer. Using Protégé as a visualization tool, the implicit knowledge mining of ontology is carried out by constructing SWRL rules to verify the semantic parsing ability and consistency of domain ontology.

Research method

This paper proposes a TFI online rumor domain ontology construction method based on the improvement of the seven-step method, which includes the term layer, the frame layer and the instance layer construction.

Term layer construction

Determine the domain and scope: the purpose of constructing the rumor domain ontology is to support the credible detection and governance of online rumors, and the domain and scope of the ontology are determined by answering questions.

Three-dimensional term set construction: investigate the top-level ontology and related core literature, complete the mapping of reusable top-level ontology and rumor content feature concept extraction semi-automatically from top to bottom; establish authoritative real rumor datasets, and complete the domain new concept discovery automatically from bottom to top; based on this, determine the term set of the domain ontology.

Frame layer construction

Define core classes and hierarchical relationships: combine the concepts of the three-dimensional rumor term set, based on the data distribution of the rumor dataset, define the parent class, summarize the subclasses, design hierarchical relationships and explain the content of each class.

Define core properties and facets of properties: in order to achieve deep semantic parsing of rumor text contents, define object properties, data properties and property facets for each category in the ontology.

Instance layer construction

Create instances: analyze the real rumor dataset, extract instance data, and add them to the corresponding concepts in the ontology.

Encode and visualize ontology: use OWL language to encode ontology, and use Protégé to visualize ontology, so that ontology can be understood and operated by computer.

Ontology verification: use SWRL rules and pellet reasoner to mine implicit knowledge of ontology, and verify its semantic parsing ability and consistency.

Ethical statements

This article does not contain any studies with human participants performed by any of the authors.

Domain ontology construction

Term layer construction

Determine the professional domain and scope of the ontology description

This paper determines the domain and scope of the online rumor domain ontology by answering the following four questions:

  • (1) What is the domain covered by the ontology?

The “Rumor Domain Ontology” constructed in this paper only considers content features, not user features and propagation features; the data covers six rumor types of politics and military, disease prevention and treatment, social life, science and technology, nutrition and health, and others involved in China’s mainstream internet rumor-refuting websites.

  • (2) What is the purpose of the ontology?

To perform fine-grained hierarchical modeling of the relationships among the features of multi-domain online rumor contents, realize semantic parsing and credibility reasoning verification of rumor texts, and guide fine-grained rumor detection and governance. It can also be used as a guiding framework and constraint condition for online rumor knowledge graph construction.

  • (3) What kind of questions should the information in the ontology provide answers for?

To provide answers for questions such as the fine-grained rumor types of rumor instances, the valid features of rumor types, etc.

  • (4) Who will use the ontology in the future?

Users of online rumor detection and governance, users of online rumor knowledge graphs construction.

Three-dimensional term set construction

Domain concepts reused by top-level ontology

As a mature and authoritative common ontology, top-level ontology can be shared and reused in a large range, providing reference and support for the construction of domain ontology. The domain ontology of online rumors established in this paper focuses on the content characteristics, mainly including the content theme, events and emotions of rumor texts. By reusing the terminology concepts in the existing top-level ontology, the terminology in the terminology set can be unified and standardized. At the same time, the top-level concept and its subclass structure can guide the framework construction of domain ontology and reduce the difficulty and cost of ontology construction. Reusable top-level ontologies include: SUMO, senticnet and ERE after screening.

SUMO ontology: a public upper-level knowledge ontology containing some general concepts and relations for describing knowledge in different domains. The partial reusable SUMO top-level concepts and subclasses selected in this paper are shown in Table 1, which provides support for the sub-concept design of text topics in rumor domain ontology.

Table 1 Explanation of some reusable SUMO top-level concepts and examples of subclasses.

Senticnet: a knowledge base for concept-based sentiment analysis, which contains semantic, emotional, and polarity information related to natural language concepts. The partial reusable SenticNet top-level concepts and subclasses selected in this paper are shown in Table 2, which provides support for the sub-concept design of text topics in rumor domain ontology.

Table 2 Explanation of some reusable SenticNet top-level concepts and examples of subclasses.

Entities, relations, and events (ERE): a knowledge base of events and entity relations. The partial reusable ERE top-level concepts and subclasses selected in this paper are shown in Table 3, which provides support for the sub-concept design of text elements in the rumor domain ontology.

Table 3 Explanation of some reusable ERE top-level concepts and examples of subclasses.
Extracting domain concepts based on core literature content features

Domain core literature is an important source for extracting feature concepts. This paper uses ‘rumor detection’ as the search term to retrieve 274 WOS papers and 257 CNKI papers from the WOS and CNKI core literature databases. The content features of rumor texts involved in the literature samples are extracted, the repetition content features are eliminated, the core content features are screened, and the canonical naming of synonymous concepts from different literatures yields the domain concepts as shown in Table 4. Among them, text theme, text element, text style, text feature and text rhetoric are classified as text features; emotional category, emotional appeal and rumor motive are classified as emotional characteristics; source credibility, evidence credibility and testimony method are classified as information credibility characteristics; social context is implicit.

Table 4 Part of rumor domain concepts based on literature content features extraction.
Extracting domain concepts based on new concept discovery

This paper builds a general rumor dataset based on China’s mainstream rumor-refuting websites as data sources, and proposes a domain new concept discovery algorithm to discover domain new words in the dataset, add them to the word segmentation dictionary to improve the accuracy of word segmentation, and cluster them according to rumor type, resulting in a concept subclass dictionary based on the real rumor dataset, which provided realistic basis and data support for the conceptual design of each subclass in domain ontology.

Building a general rumor dataset

The rumor dataset constructed in this paper contains 12,472 texts, with 6236 rumors and 6236 non-rumors; the data sources are China’s mainstream internet rumor-refuting websites: 1032 from the internet rumor exposure platform of China internet joint rumor-refuting platform, 270 from today’s rumor-refuting of China internet joint rumor-refuting platform, 1852 from Tencent news Jiaozhen platform, 1744 from Baidu rumor-refuting platform, 7036 from science rumor-refuting platform, and 538 from Weibo community management center. This paper invited eight researchers to annotate the labels (rumor, non-rumor), categories (politics and military, disease prevention and treatment, social life, science and technology, nutrition and health, others) of the rumor dataset. Because data annotation is artificial and subjective, in order to ensure the effectiveness and consistency of annotation, before inviting researchers to annotate, this paper formulates annotation standards, including the screening method, trigger words and sentence break identification of rumor information and corresponding rumor information, and clearly explains and exemplifies the screening method and trigger words of rumor categories, so as to reduce the understanding differences among researchers; in view of this standard, researchers are trained in labeling to familiarize them with labeling specifications, so as to improve their labeling ability and efficiency. The method of multi-person cross-labeling is adopted when labeling, and each piece of data is independently labeled by at least two researchers. In case of conflicting labeling results, the labeling results are jointly decided by the data annotators to increase the reliability and accuracy of labeling. After labeling, multi-person cross-validation method is used to evaluate the labeling results. Each piece of data is independently verified by at least two researchers who did not participate in labeling, and conflicting labeling results are jointly decided by at least five researchers to ensure the consistency of evaluation results. Examples of the results are shown in Table 5.

Table 5 Examples of the rumor dataset annotation.
N-gram word granularity rumor text new word discovery algorithm

Existing neologism discovery algorithms are mostly based on the granularity of Chinese characters, and the time complexity of long word discovery is high and the accuracy rate is low. The algorithm’s usefulness is low, and the newly discovered words are mostly already found in general domain dictionaries. To solve these problems, this paper proposes an online rumor new word discovery algorithm based on N-gram word granularity, as shown in Fig. 1.

Figure 1
figure 1

Flowchart of domain new word discovery algorithm.

First, obtain the corpus to be processed \({\varvec{c}}=\{{{\varvec{s}}}_{1},{{\varvec{s}}}_{2},...,{{\varvec{s}}}_{{{\varvec{n}}}_{{\varvec{c}}}}\}\) , and perform the first preprocessing on the corpus to be processed, which includes: sentence segmentation, Chinese word segmentation and punctuation removal for the corpus to be processed. Obtain the first corpus \({{\varvec{c}}}^{{\varvec{p}}}=\{{{\varvec{s}}}_{1}^{{\varvec{p}}},{{\varvec{s}}}_{2}^{{\varvec{p}}},...,{{\varvec{s}}}_{{{\varvec{n}}}_{{\varvec{c}}}}^{{\varvec{p}}}\}\) ; where \({s}_{i}\) represents the \(i\)-th sentence in the corpus to be processed, \({n}_{c}\) represents the number of sentences in the corpus to be processed, and \({s}_{i}^{p}\) is the i-th sentence in the first corpus; perform N-gram operation on each sentence in the first corpus separately, and obtain multiple candidate words \(n=2\sim 5\); count the word frequency of each candidate word in the first corpus, and remove the candidate words with word frequency less than the first threshold, and obtain the first class of candidate word set;calculate the cohesion of each candidate word in the first class of candidate word set according to the following formula:

$${\varvec{min}}\left\{ {\frac{{{\varvec{P}}\left( {{\varvec{g}}_{1} {\varvec{g}}_{2} {\varvec{g}}_{3} {\varvec{g}}_{4} } \right)}}{{{\varvec{P}}\left( {{\varvec{g}}_{1} } \right){\varvec{P}}\left( {{\varvec{g}}_{2} {\varvec{g}}_{3} {\varvec{g}}_{4} } \right)}},\frac{{{\varvec{P}}\left( {{\varvec{g}}_{1} {\varvec{g}}_{2} {\varvec{g}}_{3} {\varvec{g}}_{4} } \right)}}{{{\varvec{P}}\left( {{\varvec{g}}_{1} {\varvec{g}}_{2} } \right){\varvec{P}}\left( {{\varvec{g}}_{3} {\varvec{g}}_{4} } \right)}},\frac{{{\varvec{P}}\left( {{\varvec{g}}_{1} {\varvec{g}}_{2} {\varvec{g}}_{3} {\varvec{g}}_{4} } \right)}}{{{\varvec{P}}\left( {{\varvec{g}}_{1} {\varvec{g}}_{2} {\varvec{g}}_{3} } \right){\varvec{P}}\left( {{\varvec{g}}_{4} } \right)}}} \right\}$$
(1)

In the formula, \(P(\cdot )\) represents word frequency.Then filter according to the second threshold corresponding to N-gram operation, and obtain the second class of candidate word set; after loading the new words in the second class of candidate word set into LTP dictionary, perform the second preprocessing on the corpus to be processed \({\varvec{c}}=\{{{\varvec{s}}}_{1},{{\varvec{s}}}_{2},...,{{\varvec{s}}}_{{{\varvec{n}}}_{{\varvec{c}}}}\}\); and obtain the second corpus \({{\varvec{c}}}^{{\varvec{p}}\boldsymbol{^{\prime}}}=\{{{\varvec{s}}}_{1}^{{\varvec{p}}\boldsymbol{^{\prime}}},{{\varvec{s}}}_{2}^{{\varvec{p}}\boldsymbol{^{\prime}}},...,{{\varvec{s}}}_{{{\varvec{n}}}_{{\varvec{c}}}}^{{\varvec{p}}\boldsymbol{^{\prime}}}\}\); where the second preprocessing includes: sentence segmentation, Chinese word segmentation and stop word removal for the corpus to be processed; after obtaining the vector representation of each word in the second corpus, determine the vector representation of each new word in the second class of candidate word set; according to the vector representation of each new word, use K-means algorithm for clustering; according to the clustering results and preset classification rules, classify each new word to the corresponding domain. The examples of new words discovered are shown in Table 6:

Table 6 Examples of domain neologism discovery.
RoBERTa-Kmeans rumor text concepts extraction algorithm

After adding the new words obtained by the new word discovery to the LTP dictionary, the accuracy of LTP word segmentation is improved. The five types of rumor texts established in this paper are segmented by using the new LTP dictionary, and the word vectors are obtained by inputting them into the RoBERTa word embedding layer after removing the stop words. The word vectors are clustered by k-means according to rumor type to obtain the concept subclass dictionary. The main process is as follows:

  • (1) Word embedding layer

The RoBERTa model uses Transformer-Encode for computation, and each module contains multi-head attention mechanism, residual connection and layer normalization, feed-forward neural network. The word vectors are obtained by representing the rumor texts after accurate word segmentation through one-hot encoding, and the position encoding represents the relative or absolute position of the word in the sequence. The word embedding vectors generated by superimposing the two are used as input X. The multi-head attention mechanism uses multiple independent Attention modules to perform parallel operations on the input information, as shown in formula (2):

$${\varvec{Attention}}\left( {{\varvec{Q}},{\varvec{K}},{\varvec{V}}} \right) = {\varvec{Softmax}}\left( {\frac{{{\varvec{QK}}^{{\varvec{T}}} }}{{\sqrt {{\varvec{d}}_{{\varvec{k}}} } }}} \right){\varvec{V}}$$
(2)

where \(\left\{{\varvec{Q}},{\varvec{K}},{\varvec{V}}\right\}\) is the input matrix, \({{\varvec{d}}}_{{\varvec{k}}}\) is the dimension of the input matrix. After calculation, the hidden vectors obtained after computation are residual concatenated with layer normalization, and then calculated by two fully connected layers of feed-forward neural network for input, as shown in formula (3):

$${\varvec{W}}_{{\varvec{e}}} = {\varvec{max}}\left( {0,{\varvec{XW}}_{0} + {\varvec{b}}_{0} } \right){\varvec{W}}_{0} \user2{^{\prime}} + \user2{b^{\prime}}$$
(3)

where \(\left\{{{\varvec{W}}}_{{\varvec{e}}},{{\varvec{W}}}_{0}\boldsymbol{^{\prime}}\right\}\) are the weight matrices of two connected layers, \(\left\{{{\varvec{b}}}_{{\varvec{e}}},{{\varvec{b}}}_{0}\boldsymbol{^{\prime}}\right\}\) are the bias terms of two connected layers.

After calculation, a bidirectional association between word embedding vectors is established, which enables the model to learn the semantic features contained in each word embedding vector in different contexts. Through fine-tuning, the learned knowledge is transferred to the downstream clustering task.

  • (2) K-means clustering

Randomly select k initial points to obtain k classes, and iterate until the loss function of the clustering result is minimized. The loss function can be defined as the sum of squared errors of each sample point from its cluster center point, as shown in formula (4).

$${\varvec{L}}\left( {{\varvec{\alpha}},{\varvec{\mu}}} \right) = \mathop \sum \limits_{{{\varvec{i}} = 1}}^{{\varvec{N}}} \parallel{\varvec{x}}_{{\varvec{i}}} - {\varvec{u}}_{{{\varvec{a}}_{{\varvec{i}}} }}\parallel^{2}$$
(4)

where \({x}_{i}\) represents the \(i\) sample, \({a}_{i}\) is the cluster that \({x}_{i}\) belongs to, \({u}_{{a}_{i}}\) represents the corresponding center point, \(N\) is the total number of samples.

After RoBERTa-kmeans calculation, the concept subclasses obtained are manually screened, merged repetition items, deleted invalid items, and finally obtained 79 rumor concept subclasses, including 14 politics and military subclasses, 23 disease prevention and treatment subclasses, 15 social life subclasses, 13 science and technology subclasses, and 14 nutrition and health subclasses. Some statistics are shown in Table 7.

Table 7 Rumor Theme Partial Concept Subcategory Statistics.

Each concept subclass is obtained by clustering several topic words. For example, the topic words that constitute the subclasses of body part, epidemic prevention and control, chemical drugs, etc. under the disease prevention and treatment topic are shown in Table 8.

  • (3) Determining the terminology set

Table 8 Statistics of some conceptual subcategories and their subject terms under the theme of disease prevention and treatment.

This paper constructs a three-dimensional rumor domain ontology terminology set based on the above three methods, and unifies the naming of the terms. Some of the terms are shown in Table 9.

Table 9 Examples of three-dimensional rumor domain ontology terminology set.

Framework layer construction

Define core classes and hierarchy

Define parent classes

This paper aims at fine-grained hierarchical modeling of the relationship between the content characteristics of multi-domain network rumors. Therefore, the top-level parent class needs to include the rumor category and the main content characteristics of a sub-category rumor design. The main content characteristics are the clustering results of domain concepts extracted based on the content characteristics of core documents, that is, rumor text feature, rumor emotional characteristic, rumor credibility and social context. The specific contents of the five top parent classes are as follows:

Rumor type: the specific classification of rumors under different subject categories; Rumor text feature, the common features of rumor texts in terms of theme, style, rhetoric, etc. Rumor emotional characteristic: the emotional elements of rumor texts, the Rumor motive of the publisher, and the emotional changes they hope to trigger in the receiver. Rumor credibility: the authority of the information source, the credibility of the evidence material provided by the publisher, and the effectiveness of the testimony method. Social context: the relevant issues and events in the society when the rumor is published.

Induce subclasses and design hierarchical relationships

In this paper, under the top-level parent class, according to the top-level concepts of top-level ontologies such as SUMO, senticnet and ERE and their subclass structures, and the rumor text features of each category extracted from the real rumor text dataset, we summarize its 88 subclasses and design the hierarchical relationships, as shown in Fig. 2, which include:

  • (1) Rumor text feature

Figure 2
figure 2

Diagram of the core classes and hierarchy of the rumor domain ontology.

①Text theme6,8,13,18,53: the theme or topic that the rumor text content involves. Based on the self-built rumor dataset, it is divided into politics and military54, involving information such as political figures, political policies, political relations, political activities, military actions, military events, strategic objectives, politics and military reviews, etc.; nutrition and health55, involving information such as the relationship between human health and nutrition, the nutritional components and value of food, the plan and advice for healthy eating, health problems and habits, etc.; disease prevention and treatment10, involving information such as the definition of disease, vaccine, treatment, prevention, data, etc.; social life56, involving information such as social issues, social environment, social values, cultural activities, social media, education system, etc.; science and technology57, involving information such as scientific research, scientific discovery, technological innovation, technological application, technological enterprise, etc.; other categories.

②Text element15: the structured information of the rumor text contents. It is divided into character, political character, public character, etc.; geographical position, city, region, area, etc.; event, historical event, current event, crisis event, policy event, etc.; action, protection, prevention and control, exercise, fighting, crime, eating, breeding, health preservation, rest, exercise, education, sports, social, cultural, ideological, business, economic, transportation, etc.; material, food, products (food, medicine, health products, cosmetics, etc.) and the materials they contain and their relationship with human health. effect, nutrition, health, harm, natural disaster, man-made disaster, guarantee, prevention, treatment, etc.; institution, government, enterprise, school, hospital, army, police, social group, etc.; nature, weather, astronomy, environment, agriculture, disease, etc.

③Text style7,10: the discourse style of the rumor text contents, preferring exaggerated and emotional expression. It is divided into gossip style, creating conflict or entertainment effect; curious style, satisfying people’s curiosity and stimulation; critical style, using receivers’ stereotypes or preconceptions; lyrical style, creating resonance and influencing emotion; didactic style influencing receivers’ thought and behavior from an authoritative perspective; plain style concise objective arousing resonance etc.

④Text feature7,58: special language means in the rumor text contents that can increase the transmission and influence of the rumor. It is divided into extensive punctuation reminding or attracting receivers’ attention; many mood words enhancing emotional color and persuasiveness; many emoji conveying attitude; induce forwarding using @ symbol etc. to induce receivers to forward etc.

⑤Text rhetoric15: common rhetorical devices in rumor contents. It is divided into metaphor hyperbole repetition personification etc.

  • (2) Rumor emotional characteristic

①Emotion category17,59,60: the emotional tendency and intensity expressed in the rumor texts. It is divided into positive emotion happy praise etc.; negative emotion fear10 anger sadness anxiety61 dissatisfaction depression etc.; neutral emotion no preference plain objective etc.

②Emotional appeal16,62,63: the online rumor disseminator hopes that the rumor they disseminate can trigger some emotional changes in the receiver. It is divided into “joy” happy pleasant satisfied emotions that prompt receivers to spread or believe some rumors that are conducive to social harmony; “love” love appreciation admiration emotions that prompt receivers to spread or believe some rumors that are conducive to some people or group interests; “anger” angry annoyed dissatisfied emotions that prompt receivers to spread or believe some rumors that are anti-social or intensify conflicts; “fear” fearful afraid nervous emotions that prompt receivers to spread or believe some rumors that have bad effects deliberately exaggerated; “repugnance” disgusted nauseous emotions that prompt receivers to spread or believe some rumors that are detrimental to social harmony; “surprise” surprised shocked amazed emotions that prompt receivers to spread or believe some rumors that deliberately attract traffic exaggerated fabricated etc.

③Rumor motive17,64,65,66: the purpose and need of the rumor publisher to publish rumors and the receiver to forward rumors. Such as profit-driven seeking fame and fortune deceiving receivers; emotional catharsis relieving dissatisfaction emotions by venting; creating panic creating social unrest and riots disrupting social order; entertainment fooling receivers seeking stimulation; information verification digging out the truth of events etc.

  • (3) Rumor credibility

①source credibility7,17: the degree of trustworthiness that the information source has. Such as official institutions and authoritative experts and scholars in the field with high credibility; well-known encyclopedias and large-scale civil organizations with medium credibility; small-scale civil organizations and personal hearsay personal experience with low credibility etc.

②evidence credibility61: the credibility of the information proof material provided by the publisher. Data support such as scientific basis based on scientific theory or method; related feature with definite research or investigation result in data support; temporal background with clear time place character event and other elements which related to the information content; the common sense of life in line with the facts and scientific common sense that are widely recognized.

③testimony method10,11,17: the method to support or refute a certain point of view. Such as multimedia material expressing or fabricating content details through pictures videos audio; authority endorsement policy documents research papers etc. of authorized institutions or persons; social identity identity of social relation groups.

  • (4) Social context

①social issue67: some bad phenomena or difficulties in society such as poverty pollution corruption crime government credibility decline68 etc.

②public attention63: events or topics that arouse widespread attention or discussion in the society such as sports events technological innovation food safety religious beliefs Myanmar fraud nuclear wastewater discharge etc.

③emergency(public sentiment)69: some major or urgent events that suddenly occur in society such as earthquake flood public safety malignant infectious disease outbreaks etc.

  • (5) Rumor type

①Political and military rumor:

Political image rumor: rumors related to images closely connected to politics and military, such as countries, political figures, institutions, symbols, etc. These include positive political image smear rumor, negative political image whitewash rumor, political image fabrication and distortion rumor, etc.

Political event rumor: rumors about military and political events, such as international relations, security cooperation, military strategy, judicial trial, etc. These include positive political event smear rumor, negative political event whitewash rumor, political event fabrication and distortion rumor, etc.

②Nutrition and health rumor:

Food product rumor: rumors related to food, products (food, medicine, health products, cosmetics, etc.), the materials they contain and their association with human health. These include positive effect of food product rumor, negative effect of food product rumor, food product knowledge rumor, etc.

Living habit rumor: rumors related to habitual actions in life and their association with human health. These include positive effect of living habit rumor, negative effect of living habit rumor, living habit knowledge rumor, etc.

③Disease prevention and treatment rumor:

Disease management rumor: rumors related to disease management and control methods that maintain and promote individual and group health. These include positive prevention and treatment rumor, negative aggravating disease rumor, disease management knowledge rumor, etc.

Disease confirmed transmission rumor: rumors about the confirmation, transmission, and immunity of epidemic diseases at the social level in terms of causes, processes, results, etc. These include local confirmed cases rumor, celebrity confirmed cases rumor, transmission mechanism rumor, etc.

Disease notification and advice rumor: rumors that fabricate or distort the statements of authorized institutions or experts in the field, and provide false policies or suggestions related to diseases. These include institutional notification rumor, expert advice rumor, etc.

④Social life rumor:

Public figure public opinion rumor: rumors related to public figures’ opinions, actions, private lives, etc. These include positive public figure smear rumor, negative public figure whitewash rumor, public figure life exposure rumor, etc.

Social life event rumor: rumors related to events, actions, and impacts on people's social life. These include positive event sharing rumor, negative event exposure rumor, neutral event knowledge rumor, etc.

Disaster occurrence rumor: rumors related to natural disasters or man-made disasters and their subsequent developments. These include natural disaster occurrence rumor, man-made disaster occurrence rumor, etc.

⑤Science and technology rumor:

Scientific knowledge rumor: rumors related to natural science or social science theories and knowledge. These include scientific theory rumor, scientific concept rumor, etc.

Science and technology application rumor: rumors related to the research and development and practical application of science and technology and related products. These include scientific and technological product rumor, scientific and technological information rumor, etc.

⑥Other rumor: rumors that do not contain elements from the above categories.

Definition of core properties and facets of properties

Properties in the ontology are used to describe the relationships between entities or the characteristics of entities. Object properties are relationships that connect two entities, describing the interactions between entities; data properties represent the characteristics of entities, usually in the form of some data type. Based on the self-built rumor dataset, this paper designs object properties, data properties and facets of properties for the parent classes and subclasses of the rumor domain ontology.

Object properties

A partial set of object properties is shown in Table 10.

Table 10 Partial set of the rumor domain ontology object properties.
Data attributes

The partial data attribute set is shown in Table 11.

Table 11 Partial set of the rumor domain ontology data attributes.

Instance layer construction

Creating instances

Based on the defined core classes and properties, this paper creates instances according to the real rumor dataset. An example is shown in Table 12.

Table 12 Examples of the rumor domain ontology.

This paper selects the online rumor that “Lin Chi-ling was abused by her husband Kuroki Meisa, the tears of betrayal, the shadow of gambling, all shrouded her head. Even if she tried to divorce, she could not get a solution…..” as an example, and draws a structure diagram of the rumor domain ontology instance, as shown in Fig. 3. This instance shows the seven major text features of the rumor text: text theme, text element, text style, emotion category, emotional appeal, rumor motivation, and rumor credibility, as well as the related subclass instances, laying a foundation for building a multi-source rumor domain knowledge graph.

Figure 3
figure 3

Schematic example of the rumor domain ontology.

Encoding ontology and visualization

Encoding ontology

This paper uses OWL language to encode the rumor domain ontology, to accurately describe the entities, concepts and their relationships, and to facilitate knowledge reasoning and semantic understanding. Classes in the rumor domain ontology are represented by the class “Class” in OWL and the hierarchical relationship is represented by subclassof. For example, in the creation of the rumor emotional characteristic class and its subclasses, the OWL code is shown in Fig. 4:

Figure 4
figure 4

Partial OWL codes of the rumor domain ontology.

The ontology is formalized and stored as a code file using the above OWL language, providing support for reasoning.

Ontology visualization

This paper uses protégé5.5 to visualize the rumor domain ontology, showing the hierarchical structure and relationship of the ontology parent class and its subclasses. Due to space limitations, this paper only shows the ontology parent class “RumorEmotionalFeatures” and its subclasses, as shown in Fig. 5.

Figure 5
figure 5

Ontology parent class “RumorEmotionalFeatures” and its subclasses.

Ontology reasoning and validation

SWRL reasoning rule construction

SWRL reasoning rule is an ontology-based rule language that can be used to define Horn-like rules to enhance the reasoning and expressive ability of the ontology. This paper uses SWRL reasoning rules to deal with the conflict relationships between classes and between classes and instances in the rumor domain ontology, and uses pellet reasoner to deeply mine the implicit semantic relationships between classes and instances, to verify the semantic parsing ability and consistency of the rumor domain ontology.

This paper summarizes the object property features of various types of online rumors based on the self-built rumor dataset, maps the real rumor texts with the rumor domain ontology, constructs typical SWRL reasoning rules for judging 32 typical rumor types, as shown in Table 13, and imports them into the protégé rule library, as shown in Fig. 6. In which x, n, e, z, i, t, v, l, etc. are instances of rumor types, text theme, emotion category, effect, institution, event, action, geographical position, etc. in the ontology. HasTheme, HasEmotion, HasElement, HasSource, HasMood and HasSupport are object property relationships. Polarity value is a data property relationship.

Table 13 Partial SWRL rules for the rumor domain ontology.
Figure 6
figure 6

Partial SWRL rules for the rumor domain ontology.

Implicit knowledge mining and verification based on pellet reasoner

This paper extracts corresponding instances from the rumor dataset, imports the rumor domain ontology and SWRL rule description into the pellet reasoner in the protégé software, performs implicit knowledge mining of the rumor domain ontology, judges the rumor type of the instance, and verifies the semantic parsing ability and consistency of the ontology.

Positive prevention and treatment of disease rumors are mainly based on the theme of disease prevention and treatment, usually containing products to be sold (including drugs, vaccines, equipment, etc.) and effect of disease names, claiming to have positive effects (such as prevention, cure, relief, etc.) on certain diseases or symptoms, causing positive emotions such as surprise and happiness among patients and their families, thereby achieving the purpose of selling products. The text features and emotional features of this kind of rumors are relatively clear, so this paper takes the rumor text “Hong Kong MDX Medical Group released the ‘DCV Cancer Vaccine’, which can prevent more than 12 kinds of cancers, including prostate cancer, breast cancer and lung cancer.” as an example to verify the semantic parsing ability of the rumor domain ontology. The analysis result of this instance is shown in Fig. 7. The text theme is cancer prevention in disease prevention and treatment, the text style is plain narrative style, and the text element includes product-DCV cancer vaccine, positive effect-prevention, disease name-prostate cancer, disease name-breast cancer, disease name-lung cancer; the emotion category of this instance is a positive emotion, emotional appeal is joy, love, surprise; The motive for releasing rumors is profit-driven in selling products, the information source is Hong Kong MDX medical group, and pictures and celebrity endorsements are used as testimony method. This paper uses a pellet reasoner to reason on the parsed instance based on SWRL rules, and mines out the specific rumor type of this instance as positive prevention and treatment of disease rumor. This paper also conducted similar instance analysis and reasoning verification for other types of rumor texts, and the results show that the ontology has high consistency and reliability.

Figure 7
figure 7

Implicit relationship between rumor instance parsing results and pellet reasoner mining.

Comparison and evaluation of ontology performance

In this paper, the constructed ontology is compared with the representative rumor index system in the field. By inviting four experts to make a comprehensive evaluation based on the self-built index system70,71,72, their performance in the indicators of reliability, coverage and operability is evaluated. According to the ranking order given by experts, they are given 1–4 points, and the first place in each indicator item gets four points. The average value given by three experts is taken as the single indicator score of each subject, and the total score of each indicator item is taken as the final score of the subject.

As can be seen from Table 14, the rumor domain ontology constructed in this paper constructs a term set through three ways: reusing the existing ontology, extracting the content features of core documents and discovering new concepts based on real rumor data sets, and the ontology structure has been verified by SWRL rule reasoning of pellet inference machine, which has high reliability; ontology covers six kinds of Chinese online rumors, including the grammatical, semantic, pragmatic and social characteristics of rumor text characteristics, emotional characteristics, rumor credibility and social background, which has a high coverage; ontology is coded by OWL language specification and displayed visually on protege, which is convenient for further expansion and reuse of scholars and has high operability.

Table 14 Comparison and evaluation of ontology performance.

Discussion

The construction method of TFI domain ontology proposed in this paper includes terminology layer, framework layer and instance layer. Compared with the traditional methods, this paper adopts three-dimensional data set construction method in terminology layer construction, investigates top-level ontology and related core documents, and completes the mapping of reusable top-level ontology from top to bottom and the concept extraction of rumor content features in existing literature research. Based on the mainstream internet rumor websites in China, the authoritative real rumor data set is established, and the new word discovery algorithm of N-gram combined with RoBERTa-Kmeans clustering algorithm is used to automatically discover new concepts in the field from bottom to top; determine the terminology set of domain ontology more comprehensively and efficiently. This paper extracts the clustering results of domain concepts based on the content characteristics of core documents in the selection of parent rumors content characteristics in the framework layer construction, that is, rumors text characteristics, rumors emotional characteristics, rumors credibility characteristics and social background characteristics; based on the emotional characteristics and the entity categories of real rumor data sets, the characteristics of rumor categories are defined. Sub-category rumor content features combine the concept of three-dimensional rumor term set and the concept distribution based on real rumor data set, define the sub-category concept and hierarchical relationship close to the real needs, and realize the fine-grained hierarchical modeling of the relationship between multi-domain network rumor content features. In this paper, OWL language is used to encode the rumor domain ontology in the instance layer construction, and SWRL rule language and Pellet inference machine are used to deal with the conflict and mine tacit knowledge, judge the fine-grained categories of rumor texts, and realize the effective quality evaluation of rumor ontology. This makes the rumor domain ontology constructed in this paper have high consistency and reliability, and can effectively analyze and reason different types of rumor texts, which enriches the knowledge system in this field and provides a solid foundation for subsequent credible rumor detection and governance.

However, the study of the text has the following limitations and deficiencies:

  • (1) The rumor domain ontology constructed in this paper only considers the content characteristics, but does not consider the user characteristics and communication characteristics. User characteristics and communication characteristics are important factors affecting the emergence and spread of online rumors, and the motivation and influence of rumors can be analyzed. In this paper, these factors are not included in the rumor feature system, which may limit the expressive ability and reasoning ability of the rumor ontology and fail to fully reflect the complexity and multidimensional nature of online rumors.

  • (2) In this paper, the mainstream Internet rumor-dispelling websites in China are taken as the data source of ontology instantiation. The data covers five rumor categories: political and military, disease prevention, social life, science and technology, and nutrition and health, and the data range is limited. And these data sources are mainly official or authoritative rumor websites, and their data volume and update frequency may not be enough to reflect the diversity and variability of online rumors, and can not fully guarantee the timeliness and comprehensiveness of rumor data.

  • (3) The SWRL reasoning rules used in this paper are based on manual writing, which may not cover all reasoning scenarios, and the degree of automation needs to be improved. The pellet inference engine used in this paper is an ontology inference engine based on OWL-DL, which may have some computational complexity problems and lack of advanced reasoning ability.

The following aspects can be considered for optimization and improvement in the future:

  • (1) This paper will introduce user characteristics into the rumor ontology, and analyze the factors that cause and accept rumors, such as social attributes, psychological state, knowledge level, beliefs and attitudes, behavioral intentions and so on. This paper will introduce the characteristics of communication, and analyze the propagation dynamic factors of various types of rumors, such as propagation path, propagation speed, propagation range, propagation period, propagation effect, etc. This paper hopes to introduce these factors into the rumor feature system, increase the breadth and depth of the rumor domain ontology, and provide more credible clues and basis for the detection, intervention and prevention of rumors.

  • (2) This paper will expand the data sources, collect the original rumor data directly from social media, news media, authoritative rumor dispelling institutions and other channels, and build a rumor data set with comprehensive types, diverse expressions and rich characteristics; regularly grab the latest rumor data from these data sources and update and improve the rumor data set in time; strengthen the expressive ability of rumor ontology instance layer, and provide full data support and verification for the effective application of ontology.

  • (3) The text will introduce GPT, LLaMA, ChantGLM and other language models, and explore the automatic generation algorithm and technology of ontology inference rules based on rumor ontology and dynamic Prompt, so as to realize more effective and intelligent rumor ontology evaluation and complex reasoning.

Conclusion

This paper proposed a method of constructing TFI network rumor domain ontology. Based on the concept distribution of three-dimensional term set and real rumor data set, the main features of network rumors are defined, including text features, emotional features, credibility features, social background features and category features, and the relationships among these multi-domain features are modeled in a fine-grained hierarchy, including five parent classes and 88 subcategories. At the instance level, 32 types of typical rumor category judgment and reasoning rules are constructed, and the ontology is processed by using SWRL rule language and pellet inference machine for conflict processing and tacit knowledge mining, so that the semantic analysis and reasoning of rumor text content are realized, which proves its effectiveness in dealing with complex, fuzzy and uncertain information in online rumors and provides a new perspective and tool for the interpretable analysis and processing of online rumors.