1 Introduction

Epidemiology, the aspect of research focusing on disease modelling is date intensive. Research epidemiologists in different research groups played a key role in developing different data driven model for COVID-19 [1,2,3] and monkeypox [4, 5]. The requirement of accessing highly accurate data useful for disease modelling is beneficial, but not without having challenges. The key benefit of using highly accurate data is the development of reliable models suitable for existing and accurate forecasting of disease propagation and infection rates [6, 7]. Challenges arising, in this case, include experiencing bottlenecks during data acquisition. Those bottlenecks arise due to administrative bureaucracies and protocols pre-requisite for data access [8,9,10].

The bottlenecks observed with data acquisition mostly arise due to the manual approach of acquisition infection related data. A COVID-19 modelling research group with findings in [11] utilize the individual-manual approach of acquiring COVID-19 infection rate related data. The challenges observed with the manual data acquisition approach can be addressed via the development deployment and use of virtual agents such as web crawlers. Web crawlers have been deemed to be suitable for acquiring data on the internet in relation to different healthcare applications as seen in [12,13,14,15,16,17]. The use of web crawlers is also recognized to be capable of influencing the efficiency of data acquisition and eventual processing in healthcare applications.

The use of web crawlers in the manner proposed [12,13,14,15,16,17] enables the realization of advances as regards data stream acquisition. Data stream acquisition and analysis play a significant role in the evolution of data science. Therefore, i.e., web crawlers enable the realization of web mining as seen in [18]. The role of web mining was recognized to be an important aspect of data science. In this regard, research focused on the design and description of web crawlers provides an important precursor to the emerging field of data science and analytics as seen in [19]. The discussion by Shi et al., [20] describes an important aspect of data science i.e., data use in intelligent mechanism design. The use of data in this regard focuses on the use of data in the design of intelligent classification and optimization algorithms. Such applications i.e., intelligent classification and optimization algorithms play a crucial role in communication networks frameworks such as internet of things. This was recognized in the context of internet of things by Tien [21].

The discussion in [18,19,20,21] presents research and solutions traversing the role of data in different aspects of data science. The tasks recognized in this regard comprise web mining [18], analysis of data obtained via web mining [19], data driven intelligent mechanism design [20], and application of results of data analysis in networks as seen in [21].

2 Research Contribution

The presented research addresses the challenge of quantifying how the use of web crawlers enhances the important aspects of data acquisition and data analytics. Data acquisition and analytics are important aspects of data science. The research makes a contribution to data science by formulating and investigating how the use of web crawlers (as important data gathering tools) in different patterns enhances the efficiency in data sciences. Essentially, the research investigates and presents results on how varying crawler use dynamics in data sciences influence data acquisition and analytic efficiency in data efficiencies.

Existing work in [22] recognizes the capability of using web crawlers to improve the efficiency of acquisition and use of healthcare related data. The use of web crawlers enables health domain researchers to make use of digital tools in acquiring data to be used for future analysis. An additional consideration is also required to examine how the use of web crawlers can enhance data acquisition and analysis in research groups.

The model presented in [22] examines the context of acquiring data to provide solutions to health care challenges without suitable closed form expressions. The acquisition of data to address this kind of challenge is the goal of the crawler architecture presented in [22]. The acquisition of data to meet this goal can be done using different approaches i.e., with crawlers only, humans only, or a combination of humans and crawlers. However, it is important to examine the performance benefit of either of these approaches being utilized in data acquisition.

This quantification is required for research groups where the manner of data acquisition is yet to make significant use of web crawlers. It is also significant to examine how the use of web crawlers enhances important metrics such as (1) data acquisition efficiency (DAqE) and (2) data analytics efficiency (DAnE). These metrics should be examined in the context of their relations with the number of research groups (NRGs). The focus on the use of crawlers in the domain of health applications [12,13,14,15,16,17, 22] has been their use in acquiring data. However, it is important to examine their performance benefits in different contexts. The realization of this analysis is the main goal of the research being presented.

The proposed research examines how the use of web crawlers in healthcare data applications from the internet enhances the identified metrics of the DAqE and DAnE. The quantification is done in a manner to conduct investigations enabling determining if human use to complement the crawler’s activity results in significant performance benefits. The performance benefit in this case considers the DAqE and DAnE.

Two forms of using crawlers are considered in the presented research. These are (1) Crawler only data, (2) Human only, and (3) Hybrid (Crawler and human) data acquisition models.

The rest of the research is structured as follows: Sect. 2 discusses the existing and background work. Section 3 describes the problem being considered. Section 4 formulates the performance model. Section 5 presents the results of analysis. Section 6 is the conclusion.

2.1 Background and Existing Work

Zowalla et al. [12] propose a filter adjusted web crawler that selectively finds health related content on the internet. The proposed web crawler addresses the challenge of finding health related data by first introducing a partition via the filter functionality. The filter functionality reduces the overhead associated with the data acquisition process. This is done by reducing the need to filter a significant number of pages on the internet. The consideration in [12] also introduces a geographic dimension to the role of the filter adjusted web crawler. This is because it considers only three German speaking countries i.e., Germany, Austria, and Switzerland. Therefore, the usefulness of the filter assisted web crawler is not scalable considering the geographical dimension and syntactical dimension (since it considers only German related content). However, approaches and suggestions to make the filter more scalable in the geographical dimension and syntactical dimensions have not been considered.

The discussion in [13] considers data source heterogeneity and recognizes that health care data can be found in different sources. This is important and should be given attention considering the role of cloud computing platforms in storing health care data. The influence of semantic web technology in influencing the approaches to acquiring healthcare data in online environments is recognized by Haque et al. [13]. Crawlers are recognized to play an important role as being an evolution in the role of the semantic web for the purposes of data acquisition. The resource description framework (RDF) is a semantic web technology capable of improving the acquisition of data via web crawlers. This is because the RDF approach embraces data heterogeneity in an improved manner when compared to web crawlers. However, a performance evaluation of the RDF approach in comparison to the use of web crawlers has not been presented. Nevertheless, the notion in [13] demonstrates that different design approaches can be used to enhance and complement the performance of web crawlers.

Fei et al. [14] focus on the subject of healthcare data analytics. In the context of [14], healthcare data analytics is recognized to be executed via the existence of acquired big data. The use of pre-acquired big data in the role of developing machine learning applications is considered. However, the discussion in [14] has not discussed the suite of technologies and methods that are used to acquire the concerned big data. From the discussion in [14], it can be inferred that the RDF and intelligent web crawlers alongside other methods are required for acquiring data being used for healthcare machine learning applications. However, a hybrid architecture showing the combination of different data acquisition systems has not been presented in [14]. Instead, the focus in [14] is on ensuring that the data being acquired and used is free of the influence of fake news as regards the occurrence of COVID-19.

The increasing role of cloud platforms in storing healthcare related data is recognized in [15]. The discussion in [15] is set in the context where a healthcare entity seeks to select a suitable cloud vendor. The cloud vendor being considered will execute the tasks of storing and managing health care related data. Therefore, the use of web crawlers is set in the context of finding data from the internet. The concerned data provides user reviews on the services offered by each cloud vendor. The focus of the web crawler is cloud vendor review platforms. at The drawback of the web crawler developed and presented in [15] is its limited applicability to other web forums with diverse and more heterogeneous content alongside the coverage.

The focus of the discussion by Guo et al. [16] is the presentation of a web crawler used for acquiring data on COVID-19 related infection data. The use of the crawlers considers the efficient use of computational resources. This is because of the specification of parameters associated with the location domain and the temporal domain. From the perspective of the temporal domain, the web crawler activity was limited to the region of mainland China. In terms of the location domain, the web crawler activity was intended for the periods spanning Jan 15, 2020–March 15, 2020. The choice of the parameters associated with the location domain and temporal domain demonstrates that web crawlers have varying performance from the perspective of scalability. This is further accentuated by the choice of the MySQL language which is used to specify the crawler data handling rules. Therefore, it can be inferred that the web crawler has a limited performance with regard to obtaining data from non-structured or ill-structured online based data sources. A similar perspective on the use of web crawlers in which activity is limited to the World Bank health support for the period 2005– 2016 can be found in [17].

The discussion in the next part of this section presents existing work on the use of web crawlers in contexts outside of the healthcare domain.

Renrer et. al [23] describe the use of web crawlers in extracting social media messages. The concerned social media sites are Facebook and Twitter. The proposed solution involves the identification and selection of health focused messages based on the functionality of the named entity recognition module. In the functionality of the crawlers 19 data sources have been considered. These data sources are all based in France, a single location.

The use of crawlers in this case enables the conduct of analysis related to the patient’s experience of treatment in a health facility. The considered health challenge, in this case, is not able to track patient's experiences across multiple healthcare institutions on a global scale. This is because the web crawlers are designed to function within the national domain of France.

Sommer [24] describes how the emergence of the internet and the development of crawlers’ technology has transformed traditional epidemiology. The development of web crawlers is recognised to enable the development of the concept and paradigm of syndromic surveillance. The approach of syndromic surveillance has been observed to provide a precursor to COVID-19 by enabling the detection of respiratory infections. However, the design of computing internet-based architecture engaging in a full explanation of the role of web crawlers in future epidemiology requires future research consideration.

Cinelli et al. [25] address the challenge of limiting and eliminating the propagation of false information about COVID-19 via social media sites of Twitter and YouTube. The use of crawlers has not been explicitly identified. Nevertheless, the propagation of false news is limited via the analysis of interactions with COVID-19 related concerns on the Twitter and YouTube platforms. The proposed solution in [21, 25] focuses on the sentient analysis of proposed search terms that are related to COVID-19 pandemic. However, additional work is required to consider and analyse the relations associated with the term-individual interactions for other social media platforms besides Twitter and YouTube that have been considered.

Yu et al. [26] present a summary of the role of web crawlers. In this case, different types of web crawlers have been identified. These are generic, focus web, and incremental web crawlers from the perspective of [26], the discussion can be deemed to make use of focus web crawlers. In a similar manner, the discussion [23] focuses on selected 19 forums and makes use of web crawlers (limited to 19 online forums in this consideration).

The distributed crawler is also identified as a type of crawler. This type of crawler is suited for the execution of data crawlers in a manner that uses different crawl nodes with node categorization based on functionality. The discussion in Yu et al. [26] does not focus on a sector or application specific suitability of each type of identified crawler.

The discussion in [26] demonstrates the maturation of web crawler technology alongside the increasing use and pervasiveness of the internet. The analysis of a limited set of data and individual responses from multiple select online forums [23], and social media websites [25] provides a basis for the integration of digital technology with epidemiology. This leads to the emergence of the domain of digital epidemiology.

Kalteh et al. [27] identify the emergence and usefulness of digital epidemiology. In this case, digital epidemiology relates to the use of data that is external to the public health system. This definition presumes that the boundary constituents of the public health system are known. However, it is intended that the data concerned in digital epidemiology are not primarily generated for epidemiology related applications. In [27], it is recognised that the domain of digital epidemiology incorporates the use of more digital tools. However, the description of systems enabling the use of these digital tools requires further research consideration.

John et al. [28] identify that epidemiological modelling is initialized with the acquisition of the required data and subsequent pre-processing. The data relating to epidemiological modelling is recognized to be acquired and processed via a complex sequence of steps. A generic discussion on the design and development of the data acquisition and associated pre–processing steps is presented. The discussion has not focused on methods suitable to achieve the implied automated acquisition of epidemiological related data. However, the discussion in [28] focuses on defining the guiding criteria for a framework enabling the acquisition, refining, and use of data in epidemiological applications.

Normah et al. [29] identify that clinical data inconsistency affects epidemiological modelling. The research presented in [29] addresses the challenges of evaluating the impact of SARS–Cov 2 and HIV infection. The focus in [29] is on the conduct of epidemiological studies on patients experiencing HIV and SARS–Cov 2 infections concurrently. The target here is the analysis of infection rate and occurrences in co- infected persons. In this case, complex data analysis is required to address the observed data inconsistencies. The analysis executed in [29] makes use of data from 37 countries. These data are accessible via the repository of the world health organization (WHO) global clinical platforms. However, the mechanism of populating the identified WHO global clinical platforms has not been discussed as this is beyond the intended coverage of [29]. The critical role of the computing platform identified in [29] is recognized in [30]. Bertagnolio et al. [30] note that the data aboard the WHO global clinical platform emerges from a sample of hospitalised people. A similar notion is also found in [31, 32].

However, the notion of a global health data platform is preceded by the development of local data repositories as seen in [33,34,35].

The discussion here recognizes the usefulness of data acquisition in epidemiological related studies. The role of the crawlers and the emergence of new paradigms such as digital epidemiology is also recognised in the discussion. It is also seen that global data platforms play a crucial role in assembling data from multiple global sources. The data aboard these global data platforms such as that of the WHO arise from smaller platforms. These smaller data platforms such as those whose role is recognized in [33,34,35] are important for providing data to research groups. However, the effectiveness of the data–group relations especially considering the increasing role of the internet requires further consideration and analysis. The design of a mechanism and model addressing this challenge is the goal of the solution being proposed.

3 Problem Description

The case being considered is one concerning a research group context seeking access to epidemiological related data. The research group makes use of computing tools and data frameworks such as those in [27,28,29,30]. It seeks to enhance performance in executing tasks associated with accessing and using epidemiological data. The context being considered comprises multiple research groups seeking epidemiological related data. Let \(\alpha \) and \(\beta \) be the set of research groups and epidemiological data sets, respectively.

$$\alpha =\left\{{\alpha }_{1},{\alpha }_{2},\dots , {\alpha }_{I}\right\}$$
(1)
$$\beta =\left\{{\beta }_{1},{\beta }_{2},\dots , {\beta }_{J}\right\}$$
(2)

The \(i\mathrm{th}\) group \({\alpha }_{i}, {\alpha }_{i} \epsilon \alpha \) comprises multiple individuals such that:

$${\alpha }_{i}=\left\{{\alpha }_{i}^{1},\dots , {\alpha }_{i}^{M}\right\}$$
(3)

In addition, the epidemiological data set \({\beta }_{j}, {\beta }_{j}\epsilon \beta \) comprises multiple information elements such that:

$${\beta }_{j}=\left\{{\beta }_{i}^{1},\dots , {\beta }_{i}^{N}\right\}$$
(4)

Let \({I}_{A}\left({{\alpha }_{i},\beta }_{j},{t}_{y}\right) \epsilon \{\mathrm{0,1}\}\) denote the accessibility of the \(j\mathrm{th}\) dataset \({\beta }_{j}\) to the \(i\mathrm{th}\) research group \({\alpha }_{i}\) at the epoch \({t}_{y}, {t}_{y} \epsilon t , t=\left\{{t}_{1},\dots , {t}_{Y}\right\}.\) The data set \({\beta }_{j}\) is accessible and inaccessible by the research group \({\alpha }_{i}\) at the epoch \({t}_{y}\) if \({I}_{A}\left({{\alpha }_{i},\beta }_{j},{t}_{y}\right)=0\) and\({I}_{A}\left({{\alpha }_{i},\beta }_{j},{t}_{y}\right)=1\), respectively. In addition, let \({I}_{R}\left({{\alpha }_{i},\beta }_{j},{t}_{y}\right) \epsilon \{\mathrm{0,1}\}\) denote the request status of the data set \({\beta }_{j}\) by the research group \({\alpha }_{i}\) at the epoch\({t}_{y}\). The dataset \({\beta }_{j}\) is being requested and not being requested by the research group at the epoch\({t}_{y}\). The dataset \({\beta }_{j}\) is being requested and not being requested by the research group \({\alpha }_{i}\) at the epoch \({t}_{y}\) when\({I}_{R}\left({{\alpha }_{i},\beta }_{j},{t}_{y}\right)=1\), and\({I}_{R}\left({{\alpha }_{i},\beta }_{j},{t}_{y}\right)=0\), respectively.

The dataset \({\beta }_{j}\) is being requested and can be accessed by the research group \({\alpha }_{i}\) at the epoch \({t}_{y}\) when the condition \({I}_{R}\left({{\alpha }_{i},\beta }_{j},{t}_{y}\right)=1, {I}_{A}\left({{\alpha }_{i},\beta }_{j},{t}_{y}\right)=1\) holds true. Dataset redundancy challenges arise when the conditions \({I}_{R}\left({{\alpha }_{i},\beta }_{j},{t}_{y}\right)=0, {I}_{A}\left({{\alpha }_{i},\beta }_{j},{t}_{y}\right)=1\) or \({I}_{R}\left({{\alpha }_{i},\beta }_{j},{t}_{y}\right)=0, {I}_{A}\left({{\alpha }_{i},\beta }_{j},{t}_{y}\right)=0\) holds true.

In addition, the group member and information element migration can be executed in the \(i\mathrm{th}\) group and \(j\mathrm{th}\) information data set \({\beta }_{j}\) at the epoch \({t}_{y}.\) The migration indicator for the entity \(v, v \epsilon \left\{{{\alpha }_{i},\beta }_{j}\right\}\) at the epoch \({t}_{y}\) is denoted as\({I}_{M}\left(v,{t}_{y}\right)\). The indicator variable \({I}_{M}\left(v={\alpha }_{i},{t}_{y}\right)=1\) and \({I}_{M}\left(v={\alpha }_{i},{t}_{y}\right)=0\) signifies the addition of a new group member and removal of an existing group member to and from \({\alpha }_{i}\) at the epoch \({t}_{y},\) respectively. The indicator variable \({I}_{M}\left(v={\beta }_{j} ,{t}_{y}\right)=1\) and \({I}_{M}\left(v={\beta }_{j} ,{t}_{y}\right)=0\) signifies the addition of new information elements and the removal of existing information elements to and from the dataset \({\beta }_{j}\) at the epoch \({t}_{y},\) respectively. The following challenges arise for a research group defined in the model above.

  1. 1.

    Accessibility of Required Data: this challenge arises when the dataset is requested by a research group via one or \(\left({{\alpha }_{i},\beta }_{j},{t}_{y}\right)\) more of its members. This challenge can be described by the condition given As \({I}_{R}\left({{\alpha }_{i},\beta }_{j},{t}_{y}\right)=1, {I}_{A}\left({{\alpha }_{i},\beta }_{j},{t}_{y}\right)=0\) or \({I}_{R}=1, {I}_{A}\left({\alpha }_{i},{\beta }_{j}^{m}, {t}_{y}\right)=0, {\beta }_{j}^{m} \epsilon {\beta }_{j}\). In the second condition, the request for data set \({\alpha }_{i}\) originates from \({\beta }_{j}^{m}\). Data accessibility challenges can also arise due to policy motivated information element migration. This can be described by the condition: \({I}_{R}\left({{\alpha }_{i},\beta }_{j},{t}_{y}\right)=1, {I}_{M}\left({\alpha }_{i},{\beta }_{j}^{1}, {t}_{y}\right)=1, {I}_{M}\left({\alpha }_{i},{\beta }_{j}^{2}, {t}_{y}\right)=1,{I}_{M}\left({\alpha }_{i},{\beta }_{j}^{3}, {t}_{y}\right)=1, {I}_{A}\left({\alpha }_{i},{\beta }_{j}^{1}, {t}_{y}\right)=0, {I}_{A}\left({\alpha }_{i},{\beta }_{j}^{2}, {t}_{y}\right)=0,{I}_{A}\left({\alpha }_{i},{\beta }_{j}^{3}, {t}_{y}\right)=0, \left\{{I}_{A}\left({\alpha }_{i},{\beta }_{j}^{4}, {t}_{y}\right)=0,\dots , {I}_{A}\left({\alpha }_{i},{\beta }_{j}^{M}, {t}_{y}\right)\right\}=1.\) In this case, the first three information elements from the \(j\mathrm{th}\) dataset \({\beta }_{j}\) have migrated to another data repository due to organizational challenges. In this case, the information elements that have migrated are \({\beta }_{j}^{1},{\beta }_{j}^{2}\) and \({\beta }_{j}^{3}\). The information elements that have not migrated span the 4th information element to the \(j\mathrm{th}\) information element.

  2. 2.

    Dataset Redundancy: the challenge of data redundancy arises when an acquired dataset is accessible but is not used. This occurs in the scenario where the condition given as \({I}_{R}\left({{\alpha }_{i},\beta }_{j},{t}_{y}\right)=0,{I}_{A}\left({{\alpha }_{i},\beta }_{j},{t}_{y}\right)=1\). Another condition describing redundancy is given as:

    $$\left({I}_{A}\left({{\alpha }_{i},\beta }_{j},{t}_{y}\right),{I}_{R}\left({\alpha }_{i},{\beta }_{j}^{1},{t}_{y}\right),{I}_{R}\left({\alpha }_{i},{\beta }_{j}^{2},{t}_{y}\right),{I}_{R}\left({\alpha }_{i},{\beta }_{j}^{3},{t}_{y}\right),\dots , {I}_{R}\left({\alpha }_{i},{\beta }_{j}^{M},{t}_{y}\right)\right)=\left\{\mathrm{1,1},\mathrm{1,1},\dots , 0\right\}$$
    (5)
    $${\beta }_{j}=\left\{{\beta }_{j}^{1},{\beta }_{j}^{2},\dots , {\beta }_{j}^{M}\right\}$$
    (6)

The scenario in (5) and (6) describes the case where certain information elements i.e., the first information element \({\beta }_{j}^{1}\) and second information element \({\beta }_{j}^{2}\) are requested (non-redundant) while the third \({\beta }_{j}^{3}\) to the last information element i.e.,\({\beta }_{j}^{m}\) are redundant. Migration can lead to a scenario where datasets or information elements become redundant. Such a scenario at the instant \({t}_{y+1},\) \({t}_{y+1} \epsilon t\) is described as:

$$\left({I}_{M}\left({{\alpha }_{i},\beta }_{j},{t}_{y}\right),{I}_{R}\left({{\alpha }_{i},\beta }_{j},{t}_{y}\right),{I}_{R}\left({{\alpha }_{i},\beta }_{j},{t}_{y}\right)\right)=\left\{\mathrm{1,1},1\right\}$$
(7)
$$\left({I}_{M}\left({{\alpha }_{i},\beta }_{j},{t}_{y}\right),{I}_{R}\left({{\alpha }_{f},\beta }_{j},{t}_{y}\right),{I}_{R}\left({{\alpha }_{i},\beta }_{j},{t}_{y}\right),{I}_{A}\left({{\alpha }_{i},\beta }_{j},{t}_{y}\right),{I}_{R}\left({{\alpha }_{f},\beta }_{j},{t}_{y}\right)\right)=\left\{\mathrm{1,1},\mathrm{1,1},1\right\}$$
(8)
$$\left({I}_{M}\left({{\alpha }_{i},\beta }_{j},{t}_{y+1}\right),{I}_{R}\left({{\alpha }_{f},\beta }_{j},{t}_{y+1}\right),{I}_{R}\left({{\alpha }_{i},\beta }_{j},{t}_{y+1}\right),{I}_{A}\left({{\alpha }_{i},\beta }_{j},{t}_{y+1}\right),{I}_{R}\left({{\alpha }_{f},\beta }_{j},{t}_{y+1}\right)\right)=\left\{\mathrm{1,0},\mathrm{1,1},1\right\}$$
(9)

The relations in (7), (8) and (9) describe a scenario in which there is a migration of the dataset \({\beta }_{j}\) at the epoch \({t}_{y}\) to enable its access by the \(f\mathrm{th}\) research group \({\alpha }_{f},{\alpha }_{f} \epsilon \alpha \) at the epoch.

\({t}_{y+1} ,{t}_{y+1} \epsilon t\) as seen in (9). The challenges described in the aspects of ensuring accessibility and non-redundancy of data influence the dynamics of research groups. This transfers to influencing the metrics of the DAqE and DAnE. It is significant to ensure that solutions that address the identified challenges improve the identified metrics of the DAqE and DAnE.

4 Proposed Solution

The discussion here is divided into three aspects. The first aspect presents the solution that addresses the inaccessibility challenge. The second aspect presents the solution that addresses the redundancy challenges. The third aspects focus on the integration of the solution proposed to address the challenge of data inaccessibility and redundancy.

4.1 Proposed Solution: Inaccessibility Challenges

The inaccessibility challenge arises when a requested dataset is unavailable to the research group or an individual in the research group. The challenge of unavailability is addressed via the use of proposed intelligent interactive persistent crawlers (IIPCs). The IIPC is a robust crawler that can maintain a record of the web portals that have been visited. In addition, the IIPC has the capability to communicate with web portals when new datasets or information is updated. Furthermore, the IIPC incorporates crawler networking capability. Crawler networking capability implies that IIPCs communicate with each other and other compatible networking crawlers.

The identified crawler capabilities enable the realization of a cognitive crawler. The cognitive crawler is capable of functioning in an autonomous or human assisted manner. Autonomous cognitive crawler mode transitions to the human assisted cognitive crawler mode when a change in epidemiology’s data search policies is required. The change is also necessitated when persistent cognitive crawler’s search action is unable to provide the desired data or information element. Furthermore, the cognitive crawler executes the bundling of information elements to form datasets based on the preferences of the research group and objective.

The architecture of the proposed IIPC i.e., cognitive crawler is presented in Fig. 1. Figure 1 shows the entities comprising the proposed cognitive crawler. The entities in the cognitive crawler are: (1) Resource Record Entity (RRE), (2) Resource Visit Record Entity (RVRE), (3) Data Return Objective Entity (DROE), (4) Resource Visit Entry (RVE) and (5) Transition Mode Entity (TME). The RRE receives information on the specified data search objective. It initiates the search for the datasets or information elements. The RVRE maintains a record of the web locations from which data has been obtained. The information elements in the obtained data are maintained in the DROE. The DROE keeps a record, or the domain spent acquiring the information elements and associated dataset. The RVE receives information on the update of the data set and information elements. In addition, it also decides on the need to revisit a previously visited web location. The RVE provides information on its output to the TME. The TME decides to retain the functionality of the cognitive crawler in the autonomous mode or transition to the human assisted mode of cognitive crawler functionality. The output of the TME initiates the execution of the data search objective i.e., searching for data sets or information elements to meet the research group’s objectives.

Fig. 1
figure 1

Block diagram of the proposed solution

4.2 Proposed Solution: Redundancy Challenge

The datasets and information elements that have been acquired are held in the DROE (which stores accessed datasets and information elements). The DROE incorporates sub–entities that monitor the number of epochs wherein an information element or dataset and the associated duration and number of dataset (or information elements) transfer to another location via sharing with other research groups. The DROE sub–entity that monitors the aforementioned data set (or information element) related parameters is the data monitoring sub–entity (DMSE). The DROE hosts the DMSE. In this case, the DMSE executes its functionality before the DROE communicates with the RVE as regards deciding on the need to revisit a web or internet location to access data set or information element.

4.3 Integration for the Cognitive Crawler

The functioning of crawler showing the capabilities of crawler networking has not been considered in addressing the unavailability and redundancy challenges. In Fig. 1, the RVE hosts the capability of crawler internetworking. The RVE hosts the crawler network sub–entity (CNSE). The functionality of the CNSE limits the occurrence of unnecessary visits to a location on the web or internet. CNSE enables a crawler to obtain information on new web locations where datasets and new information elements can be obtained. The CNSE enables the information to be obtained without making visits and executing search operation at other web portals. In addition, the CNSE enables information from other crawlers carrying data from web locations where dataset and information element access is prevented due to policies on data sovereignty. The architecture of the proposed cognitive crawler showing DMSE and CNSE integration is in Fig. 2.

Fig. 2
figure 2

Proposed cognitive crawler showing the role of the integrated DMSE and CNSE

5 Performance Formulation

The use of the proposed solution is expected to improve the DAqE and DAnE. The DAqE in the existing case, case involving crawlers and human assisted crawlers are denoted as \(\zeta_{1} ,\zeta_{2}\) and \(\zeta_{3}\), respectively.

$$ \zeta_{1} = \mathop \sum \limits_{i = 1}^{I} \mathop \sum \limits_{y = 1}^{Y} \mathop \sum \limits_{j = 1}^{J} \mathop \sum \limits_{m = 1}^{M} \frac{{\theta_{R} \left( {\alpha_{i} ,\beta_{j}^{m} ,t_{y} } \right)I_{A} \left( {\alpha_{i} ,\beta_{j}^{m} ,t_{y} } \right)I_{R} \left( {\alpha_{i} ,\beta_{j}^{m} ,t_{y} } \right)N_{R}^{1} \left( {\alpha_{i} ,\beta_{j}^{m} ,t_{y} } \right)}}{{\theta_{R} \left( {\alpha_{i} ,\beta_{j} ,t_{y} } \right)\left| {\theta_{R} \left( {\alpha_{i} ,\beta_{j} ,t_{y} } \right)} \right|}} $$
(10)
$$ \zeta_{2} = \mathop \sum \limits_{i = 1}^{I} \mathop \sum \limits_{y = 1}^{Y} \mathop \sum \limits_{j = 1}^{J} \mathop \sum \limits_{m = 1}^{M} \frac{{\theta_{R} \left( {\alpha_{i} ,\beta_{j}^{m} ,t_{y} } \right)I_{A} \left( {\alpha_{i} ,\beta_{j}^{m} ,t_{y} } \right)I_{R} \left( {\alpha_{i} ,\beta_{j}^{m} ,t_{y} } \right)N_{R}^{2} \left( {\alpha_{i} ,\beta_{j}^{m} ,t_{y} } \right)}}{{\theta_{R} \left( {\alpha_{i} ,\beta_{j} ,t_{y} } \right)\left| {\theta_{R} \left( {\alpha_{i} ,\beta_{j} ,t_{y} } \right)} \right|}} $$
(11)
$$ \zeta_{3} = \mathop \sum \limits_{i = 1}^{I} \mathop \sum \limits_{y = 1}^{Y} \mathop \sum \limits_{j = 1}^{J} \mathop \sum \limits_{m = 1}^{M} \mathop \sum \limits_{f = 1}^{2} \frac{{\theta_{R} \left( {\alpha_{i} ,\beta_{j}^{m} ,t_{y} } \right)I_{A} \left( {\alpha_{i} ,\beta_{j}^{m} ,t_{y} } \right)I_{R} \left( {\alpha_{i} ,\beta_{j}^{m} ,t_{y} } \right)}}{{\theta_{R} \left( {\alpha_{i} ,\beta_{j} ,t_{y} } \right)\left| {\theta_{R} \left( {\alpha_{i} ,\beta_{j} ,t_{y} } \right)} \right|}}N_{R}^{f} \left( {\alpha_{i} ,\beta_{j}^{m} ,t_{y} } \right) $$
(12)

\({\theta }_{R}\left({{\alpha }_{i},\beta }_{j},{t}_{y}\right)\) is the size of the dataset \({\beta }_{j}\) being requested by the \(i\mathrm{th}\) research group \({\alpha }_{i}\) at the epoch \({t}_{y}\).

\(\left|{\theta }_{R}\left({{\alpha }_{i},\beta }_{j},{t}_{y}\right)\right|\) is the number of datasets being requested by the \(i\mathrm{th}\) research group \({\alpha }_{i}\) (associated with the research group \({\beta }_{j}\)) at the epoch\({t}_{y}\).

\({\theta }_{R}\left({\alpha }_{i},{\beta }_{j}^{m},{t}_{y}\right)\) is the size of the \(m\mathrm{th}\) information element in the \(j\mathrm{th}\) dataset \({\beta }_{j}^{m}\) by the \(i\mathrm{th}\) research group \({\alpha }_{i}\) at the epoch \({t}_{y}\).

\({N}_{R}^{1}({\alpha }_{i},{\beta }_{j}^{m},{t}_{y})\) and \({N}_{R}^{2}({\alpha }_{i},{\beta }_{j}^{m},{t}_{y})\) are the number of information element requests for the \(m\mathrm{th}\) information element in the \(j\mathrm{th}\) dataset \({\beta }_{j}\) by the \(i\mathrm{th}\) research group \({\alpha }_{i}\) at the epoch \({t}_{y}\) in the case of human search and crawler search procedures, respectively.

The DAnE is formulated in a similar manner It is recognized that the use of intelligent solutions has significant benefits in comparison to relying on manual data analysis approach. In this case, data analysis in the existing case is evaluated by intelligent algorithms but without the incorporation of the capabilities of the DMSE and CNSE as proposed in the cognitive crawler. Let \({\varphi }_{1},\) and \({\varphi }_{2}\) denote the DAnE in the case of the existing case, crawler only (proposed case) and the human assisted crawler (proposed case), respectively.

$${\varphi }_{1}= \sum_{i=1}^{I}\sum_{j=1}^{J}\sum_{m=1}^{M}\sum_{y=1}^{Y}{\frac{{\theta }_{R}\left({\alpha }_{i},{\beta }_{j}^{m},{t}_{y}\right){I}_{A}\left({\alpha }_{i},{\beta }_{j}^{m},{t}_{y}\right){I}_{R}\left({\alpha }_{i},{\beta }_{j}^{m},{t}_{y}\right)}{{\theta }_{R}({{\alpha }_{i},\beta }_{j},{t}_{y})\left|{\theta }_{R}({{\alpha }_{i},\beta }_{j},{t}_{y})\right|{\theta }_{P}({{\alpha }_{i},\beta }_{j},{t}_{y})}\theta }_{p}^{1}\left({{\alpha }_{i},\beta }_{j},{t}_{y}\right)$$
(13)
$${\varphi }_{2}= \sum_{i=1}^{I}\sum_{j=1}^{J}\sum_{m=1}^{M}\sum_{y=1}^{Y}{A}_{3}\left(i,j,k,m,y\right){\theta }_{p}^{2}\left({{\alpha }_{i},\beta }_{j},{t}_{y}\right)$$
(14)

\({\theta }_{P}({{\alpha }_{i},\beta }_{j},{t}_{y})\) is the duration spent processing one bit of the data in the dataset (information element) for the \(j\mathrm{th}\) dataset \({\beta }_{j}\) being requested by the \(i\mathrm{th}\) research group \({\alpha }_{i}\) at the epoch \({t}_{y}\).

\({\theta }_{p}^{1}({\alpha }_{i},{\beta }_{j}^{m},{t}_{y})\) is the seconds per bit associated with the processing of the \(m\mathrm{th}\) information element in the \(j\mathrm{th}\) dataset \({\beta }_{j}^{m}\) being requested by the \(i\mathrm{th}\) research group \({\alpha }_{i}\) at the epoch \({t}_{y}\). This processing capability is associated with the computing platform being used by the research group \({\alpha }_{i}\).

\({\theta }_{p}^{2}({\alpha }_{i},{\beta }_{j}^{m},{t}_{y})\) is the seconds per bit associated with the processing of the \(m\mathrm{th}\) information element in the \(j\mathrm{th}\) dataset \({\beta }_{j}^{m}\) being requested by the \(i\mathrm{th}\) research group \({\alpha }_{i}\) at the epoch \({t}_{y}\). The variable \({\theta }_{p}^{2}({\alpha }_{i},{\beta }_{j}^{m},{t}_{y})\) is a composite of the processing capability of all computing platforms along the path which the crawler traverses.

6 Performance Evaluation

The performance evaluation results are presented in this section. This is done for two cases i.e., Case 1 and Case 2. The scenario presented in Case is that which is presented in [22]. The scenario in Case 2 arises due to the proposed approach and incorporates a reduction of crawler persistence. Crawler persistence capability has been incorporated in the simulation to consider the capability of Crawlers to recover missing data due to corruption or mistaken packet transmission sequences.

The capability of Crawler persistence is not considered in the case of existing mechanism. This is because the case of human data search in attempting to recover missing data or wrongfully packaged information leads to a significant decrease in the DAqE and DAnE. A scenario where the computing processing in the case of information retrieval occurs with minimal use of custom designed crawlers is considered in the existing case. In addition, the existing approach focuses mainly on the use of computing entities for processing of data received from correspondence with health authorities with and without a use of the internet. The proposed case of using crawlers is one in which there are multiple mediating computing entities i.e., servers with each having different crawler persistent capabilities in the concerned networks. In addition, the reduction in Crawler capability implies that crawler persistent capability is reduced on average.

The conduct of performance simulation and evaluation is done for the case of a health-related research group seeking to access epidemiological related data. The access to the concerned data is being sought to analyse the infection rate and derive other associated parameters such as the mortality rate in age groups due to the occurrence of a certain disease or condition. The disease or condition can occur at the level of epidemic or pandemic. The health-related research group has varying computing and connectivity resources in seeking to obtain the required data from different sources. In addition, the composition of the considered health related research group is not related to one location and may comprise individuals from across a region or across the globe. In addition, the case of the using computing tools and products i.e., the proposed crawler at different levels of sophistication due to the interaction with the internet and computing platforms has been considered.

Furthermore, the simulation for the DAqE and DAnE is done using the MATLAB simulation package. In the simulation procedure, the simulation of all scenarios in Case 1 and Case 2 is done in a stochastic and random fashion. This is done considering the emergence of different contexts for the considered scenarios due to a change in the value of the simulation parameters associated with variables used in the performance formulation. Case 1 and Case 2 has three scenarios i.e., Existing Mechanism, Proposed Mechanism (Crawlers only) and Proposed Mechanism (Hybrid–Combination of Existing Mechanism and Crawlers).

The existing mechanism is one in which human efforts only enable data acquisition in the research group. This approach is common in a significant number of developing contexts such as that in [27]. The proposed mechanism is one where only crawler are used. An example of such a scenario is presented in the healthcare context as presented in [13, 15]. The proposed mechanism considering the hybrid case arises due to the proposed approach.

The performance simulation parameters for Case 1 and Case 2 are presented in Tables 1 and 2, respectively. The simulation results for the DAqE and DAnE in Case 1 are presented in Figs. 1 and 2, respectively. The simulation results for the DAqE and DAnE in Case 2 are presented in Figs. 3 and 4, respectively. In the presented simulation results, the DAqE implies the amount of data that is acquired due to the use of the Crawlers or the hybrid approach out of the desired data sets and associated information elements. A high value of the DAqE metric is beneficial. In the case of the DAnE metric, a high value signifies the processing of more information elements to acquire a large proportion of the required data set desired to meet the goal of acquiring the health related and epidemiological related data.

Table 1 Simulation parameters for the DAqE and DAnE in scenario 1 (results in Figs. 3 and 4)
Table 2 Simulation parameters for the DAqE and DAnE in scenario 1 (results in Figs. 5 and 6)
Fig. 3
figure 3

DAqE simulation results prior to the reduction of the mean number of packets recovered due to crawler persistence

Fig. 4
figure 4

DAnE simulation results prior to the reduction of the mean number of packets recovered due to crawler persistence

From the results in Figs. 1, 2, 3, and 4, it can be seen that the DAqE values are all less than one in all the concerned cases and associated scenarios. This arises because of the high amount of requested dataset with the accompanying composing information elements required to meet the research goal of the data driven health related research group. However, it can be seen that the value of the DAnE exceeds unity i.e., 1 in some cases. This is because of the consideration of the processing capability i.e., duration associated with information element and dataset processing in this case. A value of DAnE is seen to be obtained in the hybrid case. This is because of the combination of the processing capability alongside the use of human efforts and crawlers in this case. A high value exceeding unity i.e., 1 is beneficial in this case. This is because it symbolizes that the desired dataset and information elements are acquired with initialization of acquisition process to satisfy other data acquisition goals of the data driven health related research group.

Performance analysis shows that the use of the proposed crawler instead of the existing mechanism enhances the DAqE by an average of 14.2%. The use of the considered hybrid approach instead of the existing mechanism and the crawler alone (in the proposed case) enhances the DAqE by an average of 34.8247% and 11.9867%, respectively.

In addition, the performance analysis examines how DAnE metric is enhanced. It can be seen that the use of the hybrid approach (combining the proposed crawler and human effort) instead of the case of the proposed Crawler and existing case (human approach alone) enhances the DAnE by 99% and 46.3% on average, respectively. Furthermore, the use of the proposed crawler instead of the relying on human effort (and associated non-heterogeneous computing entities) in the previous case enhances the DAnE by 99% on average. The use of the crawlers (Proposed Mechanism) instead of the existing mechanism is seen to enhance the DAnE by 99% on average.

Furthermore, the performance evaluation is examined in a case when the persistence of the crawler is reduced. In this case the crawler persistence is reduced by 70% on average. The performance analysis in this case shows that the use of the proposed mechanism (Crawler) instead of the existing approach improves the DAqE and DAnE by 10% and 99% respectively. The use of the proposed mechanism in a hybrid fashion is also seen to improve the DAqE in the case of the existing mechanism and proposed mechanism (Crawler) by an average of 64.7% and 43.5% on average, respectively.

The performance evaluation is examined in the case of the DAnE metric. In this case, the performance analysis shows that the use of the proposed mechanism in a hybrid fashion instead of the existing mechanism and proposed mechanism (Crawler) improves the DAnE by an average of 99% and 45.5% on average, respectively.

7 Results and Discussion

The performance evaluation results are presented in this section. This is done for two cases i.e., Case 1 and Case 2. The case in Case 2 differs from that in Scenario 1 due to the consideration of Crawler Persistence reduction in Case 2. Crawler persistence capability has been incorporated in the simulation to consider the capability of Crawlers to recover missing data due to corruption or mistaken packet transmission sequences. The capability of Crawler persistence is not considered in the case of existing mechanism. This is because the case of human data search in attempting to recover missing data or wrongfully packaged information leads to a significant decrease in the DAqE and DAnE. A scenario where the computing processing in the case of information retrieval occurs with minimal use of custom designed crawlers is considered in the existing case. In addition, the existing approach focuses mainly on the use of computing entities for processing of data received from correspondence with health authorities with and without a use of the internet. The proposed case of using crawlers is one in which there are multiple mediating computing entities i.e., servers with each having different crawler persistent capabilities in the concerned networks. In addition, the reduction in Crawler capability implies that crawler persistent capability is reduced on average.

The conduct of performance simulation and evaluation is done for the case of a health-related research group seeking to access epidemiological related data. The access to the concerned data is being sought to analyse the infection rate and derive other associated parameters such as the mortality rate in age groups due to the occurrence of a certain disease or condition. The disease or condition can occur at the level of epidemic or pandemic. The health-related research group has varying computing and connectivity resources in seeking to obtain the required data from different sources. In addition, the composition of the considered health related research group is not related to one location and may comprise individuals from across a region or across the globe. In addition, the case of the using computing tools and products i.e., the proposed crawler at different levels of sophistication due to the interaction with the internet and computing platforms has been considered.

Furthermore, the simulation for the DAqE and DAnE is done using the MATLAB simulation package. In the simulation procedure, the simulation of all scenarios in Case 1 and Case 2 is done in a stochastic and random fashion. This is done considering the emergence of different contexts for the considered scenarios due to a change in the value of the simulation parameters associated with variables used in the performance formulation. Case 1 and Case 2 has three scenarios i.e., Existing Mechanism, Proposed Mechanism (Crawlers only) and Proposed Mechanism (Hybrid–Combination of Existing Mechanism and Crawlers).

The performance simulation parameters for Case 1 and Case 2 are presented in Tables 1 and 2, respectively. The simulation results for the DAqE and DAnE in Case 1 are presented in Figs. 3 and 4, respectively. The simulation results for the DAqE and DAnE in Case 2 are presented in Figs. 5 and 6, respectively. In the presented simulation results, the DAqE implies the amount of data that is acquired due to the use of the Crawlers or the hybrid approach out of the desired data sets and associated information elements. A high value of the DAqE metric is beneficial. In the case of the DAnE metric, a high value signifies the processing of more information elements to acquire a large proportion of the required data set desired to meet the goal of acquiring the health related and epidemiological related data.

Fig. 5
figure 5

DAqE simulation results prior to the reduction of the mean number of packets recovered due to a reduction in crawler persistence (by 25% in a new scenario)

Fig. 6
figure 6

DAnE simulation results prior to the reduction of the mean number of packets recovered due to a reduction in crawler persistence (by 25% in a new scenario)

From the results in Figs. 3, 4, 5, and 6, it can be seen that the DAqE values are all less than one in all the concerned cases and associated scenarios. This arises because of the high amount of requested dataset with the accompanying composing information elements required to meet the research goal of the data driven health related research group. However, it can be seen that the value of the DAnE exceeds unity i.e., 1 in some cases. This is because of the consideration of the processing capability i.e., duration associated with information element and dataset processing in this case. A value of DAnE is seen to be obtained in the hybrid case. This is because of the combination of the processing capability alongside the use of human efforts and crawlers in this case. A high value exceeding unity i.e., 1 is beneficial in this case. This is because it symbolizes that the desired dataset and information elements are acquired with initialization of acquisition process to satisfy other data acquisition goals of the data driven health related research group.

Performance analysis shows that the use of the proposed crawler instead of the existing mechanism enhances the DAqE by an average of 14.2%. The use of the considered hybrid approach instead of the existing mechanism and the crawler alone (in the proposed case) enhances the DAqE by an average of 34.8247% and 11.9867%, respectively.

In addition, the performance analysis examines how DAnE metric is enhanced. The use of the hybrid approach (combining the proposed crawler and human effort) instead of the case of the proposed Crawler and existing case (human approach alone) enhances the DAnE by 99% and 46.3% on average, respectively. Furthermore, the use of the proposed crawler instead of the relying on human effort (and associated non–heterogeneous computing entities) in the previous case enhances the DAnE by 99% on average. The use of the crawlers (Proposed Mechanism) instead of the existing mechanism is seen to enhance the DAnE by 99% on average.

Furthermore, the performance evaluation is examined in a case when the persistence of the crawler is reduced. In this case the crawler persistence is reduced by 70% on average. The performance analysis in this case shows that the use of the proposed mechanism (Crawler) instead of the existing approach improves the DAqE and DAnE by 10% and 99% respectively. The use of the proposed mechanism in a hybrid fashion is also seen to improve the DAqE in the case of the existing mechanism and proposed mechanism (Crawler) by an average of 64.7% and 43.5% on average, respectively.

The performance evaluation is examined in the case of the DAnE metric. In this case, the performance analysis shows that the use of the proposed mechanism in a hybrid fashion instead of the existing mechanism and proposed mechanism (Crawler) improves the DAnE by an average of 99% and 45.5% on average, respectively.

8 Conclusion

The internet plays an increasing important role in epidemiological related research. This is because it enables the efficient execution of communications and additional processes associated with epidemiological data acquisition. However, data acquisition is executed in epidemiological research groups by individuals seeking to access epidemiological related data. The presented research proposes the design and use epidemiology focused crawlers for use in acquiring epidemiological related data. The proposed intelligent epidemiological focused crawlers can be used in an automated manner as a standalone unit or in an hybrid mode (along with human assistance). These recognized approaches of epidemiological data acquisition are different from the approach used in existing research groups in developing country context. The approaches of using the crawler in an automated manner or an hybrid manner is an innovation that benefits from the increasing access to internet platforms hosting digital epidemiological related data. The presented research also proposes the development of a quantitative approach to evaluating the effectiveness of epidemiological data. This is done using data acquisition efficiency and data analytics efficiency. Performance evaluation shows that the data acquisition efficiency and data analytics efficiency is enhanced by the use of the intelligent crawlers (with and without human assistance) in comparison to the existing approach of using manual data acquisition.