1 Introduction

In December 2019, a new virus known as severe acute respiratory syndrome Coronavirus 2 (SARS-CoV-2) emerged initially in Wuhan city of Hubei province in China [1]. A few months later, in January 2020, the whole genome sequence of the SARS-CoV-2 virus was reported in China [1]. The lethal virus belongs to the Coronavirus family and therefore it is called COVID-19 [2, 3]. One of the particular properties of SARS-CoV-2 virus is spreading very rapidly, and therefore, the Worldwide Health Organization (WHO) announced the pandemic as a Public Health Emergency of International Concern (PHEIC)[2, 4]. According to Johns Hopkins statistics for SARS-CoV-2 virus, until March 20 in 2022, more than 468 million new cases were reported with over 6 million deaths worldwide.

On March 2020, the first confirmed case was reported in Saudi Arabia. On June 24 in 2021, Saudi Arabia was ranked as the fortieth affected country worldwide with over 749,471 confirmed cases and 8,485 SARS-CoV-2-related deaths.

Globally, SARS-Cov-2 causes no symptoms, mild or severe symptoms to patients through the virus replications. The patients are usually suffering from dry cough, fever, dyspnea, pneumonia, anosmia, fatigue, gastrointestinal manifestations, and ageusia [5, 6].

Data Mining (DM) techniques play a major role in many research fields. These techniques have the ability to solve complex tasks and extract useful knowledge from small and big data for many diseases including infectious diseases [7]. In addition, DM techniques can also assist medical practitioners in healthcare systems in the war on COVID-19 pandemic.

Our work is motivated by the success of the Association Rules Apriori (ARA) algorithm [8] in finding patterns in data. The ARA algorithm is a popular DM technique proposed by Agrawal in 1993 [9] where it scans a dataset to identify common itemsets and generate associative rules. It was commercially used for business transactions to identify interesting associations and patterns between customers and most frequently purchased items using minimum support and minimum confidence thresholds. The patterns are generated by association rules. Each association rule consists of a consequent and an antecedent to describe the patterns. A famous example of association rule mining is the market basket analysis which helps businesses to expand marketing strategies by providing awareness into which items are most frequently purchased by customers. In this research, we intend to employ the ARA algorithm to find the most frequent infection patterns from COVID-19 recovered patients’ data.

Noticeably and luckily, there is a large number of infected people who have been recovered. Questions may be raised on how these people are infected? What are their health patterns? What are the most frequent infection patterns from those recovered patients? Mining the answers to these questions scientifically is important to find and understand the different infection patterns of COVID-19 patients. Therefore, understanding and finding the infection patterns of COVID-19 patients can assist medical practitioners in the behavior of the infection and spread public health awareness towards the COVID-19 pandemic. Meanwhile, discovering the infection patterns requires attaining infection and health information from recovered COVID-19 patients. However, existing public COVID-19 datasets lacks of infection and health information to help in discovering these patterns. Peiffer et al. [10] and Alafif et al. [11] support our statement by asserting that operations using machine learning and DM which they require a global collaboration and data sharing to tackle the COVID-19 pandemic problem. In addition to the lack of COVID-19 infectious information datasets, existing predictive COVID-19 methods such as in [12, 13], and [14] focus only on predicting the states of infected patients and the spread of COVID-19 respectively rather than attempting to find their infection patterns. To address the limitations in the existing COVID-19 datasets and methods, in this research work, we contribute the following:

  • We introduce our COVID-19 infection patterns (IP) dataset which includes a sample of recovered patients’ data collected in Saudi Arabia.

  • We apply the ARA algorithm to discover associative infection patterns from recovered COVID-19 patients’ data using the IP dataset.

  • We show manual computation results and verify them with experimental results.

  • A discussion on the results and their relations with the results of existing works is provided.

The remainder of this paper is organized as follows. In Sect. 2, we describe and analyze the IP dataset. Then, we present our proposed method to discover infection patterns from COVID-19 patients in Sect. 3. Experimental implementation and results are provided in Sect. 4. Then, discussions on the results, limitations, and challenges are provided in Sect. 5. Finally, we conclude our work in Sect. 6.

Fig. 1
figure 1

Age distribution in the study population

2 IP dataset

This study is conducted on 131 Saudi and non-Saudi subjects using a survey method. The study includes COVID-19 positive cases who spent a quarantine period in an isolation facility. Negative cases, positive cases that were critical and subjected to hospital admission, none critical cases who’s subjected to home isolation, and those with missing information were excluded in the survey. The survey questionnaire includes biographic data such as age, gender, nationality, smoking status, weight, height, blood type, and sport activity. It also includes a medical history such as multivitamins, medicine usage before infection, medicine usage after infection, symptoms after infection, chronic disease, previous infection of MERS-CoV, previous infection of COVID-19, and period of isolation. The IP dataset is publicly available onFootnote 1.

The collected data from the questionnaire are tabulated as shown in Table 1. The data is manually processed and converted to a binary format.

Table 1 Questionnaires data

Our preliminary analysis for the collected data demonstrates that 21.4% of the study population’s age is under 20 years old while the remaining 83.6% is above 20 years old (33.7% of them are between 30-40 years old while 25.5% are between age 20-30 years old). Although this indicates that the incidents rate in adults is higher, the obesity factor possibly increases the incidents in younger populations [15]. Figure 1 shows the age distribution in our study population.

Among our data in the dataset, gender is studied. It is found that 64.1% of the total subjects are male (84 subjects) while 37.4% are female (49 subjects). This association is reported in China back in February 2020 [16]. Figure 2 shows the gender domination in the study population.

Fig. 2
figure 2

Gender domination in the study population

In the IP dataset, 43% (56 subjects) have O+ blood group while 24% (32 subjects) have A+. Our data show that subjects who have O+ groups are more exposed to have COVID-19 than other blood types. However, this finding contradicts a similar study in Germany that excludes such a link [17]. Figure 3 shows the blood group in the study population.

Fig. 3
figure 3

Blood group type in the study population

Studies highlighted an insignificant relation between smoking and infection with COVID-19 [18]. As it is illustrated in our study, 23% of 131 subjects are smokers whereas is the remaining 77% are non-smokers. Figure 4 shows the percentage of smoker and non-smoker subjects in the study population.

Fig. 4
figure 4

Smoker and non-smoker subjects in the study population

Moreover, most subjects develop symptoms like cough, fever, diarrhea, headache, fatigue, loss of taste and smell while few of them don’t develop any symptoms as shown in Fig. 5. We notice that subjects’ history with chronic diseases is not related to the infection pattern nor administration of multivitamins as supported in [19]. Figures 6 and 7 show the distribution of chronic disease and multivitamins administration in the study population respectively.

Fig. 5
figure 5

Subjects with symptoms in the study population

Fig. 6
figure 6

Chronic disease in the study population

Fig. 7
figure 7

Multivitamins administration in the study population

Due to COVID-19 precautionary measures, difficulties in accessing the electronic link, and elder subjects, both surveys and consents are conducted online using smartphones, tablets, and other electronic devices. A soft copy is sent to the subjects as a link. The actual subjects’ size is 330 subjects. However, due to difficulties in obtaining their consent and conducting the survey, it has led to limiting the subjects’ size to 131. These factors include but are limited to; wrong/unresponsive contacts’ number, refusal to participate, subjects’ unfamiliarity with their blood groups, medications, height, and weight. Furthermore, most of the isolation facilities are recently closed and COVID-19 positive cases were directed to home isolation. This prevents us from reaching a higher targeted sample size.

3 Methodology

In this section, we manually employ the ARA algorithm to extract and generate the strong associative rules and correlated patterns using the IP dataset. Figure 8 shows a pipeline of our proposed approach. The goal of using the ARA algorithm is to find the most frequent patterns relating to COVID-19 infection patterns.

Fig. 8
figure 8

The pipeline methodology of the proposed approach

Let COVID-19 infection patterns (P) where \(P = \lbrace p_{1}, p_{2}, \dots , p_{m} \rbrace\) is a set of binary patterns. Let D be the IP dataset. Each record of a infected patient (R) in D is represented as a binary vector, with \(R[n] = 1\) if R has the pattern \(p_{n}\). Otherwise, \(R[n] = 0\) where n is the number of p. Let Q be a set of some patterns in P. We say that a record R satisfies X if for all pattern \(p_{n}\) in Q, \(R[n] = 1\).

By applying the ARA algorithm, we imply \(Q \longrightarrow p_{n}\), where Q is a set of some pattern in P, and \(p_{n}\) is an example of P that is not found in Q. The rule \(Q \longrightarrow p_{n}\) is satisfied in the dataset D with a confidence threshold C where \(0 \le C \le 1\) if and only if at least the confidence threshold of the records in D that satisfies Q tends to satisfy \(p_{n}\).

Given the set of infection patterns p, we are interested in generating all the rules that satisfy minimum support MinSup and minimum confidence MinConf thresholds. The MinSup threshold helps to find the frequent patternsets in D that satisfy the union of patternsets in the consequent and antecedent of the rule. By using the MinSup threshold, we generate all combinations of patternsets and select only pattrensets that have a higher value than the threshold. The MinSup threshold is computed as follows:

$$MinSup = count \;for \;patternset / Total \;number \;of\; R$$
(1)

Then, we apply the MinConf threshold to find the frequent rules that occur among all the patternset in D. The MinConf threshold is computed as follows:

$${\text{MinConf }} = \frac{{{\text{No}}.\,{\rm{ of}}\,{\text{support}}\,{\rm{ count}}\,{\text{ in}}\,{\rm{ both}}\,Q\,{\text{and}}\,p_{n} }}{{{\rm{No}}.\,{\text{of}}\,{\rm{ support }}\,{\text{count}}\,{\rm{in}}\,Q}}$$
(2)

To generate strong associative rules, we first need to find all frequent patternsets. The first process is to join patterns to find \(L_{k}\), a set of candidate k-patternsets \(C_{k}\), where k is the number of elements in the frequent patternset. Then, a pruning process is performed by using the MinSup threshold to reduce the size of \(C_{k}\) in the k-frequent patternsets using the MinSup threshold. The value of MinSup threshold is selected to obtain higher frequent patternsets recursively. Our approach follows algorithm 1.

figure a

Given the IP dataset in Table 2 which contains of patient identification (PID) and a list of patterns_IDs. Each patient may have a different list of Pattern_IDs. The pattern_IDs consists of letters in an alphabetical order which correspond to the infection patterns p respectively in the Excel sheet. The alphabetical letters are used to ease the manual computation and readability (Tables 2 and 3).

Table 2 List of pattern IDs for each patient’s record in the dataset
Table 3 Scanning the datasest for counting each 2-frequent patternset (C2)

First, we scan the whole dataset D in Table 2 for counting the support of each candidate in the Pattern_IDs in the first pass. We call the first candidates \(C_{1}\) or 1-freqent patternset as shown in Table 4. After finding the support count for \(C_{1}\) candidates, the support count are compared with the MinSup threshold. We set the MinSup threshold to 40% to filter out weak candidates that do not satisfy the MinSup threshold. We call these candidates \(L_{1}\) as shown in Table 5. Similarly, the same procedure is followed to compute 2-frequent patternsets (\(C_{2}\), \(L_{2}\)), 3-frequent patternsets (\(C_{3}\),\(L_{3}\)), and 4-frequent patternsets (\(C_{4}\),\(L_{4}\)) as shown in Tables 3, 6, 7, 8, 9, 10, and 11. The ARA algorithm stops to join and generate more candidates in the fifth pass since the 4-frequent patternsets do not satisfy the MinSup threshold.

Table 4 Scanning the datasest for counting each 1-frequent patternset (C1)
Table 5 Comparing the support count of 1-frequent patternset (L1) with the MinSup count
Table 6 Comparing the support count of 2-frequent patternset (L2) with the MinSup count
Table 7 Generating 3-frequent patternset C3 from L2
Table 8 Scanning the datasest for counting each 3-frequent patternset C3
Table 9 Comparing the support count of 3-frequent patternset (L3) with the MinSup count
Table 10 Generating 4-frequent patternset C4 from L3
Table 11 Scanning the datasest for counting each 4-frequent patternset C4

After pruning the infrequent patternsets, we reach the 3-frequent patternsets \(L_{3}\) that satisfy the MinSup threshold. Then, we generate the association rules by generating all non-empty subsets s of frequent patternsets from \(L_{3}\) and output the rules \(s \longrightarrow (l-s)\). From Table 9, we see that \(L_{3}\) consists of \(l_{1} = \lbrace I, L, M \rbrace\) and \(l_{2} = \lbrace I, M, DA \rbrace\). Then, we generate all possible rules and use Equation 2 to compute the confidence for each rule as shown in Table 12.

Table 12 Generated rules from 3-frequent patternsets \(L_{3}\)

We set the MinConf threshold to 60% to exclude weak generated rules from Table 12. Then, the strong generated rules that satisfy the MinConf threshold are as follows:

  • {\(I\} \longrightarrow \{L, M\)}

  • {\(L\} \longrightarrow \{I, M\)}

  • {\(M\} \longrightarrow \{I, L\)}

  • {\(I, L\} \longrightarrow \{M\)}

  • {\(I, M\} \longrightarrow \{L\)}

  • {\(L, M\} \longrightarrow \{I\)}

  • {\(I, M\} \longrightarrow \{DA\)}

  • {\(I, DA\} \longrightarrow \{M\)}

  • {\(M, DA\} \longrightarrow \{I\)}

We substitute the alphabetical letters with the infection patterns. Then, we obtain the following rules:

  • {\(Male\} \longrightarrow \{(Weight>70), (Height >160)\)}

  • {\((Weight>70)\} \longrightarrow \{Male, (Height >160)\)}

  • {\((Height>160)\} \longrightarrow \{Male, (Weight >70)\)}

  • {\(Male, (Weight>70)\} \longrightarrow \{(Height >160)\)}

  • {\(Male, (Height>160)\} \longrightarrow \{(Weight >70)\)}

  • {\((Weight>70), (Height >160)\} \longrightarrow \{Male\)}

  • {\(Male, (Height >160)\} \longrightarrow \{Fever\)}

  • {\(Male, Fever\} \longrightarrow \{(Height >160)\)}

  • {\((Height >160), Fever\} \longrightarrow \{Male\)}

As shown from the above generated associative rules, we obtain eight-strong generated rules. We discover four strong COVID-19 infection patterns from COVID-19 recovered patients. The infection patterns are Male, \((Weight > 70)\), \((Height > 160)\), and Fever. These are the most frequent and strongest patterns obtained by the ARA algorithm.

From the above strong generated associative rules, one can notice that the first six rules have the same patterns \(Male, (Weight> 70), and (Height > 160)\). The last three rules contain Fever pattern in addition to Male and \((Height > 160)\) patterns.

From the rule {\(Male\} \longrightarrow \{(Weight> 70), (Height > 160)\)}, if a recovered patient is a male, he tends to have a pattern of a weight more than 70 kilogram and a pattern of height more than 160 centimeters. Similarly, the rules {\((Weight> 70)\} \longrightarrow \{Male, (Height > 160)\)}, {\((Height> 160)\} \longrightarrow \{Male, (Weight > 70)\)}, {\(Male, (Weight> 70)\} \longrightarrow \{(Height > 160)\)}, {\(Male, (Height> 160)\} \longrightarrow \{(Weight > 70)\)}, and {\((Weight> 70), (Height > 160)\} \longrightarrow \{Male\)} provide the same strong patterns that are strongly associated to each other.

For the rule {\(Male, (Height >160)\} \longrightarrow \{Fever\)}, if a recovered patient is a male and his height is more than 160 centimeters, he tends to have a pattern of fever. Similarly, the rules {\(Male, Fever\} \longrightarrow \{(Height >160)\)} and {\((Height > 160), Fever\} \longrightarrow \{Male\)} provide the same strong patterns that are strongly associated to each other.

4 Experiment

The ARA algorithm experiment is conducted using Google Co-laboratory with Apriori Python library on a Windows 10 Pro with 64-bit 8-core and Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz with 32.0 GB of RAM.

The goal of the experiment is to verify the results obtained from our manual computation. The MinSup and the MinConf thresholds are used in the experiment. The same results of the generated strong associative rules are obtained in addition to the non-smoking pattern. The experiment runs in 1.1 seconds.

5 Discussion

The ARA algorithm is robust in finding and extracting patterns from data with the use of minimum support and minimum confidence thresholds. The ARA algorithm has shown strong associative COVID-19 infection patterns from the manual and the experimental computations using the IP dataset. As a result, male, weight above 70 kilograms, height above 160 centimeters, and fever are the strongest COVID-19 recovery patterns.

To support and associate our findings with the medical findings from the literature, some medical research works have associated some patterns for COVID-19 infection. The severity of symptoms in COVID-19 active case and mortality tendency is two-fold and a half higher than females in China [20]. In Italy, 70% of COVID-19 mortality is seen in men while the death rate in Wuhan is 1.1% higher in males than females [21, 22]. Although the ministry of health reported on the 20th of April that 55% of the reported active cases are females, we believe that this is related to public compliance to precautionary measures rather than a significant link to COVID-19. Similar findings is recorded in Oman on May 2020, where 70% of active cases are men as well [23]. Our generated associative rules have detected a pattern linking height and weight to COVID-19 infection. In 2019, the CDC reported that obesity increases the risk of severe illness for COVID-19 and overweight triple the risk of hospitalization [24]. Kompaniyets et al. [25] reported that risks of hospitalization, intensive care unit admission, invasive mechanical ventilation, and death are increased with increasing Body Mass Index (BMI). Our finding in the current study is consistent with the previous studies. We find that overweight individuals have an increased risk of having the COVID-19 infection. Other studies linked BMI above 25 kg/m2 as a strong disposing factor for COVID-19 infection [26].

Our results demonstrate a link between the infection of COVID-19 and non-smokers in our population. Although the same was observed in the UK, they found that the mortality rate is high in the infected smokers than the non-smokers [27]. Furthermore, Smoking is considered a risk factor for the progression of COVID-19 symptoms [28].

In terms of the height pattern, our study shows that tall individuals are more likely to have COVID-19. This finding is in line with other reported studies by Desmet et al. which finds that individuals with height above 182 cm (6ft) tall most probably have doubles chances of being diagnosed with COVID-19 [29].

Knowing the first symptom of COVID-19 infection plays such an important role in the prevention of virus spread and patient’s treatment plan. Several clinical studies have shown that almost all infected patients reported fever as the first symptom of the COVID-19 diseases [30, 31]. For example, in China, 98% of infected individuals have suffered from fever, and in Italy, more than 72% reported the same symptom [32]. Furthermore, the Centre of Disease Control and Prevention in the US lists fever or chills as the first symptoms appear in 2-14 days after infection by the virus (https://www.cdc.gov/coronavirus/2019-ncov/symptoms-testing/symptoms.html). Also, fever lasts longer in the infected individual who is admitted to intensive care units after the start of the COVID-19 symptoms [33]. Clearly, fever should be considered as the first sign/indicator of the onset of COVID-19 infection.

6 Conclusion

Discovering infection patterns is significant for health providers towards understanding the infection factors. In this research work, we propose to use the ARA algorithm to discover infection patterns from recovered COVID-19 patients’ data. A non-clinical COVID-19 dataset is introduced and analyzed, which consists of COVID-19 Infection patterns. A sample of recovered patients’ data is manually collected in Saudi Arabia. Our manual computation and experimental results show strong associative rules with high confidence scores among males, weight above 70 kilograms, height above 160 centimeters, and fever patterns. These patterns are the strongest infection patterns discovered from COVID-19 recovered patients’ data. In future, we will work on the discovery of COVID-19 recovery and death patterns.