1 Introduction

Non-compliance with recommendations of self-quarantine, social distancing, and mask-wearing, particularly among exposed/asymptomatic and other infected individuals, appears to be driving the increases in COVID-19 incidence being observed in many regions of the country [1,2,3]. While the true extent of compliance with social distancing guidelines remains unknown, observational data suggest that people continue to move freely while not wearing masks nor maintaining the minimum distance recommended [4]. Recent national studies highlight that mask-wearing is not fully adopted [5, 6]. Several US states experiencing serious outbreaks have poor compliance with social distancing guidelines and mask-wearing particularly among young adult populations [6]. Some states are reporting that people between 20 and 44 years of age now make up more than half of new cases [7,8,9]. Cases have been linked to specific bars, parties, or social events [10]. This situation is not unique to the USA but repeats across the globe.

In order to address social distancing, mask-wearing, and population monitoring, many institutions implement their own solutions ranging from complete telework to routine testing to symptom monitoring. The presented work is based on efforts by a large public university that implemented tools for daily monitoring of COVID-19 symptoms, as well as a research study that enrolled a cohort of participants to monitor their detailed movements during the pandemic and linked these movements to reported symptoms and contacts. Efforts at other universities are not much different from ours. As reported in the literature, significant efforts are made in order to move large portion of education online so that number of people on campus can be reduced [11, 12]. Summaries of selected measures taken by universities are reported in [13]. Virtually all universities operate dedicated websites for COVID-19 response as exemplified by (https://www.harvard.edu/coronavirus, https://www.ox.ac.uk/coronavirus, https://now.mit.edu/, https://www2.gmu.edu/Safe-Return-Campus).

Most research to understand the impacts of human movement on infectious disease transmission have focused on the macro-level [14, 15] and fall short of incorporating actual individual-level behavior and specific movement patterns with only a few individual-level studies available, i.e., [16,17,18]. More commonly, insect movement models are used to observe infectious disease transmission with attempts to extrapolate to human populations [19,20,21]. It is possible to address this gap using GPS tracking individual-level movement data, which has been previously done when studying movements of animals [22], and only in limited scope in human disease transmission [23, 24]. This is partially because GPS tracker data to study movement are extremely difficult to obtain by researchers. In fact, there is a common misconception that there are large amounts of COVID-19-related GPS movement data available for research. These data are collected and owned by private companies and available only in aggregated form [25, 26], preventing any individual-level studies. Several studies have been recently completed that use such aggregated GPS movement data [27,28,29], which demonstrated population-level impact of social distancing, but these are unable to analyze or link individual movements to symptoms, testing, or individual reasons for (non)compliance nor study individual reasons for (non)compliance. Many of these studies use SafeGraph data [25] that is freely available to those who wish to study COVID-19, social distancing, and policy implications yet has unclear quality. The data appears to be based on proprietary methodology used to process and aggregate data whose details are not available, thus making any obtained results questionable. In contrast, some studies aim at analyzing social distancing and movements without GPS. For example, Moore et al. [30] used survey data but not GPS to study how COVID-19 affected movements in children.

To address this issue, our team continues to collect and analyze individual-level data including exact GPS locations and daily reported symptoms for a selected cohort of participants. In the presented work, we describe results obtained when merging individual-level GPS data with daily symptom monitoring.

2 Data Collection

The initial data collection efforts started on March 22, 2020, as part of overall university response to quickly developing COVID-19 pandemic. On May 5, 2020, we started geolocation data collection for a selected cohort of participants who consented to the data collection in an IRB approved study.

2.1 Symptom Monitoring

Our group designed and implemented a screening tool and daily symptom journal based on the US Center of Disease Control and Prevention (CDC) symptom monitoring guidelines. The CDCs guidelines have been revised several times since we implemented the tool, and these changes are reflected in our work. Additional items have been added to the revised tool as summarized in Table 1. The numbers in the table indicate the total number of responses as well as the total number of individuals reporting symptoms.

Table 1 Summary of symptom data collected by three survey instruments used since March 2020

The initial implementation of the daily symptom journal has been done using Qualtrics survey system that allowed us for quick deployment of the tool in just a few days. The tool has been used by the general university population between March 22, 2020, and August 17, 2020, and for the research study of social distancing until October 2020. Starting August 18, 2020, a new Health✓™ tool implemented in .NET framework has been deployed to the university population (https://healthcheck.gmu.edu). The university is currently enforcing requirement for students, faculty, staff, and visitors to complete the screening tool every day before coming to campus which significantly increased number of responses.

Symptom data collection timeline is shown in Fig. 1 that shows daily numbers of respondents. The initial high number of responses to the screening tool (red line) corresponds to inclusion of the link to the tool in the university learning management system (LMS), and consequently a spike corresponds to an email sent to the university listserv. The tool has been disabled on August 18. Daily symptom journal (blue line) has been completed by a relatively constant number of individuals, with a slight increase in May 2020 when research study on social distancing started to recruit participants. Another increase of responses corresponds to the university population preparing for and returning to campus for Fall semester. Small number of respondents who continue to answer the daily journal are participants of the social distancing study. The required Health✓™ tool has high number of responses, with clear weekly cycle but also an overall declining trend. Significantly lower number of responses in the original Qualtrics survey as compared to the Health✓™ tool can be explained by the latter being required.

Fig. 1
figure 1

Daily numbers of responses collected by the three reporting tools

Until September 2020, we received 23,786 responses to the old COVID-19 screening tool for 15,603 people, 34,965 responses to the old daily journal for 5867 people, and 125,186 responses to the new screening tool for 17,662 people.

2.2 Movement Tracking

Individual movement data collection requires apps installed on individuals’ smartphones. There are a wide range of apps that can be used for this purpose, ranging from specialized ones developed for the purpose of tracking people during COVID-19 pandemic, to fitness trackers, to general purpose loggers. A summary of selected top-ranked apps is available in Table 2. A detailed survey of contact tracing apps is available in [31].

Table 2 Selected top-ranked movement tracking apps available in Android and iOS app stores

The idea of using specialized tracking apps developed for COVID-19 pandemic was originally considered as an attractive option for our study, as they combine movement, symptom, and potentially Bluetooth data in one place. However, each of these apps is essentially a data black box without the possibility of accessing it by research teams. Many of the apps have also unacceptable privacy terms which precluded us from using them for our participant cohort. We decided to use two general-purpose commercially available GPS tracking apps which participants can download from app stores: myTracks on iOS devices and GPSLogger on Android devices. Both apps collect data locally and can be configured to send data only to the research team. We compared data from both apps by wearing iPhone and Android phones at the same time for about 3 months and concluded that data do not differ in a way that would affect further analysis.

The average distance traveled (Fig. 2) and the average number of visited locations (Fig. 3) are relatively constant during the reported period. The change in total distance and total number of visited locations over time is due to the changing number of participants in the study. The spike in total distance on July 31 is due to one participant’s travel. Increased movements in mid-late August correspond to preparations to the new academic year.

Fig. 2
figure 2

Daily total (red) and average per participant (gray) distance traveled in the recorded data

Fig. 3
figure 3

Daily total (red) and average per participant (gray) number of locations visited for at least 10 min in the recorded data

2.3 Data Integration

GPS coordinates data were retrieved from the study participants via automatically generated email (sent to dedicated account on our server) and later preprocessed using procmail and Python program, or automatically uploaded to Google Drive and retrieved by the study team. The two applications used in the study created data in different formats (csv and xml) with additional variations when some users enabled additional features not described in the guide provided to them. These were processed and normalized by a series of Python programs. The symptom data were extracted as csv files and securely copied to the project server. Data from daily symptom reports and GPS movements were integrated using a common identifier within a PostgreSQL database. The overall diagram of data flow in the project is shown in Fig. 4.

Fig. 4
figure 4

Overall architecture and data flow in the presented project

Symptom cluster analysis has been done in Python 3.7 and partially in Weka [32] in which Apriori algorithm was executed to find combinations of symptoms. GPS data processing and overall GIS analysis was done in PostgreSQL (PostGIS library) and Python (geopy library). Finally, preprocessed data have been moved to R for the statistical analysis.

3 Data Analysis

3.1 Symptom Clustering

To analyze reported symptoms, we first derived groups of symptoms reported together within one screening response as well as groups of symptoms reported by the same individuals regardless of time. This was done by a simple application of Apriori algorithm [33] on the survey-level and individual-level data. Two sets of results have been obtained: based on entire dataset and for surveys/individuals who reported at least one symptom. The number of individuals with at least one reported symptom at some point is 1382, and the number of surveys with at least one reported symptom is 2818. Most individuals report just one symptom. The leading symptom in both individual- and survey-level data is “headache” followed by “cough” and “throat” as summarized in Table 3 that shows top symptoms and their combinations.

Table 3 Most frequent combinations of symptoms by response and by individual

Results presented in Table 4 show that the Apriori algorithm found are all reliable rules (confidence = 1). This means that all individuals with left side symptoms always report the right side symptom(s). For most of the top generated rules, headache was the symptom that was reported together with other symptoms or their combinations. Lift is always greater than one and ranged from 2.15 to 37.3. This indicates that, for all these rules, the likelihood of occurrence of the symptoms at the right side increases by ((lift-1) *100) %, if the left side symptoms had happened. For instance, based on rule 4, if an individual has “difficulty breathing, aches, chills, headache, and nausea” symptoms already, it is 193% more likely for the individual to have “cough” as well [34].

Table 4 Selected association rules generated from the symptom data

The cluster analysis indicates that the largest group of individuals report no symptoms or small number of symptoms (Figs. 5 and 6). The second largest cluster includes people who reported headache and additional related symptoms. Finally, the smallest of the clusters includes people who reported multiple symptoms.

Fig. 5
figure 5

Clusters across entire dataset (all individuals) with frequency of symptoms (top) and cluster size (bottom)

Fig. 6
figure 6

Clusters for individuals with at least one symptom with frequency of symptoms (top) and cluster size (bottom)

Finally, we looked at the group of 39 individuals who tested positive for COVID-19 in data since August 18. Their symptoms are summarized in Table 5. Symptoms “before test” indicate symptoms reported prior to the reported COVID-19 test result. Symptoms “with test” indicate symptoms reported on the same response as the positive test result. The majority of positive cases report cough, loss of smell/taste, sore throat, and headache, followed by runny nose and aches. There are also 13 individuals in the data who are asymptomatic, i.e., tested positive for COVID-19 but did not report any symptoms. These findings are consistent with some of the literature on COVID-19-related symptoms [36,37,38,39].

Table 5 Summary of symptoms reported among COVID-19-positive individuals

Three clusters of individuals who tested positive for COVID-19 are shown in Fig. 7. The largest of the clusters include individuals who are asymptomatic and those with few symptoms, while the smallest cluster includes individuals with large number of symptoms present.

Fig. 7
figure 7

Clusters for individuals who tested positive for COVID-19 with frequency of symptoms (top) and cluster size (bottom)

3.2 Movement Analysis

Raw GPS data recorded by GPS trackers are imprecise due to GPS accuracy and sampling error. The data are also subject to noise that result from connection errors. For example, a number of datapoints recoded have coordinates (0,0) on the equator. There are a number of points that “jump” to random coordinates or have random timestamps. While the number of such outliers is small, their extreme values significantly affect results. These outliers have been removed using a simple approach that detects “jumps” between coordinates in three consecutive points with speed exceeding 200 km/h.

The extracted GPS data include the total 3.3 million distinct coordinates collected over 14.6 thousand person/days. On average, one individual in the study traveled 224 km every week and visited 6 locations outside home. These numbers, however, have a high standard deviation due to outliers who traveled more than typical study participants. The summary of GPS movement data is presented in Table 6.

Table 6 Summary of GPS movement data for N = 162 individuals between May 6, 2020, and September 6, 2020

3.3 Symptom-Movement Relation

In order to understand the movement pattern in relation to the reported symptoms, test result status, and whether individuals have been in close contact with a person with possible COVID-19 infection, we constructed a number of different linear models with (R1) the distance traveled (in km) next day and (R2) the decrease in distance traveled (difference between distance traveled (in km) next day and the day of symptom entry) as response. For these models, we dropped the missing values and only analyzed complete cases in the data, resulting in usable data for 95 individuals. Both the response variables, namely, the distance traveled next day and the difference in distance traveled, have skewed distributions. Therefore, the variables were transformed using a Box-Cox power transformation to make their distribution normal. To adjust for the dependence in data due to repeated observations from the same subject, we fitted a mixed linear model with a random effect due to the subject. All the other 21 covariates related to the journal entries were assumed to have a fixed linear effect on the response. Using all the covariates separately in the model yields unstable estimators due to extreme multicollinearity present in the covariates. To solve this issue, we used different lower dimensional versions of the fixed effect covariates and fitted the following models for both responses.

3.3.1 Factors Using Multiple Factor Analysis

To reduce the dimension of the covariate space, we performed a multiple factor analysis (can be interpreted as principal component analysis for categorical data) on the 21 covariates. In particular, this method identifies the important features in the covariate space by employing the dimension reduction technique and transforming the original variables (presence/absence of categorical variables) to a new set of variables (factors), which are uncorrelated and ordered so that the first few accounts for most of the variation in the data. We then fitted linear mixed models for both the responses with subject specific random effect and fixed linear effect(s) of the first two components. The contribution of each of the variables is shown in Fig. 8. The first factor is affected mostly by loss of smell and taste, and the prominent contributing variables in the second factor are chill, chest pain, and nausea.

Fig. 8
figure 8

The contribution of the individual symptoms in the first two components of the factor analysis

3.3.2 Number of Symptoms

The explanatory variables can be divided in two categories. There are two variables assessing the test status of the participant (if they are awaiting a test result and if they had a negative test result) and two variables summarizing if they have been in close contact with anyone who either had COVID-19 or COVID-19 symptoms still without a definite test result. All the other 17 variables are different health symptoms related to COVID-19. We summed over all these 17 variables to obtain total number of symptoms reported every day. We then used this total number of symptoms instead of all the 17 categorical variables in our model (Model ST). In order to identify if a particular level of severity affects the mobility, we also used categorical versions of this total symptom variables, identifying if the person has reported at least k symptoms, for k = 1, 2, 3, and 4 (Model S1, S2, S3, and S4, respectively). All these models included the subject-specific random effects and fixed linear effects of the four variables related to test result status and possible contact as shown in Table 8 for (R1) and Table 9 for (R2).

The results indicate that there is no evidence that people change their movement pattern when experiencing symptoms consistent with COVID-19. The factor analysis indicates that the loss in smell and taste affects the first component and second component is influenced by symptoms such as chill, chest pain, and nausea. But the models (Table 7) do not indicate any association of these components with the change in distance traveled. The models based on the number of symptoms also do not indicate any association of mobility pattern with physical symptoms. The actual distance traveled next day after the reported symptom (Table 8) is not significantly affected by any reported symptom, test result status, or contact with confirmed or suspected COVID patients. However, if we consider the change in distance traveled next day of the symptom entry as compared to the previous day (Table 9), the contact with a confirmed COVID case is marginally statistically significant, with p value ranging from 0.05 to 0.12 for the five different models under consideration. Besides this mild association of the contact with confirmed COVID-19 cases, none of the models indicate any significant effect of the physical symptoms and the possibility of being able to spread the infection on the change in the distance traveled, which is indicative of not following self-quarantine recommendations.

Table 7 Models for mobility pattern based on the factors as covariates
Table 8 Models for distance traveled (in km) next day using number of symptoms as covariates
Table 9 Models for decrease in distance traveled (in km) next day using number of symptoms as covariates

4 Conclusion

At the time of writing this paper, COVID-19 pandemic continues to evolve. Despite tremendous amount of work across the globe, it is unclear if a vaccine will be available to the public or if other measures will be used country-wide and internationally. It is our belief that there will be a need to monitor symptoms, contacts, and movements of people for some time in order to slow spread of the disease.

The presented work illustrated a small portion of efforts by a large public university to monitor and prevent spread of COVID-19 among its students, faculty, staff, and visitors. We focused here only on portion symptom monitoring efforts and research that aims at understanding movement of individuals in relation to their reported symptoms and contacts.

There are a number of interesting observations in the obtained results. Headache is by far the most frequently reported symptom, always reported by individuals who report any symptom. Individuals’ movement patterns are very diverse but clearly related to imposed movement restrictions. Some stay at home and leave only when they have to, while others tend to freely move without any limitations. However, the most important result indicates that there is no evidence that reported symptoms affect movement. This means that people who report symptoms consistent with COVID-19 do not self-quarantine but continue move as before. There are significant implications of these observations, particularly on providing right information and educating individuals. Further, the university currently requires every individual to self-report symptoms daily. Those who report any symptoms are not allowed on campus.

The presented results are a combination of data collected as part of university efforts to control COVID-19 spread and a research project that aims at studying social distancing. In this work, we were unable to link movements to the actual COVID-19 cases, because of very few cases within the university population (N = 39) and none in the movement study cohort. Such linking would employ supervised learning to construct models for predicting risk of infection based on symptoms and movements. Instead, we focused on describing the population and linking overall symptoms to movements as a simple way of studying social distancing. We collected a unique longitudinal dataset of movements linked to symptoms for about 170 individuals who participated in the study. Such data allows for a detailed individual-level analysis, in contrast to large number of aggregate studies recently published.

The presented work has several limitations. First, the university cohort may not be generalizable to general population. To address this concern, we plan to compare aggregated movements patterns to large-scale aggregated data from general population, SafeGraph. Further, at the time of writing this manuscript, very few COVID-19 positive cases were reported in the university population. A surge of COVID-19 positive cases is expected in late November and December 2020. If this is the case, we will be able to create models that link symptoms with COVID-19 diagnosis.

Beyond work to address limitations, our current efforts focus on a more detailed analysis of the results. We are linking the GPS data to landscape information extracted from OpenStreetMap [35]. We are also reconstructing detailed movement trajectories of individuals in between GPS locations and constructing models to predict movements. In addition, we are reconstructing sequences of reported symptoms for COVID-19 positive cases. We are extending the cohort of participants that will be followed until spring 2021, which will allow for higher power analysis and longer follow-up period. Finally, we will be conducting surveys and semi-structured interviews to understand reasons for compliance and non-compliance with social distancing.