In this section, we present our experimental results and we answer our research questions.
What are Covid-Related Apps Used for?
Motivation
The sudden increase of Covid-related apps during the pandemic shows that Android apps developers have been active in providing end-users with solutions to address the COVID-19 pandemic. Nevertheless, the functionalities of Covid-related apps are not known and have not been studied. In this section, we study, characterize, and build categories from which Covid-related apps belong. The categorization of Covid-related apps offers a first layer of knowledge toward understanding them. The outcome of this research question will give an overview of Covid-related apps’ functionalities to the general public.
Strategy
Textual descriptions of apps on markets generally provide a wealth of information on the purpose and functionalities that developers advertise. We undertake to systematically examine the descriptions of all the apps under study. Unfortunately, since Google Play is actively moderating Covid-related apps, we have faced an issue with some apps that we were able to initially collect but which were no longer available on the market at the time of analysis. Eventually, our analysis of descriptions was performed on 78 apps. In other words, from the time we read the descriptions of the apps to curate our dataset (as explained in Section 3.1) and the time we perform this more in-depth study, i.e., collecting information related to the features of the apps, 14 apps (92 − 78) were not able anymore on GooglePlay.
A Taxonomy of Covid-Related Apps
After a careful analysis of information available in Google Play, we summarize for each app its general goal, i.e., which aspects of the COVID-19 crisis the app is precisely intended to address. Eventually, we identified three main categories to which each app can be associated with possible overlap between categories, i.e., an app can be associated with several categories:
-
1.
Information broadcast (top-down) - Apps in this category aim to provide users with various types of information, from general guidelines, infection statistics to general COVID-19 news. Although such apps are not always officially released by government bodies, they often relay official information from top (authorities) down (users).
-
2.
Upstream collection (bottom-up) - Apps in this category collect information from users and make it available to the developer and/or an official body, such as a country’s health authorities.
-
3.
Tooling - Apps in this category serve as tools with functionalities that directly deal with daily aspects of the COVID-19 (e.g., generation of certificates).
[1] Information broadcast
From the collected dataset of Covid-related apps we identified several distinguishing scenarios in apps performing information broadcasting. Figure 8 overviews the related characteristics, notably based on the types of information that are made available to the user:
-
1.
Guidelines on measures to take to minimize the risk of infection - Among such apps, some render maps highlighting high-risk areas. Other apps provide behavioral advice (e.g., how to wash hands), leveraging the whole spectrum of available media: (1) textual descriptions (for the majority of apps), (2) videos and, (3) audio clips.
-
2.
Continuously-updated Statistics on the pandemic evolution;
-
3.
General information about COVID-19, such as about the typical symptoms. We identified two different scenarios in the provision of general information:
-
Some apps present curated information, i.e., information that is somehow checked and filtered by the development team before it is shown to the public. Such information is often tagged in a way that allows interested people to find the source, and gauge its credibility. Sometimes, these apps are developed directly by an entity that itself carries credibility as a source of information, such as national healthcare authorities. An example is the MyHealth Sri Lanka appFootnote 10 developed by the national ICT Agency, which presents to the user verified information on the current COVID-19 status.
-
A number of apps appear to provide unfiltered information regarding COVID-19. Their developers are not always themselves entities that would traditionally be assumed to have any specific credibility on the matter. For example, the DiagnoseMe app,Footnote 11 which claims to provide the user with all the information on the virus, is proposed by an association with unrecorded expertise in health.
[2] Upstream Collection
Most apps in our dataset perform data collection from users. This suggests that many app providers consider data to be key in the mitigation of the COVID-19 crisis. App providers indeed collect a variety of information, including user personal information (e.g., name, age, address, etc.), some medical information (e.g., whether a user is infected with COVID-19, the therapies that are used). Some apps are even used to keep a health diary (sharing information about symptoms every day), or to report the infection of people in the app user’s acquaintances.
Overall, we have identified three different ways in which apps collect user data, as summarized in Fig. 9. Note that in the case of data collection and spread tracking apps, we did not try to qualify whether apps were as privacy-preserving as their developers claimed they were (e.g., data is deleted after N days), nor to determine to what extent the collected data is shared with third parties.
Similarly, for this paper, we did not analyze the inner workings of contact-tracing apps, and we did not evaluate the merit nor the opportunity of contact-tracing, this having already been—and still being to this date—discussed by security researchers (Culnane 2020; Anderson 2020; Baumgärtner et al. 2020).
Several apps take inputs from the users to offer diagnoses related to COVID-19. Such apps can provide a built-in questionnaire that users have to fill within the app, or leverage a virtual assistant or chatbot. In these cases, the diagnosis can be made automatically, with no interaction nor confirmation with a trained medical practitioner.
Other apps, however, provide a somewhat more traditional medical visit experience, by offering the facilities needed to remotely exchange (e.g., via instant text messages as well as voice and/or video calls) with a medical doctor. Such apps are used from home, since millions of people worldwide were confined, and were potentially reluctant or unable to visit a brick-and-mortar doctor’s office.
Additionally, some apps are developed to track the spread of the virus by locating the users of the apps. While a few of those apps use simple geo-monitoring with GPS information for tracking users, most apps do it automatically. Nevertheless, we found a few apps that request users to provide a-posteriori the locations they have visited on a given day. We also identified one app which uses QR code scanning at the entrance of public buildings to obtain precise location information, while still being fully under users’ control.
With respect to tracing, a few apps promote social-distancing using the GPS location of users, the goal being to not approach other people too closely.
Furthermore, several apps implement contract-tracing, i.e., the ability to retrieve who a specific person has been in contact with, providing users a way to know if they have encountered someone infected, and potentially infectious. Contact-tracing apps mainly rely on three methods, (1) Using the GPS location of users, (2) Using the Bluetooth technology to detect proximity, and (3) Using a location diary that the users have to manually fill.
[3] Tooling
The last category is the tooling category which includes several types of tools aimed at helping users deal with some consequences of the COVID-19 crisis (see Fig. 10). A few apps allow users to auto-generate documents for their local authorities (e.g., travel authorization that had been made mandatory in several countries during containment).
Users can also install apps offering appointment-capabilities for medical purposes, or selling Covid-related products (e.g., masks, hand-sanitizers, etc.).
On the entertainment front, apps were released proposing games around the pandemic, or providing users with COVID-19-themed image filters, for example adding a virtual mask, or adding virtual decorative elements to an actual mask.
Lastly, apps were also made to cater to the newly-discovered needs of massive remote education.
The interested reader can inspect Tables 2, 3 and 4 for more information about the mapping between categories and the apps for which we were able to retrieve the relevant information.
Table 2 First part of Covid-related apps’ characteristics retrieved from Google Play apps pages Table 3 Second part of Covid-related apps’ characteristics retrieved from Google Play apps pages Table 4 Third part of Covid-related apps’ characteristics retrieved from Google Play apps pages
Table 2 gives the list of the 78 Covid-related apps from which the Google Play page existed and that we gathered during this study and the data we were able to extract from them. The second column gives the country of origin of each app, the third column gives information about the type of developer of each app (e.g., governmental, researcher, company, etc.) and the fourth column gives the target of each app (e.g., citizens,journalists, etc.). Afterwards, the rest of the table is composed of the first category of our taxonomy. We can see that each column of this category is a leaf node of the Information Broadcast branch of the taxonomy (see Fig. 8). A check mark indicates that the corresponding app belongs to this category. Table 3 also represents the list of 78 Covid-related apps, but it shows the mapping between each app and the upstream branch (see Fig. 9). Finally, Table 4 lists all the 78 Covid-related apps with the last part of the leaf node of the upstream branch and the tooling branch (see Fig. 10).
Do Covid-Related Apps Have Specific Characteristics?
Motivation
After categorizing Covid-related apps from their descriptions, in-depth analysis is needed to better understand how they work compared to standard apps. This section aims at comparing Covid-related apps and standard apps from a technical point of view to bring insight into future research. To that end, we extract Android apps-related features (e.g., GUI components, permissions, libraries, etc.). The outcome of this research question will provide the reader with detailed information about Covid-related apps. Indeed, it gives information on whether Covid-related apps are more prone to track and/or display advertisements than standard apps.
Strategy
In prior work, Tian et al. (2015) have shown that specific sets of apps can have similar characteristics (e.g., similar permissions, components, size, etc.). In this section, we investigate to what extent 92 apps form one coherent group that is significantly different than other apps.
To that end, for each app, we counted the number of different Android components (i.e., Activities, Broadcast Receivers, Services, and Content Providers), computed the size of the dex file, extracted the permissions needed as well as the libraries used.
Comparison dataset
For comparing the characteristics of Covid-related apps with other apps characteristics, we randomly selected 100 apps over 10 different categories of apps from Google Play. Those 1000 (10 x 100) apps are sampled from the same time span (i.e., they are coming from the same initial dataset) to ensure that time is not a factor in potential differences.
Google Play contains dozens of categories, therefore we decided to compare our set of Covid-related apps against apps from the categories that intersect those of our Covid-related apps. Table 5 shows the categories of Covid-related apps we were able to retrieve. Note that we were able to get the category of 87 among our set of 92 Covid-related apps.
Table 5 Categories of Covid-related apps and the number of apps in each category
Android Components
Figure 11 depicts differences between apps in different categories and our set of Covid-related apps regarding the number of components included in the app. We notice that Covid-related apps tend to use fewer Activities than the other apps. This difference is statistically confirmed to be significant by a Mann-Whitney-Wilcoxon (MWW)Footnote 12 test (Mann and Whitney 1947; Wilcoxon 1945)). Regarding Services (used for background tasks), we can see that, apart from the category “Shopping”, Covid-related apps tend to use more service components. Regarding broadcast receivers, however, the difference is less marked, although its statistical significance is confirmed by a MWW test. Finally, the median number of Content providers in Covid-related apps is in most cases equal to the median number of Content providers of apps in different categories (i.e., 2 content providers). An MWW test found no statistically significant difference (except for the Entertainment category).
Overall, the differences, which are mostly pronounced for Activities, suggest that Covid-related apps are different from other apps (of the same category) in terms of GUI layout. With less Activities, we can conclude that Covid-related may have less complex GUI than other apps. Services being slightly more used in Covid-related apps, it hints that Covid-related apps are more data-centric than other apps (in the same categories).
Dex Files Size
Figure 12 shows the distributions of the dex sizes of Covid-related apps and the apps from the ten different categories. It shows that the median of Covid-related apps sizes is close to other apps in general. The MWW test confirms no statistically significant difference between the distributions of app size. However, the maximum dex size value is higher than other apps, hinting at more variability in terms of app size amongst Covid-related apps.
Permissions
In Table 6, we compare the permissions used by Covid-related apps and the permissions of other apps per app category. To that end, we extracted for all sets of apps the top ten most requested permissions. First, a notable difference is that Covid-related apps tend to use the wake_lock permission more than standard apps. This permission is used for preventing the screen of the device from being turned off, and/or to ensure an app remains active. Such a feature is often used for keeping the phone awake while locating the phone (e.g., for contact tracing). In the same way, access_fine_location and access_coarse_location tend to be used more by Covid-related apps. This is in line with the use of the wake_lock permission to facilitate user location tracking.
Table 6 Top ten most requested permissions in Covid-related apps and other apps per category. Percentage indicates the ratio of apps using the permission
Figure 13 shows the distribution of the number of permissions requested by Covid-related apps and other apps per category. The MWW tests revealed no statistically significant difference between the number of permissions used by Covid-related apps and by other apps
Libraries
To compare the patterns of libraries inclusion, we measure the use of libraries by relying on a collection of well-known libraries. More specifically, we re-use two lists of libraries established in prior works (Li et al. 2016, 2019): a list of 1 114 common libraries and a list of 240 advertisement libraries.
Therefore, for Covid-related apps and our dataset of apps by category, we computed the number of apps using at least one common library and one advertisement library.
Table 7 presents our results. First, we notice that almost all the apps (Covid-related and other) use common libraries, which is not surprising since Android software development—just like non-mobile software—heavily relies on reusable libraries and frameworks.
Table 7 Number of Covid-related/other apps using libraries. (C: Communication, E: Entertainment, H&F: Health & Fitness, L: Lifestyle, P: Productivity, M: Medical, SP: Shopping, S: Social)
However, the difference is significant regarding the advertisement libraries. Indeed, while advertisement libraries are used by more than 80% of other apps, they only appear in less than 20% of Covid-related apps. Furthermore, only 3 out of 240 advertisement libraries are used in Covid-related apps, namely: (1) com.facebook, (2) com.startapp.android and (3) com.flurry. This strongly suggests that the primary goal of Covid-related apps is not to obtain a financial gain from advertisement, in opposition to the vast majority of standard apps.
Are Covid-Related Android Apps More Complex Than Standard Apps?
Motivation
In Section 4.1, we have seen that Covid-related apps cover a large variety of categories and target various objectives (e.g., informing users, collecting data from users, etc.). The code complexity that is necessary to achieve these objectives may thus vary substantially. To investigate this aspect, we compute several standard metrics used in the state of the art literature, and further assess the potential differences between Covid-related apps and other apps. Insights from this research question can improve developer’s knowledge and serve as the basis for future empirical research on code quality.
Strategy
App complexity is an elusive concept. Yet, in the literature, there are various studies that propose metrics to measure some form of complexity and attempt to show its correlation with app quality and maintainability (Jošt et al. 2013; Gao et al. 2019b). We undertake to investigate our research question based on these common metrics from the literature (Chidamber and Kemerer 1994). We provide in Appendix A the descriptions of the complexity metrics we use.
In this study, the data extracted for computing the complexity metrics are computed at the smali code level. The apps are loaded with Androguard (2020), a static analysis tool for Android apps.
The different metrics attempt to capture the Lack of Cohesion in Methods (LCOM), the Weighted number of Methods per Class (WMC), the number of methods invoked per class, i.e., the Response For a Class (RFC), the Coupling Between Object classes (CBO) and the Number Of Children per class (NOC).
Figure 14 presents the distributions of metric values.
NOC appears to present similar distribution across standard and Covid-related apps, confirmed by MWW test. However, MWW test revealed significant differences between the distributions of Covid-related apps and standard apps for the other metrics.Footnote 13
Furthermore, Fig. 14 distinctly shows that Covid-related apps complexity metric medians’ are below standard apps medians which hints at a lower complexity.
We note that obfuscation is a factor that can have an impact on Android apps studies, especially with app complexity computation based on smali code. Our set of Covid-related apps contains 2 apps (2.17%) that contain obfuscated code. For measuring if an app uses obfuscated code, we rely on APKiD.Footnote 14 The obfuscation rate of apps in each category is depicted in Table 8.
Table 8 Rate of apps obfuscated by category
At first sight, we can see in Table 8 that for some categories, there is a high number of apps that contain obfuscated code, which suggests that the metrics computed can be biased by the obfuscation rates.
We therefore conducted the same comparisons, but based on random sets of non-obfuscated apps. The conclusions remain the same.
Overall, these results establish that Covid-related apps are, to some extent, less complex than standard apps. According to (Jošt et al. 2013), this result suggests that Covid-related apps may be more maintainable and of better quality. Additionally, we note that a lower complexity could also indicate that Covid-related apps have on average less functionalities and/or are focused on more specific goals, as was already hinted above in the permission usages comparison.
To What Extent Were Covid-Related Apps Removed From the Official Google Play And Why?
Motivation
Developers have to comply with strict Google policies (Google 2020b) before submitting an app to Google Play. The unprecedented crisis of the COVID-19 led Google to release new policies regarding Covid-related apps that would be candidates for Google Play (Google 2020a) where Google performs supplementary checks (e.g., reduce misinformation by favoring official sources). With this research question, we aim to check to what extent Google actually applied strict policies in Google Play. The outcome of this research question will open new research avenues for app policy modeling, and may shed light into Google vetting processes for developers.
Strategy
We have seen in Section 4.1 that during our analyses, some Covid-related apps disappeared from the official Google Play in a matter of days.
Therefore, for each app that was initially identified at the beginning of our study, we queried the Google Play market, at the time of writing, to check if the app is still available. Around 15% of Covid-related apps (i.e., 14 apps) have been removed from Google Play.
In comparison, among 1675 standard apps taken randomly from our initial dataset (see Section 2), we found that 277 (i.e. 16.54%) apps were removed from the Google Play market.
The removal rates of both app datasets are close. Actually, we expected a much higher removal rate for Covid-related apps. This relatively low ratio of removal for Covid-related apps could be explained in several ways:
-
Google either enforces its policy very quickly or pre-screens (i.e., before it is accepted on the market) each app that is potentially relevant to COVID-19; In that case, apps would either never make it to the market, or would be removed too quickly for AndroZoo crawlers to catch them;
-
App developers either rapidly adapted to Google’s policy and/or very few developers proposed apps that conflict with Google’s policy.
Who Are Covid-Related Apps’ Developers?
Motivation.
In Section 3.2, we have seen that the number of Covid-related apps increased drastically from March 2020. The important information behind this is that many entities quickly responded to the pandemic to provide users with specific Android apps with different purposes (e.g., information, contact tracing, health guide, etc.). However, the nature of the entities was not readily available. In this section, we consider further investigating their type by mining description data and following various links. Typically, we focus on retrieving the origin of those apps to overview what country responded according to the pandemic to provide services to end-users. This information would help to overview which countries quickly reacted to the pandemic by providing end-users with mobile apps services. The outcome of this research question will give the general public a glimpse of the distribution of apps by country. Besides, by exposing the type of developers and the origin of Covid-related apps, we encourage future research into performing additional studies such as code reuse in different apps/countries, plagiarism between apps, as well as correlation between app releases and the number of COVID-19 cases.
Strategy.
On Google Play, in each web page of an app,Footnote 15 there is a field developer that provides the name of the person or entity (e.g. a software company, a governmental institution, an ONG, etc.) who has released the app. After collecting this information, we detail in Table 2 (column Developer Type) the status (or the type) of the entity having released an app. Table 9 presents the number of released Covid-related apps for each type of entities.
Table 9 Number of Covid-related apps per entity type
We can see that most of the app providers are governmental institutions. We indeed find Covid-related apps that are officially promoted by national governments (e.g. Government of BrazilFootnote 16 or Government of France).Footnote 17 We also see apps released by more local governmental bodies (at the state or regional level). We have for instance apps from specific states of the USA (e.g., State of Rhode Island),Footnote 18 or from specific “Switzerland Canton” (e.g. Gesundheitsdepartement des Kantons Basel-Stadt).Footnote 19
About 20% of the Covid-related apps (17 apps) are provided by companies. In order to understand why these apps have not been removed by Google, we further check the description of these apps and the descriptions of the companies. We found that:
-
Even if the developer is identified as a company, two apps have been developed on behalf of official bodies (Care19Footnote 20 is the official COVID-19 app for the states of South Dakota and North Dakota, COVID AP-HMFootnote 21 is an app developed for a hospital);
-
Seven apps are either endorsed by a ministry,Footnote 22 or working in close collaboration with medical/health actors,Footnote 23 or working in collaboration with renowned universities.Footnote 24
-
Two apps are actually online shopping apps.Footnote 25
-
One app is not on the market anymore.Footnote 26
-
Finally, five apps related to social distancing,Footnote 27 or health,Footnote 28 or Covid-related news,Footnote 29 have been released by companies without any explicit link to official organizations. We remind that the official Google COVID-19 policy (Google 2020a) is that Covid-related apps with no explicit links with governmental bodies or health organizations cannot provide “health claims”. We further check these 5 apps, and we confirm that they comply with the Google COVID-19 policy.
For the remaining nine Covid-related apps, we noticed that 3 apps have been provided by associations. More specifically, the DiagnoseMeFootnote 30 app has been released by the Faso Civic association from Burkina Faso, the Self Shield AppFootnote 31 by the Commonwealth Medical Association (through the Commonwealth Centre for Digital Health organization) and the COVID Safe PathsFootnote 32 app by a non-profit organization related to MIT. We also noticed that two apps have been developed by independent developers, and two other apps have been provided by researchers. One by a group of researchers from German Universities,Footnote 33 one by researchers from the Aga Khan University in Pakistan.Footnote 34 Finally, one app has been provided by an NGO (i.e., the Austria Red Cross), and one by a hospital (actually a group of hospitals in Paris, France).
We note that among all the Covid-related apps, 71% of them have been released by entities having multiple Android apps on Google Play.
Finally, we represent in the map of Fig. 15 the geographical distribution of the apps over the world. We can see that Covid-related apps are provided world-wide (maybe less present in Africa). The countries in blue are the ones listed in Table 2. Note that we also identified 16 other apps from 16 countries that we were unable to obtain; These countries are represented in red.
Do Covid-Related Apps Have Security Issues?
Motivation
Security and privacy are critical concerns regarding mobile apps. In this section, we assess several aspects of Covid-related apps security. The outcome of this research question will provide the general public with a summary of some potential security problems found in Covid-related apps, which may help them adjust their level of trust in such apps. Similarly, we highlight potential issues that developers should consider in terms of security of Covid-related apps. Researchers may also use this information to adopt further investigation topics related to Covid-related apps security, e.g., on API usage patterns, evolution of security and privacy in the lineage of apps, etc.
Strategy
In contrast to a recent work (He et al. 2020), which focused on dissecting Covid-related malware, our aim in this work is not to perform an extensive security analysis of these apps. Nevertheless, we propose to leverage four practical security and privacy scanners on our set of 92 Covid-related apps in order to systematically evaluate four S&P aspects: (1) the presence of privacy leaks; (2) the number of apps flagged by VirusTotal; (3) the misuse of crypto-APIs; (4) the matching between descriptions and behavior.
[Privacy leaks] As we have seen in Section 4.1, most of the Covid-related apps are made for collecting personal and sensitive data, e.g., health data and/or the location of users. Therefore, the security and privacy aspects of these apps are crucial, and many people started to share concerns related to this topic (Page 2020; Parliament 2020; Stolton 2020). In order to assess the privacy of Covid-related apps, we applied the state-of-the-art data leak detector FlowDroid-IccTA (Arzt et al. 2014; Li et al. 2015). Through static analysis, this tool is able to detect sensitive data leaks intra-component (e.g., inside an Activity) or inter-component (e.g., across Activities). Note that we used the default sources and sinks provided with the tool.
FlowDroid-IccTA was able to detect 24 intra-component data leaks in 2 different apps and found no inter-component leak for the list of the 24 leaks. The app SODIFootnote 35 contained only 1 potential leak, whereas the app Coronavirus - SUSFootnote 36 contained 23 potential leaks. Given that static analysis tools are subject to false-positives, we undertake to manually analyze every detected leak.
We compiled the list of the 24 leaks in Table 10. In the second column we expose the source of the potential leak, i.e., the sensitive information which is the first chain link. The third column lists the sinks associated with the sources, i.e., the method that is responsible for leaking the sensitive information.
Table 10 List of the leaks detected by Flowdroid-IccTA. Note that there can be multiple leaks for each couple of source/sink
SODI an app promoting social-distancing. The app is not originating from a government. Our manual analysis concluded, however, that the reported leak is a false-positive alarm and does not constitute a real data leak.
Regarding Coronavirus - SUS, which is an official app of the government of Brazil, FlowDroid-IccTA flagged 24 potential sensitive leaks (i.e., there is a path between a source (that can access a sensitive data) to a sink (e.g. sendTextMessage)). We notice that four of these leaks allow the app to get the longitude and/or latitude (the sources) of the app to log it internally (the sink). However, this does not necessarily constitute a malicious behavior.
-
AntiVirus detection
:
-
For each of the Covid-related apps, we have collected the detection reports from over 60 AntiVirus products, thanks to the VirusTotal API.Footnote 37 None of Covid-related apps is flagged by any of the 60 anti-virus software available in VirusTotal at the time of writing.
-
Crypto-API misuses
:
-
Finally, we leverage the state-of-the-art static-analyzer CogniCrypt (Krüger et al. 2017) through its headless implementation CryptoAnalysis (CryptoAnalysis 2020) for detecting cryptographic API misuses in Java programs. Such misuses could indeed indicate security issues. We found that 81 apps among our set of 92 Covid-related apps use JCAFootnote 38 APIs. However, CogniCrypt did not report any cryptographic misuse.
In contrast, Gao et al. (2019a) have shown that in a dataset of more than 598000 apks, 96% of apks using JCA exhibit dangerous misuses of cryptographic APIs. With 0%, Covid-related apps seem to be totally exempt from such misuses.
-
Description/Behavior matching
:
-
Covid-related apps may propose functionalities that are sensitive due to the security and privacy concerns that they can raise (e.g., the Indian app Aarogya Setu (Clarance 2020) was found to share users’ private information to third parties). The apps we study have been released on Google Play, therefore users can only rely on the description provided by developers.
However, it has been shown by Gorla et al. (2014) that apps’ behavior does not always match the apps’ description. For this reason, we replicated the CHABADA approach(Gorla et al. 2014) to check to what extent Covid-related apps’ descriptions match their behavior (approximated by API usages). CHABADA unfolds as follows:
-
1.
Preprocessing descriptions with NLP techniques: tokenization, stop word removal, stemming
-
2.
Extracting topics with Latent Dirichlet Allocation (Blei et al. 2003)
-
3.
Clustering apps based on topics with K-means (MacQueen 1967)
-
4.
Identifying, in each cluster, the apps that have outlier API usages. This outlier identification is performed via One-Class Support Vector Machine learning (Schölkopf et al. 2001)
In Section 4.1, we have seen that we were able to retrieve the descriptions of 78 apps. We applied our implementation of CHABADA to these 78 apps. The clusters generated by CHABADA can be seen in Table 11. Five clusters have been generated. We have named these clusters by considering the three most used words per cluster. We can see that the first cluster (i.e., Spread tracking) contains 30 apps, whereas other clusters are smaller and are all roughly the same size (i.e., between 10 and 14 apps per cluster). Do note that the clusters, which are independently built using the CHABADA approach, can each be associated to a category of our taxonomy (defined in Section 4.1).
Table 11 Clusters of apps generated by our implementation of CHABADA
After clustering the apps based on their description, CHABADA searches for outliers in each cluster based on the APIs usage. Table 11 shows the number of outliers detected per cluster. The three outliers in “Spread tracking” have been detected because, contrary to other apps in the cluster, they use Android vibration API and Android MediaPlayer API. Similarly, in the cluster “Sharing health information”, the two outliers use Bluetooth APIs, which is not the case for other apps in the cluster. Regarding the cluster “Sharing general information”, the outliers use the SmsManager API, the TelephonyManager API, and the SpeechRecognizer API. In the “Data collection” cluster, the use of the Bluetooth API is common: outliers do not use this API. Finally, in the “COVID-19 self-diagnosis” cluster, the outliers use the MediaPlayer API, which is not used by the rest of the cluster apps.
CHABADA allowed us to identify several Covid-related apps that deviate from the expected behavior given their description. Although the outliers do not necessarily present a danger for end-users, because, in general, they only deviate with respect to non-sensitive APIs (e.g., MediaPlayer, Vibrator, etc.), our empirical results show that descriptions do not always reliably approximate the expected app behavior.