1 Introduction

Nowadays, the digital media usage time is driven by mobile devices, with smartphone and tablets accounting for 66% of all time spent, against desktop usage which accounts for 34% only [2]. Specifically, more than 80% of mobile minutes in all markets are spent on mobile apps [35]. Indeed, the development of mobile apps is exponentially growing since the establishment of a number of app stores from where to download and install them.

The main key success factors of mobile apps is in fact the distribution model offered by dedicated app stores, such as Google Play for Android apps, and the Apple app store for iOS apps. As of today, these stores make available millions of mobile apps of different categories to millions of people, who use them for their everyday activities like purchasing products, messaging, etc. [2]. Clearly, this is a highly competitive business in which even the smallest error may have a tremendous financial impact. Revenue and profit of a mobile app is often proportional to the number of its users [240], who may enjoy using the app (and possibly rate it positively in the store) or dislike it (and possibly abandon it or even leaving a negative review in the store). This implies that improving the level of users satisfaction is fundamental for app developers to both keep existing users active and attract new ones.

Technically, mobile apps consist of executable files that are downloaded directly to the end user’s device and stored locally. Mobile apps are developed atop the services provided by their underlying mobile platform (e.g., Android). Those services are exposed via a dedicated Application Programming Interface (API) with methods related to communication and messaging, graphics, security. Programming languages and tools for developing mobile apps are platform-specific (e.g., Java code for Android apps, and Swift code for Apple iOS apps), and present many challenges that may hamper the success of a mobile app as a whole [122, 249]. As empirically emerged in [122], app developers strongly need better analysis and testing support, with a focus on important features like mobility, location services, sensors, as well as different gestures and inputs. Indeed, although it may be assumed that app developers are adhering to development best practices – mainly related to well-established software engineering principles and design patterns – there is still the need of assessing, or even guaranteeing, properties about apps with a certain degree of confidence. Examples of those properties include: low energy consumption, efficient use of computational resources, security, performance, and reliability. Satisfying these needs would allow (i) app developers to raise the level of quality of their products and, potentially, their revenues, (ii) app users to use high-quality products in their everyday activities, and (iii) app store moderators (e.g., Google and Apple) to raise the overall level of quality and trustworthiness of their stores.

Static program analysis allows for predicting (precise or approximated) quantitative and qualitative properties related to the run-time behavior of a program without actually executing it [189]. For instance, static analysis techniques allow for statically inferring cost-related properties (such as the estimation of the maximal number of loop iterations and the related worst-case execution time), as well as properties related to resource consumption [13] (such as memory/heap usage and energy consumption).

Under this perspective, static analysis of mobile apps can be a valuable instrument for both (i) app developers, who can use it to quickly get non-trivial insights about their apps (e.g., subtle security issues, energy hotspots due to some programming antipattern, inefficient use of hardware sensors) and (ii) app store moderators, who can use static analysis for systematically assessing the level of quality of the apps they distribute, possibly identifying those apps with an unacceptable level of quality (e.g., apps with well-known security flaws, apps asking for suspicious permissions, apps with strong energy inefficiencies).

Static analysis of mobile apps is gaining a growing interest in both academia and industry. Literally hundreds of (often overlapping) kinds of (theoretical and practical) static analysis approaches exist in the literature, ranging from structural and control-flow analysis, to data-flow and state-based analysis, interval analysis (used in optimizing compilers) and so on [189]. Such approaches exploit static analysis techniques from different perspectives and belong to extremely different research areas of software engineering, such as software analytics, security, testing, verification. Industrial tools are also emerging and being maintained by key players in the technological panorama. For example, Facebook’s InferFootnote 1 applies separation logic and bi-abduction for inter-procedural analysis [42] and it is used by Facebook itself, Spotify, Mozilla, the Amazon Web Services division, etc.

The goal of this paper is to precisely characterize existing software engineering research on static analysis of mobile apps from three different perspectives, namely: (i) research trends, (ii) the characteristics of the proposed approaches, and (iii) their potential for industrial adoption.

In order to achieve this goal, we applied the systematic mapping study methodology [203, 257]. The aim of this methodology is to provide an objective, replicable, and unbiased approach to answer a set of research questions about the state of the art on a given topic. In this paper, we systematically selected 261 primary studies from over 12,000 potentially relevant publications on static analysis of mobile apps. Then, we defined a classification framework for categorizing the selected approaches and rigorously applied it to the 261 primary studies. Finally, we synthesized the obtained data to let emerge a crystal-clear snapshot of the state of the art on static analysis of mobile apps.

The main contributions of this study are:

  1. 1

    a classification framework for categorizing, comparing, and evaluating approaches for static analysis of mobile apps according to a number of parameters (e.g., analysis goal, supported platforms, type and number of needed inputs, types of supported analysis);

  2. 2

    an up-to-date map of the state of the art in static analysis of mobile apps;

  3. 3

    an evaluation of the potential for industrial adoption of existing research results on static analysis of mobile apps;

  4. 4

    a discussion of the emerging challenges and their implications for future research on static analysis for mobile apps;

  5. 5

    a replication package for independent replication and verification of this study.

The audience of this study is composed of both (i) researchers interested in adopting existing static analysis approaches, possibly to further contribute to this research area by targeting (a subset of) the identified research challenges (see Section 9), and (ii) app developers interested to critically understand existing research results and thereby to adopt/extend those approaches in the context of their products. The latter point is specially relevant since in this study we also assessed how approaches developed in academia can be successfully transferred and adopted in industrial projects. As a concrete case, the Infer approach lays its theoretical foundations in academic results developed by researchers from the Imperial College and the Queen Mary University of London [42, 43], and it is now used by top tech companies such as Facebook, Amazon, Spotify, Mozilla, Sky.

The rest of the paper is organized as follows. Section 2 provides background information on the mobile apps ecosystem and static program analysis. Section 3 puts our study in context with respect to related work. The design of our study from a methodological perspective is provided in Section 4. The main results of our study are reported in Sections 5, 6, 7, and 8. Section 9 discusses and puts the achieved results in context by also elaborating on future research challenges. Threats to validity are reported in Section 10. Section 11 closes the paper.

2 Background

This section provides the reader with background notions on the mobile apps ecosystem (Section 2.1) and static program analysis (Section 2.2).

2.1 The mobile apps ecosystem

A mobile app (short for mobile application) is a computer program designed to run on mobile devices such as smartphones and tablet computers. Mobile apps were originally offered for general productivity and information retrieval, including email, calendar, contacts, stock market and weather information. However, public demand drove rapid expansion into other categories and nowadays, according to a 2017 report, the global app economy is worth 1.3<DOLLAR/> trillions and is predicted to grow to 6.3<DOLLAR/> trillions in 2021 [9].

Mobile apps fall broadly into three categories: native, web-based, and hybrid [227]. Native apps run on a device’s operating system and are required to be adapted for different devices. Web-based apps require a web browser on a mobile device. Hybrid apps are web-based apps hosted inside a native application.

Apps that are not pre-installed are generally distributed to end-users through app stores, application distribution platforms first appeared in 2008. Dedicated app stores are typically operated by the owners of the mobile operating systems (such as the Apple App StoreFootnote 2, Google PlayFootnote 3, and the Windows Phone StoreFootnote 4). Generally, mobile apps are downloaded directly from the distribution platform to a target mobile device. Currently, Android and iOS platforms, the two most prominent mobile operating systems, make up over 99% of smartphone sales worldwide [114].

Still, being relatively new, mobile apps present a wide array of issues and challenges for both end users and developers. On the one hand, when using mobile apps, end users often face issues that stem from poor quality of development (such as apps that exhibit frequent crashes, lack in responsiveness or consume an abnormal amount of energy or memory) or deliberate malicious behavior (such as apps that invade privacy or are unethical [128]). On the other hand, developers face multiple challenges when developing apps for mobile devices such as fragmentation, both across multiple platforms and within the same platform, lack of robust monitoring, analysis and testing tools, as well as having to keep up with frequent platform updates and changes [123].

In the following section, we provide a concise summary of the topic on which our research focusses, namely static program analysis techniques, followed by a concrete example of one of such techniques.

2.2 Static program analysis

The denomination static program analysis encloses a set of static compile-time techniques that predict computable approximations of values or behaviors arising at run-time when executing a program [189]. When applied to mobile apps, static program analysis can be an effective instrument for both app developers an app store moderators (e.g., Google, Apple) to predict and evaluate (precise or approximated) quantitative and qualitative properties related to the run-time behavior of mobile apps without actually executing them. Hence, it can be a valuable instrument to create apps with better quality in a world where a low-quality releases can have devastating consequences [123].

In the literature, static analysis of mobile apps has been applied with variety of goals in mind, ranging from malware and privacy leaks detection to detection of bugs in the app source, to reduction of energy and memory consumption [15, 16, 93, 113, 143]. To achieve these goals, researchers have experimented with a variety of different static analysis techniques. Among the ones worth mentioning, there is data-flow analysis, in which a program is considered as a graph: nodes are elementary blocks and edges describe how control passes from one block to another [189]. Taint Analysis is a special case of data-flow analysis that aims to detect the existence of a data flow from sensitive data sources, often simply referred as sources, to untrusted program statements, called sinks [113]. Type Analysis aims to verify the type safety of a program, i.e., if we can guarantee that the eventual value of any expression in the program will not violate the expression’s static type. In other words, type analysis aims to detect type errors in a program source code. Abstract interpretation is a sound approximation of the semantics of a program, based on monotonic functions over ordered sets. It is able to extract information about the semantics of a program without performing all the calculations. Program slicing aims to compute the set of program statements, referred to as the program slice, which may affect the values at some point of interest, referred to as a slicing criterion.

In some cases, static analysis approaches rely on additional inputs other than the program itself (e.g., knowledge bases, code mappings), either to improve the accuracy of the analysis or to perform broader kinds of analyses that would be impossible without. When the analysis makes use of information collected at run-time, while executing the program, we refer to it as hybrid analysis.

Approaches for static analysis of mobile apps can be generic or platform specific. The latter approaches are able to analyze only apps developed for a specific mobile platform, as the analysis leverages or focuses on programming constructs that are available only on that platform (e.g., Android Intents).

Example – In the remaining of this section, we describe CHEX [170], one of the identified primary studies, in order to give a concrete idea about the typical traits and features of a software engineering techniques for static analysis of mobile apps. The main goal of CHEX is to automatically detect component hijacking vulnerabilities, a specific class of security vulnerabilities existing on the Android platform. In this sense, CHEX is Android-specific. These vulnerabilities have been modeled from a data-flow analysis perspective, thus enabling their identification via a reachability analysis on custom system dependence graphs. In [170], the authors also devised novel techniques to tackle analysis challenges arising from the Android’s programming paradigm, such as multiple app entry points and asynchronous code execution. CHEX has been implemented on top of Dalysis, a generic static analysis framework that the authors built to support many types of analysis on Android app bytecode. CHEX was evaluated on 5,486 real Android apps and correctly identified 254 potential component hijacking vulnerabilities.

3 Related work on static analysis of mobile apps

In this section, we discuss other existing studies related to our work. Literature reviews, surveys and mapping studies on either static analysis approaches or analysis methodologies and techniques applied to mobile apps that can be considered as research related to our study.

Based on our knowledge, we found no systematic mapping study (SMS) and only one systematic literature review (SLR) on the specific topic of static analysis of mobile apps [151]. Thus, in the following, we first discuss in more detail the SLR reported in [151], which is a valuable and solid work study closely related to ours. Then, we discuss other works in the literature that, although having different scopes and objectives, can be related to our research.

Similarly to our SMS, the SLR in [151] reviewed publications on approaches involving the use of static analysis for mobile apps. The main difference between the SLR in [151] and our SMS is methodological; as extensively discussed in [203] and [134], SLRs aim at synthesizing evidence with a very specific goal in mind (e.g., which static analysis technique achieves higher accuracy in specific contexts), whereas systematic maps are primarily concerned with structuring a research area [203], providing an overview of the direction and intensity of the scientific interest over a specific topic (static analysis for mobile apps in our case), which sub-topics are covered, and relevant research gaps and trends. This difference in aim implies profound methodological differences throughout the whole research protocol, ranging from the nature of the research questions, the broadness of the searchers, and most importantly the synthesised findings.

In the following, we provide an overview of the main methodological differences among our study and the one in [151]. In addition to Android, our study considered also other platforms. As per the search strategy, the main difference is that we performed a manual search of top venues for SE and programming languages, followed by backward snowballing and then forward snowballing; in [151], the authors performed automatic search followed by manual search of top venues for SE, programming languages, security and privacy, and then authors’ self-check followed by backward snowballing. Concerning the selections criteria, we considered only peer reviewed work, by excluding studies in the form of editorials and tutorial, as well as short and poster papers, secondary or tertiary studies. In [151], only short papers were excluded. Moreover, differently from them, we accounted for the existence of some kind of evaluation together with the availability of an implementation. As a result, they collected 124 research papers, in the timespan 2011-2015; we have a better coverage made of 261 primary studies in the timespan 2007-2019. Finally, similarly to [151], in our study we perform a vertical analysis of the extracted data (i.e., we perform an in-depth analysis of the extracted data for each parameter of our classification framework); in addition, in our study we also complement the vertical analysis with horizontal analysis (i.e., we build contingency tables across pairs of parameters and investigate on emerging interesting correlations).

Importantly, in [151], the authors do not consider the potential for industrial adoption of existing research on static analysis of mobile apps, as we do through our research question RQ3. This is a substantial difference that permitted us to identify in the state of the art those approaches to static analysis of mobile apps that are ready for technological transfer and industrial adoption. Another profitable difference is in the nature of the study, SLR versus SMS, and in the target audience. As already introduced, in our SMS we target both researchers and practitioners, such as app developers, who are interested in selecting/choosing existing static analysis approaches, and want to critically understand what they offer and how, in order to opt for their adoption or possible industrial transfer. The SLR in [151] more specifically targets researchers and practitioners that want to propose a new approach to static analysis or to extend existing ones. In this sense, we believe that our work and the work in [151] complement one another, and together they constitute a valuable asset to the academic and industrial world in the wide spectrum of static analysis.

In [87], a survey about static analysis and model checking approaches for searching patterns and vulnerabilities within a software system is reported. The authors examine the proposed algorithms and their effectiveness in finding bugs. A peculiarity of this research is the comparison between static analysis algorithms and mathematical logic languages for model checking.

In [204], the authors report on a survey about static analysis for identifying security issues and vulnerabilities in software systems in general (not specific to mobile apps). For each type of security vulnerability, the authors present both relevant studies and the implementation details of the used static analysis algorithms.

A systematic mapping study is reported in [65]. The study was conducted for classifying and analysing approaches that combine different static and dynamic quality assurance techniques. The study includes a discussion about reported effects, characteristics, and constraints of the various existing techniques.

A literature review about mobile usability models can be found in [105], as a means for validating a specific usability model. Among the main results, from this literature review it emerges that usability is usually measured in terms of three key indicators, namely, effectiveness, efficiency and satisfaction.

Even if some of the above mentioned works are about static analysis, none of them is specifically focussed on the static analysis of mobile apps, and none of them is a systematic literature review.

4 Study design

This research was organized into three main phases, which are well-established when it comes to systematic literature studies [132, 257]: planning, conducting, and documenting.

Planning. We established the need for performing a review on static analysis of mobile app (Section 3), we identified the main research questions (Section 4.1), and we defined the protocol to be followed by the involved researchers.

Conducting. We performed the mapping study by following all the steps defined in our research protocol, namely: (i) search and selection of primary studies, i.e., the relevant research articles on static analysis methods and techniques of mobile apps (Section 4.2), (ii) extraction of relevant data from each primary study according to a rigorously-defined classification framework (Section 4.3), and (iii) synthesis of main findings emerging from the analysis and summary of the data extracted in the previous activity (Section 4.4).

Documenting. The main activities performed in this phase are: (i) a thorough elaboration of the data extracted in the previous phase, with the main goal of setting the obtained results in their context, (ii) the discussion of possible threats to validity, specially to the ones identified during the definition of the review protocol (in this activity new threats to validity may emerge too), and (iii) the writing of a final report (i.e., this article) describing the performed mapping study.

A complete replication package is publicly available to allow interested researchers to independently replicate and verify our studyFootnote 5. It includes the review protocol, the list of both searched and selected studies, a detailed data extraction form, the raw extracted data, and the R scripts for data analysis.

4.1 Research questions

We formulate the goal of this study by using the Goal-Question-Metric perspectives (i.e., purpose, issue, object, viewpoint [30]). Table 1 shows the result of the above mentioned formulation.

Table 1 Goal of this research

The results of this study are targeted to both (i) researchers willing to further contribute to this research area, and (ii) practitioners willing to understand existing research on static analysis approaches of mobile apps and thereby to be able to adopt those solutions that better fit with their needs. We refined our abstract goal into the following research questions:

  1. RQ1:

    What are the research trends on static analysis of mobile apps?

    Rationale: a multitude of researchers are investigating on static analysis for mobile apps over time with different degrees of independence and different methodologies. By answering this research question, we aim at characterizing the scientific interest on static analysis approaches of mobile apps, the relevant venues where academics are publishing their results on the topic, and their contribution type.

  2. RQ2:

    What are the characteristics of existingapproaches for static analysis of mobile apps?

    Rationale: static analysis of mobile apps is a multi-faceted research topic, where researchers can focus on very different aspects (e.g., energy consumption, security), applying very different research methodologies (e.g., industrial case studies, empirical evaluations), providing different

    types of contributions (e.g., tools for automating development activities, techniques for analyzing a specific aspect of the mobile app). By answering this research question, we aim at providing (i) a solid foundation for classifying existing (and future) research on static analysis of mobile apps, and (ii) an understanding of current research trends and gaps in the state of the art on static analysis of mobile apps.

  3. RQ3:

    What is the potential for industrial adoption of existing research on static analysis of mobile apps?

    Rationale: while it is well known that mobile apps have their roots in industry, many research groups focus on them from an academic perspective.

    Therefore, it is natural to ask ourselves how the produced research findings and contributions can be actually transferred back to industry. By answering this research question we aim at assessing how and if the current state of the art on static analysis of mobile apps is ready to be adopted in industry.

4.2 Search and selection process

Our first choice for searching potentially relevant studies was to perform an automatic search on known data sources (e.g., IEEE Xplore, the ACM Digital Library, SCOPUS). However, from the results of a preliminary study [14], we understood that the research topic of mobile static analysis resulted to be extremely heterogeneous; for example, many keywords like “program analysis” resulted to be profoundly overloaded, leading to imprecise and inaccurate automatic search results. In order to prevent biases associated to automatic searches, we adopted two complementary manual search activities. This decision is supported by the evidence that automatic searches and backward snowballing activities lead to similar results, and that the decision on which to prefer is context-specific [116, 255]. Our search strategy was divided into two subsequent and complementary steps. The first step was carried out by manually inspecting all the publications of the top-level software engineering venues. The papers identified through this first step were then subsequently utilized as input for a backward and forward snowballingFootnote 6 process [256]. In order to ensure the correctness of the adopted manual approach, the backward snowballing activity was based exclusively on the papers selected from the top-level software engineering venues. Furthermore, the backward snowballing results were further contemplated by adopting a forward snowballing process, that ensured soundness and relevance of the set of the selected primary studies.

Figure 1 shows our search and selection process, whose main steps are detailed in the following. Our search and selection process is designed as a multi-stage process in order to have full control on the number and characteristics of the studies being either selected or excluded during the various stages.

Fig. 1
figure 1

The search and selection process of this study

1. Perform initial manual search. We performed a manual search by considering exclusively articles published in the top-level software engineering conferences and international journals according to well-recognized sources in the field [55, 271]. It is important to note that the main aim of this step was not to select all primary studies but, as suggested in [255], we aimed at obtaining a good start set of papers for the subsequent snowballing procedure (stage 3), i.e., high-quality relevant papers about static analysis techniques for mobile apps in the field of software engineering. We used the quality of the publication venues as proxy of the quality of the potentially relevant studies. Table 2 shows the considered conferences and journals. The time span of our search ranges from January 2007Footnote 7 to December 2019.

Table 2 Searched data sources with number of potentially relevant studies

The search was performed by manually screening the DBLP entries of all conference proceedings and journal issues within the considered time span and contextually applying the selection criteria described in stage 2. DBLP is the Computer Science Bibliography from the University of Trier [141] and contains all proceedings and issues of the publication venues listed in Table 2. This step resulted in a total of 12,128 potentially relevant studies distributed across more than 9 years of research in software engineering.

2. Apply selection criteria. Each study was filtered according to a set of well-defined selection criteria. The adopted criteria are detailed in Section 4.2.1. An adaptive reading depth was applied in order to carry out the selection process in a time-efficient and objective manner [202], because it was not necessary to read the full text of approaches that clearly did not qualify. This step resulted in a total of 85 potentially relevant studies. This significant reduction of the number of potentially relevant studies is due to the fact that (i) we considered exclusively top-level venues in the field of software engineering, and (ii) the considered venues are quite general, with static analysis of mobile apps being only one of the many topics of interest of those venues. In order to reduce possible biases, three researchers were involved in this stage of the study, with a fourth researcher playing the role of arbiter in case of conflicts so to ‘avoid endless discussions’ [291]. The application of the selection criteria lead to an initial set of 85 primary studies.

3. Backward and forward snowballing. In this step, we applied backward and forward snowballing in order to take into account also studies that are published outside the contexts of the conferences and journal considered in the previous step. In particular, this process was carried out by considering the studies selected in the initial search, and subsequently selecting relevant papers among those cited by the initially selected ones. This method is commonly referred to as a backward snowballing activity [255].

In addition to the backward snowballing, we also analyzed the researches citing the studies selected through the initial search. This process is usually referred to as a forward snowballing activity [255]. Specifically, we included this further literature search method in order to consider also newer studies that, at that time, had not been included in official journal volumes or conference proceedings yet. Regarding the forward snowballing process, the Google ScholarFootnote 8 bibliographic database was adopted to retrieve the studies citing the ones selected through the initial search phase.

The final decision about the inclusion of the papers was based on the adherence of the full text of the studies to the predefined selection criteria presented in Section 4.2.1. This step resulted in a total of 296 potentially relevant studies. The total number of potentially relevant studies increased significantly since in this step we considered papers published in all research venues, which by definition are far more than the top-level ones.

4. Exclude studies during data extraction activity. While reading in details each potentially relevant study, we agreed that 35 studies were semantically out of the scope of this research, so they were excluded. This final step led us to the final set of 261 primary studies.

4.2.1 Selection criteria

Following the guidelines for systematic literature review for software engineering [132], in order to reduce the likelihood of biases, we defined a set of inclusion and exclusion criteria beforehand. In the following, we detail the set of inclusion and exclusion criteria that guided the selection of the potentially relevant studies. A potentially relevant study was included if it satisfied all the inclusion criterion stated below; whereas, it was discarded if it satisfied at least one of the exclusion criteria reported below.

Inclusion criteria

  1. I1)

    Studies proposing or using a static analysis method or technique for mobile apps.

  2. I2)

    Studies in which the static analysis method or technique takes as input one or more mobile applications in the form of binary files or source code.

  3. I3)

    Studies providing some kind of evaluation of the proposed method or technique (e.g., via formal analysis, controlled experiment, exploitation in industry, application to a simple example).

Exclusion criteria

  1. E1)

    Studies not describing any implementation of the proposed method or technique.

  2. E2)

    Secondary or tertiary studies (e.g., systematic literature reviews, surveys).

  3. E3)

    Studies in the form of editorials, tutorial, short, and poster papers, because they do not provide enough information.

  4. E4)

    Studies not published in English language.

  5. E5)

    Studies not peer reviewed.

  6. E6)

    Studies in which the static analysis method or technique takes as input only store metadata (e.g., user reviews, ratings) or other app artifacts (e.g., manifest files).

4.3 Data extraction

This phase concerns (i) the creation of a classification framework for the primary studies, and (ii) the collection of data from each primary study.

In order to carry out a rigorous data extraction process, as well as to ease the control and the subsequent analysis of the extracted data, a predefined data extraction form was designed prior the data extraction process. The data extraction form is composed of the various categories of the classification framework. The classification framework is composed of three distinct parts, one for each research question of our studyFootnote 9. The overview of each part of the classification framework, and respective parameters, is reported in Table 3, whereas the definition and values of each specific parameter is given in Sections 5, 6, and 7.

Table 3 Overview of the classification framework

For each primary study, three researchers collaboratively collected a record with the extracted information in the data extraction form for subsequent analysis. As suggested in [257], in order to validate our data extraction strategy, we performed a sensitivity analysis to check whether the results were consistent, independently from the researcher performing the analysis. More specifically, each of the three researchers considered a random sample of 5 primary studies and analyzed them independently by filling the data extraction form for each of them. Then, each disagreement was discussed and resolved with the intervention of a fourth researcher. Specifically, this process was carried out by jointly inspecting the disagreement items, and subsequently providing references available in the literature fitted to solve the item under discussion. For example, an early disagreement item arose between two researchers on the internal or external nature of a quality attribute. Such item was solved by escalating the item to a fourth researcher, who provided a reference to the relative standard available in the literature [115] and additional examples of both types of attributes.

4.4 Data synthesis

The data synthesis activity involves collating and summarizing the data extracted from the primary studies [133] with the main goal of understanding, analysing, and classifying current research on static analysis of mobile apps.

Our data synthesis was split into two main phases: vertical analysis and horizontal analysis. When performing vertical analysis, we analyzed the extracted data to find trends and collect information about each parameter of each category of our classification framework. When performing horizontal analysis, we analysed the extracted data to explore possible relations across different parameters of our classification framework. We used contingency tables for evaluating the actual existence of those relationsFootnote 10.

In both phases, we performed a combination of content analysis (mainly for categorizing and coding the studies under broad thematic categories) and narrative synthesis (mainly for explaining in details and interpreting the findings coming from the content analysis). During the horizontal analysis, we used contingency tables for evaluating the actual existence of inter-parameter relations.

5 Results - research trends (RQ1)

5.1 Year of publication

An overview of the year of publication of the primary studies is reported in Figure 2. Overall, the publication rate results to be constantly increasing through time until 2016. In 2017 there were registered a significant decrease in the publication rate (-25 publications with respect to the previous year). The number of publications surges in 2018 (+20 publications) and remains almost constant in 2019. The lack of growth in recent years could indicate that the initial push, tied to the novelty of the topic, has now stopped. However, the coming years will be decisive in confirming or denying this trend.

Fig. 2
figure 2

Bubble plot of primary studies by year and venue type

A steep increase of the publication rate can be noticed between the years 2011-2012 and 2015-2016, with a difference of 17 and 10 publications, respectively. We can conjecture that the first steep increase (years 2011-2012) is due to the popularity gained in those years by the Android operating system, with its version 4.0. The appearance of lightweight static analysis approaches for mobile application, e.g., Flowdroid [12], could instead be one of the root causes of the increase of publications between the years 2013 and 2014. No publication was found before the year 2011. Considering that the concept of mobile app originated in 2007, we conjecture that the lack of publications in the years 2007-2011 is attributable to the time required by mobile apps to gain widespread diffusion and, hence, for the topic considered (static analysis methods for mobile) to attract the interest of researchers.

5.2 Publication venue

Studies on static analysis of mobile apps have been published to a certain extent in all the most prominent top-level conferences and journals in software engineering. An overview of the most targeted venues and the papers there published is reported in Figure 3. The ICSE conference results to be the venue in which most studies on this topic were published (31/261), followed by ASE (30/261). Overall, a high heterogeneity can be found in the publication venues, which led to a total number of 112 different venues. Only a small number of venues results to be focused on mobile related topics. The vast majority of targeted venues is on general areas of computing, e.g., software engineering, security, testing and program analysis.

Fig. 3
figure 3

Most targeted publication venues

5.3 Publication venue type

As shown in Fig. 4, most of the papers were published in conferences (207/261), followed by journals (40/261) and workshops (14/261). The higher number of conference papers might be due to the high pace of technological advances in the topic. Specifically, researchers may have focussed more on timely publications in conference, rather than targeting journals, which have a (usually) slower publication timeline. Interestingly, as shown in Fig. 2, 31 out of 40 journal papers were published from 2016 onwards, which can be considered as an indication of the maturing of static analysis techniques for mobile apps as a scientific topic.

Fig. 4
figure 4

Primary studies by venue type

5.4 Analysis goal

The analysis goal represents the principal purposes for which the static analysis approaches were conceived. By carefully analyzing the primary studies, sixteen main analysis goal categories emerged from the keywording process. In Fig. 5, the comprehensive mapping of primary studies to analysis goals is reported. The most recurrent goals are: privacy (96/261), malware (66/261), inter-component communication (33/261), energy (25/261) and inter-app communication (24/261).

Fig. 5
figure 5

Primary studies by analysis goal (Categories are not mutually exclusive)

From an inspection of the more recurrent goals, we can observe that most of the studies focus either on analysing crucial aspects of the mobile ecosystem (e.g., privacy and malware) or on improving existing analysis methods (e.g., inter/intra-component communication). We can conjecture that this trend may be due to the fast pace of development that usually characterizes mobile application, where new app releases must be quickly developed and tested in order to be published in the app stores. This may lead to a lack of interest in analysing less critical software aspects of the app, such as refactoring the code of the app itself or identifying specific code anti-patterns.

Example. The European Union data protection regulations impose restriction on the locations of European users’ personal data transfer. In P2, Eskandari et al. investigate whether these regulations are respected by mobile apps, thus safeguarding end users Privacy. For this purpose, they developed PDTLoc, a static analysis tool that analyzes an app to identify the location of servers to which personal data is transferred.

5.5 Macro analysis goal

The macro analysis goal refers to the generic goal considered by the static analyses. The values of this parameter are based on the definition of internal and external quality attribute provided in the ISO/IEC 25010 standard [115]. Specifically, external quality attributes provide a “black box” view of the software under consideration and address properties related to the execution of the software on hardware and an operating system, e.g., reliability [115]. Internal quality attributes provide a “white box” view of software under consideration and address static properties that typically manifest themselves at development time, e.g., maintainability [115]. So, the macro analysis goal of a primary study can have the following values: (i) external quality, if the approach evaluates one or more external quality attribute and (ii) internal quality, if the approach evaluates one or more internal quality attributes. In order to identify approaches which explicitly aim to improve existing methods referenced in the literature, we have a third possible value for this parameter called improving of methodology; we use such value if the main goal of the primary study is to improve a static analysis method or technique.

The macro analysis goals considered by the primary studies are reported in Fig. 6. The majority of the primary studies focus on external quality (168/261). A smaller amount of studies focuses on the improvement of static analysis methodologies (77/261) and on internal quality (65/261)Footnote 11. From this data, we conjecture that the high pace of the mobile technological advances and the strong role of end users in the mobile ecosystem are leading researchers to give more importance to external qualities. Research aimed to refine static analysis approaches results to be higher than the ones focusing on internal quality, making us conjecture that the ones considering internal quality are either at an early stage of development or have been less explored than the ones improving the existing methods.

Fig. 6
figure 6

Primary studies by macro analysis goal (Categories are not mutually exclusive)

In addition, the distribution of macro analysis goals throughout the years is depicted in Figure 7. Here, we observe that, although studies focusing on external quality have been the majority in each of the considered years, a steady increase in number can be observed for studies that focus on either methodology improvement or internal quality, from 2013 to 2016. Due to the decrease in number of publications occured in 2017, studies that focus on either methodology improvement or internal quality also decrease.

Fig. 7
figure 7

Macro analysis goal by year (Categories are not mutually exclusive)

Example. A resource leak is a common bug caused by missing release of resources that require programmers to explicitly release them (e.g., camera and sensors). Although not directly observable by end users, a resource leak might lead to several problems such as performance degradation and occurrences of crashes. Relda2 (P42), a light-weight static analysis tool for the automatic detection of resource leaks in Android apps, is an example of a primary study aimed at improving an internal quality attribute.

5.6 Paper goal

This parameter can be of two types, namely: (i) Quality attribute assessment, if the research reported in the primary study focuses on assessing a quality attribute of mobile apps (e.g., security); (ii) Improvement of methodology, if the research reported in the primary study focuses on improving existing static analyses for mobile apps.

The goals taken into account by the primary studies is documented in Fig. 8. The majority of the primary studies (187/261) focuses on the assessment of some quality attribute(s) of mobile apps. A lower number instead (83/261) considers the improvement of static analysis techniques. We can conjecture that this trend can be associated to the more “immediate impact”, e.g., ease of adoption and real-life utilization by practitioners. From this, we can conjecture that a certain maturity with respect with assessment of apps quality attributes has been achieved (and hence a high presence of such approaches is observable), which is reflected in the reasonable amount of techniques aimed exclusively at improving the existing methods.

Fig. 8
figure 8

Primary studies by analysis goal (Categories are not mutually exclusive)

Example.Ripple (P4) is an incomplete information environment aware static reflection analysis for Android apps. Ripple is an improvement of methodology, as it is able to resolve reflective calls more soundly than conventional string inference. It enables more precise taint analyses when used in combination with tools such as FlowDroid (P86).

6 Results - characteristics of approaches (RQ2)

6.1 Platform specificity

This parameter identifies whether the proposed approach is specifically designed for a specific platform (e.g., Android or iOS) or if it is generic and can in principle be applied to any platform. As shown in Fig. 9, the vast majority of the approaches (239/261) presents an analysis approach specific for Android; only one study (1/261) presents an approach specific for iOS. A smaller amount of studies (21/261) presents an approach that is generic. Possible reasons for this imbalance may be due to the popularity and the open-source nature of the Android platform, which eases the effort required by researchers during the design of new analyses. Furthermore, Android app binaries can be straightforwardly disassembled with off-the-shelf software libraries (e.g., apktoolFootnote 12, dex2jarFootnote 13), and their internal structure and contained static resources are easily analyzable in an automatic way.

Fig. 9
figure 9

Primary studies by platform specificity

Example. As an example of platform specificity, P20 presents a technique to optimize energy consumption of mobile apps minimizing the number of HTTP requests that they perform. Proposed technique uses static analysis to detect Sequential HTTP Requests Sessions, i.e., sequences of HTTP requests in which generation of the first request implies that the following requests will also be made. Energy savings can be achieved by bundling these requests. The technique is Generic and applicable to all major mobile platforms, as mechanisms available to perform HTTP requests are similar across these platforms.

6.2 Implementation

Values for the implementation parameter, summarized in Fig. 10, were extracted from the primary studies according to whether the implementation used for evaluation purposes is implemented for a specific platform, e.g., Android or iOS, or it is Generic, applicable to apps developed for any platform.

Fig. 10
figure 10

Primary studies by platform implementation

Almost all the studies (257/261) implement the proposed approach exclusively for the Android platform. Two studies present approaches (2/261) having a generic implementation, applicable to any platform. Only one study (1/261) presents an approach that is implemented specifically for the iOS platform. Other less popular platforms are almost completely absent, with only one study (1/261) implementing the proposed analysis on TouchDevelop scripts [233]. We speculate that the reason for this disproportion, in addition to the ones already evidenced in the discussion of the platform specificity parameter, stem from the fact that some of the most popular static analysis frameworks (e.g., Soot [238] and WALA [208]) are adapted to support analysis of Android apps. The same cannot be said for the other platforms and, hence, researchers interested in performing static analysis on apps designed for those platforms experience a higher barrier to entry as they must develop their own tools, often from scratch.

Example.PiOS (P134) studies the privacy threats that applications written for Apple’s iOS may pose to users. To this end, the authors leverage static analysis techniques to extract data flows from iOS apps. PiOS is an iOS-specific implementation of the proposed technique, that automates the data flow extraction process from binaries resulting from the compilation of Objective-C code.

6.3 Static/Hybrid approach

The static/hybrid approach parameter describes whether an approach relies on static analysis only (Static) or utilizes some form of dynamic analysis also (Hybrid).

Results for the extraction of this parameter are summarized in Fig. 11. The preponderance of the studies (203/261) present an approach that relies on static analysis only. Nonetheless, a considerable amount of them (58/261) present an approach that complements static analysis with dynamic analysis. The presence of dynamic analysis in a considerable portion of the studies can be explained by considering that, despite all its drawbacks, dynamic analysis still provides an invaluable contribution for a variety of purposes, such as privacy leaks detection, GUI-modeling, energy profiling. A further discussion on the fields where dynamic analysis is most common can be found in Section 8.

Fig. 11
figure 11

Primary studies by usage of dynamic analysis

Example.SmartDroid (P113) is an hybrid analysis technique whose goal is identifying UI-based trigger conditions required to expose the sensitive behavior of Android malwares. As shown in Fig. 12, SmartDroid uses static analysis to extract Activity and Function call graphs from the application binaries. Then, guided by the static analysis results, it uses dynamic analysis to interact with the UI and identify UI-based conditions required to trigger sensitive APIs.

Fig. 12
figure 12

Example of an hybrid analysis technique

6.4 Usage of machine learning techniques

Values for this parameter are summarized in Fig. 13. The possible values identify whether the approach under evaluation complements its analysis with machine learning techniques (Yes) or not (No). A vast majority of the studies (213/261) does not make use of machine learning in the proposed approach. The remaining studies (48/261) perform features extraction from the application source code or other intermediate representations (e.g., a method-level call graph), and applies machine learning techniques on the extracted features. Machine learning techniques are widely used for some specific goals (e.g., malware detection), but their application to others has not been explored yet by researchers.

Fig. 13
figure 13

Primary studies by usage of machine learning

Example. An example of usage of machine learning coupled with static analysis is P28. In this study, the authors adopts a machine learning approach that leverages the use of data flow application program interfaces (APIs) as classification features to detect Android malware. Static analysis is employed to extract data flow related API-level features, used to train a k-nearest neighbor model for malware classification.

6.5 App artifact

The values of this parameter describe what formats are accepted as input by the selected studies for the apps to be analyzed. As shown in Fig. 14, the majority of the studies (238/261) accepts as input apps in the form of binary packages (Binary), i.e., APK (Android PacKage) files for the Android platform or IPA (iPhone Application Archive) packages for the iOS platform.

Fig. 14
figure 14

Primary studies by additional app artifacts (Categories not mutually exclusive)

This implies that the proposed analyses can be performed by a variety of subjects (app store moderators, researchers, security experts), and not only by app developers. Nonetheless, a considerable amount of primary studies (31/261) takes as input the app source code (Source Code), hence targeting app developers and researchers. In those cases, developers can potentially integrate them into their development workflow, e.g., as dedicated analyses integrated into the Android Studio IDE or as specific steps in their continuous integration pipeline. Note that both APK and source code are valid inputs for some of the studies.

6.6 Additional inputs

The possible values for the additional inputs parameter, listed in Fig. 15, identify what other inputs, if any, are required by the primary studies to perform the proposed analysis (in addition to the app itself). Overall, the majority of primary studies (194/261) is able to perform the analysis without any additional input, whereas 67/261 studies require some additional inputs. We consider this to be a positive trend, as it simplifies the adoption of the proposed techniques by industry and other researchers, additionally enabling batch analysis of a large quantity of apps more easily. Nevertheless, as for P123, in some cases relying on additional inputs is a necessity, e.g., when the app needs to be executed in a controlled, non-random, and non-trivial manner.

Fig. 15
figure 15

Primary studies by additional input (Categories are not mutually exclusive)

When focusing on the studies requiring additional inputs, we can observe that additional inputs are mostly required by techniques that verify whether given policies, rules, or constraints are violated (13/261). This is followed by mappings from the source code of the app to other auxiliary information (11/261) and by techniques that focus on a list of one or more methods leveraging the app source code (10/261). A number of studies (7/261) take as input app descriptions retrieved through app stores, and leverage this information in order to perform ad-hoc analyses. For example, CHABADA [94] aims at automatically identifying malicious apps by evaluating how their implementation differs from their description in the app store. Some proposed techniques take as input the platform (8/261) or system (1/261) profiles for application execution. Other studies (4/261) take as input test cases. This is particularly noteworthy as test cases are artifacts commonly produced during the software development cycle, and how information can be extracted from test artifacts has widely been investigated in the software engineering literature [3, 140]. Other studies (3/261) focus on problems pertaining to system permissions and, consequently, take as input an identifier of the permissions of interest. Two studies (2/261) require as input the specification of a user-defined analysis. Similarly, requires the user to write down some additional code snippets to perform the analysis (1/261). Two studies focus on app evolution and extract change information from multiple APK versions (1/261) or from the Git repository code history (1/261). One study requires a description of the workload to be executed (1/261) and one study requires execution traces (1/261). One study focuses on the behavior triggered by the interaction with user interface (UI) elements and hence requires as input a list of the latter (1/261). Interestingly, only one study (1/261) leverages information extracted from bug reports to perform the analysis and only one study takes as input information provided by other analysis tools (1/261). It is important to notice that the vast majority of these additional inputs require the knowledge of a developer or a domain expert in order to be reproduced and only a handful can be reproduced by end-users. This makes it harder to reproduce the results and might hinder large-scale adoption.

Example. As an example, Fig. 16 presents eCalc (P123), a technique involving two main steps that are performed by an Execution Traces Generator and an Analyzer, respectively. The Execution Traces Generator uses test cases for generating execution traces. Although this step requires to execute the software artifact under analysis, the actual analysis step is statically performed by the Analyzer on the execution traces by taking as input a CPU profile, without requiring the execution of the software artifact. This additional input is needed for automatically running and profiling the app under analysis multiple times in order to take into account the well-known phenomenon of energy consumption fluctuations at run-time.

Fig. 16
figure 16

Example of an analysis technique requiring additional inputs

6.7 Analysis pre-steps

The analysis pre-steps parameter identifies whether the studies under evaluation require steps that must be executed manually before the analysis can be performed. Results are listed in Fig. 17.

Fig. 17
figure 17

Primary studies by need of analysis pre-steps

The majority of the approaches (192/261) does not require any analysis pre-step. A still considerable amount (69/261) requires some analysis pre-step to be performed manually. Examples of possible pre-steps include, but are not limited to, building models of the platform APIs or libraries used by the application under analysis, collecting execution traces, collecting runtime power consumption measures, creating rule sets or security policies. Similarly to the previous parameter, having to perform manual steps before or during the application of a static analysis approach may hinder its reproducibility and large-scale adoption.

Example.UIPicker (P71) is a primary study that makes use of preprocesing steps. UIPicker aims to reduce the risks to which users are exposed when using an application by automatically identifying sensitive user inputs. To this end, in its preprocessing module, it extracts the layouts texts and reorganizes them through natural language processing techniques for further usage. This pre-step includes word splitting, redundant content removal and stemming.

6.8 Analysis technique

This parameter identifies the family of static analysis techniques performed by the approaches proposed in the primary studies. Results are summarized in Fig. 18.

Fig. 18
figure 18

Primary studies by analysis technique (Categories are not mutually exclusive)

A wide variety of static analysis techniques is used in the primary studies, the most common being Flow (171/261). A considerable amount of primary studies limit their analysis to data mining (46/261) to extract relevant information from the application bytecode or source code. Taint Analysis (33/261) follows as the third most adopted analysis technique. Machine learning classification, slicing and model-based analysis are also other relevantly used techniques, each being used in twenty-nine (29/261), thirteen (13/261), and thirteen (13/261) studies, respectively. Other less frequently used techniques are symbolic execution (12/261), points-to analysis (9/261), abstract interpretation (6/261), similarity-based analysis (6/261), constant propagation (5/261), string analysis (5/261), model checking (5/261), type inference (4/261), code instrumentation (2/261), pattern-based analysis (2/261), code-instrumentation (2/261), class analysis (1/261), formal analysis (1/261), opcode analysis (1/261), nullness analysis (1/261), responsiveness analysis (1/261), statistical analysis (1/261), termination analysis (1/261), and typestate analysis (1/261). We speculate that the popularity of Flow and Taint analysis is due to the fact that many of the issues researchers want to detect in mobile apps can be modeled under those analysis paradigms and, as further discussed in Section 8, it appears that researchers identify the technique to be used in a goal-driven fashion. We also believe that, again, researchers are limited by the available frameworks and tools, and choose to focus more on those techniques for which mature tools exist (e.g., Soot).

Example.AppSealer (P79) aims to automatically detect and prevent component hijacking attacks, a class of vulnerabilities commonly appearing in Android applications. When triggered by attackers, the vulnerable apps can expose sensitive information and compromise data integrity. For this purpose, AppSealer employs a combination of flow analysis and backward slicing. First, flow- and context-sensitive inter-procedural dataflow analysis is performed to track the propagation of sensitive information and detect if it propagates into dangerous data sinks. Then, employing backward slicing, one or more program slices that directly contribute to the dangerous information flow are computed. With the guidance of the computed slices, AppSealer automatically creates patches to deal with the discovered vulnerability, placing guarding statements at affected sinks to block the propagation of dangerous information.

7 Results - potential for industrial adoption (RQ3)

7.1 Target stakeholder

As shown in Fig. 19, app developers are the most recurrent stakeholders of static analysis approaches (150/261).

Fig. 19
figure 19

Primary studies by target stakeholder (Categories are not mutually exclusive)

Platform vendors (126/261) like Apple and Google distribute apps via their own dedicated mobile application markets. They can benefit from the use of static analysis approaches in their market places for systematically assessing the level of quality of their distributed apps, possibly identifying those apps with an unacceptable level of quality (e.g., apps with well-known security flaws, apps asking for suspicious permissions, apps with strong energy inefficiencies). Interestingly, some approaches directly target app users (20/261), who might use static analyses to better understand how their installed apps behave and for examining and granting explicit information flows within an application. Also, users may be interested in implicit information flows across multiple applications, such as permissions for reading the phone number and sending it over the network. As an example, one of the 12 studies targeting users focuses on debugging energy efficiency of apps in their real context of use. Specifically, in P39 the user can launch an automatically instrumented app to precisely record and report observed energy-related failures in order to assists the developer by automatically localizing the reported defects and suggesting patch locations. Last but not least, 7 primary studies explicitly mention researchers as target stakeholders, who can extend and/or apply the proposed techniques (and their results) to their own studies on mobile applications.

Example.FicFinder (P22) aims to ease the effort required by developers to deal with compatibility issues that might be present in their apps due to the fragmented nature of the Android platform. FicFinder automatically detects compatibility issues by performing static code analysis based on a model that captures Android APIs behavior as well as their associated context by which compatibility issues are triggered. Once detected, FicFinder reports actionable debugging information to developers.

7.2 Tool availability

All the primary studies contribute with a tool implementing the proposed approach. Nonetheless, our results also show that only 97 studies over 261 (see Fig. 20) released the tool, making it publicly available for download and adoption. When possible, the availability of a tool supporting the proposed approach is desirable as it surely helps in making the obtained results more credible, reproducible, and replicable by the community.

Fig. 20
figure 20

Primary studies by tool availability

7.3 Number of analysed apps

The authors of the analyzed primary studies evaluate and validate their findings by using an input set of applications. The evaluation of this parameter builds on the assumption that approaches evaluated on a larger set of apps are more adoptable in industry since it is less likely that they exhibit unexpected behaviors (specially for corner cases). Here, we categorized the primary studies according to the number of apps used for evaluating them.

As shown in Fig. 21, in the majority of studies (125/261) the number of applications used for evaluating the proposed approach is greater than 1,000, followed by those studies which evaluated their approach by using less than 100 apps (83/261), and those studies (53/261) which took into account a medium set of apps (between 100 and 1,000). This result is promising in that a relatively good number of approaches was evaluated on a high number of applications, making the scientific community and practitioners reasonably confident about their applicability in industrial contexts. Nevertheless, it is important to note that evaluating an approach on a low number of apps should not be seen as a strongly negative point because it may have been a necessity from an empirical perspective. For example, the number of analyzed apps could depend on the execution time of the analysis tool; if the analysis tool requires a large amount of time for each app (e.g., because including user thinking time), then the input set of applications is inevitably small in order to keep the experiment duration acceptable from a pragmatic perspective.

Fig. 21
figure 21

Primary studies by number of evaluated apps

Example.AutoPPG (P15) aims to facilitate the process of writing privacy policies for mobile apps. A privacy policy is a statement informing users how their information will be collected, used, and disclosed. Failing to adhere to privacy policies is can lead to severe consequences, such as the issue of steep fines. AutoPPG conducts static code analysis on mobile apps by extracting their behavior and subsequently relating such behavior to the personal information stored by the end-users. Once the relation between the app behavior and personal data is established, AutoPPG leverages natural language processing techniques to generate a textual description of the fair privacy policy which characterizes the analyzed app. Due to the time consuming nature of manually comparing the statically generated privacy policy with the existing one, the evaluation of AutoPPG was limited to the low number of 20 randomly selected apps.

7.4 Applied research method

This parameter represents the type of applied research method used to assess the proposed technique. Possible values of this parameter are Validation and Evaluation. Validation is done in lab contexts using applications specifically created or customized for the purpose of their approach evaluation. Evaluation takes place in real-world (industrial) contexts, using exclusively unmodified applications. The latter generally provides a higher level of evidence about the practical applicability of a proposed technique.

From the analysis of the primary studies, it emerged that the majority, during the evaluation phase, use exclusively unmodified applications (see Fig. 22) mined from an app market (234/261). In other cases, the applications to be analysed were created for the purpose of the evaluation, or they were customized versions of real apps (46/261). In some cases (e.g., P14, P17, P169, P259), a combination of real and custom applications is used; in these cases, custom apps support the evaluation of the proposed approach to exercise specific aspects of the proposed static analysis approach (e.g., corner cases when building a control flow graph of the app under analysis), which are not fully covered by the mined original apps.

Fig. 22
figure 22

Primary studies by applied research method (Categories are not mutually exclusive)

Overall, the obtained results are promising since approaches evaluated on (a potentially large number of) real apps, in principle, undergo a more realistic investigation with respect to those evaluated on synthetically-built apps. This realism comes also from the fact that apps mined from app stores are developed in real industrial contexts involving practitioners working under real business and organizational constraints (e.g., release deadlines, specific development workflows). Moreover, apps mined from app stores can be totally different from synthetic apps because the former are distributed to and downloaded by real users; it is well known that users play a central role in the success (and indirectly in the development process) of the apps, e.g., by providing publicly accessible app ratings and reviews [179], deciding to uninstall disappointing apps.

Example. In P14, the authors propose two automated static analysis techniques for automatic detection of a privilege-escalation attack known as Android Wicked Delegation (AWiDe). In order to manually verify the correctness of the two detection techniques, apps for evaluation experiments were collected from F-Droid Footnote 14, an online repository of free open source Android apps, in order to be able to inspect the app source code. As 70% of collected apps were also published on the Google Play store the study performs both Validation and Evaluation.

7.5 Industry involvement

Each primary study was classified as (i) Academia, if the authors are affiliated exclusively to an academic organization, e.g., university or research center; (ii) Industry if the authors are affiliated exclusively to an industrial organization, e.g., a company, startup, or software house; (iii) Academia and Industry if some of the authors are affiliated to an academic organization and some others to an industrial one. As depicted in Fig. 23, the vast majority of the authors of our primary studies is academic (231/261), followed by a combination of researchers and industrial practitioners (29/261), and finally 1 contribution involves industrial authors only. The emerged result is quite disappointing, as in almost all of the studies there is no involvement of industrial researchers or practitioners.

Fig. 23
figure 23

Distribution of industry involvement

In the single industry-only primary study (P91), the authors tackle the problem of Android application collusion. Specifically, they state that existing analysis techniques focused on identifying undesirable behaviors in single-apps neglecting multi-application collusion danger. Therefore, the authors present a collection of tools that provide static information flow analysis across sets of applications, showing a holistic view of all the applications running on a particular device. The techniques proposed in P91 include: (i) static binary single-app analysis, (ii) security lint tool to mitigate the limits of static binary analysis, (iii) multi-app information flow analysis, and (iv) evaluation engine to detect information flows that violate specified security policies. We believe that P91 is a good example of a research study tackling an industrially-relevant problem and proposing an industry-driven solution. Academic researchers could compare with or be inspired by the work in P91 for designing and evaluating the approaches for static analysis of mobile apps of the future.

8 Orthogonal findings

This section reports on the results of our horizontal analysis. It is worth recalling that, in this phase of the study, we (i) built contingency tables for pairs of parameters coming from our vertical analysis, (ii) analyzed each one of them, and (iii) identified perspectives of interest.

Analysis goal - Platform specificity.Privacy is the most recurrent analysis goal for all platforms, especially for the Android operating system. The only iOS approach found in the literature is also focusing on privacy. Malware results to be the second most studied subject in both Android and generic approaches. Overall, very few studies are platform-independent, and none for the categories performance, inter-app communication, and antipatterns.

We conjecture that the popularity of privacy and malware analysis goals can be associated to the ubiquity and handling of sensitive data that nowadays characterizes mobile apps. As a consequence, new methods and techniques to address the associated challenges is receiving a growing attention. Indeed, many of the researches focusing on privacy rely on a technique, namely, the inspection of the AndroidManifest.xml, which is quite simple to implement. This consideration further explains the high occurrences of such studies. Regarding the performance, inter-app communication and antipatterns goals, we hypothesize that such goals can be studied exclusively from a platform-specific point of view due to their tight relationship with the platform on which the app is running.

Analysis goal - Static ∖Hybrid approach. Except for frameworks and antipatterns, which result to be supported exclusively by static analysis, the majority of the goal categories are studied through hybrid approaches. Overall, privacy results to be the most studied subject in both static and dynamic approaches (74 static approaches and 22 hybrid ones). Energy consumption (13 static approaches and 12 hybrid ones) is the second most recurrent goal of hybrid analyses.

We believe that the rationale behind the popularity of hybrid approaches resides in the ability to circumvent weaknesses that arise when using only one kind of analysis, hence making it possible to gather more comprehensive, yet precise, results. As presented in the previous section, the popularity of the privacy goal can be justified by the interest of final users, developers and app store vendors to protect sensitive data from unauthorised access. The high number of hybrid approach targeted at the energy goal evidences the reliance of such approaches on dynamic methodologies, utilised to exercise the applications under analysis, and gather empirical energy consumption measurements. On the other hand, we conjecture that the lack of usage of dynamic analysis by approaches aimed at the frameworks and antipatterns goals is due to the nature of these goals, which are more tightly related to source code metrics rather than runtime ones, thus making static analysis techniques more suitable for them.

Analysis goal - App artifact. In general, the vast majority of the approaches require the APK package of the mobile application. This has to be attributed to the skewed data gathered for this research, from which most of the approaches result to focus on Android applications. In contrast, the goals that require more often source code are the ones focusing on refactoring and performance. Additionally, some goals that do not require access to the source code of the application were identified, namely reflection, antipatterns, similarity, obfuscation, and authorship.

Regarding the goals for which analyses are often performed on source code, we believe that the reason underlying this trend is that these types of analysis require the exact source code of the app under analysis to be carried out properly. Even though Android decompilers and disassemblers do exist, at the time of writing, their precision is not high enough to perform these kind of analysis on packaged applications [191]. On the other hand, when focusing on the analysis goals requiring an APK as input, we can notice that for testing, privacy and energy consumption researchers have been focusing on black-box approaches, while neglecting white-box ones (at least partially). For these goals, approaches of the latter kind could be of assistance during development of mobile apps, either notifying developers when they unknowingly insert known antipatterns in their code (e.g., an energy hotspot in the case of energy consumption or a privacy leak in the case of privacy) or in helping them in performing more efficient testing (in the case of testing).

Analysis technique - Analysis pre-steps. Eight out of 24 analysis techniques do not require pre-steps. In fact, nulness, points-to and termination analyses are carried out by inspecting the source code repository of the application, and hence do not require additional tooling or configuration. The remaining 16 analysis techniques require pre-steps of different nature. As expected, most of the analysis techniques needing pre-steps require the manipulation of source code, such as abstract interpretation (for which two out of three papers required analysis pre-steps). In general, only three of the 24 identified analysis techniques resulted to require in the majority of the cases analysis pre-steps. This indicates that the vast majority of analysis techniques is executable “as is”, i.e., without requiring any additional process before the analysis can be actually carried out.

Target stakeholder - Analysis goal. Approaches targeting app stores vendors result to be mostly interested in malware (57) and privacy (56 studies), followed by inter-app and inter-component communication (15 and 14 studies respectively). Approaches targeting developers also result to be mostly interested in privacy-related analyses (48 studies), but also consider more low-level goals, such as energy consumption (25 studies), inter-component communication (24 studies), and testing (23 studies). Approaches targeting researchers result to be mostly related to the improvement of the state of the art analysis techniques, hence often considering goals related to inter-component communication (4 studies), and frameworks (3 studies). As expected, approaches targeting end users result to be mostly interested in privacy (14 studies), and approaches targeting app store vendors are more interested in malware than developers (57 against 8 studies). In contrast, approaches targeting developers result to be more interested than those targeting app store vendors in analyses related to testing (21 against 2 study), resources (6 against 0), refactoring (16 against 2), performance (16 against 0), and energy (25 against 0). Again, this indicates that approaches targeting developers are more interested in the quality of the applications than those targeting app store vendors; the latter are mainly focused on ensuring the security of the end user by identifying potential malware and privacy leaks.

Usage of machine learning - Analysis goal. Usage of machine learning techniques is not evenly distributed among all goals. In particular, machine learning techniques are mostly employed for the goal of malware detection: out of 48 studies leveraging machine learning techniques in their analyses, 32 fall into the malware goal, the remainder is split among privacy (11), inter-component communication (4) and inter-app communication (2), energy (1) and obfuscator identification (1) (remember that goals are not mutually exclusive). This trend is traceable to the common techniques utilized to identify malware applications, which mostly often rely on training a classifier on a collected dataset of both benign and malicious applications. It is worth noting that the same machine learning techniques can potentially be applied when targeting other goals, such as performance or energy consumption; surprisingly, only one of the studies that fall into those goals make use of machine learning. We believe that this is due to the greater effort required for the collection of large datasets when considering these goals.

Industry involvement - Analysis goal. As expected, all analysis goals are considered by academic researchers. energy (25/25), inter-component communication (26/33), malware (59/66), and privacy (81/96) are the most targeted goals for academic researchers. In some cases, when the analysis goal concerns privacy (15/96), malware (6/66), inter-component communication (6/33), inter-app communication (2/24), framework (2/8), testing (1/23), resource (1/6), and refactoring (1/18), academic researchers are supported by industrial professionals.

By analyzing these results, we can conjecture that, although industrial organizations are interested in addressing the issues related to these goals, there is still a lack of industrial involvement when targeting other research goals, such as energy and performance, that would improve the overall user experience of mobile apps. We argue that researchers should more actively try to involve industry practitioners when working on such goals.

Target stakeholder - Analysis technique.Approaches to be utilized by app stores vendors have a more prominent usage of techniques such as data mining (33/46), taint analysis (18/33), and classification (21/29). This is in line with the most prominent goal of such stakeholder, i.e., identifying malicious applications in order to remove them from their stores. On the contrary, approaches to be utilized by developers, which are more interested in the inner workings of the applications, result to be characterized by a higher usage of techniques based on flow analysis (108/171). An explanation for this trend is the difference in performances among different static analysis techniques: approaches targeted at app stores must be highly scalable, as they have to be executed daily on thousands of apps; approaches targeted at developers have less stringent requirements. This evidences that improving the performances of some techniques is a relevant open problem, as they are currently a limiting factor for the kind of analyses that can be performed on app stores.

Tool availability - Analysis goal. When dealing with static analysis, automation is a crucial requirement for an approach to be effectively adopted in practice. Although for the majority of the identified analysis goals many different approaches have been proposed, most of them do not have a (released) tool ready for adoption by practitioners. On the one hand, we can argue that addressing goals such as privacy and malware, may require the realization of a mature supporting tool requiring a development effort that cannot be always afforded. On the other hand, addressing some goals represent more a theoretical interest, with potentially marginal practical impact, such as the study of an analysis framework itself. Nonetheless, we encourage researchers to undergo the extra effort required for making their analysis tool available to the research community: not only it makes easier to replicate their results but also analysis types for which a mature tool has been made available have been far more explored by the scientific community (as in the case of Flowdroid [12] for flow analysis).

9 Discussion and future research challenges

The results presented in the previous sections give a data-driven, objective overview of the current state of the art on static analysis for mobile apps. In this section, we provide our own interpretation of the main points we deem as important challenges for future researchers in this area.

Is there life after Android? When considering the targeted platforms, it is evident that Android is the clear winner, with more than 90% of approaches targeting it. If on the one hand, we could have expected this result (as of today, Android is the most popular mobile operating system with more than 90% market share [2] and a relatively large number of open-source tools for apps analysis), on the other hand, it makes us wonder what will be the fate of this Android-specific large body of knowledge and tools we researchers are producing in the future. If we look back in time, it is widely recognized that the mobile ecosystem is extremely dynamic, with platforms unpredictably raising and failing in terms of sells of devices, companies acquisitions, users flowing to/from other platforms. For example, 10 years ago, Apple iOS and Symbian were having 38% and 16% of the market share, whereas today they account for less than 14% togetherFootnote 15.

It is encouraging to see that 2 approaches out of 261 are generic (even though the implementation of the majority of them is again Android-specific). We believe that in the future researchers should reason at a higher level of abstraction, and focus more on approaches which are technology-independent, generic, and applicable to different platforms with reasonable effort. It is only in this way that our research results will pass the test of time and will (hopefully) remain relevant also in the future, despite the inevitable technological waves we will be facing. It is important to note that we are not suggesting to totally neglect platform-specific aspects, rather we are proposing to design our own research products to be platform-independent and robust with respect to (future) technologies; among many, researchers might take advantage of the well-known principles of extensibility and separation of concerns, of layered or plugin-based architectures for making their research products applicable in the context of new technologies without disrupting their general principles and base mechanisms. This will also speed up research by helping researchers in avoiding to reinvent the wheel whenever a (potentially applicable) research product will be applied to a new mobile platform.

Analysis goals shall be expanded substantially. The results of our study tell that privacy and malware are the most targeted analysis goals, far more than the others (e.g., performance, energy, resources usage). This is a clear gap that we, as researchers in the area of mobile apps analysis, should be filling in the future.

Given its strong importance for mobile apps, it seems that performance is extremely under-explored. Indeed, performance is a fundamental aspect of mobile apps development and is one of the top concerns for both developers and users; indeed, frequent complaints in app stores are about apps’ performance, impacting the ratings of the apps and potentially undermining their chances of success [59, 162]. Moreover, anti-patterns identification and refactoring are among the least explored analysis goals so far, despite the fact that bug fixing and code re-organization are among the most recurrent activities of mobile apps developers [197]. In this context, P52 can be considered as a reference study about how to propose, design, and evaluate a refactoring method for mobile apps. Specifically, P52 presents a preliminary large-scale formative study about how developers approach asynchronous programming in Android apps. Then, based on the obtained results (e.g., that developers are using the Android AsyncTask construct also for long running operations, potentially leading to memory leaks, lost results, and wasted energy), a tool-based method is proposed for (i) statically identifying usages of the AsyncTask construct which can be automatically improved, and (ii) refactoring those parts of the app via a safe code rewriting algorithm. Finally, an empirical evaluation provides objective and reproducible evidence about the applicability and saved effort of the proposed method.

Users are being left out of the equation. From the results of RQ3, it emerged that only 20 studies consider end users as stakeholders, revealing that researchers are mostly focusing on techniques aimed at assisting developers, store moderators and researchers instead. Although this unbalance is not unexpected, when also considering that the majority of studies focused on privacy as their goal, we can notice a lack of users-first privacy approaches. Indeed, privacy is a subjective property, as different users may have different concerns when judging the trustability of an application. Current solutions fail to address this subjective aspect of privacy, considering all users as equals. In light of these considerations, we can identify one research area currently open and overlooked: the design of more user-centric approaches to privacy, where users are provided with the necessary tools to specify and validate the “personal” requirements to which an application must comply [218, 219].

Developers are being left out of the equation too! Even though when answering RQ3 it emerged that practitioners were involved in 30 studies, it also emerged that almost all approaches have not been evaluated or adopted in an industrial environment. We consider this finding as an indication that practitioners are involved in the technical phases of the study (e.g., elicitation of the requirements for the approaches, analysis steps definition, experiments results evaluation), but not as subjects of the evaluation of the proposed approaches. This situation is in strong contrast with the fact that the most recurrent stakeholders of the proposed approaches are the practitioners themselves. For the future, we strongly advise to close the loop by including practitioners in all the phases of the studies, specially while (i) defining the assumptions, requirements, and usage scenarios of the proposed static analysis approaches, as well as (ii) evaluating the proposed approaches in terms of their usefulness, applicability, and usability. At best, the latter can be performed by applying the case study methodology [257]. This is already happening in other research areas within the software engineering domain, such as software energy efficiency [239], technical debt [180] and software testing [214].

Tools and datasets shall be released and publicly available. An underlying problem which hinders the effective uptake of static analysis of mobile apps research lies in tool availability. In fact, from the results of our research, we evince that only a small portion of tools utilized or developed in the primary studies are available online. This constitutes a serious problem for researchers interested in extending or adapting tools which have been already developed. Additionally, the data utilized in the primary studies (e.g., accurate versioning history of apps used for experimentation) is only seldom available. This potentially slows down investigations, as datasets still have to be created on an ad-hoc basis for researches, as the number of already available ones is scarce. In recent times, this trend has been opposed by the constitution of some conference tracks explicitly aimed to make datasets publicly available. Among the most prominent ones are the “Artifact” track of the International Conference on Software Maintenance and Evolution (ICSME), and the “Data Showcase” track of the Mining Software Repositories (MSR) conference. Researches belonging to this tracks range from general purpose datasets, e.g., large versioning datasets focusing on Android applications [89], to context-specific datasets, e.g., to support dynamic analyses of Android applications [41]. Finally, from the findings of our study, we detect a shortcoming shared by many studies of static analysis of mobile apps, namely the impossibility to replicate the reported results. In fact, the absence of structured replication packages, in form of tools and dataset utilized, precludes the possibility to replicate the results reported in the primary studies. This constitutes a major problem affecting not only researchers interested in the field of mobile static analysis, but also the soundness of the studies itself.

10 Threats to validity

In order to ensure the high quality of the data gathered for this study, a well-defined research protocol was established before carrying out the data collection. The research activities were designed by following a set of well-accepted and revised guidelines for systematic mapping studies [133]. From the formalization of such guidelines, we established the research protocol that was strictly followed all throughout the evolution of the study, as documented in Section 4. In addition, in order to further ensure the adherence to the established protocol and the envisioned quality standards, all the steps of the research (e.g., study design, search and selection, data extraction, data analysis) were carried out in team. This activity was deemed necessary also to lower potential sources of bias by discussing crucial considerations in team. Even by adopting a methodic literature review approach, threats to validity are still unavoidable. The remaining of this section reports on the main threats to validity to our study and how we mitigated them.

External validity refers to conditions that hinder the ability to generalize the results of our research [257]. The major threat of this category is represented by the fact that our primary studies are not representative of the state of the art research on static analysis of mobile applications. In order to mitigate this threat, we adopted a search strategy consisting of a manual search encompassing all the top-level software engineering conferencesFootnote 16 and international journalsFootnote 17 according to well known sources in the field. Such process was further extended by executing a backward and forward snowballing process on the selected literature. In order to ensure the quality of the selected researches, we exclusively considered peer-reviewed papers and excluded the so-called grey literature, such as white papers, editorials, etc. We disregard such decision as a significant source of bias, as peer-review processes are a standard requirement for high-quality publications. Finally, we adopted a set of well-defined inclusion and exclusion criteria, which rigorously guided our selection of the literature.

Internal Validity refers to the influences that can affect the design of the study, without the researcher’s knowledge [257]. In this regard, we defined a priori a rigorous research protocol for the study. The classification framework adopted was established iteratively by strictly following the keywording process and it has been piloted by three researchers in an independent manner. Regarding the synthesis of the collected data, such process was carried out by adopting simple and well-assessed descriptive statistics. Subsequently, during the orthogonal analysis, we performed sanity tests on the extracted data by cross-analyzing different parameters of the established classification framework.

Construct validity refers to the extent to which the primary studies selected are suited to answer our research questions [257]. In order to mitigate such threat, we manually inspected thoroughly the literature published in the top-level software engineering conferences and journals. This procedure was performed by adhering to a rigorous predefined protocol. In addition, the results of such process were expanded by integrating the results gathered through a backward and forward snowballing process. Subsequently, we methodologically selected the identified studies by applying a set of well-documented inclusion and exclusion criteria. This latter process was carried out by three researchers independently. As recommended by Wholin et al. [257], a random sample of eight studies were selected and analyzed by all three researchers in order to ensure that the analyses were aligned.

Conclusion validity refers to issues that might hinder the ability to draw the correct conclusion from the data gathered [257]. In order to minimize the presence such threat, we carefully carried out the data extraction and analysis by strictly adhering to an a priori defined protocol. Such protocol was specifically conceived to collect the data necessary to answer our research questions. This enabled us to reduce potential sources of bias resulting from the data extraction and analyses processes. In addition, such methodology guaranteed us that the extracted data was fitted to answer our research questions. In order to further mitigate potential threats to conclusion validity, we adhered to the best practices reported in several well-known guidelines for systematic literature reviews [132, 203, 257]. Such guidelines were strictly followed throughout each phase of our research, and were comprehensively documented in order to make our research approach transparent and replicable.

11 Conclusions

The systematic mapping study reported in this paper permitted us to precisely characterize the most relevant methods and techniques for statically analyzing mobile apps. Starting from over 12,000 potentially relevant studies, we applied a rigorous selection procedure resulting in 261 primary studies along 122 scientific venues and a time span of 9 years.

We rigorously defined a classification framework with the target of identifying, evaluating and classifying the characteristics of existing approaches to the static analysis of mobile apps, while understanding trends and potentials of industrial adoption.

The main findings of this study have been synthesized by performing (i) a combination of content analysis and narrative synthesis (vertical analysis), and (ii) a correspondence analysis via contingency tables (horizontal analysis).

Our study will help researchers and practitioners in identifying the purposes and the limitations of existing research on static analysis of mobile apps. Also, we assessed the potential of research on static analysis of mobile apps, discussing how to foster industrial adoption and technological transfer. The knowledge of the potential of existing methods and techniques constitutes a reference framework in support of researchers and practitioners, such as app developers, who are interested in selecting/choosing existing static analysis approaches, and want to critically understand what they offer and how. In this sense, we can argue that this work constitutes a valuable asset to the academic and industrial world in the wide spectrum of static analysis.

12 Appendix

12.1 Research team

Four researchers were involved in this study, each of them with a specific role within the research team.

  • Principal researcher: Gian Luca Scoccia, and Roberto Verdecchia, postdocs. They took part in all the activities, i.e., planning the study, conducting it, and reporting;

  • Research methodologist: Ivano Malavolta, assistant professor with expertise in empirical software engineering, software architecture, and systematic literature reviews; he was mainly involved in (i) the planning phase of the study, and (ii) supporting the principal researchers during the whole study, e.g., by reviewing the data extraction form, selected primary studies, extracted data, produced reports, etc.;

  • Advisor: Marco Autili, associate professor with many-years expertise in software engineering methods applied to the modeling, verification, analysis and automatic synthesis of complex distributed systems, and application of context-oriented programming and analysis techniques to the development of (adaptable) mobile applications. He took final decisions on conflicts and methodological options, and supported the other researchers during data and findings synthesis activities.

12.2 Primary studies

Table 4 reports the full list of the 261 primary studies.

Table 4 Primary studies