How do Android developers improve non-functional properties of software?

Nowadays there is an increased pressure on mobile app developers to take non-functional properties into account. An app that is too slow or uses much bandwidth will decrease user satisfaction, and thus can lead to users simply abandoning the app. Although automated software improvement techniques exist for traditional software, these are not as prevalent in the mobile domain. Moreover, it is yet unknown if the same software changes would be as effective. With that in mind, we mined overall 100 Android repositories to find out how developers improve execution time, memory consumption, bandwidth usage and frame rate of mobile apps. We categorised non-functional property (NFP) improving commits related to performance to see how existing automated software improvement techniques can be improved. Our results show that although NFP improving commits related to performance are rare, such improvements appear throughout the development lifecycle. We found altogether 560 NFP commits out of a total of 74,408 commits analysed. Memory consumption is sacrificed most often when improving execution time or bandwidth usage, although similar types of changes can improve multiple non-functional properties at once. Code deletion is the most frequently utilised strategy except for frame rate, where increase in concurrency is the dominant strategy. We find that automated software improvement techniques for mobile domain can benefit from addition of SQL query improvement, caching and asset manipulation. Moreover, we provide a classifier which can drastically reduce manual effort to analyse NFP improving commits.


Introduction
Several studies have shown that non-functional performance characteristics have a strong impact on user satisfaction (Inukollu et al. 2014;Lim et al. 2014;Khalid et al. 2014;Liu et al. 2014;Ferrucci et al. 2015;Kim et al. 2015;Martins et al. 2018;Gao et al. 2018Gao et al. , 2020. Performance issues are especially important in the mobile, resource constrained domain. Banerjee and Roychoudhury (2017) analysed 170,000 user reviews of mobile applications, and classified reasons for user downvotes. Three out of five identified categories related to non-functional properties. Khalid et al. (2014)'s study of iOS applications also showed that unresponsive, resource heavy applications, and those with network related issued were among the top most frequent sources of complaints. Issues related to non-functional software properties lead not only to downvotes, but also large number of uninstallations (Banerjee and Roychoudhury 2017) rendering them a priority in the mobile software development process.
Most of the aforementioned work provides analysis on what the issues are, rather than how these can be solved. Only recently Hort et al. (2021) carried out a comprehensive survey of existing research on performance optimisation for mobile applications. These included offloading, antipattern detection, refactoring, prefetching, choice of a different programming language, reordering of I/O calls and changes to hardware components (Hort et al. 2021). Their survey shows that tool support for automated improvement of non-functional properties has been scarce, mainly targeting energy consumption, and if available, targeting one property of choice, sometimes having adverse effects on the others (Hort et al. 2021).
Among automated approaches for improvement of non-functional properties for traditional software, search-based approaches have gained popularity in the last few years. For example, Wu et al. (2015) automatically applied changes directly to source code to produce a Pareto front of program variants that improve memory consumption and runtime efficiency. Basios et al. (2018) and Burles et al. (2015) both modified the data structures used by software to improve non-functional properties by finding less resource-demanding combinations. These and other techniques have been successful at optimising traditional software, though these are yet to be applied more widely in the mobile domain. The advantage of using search-based tools, is that they are well-suited for multi-objective optimisation, something that is missing in the mobile application improvement field (Hort et al. 2021).
We pose that software repositories offer researchers a wealth of information about the behaviour and techniques used by actual developers. These can be used to find patterns that can be mimicked by search-based software engineering approaches for optimisation of nonfunctional software properties (Harman and Jones 2001). Although several previous studies focus on performance bugs in traditional software, such as the study by Jin et al. (2012a), only a few studies on mining performance improving commits in the Android domain exist (e.g., Das et al. 2016;Moura et al. 2015). Moreover, those do not provide fine-grained enough information to guide developers of search-based software development tooling. Previous studies were also concerned with finding general patterns across as many projects as possible, thus employed a sampling strategy that would alleviate the expensive manual analysis cost. This leads to under-approximation of the true number of non-functional property improving changes.
To fill this gap we mined the most popular Android repositories, using single-keyword search and analyse all returned results, to find patterns that could be utilised in searchbased automated software improvement tooling. We focus on four non-functional properties in particular: execution time, memory consumption, bandwidth usage, and frame rate. We chose these as they are most related to mobile app performance, key issue for users, as previous studies show Banerjee and Roychoudhury (2016) and Hort et al. (2021).
First, we mined the repositories of the 20 most popularly downloaded mobile applications, according to Fossdroid, and manually examine the resultant 3,132 commits,finding 229 were actually NFP improving ones. 1 Although this process should give us a good overview of non-functional property improving strategies for performance, it only allows for analysis of a relatively small number of repositories. However, the detailed analysis provides us with a corpus of data on which we can train a classifier that could help gather and analyse more data. Therefore, we devised such a classifier and analysed a further randomly selected set of 80 repositories, manually analysing 495 commits found, which added 331 non-functional property improving commits to our dataset. We categorised all the commits found, to help us identify emerging patterns. We also report on whether current automated improvement tools already allow for such transformations to be found, and if not, if such tools could be extended to provide new, useful software transformations. Finally, we examined features of the repositories we analysed. This is to provide recommendations for software developers, for what types of mobile applications non-functional improvements are likely to be found.
Our results show that non-functional property improvements to app performance are rare: from 74,408 commits mined across 100 repositories, only 560 were deemed to improve execution time, memory consumption, frame rate or bandwidth (229 identified by manual search and 331 by using a classifier). However, we can still draw interesting conclusions about their nature. In particular: -In 10.7% of cases, developers were willing to sacrifice one non-functional property over another, while in 6.5% of cases developers were able to improve upon multiple properties at once. This shows the need for tooling that can handle multi-objective optimisation. -The strongest indicators for the number of non-functional property improving commits in a repository was the total number of commits, number of contributors and number of stars. -Current search-based improvement tooling mimics 5 out of 23 non-functional improvement strategies found. -Future automated techniques for improvement of non-functional properties could be enhanced by incorporating automated caching, SQL query, and image transformations. We propose detailed transformation patterns to aid researchers and developers in the design and adoption of such strategies.
Overall our results provide recommendations for software engineers, aiming to provide better tooling for automated software improvement; and for researchers, providing patterns of how developers improve mobile applications' non-functional properties related to mobile app performance, as well as a classifier that can help with future mining studies in this domain.
All our data and scripts are freely available to allow for reproduction (github.com/ SOLAR-group/NonFunctionalAndroidCommits), replication and extension of our work.
The rest of this paper is organised as follows: Section 2 describes our methodology; Section 3 presents our results; in Section 4 we discuss implications of our study in software engineering research and practice; Section 6 presents threats to validity; while Section 5 presents related work; Section 7 concludes the paper. Appendix A contains additional materials.

Methodology
In order to answer how Android developers improve the performance-related non-functional properties of software (performance NFPs), and how we can use this knowledge to potentially devise new software transformations for tools for automated software improvement, we mine open-source Android projects for commits that improve four non-functional software properties (NFPs): execution time, memory consumption, bandwidth usage, and frame rate. Along with energy efficiency, previous research shows these are often found in user reviews (Banerjee and Roychoudhury 2017;Khalid et al. 2014), yet have not been extensively tackled in the literature (Hort et al. 2021). 2 We aim to answer the following research questions: We want to know at which stage in software development do performance NFP improving commits occur, whether these are considered as standalone improvements, and whether these improve multiple NFPs or prioritise one whilst possibly sacrificing another. These should give us an overview of the current Android development practice with respect to performance NFP improvement. RQ3 What type of code changes do Android developers make to improve app's performance NFPs?
We want to also investigate what sort of changes developers make to source code to improve its performance NFPs. Examining these changes will allow us to compare current search-based improvement techniques to real-world commits and make suggestions for how these techniques can be improved.
To answer these research questions we have manually curated a corpus of 560 nonfunctional property improving commits, which were collected by analysing a total of 74,408 commits mined from 100 open-source Android repositories. In the following section we explain in detail our collection procedure. We have made this corpus publicly available to allow for replication and extension of our work (github.com/SOLAR-group/ NonFunctionalAndroidCommits).

Overview of Methodology
Below we present the methodology used to create our corpus. It consists of three steps: Keyword mining: In this step we collect a set of performance NFP improving commits by filtering them first based on keywords and then manual analysis. Classifier mining: In this step we expand this set by using a classifier trained on the commit messages gathered in the previous step.

Categorisation:
In this step we attempt to manually group the commits into categories. These categories allow us to find common patterns used to improve the four nonfunctional properties of interest: runtime, memory consumption, bandwidth and frame rate.

Corpus
In the first step, we mined the twenty most popularly downloaded Android applications according to Fossdroid, 3 and extracted a total of of 28,028 commits. As it would have been infeasible to manually inspect such a large set to identify NFP improving commits, we have adopted a semi-automatic approach that examines every commit message based on keyword search (as detailed in Section 2.3). This lead us to a total of 3,132 commits, which were then manually analysed in order to label them as performance NFP improving commits or not. A final set of 229 NFP improving commits was deemed to improve one of the four non-functional properties of interest. We note that in previous work Moura et al. (2015) opted for two-word key-phrases rather than keywords to massively narrow down the number of commits to manually analyse. Das et al. (2016) only mined commits from the main modules of applications, missing any changes to back-end modules. We opted not to take these actions, and avoid missing possible useful software transformations by mining all commits with generic keywords.
In the second step, we leverage this curated set of NFP improving commits, to train a classifier to be able to automatically identify such commits. This allowed us to automatically analyse a much larger set of commits (46,378), mined from 80 randomly selected F-droid repositories, and filter out irrelevant (i.e., not NFP improving) commits with a precision of 95%, as detailed in Section 2.4. Specifically, we used the classifier to automatically identify 331 additional NFP improving commits by randomly sampling F-droid. We initially found a total of 495 commits, which were then manually validated by two of the authors to make sure they improve any of the four non-functional properties of interest. This manual check led to the identification of 331 performance NFP improving commits.
The final size of our manually curated corpus thus consists of 560 NFP improving commits (229 from the first and 331 from the second step). We then manually categorised these commits by the type of change which was made to improve the NFP, by analysing their commit messages and diffs. This resulted in 23 categories of improvement types being found.
Next, we detail how we mine NFP improving commits by using keyword search (Section 2.3) and the classifier (Section 2.4), as well as how we manually validate the NFP improving commits and categorise them (Section 2.5).

Step 1: Identifying NFP Improving Commits Based on Keyword Search
We mined 28,028 commits from the twenty most popularly downloaded applications according to Fossdroid (as of 18/03/2020), a website which offers an alternative user interface to the standard F-Droid web page. These applications are diverse in nature (e.g., gaming applications, streaming applications, browsers) and size, having between 13 and 6,157 commits. Details of each application repository can be found in Tables 1 and 5. Whilst the repositories of these applications are hosted on a variety of platforms (GitHub, GitLab, etc.), all repositories use the git version control system. The git log command was used to generate a list of commit messages, which was then parsed and searched for sets of relevant commits, that suggest improvements to the following four non-functional properties: In order to identify relevant performance NFP improving commits, each repository was mined by searching every commit message for a series of keywords (or parts of words in some cases, e.g. "effic" to capture all words similar to "efficient", "inefficient", etc.) associated with the particular property, by following a three-step process, as described below, and then manually validated.
Initial Selection An initial set of keywords was generated by a combination of our knowledge of relevant terminology (which we have gained by writing NFP improving commits ourselves) and the examination of the language used in commit messages written by others. We then augmented this set with 15 keywords 4 used in previous work conducting similar analysis (Jin et al. 2012b;Mazuera-Rozo et al. 2020;Das et al. 2016;Linares-Vásquez et al. 2015;Chen et al. 2019b). Any commit containing any of these keywords was selected for manual evaluation. Every selected commit message was manually evaluated to see if it actually suggests that an NFP has been improved or not. This approach aims to highlight as many commits as possible that could improve non-functional properties and therefore result in many false positives being manually evaluated. This helps to reduce the number of false negatives and allows us to detect as many relevant commits as possible.
Keyword Expansion Synonyms for all keywords were searched for using the SEThesaurus (Chen et al. 2019a), a natural language processing (NLP) tool for finding synonyms in an SE context. Terminology found during manual evaluation of commits which suggests improvement but not present in the initial keyword set was added to a new keyword set. Another search took place with the new keywords in the same way as the original search. The keywords used can be found in Table 2.

Keyword Validation
To validate the keywords we conducted a text analysis by tokenising and lemmatising all words over all commit messages. The resulting 12,230 tokens were grouped according to the commits relevant (229), irrelevant (3132 -229), and filtered out (27028 -3132), based on the keywords used. These tokens were then ranked by how often they occur in each group. From these rankings we attempted to identify possible keywords that we may have missed. First we removed all tokens that occur less than 10 times in the commits identified as improving performance NFPs: This resulted in the identification of 76 tokens, which could potentially be used as keywords. Then we further filtered out tokens by focusing only those that occur in the relevant group more or as often than those occurring in the irrelevant group. This step allowed us to filter out words such as 'and', which are common in all commits. Of the 6 remaining tokens, three were already included as keywords (i.e., memory, faster, and leak). The remaining three were save, reduce and low. These three terms may be considered as additional keywords to identify additional NFP commits, yet their use could increase the already high manual effort needed to inspect the selected commits. In fact, in our study, these three keywords (save, reduce, low) relate to 111 filtered out commits. After manual inspection, we found that of these 111 commits only a single one could be identified as relevant; this commit also contains the word 'mem' instead of 'memory', suggesting that using keyword search may miss those commits that use abbreviations like this or contain misspellings of keywords. However, as most commits contain more than one keyword, the keyword set used herein can capture the majority of those commits too. As only three more relevant keywords were identified out of the 12,230 unique tokens present in the relevant commits, and they led to the identification of only one additional relevant commit out of 111, we are confident that the set of keywords used to conduct our study is comprehensive and effective.  Furthermore, the first author of this paper manually analysed the resultant commit set. Some commit messages were found ambiguous as to whether or not they offer any improvement. Developers sometimes write commit messages about what they have done but not why they have done it. Such commits were also independently analysed by another author. If the second author also found the commit to be ambiguous and not explicitly labelled as and improvement, it was discarded. We also discarded those commits which were merges with a single child commit as they were considered duplicates. We refer to the final set of manually curated commits gathered in this step as the "manual set".

Step 2: Identifying NFP Improving Commits Based on Automated Classification
While in the previous step we use keyword search to narrow-down the number of commits for manual investigation, in this step we explore the use of an automated classifier, which leverages on the manual set obtained from Step 1.
The classifier we propose has been trained with the classified data from Step 1, i.e., all commits manually excluded after the keyword search are labelled as irrelevant, while all commits included are labelled as relevant. 5 In addition, we have included 368 commits manually identified as relevant towards execution time in previous work (Mazuera-Rozo et al. 2020) to the relevant commit set. We train the classifier using only the commit messages of the commit. 6 In order to search for an accurate prediction model, we have investigated a total of 20 classification algorithms exploiting 6 different settings for feature selection. The settings were derived from the featurization of text tokens via TF/IDF, Bag of Words (Yamauchi et al. 2014), and an adapted version of Bag of Words where only words occurring with a discriminative significance in either the irrelevant or relevant groups were used in the feature vector. Next, we present only the best result of these attempts, while more information about the training of the classifier can be found in Appendix A. The best classifier was achieved via stemming as pre-processing step, TF/IDF for featurization using a Decision Tree classifier. We assessed its effectiveness via cross-validation by using 10 hold-out repetitions (80%/20% train/test split), each time using a different seed. The results show a good level of classification with a precision of 73% and recall of 80% in the relevant class (see Table 3).
In order to show the reduction in manual effort required when using our classifier we run it on two datasets. Table 4 shows a comparison of commits identified via keyword 5 We decided to group all relevant commits into one single group, as preliminary analysis showed that attempting to classify the commits into multiple classes (i.e., execution time, memory, bandwidth and frame rate) produces classes that are too small for building an accurate classification model (the Recall in all groups was less than 0.1). 6 We considered also using issue messages to identity commits. However the analysis of our KM data set showed that only 13% of commits had associated issues. Most (52%) of those issues were associated with 10 or more commits, meaning that only a small fraction of their messages and comments would be related to the commit that we are interested in. Table 4 Comparing our keyword search to our classification based approach on two datasets. The 368 number of relevant commits for the Mazuera-Rozo et al. dataset was taken from their work https://github.com/ amazuerar/perf-bugs-mobile/blob/master/bug-fixing-commits-performance.csv. We note that authors report 380 in their paper, but 11 commits don't exist anymore , and an additional 440 commits that may be relevant, but were filtered by the keyword search.
As the cross-validation confirms the effectiveness of the classifier, we re-train it on the entire available dataset in order to classify performance NFP improving commits on unseen data, thus further validating our classifier in a real usage scenario, and extending our corpus of NFP improving commits with the commits correctly classified as such.
To this end, we randomly selected 80 repositories from F-Droid and used the classifier to automatically classify all 46,378 commits extracted from these repositories. 7 Details of the repositories are provided in Table 5. The classifier identified 475 commits relevant commits. Two of the authors manually analysed these commits, as they did in Step 1, to check whether the commits classified as relevant are actually NFP improving commits, i.e., true positives. They found that only 164 commits were false positives, giving a manually evaluated precision of 66.87% for this classifier, and 331 commits added to our corpus. 8 To further verify our classifier, we evaluated its performance on 5 randomly selected repositories from the set that was mined with the classifier. We perform keyword mining on these repositories in order to identify the false negatives of the classifier. Of the 5 repositories selected, 3 were found in both CM and KM to contain no performance NFP improving commits. In the repositories where commits were found, one was found to have 5 performance NFP improving commits compared to 3 found by the classifier, and in the other the same 3 commits were found by both approaches. These repositories are all small yet representative of many of the repositories which were mined. In order to evaluate the classifier on a larger repository with many commits, we also run it on the Koreader repository where the most CM commits were found (147 overall, see Table 7). We manually analyse all commits  found using keyword search. In this project we found 2 additional relevant execution time commits, 2 additional memory commits, 1 additional frame rate improving commit and the same set of bandwidth improving commits with the keyword search. This means that the classifier only missed 5 relevant commits. Our final manually curated corpus of performance NFP commits thus contains a total of 560 commits, which we use to answer our RQs.

Step 3: Categorisation of Mined Performance NFP Improving Commits
To gain a greater understanding of the set of commits, we manually classified them by the type of change that was made. For each performance NFP (i.e., execution time, memory consumption, bandwidth, and frame rate) the set of extracted commits was examined and categories were generated, based on commit type. Commits were inserted into relevant categories or into new categories if they could not be classified inside current ones. Commits which could be classified into more than one category due to multiple changes were added to both categories. If two categories had a large shared membership or it became difficult to place a commit into either category, the categories were combined into a single category encompassing the traits of both. Some commits were unclassifiable, e.g., due to improvements being buried in a large list of changes, or changes requiring domain specific knowledge that is not explained in the commit message. If many commits were left without a category, the uncategorised commits were re-examined to determine if any categories had not been uncovered in the first class. Next another author examined the categorised commits to analyse whether or not they belonged to a given category. In case of disagreement the commit was placed in the unknown category, this occurred in 15 instances. The two authors also independently examined the issues associated with these commits in order to gather any extra information that could aid us in categorising them. In the case of non-classifiable commits we had to rely on the description of the optimisation written by the developer in the commit message as evidence that there was an improvement. 9 Table 6 summarizes the categories of the two separate datasets of found commits via keyword and classifier search respectively. The results show only around two percent discrepancy in the memory usage and framerate categories. The Execution time is represented 9.7% stronger via the classifier, and the bandwith around 9.9% less via the classifier. The reason for this may be a slight bias towards run time performance, but may also be simply because bandwidth is not as relevant for many projects. 19 out of 26 bandwidth commits we identified via keyword search are only from two repositories (see Table 7), and thus may simply be overrepresented in the keyword dataset. The analysis implies that the classifier can produce datasets that reflect the real-world considerations of Android developers towards NFP.

Results
In this section we present the results of our mining for commits improving execution time, memory consumption, bandwidth usage, and frame rate. We report on these nonfunctional property improving (NFP) commits returned from our keyword search (KM) and those returned using our classifier (CM) separately. We investigate how developers improve Openfoodfacts-androidapp 6 (0.08%) 2 (0.02%)   the four NFPs, and categorise them to see whether source-code level changes could be implemented in automated software improvement tooling, targeting the mobile domain. Table 7 shows the number of commits which were found to improve each particular NFP in each of the repository mined, split between KM and CM commits. We also report on what percentages of total commits improve each of the four NFPs considered.

KM commits:
We found NFP-improving commits in 12 out of 20 most popular Android repositories. 229 out of a total of 28,028 were deemed to improve 1 of our four NFPs considered. Execution time is the most commonly improved non-functional property in our KM set of commits (with 125 commits identified), appearing to be the most important non-functional property to Android developers. The next most common improvement was memory usage, with 73 commits. In three repositories, the number of commits improving memory consumption was actually greater than those improving execution time, showing its importance varies across projects. Bandwidth is improved less often than the previous two properties. It is to be expected, as some applications use little to no network data. A lot of network traffic also consists of large files, such as videos, pictures, or application APKs. Decreasing the network data used by these files is often not possible with source code changes. Frame rate is not improved very often. This could be due to developers being willing to tolerate frame rate in their applications, or that changes which improve frame rate are less well-known amongst developers. Also we found that the changes are larger than those for memory and execution time, so may require more effort to implement (see Table 10). CM commits: We found NFPs in 40 out of 80 randomly selected repositories. We find that our automatic classifier mostly selected performance improving commits (211). The next most commonly captured improved property was memory consumption with 115 out of 331 CM commits. Bandwidth and frame rate improving commits were, as with KM, much less common. With 15 frame rate commits and only 5 bandwidth improving commits found.

RQ2: How Android Developers Improve NFPs
To answer RQ2 we first analyse general characteristics of the commits found, i.e., age of commits, functionality changes made at the same time as NFP improving changes, commit size, multiple NFPs improved at once, and optimisation trade-offs between NFPs. This should give us an overview of the current software development practice with respect to NFP improvement.

Age of Repository When Commits are Made
To determine when developers make commits that improve performance NFPs, we look at the age of the repository when a commit was made in days, i.e., how long after the first commit was it made (see Table 8). In this table we show the median number of days between the first commit and the commits of each category, we also show the upper and lower quartiles to show the spread of the ages. We find that the age of the repository when these commits are made varies greatly between repositories. Not only that, but the spread of this figure also varies. For example, in the F-Droid repository (highlighted in the table), NFP improving commits were spread out in age, even more so than other commits. Perhaps making NFP commits throughout the life cycle of the application rather than in small time windows is one of the reasons F-Droid was found to have the most NFP improving commits. Table 9 shows the number of commits of each type in which functional changes were also made. For the KM commits three of the types contained very similar numbers (14%), however bandwidth improving commits contained more (30%). This is mostly due to many of the bandwidth commits coming from repositories in which commits tended to be larger and contain many changes. The CM commits are similar, with the exception of frame rate, where 1/3 of commits also modified functionality.

Size of Commits
To determine the size of non-functional property improving commits we consider four measures: the number of files changed, the number of chunks changed, the number of classes modified, and the number of lines of code inserted/removed. Git splits the diff of each commit into chunks, where a chunk is the set of lines containing a change and the surrounding lines. If changes are close together they will be contained within the same chunks, so chunks can be used to measure the distribution of changes. These measures will provide a holistic picture of both the size of commits and the distribution of the changes that are made. Some of the commits contained multiple changes which will inflate their     Med. UQ.

UQ.
Trickytripper  size, however so do many standard commits. We also take median values to attempt to mitigate this distortion and allow for a valid comparison. We compare the size of our identified performance NFP improving commits to the size of every other commit found in the repositories. As commits from both categories will contain multiple changes, we do not attempt to distinguish between functional changes and non-functional changes in individual commits. As shown in Table 10 we can see that KM commits improving performance NFPs tend to be larger than a generic commit, in every measure. They often span multiple files, multiple chunks and change multiple class definitions. They also add in more lines than they remove. For the CM commits, execution time and memory improving commits are of similar size to those in the KM set, spanning multiple files and make larger changes than other commits. The CM bandwidth commits are smaller than those in KM set, however still make larger changes than non NFP improving commits. Finally, the CM frame rate improving commits are very large (median lines inserted of 47 for KM and 77 for CM). This could be due to the large rate of frame improving commits which also alter the functionality of their projects, as shown in Table 9. Table 11 shows how often commits improve multiple properties at the same time. The vast majority of NFP improving commits improve one property at a time. This is true across both CM and KM commits. However, in 6.5% of cases (5.7% of KM and 7.0% of CM) developers are able to improve multiple properties at once. It is possible that some commits do improve multiple properties in ways that developers are not aware of or do not report, as this is not the primary purpose of the commit.

Multiple Improvements
Tradeoffs Some commit messages report that changes which improve one non-functional property negatively affect others. Table 12 shows that memory and execution time improvements are often traded-off against each other. The most often impaired property is memory.

RQ3: Types of NFP Commits
Subsequently, we discuss the results of the manual categorisation of NFP improving commits, as explained in Section 2.5. This should help us establish whether changes made by developers are already automated by existing tooling, and if not, whether new refactorings could be suggested for future work.  In order to answer RQ3, our categorisation is presented in Table 13, we also show this data in Fig. 1 as a histogram for easy comparison.
Add Condition: Conditions were added in 8 KM and 19 CM commits, this allowed applications to avoid additional computation unless it was actually necessary. Execution Time: 2 KM and 13 CM commits improved execution time by using caching. This is the most obvious application of caching as one can use it to avoid repeatedly performing the same computation unnecessarily.
Memory: 4 KM commits and 2 CM commits optimised memory usage by adding new conditions.
Bandwidth: 1 KM and 2 CM commits utilised caching to avoid making unnecessary network requests.
Frame Rate: 1 KM and 2 CM commits utilised new conditions to avoid unnecessary work on the UI thread, thus improving frame rate.
Pattern: Wrap blocks of code in if statements, or add new conditions to existing if statements. Add Delay: Delays were introduced in 2 KM and 2 CM commits, these delays allowed background execution to finish before proceeding, thus improving performance NFPs. Execution Time: 2 CM commits improved execution time using increased delays. Frame Rate: 2 KM commits utilised delays to improve frame rate. Pattern: Insert calls to the Time.sleep() method into code. Animation Length Reduction. Visual changes, such as changes to animations or UI elements, are used in 2 KM commits to reduce the frame rate of an application. Frame Rate: 1 KM commit animation length reduction improved the application execution time.
Frame Rate: 1 KM commit used programmatic animation changes were used to improve the application frame rate.
Pattern: When animations are done programmatically we recommend using profilers to identify hotspots. Frames could also be removed from animations to speed them up and a trade off between speed and smoothness could be found. Caching. Caching data to avoid rerunning code was one of the largest categories of change, with 28 KM and 28 CM changes being made.
Execution Time: 20 KM and 25 CM commits improved execution time using caching. This is the most obvious application of caching as we one use it to avoid repeatedly performing the same computation unnecessarily.
Memory: Whilst caching can often increase memory usage, if used to avoid memory intensive computation it can actually save memory. However, this is uncommon, with only 1 CM commit showing this use of caching.
Bandwidth: 8 KM commits utilised caching to avoid making unnecessary network requests.
Frame Rate: 1 CM commit utilised caching to avoid making unnecessary work on the UI thread, improving frame rate.
Pattern: Caching is a common pattern for decreasing the execution time of an application. It can often be easily implemented by assigning the results of a method call to a variable and replacing future calls to the method with that variable. An example of a caching pattern being applied can be found in Fig. 2.

Change in Operation Order. 2 KM and 9 CM commits altered the order of operations.
This involved swapping the order in which lines of code are executed, such as conditions in if statements. Execution Time: 2 KM and 6 CM commits improved the execution time of code by changing the order of operations.
Memory: 2 CM commits were found to reduce the memory consumed. Frame Rate: only 1 CM commit changed the order of operations in order to improve the application frame rate.
Pattern: Change in operation order can be achieved by swapping lines or blocks of code, or nodes in the abstract syntax tree. The swap operator has already been utilised in program repair (Le Goues et al. 2012), but is yet to be widely adopted in automated search-based techniques for improvement of NFPs of software ). An example of this pattern can be seen in Fig. 3. Data Structure Replacement. Data structure replacements were used in 8 KM and 12 CM commits. A data structure replacement could consist of swapping an ArrayList for a LinkedList where the LinkedList is more efficient. Execution Time: 8 KM and 9 CM commits replaced data structure to improve execution time.  Bandwidth: 1 CM commit replaced a data structure transmitted over the network with a smaller equivalent to improve bandwidth usage.
Frame Rate: 1 CM commit used a more efficient data structure to improve frame rate. Pattern: Various NFPs can be improved by finding data structures and automatically replacing them with compatible ones. Similar approach has been implemented by Basios et al. (2018), although their tool is not publicly available. Data Structure Size Reduction. Reduction in the size of data structures, such as arrays, were used in 6 KM and 4 CM commits. Memory: All KM and CM commits improved the memory consumption of applications by reducing the sizes of data structures.
Pattern: Reduce the size of data structures. This can be achieved, for instance, by changing the size of declared arrays. Program analysis could be required to prevent overflows. An example of this pattern can be seen in Fig. 4.

Decrease Asset Size.
Changes to assets such as images and fonts, mostly to improve the efficiency of loading said assets, account for 3 KM changes and 5 CM changes. Execution Time: 1 KM and 1 CM commit improved the execution by reducing asset size and speeding up their loading.
Memory: 3 KM commits and 5 CM commits improved memory by reducing the amount of memory which large assets consume.
Pattern: Use compression algorithms, such as gzip, to reduce the size of assets. Different Algorithm. The implementation of more efficient algorithms constituted 7 KM commits and 24 CM commits. Memory: 1 CM commit replaced one algorithm with another more memory efficient one.
Frame Rate: 1 CM commit implemented a more efficient algorithm to improve frame rate.
Subcategories: Use String Builder: In 4 CM commits, String Builders were used in place of naive string construction to improve execution time.
Use char instead of String: In 3 CM commits, method calls with Strings for arguments, such as indexOf, were replaced with equivalent methods with char arguments to improve execution time.
Improve Regex Performance: In 2 KM and 4 CM commits, regular expressions were modified to execute more quickly.
Other: 5 KM and 12 CM commits consisted of changes to algorithms which could not be grouped with others. We detail each of these changes in Appendix B. Early Return. Earlier return statements were introduced in 2 KM and 2 CM commits, preventing whole methods from executing when unnecessary. Execution Time: 2 KM and 2 CM commits used earlier returns to speed up application execution time.
Pattern: Insertion of a return statement. An example of this pattern can be found in Fig. 5.

Freeing Up Memory.
Adjustments to the amount of memory in use when devices were low on memory was another common change made to 10 KM commits. Memory: All commits which used this strategy were used to reduce the memory consumption of applications.
Pattern: Program analysis would be required to identify which resources could be freed from memory. Memory: 1 CM commit used more concurrency to become more memory efficient. Frame Rate: 5 KM commits improved frame rate by using multi-threading to reduce the load on the UI thread.
Subcategories: Move code to background: 6 KM and 4 CM improved execution time, 1 CM commit improved memory consumption, and 3 KM commits improved frame rate by executing code on separate threads.
Alter timing: 3 KM commits improved execution time by altering the timing of threads.
Use a thread pool: 2 CM commits found improvements to execution time by introducing thread pools to manage thread execution more efficiently. Layout Optimisation. 6 CM commits modified the layouts of applications. This consists of flattening layouts to reduce nesting. Execution Time: All 6 CM commits were used to speed up applications. Pattern: Automatic layout flattening obtained by removing layout components and replacing them with their child components. This can be achieved with the lint tool. 10 Leak Fix Fixing memory leaks was the most common approach to decrease memory consumption, with 35 KM and 74 CM commits. Memory: Leak fixes were made in 35 KM commits, which is almost half of KM memory improving changes. 74 CM commits fixed leaks.
Pattern: Automatically detecting objects that are being unnecessarily instantiated or not properly disposed of, using a tool like infer, 11 and removing them with program repair techniques like GenProg (Le Goues et al. 2012). Make Final: 2 KM commits introduced the final keyword to local variables, allowing more efficient code to be compiled. Execution Time: both commits improved execution time.
Pattern: Add the static keyword to local variables. Make Static: 1 CM commit added the static keyword to a method in order to reduce memory usage. Memory: The only commit in this category improved memory usage. Pattern: Add the static keyword to methods. Network Throttling. Network throttling (which is an intentional slowing down of internet speed) was used in 3 KM commits to improve the bandwidth usage. In fact, while this may not reduce the total amount of bandwidth used, it can still be useful to reduce the load on the network by speeding it up for other users. Bandwidth: All 3 KM commits used Network throttling to improve bandwidth usage. Pattern: Create a network monitor which reduces the networking of the application when traffic is high, a tool like android-varanus 12 could be used to this end. Parameter Change. 13 CM commits simply changed parameters in various function calls, speeding up the application.
Execution Time: 12 CM commits were used to decrease execution time.
Bandwidth: 1 CM commit was used to improve frame rate.
Pattern: Techniques such as deep parameter optimisation (Wu et al. 2015), have proven useful for finding optimal parameters in source code. Remove Caching: 2 KM commit removed a cache to improve NFPs. Bandwidth: 10 KM and 3 CM changes removed unnecessary network access, reducing bandwidth usage.
Frame Rate: 2 KM and 2 CM commits removed redundant code that was causing frame rate to be low.
Pattern: Remove lines or blocks of code, or nodes in the abstract syntax tree. This operation is standard in Genetic Improvement  tooling used for automated improvement of non-functional software's properties, such as execution time, energy consumption, but also for automated program repair. SQL Query. A large number of changes (11 KM and 11 CM) to SQL requests appear for improving performance NFPs. These changes were only present in 7 projects. Some of the changes removed unnecessary JOIN statements, others changed the order of JOIN statements. Execution Time: All SQL query commits improved execution time. Subcategories Change Primary Key: 2 CM commits improved execution times by changing the primary keys used in SQL tables.
Specify Column: 2 CM commits moved from selecting all columns to selecting individual columns, speeding up query execution.
Combine Queries: 3 CM commits combined multiple queries together to save executing them each individually.
Move File to DB: 1 KM commit and 2 CM replaced file I/O with an in memory database to improve the execution time.
Remove Unneeded JOIN: 1 KM commit improved execution time by removing an unnecessary JOIN statement.
Flatten Queries: 3 KM commits flattened queries containing sub-queries into a single select query, improving execution time.
Add table indices: 3 KM commits added indices to tables to speed up SQL queries. Use transactions: 1 KM commit and 1 CM commit wrapped a series of queries up into a single transaction, allowing them to be executed more quickly.
Parameter Binding: 1 CM commit introduced parameter binding, which allows similar queries to be made repeatedly with only their parameters changed.

Time Out Reduction.
In total, 1 KM and 7 CM commits made changes to the length of timeouts, waiting for other computation to complete. Execution Time: 1 KM commit and 7 CM commits reduced unnecessarily long timeouts to improve the execution time.
Pattern: Change timeout values in source code. Use Different Library. 3 KM 11 CM commits replaced the libraries they used with more efficient alternatives. Execution Time: 1 KM and 8 CM commits used different faster libraries.
Memory: 2 KM and 3 CM commit used a more memory efficient library. Frame Rate: 1 CM commit used a more efficient library to improve frame rate. Pattern: Use a set of similar libraries and automatically replace their existing usages in code. Unknown. Some of the commits (41 KM and 46 CM) were not classifiable as the type of optimising change was not obvious from the commit message or the diff. In fact, some changes were bundled within large commits making optimisations hard to pinpoint, yet from the developer message it was clear that execution time, memory consumption, bandwidth, or frame rate had been improved.

Discussion
In this section we discuss potential avenues for future research that stem from our study.

Recommendations for NFP Mining
Our work shows that a more fine-grained mining of NFP commits is feasible. The classifier that we presented has a good chance of finding commits with a recall of 0.8 (4 in 5 commits are found in our test set). The precision of 0.72 suggests that around 3 in 4 commits will be relevant. This means that the manual effort to identify relevant commits is much less than searching via keywords. This is confirmed by our mining via the classifier where we find 331 relevant commits by analyzing 495 commits. In comparison via keyword search we mined 229 relevant commits by manually checking 3,132 commits. I.e. analyzing 1.50 commits via classifier returns one relevant commit, vs. 13.68 commits for one relevant commit via keyword search. This enables the mining of repositories for NFP commits on a larger scale than previously possible as the manual effort is reduced significantly. With larger datasets in the future it may even be possible to expand the classification to identify categories and subcategories of NFP that a commit falls under.

Characteristics of Repositories Containing NFP Improving Commits
As we observed that the number of performance NFP improving commits greatly varies among the mined repositories, we further analyse these repositories in an attempt to find the characteristics of those repositories that contain many commits. This can allow future NFP mining studies to target the most commit rich repositories.
Tables 1 and 5 show the properties (i.e., the total number of commits made in the repository, the number of stars a repository has, the number of developers who have contributed, the age of the repository, the number of times the repository has been forked, and the number of lines of code in the repository) of the repositories together with the number of performance NFP improving commits found in each of them.
In order to quantify the relationship between each property and the number of NFP improving commits found, we calculate the Pearson Correlation Coefficient (Pearson 1895). Pearson's correlation (ρ) measures the linear relationship between two pairs of observations. It ranges from +1 (indicating perfect correlation) to −1 (indicating perfect inverse correlation), no correlation is indicated by 0. The results of this analysis are shown in Table 14.
We find that the total number of commits in a repository shows the strongest, statistically significant, positive correlation with the number of performance NFP improving commits (ρ = 0.754, p-value < 2.20e − 16). We also find strong, statistically significant, positive correlations between the number of NFP improving commits and both the number of stars (ρ = 0.678, p-value = 1.10e − 13) and the number of contributors (ρ = 0.601, p-value = 234e − 10). We find no statistically significant correlation with the age of a repository (ρ = 0.169, p-value = 0.107), the number of lines of code in a repository (ρ = 0.043, p-value = 0.756), or the number of forks of the repository (ρ = 0.033, p-value = 0.697).
We also attempt to see if there is a relationship between the categories of applications and the number of performance NFP improving commits we found. We present in Fig. 6 a box plot of the categories reported on F-Droid for each application against the number of performance NFP improving commits we found in it. We can observe that there are no categories that tend to have significantly more commits than others, however the applications in the connectivity category have the highest median number of NFP improving commits.
The above results suggest that it would be preferable to target large repositories that receive many stars by GitHub users and have a large number of contributors, when mining for performance NFPs commits.

Recommendations for Performance NFP-Improving Tooling
Our analysis of commits improving execution time, memory consumption, bandwidth usage, and frame rate of mobile applications provides several findings. In particular, our main goal has been to establish how Android developers improve these four NFPs, and whether we could utilise this knowledge to enhance automated software improvement tooling. After careful analysis of the literature, we found that 5 out of 23 improvement strategies (see Section 3) are already mimicked by software mutation operators in automated software improvement tooling, as described below.
Deletion of redundancy has been used in many fields, ranging from slicing to newer strategies such as genetic improvement . The delete operator, which can be applied at various granularity levels, e.g., line or statement level has proved effective in automated program repair . Similarly, 75% of the non-functional properties studied were improved through the removal of a redundancy (used to improve execution time, memory consumption, and frame rate). Android applications contain many lines of redundant code and its removal is one of the main techniques developers use to improve the four NFPs considered in this work. Burles et al. (2015) and Basios et al. (2018) swap the data structures to improve nonfunctional properties. This is also reflected in our mined data. Both the execution time and memory consumption commit sets contained changes which consisted of simply swapping one data structure for a more efficient one. This suggests that patches produced with their techniques will be similar to those made by humans.
The early return category is small but offers a simple software transformation. Rather than assigning a value to a variable, just return the value. This will only be valid if the return value will not change in the rest of the method. This could be used as an opportunity to insert early returns. Recently, Brownlee et al. (2020) incorporated this type of operator in their automated software improvement tooling for Java software.
Switching the order of operations in which these are executed is also mimicked by the swap operator in genetic improvement research. Such swaps have been done at the statement (Brownlee et al. 2019) and expression level (Schulte et al. 2014;Haraldsson et al. 2017).
The parameter change category is reminiscent of deep parameter optimisation (Wu et al. 2015), where parameters are exposed and automatically tuned to improve some property. Bokhari et al. (2017) successfully applied deep parameter optimisation to energy consumption in Android. Our results suggest that deep parameter optimisation could be effective for improving other properties in Android.
Although the changes mentioned above have been incorporated into automated software improvement tooling, these have generally not been applied in the Android ecosystem, but just to traditional software.
Our study has also revealed several improvement strategies that are not yet employed in current search-based software improvement tools.
One such strategy is caching. Caching has been repeatedly used to improve execution time, memory consumption, and bandwidth usage. Caching is useful as it reduces calls to areas of code by storing their output. The stored value is then reused instead of calling the code. However, it does increase memory usage. The trade-off between the NFPs that caching improves must be balanced with the cost to memory, pointing to multi-objective optimisation. A caching pattern could be as simple as replacing multiple calls to a function, with the same arguments, with a single call to the function. The result of this call should be stored in a variable, and the original calls to the function replaced by the variable. Another option would be to cache a function called in a loop, replacing the function call in the loop with a variable storing the result of the function which is called before the loop, as in Fig. 2.
Changes to assets, such as pictures, are found among the commits that improve execution time and memory consumption. Assets can be resized to improve memory consumption (and potentially bandwidth usage) or handled differently to improve execution time. When modifying assets, the changes made must result in the asset still being acceptable to the users. Measuring this acceptability and finding reasonable modifications poses an interesting challenge. However, a few genetic improvement approaches have tried to allow for decrease of output quality, when improving for other properties, such as energy consumption ), or shader simplification (Sitthi-amorn et al. 2011). This, however, has not been regarded as a code changing operator itself, but as a side-effect of the other software transformations.
SQL queries were modified a substantial number of times in order to improve execution time. Using search-based techniques to transform SQL queries could allow developers to automatically achieve large improvement to execution time in applications that have many database interactions. Das et al. (2016) found also that not only databases but also file systems led to speed bugs, suggesting that I/O could be a good target for improvement.
Many search-based techniques only focus on making changes within a single file. Table 10 suggest that a multi-file approach may be more useful for generating patches to improve non-functional properties. The number of chunks also suggests that patches are distributed widely across source code and changes are found in many locations. Thus, for Android applications, changes could be deliberately spread out across multiple files.

Related Work
There have been a few studies that mined for non-functional property improving commits in Android applications. Moura et al. (2015) performed a study on software repositories mining "energy aware commits". They mined commits which attempted to improve the energy efficiency of an application and categorised the commits that did. They used a two-wordkey-phrases approach, to narrow down the returned results to 2,189 commits, rather than 112,900 commits that were initially returned from single keyword search. The majority of the commits found in this study (∼70%), belong to categories which concern device configuration, e.g. Voltage Scaling. Das et al. (2016) mined Android repositories for "performance related commits". They found such messages in only 7% Android application repositories considered. Moreover, their categorisation was more top-level than what we propose, classifying commits into, e.g., "Memory", "File system", and other. They have also used one set of keywords for all performance-related issues. Their study restricted their searches to the main application modules. The patterns that they extracted mostly concerned what was being fixed, rather than how it was fixed, making direct comparison difficult. Although there are a few similarities, we both find changes to regular expressions to be useful for improving performance, we find that developers move computation to the background, and we find that caching is used to improve performance. In their work, Das et al. (2016) find that developers improve multiple properties 10.4% of the time, whereas we only found that in 6.5% of cases We have also attempted to use as many keywords as possible, this includes all of the relevant keywords from the related work, and keywords which we determined from analysing commit messages, this is detailed in Section 2.
Mazuera-Rozo et al. (2020) performed a similar study, producing a topology of Android performance bugs, and pointed out several other studies that focused on non-functional bugs. We are concerned with non-functional property optimisation rather than repair, and thus we describe the types of changes which can improve NFPs, whilst they describe the bugs which can be detrimental to them. In their study they uncovered a number of bug types, for which we have found related fixes. These include the leak fix category, the data structure replacement category, the layout optimisation category. Interestingly, we find around 3-4x the amount of commits dealing with caching for execution time improvement than they do. They also do not capture many of the categories for improving memory usage and bandwidth usage that we do, in particular the Remove Redundancy category. This may be due to our usage of a larger keyword set. As we are concerned with a detailed analysis of the improvement strategies found with respect to four specific performance non-functional property improvement criteria, we use a more comprehensive keyword search than the three aforementioned studies. Consequently, we manually analyse more commits. Moreover, we train a classifier to gather additional data, which we make open source. We also categorise and analyse commits in ways which allow the comparison to existing, and derivation of new, mutation operators for automated software engineering approaches for non-functional property improvement, such as genetic improvement (Hort et al. 2021;Petke et al. 2017). Linares-Vásquez et al. (2015) attempted to uncover ways in which developers address performance bottlenecks in Android, they did this by surveying developers. They found that developers use similar techniques to the caching and SQL query categories uncovered in this study. Our approach aims to find a more complete picture of the changes developers make by directly exploring the content of the changes they made, as developers may exclude techniques used less frequently when answering a survey. Chen et al. (2019b) perform a similar study to Mazuera-Rozo et al. (2020), but on 100 projects with C/C++ code. Changes to C/C++ are likely to be different to Java as there is more direct access to lower level operations like memory allocation. However, their arguments pattern is similar to the change parameter pattern that we found and the memorization is similar to our caching pattern. Jin et al. (2012b) conducted a study across a number of different programs in different languages, but did not include Android applications. However they did find fix patterns similar to change parameter, change order of operations, and add condition, showing that some changes made in Android may applicable in other type of software. We also find that changes in Android are larger than those in C/C++. Only a median of lines of code are changed in C/C++ commits, but for all categories we find median changes of 10+ lines and in some cases up to 70+ lines.
Our work uncovers a number of new patterns with respect to those discussed in related work. We find changes to animations, layouts and assets which have not been discussed in previous work, these categories all seem to be particularly relevant UI focused Android applications. We also find that searching for source code changes results in a number of patterns that have not been discussed in previous work, like remove redundancy, early return, different library, data structure changes and time out reduction.
The idea of mining repositories for software improving commits has shown promise in the automated program repair field. Long et al. (2017) mined Java repositories to automatically find fix patterns to apply to buggy software. Kim et al. (2013) manually evaluated human written patches from GitHub to generate patches which could then be automatically applied to buggy software. Bader et al. (2019) leverage the version history of a piece of software to extract fix patterns to suggest to developers. Martinez and Monperrus (2018) showed that repair templates generated by mining existing code could be used to generate large number of bug fixes. Several of the automated program repair techniques mentioned use of search-based approaches, which have been utilised for improvement of non-functional properties, in the field of genetic improvement , in particular. Therefore, we believe that the results of our study could be similarly used to provide recommendations for future tooling (Petke 2017).

Threats to Validity
Although we cannot claim that our results can apply to any type of mobile application, we mitigate the conclusion validity threat by mining commits from applications diverse in many ways, including their type, size, number of commits made and number of contributors.
Moreover, developers of open-source applications may view non-functional properties at a different priority than their closed source counterparts. However, the user base is the same (Android mobile users), and open-source repositories provide a rich amount of data that can be analysed to improve current software engineering practice. The number of NFP improving commits we analysed may give a partial picture of the commits which are actually made by developers, and it was limited by the manual effort needed to analyse the 3,132 commits. However, this is in par with the number of commits examined in previous studies (e.g., Moura et al. 2015;Das et al. 2016). Some commits which improve NFPs may have been missed by the keyword search. This may have been due the sets of keywords used being incomplete or the commits not having messages which reflect the non-functional properties which they improve. To mitigate the former, the keywords were repeatedly expanded to try to catch as many commits as possible and had a very large false positive rate (73%), so it is unlikely that many were missed. It is also likely that developers did not report tradeoffs that were made. The classifier may also be influenced by the keywords used, as it was trained on the gold set based on keywords search and manual analysis. We do provide the source code of both the keyword search and classifier (github.com/SOLAR-group/ NonFunctionalAndroidCommits), so that they can be expanded in the future.
Finally, the NFP improving commits could be categorised differently. The aim of the current categorisation was to provide meaningful classes from which insights could be drawn towards possible new mutation operators for non-functional property improvement using search-based approaches, such as genetic improvement . To alleviate such a threat and facilitate this aim, we provide our detailed commit categorisation on-line (github.com/SOLAR-group/NonFunctionalAndroidCommits).

Conclusions and Future Work
In this paper we have explored the ways in which Android application developers improve non-functional properties, in order to guide the development of new automatic transformations to improve the NFPs of Android applications. To this end we analysed 74,408 commits from 100 repositories, finding 560 commits that improve four performance-related non-functional properties (NFPs): execution time, memory consumption, bandwidth usage, and frame rate. We employed a combination of keyword search, and a classifier to filter irrelevant commits, and manually analysed over 3,000 commits to obtain our corpus. We have found that developers more commonly improve the execution time of their applications than other NFPs. However, manual changes to improve non-functional properties are uncommon, suggesting that automated tools could aid developers in this challenging task. Moreover, we find that developers occasionally improve multiple non-functional properties at once (5.2% of cases), or sacrifice one property for another (10.7% of cases). This suggests developers are willing to take multi-objective approaches. However, the rarity of these changes suggests a need for automatic tools that could propose a Pareto of solutions for the developer to choose from (Harman et al. 2010). We have also found similarity between 5 current mutation operators in automated software improvement and the changes that app developers make. Code removal was a very common technique used by app developers for improving multiple non-functional properties, as was data structure replacement. It would be interesting to explore whether the corresponding mutation operators, which have already been successfully used to optimise NFPs of traditional software, are effective in automatically improving NFPs of mobile applications too. We have also found novel ways in which real commits could be more closely mimicked by tools for automated NFP improvement, for example, by making changes across multiple files simultaneously.
Our results highlight a need for automated tools which improve the non-functional properties of Android applications, and provide initial guidance on what types of changes such tools could implement. These changes include automatic caching to improve speed and bandwidth, SQL query transformation to improve speed, and image modification to reduce memory consumption and execution time.
Our study could be further extended. First of all, our corpus could be extended in terms of commit types and size. Commits improving other non-functional properties such as code quality could be collected and analysed in future studies. Future work could use automatic pattern recognition from a large set of NFP improving commits, to generate generic patterns improving NFPs of Android applications.

Appendix A: Classifier Training
In order to build an accurate classifier for NFP improving commits, we investigated several different options as detailed in this Appendix. The source code and the results obtained are provided in our on-line repository (github.com/SOLAR-group/ NonFunctionalAndroidCommits). For clarity we compare between three different classes. unknown are commits that have not been analysed, and were excluded by the keyword search. irrelevant commits were identified via keyword search, but deemed not relevant towards non-functional properties by one or more of the examiners. Finally, relevant commits encompass all commits identified to deal with a non-functional property after keyword search.
We initially attempted to generate the classifier for all four subclasses of relevant commits, execution time, memory, bandwidth and frame rate. This yielded no satisfactory results as the classes bandwidth and framerate had no recall (not a single commit) and the other two groups yielded less than 0.1 recall (i.e., less than 1 in 10 commits found).
We also attempted balancing the data. This was done in two ways. The first was to balance the irrelevant commits (3,132) with the relevant class (229) as the drastic difference lets classifiers overfit towards the irrelevant class. The balancing yields large margins in the test and training sets (accuracy 0.8, with the same recall in both classes). However, when attempting to reproduce this on the entire dataset the precision in the relevant classes dropped to 0.07. In this case the classifier misses 20% of commits, but yields no advantage over keyword search as many commits that are not relevant are identified as such. The second attempt was balancing the classes of commits to reduce the under-representation of bandwidth and frame rate (see Table 6). This reduced the recall to 0.02 meaning that the classifier finds only 2 in 100 commits. We also investigated algorithms specifically targeted towards unbalanced data, also to no avail. All of the balancing was done by keeping all commits of the respective smallest class, and randomly removing commits from other classes until they were the same size.
Since balancing the data worsens the precision drastically, we continued training classifiers with the datasets as they were recorded (i.e., without balancing). This investigation was conducted with all combinations of text preprocessing and featurisation. In all cases we did a preliminary tokenisation and stop word removal of all commit messages. For string processing we used: -Lemmatisation-reducing word inflections to a word root -Stemming-removing word endings to approximate word roots For the featurisation of the remaining word roots we used: -TF/IDF-term frequency / inverse document frequency essentially ranking words in the source -Bag of words-counting words in the commit messages -Improved bag of words-the adaption was that we reduce to words that we identified as discriminative between the relevant and irrelevant sets. E.g. when a word occurs more often in one or the other it is included, otherwise it is not used in bag of words. Table 15 shows the best classifier found with this. Quadratic Discriminant Analysis yields a recall of 0.68 meaning two in three commits are identified, and a precision of 0.36 meaning that about three commits have to be consulted manually to yield one relevant commit. This is much better than keyword search (13,67 commits on average) but still represents a loss of information. The best results were yielded via using Lemmatisation as preprocessing step and improved bag of words to create the feature vectors. Quadratic Discriminant Analysis was the best algorithm found out of the following twenty algorithms: -Multi-layer Perceptron Classifier -C-Support Vector Classification -Linear C-Support Vector Classification -Gaussian Process Classifier -Decision Tree classifier -Extra Tree Classifier -Random Forest Classifier -Ada Boost Classifier -Bagging Classifier -Gaussian Native Bayes -Multinomial Native Bayes -Bernoulli Native Bayes -Complement Native Bayes -Quadratic Discriminant Analysis -Linear Discriminant Analysis -Stochastic Gradient Descent Classifier -Ridge Classifier -Passive Aggressive Classifier We chose Quadratic Discriminant Analysis as it supports the highest recall in the relevant category while still being a reasonable filter. Multinomial Native Bayes has the highest accuracy overall (0.95) but only a recall of 0.32, meaning two out of three relevant commits are lost. Quadratic Discriminiant Analysis is highly dependent on the feature selection, as selecting regular bag of words actually produces a recall of 1.0 for relevant commits (no loss), but with a precision of 0.09, meaning that it requires an average of 11 commits to find one relevant one. This is similar to the results that the keyword search itself produces. Further information is captured in the GitHub repository.
In an attempt to improve the quality of our classifier we then added the dataset from Mazuera- Rozo et al. (2020), which are the results discussed in the main body of this work. The results of this analysis are available in our online repository consisting of csv tables with 10 runs of a random training / test split and a final line with the averages over all runs. All of the Classifiers using TF/IDF are outperformed by classifiers using other featurisation methods. In a similar vein lemmatisation outperformed stemming. In general our improved bag of words slightly outperforms bag of words. The exception is that best classifier Decision Tree which performs best when using TF/IDF with stemming. It achieves a recall of 0.80 (i.e. 4 out of 5 commits) with a precision of 0.72 meaning that about 3 in 4 commits will be found.
In this Appendix we describe the commits in the "other" subcategory in the "Different algorithm" category, found in Section 3.2, in more detail. We First describe those from the KM set, then those from the CM set. For each commit, we report its SHA and a description of the changes made in it.  Funding This research is funded by the ERC advanced fellowship grant no. 741278 (EPIC: Evolutionary Program Improvement Collaborators) and by the EPSRC Fellowship grant no. EP/P023991/1 (Automated Software Specialisation Using Genetic Improvement).

Availability of Data and Material
All data and material from this study is available at github.com/ SOLAR-group/NonFunctionalAndroidCommits Code Availability All code, including the mining scripts and the classifier, is also available at github.com/ SOLAR-group/NonFunctionalAndroidCommits.

Conflict of Interest
The authors have no conflicts of interest to declare that are relevant to the content of this article.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. James Callan is a Computer Science PhD student at University College London, under the supervision of Justyna Petke, Federica Sarro, and Mark Harman. His primary research interest are the Genetic Improvement of Software (GI) and the improvement of non-functional properties of Android applications.

Oliver Krauss a research project manager in the Research Group
Advanced Information Systems and Technology at the University of Applied Sciences of Upper Austria. He primarily works on research projects in the domains of machine learning and data science, and teaches project management. He is currently a PhD student at the Johannes Kepler University Linz, with the topic of Pattern Mining and Genetic Improvement in Compilers and Interpreters. Justyna Petke is a Principal Research Fellow and a Proleptic Associate Professor at the Centre for Research on Evolution, Search and Testing (CREST) and a member of the Software Optimisation, Learning and Analytics Research (SOLAR) group at University College London. Her main expertise lies in Genetic Improvement (GI), which uses automated search to find improved versions of existing software. She currently holds an EPSRC Early Career Fellowship on the topic of Automated Software Specialisation Using Genetic Improvement. She also has expertise in Combinatorial Interaction Testing, Search-Based Software Engineering, and Constraint Satisfiability.