Default Privacy Setting Prediction by Grouping User’s Attributes and Settings Preferences
While user-centric privacy settings are important to protect the privacy of users, often users have difficulty changing the default ones. This is partly due to lack of awareness and partly attributed to the tediousness and complexities involved in understanding and changing privacy settings. In previous works, we proposed a mechanism for helping users set their default privacy settings at the time of registration to Internet services, by providing personalised privacy-by-default settings. This paper evolves and evaluates our privacy setting prediction engine, by taking into consideration users’ settings preferences and personal attributes (e.g. gender, age, and type of mobile phone). Results show that while models built on users’ privacy preferences have improved the accuracy of our scheme; grouping users by attributes does not make an impact in the accuracy. As a result, services potentially using our prediction engine, could minimize the collection of user attributes and based the prediction only on users’ privacy preferences.
KeywordsPrivacy preference Privacy setting Machine learning
When providing a privacy function, the default settings are very important because many users may not spend the time and effort to set their privacy preferences adequately. It is especially difficult to manually configure appropriate privacy settings as the combinations of service providers, types of personal data, and the applications for personal data have become so vast. Hence, it is important to simplify this task of setting privacy-preserving default preferences by providing tailoring mechanisms that will address individual privacy concerns and translate these concerns into personalized privacy settings to users.
In our initial efforts to overcome this, we proposed a conceptual design and a mechanism based on a Support Vector Machine (SVM) for the automatic generation of personalized privacy settings . In our basic approach we have designed a questionnaire of 80 questions that considered the combination of 16 different data types shared for 5 different utilization purposes and services. The basic approach delivered a minimal set of (5) questions to each user at registration time, and from the user’s answers, it predicted the default privacy settings for each user.
In this paper, we present a more advanced scheme and a prototype that improve the accuracy of the privacy setting prediction, based on the grouping of users’ attributes and setting preferences. Thus, the contribution of this paper is twofold. First, we present an extension and improvement of previous work , which was focused on selecting optimal and minimal number of questions to predict the privacy settings. In this work, we further elaborate and give an in-depth analysis on the improvement mechanisms by considering user attributes and privacy preferences. Second, to showcase the applicability of the proposed models, we implemented a prototype of the prediction engine in R using SVM based models in order to predict user privacy settings.
The rest of the paper is organized as follows. Section 2 provides an overview of related work in the area of privacy preferences. Section 3 describes the main methodology and approach of the SVM-based prediction scheme proposed in  and the questionnaires designed and used to derived initial settings database. Section 4 describes the experimental evaluation for both user attributes and privacy preferences. Section 5 discusses the results of the evaluation. Section 6 draws the main conclusions and points out future directions for research.
2 Related Work
In this regard, Solove suggested that the privacy self-management model cannot achieve its objectives, and it has been pushed beyond its limits, while privacy law has been relying too heavily upon the privacy self-management model . Moreover, other studies such as the experimental study conducted by Acquisti and Grossklags  demonstrated users’ lack of knowledge about technological and legal forms of privacy protection when confirming privacy policies. Their observations suggest that several difficulties obstruct individuals in their attempts to protect their own private information, even those concerned about and motivated to protect their privacy. This was reinforced by authors in  who also supported the presumption that users are not familiar with technical and legal terms related to privacy. Moreover, it was suggested that users’ knowledge about privacy threats and technologies that help to protect their privacy is inadequate . In this regard, Guo and Chen  proposed an algorithm to optimise privacy configurations based on desired privacy level and utility preference of users.
Fang et al. [9, 10] have proposed a privacy wizard for social networking sites. The purpose of the wizard is to automatically configure a user’s privacy settings with minimal effort required by the user. The wizard is based on the underlying observation that real users conceive their privacy preferences based on an implicit structure. Thus, after asking the user a limited number of carefully chosen questions, it is usually possible to build a machine learning model that accurately predicts the user’s preferences. This approach is very similar to ours. The difference is the target dataset. Fang et al. treated real data of Facebook, so the variety of the items was limited and the number of the participants is small. We treat more general data items and the number of the participants is larger because our approach does not focus on a specific service such as Facebook.
Types of personal data
Addresses and telephone numbers
Device information (e.g., IP addresses, OS)
Logs on a search engine
Personal info (age, gender, income)
Contents of email, blog, twitter etc
Session information (e.g., Cookies)
Social Info. (e.g., religion, volunteer records)
Official ID (national IDs or license numbers)
Providing the service
There is some existing research about learning privacy preferences. Berendt et al.  emphasised the importance of privacy preference generation and Sadah et al.  suggested that machine learning techniques have the power to generate more accurate preferences than users themselves in a mobile social networking application. Tondel et al.  proposed a conceptual architecture for learning privacy preferences based on the decisions a user makes in their normal interactions on the web. They suggested that learning of privacy preferences has the potential to increase the accuracy of preferences without requiring users to have a high level of knowledge or willingness to invest time and effort in their privacy. Kelley et al.  showed preferences for a mobile social network application. Preference modeling for eliciting preferences was studied by Bufett and Fleming . Mugan et al.  proposed a method for generating persona and suggestions intended to help users incrementally refine their privacy preferences over time.
3 SVM Based Privacy Setting Prediction Scheme
This section introduces the SVM-scheme used as the basis of our approach, as well as the questionnaires designed in order to get the initial privacy settings database.
3.1 Design of Questionnaires
We designed a questionnaire survey focused on the acceptability from users to provide personal data, considering a combinations of 16 data types (cf. Table 1) for 5 utilization purposes (cf. Table 2). The data types and usage purposes were selected from the items defined in P3P . In this work, we prioritized to make them close to P3P categories. We recognize that there are some misleading and uneasy to understand points, hence we will modify them next evaluation. Additionally, other attributes related to demographics and type of mobile device used were considered because they might have possibility to find any special features in the groups separated with them.
Distribution of participants
Distribution of types of mobile phone
Not smart phone
Distribution of result
3.2 Comparison Based on Attributes
3.3 SVM-based Prediction Scheme
An existing user settings database is the input to a prediction model generator in order to generate an optimal question set and the prediction model.
A user is provided with the question set (5 questions).
The user’s answers to the selected questions are then the input to the prediction model so that the privacy setting prediction engine generates the corresponding (personalized) prediction values.
The prediction values are then recommended to the user.
The abstract of the prediction-model-generating algorithm is shown in Fig. 4. The prediction-model-generating algorithm is detailed below.
The existing user settings database is split into learning data and test data.
Questions are randomly selected for prediction.
SVM models are generated for the rest of the questions (75) in the learning data by using selected questions in the learning data as feature vectors.
The SVM models that were created in the previous step are evaluated using the test data.
The process is repeated to evaluate for an adequate number of combinations of questions, and the combination of questions achieving the highest accuracy as the selected questions is adopted.
4 Experimental Evaluation
Appropriate parameters need to be chosen such as the number of learning data, test data, items for prediction of answers, and combinations of items for evaluation in order to efficiently make experiments in various conditions. Generally, if a greater number of learning data items and combinations of items for evaluation are used for prediction, higher accuracy can be expected, but meanwhile, the processing time (especially critical for generating the SVM model) is also increasing.
intel core i7-4770 @ 3.40 GHz
R, e1071(SVM), doSNOW(Multi core processing)
In order to discover an adequate number of samples of combinations of items and finding the most suitable combination for prediction of answers, the accuracy is evaluated by varying the number of samples of combinations from 1,000 to 10,000 in increments of 1,000 and fixing the number of learning data, test data, and items for prediction of answers at 100, 50, and 5, respectively. Learning data and test data were randomly chosen from the original dataset twice and called dataset A and dataset B. For each dataset, we randomly choose samples of combinations of items, evaluate all combinations, and find the best combination and its accuracy. After five evaluations, we regard the average of accuracy of the five evaluations as the accuracy of the dataset. The results show that 10,000 samples of combinations are sufficient because the maximum differences in accuracy in dataset A and B are only about 0.46% and 0.67%, Fig. 5.
For learning data, the accuracy is evaluated by varying the number of learning data from 50 to 500 and fixing the number of test data, items for prediction of answers, samples of combinations of items at 1,000, 5, and 10,000, respectively. Test data are randomly chosen from the original dataset five times, as samples of combinations of items, and called datasets A to E. For each dataset, we randomly choose learning data from original dataset for ten times, evaluate all combinations of items, and find the best combination and its accuracy. After ten evaluations, we regard the average of accuracy of the ten evaluations as the accuracy of the dataset. The results show (Fig. 7) that the accuracy linearly increases with the increase in size of learning data, hence the number of learning data is set to 100, considering the processing time for evaluation.
Finally, in order to discover an adequate number of items for prediction, the accuracy is evaluated by varying the number of items for prediction from 2 to 10 and fixing the number of learning data, test data, and samples of combinations of items at 100, 1,000, and 10,000, respectively. We randomly choose learning data and test data from original dataset for five times, evaluate all combinations of items, and find the best combination and its accuracy. After five evaluations, we regard the average of the accuracy of the five evaluations as the accuracy of the dataset. The results show (Fig. 8) that the increase of accuracy is reduced when the number of items for prediction is greater than six, hence the number of items for prediction is set at five.
From the previous results, the parameters in this experiment are set as shown in Table 7. Note that the SVM parameters are not adjusted, and the default SVM parameters are used such that \(\gamma = 0.2\) and \(cost = 1\) both in this section and in Sect. 4.1 and 4.2.
Parameters in this experiment
# learning data
# test data
# items for prediction
# samples of combinations
\(\gamma \) (Parameter on SVM)
cost (Parameter on SVM)
4.1 Evaluation by Attributes Grouping
In this section, the original data set is grouped by the participants’ attributes such as gender, age, and type of mobile phone. The accuracy is evaluated in order to generate the prediction model from the grouped data set. The parameters used for the evaluations are the same as in Sect. 4. Note that the size of learning data or test data does not decrease even if the data set is divided into small subsets. Learning data and test data are randomly chosen from the grouped subset 10 times, as samples of combinations of items, and the average of the accuracy is evaluated in the 10 trials. The result is shown in Table 8. Note that on the type of mobile phone, the item “other smart phone” is omitted because the number is too small.
Accuracy by grouping by attributes
Type of mobile phone
Other smart phone
Not smart phone
4.2 Evaluation by Privacy Preferences
The results show that there are two characteristic clusters: Cluster 1 and Cluster 4. The participants in Cluster 1 tend to answer “0” (means negative), and the participants in Cluster 4 tend to answer “1” (means neutral) for almost all the questions. It is easy to determine to which cluster a person belongs, e.g., Cluster 1, Cluster 4, or another cluster, because it is only necessary to ask his/her basic privacy attitude directly, for example, “Would you prefer that your personal data never be provided at all?”. If accuracy is improved by grouping the original data set by clustering on the answer preferences, it may be possible to improve our scheme by adding only one question that may determine to which cluster a person belongs. Hence in the next subsection, the case is evaluated with the original data set divided into Cluster 1, Cluster 4, and the other clusters, and each prediction model is generated for each cluster.
4.3 Evaluation by Grouping of Clusters
The parameters used for the evaluations are the same as for Sects. 4 and 4.1. Learning data and test data are randomly chosen 10 times from the grouped subset, as samples of combinations of items, and the average of the accuracy is evaluated in the 10 trials. The case when applying the prediction model from the whole data set to each cluster is compared with the case when applying each prediction model from the data set grouped by each cluster to each cluster. The result is shown in Table 9.
Evaluation in grouping by clustering
Using model from all data
Using models from divided data
(Previous scheme )
Accuracy \(\times \) Ratio
Accuracy \(\times \) Ratio
Evaluation in the case dividing Cluster 2 and 3
Using model from all data
Using models from divided data
(Previous scheme )
Accuracy \(\times \) Ratio
Accuracy \(\times \) Ratio
Regarding the results in Sect. 4.2, accuracy is improved for Cluster 4; however, no significant improvement is obtained for Cluster 1 and Clusters 2+3. The reason why the accuracy is not improved for Cluster 1 may be that sufficiently high accuracy was already achieved from using the prediction model generated from the whole data set because the ratio of answering “0” (i.e., negative) is very high (about 96.8%). The reason the accuracy is not improved for Clusters 2+3 may be that the prediction model is generated from mixed data with two clusters with different tendencies. Results of the additional evaluations, where Clusters 2+3 are split into Cluster 2 and Cluster 3 from the evaluation are shown in Table 10. These results show an improvement of accuracy of about 2.4% and 3.4% for Clusters 2 and 3, respectively. These results raise the possibility for improving the accuracy by subdividing the clusters even further based on the answer preferences.
In this paper, we proposed and evaluated the applicability of SVM-based models to predict default privacy settings of users at the time of registration to service providers. Furthermore, we evaluated the improvement in accuracy of a privacy setting prediction scheme when the machine learning data sets were grouped based on users’ attributes and setting preferences. First, we evaluated the case where the data sets were grouped by gender, age, and type of mobile phone; however, the accuracy was not improved. In terms of privacy protection, this result shows that the collection of additional user attributes could be minimized. We then evaluated our scheme by grouping privacy setting preferences using the K-means algorithm, from the results we could observe an improvement in accuracy. Future work will focus on enhancing the prediction accuracy, for instance by trying a different combination when merging the classes. We also plan to trial the model in real world scenarios; i.e. by integrating our prediction engine to an online service such as a social network site. We plan to analyze the behavior of users and collect their feedback regarding the usefulness and expected accuracy of the prediction engine. We also plan to execute some statistical tests on the significance of this improvement. Additionally, we would also like to investigate the impacts of the predicted settings with respect to the regulatory requirements, such as GDPR or the law of personal data protection in Japan, of service providers and the rights of users.
This research work has been supported by JST CREST Grant Number JPMJCR1404, Japan.
- 2.Backes, M., Karjoth, G., Bagga, W., Schunter, M.: Efficient comparison of enterprise privacy policies. In: Proceedings of the 2004 ACM symposium on Applied computing, SAC 2004, pp. 375–382 (2004)Google Scholar
- 3.Bekara, K., Ben Mustapha, Y., Laurent, M.: XPACML extensible privacy access control markup langua. In: 2010 Second International Conference on Communications and Networking (ComNet), pp. 1–5 (2010)Google Scholar
- 6.Buffett, S., Fleming, M.W.: Applying a preference modeling structure to user privacy. In: Proceedings of the 1st International Workshop on Sustaining Privacy in Autonomous Collaborative Environments (2007)Google Scholar
- 8.Dehghantanha, A., Udzir, N., Mahmod, R.: Towards a pervasive formal privacy language. In: 2010 IEEE 24th International Conference on Advanced Information Networking and Applications Workshops (WAINA), pp. 1085–1091 (2010)Google Scholar
- 9.Fang, L., Kim, H., LeFevre, K., Tami, A.: A privacy recommendation wizard for users of social networking sites. In: Proceedings of the 17th ACM conference on Computer and communications security, pp. 630–632. ACM (2010)Google Scholar
- 10.Fang, L., LeFevre, K.: Privacy wizards for social networking sites. In: Proceedings of the 19th international conference on World wide web, pp. 351–360. ACM (2010)Google Scholar
- 11.Guo, S., Chen, K.: Mining privacy settings to find optimal privacy-utility tradeoffs for social network services. In: 2012 International Conference on Privacy, Security, Risk and Trust (PASSAT) and 2012 International Confernece on Social Computing (SocialCom), pp. 656–665 (2012)Google Scholar
- 13.Kelley, P.G., Hankes Drielsma, P., Sadeh, N., Cranor, L.F.: User-controllable learning of security and privacy policies. In: Proceedings of the 1st ACM workshop on Workshop on AISec, AISec 2008, pp. 11–18 (2008)Google Scholar
- 15.Madejski, M., Johnson, M., Bellovin, S.: A study of privacy settings errors in an online social network. In: 2012 IEEE International Conference on Pervasive Computing and Communications Workshops (PERCOM Workshops), pp. 340–345 (2012)Google Scholar
- 16.Mugan, J., Sharma, T., Sadeh, N.: Understandable learning of privacy preferences through default personas and suggestions (2011)Google Scholar
- 17.Nakamura, T., Kiyomoto, S., Tesfay, W.B., Serna, J.: Personalised privacy by default preferences - experiment and analysis. In: Proceedings of the 2nd International Conference on Information Systems Security and Privacy, ICISSP, vol. 1, pp. 53–62 (2016)Google Scholar
- 20.Solove, D.J.: Privacy self-management and the consent paradox. In: Harvard Law Rev. 126 (2013)Google Scholar
- 21.Tondel, I., Nyre, A., Bernsmed, K.: Learning privacy preferences. In: 2011 Sixth International Conference on Availability, Reliability and Security (ARES), pp. 621–626 (2011)Google Scholar
- 23.W3C: The platform for privacy preferences 1.0 (P3P1.0) specificati. In: Platform for Privacy Preferences (P3P) Project (2002)Google Scholar