Abstract
In the current society, almost everyone can’t do without a mobile phone. As the rapidly expansion of smartphone and app market in recently years, the current 35%–40% penetration of smartphone in the mobile phone market will reach to 60% by the year 2019. The customers use their mobile phones to browse internet, have chat and play popular game almost at anywhere and anytime. As a result, mobile phone carries almost all of a person’s behavior and preferences. In that way, user’s personal information such as gender and age, demographic attribute that is frequently used in precision marketing, can be accurately predicted. In this paper, a gender and age prediction algorithm (GAPA) is proposed to predict user’s gender and age by using established supervised machine learning. The numerical results show that the algorithm proposed in this paper is high-efficiency and is able to control the loss function near 2–3.
1 Introduction
As the rapidly development of telecommunications and smartphone, almost everyone can’t do without the personal mobile phone. From Ericsson Mobility Report, it show that during nearly development of 5 years, the smartphone penetration in the mobile market will increase from 30% to 60% by the year 2019. People use their mobile phones to browse internet, have chat and play popular game almost every day. As a result, the mobile phone carries nearly all of a person’s behavior and preferences. The record data, such as installed APP list, APP usage record, type and price of the mobile phone which are collected by tracing platform will contain abundant information of the customers. In that way, user’s personal information such as gender and age can be accurately predicted by using machine learning technology. This information can be widely used to provide personal targeted advertising. It can not only help APP companies understand their users’ behavior characteristics, iterate products, but also help enterprises to more accurately deliver Internet advertising and save advertising costs. Recently, in [1], the author proposes that the behavioral targeted advertisements could improve the click-through-rates of advertisement effectively. To achieve this objective, big data and machine learning technology have the ability to provide nearly real-time solutions for processing the huge amount of data collected from tracing platform.
There have several famous supervised algorithms in machine learning technology such as Support Vector Machines (SVM), Logistic Regression (LR), and Decision Trees, i.e. The gender of customers is divided into male and female. The information of age can be represented by 10 years per group. From the above, the prediction of gender and age can be converted as a problem of classification. There are also some classical algorithms to solve the problem of classification, such as Decision Trees, GBDT (Gradient Boosting Decision Tree), and XGBoost algorithm, i.e. For example, XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework [2]. Later after XGBoost, LightGBM was proposed by Microsoft to improve the performance of boosting algorithms [3]. As a result, it can reduce the calculation cost of split gain and use histogram subtraction for further speed-up. Based on the machine learning algorithms mention above, we develop a framework to estimate the gender and age of mobile users by the installation and usage of APPs.
This paper is organized as follows, in Sect. 2, researches related to prediction of user’s information are discussed. Section 3 shows the scheme of GAPA algorithm. Section 4 describes the numerical results of the accurate scheme by analyzing the tracing data. At last, conclusions of this paper are given in Sect. 5.
2 Relevant Work
To the best of our known, the feasibility of demographic inference through various tracing data of the customers has been proposed many times in the past. For example, the authors of [4] proposed a solution of predicting gender, age and religion tendency of the mobile users based on the search queries from SNS, such as Facebook, i.e.. In [5], the authors developed a scheme to predict demographics such as relationship, age and gender. The scheme is not only based on the behavioral features of application usage, voice call usage, and SMS usage, but also refers to the environment features, such as Bluetooth and WiFi devices detected per day on mobile phones. Also in [6], Suranga gave a warning that there will be multiple privacy and security issues with the data collection through over-permission platform and share with other companies. In order to verity the affection, the authors presented a framework to predict mobile users’ gender based on installed APPs simply and the accuracy could reach around 70% in the numerical results.
Mobile phones are widely used worldwide [7,8,9]. The proposed gender and age prediction algorithm is based on the LightGBM method by collecting the tracing data such as installed APPs, records of APPs usage, type and price of the mobile phone. In order to enhance the accuracy of prediction, cross verification scheme is applied in the proposed GAPA. Through repeated iterative calculations based on the training data set, we use the most accurate set of the features’ importance to train the forecast data set. Then we check the accuracy of this calculation model by the log-loss function.
3 A Gender and Age Prediction Algorithm Based on Machine Learning
In order to enhance the accuracy of prediction, LightGBM algorithm and cross verification scheme is applied in the proposed GAPA. Figure 1 presents the process of GAPA and the whole process can be divided into five steps: data collection, feature engineering, model training, cross verification and results evaluation.
3.1 Step 1: Collection
The first step is data collection, the collector in the tracing platform will gather user’s mobile information, such as user ID, mobile brand, mobile sub-brand, mobile price, gender and age. Of course, we set the data cluster with gender and age as the training set, and set the data cluster without gender and age information as the forecast set. In this study, we collect nearly 73000 android users’ information, and out of 50000 android users who provide the information of gender and age. We set this data of users as training set. In this data set, there are 32324 (≈64.6%) male and 17676 (≈35.4%) are female.
Beside the user’s mobile information, user’s APP information is also collected by the collector. It contains user ID, APP, APP series, APP sub-series, start time and end time record of the APP usage. Tables 1 and 2 describe the fragmentary of the two kind of data collected from tracing platform. In the next step, we will separate this data and promote the feature engineering which is the key processing of GAPA.
3.2 Step 2: Feature Engineering
Step 2 is feature engineering, which is the most important segment in each machine learning project. In this process, we need to pick up features which have the most influential and discriminative ability to classify and identify the user’s information. In a word, the more studies we do in the step, the more accuracy result we will get from the machine learning algorithm. First of all, we need to have statistical analysis of the training set which have 50000 users’ data.
Before the study, we set a regulation for the target features, gender and age. It is presented in Table 3.
We represent the male and female as 1 and 2, we divide the age as 10 years for each group. For example, the users who are in the 25–30 group will be represented by 3 in the age feature.
3.2.1 Basic Features
Basic features show the rule of statistics for the basic information in the training set.
Figure 2 describes the distribution of mobile brand in training set, it is obviously that no matter in male and female, Xiaomi, Samsung and Huawei are the top three market share in the android smartphone. The occupation ratio of Xiaomi is 20.5% and 18.7% in male and female separately. In Fig. 3, the distribution of age in different mobile brand is shown. From the figure, we can find a rule that the top three market share brands have intensified competition for the young consumers below 30 years old.
3.2.2 APP Features
APP features describe the installed and usage statistical rules in the training set. Figure 4 shows the distribution of installed APP in different gender. For the male in left figure, the top three favorites APPs are social, mobile shopping and App-Manager. On the other side, social, mobile shopping and physical health are the most favorites APPs amount female customers. This market rule is formed by the different living habit, way of thinking and physiologic structure.
What is more, we select game and physical health as the representative APP series for different gender to analyze the rule more deeply. In Fig. 5, the physical health series is a very popular series amount the female consumers during 20 to 50 years old. The penetration of game series demonstrate a young tendency in male group, as the users below 40 years will be more likely to install and play for fun.
From the above analysis, we chose TF-IDF (Term Frequency–inverse Document Frequency) algorithm which is a very famous algorithm is information retrieval to process the installed APPs information which is a kind of literal-type data. TF-IDF is an algorithm that is intended to reflect how significant a word that is to a document in a collection or corpus.
In the case of the TF, the simplest choice is to use the raw count of a term in a document. It can be described as follow:
where \( n_{i,j} \) is the number of times that term t occurs in document d, the denominator is the total number of words in d.
The inverse document frequency is a measure of how much information the word provides. It can be described in (2):
where |D| is the total number of documents in the corpus, {j: ti ∈ dj} is the number of documents where the term t appears.
Then TF-IDF is calculated as (3):
The TF-IDF value will increase as the number of times that a word appears in the corpus proportionally.
3.3 Step 3: Model Training
As mention above, GAPA is based on big data and machine learning technology, we studied GBDT, Xgboot and LightGBM, which are famous algorithm in machine learning. After comparison of the accuracy and complexity, we chose LightGBM as the core algorithm of GAPA. The parameters for the lightGBM in Python is set as follow (Table 4):
3.4 Step 4: Cross Validation
In this step, cross validation scheme is used to optimize the parameters of iteration trees in LightGBM algorithm. Cross validation is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. We fold the training data into 5 pieces, 1 piece for training and the rest for validation.
3.5 Step 5: Results Evaluation
In order to evaluate the performance of the built model, we use loss-log function to profile the accuracy of the prediction result. The loss-log function is presented in formula (4):
where i is the number of users in forecast set, j is the different group of users divided by gender and age, \( {\text{y}}_{\text{ij}} \) is whether a user is in the group of j or not, and \( {\text{p}}_{\text{ij}} \) is the probability calculated by the GAPA for each user.
In a solution, the Loss-log is the cumulative sum of the error between the reality and the prediction result. Apparently, 0 is the ideal target of the proposed algorithm.
4 Numerical Result
In this paper, the case analysis of machine learning is based on the user’s mobile information collected by tracing platform. The data sets is processed and analyzed by Sklearn in the Python environment (Fig. 6).
From the above, we can observe a disciplinary rule that the performance of GAPA will be improved as the accumulation of features. Because all the basic features are gains for the Sklearn algorithms. The inflection point happens when we add the ‘Top30–40 apps is installed’ features in the algorithms. Because this ten features contain some interferential features that is an interference to classify the forecast data set. The best result in our experiment is 2.67 by using the LightGBM algorithm.
5 Conclusions
In this paper, a gender and age prediction algorithm based on big data analytic and machine learning is proposed for study. The proposed framework can predict the user’s information by machine learning algorithm with accuracy of 2.67. The proposed algorithm considers more aspects and features than the algorithms in [4,5,6]. Also, the precision can be improved unceasingly. In the end, we give the performance of GAPA is significant by analyzing the collected data in the last part. And the results show that the GAPA scheme can be generalized in the area of targeted advertising.
References
Jakir, K., Fenil, A., Mithila, S.: Different approaches and methods for targeted advertisements by predicting user’s behavioral data and next location. In: Conference 2018, ICISC, pp. 1345–1350. IEEE (2018)
Chen, T., Carlos, G.: Xgboost: a scalable tree boosting system. In: Conference 2016, ACM, pp. 785–794. IEEE (2016)
Ke, G., Meng, Q., Finley, T.: LightGBM: a highly efficient gradient boosting decision tree. In: Conference 2017, NIPS, pp. 342–353. NIPS (2017)
Bi, B., Shokouhi, M., Kosinki, M.: Inferring the demographics of search users: Social data meets search queries. In: Conference 2013, World Wide Web, pp. 131–140. IEEE (2013)
Aarthi, S., Bharanidharan, S., Saravanan, M.: Predicting customer demographics in a mobile social network. In: Conference 2011, International Conference on Advances in Social Networks Analysis and Mining, pp. 553–554. IEEE (2011)
Chen, J., Wang, C., He, K.: Semantics-aware privacy risk assessment using self-learning weight assignment for mobile apps. IEEE Trans. Dependable Secur. Comput. pp, 1 (2018)
Xu, L., Luan, Y., Cheng, X.: Telecom big data based user offloading self-optimisation in heterogeneous relay cellular systems. Int. J. Distrib. Syst. Technol. 8, 27–46 (2017)
Xu, L., Cheng, X., Chen, Y., Chao, K., Liu, D., Xing, H.: Self-optimised coordinated traffic shifting scheme for LTE cellular systems. In: 1st EAI International Conference on Self-Organizing Networks, pp. 67–75. Springer press, Beijing (2015)
Xu, L., Zhao, X., Luan, Y.: User Perception aware telecom data mining and network management for LTE/LTE-advanced networks. In: 4th International Conference on Signal and Information Processing, Networking and Computers, pp. 237–245. Springer press, Qingdao (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Gao, J., Zhang, T., Guan, J., Xu, L., Cheng, X. (2019). A Gender and Age Prediction Algorithm Using Big Data Analytic Based on Mobile APPs Information. In: Sun, S., Fu, M., Xu, L. (eds) Signal and Information Processing, Networking and Computers. ICSINC 2018. Lecture Notes in Electrical Engineering, vol 550. Springer, Singapore. https://doi.org/10.1007/978-981-13-7123-3_60
Download citation
DOI: https://doi.org/10.1007/978-981-13-7123-3_60
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-7122-6
Online ISBN: 978-981-13-7123-3
eBook Packages: EngineeringEngineering (R0)