1 Introduction

As the rapidly development of telecommunications and smartphone, almost everyone can’t do without the personal mobile phone. From Ericsson Mobility Report, it show that during nearly development of 5 years, the smartphone penetration in the mobile market will increase from 30% to 60% by the year 2019. People use their mobile phones to browse internet, have chat and play popular game almost every day. As a result, the mobile phone carries nearly all of a person’s behavior and preferences. The record data, such as installed APP list, APP usage record, type and price of the mobile phone which are collected by tracing platform will contain abundant information of the customers. In that way, user’s personal information such as gender and age can be accurately predicted by using machine learning technology. This information can be widely used to provide personal targeted advertising. It can not only help APP companies understand their users’ behavior characteristics, iterate products, but also help enterprises to more accurately deliver Internet advertising and save advertising costs. Recently, in [1], the author proposes that the behavioral targeted advertisements could improve the click-through-rates of advertisement effectively. To achieve this objective, big data and machine learning technology have the ability to provide nearly real-time solutions for processing the huge amount of data collected from tracing platform.

There have several famous supervised algorithms in machine learning technology such as Support Vector Machines (SVM), Logistic Regression (LR), and Decision Trees, i.e. The gender of customers is divided into male and female. The information of age can be represented by 10 years per group. From the above, the prediction of gender and age can be converted as a problem of classification. There are also some classical algorithms to solve the problem of classification, such as Decision Trees, GBDT (Gradient Boosting Decision Tree), and XGBoost algorithm, i.e. For example, XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework [2]. Later after XGBoost, LightGBM was proposed by Microsoft to improve the performance of boosting algorithms [3]. As a result, it can reduce the calculation cost of split gain and use histogram subtraction for further speed-up. Based on the machine learning algorithms mention above, we develop a framework to estimate the gender and age of mobile users by the installation and usage of APPs.

This paper is organized as follows, in Sect. 2, researches related to prediction of user’s information are discussed. Section 3 shows the scheme of GAPA algorithm. Section 4 describes the numerical results of the accurate scheme by analyzing the tracing data. At last, conclusions of this paper are given in Sect. 5.

2 Relevant Work

To the best of our known, the feasibility of demographic inference through various tracing data of the customers has been proposed many times in the past. For example, the authors of [4] proposed a solution of predicting gender, age and religion tendency of the mobile users based on the search queries from SNS, such as Facebook, i.e.. In [5], the authors developed a scheme to predict demographics such as relationship, age and gender. The scheme is not only based on the behavioral features of application usage, voice call usage, and SMS usage, but also refers to the environment features, such as Bluetooth and WiFi devices detected per day on mobile phones. Also in [6], Suranga gave a warning that there will be multiple privacy and security issues with the data collection through over-permission platform and share with other companies. In order to verity the affection, the authors presented a framework to predict mobile users’ gender based on installed APPs simply and the accuracy could reach around 70% in the numerical results.

Mobile phones are widely used worldwide [7,8,9]. The proposed gender and age prediction algorithm is based on the LightGBM method by collecting the tracing data such as installed APPs, records of APPs usage, type and price of the mobile phone. In order to enhance the accuracy of prediction, cross verification scheme is applied in the proposed GAPA. Through repeated iterative calculations based on the training data set, we use the most accurate set of the features’ importance to train the forecast data set. Then we check the accuracy of this calculation model by the log-loss function.

3 A Gender and Age Prediction Algorithm Based on Machine Learning

In order to enhance the accuracy of prediction, LightGBM algorithm and cross verification scheme is applied in the proposed GAPA. Figure 1 presents the process of GAPA and the whole process can be divided into five steps: data collection, feature engineering, model training, cross verification and results evaluation.

Fig. 1.
figure 1

The flowchart of GAPA

3.1 Step 1: Collection

The first step is data collection, the collector in the tracing platform will gather user’s mobile information, such as user ID, mobile brand, mobile sub-brand, mobile price, gender and age. Of course, we set the data cluster with gender and age as the training set, and set the data cluster without gender and age information as the forecast set. In this study, we collect nearly 73000 android users’ information, and out of 50000 android users who provide the information of gender and age. We set this data of users as training set. In this data set, there are 32324 (≈64.6%) male and 17676 (≈35.4%) are female.

Beside the user’s mobile information, user’s APP information is also collected by the collector. It contains user ID, APP, APP series, APP sub-series, start time and end time record of the APP usage. Tables 1 and 2 describe the fragmentary of the two kind of data collected from tracing platform. In the next step, we will separate this data and promote the feature engineering which is the key processing of GAPA.

Table 1. User’s mobile information
Table 2. User’s APP information

3.2 Step 2: Feature Engineering

Step 2 is feature engineering, which is the most important segment in each machine learning project. In this process, we need to pick up features which have the most influential and discriminative ability to classify and identify the user’s information. In a word, the more studies we do in the step, the more accuracy result we will get from the machine learning algorithm. First of all, we need to have statistical analysis of the training set which have 50000 users’ data.

Before the study, we set a regulation for the target features, gender and age. It is presented in Table 3.

Table 3. Regulation for gender and age

We represent the male and female as 1 and 2, we divide the age as 10 years for each group. For example, the users who are in the 25–30 group will be represented by 3 in the age feature.

3.2.1 Basic Features

Basic features show the rule of statistics for the basic information in the training set.

Figure 2 describes the distribution of mobile brand in training set, it is obviously that no matter in male and female, Xiaomi, Samsung and Huawei are the top three market share in the android smartphone. The occupation ratio of Xiaomi is 20.5% and 18.7% in male and female separately. In Fig. 3, the distribution of age in different mobile brand is shown. From the figure, we can find a rule that the top three market share brands have intensified competition for the young consumers below 30 years old.

Fig. 2.
figure 2

The distribution of mobile brand in training set

Fig. 3.
figure 3

The distribution of age in different mobile brand

3.2.2 APP Features

APP features describe the installed and usage statistical rules in the training set. Figure 4 shows the distribution of installed APP in different gender. For the male in left figure, the top three favorites APPs are social, mobile shopping and App-Manager. On the other side, social, mobile shopping and physical health are the most favorites APPs amount female customers. This market rule is formed by the different living habit, way of thinking and physiologic structure.

Fig. 4.
figure 4

The distribution of installed APP in different gender

What is more, we select game and physical health as the representative APP series for different gender to analyze the rule more deeply. In Fig. 5, the physical health series is a very popular series amount the female consumers during 20 to 50 years old. The penetration of game series demonstrate a young tendency in male group, as the users below 40 years will be more likely to install and play for fun.

Fig. 5.
figure 5

The distribution of physical and game in different gender

From the above analysis, we chose TF-IDF (Term Frequency–inverse Document Frequency) algorithm which is a very famous algorithm is information retrieval to process the installed APPs information which is a kind of literal-type data. TF-IDF is an algorithm that is intended to reflect how significant a word that is to a document in a collection or corpus.

In the case of the TF, the simplest choice is to use the raw count of a term in a document. It can be described as follow:

$$ {\text{TF}}_{\text{i,j}} { = }\frac{{n_{i,j} }}{{\sum\nolimits_{k} {n_{i,j} } }} $$
(1)

where \( n_{i,j} \) is the number of times that term t occurs in document d, the denominator is the total number of words in d.

The inverse document frequency is a measure of how much information the word provides. It can be described in (2):

$$ {\text{ITF}}_{\text{i}} {\text{ = log}}\frac{|D|}{{|\{ j:t_{i} { \in }d_{j} \} |}} $$
(2)

where |D| is the total number of documents in the corpus, {j: ti dj} is the number of documents where the term t appears.

Then TF-IDF is calculated as (3):

$$ {\text{TF}} - {\text{ITF}}_{\text{i,j}} {\text{ = TF}}_{\text{i,j}} {\text{ * IDF}}_{\text{i}} $$
(3)

The TF-IDF value will increase as the number of times that a word appears in the corpus proportionally.

3.3 Step 3: Model Training

As mention above, GAPA is based on big data and machine learning technology, we studied GBDT, Xgboot and LightGBM, which are famous algorithm in machine learning. After comparison of the accuracy and complexity, we chose LightGBM as the core algorithm of GAPA. The parameters for the lightGBM in Python is set as follow (Table 4):

Table 4. Regulation for gender and age

3.4 Step 4: Cross Validation

In this step, cross validation scheme is used to optimize the parameters of iteration trees in LightGBM algorithm. Cross validation is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. We fold the training data into 5 pieces, 1 piece for training and the rest for validation.

3.5 Step 5: Results Evaluation

In order to evaluate the performance of the built model, we use loss-log function to profile the accuracy of the prediction result. The loss-log function is presented in formula (4):

$$ {\text{Loss}} = - \frac{1}{N}\sum\limits_{i = 1}^{N} {\sum\limits_{j = 1}^{22} {y_{ij} \ln (p_{ij} )} } $$
(4)

where i is the number of users in forecast set, j is the different group of users divided by gender and age, \( {\text{y}}_{\text{ij}} \) is whether a user is in the group of j or not, and \( {\text{p}}_{\text{ij}} \) is the probability calculated by the GAPA for each user.

In a solution, the Loss-log is the cumulative sum of the error between the reality and the prediction result. Apparently, 0 is the ideal target of the proposed algorithm.

4 Numerical Result

In this paper, the case analysis of machine learning is based on the user’s mobile information collected by tracing platform. The data sets is processed and analyzed by Sklearn in the Python environment (Fig. 6).

Fig. 6.
figure 6

The distribution of precision for different algorithms

From the above, we can observe a disciplinary rule that the performance of GAPA will be improved as the accumulation of features. Because all the basic features are gains for the Sklearn algorithms. The inflection point happens when we add the ‘Top30–40 apps is installed’ features in the algorithms. Because this ten features contain some interferential features that is an interference to classify the forecast data set. The best result in our experiment is 2.67 by using the LightGBM algorithm.

5 Conclusions

In this paper, a gender and age prediction algorithm based on big data analytic and machine learning is proposed for study. The proposed framework can predict the user’s information by machine learning algorithm with accuracy of 2.67. The proposed algorithm considers more aspects and features than the algorithms in [4,5,6]. Also, the precision can be improved unceasingly. In the end, we give the performance of GAPA is significant by analyzing the collected data in the last part. And the results show that the GAPA scheme can be generalized in the area of targeted advertising.