Local regression transfer learning with applications to users’ psychological characteristics prediction

It is important to acquire web users’ psychological characteristics. Recent studies have built computational models for predicting psychological characteristics by supervised learning. However, the generalization of built models might be limited due to the differences in distribution between the training and test dataset. To address this problem, we propose some local regression transfer learning methods. Specifically, k-nearest-neighbour and clustering reweighting methods are developed to estimate the importance of each training instance, and a weighted risk regression model is built for prediction. Adaptive parameter-setting method is also proposed to deal with the situation that the test dataset has no labels. We performed experiments on prediction of users’ personality and depression based on users of different genders or different districts, and the results demonstrated that the methods could improve the generalization capability of learning models.


Introduction
In recent decades, people spend more and more time on Internet, which implies an increasingly important role of Internet in human lives. To improve online user experience, online services should be personalized and tailored to fit consumer preference. Psychological characteristics, including consistent traits (like personality [1]) and changeable status (like depression [2,3]), are considered as key factors in determining personal preference. Therefore, it is critical to understand web user's personal psychological characteristics.
Personal psychological characteristics can be reflected by behaviours. As one type of human behaviour, web behaviour is also associated with individual psychological characteristics [4]. With the help of information technology, web behaviours can be collected and analysed automatically and timely, which motivates us to identify web user's psychological characteristics through web behaviours. Many studies have confirmed that it is possible to build computational models for predicting psychological characteristics based on web behaviours [5,6].
Most studies build computational models by supervised learning, which learns computational models on labelled training dataset and then applies the models on another independent test dataset. Supervised learning assumes that the distribution of the training dataset should be identical to that of test dataset. However, the assumption might not be satisfied in many cases, e.g. demographic variation (e.g. variation of gender and district), which results in the low performance of trained models. Previous studies have paid little attention to this problem. In this paper, we build models based on an innovative approach, which do not need to make the assumption of identical distribution. Transfer learning, or known as covariate shift, is introduced and investigated for this purpose.
Most existing covariate shift methods compute the resampling weight of training dataset and then train a weighted risk model to predict on test dataset. Commonly, these researches use the entire dataset to reweight in the whole procedure. We notice that probability density of data points is similar to each other in their local neighbour region, and this motivates us to use only the local region instead of the whole dataset to improve prediction accuracy and save computation cost. Therefore, we bring in some local learning views to improve covariate shift. In addition, the situation can be encountered that people do not know any labels of the test dataset before they decide to predict them, so it is difficult to learn the parameters of learning model. To cope with this problem, we propose an adaptive parameter-setting method which needs no test dataset label. Besides, we focus on the regression form of local transfer learning since psychological characteristics labels are often used in the form of continual values.
In this paper, based on our previous work [7], we intend to work on more domains of psychological characteristics predictions and propose some new local regression transfer learning methods, including training-test k-NN method and adaptive k-NN methods, which are more effective and can adaptively set the unknown parameter in prediction functions.
The rest of the paper is organized as follows: we present the local regression transfer learning methods in Sect. 2; we then introduce the background of covariate shift and local learning, and propose some local transfer learning methods to reweight the training dataset and build the weighted risk regression model. We perform some experiments of psychological characteristics prediction and analyse the experiment results in Sect. 3. Finally, we conclude the whole work in the last section. It is quite often that the test dataset has a different distribution from the training dataset. We focus on simple covariate shift that only inputs of the training dataset and inputs of the test dataset follow different distributions, i.e. only P tr ðXÞ 6 ¼ P te ðXÞ, while anything else does not change [8]. Then, we will introduce a general solution framework to cope with covariate shift problems. The key point is to compute probability of training data instances within the test dataset population, so that people can use labels of the training dataset to learn a test dataset model. We illustrate the process as [9,10] did.
Firstly, we represent the risk function in this situation and minimize its expected risk: where lðx tr ; y tr ; hÞ is the loss function, which depends on an unknown parameter h, and ðx tr ; y tr Þ $ P te denotes the probability with which ðx tr ; y tr Þ belongs to test dataset population. It is usually difficult to compute the distribution of P te , so people turn to compute the empirical risk form as follows: It is usually assumed that P tr ðyjxÞ ¼ P te ðyjxÞ, i.e. the prediction functions for both datasets are identical. Then, P te ðx tr ;y tr Þ P tr ðx tr ;y tr Þ is replaced by P te ðx tr Þ P tr ðx tr Þ . People usually directly compute the ratio P te ðx tr Þ P tr ðx tr Þ but do not estimate P tr and P te independently, which can avoid generating more errors.
To estimate the ratio P te ðx tr Þ P tr ðx tr Þ , also called the importance, researchers construct many kinds of forms of formula 2. Sugiyama et al. [11] computed the importance by minimizing the Kullback-Leibler divergence between training and test input densities and constructed the prediction model with a series of Gaussian kernel basis functions. Kanamori et al. [12] proposed a method which minimizes squares importance biases represented by Gaussian kernel functions centred at test points. Huang et al. [10] used a kernel mean matching method (KMM) which computed the importance by matching test and training distributions in a reproducing-kernel Hilbert space. Dai et al. [13] and Pardoe et al. [14] proposed a list of boosting-based algorithms for transfer learning.

Local machine learning
Local machine learning has shown a comparative advantage in many machine learning tasks [15][16][17]. In some situations, the size of local region of target data imposes a significant effect on prediction accuracy of model [17]. On the one hand, too many neighbour points can over-estimate the effects of long-distance points which may have little relationship with target point. Thus, this may bring unnecessary interferences to learning process and produce more computation cost. In another way, the predicted data point can be thought to have similar property only to points in its small region but not to all points in a very big region. On the other hand, too less neighbour points may introduce strong noise to local learning.
For covariate shift, density estimation is important. There are many density estimation methods including knearest-neighbour methods, histogram methods and kernel methods, which are localized with only a small proportion of all points which contribute most to the density estimation of a given point [18]. The k-nearest-neighbour approximation method is represented as follows: where k is the number of nearest neighbours, n is the total number of all data and V is the region volume containing all nearest neighbours. If the training and test data are in one volume, ratio between densities of both can be represented as k tr =k te , which do not require to compute nV any more. Moreover, Loog [19] proposed a local classification method which estimated the importance by using the number of test data falling in its neighbour region which consisted of training and test data. All of these inspired us to further study local learning within covariate shift.

Reweighting the importance
A complete covariate shift process is divided into two stages: reweighting importance of training data, and training a weighted machine learning model for prediction on the test dataset. In the first stage, we reweight the importance of training instances by estimating the ratio P te ðx tr Þ=P tr ðx tr Þ.
In this work, we use local learning to improve the performance in covariate shift. The key point is to use the neighbourhood of training points to compute their importance. In fact, this uses the knowledge of density similarity between the training point and its neighbour points. K-nearest-neighbour and clustering methods are used to determine the neighbourhood of training point and reweight the importance. Specifically, we first present k-NN reweighting method, which is simplest and can be seen as an origin form of all our k-NN methods. Training-test K-NN reweighting method is an extension of k-NN reweighting method, and adaptive K-NN reweighting method is an adaptation of training-test K-NN reweighting method to more common situations. Clustering-based reweighting method is another view about using local learning to reweight the importance.

K-NN reweighting method
We firstly introduce k-nearest-neighbour reweighting methods [7], which uses k-nearest test set neighbours of training instance to compute its importance. Gaussian kernel is chosen to compute density distance between training data and test data. Then the importance can be computed as follows: where k represents the number of the nearest test set neighbours of training data x tr , which determines the size of the local region, and c reflects the bandwidth of kernel function and c [ 0. Even though the exponential term in Weigðx tr Þ decreases according to an exponential law, the k value is helpful for obtaining an appropriate neighbour region and then computing the importance. It is easy to know that this k-nearest-neighbour reweighting method can save much computation time when the size of dataset is very large compared with k.

Training-test K-NN reweighting method
When we regard both the training and test neighbours of given training data in a local region, we develop a new knearest-neighbour reweighting method, called training-test k-NN reweighting method, which uses both training data and test data. The training-test k-NN reweighting method tries to use more training data points to balance the effect which is due to that the only training point does not have comparable probability with the other test points in the k-NN reweighting method sometimes, which may reduce the performance of the k-NN method. Simply, k tr =k te can be used as a reweighting formula if the training data and test data in the local region are treated to have similar probability. Further, we put forward the below formula to compute the importance after combining Gaussian kernels.
where the neighbour region divides into two parts: the training data part with a total number of k tr and the test data part with a total number of k te . The total number of data in the neighbour region is k ¼ k tr þ k te . When we determine the k, k tr and k te will be determined automatically. Here, since the training point itself is also defined as its neighbour, the denominator cannot be 0.

Adaptive K-NN reweighting method
For covariate shift methods, how to determine appropriate parameters is an important issue. Cross validation technique is used broadly for the problem. However, cross validation technique needs some labelled test data to be as validation dataset. When the prediction model is used in changed situation where test data are completely not labelled, people cannot apply cross validation. Here, we give an empirical parameter estimation way to modify the training-test k-NN reweighting method. We call it adaptive k-NN reweighting method, which includes how to determine k and how to determine c.
For k, we first assign k % n 3 8 in the way of Enas and Choi [20], where n is the population size. Then we reduce k to be a smaller value n neig when Gaussian kernel function ratio gauðn neig þ 1Þ=gauðn neig Þ is less than a threshold, which makes data in the region have similar probability. gau(i) is defined as expðÀcjjx tar À x ðiÞ jj 2 2 Þ. The reason is that, if a too small value gau(i) of nearest-neighbour point i is summed to compute the density together with other big values, that would bring big bias, and thus the point should be gotten rid of.
As to the parameter c, we set it as an empirical way c ¼ 1 2n neig P n neig i¼1 jjx tr À x ðiÞ jj 2 2 Þ. In fact, this way is somehow like a way of computing an approximated empirical variance of a dataset.

Clustering-based reweighting method
Finally, we introduce clustering-based reweighting methods [7], which are somehow similar to data-adaptive histogram method [18]. This kind of methods use clustering algorithm to generate histograms, whereas it uses training and test instances in one histogram to estimate the importance. In detail, clustering is performed on the whole training and test dataset, and P te ðx tr Þ=P tr ðx tr Þ is estimated through computing the ratio between number of test data and number of training data in one cluster. The idea is simple that training data and test data clustered in one small enough region can be thought to have the equal probability and then the importance can be computed with the ratio. Thus, we obtain the formula of clustering-based reweighting method as follows: where Weigðx Like the histogram method, this method may suffer from high-dimensional difficulty. Number of training data and test data in their cluster affects the probability estimation, and it needs very many data in high-dimensional situation. Clustering method also has a big influence on risk of importance weighting, because common clustering methods are not accurate density-region division methods. Clustering-based reweighting method can be taken as an approximate computation way.

Weighted regression model
When we get the importance of all training data in the previous stage, we train the weighted learning model and predict on the test dataset. The importance of training data is taken as weight of data and is integrated into the following formula: In this work, we integrate multivariate adaptive regression splines (MARS) method with local reweighting methods. MARS is an adaptive stepwise regression method [21], and its weighted learning model has the following form: where h j ðxÞ is a constant denoted by C, or a hinge function with the form maxð0; x À CÞ or maxð0; C À xÞ, or a product of two or more hinge functions. m denotes the total steps to get optimal performance, and f ðx ðiÞ tr Þ and f ðx ðiÞ te Þ denote the prediction values of training data and test data, respectively. This model is trained for solving unknown coefficients b j .

Experiments
Our experiments aim to predict microblog users' psychological characteristics. They include three parts: predicting users' personality across different genders, predicting users' personality across different districts and predicting users' depression across different genders.
In this paper, personality is evaluated by the Big Five personality framework, a wide accepted personality model in psychology. The Big Five personality model describes human personality with five dimensions as follows: agreeableness (A), conscientiousness (C), extraversion (E), neuroticism (N) and openness (O) [22]. Agreeableness refers to a tendency to be compassionate and cooperative. Conscientiousness refers to a tendency to be organized and dependable. Extraversion refers to a tendency to be socialized and talkative. Neuroticism refers to a tendency to experience unpleasant emotions easily. Openness refers to the degree of intellectual curiosity, creativity and a preference for novelty. Besides, CES-T scale [23] is employed to measure web users' depression.
We test the local transfer methods among web users with different genders and in different districts. There exists some relationship between users' web behaviours and their personality/depression. Gender is an important factor that can effect users' behaviours, so we choose it as example to test the local transfer methods. It is often encountered that users of the training set and the test set are in different districts, so we also study the suitability of the local transfer methods in this situation. Depression in male and female shows difference [24], so we also investigate it. In detail, our experiments are to predict male users' personality based on female users, predict non-Guangdong users' personality based on Guangdong users and predict male users' depression degree based on female users.

Experiment setup
In China, Sina Weibo (weibo.com) is one of the most famous microblog service providers and has more than 503 million registered users. In this research, we invited Weibo users to complete online self-report questionnaire, including personality and depression scales, and downloaded their digital records of online behaviours with their consent.
For the prediction of personality, between May and August in 2012, we collected data from 562 participants (male: 215, female: 347; Guangdong: 175, non-Guangdong: 387) and extracted 845 features from their online behavioural data. The extracted features can be divided into five categories: (a) profiles include features like registration time and demographics (e.g. gender); (b) self-expression behaviours include features reflecting the online expression of one's personal image (e.g. screen name, facial picture and self-statement on personal page); (c) privacy settings include features indicating the concern about individual privacy online (e.g. filtering out private messages and comments sent by strangers); (d) interpersonal behaviours include features indicating the outcomes of social interaction between different users (e.g. number of friends whom a user follows, number of followers, categories of friends whom a user follows and categories of forwarded microblogs); and (e) dynamic features can be represented as time series data (e.g. updating microblogs in a certain period or using apps in a certain period).
For the prediction of depression, between May and June in 2013, we collected data from 1000 participants (male: 426, female: 574). Compared with personality experiments, we supplemented additional linguistic features in depression experiments. These linguistic features included the total number of characters, the number of numerals, the number of punctuation marks, the number of personal pronouns, the number of sentiment words, the number of cognitive words, the number of perceptual processing words and so on.
Since all these experiments have very many feature dimensions and high dimension curse would weaken the learning model, we firstly use stepwisefit method in Matlab toolbox to reduce dimensions and select the most relevant features. For the gender-personality experiment, we process the female dataset and obtain 25, 14, 19, 25 and 20 features for predicting Big Five dimensions: A, C, E, N and O, respectively. For the district-personality experiment, the Guangdong dataset is processed and we obtain 19, 21, 18, 22 and 20 features for A, C, E, N and O, respectively. For the depression experiment, the female dataset is processed, and we obtain 20 features.
It also must be emphasized that we test whether the training set and the test set follow the same distribution before we do transfer learning. Both T test and Kolmogorov-Smirnov test are performed in the two-sample test. T test is fit to test dataset with Gaussian distribution, and Kolmogorov-Smirnov test can test dataset with unknown distribution. Specifically, we test the datasets along each dimension.
In the experiments, our local transfer learning methods are compared with non-transfer method, global transfer method and other transfer learning methods. The local transfer learning methods include k-NN transfer learning method, training-test k-NN transfer learning method, adaptive k-NN transfer learning methods and clustering transfer learning methods. The non-transfer method does not use a transfer learning way and is a traditional method. The global transfer method is also a k-NN transfer learning method, but it has a k value equalling the number of all test data, i.e. it takes all test data as neighbours. A famous transfer learning method called KMM [10] is also used here as a baseline method. After reweighting importance, we integrate the importance into weighted risk models. We choose weighted risk model MARS, which is open source regression software for Matlab/Octave from (http://www. cs.rtu.lv/jekabsons/regression.html).
In all tables and figures of this paper, MARS denotes the method with no transfer learning, KMM denotes combination of KMM reweighting method and MARS method in a weighted risk form, GkNN denotes global k-NN reweighting method and MARS, kNN denotes k-NN reweighting method and MARS, TTkNN denotes trainingtest k-NN reweighting method and MARS, and AkNN1 denotes adaptive k-NN reweighting method and MARS, where k value is determined as described in Sect.

Predicting users' personality across genders
This task is to predict male users' personality based on female users' labelled data and male users' unlabelled data. We firstly perform single-dimension T test and Kolmogorov-Smirnov test to test whether male and female datasets are drawn from the same distribution. As a result, 3, 1, 2, 3 and 2 features of all 25, 14, 19, 25 and 20 features are shown to follow different distributions by T test, and 2, 0, 0, 2 and 1 features by Kolmogorov-Smirnov test. All of these test results are with probability more than 95 % confidence. Thus, it can be thought that there exists some distribution divergence between male and female datasets, though the divergence is not big. Then, we examine the performance of all the local transfer learning methods in this experiment.
From Table 1, it can be seen that all regression transfer learning methods improve much on the prediction accuracy compared with non-transfer learning method in all situations. Local kNN reweighting methods beat global k-NN reweighting method GkNN in almost all situations. TTkNN method performs better than the others in 3 of 5 personality dimensions. AkNN1 performs nearly well with other k-NN reweighting methods, except in the dimension of C. Especially, AkNN1 beats GkNN in 4 dimensions, and this shows the advantage of its fixed k value. For AkNN2, it performs better only than MARS method. Clust also shows comparable performance compared with other local transfer learning methods.
To investigate the impact of k value in k-NN reweighting methods, we take experiment on trait A as an example. The results of GkNN, kNN and TTkNN are shown in Fig. 1. We can see that these methods perform the best when the values of k range between 20 and 30. As k approximates to the total size of test dataset, the performances of kNN and TTkNN become equal to GkNN method. For TTkNN method, it performs worse than GkNN when k is 1, and that could be caused by noise. When k of TTkNN method is very small, i.e. close to 0, outlier point can impose a strong influence. When k of TTkNN method is 50, its performance shows an exception and the reason may be that the local region caused by k experiences a shake-up. Thus, the value of k can be recognized as a factor affecting the prediction performance.
We then test how prediction accuracy of clustering transfer methods is affected by the number of clusters in all five personality traits. From Fig. 2, we can see that the number of clusters has a big influence on the prediction accuracy. There is no certain value of cluster number which achieves the best performance for all five traits. The method obtains the optimization result in C, E and O trait when the number of clusters is small. For these three traits,  Fig. 1 The impact of the number of nearest neighbours on the performance of k-NN transfer methods in trait A it could also be seen that their MSE gradually increases as number of clusters increases, and the least k value (here, the value is 1) may not be the optimised value because of noise. Meanwhile, it seems to follow no regular rule for the other two traits. Thus, we can think that there is no constant optimal value for cluster number in clustering transfer methods for all situations. The reasons are speculated that distributions of the datasets are of diversity, and clustering method is not a stable density estimation method here.

Predicting users' personality across districts
In this experiment, we use Weibo data of Guangdong province of China to train the model and predict personality of users in the other districts. Firstly, we still apply stepwisefit method to select 19, 21, 18, 22 and 20 features from a total of 845 features in A, C, E, N and O traits, respectively. We then use T test and get 3, 1, 3, 3 and 2 features following different distributions and use Kolmogorov-Smirnov test and get 3, 5, 6, 9 and 2 features following different distributions, both with probability more than 95% confidence. Finally, we perform our regression transfer methods on different-district datasets and compare all the methods as used in the above differentgender experiment.
We analyse performances of all methods. Table 2 shows that all local transfer learning methods perform better than non-transfer method MARS. GkNN behaves unstably: it performs worse than MARS in 2 of all 5 traits, while it performs best in O trait. kNN performs no worse than GkNN in all five traits. TTkNN is still the best method for most situations and performs stably. AkNN1 performs much better than MARS, but much worse in O trait than other local transfer learning methods except AkNN2. AkNN2 behaves only a little better than MARS in four traits and weaker in one trait. Clust also beats MARS method in all situations but behaves not so well in O trait.

Predicting users' depression across genders
This experiment is to predict male users' depression level based on female users' labelled data. Still, stepwisefit method is performed and 20 features are selected. 3 feature dimensions in T test and 5 feature dimensions in Kolmogorov-Smirnov test are thought as different-distribution feature. This suggests that training and test data also follow different distributions in this experiment.
In Table 3, the result shows that the transfer learning methods perform much better than non-transfer method MARS. KMM and Clust behave a little better than other transfer methods. AkNN1 and AkNN2 perform nearly equally well to other transfer learning methods.

Discussion and conclusion
It can be concluded from the above experiments that all our local transfer learning methods work better than nontransfer learning method, because they reduce the prediction bias of model which is trained and tested on differentdistribution datasets. Our local k-NN family transfer learning methods perform better than the global k-NN transfer learning method generally, and the reason may be that an appropriate k value in k-NN methods could reflect more subtle nature in density estimation. All our local transfer learning methods show comparable performance with KMM method in all situations. TTkNN method exceeds kNN and obtains the best performance among all the methods in half of situations. It could be guessed that TTkNN uses both test and training data information, while kNN only uses test data. Clust method performs well in  Local regression transfer learning with applications to users' psychological characteristics 151 most situations, this proves its applicability, and better density clustering methods may further enhance this method. Finally, we compare the performance of GkNN, AkNN1 and AkNN2; AkNN1 is the best, GkNN is the second and AkNN2 is the worst of them. AkNN1 performs better than GkNN in most situations, and this demonstrates that determining k in an AkNN1 way, same as AkNN2, can work well generally. We also note that AkNN1 and AkNN2 behave not well in O trait in Table 2, and it indicates that k in AkNN1 and AkNN2 is not an optimal choice in some situation because of the change of distribution of data set. It is pointed that AkNN2 is inferior to AkNN1 and GkNN, because it does not choose the optimal value for parameter c in prediction function preliminary. Since no parameter in AkNN2 needs to be set artificially, it could work in the situations where we completely have no idea about labels of predicted data, which can be of much significance.

Conclusions
In this paper, we propose some local regression transfer learning methods and apply them to predict users' psychological characteristics when the training set and the test set follow different distributions. We present k-NN reweighting methods and clustering reweighting method to estimate the importance of training set in covariate shift process. Specifically, these methods utilize training and test data in certain local neighbour region for importance estimations. We still apply them to psychological characteristics predictions including microblog users' personality prediction across different genders and different districts, and microblog users' depression prediction across different genders. The experiments demonstrate that these methods improve the accuracy of prediction models. Specially, the complete adaptive k-NN reweighting method is able to make prediction even without knowing any label of test data.