Big5 Tool for Tracking Personality Traits

  • Binh Thanh NguyenEmail author
  • Dang Ngoc Dung
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11431)


In the big data era, understanding consumers through digital data is as important as the approach and exploitation of customers through their behavioral and personality traits in the digital world. First, a data warehouse has been studied and developed to extract, transform and load big mobile log data. Afterwards, the data warehouse’s data cubes are aggregated and used to calculate a set of Big5 indicators. Hereafter, Big5 traits can be predicted based on those just-specified indicators. To proof of our concepts, implementation results will be presented in the context of the Big5 tool, which has been designed and developed to predict Big5 personalities in a representative manner.


Big5 traits Personality Indicators Data management Mobile logs Machine learning Naive Bayes classification 

1 Introduction

As the number of mobile phone users in the world is expected to pass the five billion mark by 2019 [25] and carriers have increasingly made available phone logs to researchers [3], data patterns retrieved from mobile phone user logs can open the door to exciting avenues for future research in social sciences [6]. On another hand, five-factor model (Big5) of personality, which has been introduced as extraversion, agreeableness, conscientiousness, ceuroticism, and openness to experience [11, 12, 20, 23], can be used to explore the relationship between personality and various behaviors [23]. In [14], we indicated that understanding clients through these Big5 factors is a commonly used method of business people, e.g. service providers and marketing agent, who need information about their customers in investigating, planning for policies, development campaigns.

In this context, determining the personality of a mobile phone user simply through standard carrier’s logs has became a topic of tremendous interest [3]. Based on the predicted results, mobile phones datasets could thus provide a valuable unobtrusive and cost-effective alternative to survey-based measures of personality [3]. These data permit fine-grained, continuous collection of people’s social interactions (e.g., speaking rates in conversation, size of social groups, calls, and text messages), daily activities (e.g., physical activity and sleep), and mobility patterns (e.g., frequency and duration of time spent at various locations) [7].

In this paper, first, our methodology for processing, storing and retrieving the Call Detail Records (CDR) datasets provided by Orange Senegal [4] has been described. The big data sets are the basis for our analysis into calling and texting behaviors, spatial interactions to predict Big5 traits. As a result, a data warehouse, namely Big5DW has been studied and developed to store big mobile log data in multi-dimensional data structure [7, 8, 15]. Hereafter, data in data cubes can be used to calculate a set of Big5 indicators. Furthermore, the data sets of Bandicoot tool [3] also provide many meaningful indicators for our study. In this context, machine learning algorithms, especially, Naive Bayes classification [22, 24] are studied and applied to predict if a phone user has feature of low, average, or high in the framework of Big5 traits based on the just-specified indicators. To proof of our concepts, the Big5 tool has been developed to present implementation results of our research in a representative manner.

The rest of this paper is organized as follows: Sect. 2 introduces some approaches and projects related to our work; after an introduction of the Big5 conceptual model, i.e. Big5 data management, Big5DW, indicators, and how to predict Big5 traits based on Naive Bayes classification in Sect. 3, Sect. 4 will present our implementation results with their main use cases. And lastly, Sect. 5 gives a summary of what have been achieved and future works.

2 Related Work

This work has been proposed by the requirements of analyzing user’s mobile log data, which are basic resources with the more precise and uptodate information provided by Orange Senegal. According to [27], determining the personality of mobile phone users, besides being important solely from the psychological point of view, can also provide an interesting application framework for wearable computing. Furthermore, partterns retrieved from mobile phone logs contain many features, which are very useful for researchers, marketing agents as well as phone companies [2].

In this context, in [9], various features extracted from Facebook data can be explored to predict personality. In [10], the authors classified author’s personality from weblog texts by using the n-grams as the features and the Na¨ıve Bayes algorithm as the classification algorithm. They performed experiments on the authors with the highest and lowest scores and reported how to automatically select features that yield the best performance. In [22], personality is believed to be an important factor in determining individual variation in thoughts, emotions and behavior patterns.

Also, [22] predicted conscientiousness by exploiting the nuances on the usages of the verbs, in other words, measuring the specificity and objectivity of the verbs taken from WordNet and Senti-WordNet. In [1], personality is predicted by different classification methods such as Support Vector Machine, Bayesian Logistic Regression. However, most research used social texts as their input sources. On another hand, the ability to draw connections between behavioral aspects derived through contextual data collected by mobile phones, as well as personality, could lead to designing and applying machine learning methods to classify users into personality types [2, 5].

In our current research [2, 3, 15], we are focusing on main key features: developing a data warehouse [9, 16, 17, 18, 19] to manage big mobile log data, specifying new Big5 indicators, which could be retrieved from the data warehouse’ data cubes; then implementing a tool to predict personalities.

3 Big5 Conceptual Model

The objective of our research is to build a framework application and provide a quantitative assessment methodology for the use of mobile phone log data to retrieve Big5 traits distributed in developing countries in general, and Senegal in particular. Based on those data sets, a data warehouse, namely Big5DW has been designed. Afterwards, multi-dimensional data cubes can be defined and used to store aggregated data. In this context, Big5 indicators can be calculated from those data cubes. Futhermore, there are many useful indicators generated by Bandicoot tool [3] and inherited by our study. Based on those just-specified indicators, Naive Bayes classification [22, 24] are studied and applied to predict whether a phone user has feature of low, average, or high in the farmework of Big5 traits. Afterwards, the implementation results will be presented in Sect. 4 to proof of our concepts.

3.1 Big 5 Data Management

As mentioned above, mobile phone use logs could provide key insights into personality traits of Senegal with their interesting data patterns retrieved by data science approaches. In this section, the Orange mobile data will be introduced briefly. Afterwards, our data warehouse approach will be used to manage such big data files. Furthermore, the data cube concepts [7, 8, 15] will be applied to aggregate data according dimensional hierarchies specified in the next section.

3.1.1 Orange Mobil Data

The anonymized mobile Call Detail Records (CDR) are collected in Senegal between January 1, 2013 and December 31, 2013 and provided by Orange in the context of the Data for Climate Action challenge [3]. In this study we analyzed the mobile data acquired from the 1666 towers distributed across Senegal. According to [3], those datasets are based on phone calls and SMS exchanged between 9 million users during the year 2013. In the context of this challenge, those data have been properly anonymized before being handled to researchers [4]. The Fig. 1 shows a subset of SMS antenna to antenna mobile log data. According to [3], the data are organized into three sets:
Fig. 1.

A subset of SMS antenna to antenna log data

  • Set 1 contains the hourly voice and text traffic between Outgoing_site and Incoming_site. The information includes total call duration, number of calls and total number of text messages.

  • Set 2 contains the site-based fine-grained mobility data of about 300,000 randomly sampled and anonymized users for each two week period. This data set also includes bandicoot user-based behavioral indicators. Figure 1 shows a subset of indicators provided in this data set from Badicoot tool.

  • Set 3 contains one year (2013) coarse-grained trajectories of 123 arrondissements. There are also bandicoot behavioral indicators at individual level for about 150,000 randomly sampled users as shown in Fig. 2.
    Fig. 2.

    An example list of indicators calculated by Bandicoot tool.

3.1.2 Data Preprocessing

First, the Orange data CSV files are imported into fact tables of our Big5DW data warehouse, the concepts of which will be introduced in the next section. Afterwards, the data in those fact tables are aggregated into multi-dimensional data cubes as shown in the following Fig. 3.
Fig. 3.

Pre-processing data

3.2 Big5DW Data Warehouse

First, the Big5 data warehouse (Big5DW) is formalized based on the basis of mathematical model. The purpose of this conceptual model is to provide an extension of the standard data warehouse architecture used in our previous studies [7, 8, 13, 14]. As a result, the Big5DW can be defined as follows:
$$ Big5DW\; = \; < Big5Dims,Big5Facts,Big5FTs,Big5Gbys > $$
  • Big5Dims = {Site, Time} is a set of dimensions.

  • Big5Facts = {number_of_calls, Total_call_duration, number_of_sms} is a set of log variables.

  • Big5FTs = {FTCalls, FTSMSs}.

  • Big5Gbys is a set of data cubes grouped by the hierarchical levels of Time and Site dimensions, e.g. CallbyNighttime, CallbyDay, CallbyMonth, CallbyYear, SMSbyNightTime, SMSbyDay, SMSbyMonth, SMSbyYear, etc.

In the next sections, we will introduce main components of the Big5 multidimensional data model, i.e. dimension and facts (variables) and their related elements.

3.2.1 Big5Dims

In Big5DW data warehouse, Site and Time are main dimensions. Furthermore, the Site dimension is specified into Outgoing_site, Incomming_site dimensions where:
  • Outgoing_site_id: id of site the call/text originated from.

  • Incoming_site_id: id of site receiving the call/text.

According the SITE_ARR_LON_LAT and SENEGAL_ARR_V2 tables [3], the hierachical levels of the Site dimension can be defined as follows:

{Senegal->Regions->arrondissement_id ->Site_ID}

Based on the timestamp format [3], the Time dimension can be specified as:
$$ \begin{aligned} \{ Year \rightarrow Month \rightarrow Week \rightarrow Day & \rightarrow \\ & DayNight \rightarrow TimeStamp\} \\ \end{aligned} $$

3.2.2 Big5 Fact Tables and Related Elements

Based on Antenna-to-antenna traffic data provided in Set 1 [3], two main fact tables, namely FTCalls, FTSMSs are defined by having Outgoing_site_id, Incomming_site_id, timestamp as dimensions and number_of_calls, Total_call_duration, number_of_sms as facts. Figure 4 shows an example of FTCalls fact table. So, the Big5Gby set can be specified from FTCalls, FTSMSs as aggregations of hierarchical levels of those dimensions.
Fig. 4.

An example of fact table FTCalls

3.3 Big5 Indicators

In [14], we denote a set of Big5 personality traits, which can be classified into agreeableness (A), conscientiousness (C), extraversion (E), neuroticism (N) and openness to experience (O) as follows:
$$ Cl = \left\{ {A,C,E,N,O} \right\} $$
The Big5 indicators have been calculated based on our multidimensional data cubes Big5Gbys as shown in this Fig. 5:
Fig. 5.

Exanple list of indicators calculated by user from Orange data sets.

3.4 Predict Big5

We have presented how Big5 personality can be predicted by applying Naive Bayes classification [21, 23] in [14]. This approach can be summarised as follows:

Given a class variable \( cl \in CL \) defined in formula (1) and a dependent feature vector x \( (x_{1} ,..,x_{n} ) \), the following relationship is defined:
$$ P\left( {cl|x_{i} ,\, \ldots ,\,x_{n} } \right)\, = \,\frac{{P\left( {cl} \right)P\left( {x_{i} ,\, \ldots ,\,x_{n} |cl} \right)}}{{P\left( {x_{1} ,\, \ldots ,\,x_{n} } \right)}} $$
Then, following Uniform distribution [27], the probability are equal each other’s traits as:
$$ P\left( {cl|x_{i} , \ldots ,x_{n} } \right) = \frac{{\frac{1}{5}\mathop \prod \nolimits_{i = 1}^{n} P\left( {x_{i} |cl} \right)}}{{P\left( {x_{1} , \ldots ,x_{n} } \right)}} $$

In this context, indicators \( x_{i} \) are mapped into low (l) and high (h) degrees [26]. Then, Multinomial Naive Bayes method [23] has been applied to calculate \( P\left( {x_{i} |cl} \right) \) by mean of low or high in each personality’s dimension.

4 Big5 Tool

The Big5 tool has been developed to enables user(s) to explore personality traits. First, mobile phone log data provided by Orange Senegal [3] are extracted, transformed, and then loaded into the PostgreSQL data warehouse as illustrated in Fig. 6. Afterwards, Big5 indicators are calculated and stored in MongoDB. Those indicators are used to predict Big5 traits as presented in [14] and summarised in the previous section. The below paragraphs will present Big5 actors and their main use cases and some typical examples of the Big5 tool.
Fig. 6.

Big5 system architecture

4.1 Big5 Dashboard Use Cases

Big5 dashboard use cases have been designed for two main actors, i.e. Mobile Phone Provider, and Marketing Agents as shown in Fig. 7.
Fig. 7.

Big5 tool use case diagram

First, main actors can be aggregated into an abstract actor, namely Big5 Tool User. This actor has three use cases, i.e. UC1: predict Big5 Traits, UC2: view Big5 Charts, and UC3: view Big5 Map.

Moreover, the UC1: predict Big5 Traits can be implemented by having a sequence of back-end use cases. In other words, calculate Big5 Indicator uses data cubes from Big5DW, which is designed and developed by specify Big5DW one. However big mobile log data provided by Orange has been extracted, transformed and loaded into the Big5DW by mean of etl Big5Data use case.

4.2 Big5 Tool Implementing Results

The Big5 tool is built by Angular framework, Nodejs and Mongodb and PostgreSQL databases. Figure 8 shows on the top the results of the UC2 and UC3, which are using output of UC1. In this context, degree levels of Big5 traits of an user can be predicted by using UC1, and can be displayed in chart format by using UC2: view Big5 Chart. Furthermore, using calculated indicators by users (user_ids) and arrondissements (arr_ids), a Big5 prediction map of Senegal has been specified by UC1 and generated in the context of UC3: view Big5 Map.
Fig. 8.

Big5 tool

5 Conclusion

This paper introduced the concepts of Big5 and its indicators, which can be considered as underline background for predicting a mobile phone user’s personality. First, Orange Senegal mobile phone logs data are preprocessed and loaded in our data warehouse in term of callbyday and smsbyday fact tables. Afterwards, multi dimensional data cubes can be specified as aggregations of the two fact tables. Based on the multi dimensional data model, a set of indicators has been calculated and used for predicting Big5 based on Naive Bayes classification method. In this context, the Big5 tool has been designed in UML and developed to proof of our concepts.

Future work of our approach could then be able to support Big5 predicting in related application domains, e.g. data from other mobile phone providers, data from other sources. Furthermore, we will focus on the implementation of Big5 tool with new features to make use of our concepts.



Thanks to Orange Sonatel Senegal and the D4D team for providing the mobile phone data. Support from the Duy Tan University, Vietnam is acknowledged.


  1. 1.
    Alam, F., Stepanov, E.A., Riccardi, G.: Personality traits recognition on social network-facebook. In: Proceedings of Workshop on Computational Personality Recognition, pp. 6–9. AAAI Press, Melon Park (2013)Google Scholar
  2. 2.
  3. 3.
    de Montjoye, Y.A., Quoidbach, J., Robic, F., Pentland, A.: Predicting personality using novel mobile phone-based metrics. In: Greenberg, A.M., Kennedy, W.G., Bos, N.D. (eds.) Social Computing, Behavioral-Cultural Modeling and Prediction, SBP 2013. Lecture Notes in Computer Science, vol. 7812, pp. 48–55. Springer, Heidelberg (2013). Scholar
  4. 4.
    de Montjoye, Y.-A., Smoreda, Z., Trinquart, R., Ziemlicki, C., Blondel, V.: D4D-Senegal: The Second Mobile Phone Data for Development Challenge (2014)Google Scholar
  5. 5.
    de Oliveira, R., et al.: Towards a psychographic user model from mobile phone usage. In: Proceedings of the 2011 Annual Conference Extended Abstracts on Human Factors in Computing Systems. ACM (2011)Google Scholar
  6. 6.
    Chittaranjan, G., Blom, J., Gatica-Perez, D.: Who’s who with big-five: analyzing and classifying personality traits with smartphones. In: Proceedings of the 2011 15th Annual International Symposium on Wearable Computers (ISWC 2011). IEEE Computer Society, Washington, pp. 29–36 (2011).
  7. 7.
    Harari, G.M., Lane, N.D., Wang, R., Crosier, B.S., Campbell, A.T., Gosling, S.D.: Using smartphones to collect behavioral data in psychological science: opportunities, practical considerations, and challenges. Perspect. Psychol. Sci.: J. Assoc. Psychol. Sci. 11(6), 838–854 (2016). Scholar
  8. 8.
    Hoang, D.T.A., Ngo, N.S., Nguyen, B.T.: Collective cubing platform towards definition and analysis of warehouse cubes. In: Nguyen, N.-T., Hoang, K., Jȩdrzejowicz, P. (eds.) ICCCI 2012. LNCS (LNAI), vol. 7654, pp. 11–20. Springer, Heidelberg (2012). Scholar
  9. 9.
    Hoang, A.D.T., Nguyen, T.B.: An integrated use of CWM and ontological modeling approaches towards ETL Processes. In: ICEBE 2008, pp. 715–720 (2008)Google Scholar
  10. 10.
    Hoang, A.D.T., Nguyen, T.B.: State of the art and emerging rule-driven perspectives towards service-based business process interoperability. In: RIVF 2009, pp. 1–4 (2009)Google Scholar
  11. 11.
    Oberlander, J., Nowson, S.: Whose thumb is it anyway? Classifying author personality from weblog text. In: Proceedings of the COLING/ACL on Main Conference Poster Sessions (COLING-ACL 2006), pp. 627–634. Association for Computational Linguistics, Stroudsburg (2006)Google Scholar
  12. 12.
    McCrae, R.R., John, O.P.: An introduction to the five-factor model and its applications. J. Pers. 60(2), 175–215 (1992)CrossRefGoogle Scholar
  13. 13.
    Mount, M., Ilies, R., Johnson, E.: Relationship of personality traits and counterproductive work behaviors: the mediating effects of job satisfaction. Pers. Psychol. 59, 591–622 (2006). Scholar
  14. 14.
    Nguyen, T.B., Dang, N.D., Nguyen, T.T.H., Ha, T.T., Phan, T.H.L., Truong, D.H.: Tracking Big5 traits based on mobile user log data. In: The 7th International Conference on Frontiers of Intelligent Computing: Theory And Application (FICTA 2018). Advances in Intelligent Systems and Computing (2018)Google Scholar
  15. 15.
    Nguyen, T.B., Ngo, N.S.: Semantic cubing platform enabling interoperability analysis among cloud-based linked data cubes. In: Proceedings of the 8th International Conference on Research and Practical Issues of Enterprise Information Systems, CONFENIS 2014. ACM International Conference Proceedings Series (2014)Google Scholar
  16. 16.
    Nguyen, T.B., Tjoa, A.M., Wagner, R.: Conceptual multidimensional data model based on metacube. In: Yakhno, T. (ed.) ADVIS 2000. LNCS, vol. 1909, pp. 24–33. Springer, Heidelberg (2000). Scholar
  17. 17.
    Nguyen, T.B., Wagner, F.: Collective intelligent toolbox based on linked model framework. J. Intell. Fuzzy Syst. 27(2), 601–609 (2014)MathSciNetGoogle Scholar
  18. 18.
    Nguyen, T.B., Wagner, F., Schoepp, W.: Federated data warehousing application framework and platform-as-a-services to model virtual data marts in the clouds. Int. J. Intell. Inf. Database Syst. 8(3), 280 (2014). ISSN 1751-5858, 1751-5866CrossRefGoogle Scholar
  19. 19.
    Nguyen, T.B., Wagner, F., Schoepp, W.: EC4MACS – an integrated assessment toolbox of well-established modeling tools to explore the synergies and interactions between climate change, air quality and other policy objectives. In: Auweter, A., Kranzlmüller, D., Tahamtan, A., Tjoa, A.M. (eds.) ICT-GLOW 2012. LNCS, vol. 7453, pp. 94–108. Springer, Heidelberg (2012). Scholar
  20. 20.
    Nguyen, T.B., Wagner, F., Schoepp, W.: GAINS-BI: business intelligent approach for greenhouse gas and air pollution interactions and synergies information system. In: Proceedings of the International Organization for Information Integration and Web-Based Application and Services IIWAS 2008, Linz (2008)Google Scholar
  21. 21.
    Peng, K.-H., Liou, L.-H., Chang, C.-S., Lee, D.-S.: Predicting personality traits of Chinese users based on Facebook wall posts, pp. 9–14 (2015).
  22. 22.
    Tomlinson, M.T., Hinote, D., Bracewell, D.B.: Predicting conscientiousness through semantic analysis of facebook posts. In: Proceedings of Workshop on Computational Personality Recognition. AAAI Press, Melon Park (2013)Google Scholar
  23. 23.
    Zhang, W., Gao, F.: An improvement to Naive Bayes for text classification. Proc. Eng. 15, 2160–2164 (2011). ISSN 1877-7058CrossRefGoogle Scholar
  24. 24.
  25. 25.
  26. 26.
  27. 27.

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Duy Tan UniversityDanangVietnam

Personalised recommendations