Skip to main content

Inferring Twitters’ Socio-demographics to Correct Sampling Bias of Social Media Data for Augmenting Travel Behavior Analysis


Many studies demonstrated that social media data, especially Twitter data, have significant potentials to develop models for estimating travel demand, managing operation, and conducting long-term planning purposes. However, it is well known that research with social media data is facing a looming challenge in sampling bias. The Twitter user’s population has huge discrepancies compared with the overall population. Therefore, social media data, when it is directly used for travel behavior analysis, contains biases and errors to some degree. The objective of this study is to correct sampling bias of Twitter data for travel behavior analysis by inferring Twitter users’ socio-demographics. This study first links travelers’ Twitter account with their Facebook account, and verifies their socio-demographics by Facebook data, assuming that one’s Facebook information is real. Second, several models are proposed for predicting socio-demographics, including gender, age, ethnicity, and education levels. Afterward, this paper resamples social media data and compares it to the 2009 California Household Travel Survey data. The resampled data show comparable characteristics to the survey data. This research shed light on tackling sampling bias issues when social media data are incorporated for augmenting travel behavior analysis and urban planning.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Availability of data, material and code

Some or all data, models, or code generated or used during the study are proprietary or confidential in nature and may only be provided with restrictions (e.g., anonymized data). We can share: 1. Twitter data. 2. Facebook data. 3. Code. Due to privacy concerns, California Household Travel Survey (CHTS) data is confidential. One can request it through National Renewable Energy Laboratory (NREL) for permission. Our data and code are available on GitHub (


  • Al Zamal F, Liu W, Ruths D (2012) Homophily and latent attribute inference: inferring latent attributes of twitter users from neighbors. ICWSM 270:2012

    Google Scholar 

  • Ardehaly EM, Culotta A (2014) Using county demographics to infer attributes of twitter users. In: Proceedings of the joint workshop on social dynamics and personal attributes in social media, pp 7–16

  • Barbieri F (2008) Patterns of age-based linguistic variation in American English. J Sociolinguist 12:58–88

    Article  Google Scholar 

  • Burger JD, Henderson J, Kim G, Zarrella G Discriminating gender on Twitter. In: Proceedings of the Conference on empirical methods in natural language processing, 2011. Association for Computational Linguistics, pp 1301–1309

  • Chang J, Rosenn I, Backstrom L, Marlow C (2010) ePluribus: Ethnicity on Social Networks. ICWSM 10:18–25

    Google Scholar 

  • Conover M, Ratkiewicz J, Francisco MR, Gonçalves B, Menczer F, Flammini A (2011a) Political polarization on twitter. ICWSM 133:89–96

    Google Scholar 

  • Conover MD, Gonçalves B, Ratkiewicz J, Flammini A, Menczer F (2011b) Predicting the political alignment of Twitter users. In: Privacy, security, risk and trust (PASSAT) and 2011 IEEE Third Inernational Conference on social computing (SocialCom), 2011 IEEE Third International Conference on, 2011. IEEE, pp 192–199

  • Cui Y (2016) Behavior-based traveller classification using high-resolution connected vehicles trajectories and land use Data. University at Buffalo, Suny

    Google Scholar 

  • Cui Y (2019) Repository for inferring Twitter's soico-demographics to correct sampling bias of social meida data for augmenting travel behavior analysis. Accessed 26 Aug 2019

  • Cui Y, Meng C, He Q, Gao J (2018) Forecasting current and next trip purpose with social media data and Google Places. Transport Res Part C Emerg Technol 97:159–174

    Article  Google Scholar 

  • Cui Y, He Q, Khani A (2018) Travel behavior classification: an approach with social network and deep learning. Transport Res Rec 2672(47):68–80

    Article  Google Scholar 

  • Culotta A, Kumar N, Cutler J (2015) Predicting the demographics of twitter users from website traffic data. In: Proceedings of the AAAI conference on artificial intelligence, vol 29, no 1

  • Daisy NS, Hafezi MH, Liu L, Millward H (2018) Understanding and modeling the activity-travel behavior of university commuters at a large Canadian university. J Urban Plan Dev 144:04018006

    Article  Google Scholar 

  • Facebook (2018) Facebook publishes enforcement numbers for the first time. Facebook. Accessed 15 May 15 2018

  • Fink C, Kopecky J, Morawski M (2012) Inferring gender from the content of tweets: a region specific example. In: ICWSM, 2012

  • Goel S, Hofman JM, Sirer MI (2012) Who does what on the web: a large-scale study of browsing behavior. In: ICWSM, 2012

  • Gonzalez MC, Hidalgo CA, Barabasi A-L (2008) Understanding individual human mobility patterns. Nature 453:779

    Article  Google Scholar 

  • Goswami S, Sarkar S, Rustagi M (2009) Stylometric analysis of bloggers’ age and gender. In: Third International AAAI Conference on weblogs and social media, 2009

  • KickFactory (2016) The average twitter user now has 707 followers. Accessed 23 June 2016

  • Lee JH, Davis AW, Yoon SY, Goulias KG (2016) Activity space estimation with longitudinal observations of social media data. Transportation 43:955–977

    Article  Google Scholar 

  • Lin L, Ni M, He Q, Gao J, Sadek AW (2015) Modeling the impacts of inclement weather on freeway traffic speed: exploratory study with social media data. Transport Res Rec J Transport Res Board 2482(1):82–89

    Article  Google Scholar 

  • Liu W, Ruths D (2013) What's in a name? Using first names as features for gender inference in twitter. In: AAAI spring symposium: analyzing microtext, 2013. vol 1. pp 10–16

  • Maghrebi M, Abbasi A, Waller ST (2016) Transportation application of social media: Travel mode extraction. In: 2016 IEEE 19th International Conference on intelligent transportation systems (ITSC), 2016. IEEE, pp 1648–1653

  • Meng C, Cui Y, He Q, Su L, Gao J (2017) Travel purpose inference with GPS trajectories, POIs, and geo-tagged social media data. In: Big data (Big Data), 2017 IEEE International Conference on, 2017. IEEE, pp 1319–1324

  • Mislove A, Lehmann S, Ahn Y-Y, Onnela J-P, Rosenquist JN (2011) Understanding the demographics of twitter users. ICWSM 11:25

    Google Scholar 

  • Nasri A, Zhang L (2014) Assessing the impact of metropolitan-level, county-level, and local-level built environment on travel behavior: Evidence from 19 US urban areas. J Urban Plan Dev 141:04014031

    Article  Google Scholar 

  • Nguyen D, Gravel R, Trieschnigg D, Meder T (2013) How old do you think i am?" A study of language and age in Twitter. In: ICWSM, 2013.

  • Nguyen D, Trieschnigg D, Doğruöz AS, Gravel R, Theune M, Meder T, De Jong F (2014) Why gender and age prediction from tweets is hard: lessons from a crowdsourcing experiment. In: Proceedings of COLING 2014, the 25th International Conference on computational linguistics: technical papers, 2014. pp 1950–1961

  • NHTS (2011) Uses of National Household Travel Survey Data in Transportation. In: Using National household travel survey data for transporation decision making a workshop

  • Ni M, He Q, Gao J (2017) Forecasting the subway passenger flow under event occurrences with social media. IEEE Trans Intell Transp Syst 18:1623–1632

    Google Scholar 

  • OECD (2018) Education at a Glance 2018.

  • Ouimet MC, Simons-Morton BG, Zador PL, Lerner ND, Freedman M, Duncan GD, Wang J (2010) Using the US National Household Travel Survey to estimate the impact of passenger characteristics on young drivers’ relative risk of fatal crash involvement. Accid Anal Prev 42:689–694

    Article  Google Scholar 

  • Pennacchiotti M, Popescu AM (2011) A machine learning approach to twitter user classification. In: Proceedings of the international AAAI conference on web and social media, vol 5. Barcelona, Catalonia, Spain, 17–21 July 2011

  • Picornell M, Ruiz T, Lenormand M, Ramasco JJ, Dubernet T, Frías-Martínez E (2015) Exploring the potential of phone call data to characterize the relationship between social network and travel behavior. Transportation 42:647–668

    Article  Google Scholar 

  • Polzin SE, Chu X, Raman VS (2008) Exploration of a shift in household transportation spending from vehicles to public transportation

  • Rao D, Yarowsky D, Shreevats A, Gupta M (2010) Classifying latent user attributes in twitter. In: Proceedings of the 2nd international workshop on Search and mining user-generated contents, 2010. ACM, pp 37–44

  • Rao D, Paul MJ, Fink C, Yarowsky D, Oates T, Coppersmith G (2011) Hierarchical bayesian models for latent attribute detection in social media. ICWSM 11:598–601

    Google Scholar 

  • Rashidi TH, Abbasi A, Maghrebi M, Hasan S, Waller TS (2017) Exploring the capacity of social media data for modelling travel behaviour: opportunities and challenges. Transport Res Part C Emerg Technol 75:197–211

    Article  Google Scholar 

  • Schler J, Koppel M, Argamon S, Pennebaker JW (2006) Effects of age and gender on blogging. In: AAAI spring symposium: Computational approaches to analyzing weblogs, vol 6, pp 199–205

  • Schwartz HA et al (2013a) Characterizing geographic variation in well-being using tweets. In: ICWSM, pp 583–591

  • Schwartz HA et al (2013) Personality, gender, and age in the language of social media: The open-vocabulary approach. PLoS ONE 8:e73791

    Article  Google Scholar 

  • Statista (2018) Distribution of Twitter users in the United States as of January 2017, by gender. Accessed 27 Jan 2021

  • Zhang Z, He Q (2019) Social media in transportation research and promising applications. In: Ukkusuri S, Yang C (eds) Transportation analytics in the era of big data. Springer, Cham, pp 23–45

    Chapter  Google Scholar 

  • Zhang Z, He Q, Zhu S (2017) Potentials of using social media to infer the longitudinal travel behavior: a sequential model-based clustering method. Transport Res Part C Emerg Technol 85:396–414

    Article  Google Scholar 

  • Zhang Z, He Q, Gao J, Ni M (2018) A deep learning approach for detecting traffic accidents from social media data. Transport Res Part C Emerg Technol 86:580–596

    Article  Google Scholar 

Download references


This work was partially supported by National Science Foundation award CMMI-1637604 and Tier 1 University Transportation Center Transportation Informatics at University at Buffalo.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Qing He.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Cui, Y., He, Q. Inferring Twitters’ Socio-demographics to Correct Sampling Bias of Social Media Data for Augmenting Travel Behavior Analysis. J. Big Data Anal. Transp. 3, 159–174 (2021).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Social media data
  • Twitter
  • Socio-demographics
  • Sampling bias correction
  • Travel behavior