On Utilizing Nonstandard Abbreviations and Lexicon to Infer Demographic Attributes of Twitter Users

Mosely, Nathaniel; Alm, Cecilia Ovesdotter; Rege, Manjeet

doi:10.1007/978-3-319-16577-6_11

Nathaniel Mosely⁴,
Cecilia Ovesdotter Alm⁴ &
Manjeet Rege⁵

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 346))

Included in the following conference series:

Workshop on Formal Methods Integration

414 Accesses

Abstract

Automatically determining demographic attributes of writers with high accuracy, based on their texts, can be useful for a range of application domains, including smart ad placement, security, the discovery of predator behaviors, enabling automatic enhancement of participants’ profiles for extended analysis, and various other applications. It is also of interest from the perspective to linguists who may wish to build on such inference for further sociolinguistic analysis. Previous work indicates that attributes such as author gender can be determined with some amount of success, using various methods, such as analysis of shallow linguistic patterns or topic, in authors’ written texts. Author age appears more difficult to determine, but previous research has been somewhat successful at classifying age as a binary (e.g. over or under 30), ternary, or even as a continuous variable using various techniques. In this work, we show that word and phrase abbreviation patterns can be used toward determining user age using novel binning, as well as toward determining binary user gender, and ternary user education level. Notable results include age classification accuracy of up to 83% (67% above relative majority class baseline) using a support vector machine classifier and PCA extracted features, including n-grams. User ages were classified into 10 equally sized age bins and achieved 51% accuracy (34% above baseline) when using only abbreviation features. Gender classification achieved 75% accuracy (13% above baseline) using only abbreviation features, PCA extracted, and education classification achieved 62% accuracy, 19% above baseline with PCA extracted abbreviation features. Also presented is an analysis of the evident change in author abbreviation use over time on Twitter.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Mesthrie, R.: Introducing Sociolinguistics. Edinburgh University Press (2009), http://books.google.com/books?id=uy1xbYDsU8kC
Han, B., Baldwin, T.: Lexical normalisation of short text messages: Makn sens a #twitter. In: Proceedings of HLT, pp. 368–378 (2011)
Google Scholar
Rosenthal, S., McKeown, K.: Age prediction in blogs: A study of style, content, and online behavior in pre- and post-social media generations. In: Proceedings of ACL, Portland, OR, USA, vol. 1, pp. 763–772 (2011)
Google Scholar
Sarawgi, R., Gajulapalli, K., Choi, Y.: Gender attribution: Tracing stylometric evidence beyond topic and genre. In: Proceedings of CoNLL, Portland, Oregon, pp. 78–86 (2011)
Google Scholar
Derczynski, L., Maynard, D., Aswani, N., Bontcheva, K.: Microblog-genre noise and impact on semantic annotation accuracy. In: Proceedings of the 24th ACM Conference on Hypertext and Social Media, ser. HT 2013, pp. 21–30. ACM, New York (2013)
Chapter Google Scholar
Ritter, A., Sam, Clark, M., Etzioni, O.: Named entity recognition in tweets: an experimental study. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, ser. EMNLP 2011, pp. 1524–1534. Association for Computational Linguistics, Stroudsburg (2011)
Google Scholar
Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yogatama, D., Flanigan, J., Smith, N.A.: Part-of-speech tagging for Twitter: Annotation, features, and experiments. In: Proceedings of ACL HLT: Short Papers, vol. 2, pp. 42–47 (2011)
Google Scholar
Kaufmann, M., Kalita, J.: Syntactic normalization of Twitter messages. In: Proceedings of ICON, pp. 149–158 (2010)
Google Scholar
Gouws, S., Metzler, D., Cai, C., Hovy, E.: Contextual bearing on linguistic variation in social media. In: Proceedings of LSM, pp. 20–29 (2011)
Google Scholar
Contractor, D., Faruquie, T.A., Subramaniam, L.V.: Unsupervised cleansing of noisy text. In: Proceedings of COLING (2010)
Google Scholar
Wagner, S.E.: Age grading in sociolinguistic theory. Language and Linguistics Compass 6, 371–382 (2012)
Article Google Scholar
Moseley, N., Alm, D. C.O., Rege, D. M.: A user-annotated microtext data set for modeling and analyzing sociolinguistic characteristics and age grading of Twitter users. In: EMNLP 2013: Conference on Empirical Methods in Natural Language Processing, SIGDAT, Seattle (October 2013)
Google Scholar
Burger, J.D., Henderson, J., Kim, G., Zarrella, G.: Discriminating gender on Twitter. In: Proceedings of EMNLP, pp. 1301–1309 (2011)
Google Scholar
Udani, G.: An exhaustive study of Twitter users across the world. Beevolve Technologies (October 2012), http://www.beevolve.com/twitter-statistics/
Loper, E., Bird, S.: NLTK: the natural language toolkit. In: Proceedings of ETMTNLP (2002)
Google Scholar
Stolcke, A.: SRILM - an extensible language modeling toolkit. In: Proceedings of 7th International Conference on Spoken Language Processing (2002)
Google Scholar
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification using string kernels. Journal of Machine Learning Research 2, 419–444 (2002)
MATH Google Scholar
Smith, A., Brenner, J.: Twitter use 2012. Pew Research Centers Internet & American Life Project, Tech. Rep (2012)
Google Scholar
Nguyen, D., Smith, N.A., Rosé, C.P.: Author age prediction from text using linear regression. In: Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, Portland, Oregon (2011)
Google Scholar
Holte, R.C.: Very simple classification rules perform well on most commonly used datasets. Machine Learning 11, 63–91 (1993)
Article MATH Google Scholar
John, G.H., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–345 (1995)
Google Scholar
Quinlan, R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo (1993)
Google Scholar
Chang, C.-C., Lin, C.-J.: LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2(3), 1–27 (2011)
Article Google Scholar
Turk, M.: Analysis and visualization of multi-scale astrophysical simulations using python and numpy. In: Proceedings of 7th Python in Science Conference (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Rochester Institute of Technology, Rochester, NY, 14623, USA
Nathaniel Mosely & Cecilia Ovesdotter Alm
University of St. Thomas, St. Paul, MN, 55105, USA
Manjeet Rege

Authors

Nathaniel Mosely
View author publications
You can also search for this author in PubMed Google Scholar
Cecilia Ovesdotter Alm
View author publications
You can also search for this author in PubMed Google Scholar
Manjeet Rege
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nathaniel Mosely .

Editor information

Editors and Affiliations

Ecole Nationale Supérieure d'Informatique, Alger, Algeria
Thouraya Bouabana-Tebibel
SPAWAR Systems Center Pacific, San Diego, California, USA
Stuart H. Rubin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mosely, N., Alm, C.O., Rege, M. (2015). On Utilizing Nonstandard Abbreviations and Lexicon to Infer Demographic Attributes of Twitter Users. In: Bouabana-Tebibel, T., Rubin, S. (eds) Formalisms for Reuse and Systems Integration. FMI 2014. Advances in Intelligent Systems and Computing, vol 346. Springer, Cham. https://doi.org/10.1007/978-3-319-16577-6_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-16577-6_11
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-16576-9
Online ISBN: 978-3-319-16577-6
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics