Abstract
This work attempts to report the stylistic differences in blogging for gender and age group variations using slang word co-occurrences. We have mainly focused on co-occurrence of non dictionary words across bloggers of different gender and age groups. For this analysis, we have focused on the feature use of slang words to study the stylistic variations of bloggers across various age groups and gender. We have modeled the co-occurrences of slang words used by bloggers as graph based model where nodes are slang words and edges represent the number of cooccurrences and studied the variations in predicting age groups and gender. We have used demographically tagged blog corpus from ICWSM Spinner dataset for these experiments and used Naive Bayes classifier with 10 fold cross validations. Preliminary results shows that the concurrence of of slang words could be a better choice for predicting age and gender.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
McMenamin, G.R.: Forensic Linguistics: Advances in Forensic Stylistic. CRC Press, Boca Raton (2002)
Leximancer Manual V.3: Leximancer (2009), http://www.leximancer.com (last accessed on January 22, 2009)
Argamon, S., Koppel, M., Avneri, G.: Routing documents according to style. In: Proc. of First Int. Workshop on Innovative Inform. Syst. (1998)
Burger, J.D., Henderson, J.C.: An exploration of observable features related to blogger age. In: Proc. of the AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs (2006)
Schler, J., Koppel, M., Argamon, S., Pennebaker, J.: Effects of age and gender on blogging. In: Proc. of the AAAI Spring Symposia on Computational Approaches to Analyzing Weblogs (April 2006)
Yan, R.: Gender classification of weblog authors with bayesian analysis. In: Proc. of the AAAI Spring Symp. on Computational Approaches to Analyzing Weblogs (2006)
Pennebaker, J.W., Francis, M.E., Booth, R.J.: Liwc 2001, Linguistic Inquiry and Word Count (2001)
Pennebaker, J.W., Stone, L.D.: Words of wisdom: Language use over the lifespan. Journal of Personality and Social Psychology 85, 291–301 (2003)
Holmes, J.: Women’s talk: The question of sociolinguistic universals. Australian Journal of Communications 20(3) (1993)
Palander-Collin, M.: Male and female styles in 17th century correspondence: I think. Language Variation and Change 11, 123–141 (1999)
Herring, S.: Two variants of an electronic message schema. In: Herring, S. (ed.) Computer-Mediated Communication: Linguistic, Social and Cross-Cultural Perspectives, vol. 11, pp. 81–106 (1996)
Patton, J.M., Can, F.: A stylometric analysis of yaşar kemal’s İnce memed tetralogy. Computers and the Humanities 38, 457–467 (2004)
Can, F., Patton, J.M.: Change of writing style with time. Computers and the Humanities 38, 61–82 (2004)
Simkins-Bullock, J., Wildman, B.: An investigation into relationship between gender and language Sex Roles, vol. 24. Springer, Netherlands (1991)
Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Literary and Linguistic Computing 17(4), 401–412 (2002)
Corney, M., de Vel, O., Anderson, A., Mohay, G.: Gender-preferential text mining of e-mail discourse. In: 18th Annual Computer Security Appln. Conference (2002)
Brank, J., Grobelnik, M., Milic-Frayling, N., Mladenic, D.: Feature selection using support vector machines. In: Proc. of the 3rd Int. Conf. on Data Mining Methods and Databases for Engg., Finance, and Other Fields, pp. 84–89 (2002)
Rustagi, M., Prasath, R.R., Goswami, S., Sarkar, S.: Learning age and gender of blogger from stylistic variation. In: Chaudhury, S., Mitra, S., Murthy, C.A., Sastry, P.S., Pal, S.K. (eds.) PReMI 2009. LNCS, vol. 5909, pp. 205–212. Springer, Heidelberg (2009)
Spinn3r: Spinn3r - indexing blogosphere, http://www.spinn3r.com (last accessed on March 01, 2009)
ICWSM 2009: Icwsm 2009 (May 2009); ICWSM 2009 Spinn3r Dataset
Datta, S., Sarkar, S.: A comparative study of statistical features of language in blogs-vs-splogs. In: AND 2008: Proc. of the second workshop on Analytics for noisy unstructured text data, pp. 63–66. ACM, New York (2008)
Goswami, S., Sarkar, S., Rustagi, M.: Stylometric analysis of bloggers’ age and gender. To appear in Proc. of ICWSM (2009)
Ispell: Ispell (2009), http://www.gnu.org/software/ispell/ (last accessed on March 02, 2010)
Dhillon, I.S., Guan, Y., Kulis, B.: Weighted graph cuts without eigenvectors a multilevel approach. IEEE Trans. Pattern Anal. Mach. Intell. 29(11), 1944–1957 (2007)
Witten, I.H., Frank, E.: DataMining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
Estival, D., Gaustad, T., Pham, S.B., Radford, W., Hutchinson, B.: Tat: an author profiling tool with application to arabic emails. In: Proc. of the Australasian Language Technology Workshop, pp. 21–30 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Prasath, R.R. (2010). Learning Age and Gender Using Co-occurrence of Non-dictionary Words from Stylistic Variations. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds) Rough Sets and Current Trends in Computing. RSCTC 2010. Lecture Notes in Computer Science(), vol 6086. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13529-3_58
Download citation
DOI: https://doi.org/10.1007/978-3-642-13529-3_58
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13528-6
Online ISBN: 978-3-642-13529-3
eBook Packages: Computer ScienceComputer Science (R0)