Knowledge and Information Systems

, Volume 31, Issue 1, pp 1–21 | Cite as

Two-layered Blogger identification model integrating profile and instance-based methods

Regular Paper

Abstract

This paper introduces a two-layered framework that improves the result of authorship identification within larger sample numbers of bloggers as compared with earlier work. Previous studies are mainly divided into two categories: profile-based and instance-based methods. Each of these approaches has its advantages and limitations. The two-layered framework presented here integrates the two previous approaches and presents a new solution to a key problem in authorship identification, namely the drop in accuracy experienced as the number of authors increases. The paper begins by illustrating the regular instance-based core model and the investigated features. It then introduces a new psycholinguistic profile representation of authors, presents similarity grouping extraction over profiles, and applies blogger identification utilizing the two-layered approach. The results confirm the improvement introduced by the proposed two-layered approach against our regular classifier, as well as a selected baseline, for an extended number of users.

Keywords

Blog mining Authorship identification User representation Group extraction Profile modeling 

References

  1. 1.
    Abbasi A, Chen H (2008) Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Trans Inform syst 26(2): 1–29CrossRefGoogle Scholar
  2. 2.
    Argamon S, Koppel M, Pennebaker JW, Schler J (2009) Automatically profiling the author of an anonymous text. Commun ACM 52(2): 119–123CrossRefGoogle Scholar
  3. 3.
    Argamon S, Saric M, Stein SS (2003) Style mining of electronic messages for multiple authorship discrimination: First results. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, New York, pp 475-480Google Scholar
  4. 4.
    Baker LD, McCallum AK (1998) Distributional clustering of words for text classification. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, ACM, pp 96–103Google Scholar
  5. 5.
    Chan S, Pon RK, Cardenas AF (2006) Visualization and clustering of author social networks. In: Distributed multimedia systems conference, pp 174–180. http://www.cs.ucla.edu/~cardenas/cardenas2.html
  6. 6.
    Dardick GS, Roche CRL, Flanigan MA (2007) Blogs: Anti-forensics and counter anti-forensics. In: Proceedings of the 5th Australian digital forensics conference, p 199Google Scholar
  7. 7.
    de Vel O, Anderson A, Corney M, Mohay G (2001) Mining e-mail content for author identification forensics ACM. SIGMOD Rec 30(4): 55–64CrossRefGoogle Scholar
  8. 8.
    Diederich J, Kindermann J, Leopold E, Paass G (2003) Authorship attribution with support vector machines. Appl Intell 19(1): 109–123MATHCrossRefGoogle Scholar
  9. 9.
    Feldman R, Sanger J (2007) The text mining handbook. Cambridge University Press, New YorkGoogle Scholar
  10. 10.
    Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3: 1289–1305MATHGoogle Scholar
  11. 11.
    Frantzeskou G, Stamatatos E, Gritzalis S, Katsikas S (2006) Effective identification of source code authors using byte-level information. In: Proceedings of the 28th international conference on Software engineering, ACM, p 896Google Scholar
  12. 12.
    Gehrke GT, Reader S, Squire KM (2008) Authorship discovery in blogs using Bayesian classification with corrective scalingGoogle Scholar
  13. 13.
    Gill A (2003) Personality and language: The projection and perception of personality in computer-mediated communicationGoogle Scholar
  14. 14.
    Gill AJ, French RM, Gergle D, Oberlander J (2008) The language of emotion in short blog texts. In: Proceedings of the ACM 2008 conference on computer supported cooperative work, ACM New York, pp 299–302Google Scholar
  15. 15.
    Hancock JT, Gee K, Ciaccio K, Lin JMH (2008) I’m sad you’re sad: emotional contagion in cmc, in ‘Proceedings of the ACM 2008 conference on computer supported cooperative work’, ACM New York, pp. 295–298Google Scholar
  16. 16.
    Hancock JT, Landrigan C, Silver C (2007) Expressing emotion in text-based communication. In: Proceedings of the SIGCHI conference on human factors in computing systems, ACM New York, pp 929–932Google Scholar
  17. 17.
    He Y, Hui SC, Fong ACM (2003) Citation-based retrieval for scholarly publications. IEEE Intell Syst 18(2): 58–65CrossRefGoogle Scholar
  18. 18.
    Holmes D, Forsyth R (1995) The Federalist revisited: new directions in authorship attribution. Lit Linguist Comput 10(2): 111CrossRefGoogle Scholar
  19. 19.
    Jing L, Ng MK, Huang JZ (2009) Knowledge-based vector space model for text clustering. Knowl Info Syst 25(1): 35–55CrossRefGoogle Scholar
  20. 20.
    Keselj V, Peng F, Cercone N, Thomas C (2003) N-gram-based author profiles for authorship attribution. In: Proceedings of the conference pacific association for computational linguistics, PACLING 3, Citeseer, pp 255–264Google Scholar
  21. 21.
    Koppel M, Akiva N, Dagan I (2006) Feature instability as a criterion for selecting potential style markers. J Am Soc Info Sci Technol 57(11): 1519–1525CrossRefGoogle Scholar
  22. 22.
    Koppel M, Schler J (2003) Exploiting stylistic idiosyncrasies for authorship attribution. In: Proceedings of IJCAI’03 workshop on computational approaches to style analysis and synthesis, pp 69–72Google Scholar
  23. 23.
    Koppel M, Schler J, Zigdon K (2005) Determining an author’s native language by mining a text for errors. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, ACM New York, pp 624–628Google Scholar
  24. 24.
    Kyriakopoulou A, Kalamboukis T (2006) Text classification using clustering. In: The Discovery challenge workshop, Citeseer, p 28Google Scholar
  25. 25.
    Li J, Zheng R, Chen H (2006) From fingerprint to writeprint. Commun ACM 49: 76–82CrossRefGoogle Scholar
  26. 26.
    Mairesse F, Walker MA, Mehl MR, Moore RK (2007) Using linguistic cues for the automatic recognition of personality in conversation and text. J Artif Intell Res 30: 457–500MATHGoogle Scholar
  27. 27.
    Matthews RAJ, Merriam TVN (1993) Neural computation in stylometry. In: An application to the works of shakespeare and fletcher. Lit Linguist Comput 8(4): 203CrossRefGoogle Scholar
  28. 28.
    Mishne GA (2007) Applied text analytics for blogs. Universiteit van, AmsterdamGoogle Scholar
  29. 29.
    Mohtasseb H, Ahmed A (2009a) Mining online diaries for blogger identification. In: The 2009 International conference of data mining and knowledge engineering (ICDMKE’09)Google Scholar
  30. 30.
    Mohtasseb H, Ahmed A (2009b) More blogging features for author identification. In: The 2009 International conference on knowledge discovery (ICKD’09)Google Scholar
  31. 31.
    Mohtasseb H, Ahmed A (2010) The affects of demographics differentiations on authorship identification. Springer, Netherlands, pp 409–417Google Scholar
  32. 32.
    Mosteller F, Wallace DL (1964) Inference and disputed authorship: the federalist. Addison-Wesley, ReadingMATHGoogle Scholar
  33. 33.
    Narasimhamurthy A, Greene D, Hurley N, Cunningham P (2009) Partitioning large networks without breaking communities. Knowl Info Syst 25(2): 1–25Google Scholar
  34. 34.
    Nowson S, Oberlander J (2006) The identity of bloggers: openness and gender in personal weblogs. In: Proceedings of the AAAI spring symposia on computational approaches to analyzing weblogsGoogle Scholar
  35. 35.
    Peng F, Schuurmans D, Wang S (2004) Augmenting naive bayes classifiers with statistical language models. Info Retr 7(3): 317–345CrossRefGoogle Scholar
  36. 36.
    Pennebaker JW, Francis ME, Booth RJ (2001) Linguistic inquiry and word count: Liwc 2001. Lawrence Erlbaum Associates, MahwayGoogle Scholar
  37. 37.
    Pennebaker JW, King LA (1999) Linguistic styles:Language use as an individual difference. J Pers Soc Psychol 77(6): 1296–1312CrossRefGoogle Scholar
  38. 38.
    Pfitzner D, Leibbrandt R, Powers D (2009) Characterization and evaluation of similarity measures for pairs of clusterings. Knowl Info Syst 19(3): 361–394CrossRefGoogle Scholar
  39. 39.
    Porter M.(n.d.) The porter stemming algorithm, Accessible at http://www.tartarus.org/martin/PorterStemmer
  40. 40.
    Raskutti B, Ferra HL, Kowalczyk A (2002) Using unlabelled data for text classification through addition of cluster parameters. In: Proceedings of the nineteenth international conference on machine learning, Morgan Kaufmann, p 521Google Scholar
  41. 41.
    Slonim N, Tishby N (2001) The power of word clusters for text classification. In: Proceedings of ECIR-01, 23rd European colloquium on information retrieval research, citeseerGoogle Scholar
  42. 42.
    Stamatatos E (2009) A survey of modern authorship attribution methods. J Am Soc Info Sci Technol 60(3): 538–556CrossRefGoogle Scholar
  43. 43.
    Tishby N, Pereira FC, Bialek W (2000) The information bottleneck method. Arxiv preprint physics/0004057Google Scholar
  44. 44.
    Uzuner O, Katz B (2005) A comparative study of language models for book and author recognition. Lect Notes Comput Sci 3651: 969CrossRefGoogle Scholar
  45. 45.
    Wan X (2008) Beyond topical similarity: a structural similarity measure for retrieving highly similar documents. Knowl Info Syst 15(1): 55–73CrossRefGoogle Scholar
  46. 46.
    Willard N, JD D (2005) Educator’s guide to cyberbullying addressing the harm caused by online social cruelty. Accessible at http://cyberbullying.org 19, 2005
  47. 47.
    Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, San FranciscoMATHGoogle Scholar
  48. 48.
    Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS (2008) Top 10 algorithms in data mining. Knowl Info Syst 14(1): 1–37CrossRefGoogle Scholar
  49. 49.
    Zhao Y, Zobel J (2005) Effective and scalable authorship attribution using function words. Lect Notes Comput Sci 3689: 174–189CrossRefGoogle Scholar
  50. 50.
    Zheng R, Li J, Chen H, Huang Z (2006) A framework for authorship identification of online messages: writing-style features and classification techniques. J Am Soc Info Sci Technol 57(3): 378–393CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2011

Authors and Affiliations

  1. 1.School of Computer ScienceUniversity of LincolnLincolnUK

Personalised recommendations