Advertisement

Classifying with Co-stems

A New Representation for Information Filtering
  • Nedim Lipka
  • Benno Stein
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6611)

Abstract

Besides the content the writing style is an important discriminator in information filtering tasks. Ideally, the solution of a filtering task employs a text representation that models both kinds of characteristics. In this respect word stems are clearly content capturing, whereas word suffixes qualify as writing style indicators. Though the latter feature type is used for part of speech tagging, it has not yet been employed for information filtering in general. We propose a text representation that combines both the output of a stemming algorithm (stems) and the stem-reduced words (co-stems). A co-stems can be a prefix, an infix, a suffix, or a concatenation of prefixes, infixes, or suffixes. Using accepted standard corpora, we analyze the discriminative power of this representation for a broad range of information filtering tasks to provide new insights into the adequacy and task-specificity of text representation models. Altogether we observe that co-stems-based representations outperform the classical bag of words model for several filtering tasks.

Keywords

Support Vector Machine Sentiment Analysis Topic Detection Spam Detection Movie Review 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Blanzieri, E., Bryl, A.: A survey of learning-based techniques of email spam filtering. Artificial Intelligence Review 29(1), 63–92 (2008)CrossRefGoogle Scholar
  2. 2.
    Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proc. of the Workshop on Computational Learning Theory, pp. 92–100 (1998)Google Scholar
  3. 3.
    Gottron, T., Lipka, N.: A comparison of language identification approaches on short, query-style texts. In: Gurrin, C., He, Y., Kazai, G., Kruschwitz, U., Little, S., Roelleke, T., Rüger, S., van Rijsbergen, K. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 611–614. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  4. 4.
    Hanani, U., Shapira, B., Shoval, P.: Information filtering: Overview of issues, research and systems. User Modeling and User-Adapted Interaction 11(3), 203–259 (2001)CrossRefzbMATHGoogle Scholar
  5. 5.
    Koppel, M., Schler, J.: Authorship verification as a one-class classification problem. In: Proc. of ICML, p. 62 (2004)Google Scholar
  6. 6.
    Krovetz, R.: Viewing morphology as an inference process. In: Proc. of SIGIR, pp. 191–202 (1993)Google Scholar
  7. 7.
    Lang, K.: Newsweeder: learning to filter netnews. In: Proc. of ICML, pp. 331–339 (1995)Google Scholar
  8. 8.
    Lipka, N., Stein, B.: Identifying Featured Articles in Wikipedia: Writing Style Matters. In: Proc. of WWW, pp. 1147–1148 (2010)Google Scholar
  9. 9.
    Lovins, J.B.: Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11, 22–31 (1968)Google Scholar
  10. 10.
    Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting spam web pages through content analysis. In: Proc. of WWW, pp. 83–92 (2006)Google Scholar
  11. 11.
    Paice, C.D.: Another Stemmer. SIGIR Forum 24(3), 56–61 (1990)CrossRefGoogle Scholar
  12. 12.
    Pang, B., Lee, L.: A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In: Proc. of ACL, pp. 271–278 (2004)Google Scholar
  13. 13.
    Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? Sentiment classification using machine learning techniques. In: Proc. of the Conference on Empirical Methods in Natural Language Processing, pp. 79–86 (2002)Google Scholar
  14. 14.
    Porter, M.F.: An algorithm for suffix stripping. Program: Electronic Library & Information Systems 40(3), 211–218 (1980)CrossRefGoogle Scholar
  15. 15.
    Priedhorsky, R., Chen, J., Lam, S.T.K., Panciera, K., Terveen, L., Riedl, J.: Creating, destroying, and restoring value in wikipedia. In: GROUP 2007: Proc. of the International ACM Conference on Supporting Group Work, pp. 259–268 (2007)Google Scholar
  16. 16.
    Santini, M.: Common criteria for genre classification: Annotation and granularity. In: Third International Workshop on Text-Based Information Retrieval (2006)Google Scholar
  17. 17.
    Schler, J., Koppel, M., Argamon, S., Pennebaker, J.: Effects of age and gender on blogging. In: Proc. of AAAI - Symposium on Computational Approaches for Analyzing Weblogs, pp. 191–197 (2006)Google Scholar
  18. 18.
    Stamatatos, E.: A survey of modern authorship attribution methods. JASIST 60, 538–556 (2009)CrossRefGoogle Scholar
  19. 19.
    Stein, B., Eissen, S.M.Z., Lipka, N.: Web genre analysis: Use cases, retrieval models, and implementation issues. Genres on the Web 42, 167–189 (2011)CrossRefGoogle Scholar
  20. 20.
    Tsur, O., Davidov, D., Rappoport, A.: A Great Catchy Name: Semi-Supervised Recognition of Sarcastic Sentences in Product Reviews. In: Proc. of AAAI - ICWSM (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Nedim Lipka
    • 1
  • Benno Stein
    • 1
  1. 1.Bauhaus-Universität WeimarWeimarGermany

Personalised recommendations