Advertisement

Identifying Historical Period and Ethnic Origin of Documents Using Stylistic Feature Sets

  • Yaakov HaCohen-Kerner
  • Hananya Beck
  • Elchai Yehudai
  • Dror Mughaz
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4265)

Abstract

Text classification is an important and challenging research domain. In this paper, identifying historical period and ethnic origin of documents using stylistic feature sets is investigated. The application domain is Jewish Law articles written in Hebrew-Aramaic. Such documents present various interesting problems for stylistic classification. Firstly, these documents include words from both languages. Secondly, Hebrew and Aramaic are richer than English in their morphology forms. The classification is done using six different sets of stylistic features: quantitative features, orthographic features, topographic features, lexical features and vocabulary richness. Each set of features includes various baseline features, some of them formalized by us. SVM has been chosen as the applied machine learning method since it has been very successful in text classification. The quantitative set was found as very successful and superior to all other sets. Its features are domain-independent and language-independent. It will be interesting to apply these feature sets in general and the quantitative set in particular into other domains as well as into other.

Keywords

Support Vector Machine Classification Task Ethnic Origin Historical Period Sequential Minimal Optimization 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Argamon-Engelson, S., Koppel, M., Avneri, G.: Style-based text categorization: What newspaper am I reading? In: Proceedings of the AAAI Workshop on Learning for Text Categorization, pp. 1–4 (1998)Google Scholar
  2. 2.
    Blum, A., Mitchell, T.: Combining Labeled and Unlabeled Data with Co-Training Proceedings of the Conference on Computational Learning Theory (COLT), pp. 92–100 (1998)Google Scholar
  3. 3.
    Choueka, Y., Conley, E.S., Dagan, I.: A comprehensive bilingual word alignment system: Application to disparate languages - Hebrew, English. In: Veronis, J. (ed.) Parallel Text Processing, pp. 69–96. Kluwer Academic Publishers, Dordrecht (2000)Google Scholar
  4. 4.
    Cortes, C., Vapnik, V.: Support-Vector Networks. Machine Learning 20, 273–297 (1995)MATHGoogle Scholar
  5. 5.
    Díaz, I., Ranilla, J., Montañés, E., Fernández, J., Combarro, E.F.: Improving performance of text categorization by combining filtering, supportvector machines. JASIST 55(7), 579–592 (2004)CrossRefGoogle Scholar
  6. 6.
    Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive Learning Algorithms, Representations for Text Categorization. In: Proceedings of the 7th ACM International Conference on Information, Knowledge Management (CIKM), Bethesda, MD, pp. 148–155 (1998)Google Scholar
  7. 7.
    Friedman, S.: The Manuscripts of the Babylonian Talmud: A Typology Based Upon Orthographic and Linguistic Features. In: Bar-Asher, M. (ed.) Studies in Hebrew and Jewish Languages Presented to Shelomo Morag (in Hebrew), Jerusalem, pp. 163–190 (1996)Google Scholar
  8. 8.
    Gabrilovich, E., Markovitch, S.: Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5. In: Proceedings of the 21 Int. Conference on Machine Learning, ICML 2004, pp. 321–328 (2004)Google Scholar
  9. 9.
    HaCohen-Kerner, Y., Kass, A., Peretz, A.: Baseline Methods for Automatic Disambiguation of Abbreviations in Jewish Law Documents. In: Vicedo, J.L., Martínez-Barco, P., Muńoz, R., Saiz Noeda, M. (eds.) EsTAL 2004. LNCS (LNAI), vol. 3230, pp. 58–69. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  10. 10.
    Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Proceedings of the 10th European Conference on Machine Learning (ECML), Chemnitz, Germany, pp. 137–142 (1998)Google Scholar
  11. 11.
    Joachims, T.: Learning to Classify Text using Support Vector Machines. Kluwer, Dordrecht (2002)Google Scholar
  12. 12.
    Karlgren, J., Cutting, D.: Recognizing Text Genres with Simple Metrics Using Discriminant Analysis. In: Proceedings of the 15th International Conference on Computational Linguistics, Kyoto, Japan, vol. 2, pp. 1071–1075 (1994)Google Scholar
  13. 13.
    Knight, K.: Mining online text. Commun. ACM 42(11), 58–61 (1999)CrossRefMathSciNetGoogle Scholar
  14. 14.
    Koppel, M., Argamon, S., Shimony, A.R.: Automatically categorizing written texts by author gender, Literary. Linguistic Computing 17(4), 401–412 (2002)CrossRefGoogle Scholar
  15. 15.
    Koppel, M., Mughaz, D., Schler, J.: Text categorization for authorship verification. In: Proc. 8th Symposium on Artificial Intelligence, Mathematics, Fort Lauderdale, FL (2004)Google Scholar
  16. 16.
    Koppel, M., Mughaz, D., Akiva, N.: New Methods for Attribution of Rabbinic Literature. Hebrew Linguistics: A Journal for Hebrew Descriptive, Computational, Applied Linguistics 57, v-xviii (2006)Google Scholar
  17. 17.
    Lim, C.S., Lee, K.J., Kim, G.-C.: Multiple sets of features for automatic genre classification of web documents. Inf. Process. Manage. 41(5), 1263–1276 (2005)CrossRefGoogle Scholar
  18. 18.
    Melamed, E.Z.: Aramaic-Hebrew-English Dictionary. Feldheim (2005)Google Scholar
  19. 19.
    Meretakis, D., Wuthrich, B.: Extending Naive Bayes Classifiers Using Long Itemsets. In: Proc. 5th ACM-SIGKDD Int. Conf. Knowledge Discovery, Data Mining (KDD 1999), San Diego, USA, pp. 165–174 (1999)Google Scholar
  20. 20.
    Mughaz, D.: Classification Of Hebrew Texts according to Style, M.Sc. Thesis (in Hebrew), BarIlan University, Ramat-Gan, Israel (2003)Google Scholar
  21. 21.
    Mosteller, F., Wallace, D.L.: Inference and Disputed Authorship: The Federalist. Addison Wesley, Reading (1964)MATHGoogle Scholar
  22. 22.
    Pazienza, M.T. (ed.): Information Extraction. LNCS, vol. 1299. Springer, Heidelberg (1997)Google Scholar
  23. 23.
    Platt, J.C.: Fast training of support vector machines using sequential minimal optimization. In: Scholkopf, B., Burges, C., Smola, A.J. (eds.) Advances in Kernel Methods - Support Vector Learning, ch. 12, pp. 185–208. MIT Press, Cambridge (1999)Google Scholar
  24. 24.
    Radai, Y.: Hamikra haMemuchshav: Hesegim Bikoret uMishalot (in Hebrew). Balshanut Ivrit 13, 92–99 (1978)Google Scholar
  25. 25.
    Radai, Y.: Od al Hamikra haMemuchshav (in Hebrew). Balshanut Ivrit 15, 58–59 (1979)Google Scholar
  26. 26.
    Radai, Y.: Mikra uMachshev: Divrei Idkun (in Hebrew). Balshanut Ivrit 19, 47–52 (1982)Google Scholar
  27. 27.
    Rosenthal, F.: Aramaic Studies During the Past Thirty Years. The Journal of Near Eastern Studies, 81–82 (1978)Google Scholar
  28. 28.
    Schneider, K.-M.: Techniques for Improving the Performance of Naive Bayes for Text Classification. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 682–693. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  29. 29.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)CrossRefGoogle Scholar
  30. 30.
    Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Computer-based authorship attribution without lexical measures. Computers and the Humanities 35, 193–214 (2001)CrossRefGoogle Scholar
  31. 31.
    Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995) ISBN 0-387-94559-8Google Scholar
  32. 32.
    Witten, I.H., Frank, E.: Weka 3: Machine Learning Software in Java (1999), http://www.cs.waikato.ac.nz/~ml/weka
  33. 33.
    Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. In: Proceedings of the 22nd ACM International Conference on Research, Development in Information Retrieval (SIGIR), Berkeley, CA, pp. 42–49 (1999)Google Scholar
  34. 34.
    Yule, G.U.: On Sentence Length as a Statistical Characteristic of Style in Prose with Application to Two Cases of Disputed Authorship. Biometrika 30, 363–390 (1938)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Yaakov HaCohen-Kerner
    • 1
  • Hananya Beck
    • 1
  • Elchai Yehudai
    • 1
  • Dror Mughaz
    • 1
    • 2
  1. 1.Department of Computer ScienceJerusalem College of Technology (Machon Lev)JerusalemIsrael
  2. 2.Department of Computer ScienceBar-Ilan UniversityRamat-GanIsrael

Personalised recommendations