Automated Classification and Categorization of Mathematical Knowledge

  • Radim Řehůřek
  • Petr Sojka
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5144)

Abstract

There is a commonMathematics SubjectClassification(MSC) System used for categorizing mathematical papers and knowledge. We present results of machine learning of the MSC on full texts of papers in the mathematical digital libraries DML-CZ and NUMDAM. The F1- measure achieved on classification task of top-level MSC categories exceeds 89%. We describe and evaluate our methods for measuring the similarity of papers in the digital library based on paper full texts.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Royal Society of London: Catalogue of scientific papers 1800–1900 vol. 1–19 and Subject Index in 4 vols (published, 1867–1925) (1908), free electronic version available by project Gallica http://gallica.bnf.fr/
  2. 2.
    Ohrtmann, C., Müller, F., (eds.): Jahrbuch über die Fortschritte der Mathematik vol. 1–68 (1868–1942) Druck und Verlag von Georg Reimer, Berlin (1871–1942); electronic version available by project ERAM, http://www.emis.de/projects/JFM/
  3. 3.
    Bouche, T.: Towards a Digital Mathematics Library? In: Rocha, E.M. (ed.) CMDE 2006: Communicating Mathematics in the Digital Era, pp. 43–68. A.K. Peters, MA, USA (2008)Google Scholar
  4. 4.
    Sojka, P.: From Scanned Image to Knowledge Sharing. In: Tochtermann, K., Maurer, H. (eds.) Proceedings of I-KNOW 2005: Fifth International Conference on Knowledge Management, Graz, Austria, Know-Center in coop, Graz Uni, pp. 664–672. Joanneum Research and Springer Pub. Co (2005)Google Scholar
  5. 5.
    Bartošek, M., Lhoták, M., Rákosník, J., Sojka, P., Šárfy, M.: DML-CZ: The Objectives and the First Steps. In: Borwein, J., Rocha, E.M., Rodrigues, J.F. (eds.) CMDE 2006: Communicating Mathematics in the Digital Era, pp. 69–79. A.K. Peters, MA, USA (2008)Google Scholar
  6. 6.
    Dunning, T.: Statistical identification of language. Technical Report MCCS 94-273, New Mexico State University, Computing Research Lab (1994)Google Scholar
  7. 7.
    Sojka, P., Panák, R., Mudrák, T.: Optical Character Recognition of Mathematical Texts in the DML-CZ Project. Technical report, Masaryk University, Brno. CMDE 2006 conference in Aveiro, Portugal (presented, 2006)Google Scholar
  8. 8.
    Pomikálek, J., Řehůřek, R.: The Influence of Preprocessing Parameters on Text Categorization. International Journal of Applied Science, Engineering and Technology 1, 430–434 (2007)Google Scholar
  9. 9.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)CrossRefMathSciNetGoogle Scholar
  10. 10.
    Yang, Y., Joachims, T.: Text categorization. Scholarpedia (2008), http://www.scholarpedia.org/article/Text_categorization
  11. 11.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)Google Scholar
  12. 12.
    Krovetz, R.: Viewing morphology as an inference process. In: Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Linguistic Analysis, pp. 191–202 (1993)Google Scholar
  13. 13.
    Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML 1997, 14th International Conference on Machine Learning, pp. 412–420. Morgan Kaufmann, San Francisco (1997)Google Scholar
  14. 14.
    Galavotti, L., Sebastiani, F., Simi, M.: Experiments on the use of feature selection and negative evidence in automated text categorization. In: Borbinha, J.L., Baker, T. (eds.) ECDL 2000. LNCS, vol. 1923, pp. 59–68. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  15. 15.
    Forman, G.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3, 1289–1305 (2003)MATHCrossRefGoogle Scholar
  16. 16.
    Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24, 513–523 (1988)CrossRefGoogle Scholar
  17. 17.
    Lee, J.H.: Analyses of multiple evidence combination. In: Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Combination Techniques, pp. 267–276 (1997)Google Scholar
  18. 18.
    Yang, Y.: A Study on Thresholding Strategies for Text Categorization. In: Croft, W.B., Harper, D.J., Kraft, D.H., Zobel, J. (eds.) Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2001), pp. 137–145. ACM Press, New York (2001)CrossRefGoogle Scholar
  19. 19.
    Gandrabur, S., Foster, G., Lapalme, G.: Confidence Estimation for NLP Applications. ACM Transactions on Speech and Language Processing 3, 1–29 (2006)CrossRefGoogle Scholar
  20. 20.
    Esuli, A., Fagni, T., Sebastiani, F.: Boosting multi-label hierarchical text categorization. Information Retrieval 11 (2008)Google Scholar
  21. 21.
    Allen, J.A.: The international catalogue of scientific literature. The Auk. 21, 494–501 (1904)Google Scholar
  22. 22.
    Rusin, D.: The Mathematical Atlas—A Gateway to Modern Mathematics (2002), http://www.math-atlas.org/welcome.html
  23. 23.
    Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41, 391–407 (1990)CrossRefGoogle Scholar
  24. 24.
    Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training of deep networks. In: Schölkopf, B., Platt, J., Hoffman, T. (eds.) Advances in Neural Information Processing Systems 19, pp. 153–160. MIT Press, Cambridge (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Radim Řehůřek
    • 1
  • Petr Sojka
    • 1
  1. 1.Faculty of InformaticsMasaryk UniversityBrnoCzech Republic

Personalised recommendations