Stochastic Modelling of Scientific Terms Distribution in Publications

  • Rimantas Rudzkis
  • Vaidas Balys
  • Michiel Hazewinkel
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4108)


In this paper, we address the problem of automatic keywords assignment to scientific publications. The idea to use textual traces learned from training data in a supervised manner to identify appropriate keywords is considered. We introduce the transparent concept of identification cloud as a means to represent the semantics of scientific terms. This concept is mathematically defined by models of scientific terms stochastic distributions over publication texts. Characteristics of models as well as procedures for both non-parametric and parametric estimation of probability distributions are presented.


Support Vector Machine Latent Dirichlet Allocation Latent Semantic Analysis Scientific Term Probabilistic Latent Semantic Analysis 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Blei, D., Ng, A., Jordan, M.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993–1022 (2003)MATHCrossRefGoogle Scholar
  2. 2.
    Balys, V., Rudzkis, R.: Stochastic models for keyphrase assignment. In: Proceedings of the VII International Conference Computer Data Analysis and Modelling (2004)Google Scholar
  3. 3.
    Church, K.W., Hanks, P.: Word Association Norms, Mutual Information, and Lexicography. Computational linguistics 16, 22–29 (1990)Google Scholar
  4. 4.
    Deerwester, S., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harsham, R.A.: Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science 41, 391–407 (1990)CrossRefGoogle Scholar
  5. 5.
    Domingos, P., Pazzani, M.: Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier. In: Proceedings of the 13th International Conference on Machine Learning, pp. 105–112 (1996)Google Scholar
  6. 6.
    Hazewinkel, M.: Topologies and metrics on information spaces. CWI Quarterly 12, 93–110 (1999)Google Scholar
  7. 7.
    Hazewinkel, M.: Dynamic stochastic models for indexes and thesauri, identification clouds, and information retrieval and storage. In: Baeza-Yates, R. (ed.) Recent advances in applied probability. KAP, pp. 181–204 (2004)Google Scholar
  8. 8.
    Hazewinkel, M., Rudzkis, R.: A probabilistic model for the growth of thesauri. Acta Applicandae Mathematicae 67, 237–252 (2001)MATHCrossRefMathSciNetGoogle Scholar
  9. 9.
    Hofmann, T.: Probabilistic Latent Semantic Analysis. In: Proc. of Uncertainty in Artificial Intelligence, UAI 1999 (1999)Google Scholar
  10. 10.
    Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  11. 11.
    Lee, L.: Measures of Distributional Similarity. ACL 99, 25–32 (1999)Google Scholar
  12. 12.
    Magerman, D.M., Marcus, M.P.: Parsing a Natural Language Using Mutual Information Statistics. In: National Conference on Artificial Intelligence, pp. 984–989 (1990)Google Scholar
  13. 13.
    Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys (2002)Google Scholar
  14. 14.
    Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)MATHGoogle Scholar
  15. 15.
    Yang, Y.: Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval. In: Proceedings of SIGIR-1994 (1994)Google Scholar
  16. 16.
    Yarowsky, D.: Word-Sense Disambiguation using Statistical Models of Roget’s Categories Trained on Large Corpora. In: Proceedings of COLING-1992, pp. 454–460 (1992)Google Scholar
  17. 17.
    Yang, Y., Chute, C.G.: A Linear Least Squares Fit Mapping Method for Information Retrieval from Natural Language Texts. In: Proceedings of COLING-1992, the 15th International Conference on Computational Linguistics (1992)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Rimantas Rudzkis
    • 1
  • Vaidas Balys
    • 1
  • Michiel Hazewinkel
    • 2
  1. 1.Institute of Mathematics and InformaticsVilniusLithuania
  2. 2.Centrum voor Wiskunde en InformaticaAmsterdamThe Netherlands

Personalised recommendations