Comparison of Vector Space Representations of Documents for the Task of Information Retrieval of Massive Open Online Courses

  • Julius KleninEmail author
  • Dmitry Botov
  • Yuri Dmitrin
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 789)


One of the important issues, arising in development of educational courses is maintaining relevance for the intended receivers of the course. In general, it requires developers of such courses to use and borrow some elements presented in similar content developed by others. This form of collaboration allows for the integration of experience and points of view of multiple authors, which tends to result in better, more relevant content. This article addresses the question of searching for relevant massive open online courses (MOOC) using a course programme document as a query. As a novel solution to this task we propose the application of language modelling. Presented results of the experiment, comparing several most popular models of vector space representation of text documents, such as the classical weighting scheme TF-IDF, Latent Semantic Indexing, topic modeling in the form of Latent Dirichlet Allocation, popular modern neural net language models word2vec and paragraph vectors. The experiment is carried out on the corpus of courses in Russian, collected from several popular MOOC-platforms. The effectiveness of the proposed model is evaluated taking into account opinions of university professors.


Vector space model Educational course programme Document modelling Information retrieval Word embedding Mooc-platform Educational data mining 


  1. 1.
    Class Central. By The Numbers: MOOCS in 2016.
  2. 2.
    Chernikova, E.: A Novel Process Model-driven Approach to Comparing Educational Courses using Ontology Alignment (2014)Google Scholar
  3. 3.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)Google Scholar
  4. 4.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed represenations of sentences and documents. In: Proceedings of ICML 2014, pp. 1188–1196 (2014)Google Scholar
  5. 5.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)CrossRefGoogle Scholar
  6. 6.
    Deerwester, S., et al.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)CrossRefGoogle Scholar
  7. 7.
    Panchenko, A., et al.: RUSSE: the first workshop on Russian semantic similarity. In: Computational Linguistics and Intellectual Technologies Papers from the Annual International Conference Dialogue, RGGU 2015, Moscow, vol. 2, pp. 89–105 (2015)Google Scholar
  8. 8.
    Lilleberg, J., Zhu, Y., Zhang, Y.: Support vector machines and word2vec for text classification with semantic features. In: IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC) (2015)Google Scholar
  9. 9.
    Ganguly, D.: Word embedding based generalized language model for information retrieval. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 795–798 (2015)Google Scholar
  10. 10.
    Nalisnick, E., et al.: Improving document ranking with dual word embeddings. In: Proceedings of WWW. International World Wide Web Conferences Steering Committee (2016)Google Scholar
  11. 11.
    Mitra, B., Craswell, N.: Neural text embeddings for information retrieval. In: Proceedings of WSDM. ACM, pp. 813–814 (2017)Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.Information Technologies InstituteChelyabinsk State UniversityChelyabinskRussian Federation

Personalised recommendations