Indexing and Searching Mathematics in Digital Libraries

Architecture, Design and Scalability Issues
  • Petr Sojka
  • Martin Líška
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6824)

Abstract

This paper surveys approaches and systems for searching mathematical formulae in mathematical corpora and on the web. The design and architecture of our MIaS (Math Indexer and Searcher) system is presented, and our design decisions are discussed in detail. An approach based on Presentation MathML using a similarity of math subformulae is suggested and verified by implementing it as a math-aware search engine based on the state-of-the-art system, Apache Lucene.

Scalability issues were checked based on 324,000 real scientific documents from arXiv archive with 112 million mathematical formulae. More than two billions MathML subformulae were indexed using our Solr-compatible Lucene extension.

Keywords

math indexing and retrieval mathematical digital libraries information systems information retrieval mathematical content search document ranking of mathematical papers math text mining MIaS WebMIaS 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Altamimi, M., Youssef, A.S.: A Math Query Language with an Expanded Set of Wildcards. Mathematics in Computer Science 2, 305–331 (2008), http://dx.doi.org/10.1007/s11786-008-0056-4 MathSciNetCrossRefMATHGoogle Scholar
  2. 2.
    Anca, Ş.: Natural Language and Mathematics Processing for Applicable Theorem Search. Master’s thesis, Jacobs University, Bremen (August 2009), https://svn.eecs.jacobs-university.de/svn/eecs/archive/msc-2009/aanca.pdf
  3. 3.
    Archambault, D., Berger, F., Moço, V.: Overview of the “Universal Maths Conversion Library”. In: Pruski, A., Knops, H. (eds.) Assistive Technology: From Virtuality to Reality: Proceedings of 8th European Conference for the Advancement of Assistive Technology in Europe AAATE 2005, Lille, France, pp. 256–260. IOS Press, Amsterdam (September 2005)Google Scholar
  4. 4.
    Archambault, D., Moço, V.: Canonical MathML to Simplify Conversion of MathML to Braille Mathematical Notations. In: Miesenberger, K., Klaus, J., Zagler, W., Karshmer, A. (eds.) ICCHP 2006. LNCS, vol. 4061, pp. 1191–1198. Springer, Heidelberg (2006), http://dx.doi.org/10.1007/11788713_172 CrossRefGoogle Scholar
  5. 5.
    Baker, J.B., Sexton, A.P., Sorge, V.: Extracting Precise Data on the Mathematical Content of PDF Documents. In: Sojka [11], pp. 75–79, http://dml.cz/handle/10338.dmlcz/702535
  6. 6.
    Grigore, M., Wolska, M., Kohlhase, M.: Towards context-based disambiguation of mathematical expressions. Math-for-Industry Lecture Note Series, vol. 22, pp. 262–271 (December 2009)Google Scholar
  7. 7.
    Líška, M.: Vyhledávání v matematickém textu (in Slovak), Searching Mathematical Texts. Bachelor Thesis, Masaryk University, Brno, Faculty of Informatics (advisor: Petr Sojka) (2010), https://is.muni.cz/th/255768/fi_b/?lang=en
  8. 8.
    Mišutka, J., Galamboš, L.: Extending Full Text Search Engine for Mathematical Content. In: Sojka [11], pp. 55–67, http://dml.cz/dmlcz/702546
  9. 9.
    Munavalli, R., Miner, R.: MathFind: A Math-Aware Search Engine. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2006, pp. 735–735. ACM, New York (2006), http://doi.acm.org/10.1145/1148170.1148348 Google Scholar
  10. 10.
    Růžička, M., Sojka, P.: Data Enhancements in a Digital Mathematics Library. In: Sojka [12], pp. 69–76, http://dml.cz/dmlcz/702575
  11. 11.
    Sojka, P. (ed.) Towards a Digital Mathematics Library, Birmingham, UK. Masaryk University (July 2008), http://www.fi.muni.cz/~sojka/dml-2008-program.xhtml
  12. 12.
    Sojka, P. (ed.) Towards a Digital Mathematics Library, Paris, France. Masaryk University (July 2010), http://www.fi.muni.cz/~sojka/dml-2010-program.html
  13. 13.
    Stamerjohanns, H., Kohlhase, M., Ginev, D., David, C., Miller, B.: Transforming Large Collections of Scientific Publications to XML. Mathematics in Computer Science 3, 299–307 (2010), http://dx.doi.org/10.1007/s11786-010-0024-7 CrossRefMATHGoogle Scholar
  14. 14.
    Suzuki, M., Tamari, F., Fukuda, R., Uchida, S., Kanahori, T.: INFTY — An integrated OCR system for mathematical documents. In: Vanoirbeek, C., Roisin, C., Munson, E. (eds.) Proceedings of ACM Symposium on Document Engineering 2003, Grenoble, France, pp. 95–104. ACM, New York (2003)CrossRefGoogle Scholar
  15. 15.
    Sylwestrzak, W., Borbinha, J., Bouche, T., Nowiński, A., Sojka, P.: EuDML—Towards the European Digital Mathematics Library. In: Sojka [12], pp. 11–24, http://dml.cz/dmlcz/702569
  16. 16.
    Youssef, A.S.: Roles of Math Search in Mathematics. In: Borwein, J., Farmer, W. (eds.) MKM 2006. LNCS (LNAI), vol. 4108, pp. 2–16. Springer, Heidelberg (2006), http://dx.doi.org/10.1007/11812289_2 CrossRefGoogle Scholar
  17. 17.
    Youssef, A.S.: Methods of Relevance Ranking and Hit-Content Generation in Math Search. In: Kauers, M., Kerber, M., Miner, R., Windsteiger, W. (eds.) MKM/CALCULEMUS 2007. LNCS (LNAI), vol. 4573, pp. 393–406. Springer, Heidelberg (2007)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Petr Sojka
    • 1
  • Martin Líška
    • 1
  1. 1.Faculty of InformaticsMasaryk UniversityBrnoCzech Republic

Personalised recommendations