Skip to main content

Structural Similarity Search for Formulas Using Leaf-Root Paths in Operator Subtrees

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11437))

Included in the following conference series:

Abstract

We present a new search method for mathematical formulas based on Operator Trees (OPTs) representing the application of operators to operands. Our method provides (1) a simple indexing scheme using OPT leaf-root paths, (2) practical matching of the K largest common subexpressions, and (3) scoring matched OPT subtrees by counting nodes corresponding to visible symbols, weighting operators lower than operands. Using the largest common subexpression (K = 1), we outperform existing formula search engines for non-wildcard queries on the NTCIR-12 Wikipedia Formula Browsing Task. Stronger results are obtained when using additional subexpressions for scoring. Without parallelization or pruning, our system has practical execution times with low variance when compared to other state-of-the-art formula search engines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Source code: https://github.com/approach0/search-engine/tree/ecir2019.

  2. 2.

    Our expression grammar has roughly 100 grammar rules and 50 token types.

  3. 3.

    Tangent-S is an improved version of the Tangent system [3] that participated in NTCIR-12.

References

  1. Buckley, C., Voorhees, E.M.: Retrieval evaluation with incomplete information. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 25–32. ACM (2004)

    Google Scholar 

  2. Bunke, H., Shearer, K.: A graph distance metric based on the maximal common subgraph. Pattern Recogn. Lett. 19(3–4), 255–259 (1998)

    Article  Google Scholar 

  3. Davila, K.: Tangent-3 at the NTCIR-12 MathIR Task (2016). http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings12/pdf/ntcir/MathIR/06-NTCIR12-MathIR-DavilaK.pdf

  4. Davila, K., Zanibbi, R.: Layout and semantics: combining representations for mathematical formula search. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1165–1168. ACM (2017)

    Google Scholar 

  5. Guidi, F., Sacerdoti Coen, C.: A survey on retrieval of mathematical knowledge. In: Kerber, M., Carette, J., Kaliszyk, C., Rabe, F., Sorge, V. (eds.) CICM 2015. LNCS (LNAI), vol. 9150, pp. 296–315. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-20615-8_20

    Chapter  Google Scholar 

  6. Hijikata, Y., Hashimoto, H., Nishida, S.: An investigation of index formats for the search of MathML objects. In: 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, pp. 244–248, November 2007

    Google Scholar 

  7. Kamali, S., Tompa, F.W.: Structural similarity search for mathematics retrieval. In: Carette, J., Aspinall, D., Lange, C., Sojka, P., Windsteiger, W. (eds.) CICM 2013. LNCS (LNAI), vol. 7961, pp. 246–262. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39320-4_16

    Chapter  MATH  Google Scholar 

  8. Kristianto, G., Topic, G., Aizawa, A.: MCAT Math Retrieval System for NTCIR-12 MathIR Task, June 2016

    Google Scholar 

  9. Lin, X., Gao, L., Hu, X., Tang, Z., Xiao, Y., Liu, X.: A mathematics retrieval system for formulae in layout presentations. In: Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR 2014. ACM, New York (2014)

    Google Scholar 

  10. Lu, X., Moffat, A., Culpepper, J.S.: The effect of pooling and evaluation depth on IR metrics. Inf. Retr. 19(4), 416–445 (2016). https://doi.org/10.1007/s10791-016-9282-6

    Article  Google Scholar 

  11. Miller, B.R., Youssef, A.: Technical aspects of the digital library of mathematical functions. Ann. Math. Artif. Intell. 38(1–3), 121–136 (2003). https://link.springer.com/article/10.1023/A:1022967814992

    Article  MathSciNet  Google Scholar 

  12. Misutka, J., Galambos, L.: Extending Full Text Search Engine for Mathematical Content, pp. 55–67, January 2008

    Google Scholar 

  13. Shamir, R., Tsur, D.: Faster subtree isomorphism. J. Algorithms 33(2), 267–280 (1999)

    Article  MathSciNet  Google Scholar 

  14. Sojka, P., Líška, M.: Indexing and searching mathematics in digital libraries. In: Davenport, J.H., Farmer, W.M., Urban, J., Rabe, F. (eds.) CICM 2011. LNCS (LNAI), vol. 6824, pp. 228–243. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22673-1_16

    Chapter  MATH  Google Scholar 

  15. Stalnaker, D., Zanibbi, R.: Math expression retrieval using an inverted index over symbol pairs. In: Document recognition and retrieval XXII, vol. 9402, p. 940207. International Society for Optics and Photonics (2015)

    Google Scholar 

  16. Turtle, H., Flood, J.: Query evaluation: strategies and optimizations. Inf. Process. Manage. 31(6), 831–850 (1995). https://doi.org/10.1016/0306-4573(95)00020-H

    Article  Google Scholar 

  17. Valiente, G.: An efficient bottom-up distance between trees. In: Proceedings of Eighth Symposium on String Processing and Information Retrieval, pp. 212–219, November 2001

    Google Scholar 

  18. Valiente Feruglio, G.A.: Simple and Efficient Tree Comparison (2001)

    Google Scholar 

  19. Yokoi, K., Aizawa, A.: An approach to similarity search for mathematical expressions using MathML. In: Towards a Digital Mathematics Library, Grand Bend, Ontario, Canada, 8–9th July 2009, pp. 27–35 (2009)

    Google Scholar 

  20. Zanibbi, R., Aizawa, A., Kohlhase, M., Ounis, I., Topic, G., Davila, K.: NTCIR-12 MathIR task overview. In: NTCIR (2016)

    Google Scholar 

  21. Zanibbi, R., Blostein, D.: Recognition and retrieval of mathematical expressions. Int. J. Doc. Anal. Recognit. 15(4), 331–357 (2012)

    Article  Google Scholar 

  22. Zanibbi, R., Davila, K., Kane, A., Tompa, F.W.: Multi-stage math formula search: using appearance-based similarity metrics at scale. In: Proceedings of the 39th International ACM SIGIR Conference on Research & Development in Information Retrieval. SIGIR 2016. ACM, New York (2016)

    Google Scholar 

  23. Zhong, W., Fang, H.: A novel similarity-search method for mathematical content in LaTeX markup and its implementation. Master’s thesis, University of Delaware (2015)

    Google Scholar 

  24. Zhong, W., Fang, H.: OPMES: a similarity search engine for mathematical content. In: Ferro, N., et al. (eds.) ECIR 2016. LNCS, vol. 9626, pp. 849–852. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30671-1_79

    Chapter  Google Scholar 

  25. Zukowski, M., Heman, S., Nes, N., Boncz, P.: Super-scalar RAM-CPU cache compression. In: Proceedings of the 22nd International Conference on Data Engineering, 2006. ICDE 2006, p. 59. IEEE (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Zhong .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhong, W., Zanibbi, R. (2019). Structural Similarity Search for Formulas Using Leaf-Root Paths in Operator Subtrees. In: Azzopardi, L., Stein, B., Fuhr, N., Mayr, P., Hauff, C., Hiemstra, D. (eds) Advances in Information Retrieval. ECIR 2019. Lecture Notes in Computer Science(), vol 11437. Springer, Cham. https://doi.org/10.1007/978-3-030-15712-8_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-15712-8_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-15711-1

  • Online ISBN: 978-3-030-15712-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics