Accelerating Substructure Similarity Search for Formula Retrieval
- 2 Citations
- 3k Downloads
Abstract
Formula retrieval systems using substructure matching are effective, but suffer from slow retrieval times caused by the complexity of structure matching. We present a specialized inverted index and rank-safe dynamic pruning algorithm for faster substructure retrieval. Formulas are indexed from their Operator Tree (OPT) representations. Our model is evaluated using the NTCIR-12 Wikipedia Formula Browsing Task and a new formula corpus produced from Math StackExchange posts. Our approach preserves the effectiveness of structure matching while allowing queries to be executed in real-time.
Keywords
Math information retrieval Query processing optimization Dynamic pruning1 Introduction
In the most recent math similarity search competition,^{1} effective systems all take a tree-based approach by extracting query terms from tree representations. For example, an Operator Tree (OPT) is used in Fig. 1 to represent math formulas where operands are represented by leaves and operators are located at internal nodes. This facilitates searching substructures shared by two math expressions. For example, we can extract paths from their tree representations and find their shared subtrees by matching their common paths grouped by subtree root nodes. However, in order to carry structure information, it is common to see structural queries with over tens or even hundreds of path tokens which is unusual for normal fulltext search. This makes query processing costly for realistic math search tasks.
In text similarity search, query processing can be accelerated through dynamic pruning [18], which typically estimates score upperbounds to prune documents unlikely to be in the top K results. However, effective substructure search requires additional matching or alignment among query terms, and this makes it hard to get a good score estimation and it prevents us applying traditional dynamically pruning effectively. In fact, reportedly few state-of-the-art MIR systems have achieved practical query run times even when given a large amount of computing resources [11, 20]. In this paper we try to address this problem by introducing a specialized inverted index and we propose a dynamic pruning method based on this inverted index to boost formula retrieval efficiency.
2 Related Work
Recently there has been an increasing amount of research on similarity search for math formulas, with most focusing on search effectiveness [5, 7, 11, 23]. There are many emerging issues regarding effectiveness, including handling mathematical semantics, and identifying interchangeable symbols and common subexpressions. However, the efficiency of math formula search systems is often not addressed.
A number of MIR systems apply text search models to math retrieval, extracting sequential features from formulas and use variants of TF-IDF scoring [12, 14, 16]. These approaches incorporate a bag-of-words model, and use frequency to measure formula similarity. Inevitably, they need to index different combinations of sequences or substrings to handle operator commutativity and subexpression identification. This index augmentation results in a non-linearly increasing index size in the number of indexed “words” [12] and thus hurts efficiency for large corpora. On the other hand, recent results [10, 20, 23] reveal that effective systems for formula retrieval use tree-based approaches distinct from text-based methods. However, tree-based systems usually need to calculate costly graph matching or edit distance metrics [9, 22], which generally have non-linear time complexity. Recently, a path-based approach [23] was developed to search substructures in formula OPTs approximately by assuming that identical formulas have the same leaf-root path set. Although at the time of writing, it obtains the best effectiveness for the NTCIR-12 dataset, the typically large number of query paths means that query run times are not ideal - maximum run times can be a couple of seconds.
Dynamic pruning has been recognized as an effective way to reduce query processing times [2, 8, 13, 18]. Dynamic pruning speeds up query processing by skipping scoring calculations or avoiding unnecessary reads for documents which are unlikely to be ranked in the top K results. Pruning methods can be based on different query processing schemes: Document-at-a-time (DAAT) requires all relevant posting lists be merged simultaneously. Term-at-a-time (TAAT) or score-at-a-time (SAAT) processes one posting list at a time for each term, requiring additional memory to store partial scores, and posting lists in this case are usually sorted by document importance (e.g, impact score [1]), with promising documents placed at the front of inverted lists. Pruning strategies are rank-safe (or safe up to rank K) [19] if they guarantee that the top K documents are ranked in the same order before and after pruning. The most well-known rank-safe pruning strategies for DAAT are MaxScore [8, 17, 19] and WAND variants [3, 6]. Shan et al. [15] show that MaxScore variants (e.g. BMM, LBMM) outperform other dynamic pruning strategies for long queries, and recently Mallia et al. [2] report a similar finding over a range of popular index encodings.
3 Preliminaries
Baseline Model. This work is based on our previous work [23] which extracts prefixes from OPT leaf-root paths as index or query terms. The OPT is parsed from a formula in Open image in new window . For indexed paths, they are mapped to corresponding posting lists in an inverted index where the IDs of expressions containing the path are appended. For query paths, the corresponding posting lists are merged and approximate matching is performed on candidates one expression at a time. The similarity score is measured from matched common subtree(s).
Because math symbols are interchangeable, paths are tokenized for better recall, e.g., variables such as a, b, c are tokenized into VAR. In our tokenized path representation uppercase words denote token types, which may be for operators as well as operands (e.g., TIMES for symbols representing multiplication). In Fig. 1, when indexing “bc + xy + a + z,” its expression ID (or ExpID) will be appended to posting lists associated with tokenized prefix paths from its OPT representation, i.e., VAR/TIMES, VAR/ADD and VAR/TIMES/ADD. At query processing, the shared structures highlighted in black and gray are found by matching these tokenized paths (two paths match if and only if they have the same tokenized paths, for example, “a/+” and “z/+” can be matched) and common subtree roots are identified by grouping paths by their root nodes. As a result, the posting list entry also stores the root node ID for indexed paths, in order to reconstruct matches substructures at merge time.
At query time, the similarity score is given by the size of matched common subtrees. Specifically, the model chooses a number of “widest” matched subtree(s) (e.g., \(a+bc\) is the widest matched in Fig. 1 because it has 3 common leaves and is “wider” than the other choices) and measure formula similarity based on the size of these common subtrees.
Dynamic Pruning. In dynamic pruning, the top K scored hits are kept throughout the querying process, with the lowest score in the top K at a given point defining the threshold \(\theta \). Since at most K candidates will be returned, dynamic pruning strategies work by estimating score upperbounds before knowing the precise score of a hit so that candidate hits with a score upperbound less or equal to \(\theta \) can be pruned safely, because they will not appear in the final top K results. Moreover, if a subset of posting lists alone cannot produce a top K result from their upperbounds, they are called a non-requirement set, the opposite being the requirement set. Posting lists in the non-requirement with IDs less than the currently evaluating IDs in the requirement set can be skipped safely, because posting lists in the non-requirement set alone will not produce a top K candidate.
4 Methodology
Denote \(w_{m, t} = |Q^{(m)}_t|\), we call \(w_{m, t} \ge \min (|Q^{(m)}_t|, |D^{(n)}_t|)\) as our (pre-computed) partial score upperbound. It is analogous to text search where each posting list has a partial score upperbound, the TF-IDF score upperbound is merely their sum. In our case, the sum for partial score upperbounds is only for one node or a subtree.
In the following we propose three strategies to compute \(w^*_{Q, D}\) upperbound from partial score upperbounds and assign non-requirement set.
Max Reference (MaxRef) Strategy. In MaxScore [17, 19], each posting list has a partial score upperbound, however, our scoring function implies each posting list can be involved with multiple partial score upperbounds. One way to select the non-requirement set in our case is using an upperbound score \(MaxRef_t\) (for each posting list t) which is the maximum partial score from the query nodes by which this posting list gets “referenced”, and if a set of posting lists alone has a sum of MaxRef scores less or equal to \(\theta \), they can be safely put into the non-requirement set.
Both strategies require solving binary programming problems, which are known to be NP-complete and thus too intensive for long queries. Instead, we greedily follow one branch of the binary programming sub-problems to obtain a feasible (but not optimal) solution in \(O(|T_q| |\mathfrak {T}|^2)\).
5 Implementation
Figure 3 illustrates formula query processing using a modified inverted index for dynamic pruning. For each internal node m of the query OPT, we store the number of leaves of m as \(w_m = |Q^{(m)}|\). Each query node points to tokenized path entries in a dictionary, where each reference is associated with \(w_{m,t} = |Q^{(m)}_t|\) identified by tokenized path t (denoted as \(m/w_m\) of t). In Fig. 3, node q1 from the query has 6 leaves, which is also the upperbound number of path matches for q1, i.e, \(|Q^{(1)}|\). Since q1 consists of 2 tokenized leaf-root paths VAR/TIMES/ADD and VAR/ADD, q1 is linked to two posting lists, each associated with a partial score upperbound (5 and 1).
Each posting list maps to a token path \(t \in \mathfrak {T}\) with a dynamic counter for the number of query nodes referring to it (initially \(|Q_t|\)). Query nodes are pruned by our algorithm when its subtree width is no longer greater than the current threshold, because the corresponding subexpression cannot be in the top-K results. In this case the reference counter decreases. A posting list is removed if its reference counter is less than one.
Query processing is described in Algorithm 1. RequirementSet returns selected iterators of the requirement set. Assignment according to different pruning strategies is described in Sect. 4. In the MaxRef strategy, we sort posting lists by descending \(\text {MaxRef}\) values, and take as many posting lists as possible into non-requirement set from the lowest \(\text {MaxRef}\) value. At merging, a candidate ID is assigned by the minimal ExpID of current posting list iterators in the requirement set. Requirement set iterators are advanced by one using the next() function, while iterators in the non-requirement set are advanced directly to the ID equal to or greater than the current candidate by the skipTo() function. In Fig. 3 for example, the posting list corresponding to VAR/TIMES/ADD is in the requirement set under the MaxRef strategy, while the other two are not: Document expression 13 and 15 will be skipped if the next candidate is 90. For ease of testing termination, we append a special ExpID MaxID at the end of each posting list, which is larger than any ExpID in the collection.
6 Evaluation
We first evaluate our system^{2} on the NTCIR-12 Wikipedia Formula Browsing Task [20] (NTCIR-12 for short), which is the most current benchmark for formula-only retrieval. The dataset contains over 590,000 math expressions taken from English Wikipedia. Since work in formula retrieval is relatively new, there are only 40 queries in NTCIR-12 that can be compared with other published systems. However, these queries are well designed to cover a variety of math expressions in different complexity. There are 20 queries containing wildcards in this task (using wildcard specifier \(\varvec{\backslash }{} \mathbf{qvar} \) to match arbitrary subexpression or symbols, e.g., query “\(\backslash \)qvar{a}\(^2\) + \(\backslash \)qvar{b}\(^3\)” can match “\(x^2 + (y + 1)^3\)”). We add support for wildcards by simply treating internal nodes (representing a rooted subexpression) of formulas as additional “leaves” (by ignoring their descendants), and the wildcard specifiers in a query are treated as normal leaves to match those indexed wildcard paths.
Since the corpus of NTCIR-12 is not large enough to show the full impact of pruning, we also evaluate query run times on a corpus containing over 1 million math related documents/threads from Math StackExchange (MSE) Q&A website^{3} and we run the same query set from NTCIR-12. Run times are shown for the posting list merging stage (e.g., time for parsing the query into OPT is excluded) and unless specified, posting lists are compressed and cached into memory. Each system had five independent runs, and we report results from overall distribution. The resulting uncompressed index size for NTCIR-12 and MSE corpora are around 2 GB and 16 GB in size, with 961,604 and 5,764,326 posting lists respectively. The (min, max, mean, standard deviation) for posting list lengths are (1, 262309, 16.95, 737.84) and (1, 7916296, 73.74, 9736.72).
Table 1 reports run time statistics. Non-pruning (exhaustive search) baselines with K = 100 are also compared here. Almost consistently, GBP-LEN strategy achieves the best efficiency with smaller variance. This is expected since GBP-LEN models the skipping possibility better than GBP-NUM. Although GBP-NUM gives a tighter theoretic upperbound than MaxRef, it only maximizes the number of posting lists in the non-requirement set and may lead to bad performance when these posting lists are short.
Query merge time performance (in milliseconds) for different strategies.
Runs | Non-wildcards | Wildcards | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
K | Strategy | \(\mu \) | \(\sigma \) | median | min | max | \(\mu \) | \(\sigma \) | median | min | max |
Wiki dataset | |||||||||||
100 | Baseline | 540.12 | 569.44 | 360.50 | 7.00 | 2238.00 | 426.73 | 383.47 | 225.50 | 8.00 | 1338.00 |
100 | MaxRef | 90.29 | 74.14 | 79.00 | 3.00 | 312.00 | 145.50 | 121.19 | 136.00 | 7.00 | 573.00 |
GBP-NUM | 84.90 | 80.44 | 52.50 | 3.00 | 321.00 | 138.82 | 102.55 | 135.00 | 9.00 | 428.00 | |
GBP-LEN | 67.49 | 61.40 | 45.00 | 2.00 | 218.00 | 125.27 | 97.28 | 103.50 | 9.00 | 404.00 | |
200 | MaxRef | 107.71 | 82.64 | 102.00 | 5.00 | 322.00 | 160.10 | 121.40 | 149.00 | 9.00 | 583.00 |
GBP-NUM | 105.34 | 99.51 | 71.50 | 5.00 | 357.00 | 155.52 | 110.61 | 153.00 | 8.00 | 479.00 | |
GBP-LEN | 89.63 | 83.20 | 62.00 | 5.00 | 330.00 | 142.78 | 103.11 | 143.50 | 9.00 | 446.00 | |
1000 | MaxRef | 154.51 | 93.75 | 157.50 | 6.00 | 361.00 | 211.86 | 140.01 | 186.00 | 10.00 | 662.00 |
GBP-NUM | 159.80 | 143.70 | 120.50 | 6.00 | 626.00 | 208.91 | 136.42 | 178.50 | 10.00 | 591.00 | |
GBP-LEN | 144.25 | 126.95 | 105.00 | 6.00 | 622.00 | 195.70 | 122.25 | 176.00 | 9.00 | 536.00 | |
MSE dataset | |||||||||||
100 | Baseline | 15134.10 | 15186.78 | 11161.00 | 157.00 | 55499.00 | 13450.57 | 12554.19 | 7075.50 | 304.00 | 47513.00 |
100 | MaxRef | 1083.23 | 1274.23 | 745.50 | 28.00 | 5922.00 | 3188.66 | 2458.91 | 2925.00 | 85.00 | 10412.00 |
GBP-NUM | 1202.24 | 1240.21 | 815.00 | 37.00 | 4987.00 | 2943.79 | 2025.96 | 2987.00 | 84.00 | 8775.00 | |
GBP-LEN | 562.83 | 635.26 | 382.50 | 24.00 | 2313.00 | 2257.95 | 1491.59 | 2346.50 | 86.00 | 4494.00 | |
200 | MaxRef | 1261.21 | 1368.93 | 1012.50 | 30.00 | 6439.00 | 3416.77 | 2753.09 | 3032.50 | 160.00 | 12412.00 |
GBP-NUM | 1378.19 | 1398.08 | 998.50 | 39.00 | 5863.00 | 3174.93 | 2283.05 | 3125.00 | 159.00 | 10099.00 | |
GBP-LEN | 697.32 | 739.11 | 478.00 | 27.00 | 2925.00 | 2504.90 | 1683.16 | 2382.50 | 159.00 | 6049.00 | |
1000 | MaxRef | 2030.05 | 1746.17 | 1796.50 | 53.00 | 7816.00 | 4123.26 | 3510.01 | 3473.00 | 287.00 | 16981.00 |
GBP-NUM | 1952.52 | 1746.05 | 1530.50 | 60.00 | 7197.00 | 3786.89 | 2744.99 | 3493.50 | 281.00 | 11323.00 | |
GBP-LEN | 1217.16 | 1083.53 | 764.50 | 47.00 | 3756.00 | 3304.69 | 2403.09 | 2812.00 | 285.00 | 9895.00 |
Secondly, we have compared our system effectiveness (Fig. 4) and efficiency (Fig. 5) with Tangent-S [5], MCAT [11] and our baseline system without pruning [23], which are all structure-based formula search engines that have obtained the best published Bpref scores on NTCIR-12 dataset. In addition, ICST system [7] also obtains effective results for math and text mixed task, but they do training on previous Wiki dataset and their system is currently not available.
All systems are evaluated in a single thread for top-1000 results. We use our best performance strategy, i.e., GBP-LEN, having an on-disk version with posting lists uncompressed and always read from disk, and an in-memory version with compression. For the baseline system, only 20 non-wildcard queries are reported because it does not support wildcards. We compare the baseline best performed run (base-best) which uses costly multiple tree matching as well as its specialized version (base-opd-only) which considers only the largest matched tree width (see Eq. 2). Tangent-S has a few outliers as a result of its costly alignment algorithm to rerank structure and find the Maximum Subtree Similarity [22], its non-linear complexity makes it expensive for some long queries (especially in wildcard case). And MCAT reportedly has a median query execution time around 25 s, using a server machine and multi-threading [11]. So we remove Tangent-S outliers and MCAT from runtime boxplot. For space, we only include the faster base-opd-only baseline in Fig. 5.
We outperform Tangent-S in efficiency even if we exclude their outlier queries, with higher Bpref in non-wildcard fully relevant results. Our efficiency is also better than the baseline system, even if the latter only considers less complex non-wildcard queries. However, our overall effectiveness is skewed by bad performance of wildcard queries because a much more expensive phase is introduced to boost accuracy by other systems to handle inherently difficult “structural wildcards.”
Our pruning strategies are rank-safe (pruning and exhaustive version shows the same Bpref scores) but there is a minor Bpref difference between ours and baseline (base-opd-only) due to parser changes we have applied to support wildcards (e.g., handle single left brace array as seen in a wildcard query) and they happen to slightly improve accuracy in partially relevant cases.
7 Conclusion
We have presented rank-safe dynamic pruning strategies that produce an upperbound estimation of structural similarity in order to speedup formula search using subtree matching. Our dynamic pruning strategies and specialized inverted index are different from traditional linear text search pruning methods and they further associate query structure representation with posting lists. Our results show we can obtain substantial improvement in efficiency over the baseline model, while still generating highly relevant non-wildcard search results. Our approach can process a diverse set of structural queries in real time.
Footnotes
- 1.
The NTCIR-12 Wikipedia Formula Browsing Task.
- 2.
Source code: https://github.com/approach0/search-engine/tree/ecir2020.
- 3.
MSE corpus: https://www.cs.rit.edu/~dprl/data/mse-corpus.tar.gz.
References
- 1.Anh, V.N., Moffat, A.: Pruned query evaluation using pre-computed impacts. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 372–379. ACM (2006)Google Scholar
- 2.Mallia, A., Siedlaczek, M., Suel, T.: An experimental study of index compression and DAAT query processing methods. In: Azzopardi, L., Stein, B., Fuhr, N., Mayr, P., Hauff, C., Hiemstra, D. (eds.) ECIR 2019. LNCS, vol. 11437, pp. 353–368. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-15712-8_23CrossRefGoogle Scholar
- 3.Broder, A.Z., Carmel, D., Herscovici, M., Soffer, A., Zien, J.: Efficient query evaluation using a two-level retrieval process. In: Proceedings of the Twelfth International Conference on Information and Knowledge Management, pp. 426–434. ACM (2003)Google Scholar
- 4.Buckley, C., Voorhees, E.M.: Retrieval evaluation with incomplete information. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 25–32. ACM (2004)Google Scholar
- 5.Davila, K., Zanibbi, R.: Layout and semantics: combining representations for mathematical formula search. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1165–1168. ACM (2017)Google Scholar
- 6.Ding, S., Suel, T.: Faster top-k document retrieval using block-max indexes. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 993–1002. ACM (2011)Google Scholar
- 7.Gao, L., Yuan, K., Wang, Y., Jiang, Z., Tang, Z.: The math retrieval system of ICST for NTCIR-12 MathIR task. In: NTCIR (2016)Google Scholar
- 8.Jonassen, S., Bratsberg, S.E.: Efficient compressed inverted index skipping for disjunctive text-queries. In: Clough, P., et al. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 530–542. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20161-5_53CrossRefGoogle Scholar
- 9.Kamali, S., Tompa, F.W.: Structural similarity search for mathematics retrieval. In: Carette, J., Aspinall, D., Lange, C., Sojka, P., Windsteiger, W. (eds.) CICM 2013. LNCS (LNAI), vol. 7961, pp. 246–262. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39320-4_16CrossRefzbMATHGoogle Scholar
- 10.Davila, K., Joshi, R., Setlur, S., Govindaraju, V., Zanibbi, R.: Tangent-V: math formula image search using line-of-sight graphs. In: Azzopardi, L., Stein, B., Fuhr, N., Mayr, P., Hauff, C., Hiemstra, D. (eds.) ECIR 2019. LNCS, vol. 11437, pp. 681–695. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-15712-8_44CrossRefGoogle Scholar
- 11.Kristianto, G.Y., Topic, G., Aizawa, A.: MCAT math retrieval system for NTCIR-12 MathIR task. In: NTCIR (2016)Google Scholar
- 12.Lin, X., Gao, L., Hu, X., Tang, Z., Xiao, Y., Liu, X.: A mathematics retrieval system for formulae in layout presentations. In: Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2014. ACM, New York (2014)Google Scholar
- 13.Macdonald, C., Ounis, I., Tonellotto, N.: Upper-bound approximations for dynamic pruning. ACM Trans. Inf. Syst. (TOIS) 29(4), 17 (2011)CrossRefGoogle Scholar
- 14.Miller, B.R., Youssef, A.: Technical aspects of the digital library of mathematical functions. Ann. Math. Artif. Intell. 38(1–3), 121–136 (2003). https://doi.org/10.1023/A:1022967814992MathSciNetCrossRefzbMATHGoogle Scholar
- 15.Shan, D., Ding, S., He, J., Yan, H., Li, X.: Optimized top-k processing with global page scores on block-max indexes. In: Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, WSDM 2012, pp. 423–432. ACM, New York (2012)Google Scholar
- 16.Sojka, P., Líška, M.: Indexing and searching mathematics in digital libraries. In: Davenport, J.H., Farmer, W.M., Urban, J., Rabe, F. (eds.) CICM 2011. LNCS (LNAI), vol. 6824, pp. 228–243. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22673-1_16CrossRefzbMATHGoogle Scholar
- 17.Strohman, T., Turtle, H., Croft, W.B.: Optimization strategies for complex queries. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 219–225. ACM (2005)Google Scholar
- 18.Tonellotto, N., Macdonald, C., Ounis, I., et al.: Efficient query processing for scalable web search. Found. Trends Inf. Retr. 12(4–5), 319–500 (2018)CrossRefGoogle Scholar
- 19.Turtle, H., Flood, J.: Query evaluation: strategies and optimizations. Inf. Process. Manag. 31(6), 831–850 (1995)CrossRefGoogle Scholar
- 20.Zanibbi, R., Aizawa, A., Kohlhase, M., Ounis, I., Topic, G., Davila, K.: NTCIR-12 MathIR task overview. In: NTCIR (2016)Google Scholar
- 21.Zanibbi, R., Blostein, D.: Recognition and retrieval of mathematical expressions. Int. J. Doc. Anal. Recogn. 15(4), 331–357 (2012)CrossRefGoogle Scholar
- 22.Zanibbi, R., Davila, K., Kane, A., Tompa, F.W.: Multi-stage math formula search: using appearance-based similarity metrics at scale. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016. ACM, New York (2016)Google Scholar
- 23.Zhong, W., Zanibbi, R.: Structural similarity search for formulas using leaf-root paths in operator subtrees. In: Azzopardi, L., Stein, B., Fuhr, N., Mayr, P., Hauff, C., Hiemstra, D. (eds.) ECIR 2019. LNCS, vol. 11437, pp. 116–129. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-15712-8_8CrossRefGoogle Scholar