Overview of ARQMath-2 (2021): Second CLEF Lab on Answer Retrieval for Questions on Math

This paper provides an overview of the second year of the Answer Retrieval for Questions on Math (ARQMath-2) lab, run as part of CLEF 2021. The goal of ARQMath is to advance techniques for mathematical information retrieval, in particular retrieving answers to mathematical questions (Task 1), and formula retrieval (Task 2). Eleven groups participated in ARQMath-2, submitting 36 runs for Task 1 and 17 runs for Task 2. The results suggest that some combination of experience with the task design and the training data available from ARQMath-1 was beneficial, with greater improvements in ARQMath-2 relative to baselines for both Task 1 and Task 2 than for ARQMath-1 relative to those same baselines. Tasks, topics, evaluation protocols, and results for each task are presented in this lab overview.

    We thank Deyan Ginev and Vit Novotny for helping reduce LaTeXML failures: for ARQMath-1 conversion failures affected 8% of SLTs, and 10% of OPTs.

    We thank Frank Tompa for sharing this suggestion at CLEF 2020.

    Participating systems did not have access to this information.

    In ARQMath-1, all topics had links to at least one duplicate or related post that were available to the organizers.

    H+M binarization corresponds to the definition of relevance usually used in the Text Retrieval Conference (TREC). The TREC definition is “If you were writing a report on the subject of the topic and would use the information contained in the document in the report, then the document is relevant. Only binary judgments (‘relevant’ or ‘not relevant’) are made, and a document is judged relevant if any piece of it is relevant (regardless of how small the piece is in relation to the rest of the document).” (source:

    One assessor (with id 7) was not able to continue assessment.

    Two of the 4 dual-assessed topics had no high or medium relevant answers found by either assessor.

    Pooling to at least depth 10 ensures that there are no unjudged posts above rank 10 for any baseline, primary, or alternative run. Note that P\(^\prime \)@10 cannot achieve a value of 1 because some topics have fewer than 10 relevant posts.

    This differs from the approach used for ARQMath-1, when only submitted formula instances were clustered. For ARQMath-2 the full formula collection was clustered to facilitate post hoc use of the resulting test collection.

    In ARQMath-2, Task 1 pools were not used to seed task 2 pools.

    For ARQMath-1, 92% of formula instances had an SLT representation; for ARQMath-2 we reparsed the collection and improved this to 99.9%.

    As mentioned in Sect. 3, a relatively small number of formulae per topic had incorrectly generated visual ids. In 6 cases assessors indicated that a pooled formula for a single visual id was ‘not matching’ the other formulae in hits grouped for a visual id, rather than assign a relevance score for the formula.


We thank our student assessors from RIT and St. John Fisher College: Josh Anglum, Dominick Banasick, Aubrey Marcsisin, Nathalie Petruzelli, Siegfried Porterfield, Chase Shuster, and Freddy Stock. This material is based upon work supported by the National Science Foundation (USA) under Grant No. IIS-1717997 and the Alfred P. Sloan Foundation under Grant No. G-2017-9827.

