Abstract
Document retrieval enables users to find their required documents accurately and quickly. To satisfy the requirement of retrieval efficiency, prevalent deep neural methods adopt a representation-based matching paradigm, which saves online matching time by pre-storing document representations offline. However, the above paradigm consumes vast local storage space, especially when storing the document as word-grained representations. To tackle this, we present TGTR, a Topic-Grained Text Representation-based Model for document retrieval. Following the representation-based matching paradigm, TGTR stores the document representations offline to ensure retrieval efficiency, whereas it significantly reduces the storage requirements by using novel topic-grained representations rather than traditional word-grained. Experimental results demonstrate that compared to word-grained baselines, TGTR is consistently competitive with them on TREC CAR and MS MARCO in terms of retrieval accuracy, but it requires less than 1/10 of the storage space required by them. Moreover, TGTR overwhelmingly surpasses global-grained baselines in terms of retrieval accuracy.
This work was supported by Hunan Provincial Natural Science Foundation Project (No. 2022JJ30668) and (No. 2022JJ30046).
M. Du and S. Li—Contributed equally to this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Blei, D.M., Ng, A.Y.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Dai, Z., Callan, J.: Context-aware term weighting for first stage passage retrieval. In: SIGIR, pp. 1533–1536 (2020)
Devlin, J., et al.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018)
Dietz, L., Verma, M.: TREC complex answer retrieval overview. In: TREC (2017)
Feng, Z., Tang, D., et al.: Pretraining without wordpieces: learning over a vocabulary of millions of words. arXiv:2202.12142 (2022)
Gao, L., Callan, J.: Condenser: a pre-training architecture for dense retrieval. arXiv:2104.08253 (2021)
Gao, L., Dai, Z.: Coil: Revisit exact lexical match in information retrieval with contextualized inverted list. arXiv:2104.07186 (2021)
Guo, J., Fan, Y.: A deep relevance matching model for ad-hoc retrieval. In: The 25th CIKM, pp. 55–64 (2016)
Karpukhin, V., Oğuz, B.: Dense passage retrieval for open-domain question answering. arXiv:2004.04906 (2020)
Khattab, O., Zaharia, M.: Colbert: Efficient and effective passage search via contextualized late interaction over bert. In: SIGIR, pp. 39–48 (2020)
Lu, S., He, D.: Less is more: pretrain a strong siamese encoder for dense text retrieval using a weak decoder. In: 2021 EMNLP, pp. 2780–2791 (2021)
Ma, X., Guo, J.: Prop: pre-training with representative words prediction for ad-hoc retrieval. In: The 14th WSDM, pp. 283–291 (2021)
Mao, Y., He, P.: Generation-augmented retrieval for open-domain question answering. arXiv:2009.08553 (2020)
Mihalcea, R., Tarau, P.: Textrank: bringing order into text. In: The 2004 EMNLP, pp. 404–411 (2004)
Nguyen, T., Rosenberg, M.: Ms marco: a human generated machine reading comprehension dataset. In: CoCo@ NIPS (2016)
Nogueira, R., Lin, J., Epistemic, A.: From doc2query to doctttttquery. Online preprint 6 (2019)
Nogueira, R., Yang, W., Lin, J., Cho, K.: Document expansion by query prediction. arXiv:1904.08375 (2019)
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Qu, Y., Ding, Y.: Rocketqa: an optimized training approach to dense passage retrieval for open-domain question answering. arXiv:2010.08191 (2020)
Reimers, N., Gurevych, I.: Sentence-bert: sentence embeddings using siamese bert-networks. arXiv:1908.10084 (2019)
Robertson, S.E., et al.: Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In: SIGIR 1994, pp. 232–241. Springer, London (1994). https://doi.org/10.1007/978-1-4471-2099-5_24
Sun, Q., Wu, Y.: A multi-level attention model for text matching. In: Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds.) ICANN 2018. LNCS, vol. 11139, pp. 142–153. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01418-6_15
Zamani, H., et al.: From neural re-ranking to neural ranking: Learning a sparse representation for inverted indexing. In: The 27th CIKM, pp. 497–506 (2018)
Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to ad hoc information retrieval (2001)
Zheng, Z., Hui, K.: Bert-QE: contextualized query expansion for document re-ranking. arXiv:2009.07258 (2020)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Du, M. et al. (2022). Topic-Grained Text Representation-Based Model for Document Retrieval. In: Pimenidis, E., Angelov, P., Jayne, C., Papaleonidas, A., Aydin, M. (eds) Artificial Neural Networks and Machine Learning – ICANN 2022. ICANN 2022. Lecture Notes in Computer Science, vol 13531. Springer, Cham. https://doi.org/10.1007/978-3-031-15934-3_64
Download citation
DOI: https://doi.org/10.1007/978-3-031-15934-3_64
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15933-6
Online ISBN: 978-3-031-15934-3
eBook Packages: Computer ScienceComputer Science (R0)