Topic-Grained Text Representation-Based Model for Document Retrieval

Du, Mengxue; Li, Shasha; Yu, Jie; Ma, Jun; Ji, Bin; Liu, Huijun; Lin, Wuhang; Yi, Zibo

doi:10.1007/978-3-031-15934-3_64

Mengxue Du¹²,
Shasha Li¹²,
Jie Yu¹²,
Jun Ma¹²,
Bin Ji¹²,
Huijun Liu¹²,
Wuhang Lin¹² &
…
Zibo Yi¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13531))

Included in the following conference series:

International Conference on Artificial Neural Networks

1811 Accesses
3 Citations

Abstract

Document retrieval enables users to find their required documents accurately and quickly. To satisfy the requirement of retrieval efficiency, prevalent deep neural methods adopt a representation-based matching paradigm, which saves online matching time by pre-storing document representations offline. However, the above paradigm consumes vast local storage space, especially when storing the document as word-grained representations. To tackle this, we present TGTR, a Topic-Grained Text Representation-based Model for document retrieval. Following the representation-based matching paradigm, TGTR stores the document representations offline to ensure retrieval efficiency, whereas it significantly reduces the storage requirements by using novel topic-grained representations rather than traditional word-grained. Experimental results demonstrate that compared to word-grained baselines, TGTR is consistently competitive with them on TREC CAR and MS MARCO in terms of retrieval accuracy, but it requires less than 1/10 of the storage space required by them. Moreover, TGTR overwhelmingly surpasses global-grained baselines in terms of retrieval accuracy.

This work was supported by Hunan Provincial Natural Science Foundation Project (No. 2022JJ30668) and (No. 2022JJ30046).

M. Du and S. Li—Contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Blei, D.M., Ng, A.Y.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Dai, Z., Callan, J.: Context-aware term weighting for first stage passage retrieval. In: SIGIR, pp. 1533–1536 (2020)
Google Scholar
Devlin, J., et al.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018)
Dietz, L., Verma, M.: TREC complex answer retrieval overview. In: TREC (2017)
Google Scholar
Feng, Z., Tang, D., et al.: Pretraining without wordpieces: learning over a vocabulary of millions of words. arXiv:2202.12142 (2022)
Gao, L., Callan, J.: Condenser: a pre-training architecture for dense retrieval. arXiv:2104.08253 (2021)
Gao, L., Dai, Z.: Coil: Revisit exact lexical match in information retrieval with contextualized inverted list. arXiv:2104.07186 (2021)
Guo, J., Fan, Y.: A deep relevance matching model for ad-hoc retrieval. In: The 25th CIKM, pp. 55–64 (2016)
Google Scholar
Karpukhin, V., Oğuz, B.: Dense passage retrieval for open-domain question answering. arXiv:2004.04906 (2020)
Khattab, O., Zaharia, M.: Colbert: Efficient and effective passage search via contextualized late interaction over bert. In: SIGIR, pp. 39–48 (2020)
Google Scholar
Lu, S., He, D.: Less is more: pretrain a strong siamese encoder for dense text retrieval using a weak decoder. In: 2021 EMNLP, pp. 2780–2791 (2021)
Google Scholar
Ma, X., Guo, J.: Prop: pre-training with representative words prediction for ad-hoc retrieval. In: The 14th WSDM, pp. 283–291 (2021)
Google Scholar
Mao, Y., He, P.: Generation-augmented retrieval for open-domain question answering. arXiv:2009.08553 (2020)
Mihalcea, R., Tarau, P.: Textrank: bringing order into text. In: The 2004 EMNLP, pp. 404–411 (2004)
Google Scholar
Nguyen, T., Rosenberg, M.: Ms marco: a human generated machine reading comprehension dataset. In: CoCo@ NIPS (2016)
Google Scholar
Nogueira, R., Lin, J., Epistemic, A.: From doc2query to doctttttquery. Online preprint 6 (2019)
Google Scholar
Nogueira, R., Yang, W., Lin, J., Cho, K.: Document expansion by query prediction. arXiv:1904.08375 (2019)
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Qu, Y., Ding, Y.: Rocketqa: an optimized training approach to dense passage retrieval for open-domain question answering. arXiv:2010.08191 (2020)
Reimers, N., Gurevych, I.: Sentence-bert: sentence embeddings using siamese bert-networks. arXiv:1908.10084 (2019)
Robertson, S.E., et al.: Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In: SIGIR 1994, pp. 232–241. Springer, London (1994). https://doi.org/10.1007/978-1-4471-2099-5_24
Sun, Q., Wu, Y.: A multi-level attention model for text matching. In: Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds.) ICANN 2018. LNCS, vol. 11139, pp. 142–153. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01418-6_15
Zamani, H., et al.: From neural re-ranking to neural ranking: Learning a sparse representation for inverted indexing. In: The 27th CIKM, pp. 497–506 (2018)
Google Scholar
Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to ad hoc information retrieval (2001)
Google Scholar
Zheng, Z., Hui, K.: Bert-QE: contextualized query expansion for document re-ranking. arXiv:2009.07258 (2020)

Download references

Author information

Authors and Affiliations

College of Computer, National University of Defense Technology, Changsha, Hunan, China
Mengxue Du, Shasha Li, Jie Yu, Jun Ma, Bin Ji, Huijun Liu, Wuhang Lin & Zibo Yi

Authors

Mengxue Du
View author publications
You can also search for this author in PubMed Google Scholar
Shasha Li
View author publications
You can also search for this author in PubMed Google Scholar
Jie Yu
View author publications
You can also search for this author in PubMed Google Scholar
Jun Ma
View author publications
You can also search for this author in PubMed Google Scholar
Bin Ji
View author publications
You can also search for this author in PubMed Google Scholar
Huijun Liu
View author publications
You can also search for this author in PubMed Google Scholar
Wuhang Lin
View author publications
You can also search for this author in PubMed Google Scholar
Zibo Yi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jie Yu or Jun Ma .

Editor information

Editors and Affiliations

University of the West of England, Bristol, UK
Elias Pimenidis
Lancaster University, Lancaster, UK
Plamen Angelov
Digital Innovation, Teeside University, Middlesbrough, UK
Chrisina Jayne
Democritus University of Thrace, Xanthi, Greece
Antonios Papaleonidas
The University of the West of England, Bristol, UK
Mehmet Aydin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Du, M. et al. (2022). Topic-Grained Text Representation-Based Model for Document Retrieval. In: Pimenidis, E., Angelov, P., Jayne, C., Papaleonidas, A., Aydin, M. (eds) Artificial Neural Networks and Machine Learning – ICANN 2022. ICANN 2022. Lecture Notes in Computer Science, vol 13531. Springer, Cham. https://doi.org/10.1007/978-3-031-15934-3_64

Download citation

DOI: https://doi.org/10.1007/978-3-031-15934-3_64
Published: 15 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15933-6
Online ISBN: 978-3-031-15934-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Topic-Grained Text Representation-Based Model for Document Retrieval