XDist: an effective XML keyword search system with re-ranking model based on keyword distribution

Gao, Ning; Deng, ZhiHong; Lü, ShengLong

doi:10.1007/s11432-012-4781-6

XDist: an effective XML keyword search system with re-ranking model based on keyword distribution

Research Paper
Published: 22 April 2014

Volume 57, pages 1–17, (2014)
Cite this article

Science China Information Sciences Aims and scope Submit manuscript

Ning Gao^1,2,
ZhiHong Deng¹ &
ShengLong Lü¹

84 Accesses
19 Citations
Explore all metrics

Abstract

Keyword search enables web users to easily access XML data without understanding the complex data schemas. However, the native ambiguity of keyword search makes it arduous to select qualified relevant results matching keywords. To solve this problem, researchers have made much effort on establishing ranking models distinguishing relevant and irrelevant passages, such as the highly cited TF*IDF and BM25. However, these statistic based ranking methods mostly consider term frequency, inverse document frequency and length as ranking factors, ignoring the distribution and connection information between different keywords. Hence, these widely used ranking methods are powerless on recognizing irrelevant results when they are with high term frequency, indicating a performance limitation. In this paper, a new searching system XDist is accordingly proposed to attack the problems aforementioned. In XDist, we firstly use the semantic query model maximal lowest common ancestor (MAXLCA) to recognize the returned results of a given query, and then these candidate results are ranked by BM25. Especially, XDist re-ranks the top several results by a combined distribution measurement (CDM) which considers four measure criterions: term proximity, intersection of keyword classes, degree of integration among keywords and quantity variance of keywords. The weights of the four measures in CDM are trained by a listwise learning to optimize method. The experimental results on the evaluation platform of INEX show that the re-ranking method CDM can effectively improve the performance of the baseline BM25 by 22% under iP[0.01] and 18% under MAiP. Also the semantic model MAXLCA and the search engine XDist perform the best in their respective related fields.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

No-but-semantic-match: computing semantically matched xml keyword search results

Article 13 October 2017

Mehdi Naseriparsa, Md. Saiful Islam, … Irene Moser

A Novel Ranking Technique Based on Page Queries

XPloreRank: exploring XML data via you may also like queries

Article 11 August 2018

Mehdi Naseriparsa, Chengfei Liu, … Rui Zhou

References

Chamberlin D, Florescu D, Robie J, et al. XQuery: a query language for XML. In: Proceedings of ACM SIGMOD, 2003. 682–682
Google Scholar
W3C Recommendation. XML Path Language (XPath) Version 1.0, 1999
Google Scholar
Carmel D, Maarek Y S, Mandelbrod M, et al. Searching XML documents via XML fragments. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, New York, 2003. 151–158
Chapter Google Scholar
Theobald M, Schenkel R, Wiekum G. An efficient and versatile query engine for TopX search. In: Proceedings of the 31st International Conference on Very Large Data Bases, New York, 2005. 625–636
Google Scholar
Beigbeder M, Gery M, Largeron C, et al. ENSM-SE and UJM at INEX 2010: Scoring with Proximity and Tags Weights. Berlin Heidelberg: Springer, 2011. 44–53
Google Scholar
Metzler D, Croft W B. A Markov random field model for term dependencies. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, 2005. 472–479
Google Scholar
Clarke C L A, Cormack G V, Tudhope E A. Relevance ranking for one to three term queries. Inform Process Manag, 2000, 36: 291–311
Article Google Scholar
Peng F, Ahmed N, Li X, et al. Context sensitive stemming for web search. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, 2007. 639–646
Google Scholar
Song R, Taylor M J, Wen J R, et al. Viewing Term Proximity from a Different Perspective. Berlin Heidelberg: Springer, 2008. 346–357
Google Scholar
Svore K, Kanani P H, Khan N. How good is a span of terms? exploiting proximity to improve web retrieval. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, 2010. 154–161
Google Scholar
Rasolofo Y, Savoy J. Term Proximity Scoring for Keyword-Based Retrieval Systems. Berlin Heidelberg: Springer, 2003. 207–218
Google Scholar
Gao N, Deng Z H, Jiang J J, et al. MAXLCA: a new query semantic model for XML keyword search. J Web Eng, 2012, 11: 131–145
Google Scholar
Gao N, Deng Z H, Yu H, et al. ListOPT: learning to Optimize for XML Ranking. Berlin Heidelberg: Springer, 2011. 482–492
Google Scholar
Guo L, Shao F, Botev C, et al. XRANK: ranked keyword search over XML documents. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, New York, 2003. 16–27
Chapter Google Scholar
Xu Y, Papakonstantinou Y. Efficient keyword search for smallest LCAs in XML databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, New York, 2005. 527–538
Google Scholar
Liu Z, Walker J, Chen Y. XSeek: a semantic XML search engine using keywords. In: Proceedings of the 33rd International Conference on Very Large Data Bases, 2007. 1330–1333
Google Scholar
Bao Z, Ling T W, Chen B, et al. Effective XML keyword search with relevance oriented ranking. In: Proceedings of IEEE 25th International Conference on Data Engineering, Shanghai, 2009. 517–528
Google Scholar
Geva S, Kamps J, Lethonen M, et al. Overview of the INEX 2009 Ad Hoc Track. Berlin Heidelberg: Springer, 2009. 16–51
Google Scholar
Itakura K Y, Clarke C L. University of Waterloo at INEX2008: Adhoc, Book, and Link-the-Wiki Tracks. Berlin Heidelberg: Springer, 2009. 132–139
Google Scholar
Liu J, Lin H, Han B. Study on reranking XML retrieval elements based on combining strategy and topics categorization. In: Proceedings of INEX, 2007. 170–176
Google Scholar
Mills T C. Time Series Techniques for Economists. Cambridge University Press, 1990
Google Scholar
Rijsbergen C J. Information Retireval. London: Butterworths, 1979
Google Scholar
Wilcoxon F. Individual comparisons by ranking methods. Biometrics Bull, 1945, 1: 80–83
Article Google Scholar
Mann H B, Whitney D R. On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat, 1947, 18: 50–60
Article MATH MathSciNet Google Scholar
Abdi H. The Bonferroni and sidak corrections for multiple comparisons. Encyclopedia meas stat, 2007. 103–107
Google Scholar
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc B, 1995: 289–300
Google Scholar
Shannon C E. Prediction and entropy of printed English. Bell Syst Tech J, 1951, 30: 50–64
Article MATH Google Scholar
Yu J X, Qin L, Chang L. Keyword search in relational databases: a survey. IEEE Data Eng Bull, 2010, 33: 67–78
Google Scholar
Li J, Liu C, Zhou R, et al. Top-k keyword search over probabilistic XML data. In: Proceedings of IEEE 27th International Conference on Data Engineering, Hannover, 2011. 673–684
Google Scholar
Wang G, Yuan Y, Sun Y, et al. PeerLearning: a content-based e-learning material sharing system based on P2P network. World Wide Web, 2010, 13: 275–305
Article Google Scholar
Bao Z, Lu J, Ling T W, et al. Towards an Effective XML keyword search. IEEE Trans Knowl Data Eng, 2010, 22: 1077–1092
Article Google Scholar
Qin L, Yu J X, Chang L. Computing structural statistics by keywords in databases. IEEE Trans Knowl Data Eng, 2012, 24: 1731–1746
Article Google Scholar
Li G, Li C, Feng J, et al. SAIL: structure-aware indexing for effective and progressive top-k keyword search over XML documents. Inf Sci, 2009, 179: 3745–3762
Article Google Scholar
Feng J, Li G, Wang J, et al. Finding and ranking compact connected trees for effective keyword proximity search in XML documents. Inf Syst, 2010, 35: 186–203
Article Google Scholar
Liu Z, Chen Y. Differentiating search results on structured data. ACM Trans Database Syst, 2012, 37: 4
Article Google Scholar

Download references

Author information

Authors and Affiliations

Key Laboratory of Machine Perception (Ministry of Education), School of Electronics Engineering and Computer Science, Peking University, Beijing, 100871, China
Ning Gao, ZhiHong Deng & ShengLong Lü
College of Information Studies, University of Maryland, College Park, MD, 20742, USA
Ning Gao

Authors

Ning Gao
View author publications
You can also search for this author in PubMed Google Scholar
ZhiHong Deng
View author publications
You can also search for this author in PubMed Google Scholar
ShengLong Lü
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to ZhiHong Deng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gao, N., Deng, Z. & Lü, S. XDist: an effective XML keyword search system with re-ranking model based on keyword distribution. Sci. China Inf. Sci. 57, 1–17 (2014). https://doi.org/10.1007/s11432-012-4781-6

Download citation

Received: 17 December 2013
Accepted: 24 February 2014
Published: 22 April 2014
Issue Date: May 2014
DOI: https://doi.org/10.1007/s11432-012-4781-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

XDist: an effective XML keyword search system with re-ranking model based on keyword distribution

Abstract

Access this article

Similar content being viewed by others

No-but-semantic-match: computing semantically matched xml keyword search results

A Novel Ranking Technique Based on Page Queries

XPloreRank: exploring XML data via you may also like queries

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

XDist: an effective XML keyword search system with re-ranking model based on keyword distribution

Abstract

Access this article

Similar content being viewed by others

No-but-semantic-match: computing semantically matched xml keyword search results

A Novel Ranking Technique Based on Page Queries

XPloreRank: exploring XML data via you may also like queries

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation