Abstract
Keyword search enables web users to easily access XML data without understanding the complex data schemas. However, the native ambiguity of keyword search makes it arduous to select qualified relevant results matching keywords. To solve this problem, researchers have made much effort on establishing ranking models distinguishing relevant and irrelevant passages, such as the highly cited TF*IDF and BM25. However, these statistic based ranking methods mostly consider term frequency, inverse document frequency and length as ranking factors, ignoring the distribution and connection information between different keywords. Hence, these widely used ranking methods are powerless on recognizing irrelevant results when they are with high term frequency, indicating a performance limitation. In this paper, a new searching system XDist is accordingly proposed to attack the problems aforementioned. In XDist, we firstly use the semantic query model maximal lowest common ancestor (MAXLCA) to recognize the returned results of a given query, and then these candidate results are ranked by BM25. Especially, XDist re-ranks the top several results by a combined distribution measurement (CDM) which considers four measure criterions: term proximity, intersection of keyword classes, degree of integration among keywords and quantity variance of keywords. The weights of the four measures in CDM are trained by a listwise learning to optimize method. The experimental results on the evaluation platform of INEX show that the re-ranking method CDM can effectively improve the performance of the baseline BM25 by 22% under iP[0.01] and 18% under MAiP. Also the semantic model MAXLCA and the search engine XDist perform the best in their respective related fields.
Similar content being viewed by others
References
Chamberlin D, Florescu D, Robie J, et al. XQuery: a query language for XML. In: Proceedings of ACM SIGMOD, 2003. 682–682
W3C Recommendation. XML Path Language (XPath) Version 1.0, 1999
Carmel D, Maarek Y S, Mandelbrod M, et al. Searching XML documents via XML fragments. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, New York, 2003. 151–158
Theobald M, Schenkel R, Wiekum G. An efficient and versatile query engine for TopX search. In: Proceedings of the 31st International Conference on Very Large Data Bases, New York, 2005. 625–636
Beigbeder M, Gery M, Largeron C, et al. ENSM-SE and UJM at INEX 2010: Scoring with Proximity and Tags Weights. Berlin Heidelberg: Springer, 2011. 44–53
Metzler D, Croft W B. A Markov random field model for term dependencies. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, 2005. 472–479
Clarke C L A, Cormack G V, Tudhope E A. Relevance ranking for one to three term queries. Inform Process Manag, 2000, 36: 291–311
Peng F, Ahmed N, Li X, et al. Context sensitive stemming for web search. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, 2007. 639–646
Song R, Taylor M J, Wen J R, et al. Viewing Term Proximity from a Different Perspective. Berlin Heidelberg: Springer, 2008. 346–357
Svore K, Kanani P H, Khan N. How good is a span of terms? exploiting proximity to improve web retrieval. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, 2010. 154–161
Rasolofo Y, Savoy J. Term Proximity Scoring for Keyword-Based Retrieval Systems. Berlin Heidelberg: Springer, 2003. 207–218
Gao N, Deng Z H, Jiang J J, et al. MAXLCA: a new query semantic model for XML keyword search. J Web Eng, 2012, 11: 131–145
Gao N, Deng Z H, Yu H, et al. ListOPT: learning to Optimize for XML Ranking. Berlin Heidelberg: Springer, 2011. 482–492
Guo L, Shao F, Botev C, et al. XRANK: ranked keyword search over XML documents. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, New York, 2003. 16–27
Xu Y, Papakonstantinou Y. Efficient keyword search for smallest LCAs in XML databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, New York, 2005. 527–538
Liu Z, Walker J, Chen Y. XSeek: a semantic XML search engine using keywords. In: Proceedings of the 33rd International Conference on Very Large Data Bases, 2007. 1330–1333
Bao Z, Ling T W, Chen B, et al. Effective XML keyword search with relevance oriented ranking. In: Proceedings of IEEE 25th International Conference on Data Engineering, Shanghai, 2009. 517–528
Geva S, Kamps J, Lethonen M, et al. Overview of the INEX 2009 Ad Hoc Track. Berlin Heidelberg: Springer, 2009. 16–51
Itakura K Y, Clarke C L. University of Waterloo at INEX2008: Adhoc, Book, and Link-the-Wiki Tracks. Berlin Heidelberg: Springer, 2009. 132–139
Liu J, Lin H, Han B. Study on reranking XML retrieval elements based on combining strategy and topics categorization. In: Proceedings of INEX, 2007. 170–176
Mills T C. Time Series Techniques for Economists. Cambridge University Press, 1990
Rijsbergen C J. Information Retireval. London: Butterworths, 1979
Wilcoxon F. Individual comparisons by ranking methods. Biometrics Bull, 1945, 1: 80–83
Mann H B, Whitney D R. On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat, 1947, 18: 50–60
Abdi H. The Bonferroni and sidak corrections for multiple comparisons. Encyclopedia meas stat, 2007. 103–107
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc B, 1995: 289–300
Shannon C E. Prediction and entropy of printed English. Bell Syst Tech J, 1951, 30: 50–64
Yu J X, Qin L, Chang L. Keyword search in relational databases: a survey. IEEE Data Eng Bull, 2010, 33: 67–78
Li J, Liu C, Zhou R, et al. Top-k keyword search over probabilistic XML data. In: Proceedings of IEEE 27th International Conference on Data Engineering, Hannover, 2011. 673–684
Wang G, Yuan Y, Sun Y, et al. PeerLearning: a content-based e-learning material sharing system based on P2P network. World Wide Web, 2010, 13: 275–305
Bao Z, Lu J, Ling T W, et al. Towards an Effective XML keyword search. IEEE Trans Knowl Data Eng, 2010, 22: 1077–1092
Qin L, Yu J X, Chang L. Computing structural statistics by keywords in databases. IEEE Trans Knowl Data Eng, 2012, 24: 1731–1746
Li G, Li C, Feng J, et al. SAIL: structure-aware indexing for effective and progressive top-k keyword search over XML documents. Inf Sci, 2009, 179: 3745–3762
Feng J, Li G, Wang J, et al. Finding and ranking compact connected trees for effective keyword proximity search in XML documents. Inf Syst, 2010, 35: 186–203
Liu Z, Chen Y. Differentiating search results on structured data. ACM Trans Database Syst, 2012, 37: 4
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gao, N., Deng, Z. & Lü, S. XDist: an effective XML keyword search system with re-ranking model based on keyword distribution. Sci. China Inf. Sci. 57, 1–17 (2014). https://doi.org/10.1007/s11432-012-4781-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11432-012-4781-6