Skip to main content
Log in

XDist: an effective XML keyword search system with re-ranking model based on keyword distribution

  • Research Paper
  • Published:
Science China Information Sciences Aims and scope Submit manuscript

Abstract

Keyword search enables web users to easily access XML data without understanding the complex data schemas. However, the native ambiguity of keyword search makes it arduous to select qualified relevant results matching keywords. To solve this problem, researchers have made much effort on establishing ranking models distinguishing relevant and irrelevant passages, such as the highly cited TF*IDF and BM25. However, these statistic based ranking methods mostly consider term frequency, inverse document frequency and length as ranking factors, ignoring the distribution and connection information between different keywords. Hence, these widely used ranking methods are powerless on recognizing irrelevant results when they are with high term frequency, indicating a performance limitation. In this paper, a new searching system XDist is accordingly proposed to attack the problems aforementioned. In XDist, we firstly use the semantic query model maximal lowest common ancestor (MAXLCA) to recognize the returned results of a given query, and then these candidate results are ranked by BM25. Especially, XDist re-ranks the top several results by a combined distribution measurement (CDM) which considers four measure criterions: term proximity, intersection of keyword classes, degree of integration among keywords and quantity variance of keywords. The weights of the four measures in CDM are trained by a listwise learning to optimize method. The experimental results on the evaluation platform of INEX show that the re-ranking method CDM can effectively improve the performance of the baseline BM25 by 22% under iP[0.01] and 18% under MAiP. Also the semantic model MAXLCA and the search engine XDist perform the best in their respective related fields.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Chamberlin D, Florescu D, Robie J, et al. XQuery: a query language for XML. In: Proceedings of ACM SIGMOD, 2003. 682–682

    Google Scholar 

  2. W3C Recommendation. XML Path Language (XPath) Version 1.0, 1999

    Google Scholar 

  3. Carmel D, Maarek Y S, Mandelbrod M, et al. Searching XML documents via XML fragments. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, New York, 2003. 151–158

    Chapter  Google Scholar 

  4. Theobald M, Schenkel R, Wiekum G. An efficient and versatile query engine for TopX search. In: Proceedings of the 31st International Conference on Very Large Data Bases, New York, 2005. 625–636

    Google Scholar 

  5. Beigbeder M, Gery M, Largeron C, et al. ENSM-SE and UJM at INEX 2010: Scoring with Proximity and Tags Weights. Berlin Heidelberg: Springer, 2011. 44–53

    Google Scholar 

  6. Metzler D, Croft W B. A Markov random field model for term dependencies. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, 2005. 472–479

    Google Scholar 

  7. Clarke C L A, Cormack G V, Tudhope E A. Relevance ranking for one to three term queries. Inform Process Manag, 2000, 36: 291–311

    Article  Google Scholar 

  8. Peng F, Ahmed N, Li X, et al. Context sensitive stemming for web search. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, 2007. 639–646

    Google Scholar 

  9. Song R, Taylor M J, Wen J R, et al. Viewing Term Proximity from a Different Perspective. Berlin Heidelberg: Springer, 2008. 346–357

    Google Scholar 

  10. Svore K, Kanani P H, Khan N. How good is a span of terms? exploiting proximity to improve web retrieval. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, 2010. 154–161

    Google Scholar 

  11. Rasolofo Y, Savoy J. Term Proximity Scoring for Keyword-Based Retrieval Systems. Berlin Heidelberg: Springer, 2003. 207–218

    Google Scholar 

  12. Gao N, Deng Z H, Jiang J J, et al. MAXLCA: a new query semantic model for XML keyword search. J Web Eng, 2012, 11: 131–145

    Google Scholar 

  13. Gao N, Deng Z H, Yu H, et al. ListOPT: learning to Optimize for XML Ranking. Berlin Heidelberg: Springer, 2011. 482–492

    Google Scholar 

  14. Guo L, Shao F, Botev C, et al. XRANK: ranked keyword search over XML documents. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, New York, 2003. 16–27

    Chapter  Google Scholar 

  15. Xu Y, Papakonstantinou Y. Efficient keyword search for smallest LCAs in XML databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, New York, 2005. 527–538

    Google Scholar 

  16. Liu Z, Walker J, Chen Y. XSeek: a semantic XML search engine using keywords. In: Proceedings of the 33rd International Conference on Very Large Data Bases, 2007. 1330–1333

    Google Scholar 

  17. Bao Z, Ling T W, Chen B, et al. Effective XML keyword search with relevance oriented ranking. In: Proceedings of IEEE 25th International Conference on Data Engineering, Shanghai, 2009. 517–528

    Google Scholar 

  18. Geva S, Kamps J, Lethonen M, et al. Overview of the INEX 2009 Ad Hoc Track. Berlin Heidelberg: Springer, 2009. 16–51

    Google Scholar 

  19. Itakura K Y, Clarke C L. University of Waterloo at INEX2008: Adhoc, Book, and Link-the-Wiki Tracks. Berlin Heidelberg: Springer, 2009. 132–139

    Google Scholar 

  20. Liu J, Lin H, Han B. Study on reranking XML retrieval elements based on combining strategy and topics categorization. In: Proceedings of INEX, 2007. 170–176

    Google Scholar 

  21. Mills T C. Time Series Techniques for Economists. Cambridge University Press, 1990

    Google Scholar 

  22. Rijsbergen C J. Information Retireval. London: Butterworths, 1979

    Google Scholar 

  23. Wilcoxon F. Individual comparisons by ranking methods. Biometrics Bull, 1945, 1: 80–83

    Article  Google Scholar 

  24. Mann H B, Whitney D R. On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat, 1947, 18: 50–60

    Article  MATH  MathSciNet  Google Scholar 

  25. Abdi H. The Bonferroni and sidak corrections for multiple comparisons. Encyclopedia meas stat, 2007. 103–107

    Google Scholar 

  26. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc B, 1995: 289–300

    Google Scholar 

  27. Shannon C E. Prediction and entropy of printed English. Bell Syst Tech J, 1951, 30: 50–64

    Article  MATH  Google Scholar 

  28. Yu J X, Qin L, Chang L. Keyword search in relational databases: a survey. IEEE Data Eng Bull, 2010, 33: 67–78

    Google Scholar 

  29. Li J, Liu C, Zhou R, et al. Top-k keyword search over probabilistic XML data. In: Proceedings of IEEE 27th International Conference on Data Engineering, Hannover, 2011. 673–684

    Google Scholar 

  30. Wang G, Yuan Y, Sun Y, et al. PeerLearning: a content-based e-learning material sharing system based on P2P network. World Wide Web, 2010, 13: 275–305

    Article  Google Scholar 

  31. Bao Z, Lu J, Ling T W, et al. Towards an Effective XML keyword search. IEEE Trans Knowl Data Eng, 2010, 22: 1077–1092

    Article  Google Scholar 

  32. Qin L, Yu J X, Chang L. Computing structural statistics by keywords in databases. IEEE Trans Knowl Data Eng, 2012, 24: 1731–1746

    Article  Google Scholar 

  33. Li G, Li C, Feng J, et al. SAIL: structure-aware indexing for effective and progressive top-k keyword search over XML documents. Inf Sci, 2009, 179: 3745–3762

    Article  Google Scholar 

  34. Feng J, Li G, Wang J, et al. Finding and ranking compact connected trees for effective keyword proximity search in XML documents. Inf Syst, 2010, 35: 186–203

    Article  Google Scholar 

  35. Liu Z, Chen Y. Differentiating search results on structured data. ACM Trans Database Syst, 2012, 37: 4

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to ZhiHong Deng.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gao, N., Deng, Z. & Lü, S. XDist: an effective XML keyword search system with re-ranking model based on keyword distribution. Sci. China Inf. Sci. 57, 1–17 (2014). https://doi.org/10.1007/s11432-012-4781-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11432-012-4781-6

Keywords

Navigation