Advertisement

Information Retrieval

, Volume 9, Issue 1, pp 5–31 | Cite as

Two-stage statistical language models for text database selection

  • Hui YangEmail author
  • Minjie Zhang
Article

Abstract

As the number and diversity of distributed Web databases on the Internet exponentially increase, it is difficult for user to know which databases are appropriate to search. Given database language models that describe the content of each database, database selection services can provide assistance in locating databases relevant to the information needs of users. In this paper, we propose a database selection approach based on statistical language modeling. The basic idea behind the approach is that, for databases that are categorized into a topic hierarchy, individual language models are estimated at different search stages, and then the databases are ranked by the similarity to the query according to the estimated language model. Two-stage smoothed language models are presented to circumvent inaccuracy due to word sparseness. Experimental results demonstrate that such a language modeling approach is competitive with current state-of-the-art database selection approaches.

Keywords

Database language model Text database selection Distributed information retrieval Hierarchical topics Statistical language modeling Query expansion 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Apte C, Damerau R and Weiss SM (1994) Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 12(3):233–251.Google Scholar
  2. Baumgarten (1997) A probabilistic model for distributed information retrieval. In: Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, New York, pp. 258–266.Google Scholar
  3. Baumgarten C (1999) A probabilistic solution to the selection and fusion problem. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley, California, USA, pp. 246–253.Google Scholar
  4. Berger A and Lafferty J (1999) Information retrieval as statistical translation. In: Proceedings of the 22th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley, California, pp. 222–229.Google Scholar
  5. Callan J (2000) Distributed information retrieval. In: W.B. Croft, (Ed.), Advances in Information Retrieval. Kluwer Academic Publishers, pp. 127–150.Google Scholar
  6. Callan J and Connell M (2001) Query-based sampling of text databases. ACM Transactions on Information Systems, 19(2):97–130.CrossRefGoogle Scholar
  7. Cancedda N, Gaussier E, Goutte C and Renders JM (2003) Special issue on machine learning methods for text and images: Word sequence kernels. ACM Journal of Machine Learning Research, 3:1059–1082.MathSciNetGoogle Scholar
  8. Craswell N, Baile P and Hawking D (2000) Server selection on the world wide web. In: Proceedings of the 5th International Conference on Digital Libraries, pp. 37–46.Google Scholar
  9. D'Alession S, Murray M, Schiaffino R and Kershenbaum A (1998) Category levels in hierarchical text categorization. In: Proceedings of the Third Conference on Empirical Methods in Natural Language Processing.Google Scholar
  10. David RH, Miller TL and Richard MS (1999) A Hidden Markov Model Information Retrieval System. In: Proceedings of the 22th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley, California, United States, pp. 214–221.Google Scholar
  11. Dempser AP, Laird NM and Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. International Journal of the Royal Statistical Society, 39(B):1–38.Google Scholar
  12. D'Souza D, Thom J and Zobel J (2000) A comparison of techniques for selecting text collections. In: Proceedings of the 11th Australasian Database Conference. Canberra, Australia, pp. 28–32.Google Scholar
  13. Dumais ST and Chen H (2000) Hierarchical classification of web content. In: N. Y. ACM Press, US, Eds. Proceedings of the 23rd ACM SIGIR International Conference on Research and Development in Information Retrieval. Athens, GR, pp. 256–263.Google Scholar
  14. French JC, et al. (1999) Comparing the performance of database selection algorithms. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley, California, USA, pp. 238–245.Google Scholar
  15. Gauch S, Wang G and Gomez M (1996) Profusion: Intelligent fusion from multiple, distributed search engines. The Journal of Universal Computer Science, 2(9):637–649.Google Scholar
  16. Gravano L, Chang CK, GarcÍa-Molina H and Paepcke A (1997) Starts: Stanford proposal for internet meta-searching. In: Proceedin of the 1997 ACM SIGMOD International Conference on Management of Data. New York, pp. 207–218.Google Scholar
  17. Gravano L, Garcia-Molina H and Tomasic A (1999) Gloss: Text-source discovery over the internet. ACM Transactions on Database Systems, 24(2):229–264.CrossRefGoogle Scholar
  18. Gravano L, Ipeirotis PG and Sahami M (2003) Qprober: A system for automatic classification of hidden-web databases. ACM Transactions on Information Systems, 21(1):1–41.CrossRefGoogle Scholar
  19. Hawking D and Thistlewaite P (1999) Methods for information server selection. ACM Transaction on Information System, 17(1):40–76.Google Scholar
  20. Hiemstra D (1998) A linguistically motivated probabilistic model of information retrieval. In: Proceedings of the 2nd European Conference on Digital Libraries. Heraklion, Crete, Greece, pp. 569–584.Google Scholar
  21. Ipeirotis PG and Gravano L (2002) Distributed search over the hidden web: Hierarchical database sampling and selection. In: Proceedings of the 28th International Conference on Very Large Databases. Hong Kong, China, pp. 394–405.Google Scholar
  22. Ipeirotis PG and Gravano L (2004) When one sample is not enough: Improving text database selection using shrinkage. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data. Paris, France, pp. 767–778.Google Scholar
  23. Ipeirotis PG, Gravano L and Sahami M (2001) Probe, count, and classify: Categorizing hidden web database. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data. Santa Barbara, California, USA, pp. 67–78.Google Scholar
  24. Jelinek F and Mercer R (1980) Interpolated estimation of marvok source parameters from sparse data. In: Patter Recognition in Practices. Amsterdam, Holland, pp. 381–402.Google Scholar
  25. Jin R, Hauptman A and Zhai C (2002) Title language model for information retrieval. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Tampere, Finland, pp. 42–48.Google Scholar
  26. Koller D and Sahami M. (1997) Hierarchically classifying documents using very few words. In: Proceedings of the Fourteenth International Conference on Machine Learning. Nashville, Tennessee, USA, pp. 170–178.Google Scholar
  27. Kullback S and Leibler RA (1951) On information and sufficiency. Annals of Mathematical Statistics, 22:76–88.MathSciNetGoogle Scholar
  28. Lafferty J and Zhai C (2001) Document language models, query models, and risk minimization for information retrieval. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New Orleans, Louisiana, USA, pp. 111–119.Google Scholar
  29. Lewis DD, Yang Y, Rose TG and Li F (2004) Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361–397.Google Scholar
  30. Manber U and Bigot P (1997) The search broker. In: Proceedings of USENIX Symposium on Internet Technologies and System. Monterey, California.Google Scholar
  31. Meng W, Liu KL, Yu C, Wang X and Chang Y (1998) Determining text databases to search in the internet. In: Proceedings of the 24th International Conference on Very Large Data Bases. New York, USA, pp. 14–25.Google Scholar
  32. Meng W, Wang W, Sun H and Yu C (2002) Concept hierarchy based text database categorization. Journal of Knowledge and Information Systems, 4(2):132–150.Google Scholar
  33. Miller DJ, Leek T and Schwartz RM (1999) A hidden markov model information retrieval system. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley, California, USA, pp. 214–221.Google Scholar
  34. Mood AM and Graybill FA (1963) Introduction to the Theory of Statistics 2th Ed., McGraw-Hill.Google Scholar
  35. Ponte JM and Croft WB (1998) A language modeling approach to information retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Melbourne, Australia, pp. 214–221.Google Scholar
  36. Porter M (1980) An algorithm for suffix stripping. Program, 14(3):130–137.Google Scholar
  37. Powell AL and French JC (2003) Comparing the performance of collection selection algorithms. ACM Transactions on Information systems, 21(4):412–456.CrossRefGoogle Scholar
  38. Robertson SE (1977) The probabilistic ranking principles in IR. International Journal on Document, 33:294–304.Google Scholar
  39. Robertson SE and Sparck Jones K (1976) Relevance weighting of search terms. Journal of American Society of Information Science, 27:129–146.Google Scholar
  40. Salton G and McGill M (1983) Introduction of modern information retrieval. McGrag-Hill, New York.Google Scholar
  41. Si L, Jin R, Callan J and Ogilivie P (2002) A language modeling framework for resource selection and results merging. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management. McLean, Virginia, USA, pp. 391–397.Google Scholar
  42. Si L and Callan J (2003) The effect of database size distribution on resource selection algorithms. In: Proceedings of SIGIR 2003 Workshop on Distributed Information Retrieval. Toronto, Canada, pp. 31–42.Google Scholar
  43. Song F and Croft WB (1998) A general language model for information retrieval. In: Proceedings of the Eighth International Conference on Information and Knowledge Management. Kansas City, Missouri, USA, pp. 316–321.Google Scholar
  44. Turtle H and Croft WB (1990) Inference network for document retrieval. In: Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, pp. 1–24.Google Scholar
  45. Van Rijsbergen CJ (1989) Towards an information logic. In: Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, pp. 77–86.Google Scholar
  46. Van Rijsbergen CJ (1992) Probabilistic retrieval revisited. International Journal of Computation, 35:291–298.zbMATHGoogle Scholar
  47. Voorhees E, Gupta NK and Johnson-Laird B (1995) Learning collection fusion strategies. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, pp. 172–179.Google Scholar
  48. Weighend AS, Wiener ED and Pedersen JO (1999) Exploiting hierarchy in text categorization. Information Retrieval, 1(3):193–216.Google Scholar
  49. Wong SKM, Ziarko W, Raghavan VV and Wong PCH (1987) On modeling of informtion retrieval concepts in vector space. ACM Transaction Database System, 12:229–321.Google Scholar
  50. Xu J and Croft WB (1999) Cluster-based language models for distributed retrieval. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley, California, USA, pp. 254–261.Google Scholar
  51. Yang H and Zhang M (2004) Hierarchical classification for multiple, distributed web databases. International Journal of Computers and Their Applications, 11(2):118–130.Google Scholar
  52. Yu C, Liu K, Wu W, Meng W and Rishe N (1999a), “A methodology to retrieve text documents from multiple databases,” Technical report, University of Illinois at Chicago.Google Scholar
  53. Yu C, Meng W, Liu KL, Wu W and Rishe N (1999b) Efficient and effective metasearch for a large number of text databases. In: Proceedings of the Eighth International Conference on Information and Knowledge Management. Kansas City, Missouri, USA, pp. 217–224.Google Scholar
  54. Yuwono B and Lee DL (1997) Server ranking for distributed text retrieval systems on internet. In: Proceedings of the Conference on Database Systems for Advanced Applications, pp. 41–49.Google Scholar
  55. Zhai C and Lafferty J (2001) A study of smoothing methods for language models applied to ad hoc information retrieval. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New Orleans, Louisiana, United States, pp. 334–342.Google Scholar
  56. Zaragoza H, Hiemstra D and Tipping M (2003) Bayesian extension to the language model for ad hoc information retrieval. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. Toronto, Canada, pp. 4–9.Google Scholar
  57. Zobel J (1997) Collection selection via lexicon inspection. In: Proceedings of the 2nd Australian Document Computing Symposium. Melbourne, Australia, pp. 74–80.Google Scholar

Copyright information

© Springer Science + Business Media, Inc. 2006

Authors and Affiliations

  1. 1.School of Information Technology and Computer ScienceUniversity of WollongongWollongongAustralia

Personalised recommendations