Advertisement

The VLDB Journal

, Volume 17, Issue 1, pp 5–37 | Cite as

Modelling retrieval models in a probabilistic relational algebra with a new operator: the relational Bayes

Special Issue Paper

Abstract

This paper presents a probabilistic relational modelling (implementation) of the major probabilistic retrieval models. Such a high-level implementation is useful since it supports the ranking of any object, it allows for the reasoning across structured and unstructured data, and it gives the software (knowledge) engineer control over ranking and thus supports customisation. The contributions of this paper include the specification of probabilistic SQL (PSQL) and probabilistic relational algebra (PRA), a new relational operator for probability estimation (the relational Bayes), the probabilistic relational modelling of retrieval models, a comparison of modelling retrieval with traditional SQL versus modelling retrieval with PSQL, and a comparison of the performance of probability estimation with traditional SQL versus PSQL. The main findings are that the PSQL/PRA paradigm allows for the description of advanced retrieval models, is suitable for solving large-scale retrieval tasks, and outperforms traditional SQL in terms of abstraction and performance regarding probability estimation.

Keywords

Probabilistic relational modelling Retrieval models Probabilistic databases DB + IR integration 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abiteboul, S., Agrawal, R., Bernstein, P., Carey, M., Ceri, S., Croft, B., DeWitt, D., Franklin, M., Garcia-Molina, H., Gawlick, D., Gray, J., Haas, L., Halevy, A., Hellerstein, J., Ioannidis, Y., Kersten, M., Pazzani, M., Lesk, M., Maier, D., Naughton, J., Schek, H., Sellis, T., Silberschatz, A., Stonebraker, M., Snodgrass, R., Ullman, J., Weikum, G., Widom, J., Zdonik, S.: The lowell database research self assessment (2003)Google Scholar
  2. 2.
    Agrawal, S., Chaudhuri, S., Das, G.: Dbxplorer: a system for keyword-based search over relational databases. In: ICDE, pp. 5–16 (2002)Google Scholar
  3. 3.
    Agrawal, S., Chaudhuri, S., Das, G., Gionis, A.: Automated ranking of database query results. In: CIDR (2003)Google Scholar
  4. 4.
    Azzam, H., Roelleke, T.: Efficient processing of ontological queries. In: 2nd VLDB Workshop on Ontologies-based Techniques for Databases and Information Systems, Seoul (2006)Google Scholar
  5. 5.
    Amati G. and van Rijsbergen C.J. (2002). Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20(4): 357–389 CrossRefGoogle Scholar
  6. 6.
    Bosc P., Galibourg M. and Hamon G. (1988). Fuzzy querying with sql: extensions and implementation aspects. Fuzzy Sets Syst. 28(3): 333–349 MATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    Barbara D., Garcia-Molina H. and Porter D. (1990). A probabilistic relational data model. In: Bancilhon, F., Thanos, C., and Tsichrizis, D. (eds) Advances in Database Technology – EDBT ’90, pp 60–74. Springer, Berlin CrossRefGoogle Scholar
  8. 8.
    Barbara D., Garcia-Molina H. and Porter D. (1992). The management of probabilistic data. IEEE Trans. Knowl. Data Eng. 4(5): 487–502 CrossRefGoogle Scholar
  9. 9.
    Berger, A., Lafferty, J.: Information retrieval as statistical translation. In: SIGIR (ed.) SIGIR ’99 Proceedings of the 22nd International Conference on Research and Development in Information Retrieval, pp. 222–229, ACM, New York (1999)Google Scholar
  10. 10.
    Bosc, P., Pivert, O.: Fuzzy queries and relational databases. In: Proceedings of the 1994 ACM Symposium on Applied Computing, pp. 170–174 ACM Press, New York (1994)Google Scholar
  11. 11.
    Chaudhuri, S., Das, G., Hristidis, V., Weikum, G.: Probabilistic ranking of database query results. In: VLDB pp. 888–899 (2004)Google Scholar
  12. 12.
    Chaudhuri S., Das G., Hristidis V. and Weikum G. (2006). Probabilistic information retrieval approach for ranking of database query results. ACM Trans. Database Syst. 31(3): 1134–1168 CrossRefGoogle Scholar
  13. 13.
    Croft W.B. and Harper D.J. (1979). Using probabilistic models of document retrieval without relevance information. J. Doc. 35: 285–295 Google Scholar
  14. 14.
    Cavallo, R., Pittarelli, M.: The theory of probabilistic databases. In: Proceedings of the 13th International Conference on Very Large Databases, pp. 71–81 Morgan Kaufman, Los Altos (1987)Google Scholar
  15. 15.
    Chaudhuri, S., Ramakrishnan, R., Weikum, G.: Integrating db and ir technologies: What is the sound of one hand clapping? In: CIDR, pp. 1–12 (2005)Google Scholar
  16. 16.
    Dalvi, N.N., Suciu, D.: Efficient query evaluation on probabilistic databases. In: VLDB, pp. 864–875 (2004)Google Scholar
  17. 17.
    Dalvi, N.N., Suciu, D.: Answering queries from statistics and probabilistic views. In: VLDB, pp. 805–816 (2005)Google Scholar
  18. 18.
    de Vries, A., Roelleke, T.: Relevance information: a loss of entropy but a gain for idf? In: ACM SIGIR, Salvador, Brazil (2005)Google Scholar
  19. 19.
    Ercegovac, V., DeWitt, D.J., Ramakrishnan, R.: The texture benchmark: Measuring performance of text queries on a relational dbms. In: VLDB, pp. 313–324 (2005)Google Scholar
  20. 20.
    Elmasri, R., Navathe, S.B.: Fundamentals of Database Systems. Addison-Wesley, Reading (2000)Google Scholar
  21. 21.
    Frommholz I. and Fuhr N. (2006). Probabilistic, object-oriented logics for annotation-based retrieval in digital libraries. In: Marchionini, G., Nelson, M.L. and Marshall, C.C. (eds) JCDL., pp 55–64. ACM, New York CrossRefGoogle Scholar
  22. 22.
    Fagin R., Lotem A. and Naor M. (2003). Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66(4): 614–656 MATHCrossRefMathSciNetGoogle Scholar
  23. 23.
    Fuhr, N., Roelleke, T.: A probabilistic NF2 relational algebra for integrated information retrieval and database systems. In: Tanik, M.M., Bastani, F.B., Gibson, D., Fielding, P.J. (eds.) Proceedings of the 2nd World Conference on Integrated Design and Process Technology (IDPT), Society for Design and Process Science (SDPS), Austin, pp. 17–30 (1996)Google Scholar
  24. 24.
    Fuhr N. and Rölleke T. (1997). A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Trans. Info. Syst. 14(1): 32–66 CrossRefGoogle Scholar
  25. 25.
    Forst, J.F., Tombros, A., Roelleke, T.: solving the enterprise trec task with probabilistic data models. In: Proceedings of TREC 2006, pp. xx–yy (2006)Google Scholar
  26. 26.
    Fuhr N. (1990). A probabilistic framework for vague queries and imprecise information in databases. In: McLeod, D., Sacks-Davis, R. and Schek, H. (eds) Proceedings of the 16th International Conference on Very Large Databases., pp 696–707. Morgan Kaufman, Los Altos Google Scholar
  27. 27.
    Fuhr N. (1995). Probabilistic datalog—a logic for powerful retrieval methods. In: Fox, E.A., Ingwersen, P. and Fidel, R. (eds) Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 282–290. ACM, New York CrossRefGoogle Scholar
  28. 28.
    Grabs, T., Böhm, K., Schek, H.-J.: Powerdb-ir - information retrieval on top of a database cluster. In: CIKM, pp. 411–418 (2001)Google Scholar
  29. 29.
    Grabs T., Böhm K. and Schek H.-J. (2004). Powerdb-ir - scalable information retrieval and storage with a cluster of databases. Knowl. Inf. Syst. 6(4): 465–505 CrossRefGoogle Scholar
  30. 30.
    Grossman D.A. and Frieder O. (1998). Information Retrieval: Algorithms and Heuristics. Kluwer, Massachusetts MATHGoogle Scholar
  31. 31.
    Grossman D.A. and Frieder O. (2004). Information Retrieval. Algorithms and Heuristics, 2nd edn. vol. 15 of The Information Retrieval Series. Springer, Berlin Heidelberg Google Scholar
  32. 32.
    Grossman D.A., Frieder O., Holmes D.O. and Roberts D.C. (1997). Integrating structured data and text: a relational approach. JASIS 48(2): 122–132 CrossRefGoogle Scholar
  33. 33.
    Hristidis, V., Gravano, L., Papakonstantinou, Y.: Efficient ir-style keyword search over relational databases. In: VLDB, pp. 850–861 (2003)Google Scholar
  34. 34.
    Hiemstra D. (2000). A probabilistic justification for using tf.idf term weighting in information retrieval. Int. J. Digit. Libr. 3(2): 131–139 CrossRefGoogle Scholar
  35. 35.
    Hristidis, V., Papakonstantinou, Y.: Discover: keyword search in relational databases. In: VLDB, pp. 670–681 (2002)Google Scholar
  36. 36.
    Li, C., Chang, K.C.-C., Ilyas, I.F., Song, S.: Ranksql: Query algebra and optimization for relational top-k queries. In: SIGMOD Conference, pp. 131–142 (2005)Google Scholar
  37. 37.
    Lee, S.K.: An extended relational database model for uncertain and imprecise information. In: Proceedings of the 18th VLDB Conference, pp. 211–220 Morgan Kaufman, Los Altos (1992)Google Scholar
  38. 38.
    Lakshmanan L.V.S., Leone N., Ross R. and Subrahmanian V.S. (1997). Probview: a flexible probabilistic database system. ACM Trans. Database Syst. 22(3): 419–469 CrossRefGoogle Scholar
  39. 39.
    Lalmas, M., Roelleke, T.: Modelling vague content and structure querying in XML retrieval with a probabilistic object-relational framework. In: Proceedings of the 6th International Conference on Flexible Query Answering Systems (FQAS), LNCS, Lyon, Springer, Berlin (2004)Google Scholar
  40. 40.
    Lalmas M., Roelleke T. and Fuhr N. (2002). Intelligent hypermedia retrieval. In: Szczepaniak, P.S., Segovia, F., and Zadeh, L.A. (eds) Intelligent Exploration of the Web, pp. Springer, Heidelberg Google Scholar
  41. 41.
    Lafferty, J., Zhai, C.: Probabilistic Relevance Models Based on Document and Query Generation, chap. 1. Kluwer (2002)Google Scholar
  42. 42.
    Macleod I.A. (1991). Text retrieval and the relational model. J. Am. Soc. Inf. Sci. 42(3): 155–165 CrossRefGoogle Scholar
  43. 43.
    Maron M.E. and Kuhns J.L. (1960). On relevance, probabilistic indexing and information retrieval. J. ACM 7: 216–244 CrossRefGoogle Scholar
  44. 44.
    Motro A. (1988). Vague: a user interface to relational databases that permits vague queries. ACM Trans. Off. Inf. Syst. 6(3): 187–214 CrossRefGoogle Scholar
  45. 45.
    Motro A. (1990). Accommodating imprecision in database systems: Issues and solutions. Sigmod rec. 19(4): 69 CrossRefGoogle Scholar
  46. 46.
    Niemi T. and Järvelin K. (1995). A straightforward NF2 relational interface with applications in information retrieval. Inf. Process. Manage. 31(2): 215–231 CrossRefGoogle Scholar
  47. 47.
    Ponte J.M. and Croft W.B. (1998). A language modeling approach to information retrieval. In: Bruce Croft, W., Moffat, A., van Rijsbergen, C.J., Wilkinson, R. and Zobel, J. (eds) Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 275–281. ACM, New York CrossRefGoogle Scholar
  48. 48.
    Pfeifer, U., Fuhr, N.: Efficient processing of vague queries using a data stream approach. In: Proceedings of the 18th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 189–198, New York (1995)Google Scholar
  49. 49.
    Roelleke T., Lübeck R. and Kazai G. (2001). The HySpirit retrieval platform, demonstration. In: Croft, B., Harper, D.J., Kraft, D.H., and Zobel, J. (eds) Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New Orleans Google Scholar
  50. 50.
    Robertson, S.E.: Term frequency and term value. In: SIGIR, pp. 22–29 (1981)Google Scholar
  51. 51.
    Robertson S.E. (2004). Understanding inverse document frequency: On theoretical arguments for idf. J. Doc. 60: 503–520 CrossRefGoogle Scholar
  52. 52.
    Roelleke T. (1999). POOL: Probabilistic Object-Oriented Logical Representation and Retrieval of Complex Objects. Shaker Verlag, Aachen Dissertation Google Scholar
  53. 53.
    Roelleke, T.: A frequency-based and a Poisson-based probability of being informative. In: Callan, J., Cormarck, G., Clarke, C., Hawking, D., Smeaton, A. (eds.) Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada, pp. 227–234 (2003)Google Scholar
  54. 54.
    Ross, J.G.R., Subrahmanian, V.S.: Probabilistic aggregates. In: 13th International Symposium on Methodologies for Intelligent Systems (ISMIS), Lyon, Foundations of Intelligent Systems. Springer, Heidelberg (2002)Google Scholar
  55. 55.
    Robertson S.E. and Sparck Jones K. (1976). Relevance weighting of search terms. J. Am. Soc. Inf. Sci. 27: 129–146 CrossRefGoogle Scholar
  56. 56.
    Roelleke, T., Wang, J.: A parallel derivation of probabilistic information retrieval models. In: ACM SIGIR (2006)Google Scholar
  57. 57.
    Robertson S.E., Walker S. and Hancock-Beaulieu M.M. (1995). Large test collection experiments on an operational interactive system: Okapi at TREC. Inf. Process. Manage. 31: 345–360 CrossRefGoogle Scholar
  58. 58.
    Suciu, D., Dalvi, N.N.: Foundations of probabilistic answers to queries. In: SIGMOD Conference, p. 963 (2005)Google Scholar
  59. 59.
    Silberschatz, A., Korth, H.F., Sudarshan, S.: Database Systems Concepts, 4th Edn. McGraw-Hill Higher Education (2002)Google Scholar
  60. 60.
    Schek, H.-J., Pistor, P.: Data structures for an integrated database management and information retrieval system. In: Proceedings of the 8th International Conference on Very Large Data Bases, pp. 197–207, Morgan Kaufman, Los Altos (1982)Google Scholar
  61. 61.
    Turtle H. and Croft W.B. (1990). Inference networks for document retrieval. In: Vidick, J.-L. (eds) Proceedings of the 13th International Conference on Research and Development in Information Retrieval., pp 1–24. ACM, New York CrossRefGoogle Scholar
  62. 62.
    Theobald, M., Weikum, G., Schenkel, R.: Top-k query evaluation withprobabilistic guarantees. In: VLDB, pp. 648–659 (2004)Google Scholar
  63. 63.
    van Rijsbergen C.J. (1986). A non-classical logic for information retrieval. Comput. J. 29(6): 481–485 MATHCrossRefGoogle Scholar
  64. 64.
    Wong S.K.M. and Yao Y.Y. (1995). On modeling information retrieval with probabilistic inference. ACM Trans. Inf. Sys. 13(1): 38–68 CrossRefGoogle Scholar
  65. 65.
    Yu C.T., Lam K. and Salton G. (1982). Term weighting in information retrieval using the term precision model. J. ACM 29(1): 152–170 MATHCrossRefMathSciNetGoogle Scholar
  66. 66.
    Zhai, C.X., Lafferty, J.D.: Two-stage language models for information retrieval. In: SIGIR, pp. 49–56 (2002)Google Scholar

Copyright information

© Springer-Verlag 2007

Authors and Affiliations

  • Thomas Roelleke
    • 1
  • Hengzhi Wu
    • 1
  • Jun Wang
    • 1
  • Hany Azzam
    • 1
  1. 1.Queen MaryUniversity of LondonLondonUK

Personalised recommendations