Knowledge and Information Systems

, Volume 6, Issue 4, pp 465–505 | Cite as

PowerDB-IR – Scalable Information Retrieval and Storage with a Cluster of Databases

Article

Abstract

Our objective is a scalable infrastructure for information retrieval (IR) with up-to-date retrieval results in the presence of updates. Timely processing of updates is important with novel application domains such as e-commerce. These issues are challenging, given the additional requirement that the system must scale well. We have built PowerDB-IR, a system that has the characteristics sought. This article describes its design, implementation, and evaluation. We follow a three-tier architecture with a database cluster as the bottom layer for storage management. The rationale for a database cluster is to ‘scale out’, i.e., to add further cluster nodes, whenever necessary for better performance. The middle tier provides IR-specific retrieval and update services. We deploy state-of-the-art middleware software to coordinate the cluster and to invoke IR-specific components. PowerDB-IR extends the middleware layer with service decomposition and parallelisation. PowerDB-IR has the following features: It supports state-of-the-art retrieval models such as vector space retrieval. It allows documents to be inserted and retrieved concurrently and ensures up-to-date retrieval results with almost no overhead. PowerDB-IR ensures the correctness of global concurrency and recovery. Alternative physical data organisation schemes and respective query processing techniques provide adequate performance for different workloads and database sizes. Scaling out the database cluster yields higher throughput and lower response times. We have run extensive experiments with PowerDB-IR using several commercial database systems as well as different middleware products. Further experiments have quantified the effect of transactional guarantees on performance. The main result is that PowerDB-IR shows surprisingly good scalability and low response times.

Keywords

Information retrieval Concurrent update and retrieval Database cluster Transaction management for IR  

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Alonso G, Blott S, Fessler A, Schek H-J (1997) Correctness and parallelism of composite systems. In: Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Tucson, Arizona, USA, ACM Press, New York, NY, pp 197–208 Google Scholar
  2. 2.
    Alonso G, Fessler A, Pardon G, Schek H-J (1999a) Correctness in general configurations of transactional components. In: Proceedings of the 18th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Philadelphia, USA, ACM Press, New York, NY, pp 285–293 Google Scholar
  3. 3.
    Alonso G, Fessler A, Pardon G, Schek H-J (1999b) Transactions in stack, fork, and join composite systems. In: Beeri C, Buneman P (eds) Proceedings of the 7th International Conference on Database Theory (ICDT’99), Jerusalem, Israel, pp 150–168 Google Scholar
  4. 4.
    Andresen D, Yang T, Ibarra OH (1997) Toward a scalable distributed WWW server on workstation clusters. J Parallel Distrib Comput 42(1):91–100 CrossRefGoogle Scholar
  5. 5.
    Badrinath B, Ramamritham K (1990) Performance evaluation of semantics-based multilevel concurrency control protocols. In: Garcia-Molina H, Jagadish HV (eds) Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, Atlantic City, USA, ACM Press, New York, NY, pp 163–172 Google Scholar
  6. 6.
    Barbará D, Mehrotra S, Vallabhaneni P (1996) The gold text indexing engine. In: Proceedings of the 12th International Conference on Data Engineering, New Orleans, Louisiana, USA, IEEE Computer Society, Los Alamitos, CA, pp 172–179 Google Scholar
  7. 7.
    Baru C, Fecteau G, Goyal A, Hsiao H et al (1995) DB2 parallel edition. IBM Systems Journal 34(2):292–321 CrossRefGoogle Scholar
  8. 8.
    BEA (1999) TUXEDO Guides and References (V 6.5) Google Scholar
  9. 9.
    Bernstein PA, Hadzilacos V, Goodman N (1987) Concurrency Control and Recovery in Database Systems. Addison-Wesley, Reading, Massachusetts Google Scholar
  10. 10.
    Böhm K, Aberer K, Neuhold EJ, Yang X (1997) Structured document storage and refined declarative and navigational access mechanisms in HyperStorM. VLDB J 6(4):296–311 CrossRefGoogle Scholar
  11. 11.
    Böhm K, Grabs T, Röhm U, Schek H-J (2000) Evaluating the coordination overhead of replica maintenance in a cluster of databases. In: Proceedings of Euro-Par 2000, Parallel Processing, 6th International Euro-Par Conference, Munich, Germany, Vol. 1900 of LNCS, Springer-Verlag, Heidelberg, pp 435–444 Google Scholar
  12. 12.
    Boral H, Alexander W, Clay L, Copeland G et al (1990) Prototyping Bubba, a highly parallel database system. IEEE Trans Knowl Data Eng 2(1):4–24 CrossRefGoogle Scholar
  13. 13.
    Brown EW, Callan JP, Croft WB (1994) Fast incremental indexing for full-text information retrieval. In: Proceedings of 20th International Conference on Very Large Data Bases (VLDB’94), Santiago de Chile, Chile, Morgan Kaufmann, San Francisco, CA, pp 192–202 Google Scholar
  14. 14.
    Carey M, Kossmann D (1997) On saying “enough already!” in SQL. In: Peckham J (ed) Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, Tucson, Arizona, USA, ACM Press, pp 219–230Google Scholar
  15. 15.
    Chakrabarti K, Mehrotra S (1999) Efficient concurrency control in multidimensional access methods. In: Delis A, Faloutsos C, Ghandeharizadeh S (eds) Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, Philadelphia, Pennsylvania, USA, ACM Press, New York, NY, pp 25–36 Google Scholar
  16. 16.
    Chaudhuri S, Gravano L (1999) Evaluating top-k selection queries. In: Atkinson MP, Orlowska ME, Valduriez P, Zdonik SB, Brodie ML (eds) Proceedings of 25th International Conference on Very Large Data Bases, Edinburgh, Scotland, UK, Morgan Kaufmann, San Francisco, CA, pp 397–410 Google Scholar
  17. 17.
    Copeland G, Alexander W, Boughter E, Keller T (1988) Data placement in Bubba. In: Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, Chicago, USA, ACM Press, New York, NY, pp 99–108 Google Scholar
  18. 18.
    Crawford RG, Macleod I (1978) A relational approach to modular information retrieval systems design. In: Proceedings of the 41st Conference of the American Society for Information Science Annual Meeting, pp 83–85 Google Scholar
  19. 19.
    Dadam P, Küspert K, Andersen F, Blanken HM et al (1986) A DBMS prototype to support extended NF2 relations: an integrated view on flat tables and hierarchies. In: Zaniolo C (ed) Proceedings of the 1986 ACM SIGMOD International Conference on Management of Data, Washington, DC, USA, ACM Press, New York, NY, pp 356–367 Google Scholar
  20. 20.
    Dadam P, Pistor P, Schek H (1983) A predicate oriented locking approach for integrated information systems. In: Proceedings of the IFIP 9th World Computer Congress, Paris, France, North-Holland/IFIP, Amsterdam, pp 763–768 Google Scholar
  21. 21.
    DeFazio S (1991) Overview of the full-text document retrieval benchmark. In: Gray J (ed) The Benchmark Handbook, Morgan Kaufmann, San Francisco, CA, pp 435–487 Google Scholar
  22. 22.
    DeWitt DJ, Ghandeharizadeh S, Schneider DA, Bricker A et al (1990) The Gamma Database Project. IEEE Trans Knowl Data Eng 2(1):44–61 CrossRefGoogle Scholar
  23. 23.
    Eswaran KP, Gray JN, Lorie RA, Traiger IL (1976) The notions of consistency and predicate locks in a database system. Commun ACM 19(11):624–633 MathSciNetCrossRefGoogle Scholar
  24. 24.
    Fox A, Chawathe SGY, Brewer E, Gaulthier P (1997) Cluster-based scalable network services. In: Proceedings of the 16th ACM Symposium on Operating System Principles (SOSP’97), St Malo, France, ACM Press, New York, NY, pp 78–91 Google Scholar
  25. 25.
    Frieder O, Chowdhury A, Grossman D, McCabe M (2000) On the integration of structured data and text: a review of the SIRE architecture. In: Proceedings of the First DELOS Network of Excellence Workshop on Information Seeking, Searching and Querying in Digital Libraries, Zurich, Switzerland, 2000, ERCIM, Le Chesnay, pp 53–58 Google Scholar
  26. 26.
    Grabs T, Böhm K, Schek H-J (2000) A parallel document engine built on top of a cluster of databases – design, implementation, and experiences. In: Technical Report 340, Department of Computer Science, ETH Zurich. Available at: http://www.inf.ethz.ch/publications/abstract.php3?no=tech-reports/3xx/340 Google Scholar
  27. 27.
    Grabs T, Böhm K, Schek H-J (2001a) High-level parallelisation in a database cluster: a feasibility study using document services. In: Proceedings of the 17th International Conference on Data Engineering (ICDE2001), Heidelberg, Germany, IEEE Computer Society, Los Alamitos, CA, pp 121–130 Google Scholar
  28. 28.
    Grabs T, Böhm K, Schek H-J (2001b) PowerDB-IR – information retrieval on top of a database cluster. In: Proceedings of the 10th International Conference on Information and Knowledge Management (CIKM2001), Atlanta, GA, USA, ACM Press, New York, NY, pp 411–418 Google Scholar
  29. 29.
    Gray J (1999) How high is high performance transaction processing. In: High Performance Transaction Systems Workshop, Asilomar, USA. Available at: http://research.microsoft.com/∼gray/hpts99/talks/Gray_Jim.ppt Google Scholar
  30. 30.
    Gray J, Helland P, O’Neill P, Shasha D (1996) The dangers of replication and a solution. In: Jagadish HV, Mumick IS (eds) Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Quebec, Canada, ACM Press, New York, NY, pp 173–182 Google Scholar
  31. 31.
    Gray J, Reuter A (1993) Transaction Processing: Concepts and Techniques, Morgan Kaufmann, San Francisco, CA Google Scholar
  32. 32.
    Grossman DA, Frieder O, Holmes DO, Roberts DC (1997) Integrating structured data and text: a relational approach. J Am Soc Inf Sci 48(2):122–132 CrossRefGoogle Scholar
  33. 33.
    Harper DJ, Walker ADM (1992) ECLAIR: An extensible class library for information retrieval. Comput J 35(3):256–267 CrossRefGoogle Scholar
  34. 34.
    Inktomi Corp (1996) The Inktomi technology behind HotBot. Technical report, Inktomi Corp Google Scholar
  35. 35.
    Kamath M, Ramamritham K (1996) Efficient transaction support for dynamic information retrieval systems. In: Frei H-P, Harman D, Schäuble P, Wilkinson R (eds) Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’96), Zurich, Switzerland, pp 147–155 Google Scholar
  36. 36.
    Kaufmann H, Schek H-J (1995) Text search using database systems revisited – some experiments. In: Proceedings of the 13th British National Conference on Databases, pp 18–20 Google Scholar
  37. 37.
    Kaufmann H, Schek H-J (1996) Extending TP-monitors for intra-transaction parallelism. In: Proceedings of the 4th International Conference on Parallel and Distributed Information Systems, Miami Beach, USA, IEEE Computer Society, Los Alamitos, CA, pp 250–261 Google Scholar
  38. 38.
    Kirsch S (1998) Infoseek’s experiences searching the Internet. SIGIR Forum 32(2):3–7 CrossRefGoogle Scholar
  39. 39.
    Knaus D, Schäuble P (1996) The system architecture and the transaction concept of the SPIDER information retrieval system. IEEE Bull Tech Committee Data Eng 19(1):43–52 Google Scholar
  40. 40.
    Lohman GM, Lindsay BG, Pirahesh H, Schiefer KB (1991) Extensions to Starburst: objects, types, functions and rules. Commun ACM 34(10):94–109 CrossRefGoogle Scholar
  41. 41.
    Microsoft Corp (2000) Building high-performance databases using Microsoft SQL Server 2000 federated database servers. Technical report, Microsoft Corp Google Scholar
  42. 42.
    Özsu MT, Szafron D, El-Medani G, Vittal C (1995) An object-oriented multimedia database system for a news-on-demand applications. Multimedia Syst 3(5–6):182–203 Google Scholar
  43. 43.
    Özsu MT, Valduriez P (1999) Principles of Distributed Database Systems, 2nd edn, Prentice Hall, Upper Saddle River, New Jersey Google Scholar
  44. 44.
    Rys M, Norrie MC, Schek H-J (1996) Intra-transaction parallelism in the mapping of an object model to a relational multi-processor system. In: Vijayaraman TM, Buchmann AP, Mohan C, Sarda NL (eds) Proceedings of the 22th International Conference on Very Large Data Bases, Mumbai (Bombay), India, Morgan Kaufmann, San Francisco, CA, pp 460–471 Google Scholar
  45. 45.
    Sacks-Davis R, Kent AJ, Ramamohanarao K, Thom JA et al (1995) Atlas: a nested relational database system for text applications. IEEE Trans Knowl Data Eng 7(3):454–470 CrossRefGoogle Scholar
  46. 46.
    Salton G (1989) Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Google Scholar
  47. 47.
    Salton G, Fox EA, Wu H (1983) Extended Boolean information retrieval. Commun ACM 26(12):1022–1036 MathSciNetCrossRefGoogle Scholar
  48. 48.
    Salton G, McGill M (1983) Introduction to Modern Information Retrieval. McGraw-Hill Google Scholar
  49. 49.
    Schaad W, Schek H-J, Weikum G (1995) Implementation and performance of multi-level transaction management in multidatabase environment. In: Bukhres OA, Özsu MT, Shan M-C (eds) Proceedings of RIDE-DOM ’95, Fifth International Workshop on Research Issues in Data Engineering – Distributed Object Management, Taipei, Taiwan, pp 108–115 Google Scholar
  50. 50.
    Schek H-J, Pistor P (1982) Data structures for an integrated data base management and information retrieval system. In: Eighth International Conference on Very Large Data Bases, Mexico City, Mexico, Morgan Kaufmann, San Francisco, CA, pp 197–207 Google Scholar
  51. 51.
    Schek H-J, Weikum G, Schaad W (1991) A multi-level transaction approach to federated DBMS transaction management. In: Kambayashi Y, Rusinkiewicz M (eds) First International Workshop on Research Issues on Data Engineering: Interoperability in Multidatabase Systems (RIDE-IMS’91), Kyoto, Japan, IEEE Computer Society, Los Alamitos, CA, pp 108–115 Google Scholar
  52. 52.
    Scheuermann P, Weikum G, Zabback P (1998) Data partitioning and load balancing in parallel disk systems. VLDB J 7(1):48–66 CrossRefGoogle Scholar
  53. 53.
    Shasha D, Llirbat F, Simon E, Valduriez P (1995) Transaction chopping: algorithms and performance studies. ACM Trans Database Syst 20(3):325–363 CrossRefGoogle Scholar
  54. 54.
    Stonebraker M, Kemnitz G (1991) The Postgres next-generation database management system. Commun ACM 34(10):78–92 CrossRefGoogle Scholar
  55. 55.
    Tomasic A, Garcia-Molina H, Shoens K (1994) Incremental updates of inverted lists for text document retrieval. In: Snodgrass RT, Winslett M (eds) Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, Minneapolis, USA, pp 289–300 Google Scholar
  56. 56.
    Vingralek R, Breitbart Y, Weikum G (1998) SNOWBALL: scalable storage on networks of workstations with balanced load. Distrib Parallel Databases 6(2):117–156 CrossRefGoogle Scholar
  57. 57.
    Vingralek R, Hasse-Ye H, Breitbart Y, Schek H-J (1998) Unifying concurrency control and recovery of transactions with semantically rich operations. In: Theoretical Computer Science, pp 363–396 Google Scholar
  58. 58.
    Weikum G (1991) Principles and realization strategies of multilevel transaction management ACM Trans Database Syst 16(1):132–180 CrossRefGoogle Scholar
  59. 59.
    Weikum G, Schek H-J (1984) Architectural issues of transaction management in multi-layered systems. In: Dayal U, Schlageter G, Seng LH (eds) Tenth International Conference on Very Large Data Bases, Singapore, Proceedings, Morgan Kaufmann, San Francisco, CA, pp 454–465 Google Scholar
  60. 60.
    Weikum G, Schek H-J (1992) Concepts and applications of multilevel transactions and open nested transactions. In: Elmagarmid AK (ed) Database Transaction Models for Advanced Applications, Morgan Kaufmann, San Francisco, CA, pp 515–553Google Scholar

Copyright information

© Springer-Verlag 2004

Authors and Affiliations

  • Torsten Grabs
    • 1
  • Klemens Böhm
    • 2
  • Hans-Jörg Schek
    • 3
  1. 1.Database Research Group, Institute of Information SystemsETH ZürichZürichSwitzerland
  2. 2.Department of Computer ScienceOtto-von-Guericke-Universität MagdeburgMagdeburgGermany
  3. 3.Institut für InformationssystemeETH Zentrum IFW C49.28092 ZürichSwitzerland

Personalised recommendations