Skip to main content
Log in

Efficient parallel Text Retrieval techniques on Bulk Synchronous Parallel (BSP)/Coarse Grained Multicomputers (CGM)

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

In this paper, we present efficient, scalable, and portable parallel algorithms for the off-line clustering, the on-line retrieval and the update phases of the Text Retrieval (TR) problem based on the vector space model and using clustering to organize and handle a dynamic document collection. The algorithms are running on the Coarse-Grained Multicomputer (CGM) and/or the Bulk Synchronous Parallel (BSP) model which are two models that capture within a few parameters the characteristics of the parallel machine. To the best of our knowledge, our parallel retrieval algorithms are the first ones analyzed under these specific parallel models. For all the phases of the proposed algorithms, we analytically determine the relevant communication and computation cost thereby formally proving the efficiency of the proposed solutions. In addition, we prove that our technique for the on-line retrieval phase performs very well in comparison to other possible alternatives in the typical case of a multiuser information retrieval (IR) system where a number of user queries are concurrently submitted to an IR system. Finally, we discuss external memory issues and show how our techniques can be adapted to the case when processors have limited main memory but sufficient disk capacity for holding their local data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Aslam J, Pelekhov E, Rus D (2004) The star clustering algorithm for static and dynamic information organization. J Graph Algorithms Appl 8:95–129

    MATH  MathSciNet  Google Scholar 

  2. Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley, New York

    Google Scholar 

  3. Bilardi G, Fantozzi C, Pietracaprina A, Pucci G (2001) On the effectiveness of D-BSP as a bridging model of parallel computation. In: Proc of the int conf on comput sci (ICCS 2001), pp 579–588

  4. Can F (1993) Incremental clustering for dynamic information processing. ACM Trans Inf Process Syst 11:143–164

    Article  Google Scholar 

  5. Can F, Altingovde I, Demir E (2004) Efficiency and effectiveness of query processing in cluster-based retrieval. Inf Syst 29:697–717

    Article  Google Scholar 

  6. Chan A, Dehne F (1999) A note on coarse grained parallel integer sorting. Parallel Process Lett 9(4):533–538

    Article  Google Scholar 

  7. Chaudhri B (1994) Dynamic clustering for time incremental data. Pattern Recognit Lett 13:27–34

    Article  Google Scholar 

  8. Chung S, Kwon H, Ryu K, Chung Y, Jang H, Choi C (2001) Information retrieval on an SCI-based PC cluster. J Supercomput 19:251–265

    Article  MATH  Google Scholar 

  9. Cringean J, England R, Manson G, Willett P (1990) Parallel text searching in serial files using a processor farm. In: Proc of ACM SIGIR, pp 429–452

  10. Dehne F, Fabri A, Rau-Chaplin A (1993) Scalable parallel geometric algorithms for coarse grained multicomputer. In: Proc of the ACM 9th symp on comput geom, pp 298–307

  11. Dehne F, Fabri A, Kenyon C (1994) Scalable and architecture independent parallel geometric algorithms with high probability optimal time. In: Proc of the IEEE symp on parallel and distrib process, pp 586–593

  12. Dehne F, Dittrich W, Hutchinson D (1997) Efficient external memory algorithms by simulating coarse-grained parallel algorithms. In: Proc of the 9th annual ACM symp on parallel algorithms and archit (SPAA), pp 106–115

  13. Dehne F, Dittrich W, Hutchinson D (2003) Efficient external memory algorithms by simulating coarse-grained parallel algorithms. Algorithmica 36:97–122

    Article  MATH  MathSciNet  Google Scholar 

  14. Dehne F, Deng X, Dymond P, Fabri A, Kokhar A (1995) A randomized parallel 3D convex hull algorithm for coarse grained multicomputers. In: Proc of the ACM symp on parallel algorithms and archit, pp 27–33

  15. Dehne F, Hutchinson D, Maheshwari A, Dittrich W (1999) Reducing I/O complexity by simulating coarse grained parallel algorithms. In: Proc of the 13th Int and 10th symp on parallel and distrib process (IPPS/SPDP), pp 14–20

  16. Dehne F, Dittrich W, Hutchinson D, Maheshwari A (2002) Bulk synchronous parallel algorithms for the external memory model. Theory of Comput Syst 35:567–597

    Article  MATH  MathSciNet  Google Scholar 

  17. Dehne F, Ferreira A, Caceres E, Wong S, Roncato A (2002) Efficient parallel graph algorithms for coarse-grained multicomputers and BSP. Algorithmica 33:183–200

    Article  MATH  MathSciNet  Google Scholar 

  18. Dobrynin V, Patterson D, Galushka M, Rooney N (2005) SOPHIA: an interactive cluster-based retrieval system for the OHSUMED collection. IEEE Trans Inf Technol Biomed 9(2):256–265

    Article  Google Scholar 

  19. Efraimidis P, Glymidakis C, Mamalis B, Spirakis P, Tampakas B (1995) Parallel text retrieval on a high performance supercomputer using the vector space model. In: Proc of ACM SIGIR, pp 58–66

  20. Gavalas D, Konstantopoulos C, Mamalis B, Pantziou G (2005) Efficient BSP/CGM algorithms for text retrieval. In: Proc of the int conf on parallel and distrib comput syst (IASTED PDCS), pp 301–306

  21. Gerbessiotis A, Siniolakis C (1996) Primitive operations on the BSP model. Technical Report PRG-TR-23-96, Computing Laboratory, Oxford University

  22. Gerbessiotis A, Valiant L (1994) Direct bulk synchronous parallel algorithms. J Parallel Distrib Comput 22:251–267

    Article  Google Scholar 

  23. Goodrich M (1999) Communication-efficient parallel sorting. SIAM J Comput 29(2):416–432

    Article  MathSciNet  Google Scholar 

  24. Hawking D (1996) Document retrieval performance on parallel systems. In: Proc of the int conf on parallel and distrib process tech and appl (PDPTA ’96), pp 1354–1365

  25. Ishikawa Y, Chen Y, Kitagawa H (2001) An on-line document clustering method based on forgetting factors. In: Proc of the 5th Eur conf on res and adv technol for digit libr, pp 325–339

  26. Jain A, Murty M, Flynn P (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323

    Article  Google Scholar 

  27. Juurlink B, Wijshoff H (1996) Communication primitives for BSP computers. Inf Process Lett 58:303–310

    Article  MATH  MathSciNet  Google Scholar 

  28. Kanellakis P, Shvartsman A (1997) Fault-tolerant parallel computation. Kluwer Academic, Norwell

    MATH  Google Scholar 

  29. Kang J, Ahn H, Jung S, Ryu K, Kwon H, Chung S (2004) Improving load balance and fault tolerance for PC cluster-based parallel information retrieval. In: Proc of the 5th int conf on parallel process and appl math (PPAM 2003), pp 682–687

  30. Karger D, Stein C, Wein J (1997) Scheduling algorithms. In: Atallah M (ed) Algorithms and theory of computation handbook. CRC Press, New York, pp 35-1–35-33

    Google Scholar 

  31. Katriel I, Meyer U (2003) Elementary graph algorithms in external memory. In: Meyer U, Sanders P, Sibeyn J (eds) Algorithms for memory hierarchies. LNCS, vol 2625. Springer, Berlin, pp 62–84

    Chapter  Google Scholar 

  32. Kleinberg J, Tardos E (2006) Algorithm design. Addison-Wesley, Boston

    Google Scholar 

  33. Kunder M (2008) The size of the World Wide Web: estimated size of Google’s index. In: The size of the World Wide Web. Available via http://www.worldwidewebsize.com. Accessed 2 June 2008

  34. Kurland O, Lee L (2004) Corpus structure, language models, and ad hoc information retrieval. In: Proc of SIGIR’04, pp 194–201

  35. Levering R, Cutler M (2006) The portrait of a common HTML Web page. In: Proc of the 2006 ACM symp on doc eng, pp 198–204

  36. Liu X, Croft W (2004) Cluster-based retrieval using language models. In: Proc of SIGIR’04, pp 186–193

  37. Lu Z, McKinley K (2000) Partial replica selection based on relevance for information retrieval. In: Proc of the 22nd Int ACM SIGIR conf on res and develop in inf retr, pp 97–104

  38. MacFarlane A, Robertson S, McCann J (2000) Pliers at TRECS. In: Proc of eighth text retr conf (TREC-8), pp 241–252

  39. Mamalis B, Spirakis P, Tampakas B (2003) Parallel processing of multiple text queries on hypercube interconnection networks. Int J Comput Appl 10(1):115–132

    Google Scholar 

  40. Manning C, Raghavan P, Schutze H (2008) An introduction to information retrieval, preliminary draft. Cambridge University Press. Available via http://www.informationretrieval.org. Accessed 1 June 2008

  41. Martin J, Tiskin A (2004) Dynamic BSP: towards a flexible approach to parallel computing over the grid. In: East I, Martin J, Welch P, Duce D, Green M (eds) Communicating process architectures. IOS Press, Amsterdam, pp 219–226

    Google Scholar 

  42. McColl W (1994) Scalable parallel computing: a grand unified theory and its practical development. In: Proc of IFIP world congr, pp 539–546

  43. Rasmussen E (1992) Clustering algorithms. In: Frakes W, Baeza-Yates R (eds) Information retrieval: data structures and algorithms. Prentice-Hall, Englewood Cliffs, pp 419–442

    Google Scholar 

  44. Rungsawang A, Laohakanniyom A, Lertprasertkune M (2001) Low-cost parallel text retrieval using PC-cluster. In: Proc of Eur PVM/MPI, pp 419–426

  45. Salton G, McGill M (1983) An introduction to modern information retrieval. McGraw-Hill, New York

    Google Scholar 

  46. Salton G, Wong A, Yang C (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620

    Article  MATH  Google Scholar 

  47. Stanfill C (1989) Partitioned posting files: a parallel inverted file structure for information retrieval. In: Proc of ACM SIGIR, pp 413–428

  48. Valiant L (1990) A bridging model for parallel computation. Commun ACM 33(8):103–111

    Article  Google Scholar 

  49. Valiant L (1991) General purpose parallel architectures. In: Leeuwen J (ed) Handbook of theoretical computer science, vol A. MIT Press, Cambridge, pp 943–971

    Google Scholar 

  50. van Rijsbergen C (1971) Information retrieval, 2nd edn. Butterworth-Heinemann, Newton

    Google Scholar 

  51. Vitter J (2001) External memory algorithms and data structures: dealing with massive data. ACM Comput Surv 33(2):209–271

    Article  Google Scholar 

  52. Williams T (2000) A general-purpose model for heterogeneous computation. PhD Thesis, University of Central Florida, Orlando

  53. Williams T, Parsons R (2000) The heterogeneous bulk synchronous parallel model. In: Proc of the IPDPS 2000 workshops on parallel and distrib process, pp 102–108

  54. Witten I, Moffat A, Bell T (1999) Managing gigabytes: compressing and indexing documents and images, 2d edn. Morgan Kaufmann, San Francisco

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Charalampos Konstantopoulos.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Konstantopoulos, C., Mamalis, B., Pantziou, G. et al. Efficient parallel Text Retrieval techniques on Bulk Synchronous Parallel (BSP)/Coarse Grained Multicomputers (CGM). J Supercomput 48, 286–318 (2009). https://doi.org/10.1007/s11227-008-0225-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-008-0225-x

Keywords

Navigation