Skip to main content

Searching Large Text Collections

  • Chapter

Part of the book series: Massive Computing ((MACO,volume 4))

Abstract

In this chapter we present the main data structures and algorithms for searching large text collections. We emphasize inverted files, the most used index, but also review suffix arrays, which are useful in a number of specialized applications. We also cover parallel and distributed implementations of these two structures. As an example, we show how mechanisms based upon inverted files can be used to index and search the Web.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   629.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   799.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   799.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Bibliography

  • O. Alonso and R. Baeza-Yates. A model for visualizing large answers in WWW. In XVIII Int. Conf. of the Chilean CS Society,pages 2–7, Antofagasta, Chile, 1998. IEEE CS Press.

    Google Scholar 

  • AltaVista. A1taVista: Main page. http://www.altavista.com, 1996.

  • T. Anderson, D. Culler, and D. Patterson. A case for NOW (network of workstations). IEEE Micro, 15 (1): 54–64, 1995.

    Article  Google Scholar 

  • A. Apostolico. The myriad virtues of subword trees. In Combinatorial Algorithms on Words, NATO ISI Series, pages 85–96. Springer-Verlag, 1985.

    Google Scholar 

  • R. Baeza-Yates, E. Barbosa, and N. Ziviani. Hierarchies of indices for text searching. Information Systems, 21 (6): 497–514, 1996.

    Article  Google Scholar 

  • R. Baeza-Yates and G. Gonnet. All-against-all sequence matching. Dept. of Computer Science, University of Chile, 1990.

    MATH  Google Scholar 

  • R. Baeza-Yates and G. Gonnet. Fast text searching for regular expressions or automaton searching on a trie. J. of the ACM, 43 (6): 915–936, 1996.

    Article  MathSciNet  MATH  Google Scholar 

  • R. Baeza-Yates and G. Navarro. Block-addressing indices for approximate text retrieval. In Proc. ACM CIKM’97,pages 1–8, 1997. Extended version to appear in JASIS.

    Google Scholar 

  • R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley & ACM Press, Harlow, UK, 1999.

    Google Scholar 

  • V. N. Anh and A. Moffat. Compressed inverted files with reduced decoding overheads. In Croft et al. (1998), pages 290–297.

    Google Scholar 

  • E. Barbosa and N. Ziviani. From partial to full inverted lists for text searching. In R. Baeza-Yates and U. Manber, editors, Proc. 2nd South American Workshop on String Processing, WSP’95, pages 1–10, 1995.

    Google Scholar 

  • R. Bayer and K. Unterauer. Prefix B-trees. ACM TODS, 2 (1): 11–26, Mar 1977.

    Article  Google Scholar 

  • Tim Berners-Lee, Robert Cailliau, Ari Luotonen, Henrik Frystyk Nielsen, and Arthur Secret. The World-Wide Web. Comm. of the ACM, 37 (8): 76–82, 1994.

    Article  Google Scholar 

  • Krisha Bharat, Andrei Broder, Monika Henzinger, Puneet Kumar, and Suresh Venkatasubramanian. The connectivity server: fast access to linkage information on the Web. In 7th WWW Conf., Brisbane, Australia, April 1998.

    Google Scholar 

  • Philippe Bonnet and Anthony Tomasic. Partial answers for unavailable data sources. In Workshop on Flexible Query-Answering Systems, pages 43–54, 1998.

    Chapter  Google Scholar 

  • C. Mic Bowman, Peter B. Danzig, Darren R. Hardy, Udi Manber, and Michael F. Schwartz. The Harvest information discovery and access system. In Proc. 2nd Inter. World Wide Web Conf., pages 763–771, October 1994.

    Google Scholar 

  • S. Brin. Extracting patterns and relations from the World Wide Web. In Workshop on Web Databases,Valencia, Spain, March 1998.

    Google Scholar 

  • A. Broder, S. Glassman, M. Manasse, and G. Zweig. Syntactic clustering of the Web. In 6th Int’l WWW Conference,pages 391–404,Santa Clara, CA, USA, April 1997.

    Google Scholar 

  • A.Z. Broder. On the resemblance and containment of documents. In Conf. on Compression and Complexity of Sequences,pages 21–29,Salerno, Italy, 1997. IEEE Computer Society.

    Google Scholar 

  • C. Buckley and A. F. Lewit. Optimization of inverted vector searches. In Proc. 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,pages 97–110,Montreal, Canada, June 1985. ACM Press, New York.

    Google Scholar 

  • S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. In 7th WWW Conference, pages 65–74, Brisbane, Australia, April 1998.

    Google Scholar 

  • S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom. The TSIMMIS Project: Integration of Heterogeneous Information Sources. In Proc. of IPSJ Conference, pages 7–18, October 1994.

    Google Scholar 

  • C. Chen. Structuring and visualizing the WWW by generalized similarity analysis. In 8th ACM Conference on Hypertext and Hypermedia,pages 177–186, Southampton, England, 1997.

    Google Scholar 

  • J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through URL ordering. In 7th WWW Conference, Brisbane, Australia, April 1998.

    Google Scholar 

  • D. R. Clark and J. I. Munro. Efficient suffix trees on secondary storage. In Proceedings of the 7th ACM-SIAM Annual Symposium on Discrete Algorithms,pages 383–391, Atlanta, Georgia, 1996.

    Google Scholar 

  • A. Clauser and P. Ferragina. On constructing suffix arrays in external memory. Algorithmica, 2000. (to appear). Part of this work appeared in European Symposium on Algorithms, LNCS 1643, 1999.

    Google Scholar 

  • E. G. Coffman, Z. Liu, and R. R. Weber. Optimal robot scheduling for Web search engines. Technical Report 3317, INRIA, France, December 1997.

    Google Scholar 

  • W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, editors. Proc. 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, August 1998. ACM Press, New York.

    Google Scholar 

  • O. de Kretser and A. Moffat. Effective document presentation with a locality-based similarity heuristic. In M. Hearst, F. Gey, and R. Tong, editors, Proc. 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 113–120, New York, August 1999. ACM Press.

    Google Scholar 

  • O. de Kretser, A. Moffat, T. Shimmin, and J. Zobel. Methodologies for distributed information retrieval. In M. P. Papazoglou, M. Takizawa, B. Krämer, and S. Chanson, editors, Proc. 18th International Conference on Distributed Computing Systems, pages 66–73, Los Alamitos, CA, May 1998. IEEE Computer Society.

    Google Scholar 

  • Alin Deutsch, Mary Fernandez, Daniela Florescu, Alon Levy, and Dan Suciu. A query language for XML. http://www.research.att.com/—mff/xml/w3cnote.html, 1998.

    Google Scholar 

  • Fast. Fast: Main page. http://alltheweb.com, 1998, 1998.

  • P. Ferragina and R. Grossi. The string B-tree: A new data structure for string search in external memory and its applications. J. Assoc. Comput. Mach., 46: 236–280, 1999.

    Article  MathSciNet  MATH  Google Scholar 

  • W. B. Frakes and R. Baeza-Yates, editors. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs, New Jersey, 1992.

    Google Scholar 

  • R. G. Gallager and D. C. Van Voorhis. Optimal source codes for geometrically distributed integer alphabets. IEEE Transactions on Information Theory, IT-21(2): 228–230, March 1975.

    Google Scholar 

  • D. Gibson, J. Kleinberg, and P. Raghavan. Inferring Web communities from link topologies. In 9th ACM Conference on Hypertext and Hypermedia, Pittsburgh, USA, 1998.

    Google Scholar 

  • R. Goldman, J. McHugh, and J. Widom. Lore: A database management system for XML. Technical report, Stanford University Database Group, 1998.

    Google Scholar 

  • S. W. Golomb. Run-length encodings. IEEE Transactions on Infor[na- tion Theory, IT-12(3): 399–401, July 1966.

    Google Scholar 

  • G. Gonnet. A tutorial introduction to Computational Biochemistry using Darwin. Technical report, Informatik E.T.H., Zuerich, Switzerland, 1992.

    Google Scholar 

  • G. Gonnet, R. Baeza-Yates, and T. Snider. New indices for text: Pat trees and Pat arrays, chapter 3, pages 66–82. In, Frakes and BaezaYates (1992), 1992a.

    Google Scholar 

  • G. Gonnet, M. Cohen, and S. Benner. Exhaustive matching of the entire protein sequence database. Science, 256: 1443–45, 1992b.

    Article  Google Scholar 

  • Google. Google: Main page. http://www.google.com, 1998.

    Google Scholar 

  • L. Gravano, K. Chang, H. García-Molina, C. Lagoze, and A. Paepcke. STARTS: Stanford protocol proposal for Internet retrieval and search. Technical report, Stanford University, Digital Library Project, 1997. http://www-db.stanford.edu/gravano/starts.html.

    Google Scholar 

  • Luis Gravano, Hector García-Molina, and Ant hony Tomasic. GlOSS: Text-source discovery over the Internet. ACM Transactions on Database Systems, 2000. To appear.

    Google Scholar 

  • S.J. Green. Automated link generation: can we do better than term repetition. In 7th WWW Conference, Brisbane, Australia, 1998.

    Google Scholar 

  • D. K. Harman. Overview of the second text retrieval conference (TREC2). Information Prcessing é4 Management, 31 (3): 271–289, May 1995.

    Article  Google Scholar 

  • D. K. Harman and G. Candela. Retrieving records from a gigabyte of text on a minicomputer using statistical ranking. Journal of the American Society for Information Science, 41 (8): 581–589, August 1990.

    Article  Google Scholar 

  • H. Itoh and H. Tanaka. An efficient method for in memory construction of suffix arrays. In IEEE CS Press, editor, Proc. 6th Symp. on String Processing and Info nation Retrieval, SPIRE’99, pages 81–89, Cancun, Mexico, 1999.

    Google Scholar 

  • J. Jâjâ. An Introduction to Parallel Algorithms. Addison-Wesley, 1992.

    Google Scholar 

  • B. Kahle. Archiving the Internet. http://www.alexa.com/’brewster/ essays/sciam_article.html, 1997.

    Google Scholar 

  • M. Kaszkiel, J. Zobel, and R. Sacks-Davis. Efficient passage ranking for document databases. ACM Transactions on Information Systems, 17 (4): 406–439, October 1999.

    Article  Google Scholar 

  • R. Khare and A. Rifkin. XML: A door to automated Web applications. IEEE Internet Computing, 1 (4): 78–86, 1977.

    Article  Google Scholar 

  • J. Kitajima and G. Navarro. A fast distributed suffix array generation algorithm. In Proc. 6th Symp. on String Processing and Information

    Google Scholar 

  • Retrieval, SPIRE’99,pages 97–104, Cancun, Mexico, 1999. IEEE CS Press.

    Google Scholar 

  • J. Kitajima, G. Navarro, B. Ribeiro-Neto, and N. Ziviani. Distributed generation of suffix arrays: a quicksort-based approach. In Proc. nth South American Workshop on String Processing, WSP’97, pages 5369. Carleton University Press, 1997.

    Google Scholar 

  • J. Kitajima, B. Ribeiro-Neto, and N. Ziviani. Network and memory analysis in distributed parallel generation of PAT arrays. In Proc. 8th Brazilian Symposium on Computer Architectures - High-Performance Processing, pages 193–202. Brazilian Computer Society - SBC, 1996.

    Google Scholar 

  • Jon Kleinberg. Authoritative sources in a hyperlinked environment. In Proc. of the 9th ACM-SIAM Symposium on Discrete Algorithms, pages 668–677, San Francisco, USA, Jan 1998.

    Google Scholar 

  • Stefan Kurtz. Reducing the space requirement of suffix trees. Software — Practice and Experience, 29 (13): 1149–1171, 1999.

    Article  Google Scholar 

  • N. J. Larsson and K. Sadakane Faster suffix sorting. Technical Report LU-CS-TR:99–214, LUNDFD6/NFCS3140/1–20/(1999), Department of Computer Science, Lund University, Sweden, May 1999.

    Google Scholar 

  • S. Lawrence and L. Giles. Accessability and distribution of information on the Web. Nature, July 1999.

    Google Scholar 

  • M. Lesk. Practical Digital Libraries: Books, Bytes, and Bucks. Morgan Kaufmann, San Francisco, 1997.

    Google Scholar 

  • U. Manber and E. Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, pages 935–948, 1993.

    Google Scholar 

  • Udi Manber and S. Wu. GLIMPSE: A tool to search through entire file systems. In Usenix Winter 199.E Technical Conference, pages 23–32, San Francisco, CA, January 1994.

    Google Scholar 

  • M. Marchiori. The quest for correct information on the Web: Hyper search engi nes. In 6th WWW Conf., pages 265–274, Santa Clara, CA, USA, 1997.

    Google Scholar 

  • M. Mitra, C. Buckley, A. Singhal, and C. Cardie. An analysis of statistical and syntactic phrases. In Proc. RIAO’97 Conf. on Computer-Assisted Information Searching on the Internet, pages 200–214, June 1997.

    Google Scholar 

  • A. Moffat and T. A. H. Bell. In-situ generation of compressed inverted files. Journal of the American Society for Information Science, 46 (7): 537–550, August 1995.

    Article  Google Scholar 

  • A. Moffat and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14 (4): 349–379, October 1996.

    Article  Google Scholar 

  • A. Moffat, J. Zobel, and N. Sharman. Text compression for dynamic document databases. IEEE Transactions on Knowledge and Data Engineering, 9 (2): 302–313, March 1997.

    Article  Google Scholar 

  • E. S. de Moura, G. Navarro, N. Ziviani, and R. Baeza-Yates. Fast searching on compressed text allowing errors. In Croft et al. (1998), pages 298–306.

    Google Scholar 

  • M. Nagao and S. Mori. A new method of n-gram statistics for large number of n and automatic extraction of words and phrases from large text data of japanese. In Proc. COLING’9.4, pages 611–615, 1994.

    Google Scholar 

  • G. Navarro and R. Baeza-Yates. A new indexing method for approximate string matching. In Maxime Crochemore and Mike Paterson, editors, Combinatorial Pattern Matching (CPM’99), number 1645 in LNCS, pages 163–185, Warwick, UK, July 1999. Springer Verlag.

    Google Scholar 

  • G. Navarro, E. Barbosa, R. Baeza-Yates, W. Cunto, and N. Ziviani. Binary searching with non-uniform costs and its application to text retrieval. Algorithmica,2000. To appear. Preliminary version in Proc. ESA ‘85,LNCS, Springer Verlag.

    Google Scholar 

  • G. Navarro, J. Kitajima, B. Ribeiro-Neto, and N. Ziviani. Distributed generation of suffix arrays. In Proc. CPM’97, LNCS 1264, pages 102–115, 1997.

    Google Scholar 

  • Northern. Northern Light: Main page. http://www.northernlight.com, 1997.

  • M. Persin, J. Zobel, and R. Sacks-Davis. Filtered document retrieval with frequency-sorted indexes. Journal of the American Society for Information Science, 47 (10): 749–764, October 1996.

    Article  Google Scholar 

  • Peter Pirolli, James Pitkow, and Ramana Rao. Silk from a sow’s ear: Extracting usable structures from the Web. In Proc. of the ACM SIGCHI Conference on Human Factors in Computing Systems,pages 118–125, Zurich, Switzerland, May 1996. ACM Press.

    Google Scholar 

  • J. Powell and E. Fox. Multilingual federated searching across heterogeneous collections. D-Lib Magazine, September 1998.

    Google Scholar 

  • D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman, and J. Widom. Querying semistructured heterogeneous information. In Deductive and Object-Oriented Databases, Proc. of the DOOD ‘85 Conference,pages 319344, Singapore, December 1995. Springer-Verlag.

    Google Scholar 

  • M. J. Quinn. Parallel Computing: Theory and Practice. McGraw-Hill, second edition, 1994.

    Google Scholar 

  • B. Ribeiro-Neto and R. Barbosa. Query performance for tightly coupled distributed digital libraries. In Proc. ACM Int. Conference on Digital Libraries (DL’98), pages 182–190, 1998.

    Chapter  Google Scholar 

  • B. Ribeiro-Neto, J. Kitajima, G. Navarro, C. Sant’Ana, and N. Ziviani. Parallel generation of inverted lists for distributed text collections. In Y. Eterovic, editor, Proc. XVIII Chilean Computer Science Society, SCCC’98, pages 149–157. IEEE CS Press, 1998.

    Google Scholar 

  • B. Ribeiro-Neto, E. Moura, M. Neubert, and N. Ziviani. Efficient distributed algorithms to build inverted files. In Proc. ACM SIGIR’99, pages 105–112, 1999.

    Google Scholar 

  • K. Sadakane. A fast algorithm for making suffix arrays and for BurrowsWheeler transformation. In Proc. IEEE Data Compression Conference (DCC’98), pages 129–138, 1998.

    Google Scholar 

  • G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing é1 Management, 24 (5): 513–523, 1988.

    Article  Google Scholar 

  • G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983.

    MATH  Google Scholar 

  • N. Shivakumar and H. García-Molina. Finding near-replicas of documents on the Web. In Workshop on Web Databases, Valencia, Spain, March 1998.

    Google Scholar 

  • A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In H.-P. Frei, D. Harman, P. Schäuble, and R. Wilkinson, editors, Proc. 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,pages 21–29, New York, August 1996. ACM Press. Snap. Snap: Main page. http://www.snap.com, 1998.

  • Ellen Spertus. ParaSite: Mining structural information on the Web. In 6th Int’l WWW Conference,Santa Clara, CA, USA, April 1997.

    Google Scholar 

  • W. Szpankowski. Probabilistic analysis of generalized suffix trees. In Proc. CPM’92,pages 1–14, 1992. LNCS 644.

    Google Scholar 

  • A. Tomasic and H. Garcia-Molina. Performance of inverted indices in shared-nothing distributed text document information retrieval systems. In Proc. Int. Conf. on Parallel and Distributed Information Systems, PDIS’93, 1993.

    Google Scholar 

  • E. Ukkonen. Approximate string matching over suffix trees. In Proc. CPM’93, pages 228–242, 1993.

    Google Scholar 

  • R. Weiss, B. Vélez, M. Sheldon, C. Nemprempre, P. Szilagyi, and D.K. Gifford. HyPursuit: A hierarchical network engine that exploits content-link hypertext clustering. In 7th ACM Conference on Hypertext and Hypermedia,pages 180-193,Washington, D.C., USA, 1996.

    Google Scholar 

  • I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco, second edition, 1999.

    MATH  Google Scholar 

  • B. Yuwono and D.L. Lee. Server ranking for distributed text retrieval systems on the Internet. In 5th Int. Conf. on Databases Systems for Advanced Applications (DASFAA), pages 41–50, Melbourne, Australia, 1997.

    Google Scholar 

  • J. Zobel and A. Moffat. Adding compression to a full-text retrieval system. Software Practice and Experience, 25 (8): 891–903, August 1995.

    Google Scholar 

  • J. Zobel and A. Moffat. Exploring the similarity space. SIGIR Forum, 32 (1): 18–34, Spring 1998.

    Google Scholar 

  • J. Zobel, A. Moffat, and K. Ramamohanarao. Inverted files versus signature files for text indexing. ACM Transactions on Database Systems, 23 (4): 453–490, December 1998.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Baeza-Yates, R., Moffat, A., Navarro, G. (2002). Searching Large Text Collections. In: Abello, J., Pardalos, P.M., Resende, M.G.C. (eds) Handbook of Massive Data Sets. Massive Computing, vol 4. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-0005-6_7

Download citation

  • DOI: https://doi.org/10.1007/978-1-4615-0005-6_7

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4613-4882-5

  • Online ISBN: 978-1-4615-0005-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics