Skip to main content
Log in

Topic Distillation and Spectral Filtering

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

This paper discuss topic distillation, an information retrieval problemthat is emerging as a critical task for the www. Algorithms for this problemmust distill a small number of high-quality documents addressing a broadtopic from a large set of candidates.We give a review of the literature, and compare the problem with relatedtasks such as classification, clustering, and indexing. We then describe ageneral approach to topic distillation with applications to searching andpartitioning, based on the algebraic properties of matrices derived fromparticular documents within the corpus. Our method – which we call special filtering – combines the use of terms, hyperlinks and anchor-textto improve retrieval performance. We give results for broad-topic querieson the www, and also give some anecdotal results applying the sametechniques to US Supreme Court law cases, US patents, and a set of WallStreet Journal newspaper articles.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Arocena, G. O., Mendelzon, A. O. & Mihaila, G. A. (1997). Applications of a Web Query Language. Proc. 6th International World Wide Web Conference.

  • Bayer, A. E., Smart, J. C. & McLaughlin, G. W. (1990). Mapping Intellectual Structure of Scientific Subfields Through Author Co-Citations. J. American Soc. Info. Sci. 41: 444-452.

    Google Scholar 

  • Bharat, K. & Broder, Andrei. (1998). A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines. Proceedings of the 7th World-Wide Web Conference (WWW7).

  • Bharat K. & Henzinger, M. R. (1998). Improved Algorithms for Topic Distillation in a Hyperlinked Environment. Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, 469-477. Compressed postscript version: http://www.research.digital.com/SRC/personal/monika/papers/sigir98.ps.gz.

  • Bollobás B. (1985). Random Graphs. Academic Press.

  • Botafogo, Rodrigo A. & Shneiderman, Ben (1991). Identifying Aggregates in Hypertext Structures. Proceedings of ACM Hypertext '91: 63-74.

  • Botafogo, R., Rivlin, E. & Shneiderman, B. (1992). Structural Analysis of Hypertext: Identifying Hierarchies and Useful Metrics. ACM Trans. Inf. Sys. 10: 142-180.

    Google Scholar 

  • Brin, S. & Page, L. (1998). The Anatomy of a Large-Scale Hypertextual Web Search Engine. Proceedings of the 7th World-Wide Web Conference (WWW7).

  • Bruce Croft, W. & Turtle, Howard (1989). A Retrieval Model for Incorporazting Hypertext Links. Proceedings of ACM Hypertext '89, 213-224.

  • Carrière, J. & Kazman, R. (1997). WebQuery: Searching and Visualizing the Web Through Connectivity. Proc 6th International World Wide Web Conference.

  • Chakrabarti, S., Dom, B., Gibson, D., Kleinberg, J., Raghvan, P. & Rajagopalan, S. (1998). Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. Proceedings of the 7th World-Wide Web Conference (WWW7).

  • Chakrabarti, S., Dom, B. E., Gibson, D., Kumar, R., Raghavan, P., Rajagopalan, S. & Tomkins, A. (1998). Spectral Filtering for Resource Discovery. SIGIR 98 Workshop on Hypertext Information Retrieval and the Web.

  • Chakrabarti, S., Dom, B., Agrawal, R. & Raghavan, P. (1997). Using Taxonomy, Discriminants, and Signatures to Navigate in Text Databases. 23rd International Conference on Very Large Data Bases (VLDB). Athens, Greece.

  • Chakrabarti, S., Dom, B. & Indyk, P. (1998). Enhanced Hypertext Classification Using Hyperlinks. ACM SIGMOD Conference on Management of Data. Seattle, WA.

  • Chen, C. (1997). Structuring and Visualizing the WWW by Generalized Similarity Analysis. Proc. 8th ACM Conference on Hypertext, 177-186.

  • Cohen, P. R. & Kjeldsen, R. (1987). Information Retrieval by Constrained Spreading Activation in Semantic Networks. Information Processing and Management 23: 255-268.

    Google Scholar 

  • Cutting, D. R., Pedersen, J. O., Karger, D. R. & Turkey, J. W. (1992). Scatter/Gather: A Cluster-Based Approach to Browsing Large Document Collections. Proceedings of ACM SIGIR, 318-329.

  • Deerwester, S., Dumais, S., Landauer, T., Furnas, G. & Harshman, R. (1990). Indexing by Latent Semantic Analysis. J. American Soc. Info. Sci. 41.

  • Digital Equipment Corporation. Alta Vista Search Engine, altavista, digital.com/.

  • Donath, W. E. & Hoffman, A. J. (1972). Algorithms for Partitioning of Graphs and Computer Logic Based on Eigenvectors of Connections Matrices. IBM Technical Disclosure Bulletin 15.

  • Excite Inc. Excite, www.excite.com.

  • FindLaw. FindLaw — LawCrawler, www.lawcrawler, com.

  • Frakes, W. & Baeza-Yates, R. (eds.) (1992). Information Retrieval: Data Structures and Algorithms. Prentice-Hall.

  • Frisse, M. E. (19??). Searching for Information in a Hypertext Medical Handbook. Communications of the ACM 31(7): 880-886.

    Google Scholar 

  • Fukunaga, K. (1990). An Introduction to Statistical Pattern Recognition, 2nd edn. Academic Press: New York.

    Google Scholar 

  • Furuta, R., Shipman III, F. M., Marshall, C. C., Brenner, C. & Hsieh, H-W. (1997). Hypertext Paths and the World-Wide Web: experiences with Walden's Paths. Proc. 8th ACM Conference on Hypertext, 167-176.

  • Garfield, E. (1972). Citation Analysis as a Tool in Journal Evaluation. Science 178: 471-479.

    Google Scholar 

  • Garfield, E. (1994). The Impact Factor. Current Contents, June 20.

  • Golovchinsky, G. (1997). What the Query Told the Link: The Integration of Hypertext and Information Retrieval. Proc. 8th ACM Conference on Hypertext, 67-74.

  • Golub, G. & Van Loan, C. F. (1989). Matrix Computations. John Hopkins University Press.

  • Infoseek Corporation. Infoseek search engine, www.infoseek.com.

  • International Business Machines. IBM patent server, patent.womplex.ibm.com.

  • Kessler, M. M. (1963). Bibliographic Coupling Between Scientific Papers. American Documentation 14: 10-25.

    Google Scholar 

  • Kleinberg, J. (1997). Authoritative Sources in a Hyperlinked Environment. Proc. ACM-SIAM Symposium on Discrete Algorithms, 1998. Also appears as IBM Research Report RJ 10076(91892) May and as www.cs.cornell.edu/home/kleinber/auth.ps.

  • Kochtanek, T. R. (1983). Document Clustering Using Macro Retrieval Techniques”, J. American Soc. Info. Sci. 34: 356-359.

    Google Scholar 

  • Larson, R. (1996). “Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace”. Ann. Meeting of the American Soc. Info. Sci.

  • Liu, Mengxiong. (1993). Progress in Documentation the Complexities of Citation Practice: A Review of Citation Studies. J. Documentation 49(4): 370-408.

    Google Scholar 

  • Marchiori, Massimo (1997). The Quest for Correct Information on the Web: Hyper Search Engines. The 6th International World Wide Web Conference (WWW6). Also available at http://atlanta.cs.nchu.edu.tw/www/PAPER222.html.

  • Mukherjea, S. & Hara, Y. (1997). Focus+Context Views of World-Wide Web Nodes. Proc. 8th ACM Conference on Hypertext, 187-196.

  • Page, Larry. (1997). PageRank: Bringing Order to the Web. Stanford Digital Libraries Working Paper 1997-0072. http://www-pcd.stanford.edu/page/papers/pagerank/index.htm.

  • Pirolli, P., Pitkow, J. & Rao, R. (1996). Silk from a Sow's Ear: Extracting Usable Structures from the Web. Proc. ACM SIGCHI Conference on Human Factors in Computing (http://www.acm.org:82/sigs/sigchi/chi96/proceedings/papers/Pirolli_2/ppw.html).

  • van Rijsbergen, C. J. (1979). Information Retrieval. Butterworths. Also at dcs.glasgow.ac.uk./Keith/Preface.html.

  • Rivlin, E., Botaforgo, R. & Shneiderman, B. (1994). Navigating in Hyperspace: Designing a Structure-Based Toolbox. Communications of the ACM 37(2): 87-96.

    Google Scholar 

  • Rousseau, R. & Van Hooydonk, G. (1996). Journal Production and Journal Impact Factors, J. American Soc. Info. Sci. 47: 775-780.

    Google Scholar 

  • Salton, G. (1989). Automatic Text Processing. Addison-Wesley: Reading, MA.

    Google Scholar 

  • Savoy, Jaques (1993). Searching Information in Hypertext Systems Using Multiple Sources of Evidence. Int. J. Man-Machine Studies 38: 1017-1030.

    Google Scholar 

  • Savoy, Jaques (1996). An Extended Vector-Processing Scheme for Searching Information in Hypertext Systems. Information Processing and Management 32(2): 155-170.

    Google Scholar 

  • Savoy, Jaques (1997). Ranking Schemes in Hybrid Boolean Systems: A New Approach. J. Am. Soc. Information Sci. 48(3): 235-253.

    Google Scholar 

  • Schwanke, R. W. & Platoff, M. A. (1993). Cross References Are Features. In Hanson, S. J., Remmele, W. & Rivest, R. L. (eds.) Machine Learning: From Theory to Applications. Springer.

  • Shaw, W. M. (1991). Subject and Citation Indexing. Part I: The Clustering Structure of Composite Representations in the Cystic Fibrosis Document Collection. J. American Soc. Info. Sci. 42: 669-675.

    Google Scholar 

  • Shaw, W. M. (1991). Subject and Citation Indexing. Part II: The Optimal, Cluster-Based Retrieval Performance of Composite Representations. J. American Soc. Info. Sci. 42: 676-684.

    Google Scholar 

  • Small, H. (1973). Co-Citation in the Scientific Literature: A New Measure of the Relationship Between Two Documents. J. American Soc. Info. Sci. 24: 265-269.

    Google Scholar 

  • Spertus, E. (1997). ParaSite: Mining Structural Information on the Web. Proc. 6th International World Wide Web Conference.

  • Spielman, D. & Teng, S. (1996). Spectral Partitioning Works: Planar Graphs and Finite-Element Meshes. Processedings of the 37th IEEE Symposium on Foundations of Computer Science.

  • TREC — Text REtrieval Conference. Co-sponsored by the National Institute of Standards & Technology (NIST) and the Information Technology Office of the Defense Advanced Research Projects Agency (DARPA) as part of the TIPSTER Text Program. (http://trec.nist.gov/).

  • Wang, Q., Baldonado, M. & Winograd, T. (1997). SenseMaker: An Information-Exploration Interface Supporting the Contextual Evaluation of a User's Interests. Proc. ACM SIGCHI Conference on Human Factors in Computing.

  • Weinberg, Bella Hass (1974). Bibliographic Coupling: A Review. Information Storage and Retrieval 10: 189-196.

    Google Scholar 

  • Weinreb, Lloyd L. (1982). Leading Constitutional Cases on Criminal Justice. Foundation Press.

  • Weiss, R., Velez, B., Sheldon, M., Nemprempre, C., Szilagyi, P. & Gifford, D. K. (1996). HyPursuit: A Hierarchical Network Search Engine that Exploits Content-Link Hypertext Clustering. Proceedings of the Seventh ACM Conference on Hypertext.

  • White, H. D. & McCain, K. W. (1989). Bibliometrics. Ann. Rev. Info. Sci. and Technology, 119-186. Elsevier.

  • Willet, Peter. (1988). Recent Trends in Hierarchical Document Clustering: a Critical Review. Information Processing and Management 24(5): 577-597.

    Google Scholar 

  • World Wide Web Consortium. World Wide Web Virtual Library, www.w3.org/vl/.

  • Yahoo! Corp. Yahoo!, www.yahoo.com.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Byron E. Dom.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chakrabarti, S., Dom, B.E., Gibson, D. et al. Topic Distillation and Spectral Filtering. Artificial Intelligence Review 13, 409–435 (1999). https://doi.org/10.1023/A:1006596506229

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1006596506229

Navigation