Algorithmic Approaches to Information Retrieval and Data Mining
The emerging globalized information environment, with its unprecedented volume and diversity of information, is creating novel computational problems and is transforming established areas of research such as information retrieval. I believe that many of these problems are susceptible to rigorous modeling and principled analysis. In this talk I will focus on recent research which exemplifies the value of theoretical tools and approaches to these challenges.
Reseaschers in information retrieval have recently shown the applicability of spectral techniques to resolving such stubborn problems as polysemy and synonymy. The value of these techniques has more recently been demonstrated rigorously in a PODS 98 paper co-authored with Raghavan, Tanaki, and Vempala, by utilizing a formal probabilistic model of the corpus. Also in the same paper, a rigorous randomized simplification of the singular value decomposition process was proposed. In a paper in SODA 98, Kleinberg shows how spectral methods can extract in a striking way the semantics of a hypertext corpus, such as the world-wide web.
Although data mining has been promising the extraction of interesting patterns from massive data, there has been very little theoretical discussion of what “interesting” means in this context. In a STOC 98 paper co-authored with Kleinberg and Raghavan, we argue that such a theory must necessarily take into account the optimization problem faced by the organization that is doing the data mining. This point of view leads quickly to many interesting and novel combinatorial problems, and some promising approximation algorithms, while leaving many challenging algorithmic problems wide open.