Measuring the Complexity of a Collection of Documents

  • Vishwa Vinay
  • Ingemar J. Cox
  • Natasa Milic-Frayling
  • Ken Wood
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3936)


Some text collections are more difficult to search or more complex to organize into topics than others. What properties of the data characterize this complexity? We use a variation of the Cox-Lewis statistic to measure the natural tendency of a set of points to fall into clusters. We compute this quantity for document collections that are represented as a set of term vectors. We consider applications of the Cox-Lewis statistic in three scenarios: comparing clusterability of different text collections using the same representation, comparing different representations of the same text collection, and predicting the query performance based on the clusterability of the query results set. Our experimental results show a correlation between the observed effectiveness and this statistic, thereby demonstrating the utility of such data analysis in text retrieval.


Average Precision Retrieval Performance Mean Average Precision Query Performance Latent Semantic Indexing 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Cox, T.F., Lewis, T.: A conditional distance ratio method for analyzing spatial patterns. Biometrika 63, 483–491 (1976)MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Dumais, S.: LSI Meets TREC: A Status Report. In: Proceedings of the First Text Retrieval Conference (TREC), pp. 137–152. NIST Special Publication 500-207 (1993)Google Scholar
  3. 3.
    Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall Advanced Reference Series (1988)Google Scholar
  4. 4.
    Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the Surprising Behavior of Distance Metrics in High Dimensional Space. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, p. 420. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  5. 5.
    Chavez, E., Navarro, G.: Towards Measuring the Searching Complexity of General Metric Spaces. In: Proceedings of ENC 2001 (2001)Google Scholar
  6. 6.
    Epter, S., Krishnamoorthy, M., Zaki, M.: Clusterability Detection and Initial Seed Selection in Large Data Sets, Technical Report, Rensselaer Polytechnic Institute (1999)Google Scholar
  7. 7.
    Smith, S.P., Jain, A.K.: Testing for Uniformity in Multidimensional Data. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-6, 73–81 (1984)CrossRefGoogle Scholar
  8. 8.
    El-Hamdouchi, A., Willett, P.: Techniques for the measurement of clustering tendency in document retrieval systems. Journal of Information Science 13(6), 361–365 (1987)CrossRefGoogle Scholar
  9. 9.
    Yom-Tov, E., Fine, S., Carmel, D., Darlow, A.: Learning to estimate query difficulty: including applications to missing content detection and distributed information retrieval. In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, Salvador, Brazil (2005)Google Scholar
  10. 10.
    Panayirci, E., Dubes, R.C.: A test for multidimensional clustering tendency. Pattern Recognition 16(4), 433–444 (1983)CrossRefzbMATHGoogle Scholar
  11. 11.
    Minka, T., Lafferty, J.: Expectation-Propagation for the Generative Aspect Model. In: Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, pp. 352–359 (2002)Google Scholar
  12. 12.
    Deerwester, S., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the Society for Information Science 41(6), 391–407 (1990)CrossRefGoogle Scholar
  13. 13.
    Papadimitriou, C.H., Raghavan, P., Tamaki, H., Vempala, S.: Latent Semantic Indexing: A Probabilistic Analysis. In: Proceedings of the ACM Conference on Principles of Database Systems (PODS), Seattle (1998)Google Scholar
  14. 14.
    Tombros, A., van Rijsbergen, C.J.: Query-sensitive similarity measures for Information Retrieval. Knowledge and Information Systems (2004) (invited paper)Google Scholar
  15. 15.
    van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworths, London (1979)zbMATHGoogle Scholar
  16. 16.
    Voorhees, E.M.: Overview of the TREC 2004 Robust Retrieval Track. In: Proceedings of the 12th Text REtrieval Conference(TREC 2003), p. 69. NIST Special Publication (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Vishwa Vinay
    • 1
  • Ingemar J. Cox
    • 1
  • Natasa Milic-Frayling
    • 2
  • Ken Wood
    • 2
  1. 1.Department of Computer ScienceUniversity College LondonUK
  2. 2.Microsoft Research LtdCambridgeUK

Personalised recommendations