International Journal on Digital Libraries

, Volume 16, Issue 2, pp 145–159 | Cite as

Information-theoretic term weighting schemes for document clustering and classification

Article

Abstract

We propose a new theory to quantify information in probability distributions and derive a new document representation model for text clustering and classification. By extending Shannon entropy to accommodate a non-linear relation between information and uncertainty, the proposed least information theory provides insight into how terms can be weighted based on their probability distributions in documents vs. in the collection. We derive two basic quantities for document representation: (1) LI Binary (LIB) which quantifies information due to the observation of a term’s (binary) occurrence in a document; and (2) LI Frequency (LIF) which measures information for the observation of a randomly picked term from the document. The two quantities are computed based on terms’ prior distributions in the entire collection and posterior distributions in a document. LIB and LIF can be used individually or combined to represent documents for text clustering and classification. Experiments on four benchmark text collections demonstrate strong performances of the proposed methods compared to classic TF*IDF. Particularly, the LIB*LIF weighting scheme, which combines LIB and LIF, consistently outperforms TF*IDF in terms of multiple evaluation metrics. The least information measure has a potentially broad range of applications beyond text clustering and classification.

Keywords

Term weighting Information measure Semantic information Document representation Text clustering 

References

  1. 1.
    Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Mach. Learn. 6(1), 37–66 (1991). doi:10.1023/A:1022689900470
  2. 2.
    Aizawa, A.: The feature quantity: an information theoretic perspective of TFIDF-like measures. In: SIGIR’00, pp. 104–111 (2000). doi:10.1145/345508.345556
  3. 3.
    Amati, G., van Rijsbergen, C.J.: Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20(4), 357–389 (2002)CrossRefGoogle Scholar
  4. 4.
    Arthur, D., Vassilvitskii, S.: k-means++: the advantages of carefull seeding. In: SIAM’07, pp. 1027–1035 (2007)Google Scholar
  5. 5.
    Aslam, J.A., Yilmaz, E., Pavlu, V.: The maximum entropy method for analyzing retrieval measures. In: SIGIR’05, pp. 27–34 (2005) doi:10.1145/1076034.1076042
  6. 6.
    Baierlein, R.: Atoms and Information Theory: An Introduction to Statistical Mechanics. W.H. Freeman and Company, New York (1971)Google Scholar
  7. 7.
    Berry, M.W.: Survey of Text Mining: Clustering, Classification, and Retrieval. Springer, New York (2004)Google Scholar
  8. 8.
    Clinchant, S., Gaussier, E.: Information-based models for Ad Hoc IR. In: SIGIR’11, pp. 234–241 (2011)Google Scholar
  9. 9.
    Cover, T.M., Thomas, J.A.: Entropy, Relative Entropy and Mutual Information. Wiley, New York, pp. 12–49 (1991)Google Scholar
  10. 10.
    Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., Slattery, S.: Learning to extract symbolic knowledge from the world wide web. In: AAAI’98, pp. 509–516 (1998). http://dl.acm.org/citation.cfm?id=295240.295725
  11. 11.
    Dhillon, I.S., Mallela, S., Kumar, R.: A divisive information theoretic feature clustering algorithm for text classification. J. Mach. Learn. Res. 3, 1265–1287 (2003). http://dl.acm.org/citation.cfm?id=944919.944973
  12. 12.
    Fast, J.D.: Entropy: The Significance of the Concept of Entropy and Its Applications in Science and Technology. McGraw-Hill, New York (1962)Google Scholar
  13. 13.
    Fox, C.: Information and misinformation: an investigation of the notions of information, misinformation, informing, and misinforming. In: Contributions in Librarianship and Information Science. Greenwood Press, Westport (1983). http://books.google.com/books?id=TNHgAAAAMAAJ
  14. 14.
    Jain, A.K., Murty, M.N., Flynn, J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999). doi:10.1145/331499.331504
  15. 15.
    Jaynes, E.T. : Information theory and statistical mechanics. II. Phys. Rev. 108, 171–190 (1957). doi:10.1103/PhysRev.108.171
  16. 16.
    Ji, X., Xu, W.: Document clustering with prior knowledge. In: SIGIR’06, pp. 405–412 (1996). doi:10.1145/1148170.1148241
  17. 17.
    Kantor, P.B., Lee, J.J. :The maximum entropy principle in information retrieval. In: SIGIR’86, pp. 269–274 (1986). doi:10.1145/253168.253225
  18. 18.
    Ke, W.: Least Information Modeling for Information Retrieval. ArXiv preprint arXiv:1205.0312 (2012)
  19. 19.
    Ke, W., Mostafa, J., Fu, Y.: Collaborative classifier agents: studying the impact of learning in distributed document classification. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM, New York, JCDL ’07, pp. 428–437 (2007). doi:10.1145/1255175.1255263
  20. 20.
    Knight, K.: Mining online text. Commun. ACM 42(11), 58–61 (1999). doi:10.1145/319382.319394
  21. 21.
    Kullback, S.: Letters to the editor: the Kullback–Leibler distance. Am. Stat. 41(4), 338–341 (1987). http://www.jstor.org/stable/2684769
  22. 22.
    Kullback, S., Leibler, A.: On information and sufficiency. Ann. Math. Stat. 22, 79–86 (1951). doi:10.1214/aoms/1177729694
  23. 23.
    Lang, K.: Newsweeder: learning to filter netnews. In: Proceedings of the 12th International Conference on Machine Learning, pp. 331–339 (1995)Google Scholar
  24. 24.
    Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004). http://dl.acm.org/citation.cfm?id=1005332.1005345
  25. 25.
    Lin, J.: Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 37(1), 145–151 (2006). doi:10.1109/18.61115
  26. 26.
    Liu, T., Liu, S., Cheng, Z., Ma, W.Y.: An evaluation on feature selection for text clustering. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003). AAAI Press, Washington, DC, pp. 488–495 (2003)Google Scholar
  27. 27.
    Lovins, B.: Development of a stemming algorithm. Mech. Transl. Comput. Linguist. 11, 22–31 (1968)Google Scholar
  28. 28.
    MacKay, D.M.: Information, Mechanism and Meaning. The M.I.T. Press, Cambridge (1969)Google Scholar
  29. 29.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)Google Scholar
  30. 30.
    Rapoport, A.: What is information? ETC Rev. Gen. Semant. 10(4), 5–12 (1953)Google Scholar
  31. 31.
    Robertson, S.: Understanding inverse document frequency: on theoretical arguments for IDF. J. Doc. 60, 503–520 (2004)CrossRefGoogle Scholar
  32. 32.
    Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond. Found. Trends\(\textregistered \) Inf. Retr. 3(4), 333–389 (2009). doi:10.1561/1500000019
  33. 33.
    Sandhaus, E.: The New York Times Annotated Corpus. Linguistic Data Consortium, Philadelphia (2008)Google Scholar
  34. 34.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002). doi:10.1145/505282.505283
  35. 35.
    Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423; 623–656 (1948)Google Scholar
  36. 36.
    Siegler, M., Witbrock, M.: Improving the suitability of imperfect transcriptions for information retrieval from spoken documents. In: ICASSP’99, IEEE Press, pp. 505–508 (1999)Google Scholar
  37. 37.
    Snickars, F., Weibull, J.W.: A minimum information principle: theory and practice. Reg. Sci. Urban Econ. 7(1), 137–168 (1977). doi:10.1016/0166-0462(77)90021-7
  38. 38.
    Spärck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 60, 493–502 (2004)CrossRefGoogle Scholar
  39. 39.
    Witten, I.H., Frank, E., Hall, M.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. Morgan Kaufmann, San Francisco (2011)Google Scholar
  40. 40.
    Yang, Y., Pedersen, J.O.A.: Comparative study on feature selection in text categorization. In: ICML’97, pp. 412–420 (1997) http://dl.acm.org/citation.cfm?id=645526.657137
  41. 41.
    Zhang, D., Wang, J., Si, L.: Document clustering with universum. In: SIGIR’11, pp. 873–882 (2011) doi:10.1145/2009916.2010033
  42. 42.
    Zhao, Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for document datasets. In: CIKM’02, pp. 515–524 (2002). doi:10.1145/584792.584877

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  1. 1.College of Computing and InformaticsDrexel UniversityPhiladelphiaUSA

Personalised recommendations