Knowledge and Information Systems

, Volume 41, Issue 1, pp 223–245 | Cite as

Improving NCD accuracy by combining document segmentation and document distortion

  • Ana GranadosEmail author
  • Rafael Martínez
  • David Camacho
  • Francisco de Borja Rodríguez
Regular Paper


Compression distances have been applied to a broad range of domains because of their parameter-free nature, wide applicability and leading efficacy. However, they have a characteristic that can be a drawback when applied under particular circumstances. Said drawback is that when they are used to compare two very different-sized objects, they do not consider them to be similar even if they are related by a substring relationship. This work focuses on addressing this issue when compression distances are used to calculate similarities between documents. The approach proposed in this paper consists of combining document segmentation and document distortion. On the one hand, it is proposed to use document segmentation to tackle the above mentioned drawback. On the other hand, it is proposed to use document distortion to help compression distances to obtain more reliable similarities. The results show that combining both techniques provides better results than not applying them or applying them separately. The said results are consistent across datasets of diverse nature.


Algorithmic information theory Data compression Information filtering Word removal Document representation 



We want to thank Francisco Aura for his useful comments on the draft. We want to thank the anonymous referees for their constructive comments on the manuscript. This work was partially supported by the Spanish Ministry of Science and Innovation under TIN2010-19607 and TIN2010-19872/TSI.


  1. 1.
    Bustince H, Pagola M, Barrenechea E (2007) Construction of fuzzy indices from fuzzy DI-subsethood measures: application to the global comparison of images. Inf Sci 177(3):906–929CrossRefzbMATHMathSciNetGoogle Scholar
  2. 2.
    Bustince H, Barrenechea E, Pagola M (2008) Relationship between restricted dissimilarity functions, restricted equivalence functions and normal EN-functions: image thresholding invariant. Pattern Recogn Lett 29(4):525–536CrossRefGoogle Scholar
  3. 3.
    Cai D, Yu S, Wen J, Ma W (2004) Block-based web search. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in, information retrieval, pp 456–463Google Scholar
  4. 4.
    Callan JP (1994) Passage-level evidence in document retrieval. In: Proceedings of the seventeenth annual international ACM SIGIR conference on research and development in, information retrieval, pp 302–310Google Scholar
  5. 5.
    Cerra D, Datcu M (2008) A model conditioned data compression based similarity measure. In: Proceedings of the data compression conference, pp 509–509Google Scholar
  6. 6.
    Cilibrasi RL, Vitanyi PMB (2005) Clustering by compression. IEEE Trans Inf Theory 51(4):1523–1545CrossRefMathSciNetGoogle Scholar
  7. 7.
    Cilibrasi RL, Vitanyi PMB (2007) The google similarity distance. IEEE Trans Knowl Data Eng 19(3):370–383CrossRefGoogle Scholar
  8. 8.
    Cohen AR, Bjornsson CS, Temple S, Banker G, Roysam B (2009) Automatic summarization of changes in biological image sequences using algorithmic information theory. IEEE Trans Pattern Anal Mach Intell 31(8):1386–1403CrossRefGoogle Scholar
  9. 9.
    Dobrinkat M, Väyrynen J, Tapiovaara T, Kettunen K (2010) Normalized compression distance based measures for MetricsMATR. In: Proceedings of the joint fifth workshop on statistical machine translation and metricsMATR, pp 343–348Google Scholar
  10. 10.
    Granados A, Cebrián M, Camacho D, Rodríguez FB (2008) Evaluating the impact of information distortion on normalized compression distance. In: Proceedings of the 2nd international castle meeting on coding theory and applications, pp 69–79Google Scholar
  11. 11.
    Granados A, Cebrián M, Camacho D, Rodríguez FB (2011) Reducing the loss of information through annealing text distortion. IEEE Trans Knowl Data Eng 23(7):1090–1102CrossRefGoogle Scholar
  12. 12.
    Granados A, Camacho D, Rodríguez FB (2012) Is the contextual information relevant in text clustering by compression? Expert Syst Appl 39(10):8537–8546CrossRefGoogle Scholar
  13. 13.
    Gong Z, U LH, CW Cheang (2006) Web image indexing by using associated texts. Knowl Inf Syst 10(2):243–264CrossRefGoogle Scholar
  14. 14.
    Hammouda KM, Kamel MS (2004) Document similarity using a phrase indexing graph model. Knowl Inf Syst 6(6):710–727CrossRefGoogle Scholar
  15. 15.
    Hearst MA, Plaunt C (1993) Subtopic structuring for full-length document access. In: Proceedings of the 16th annual international ACM SIGIR conference on research and development in, information retrieval, pp 59–68Google Scholar
  16. 16.
    Kaszkiel M, Zobel J (1997) Passage retrieval revisited. In Proceedings of the 20th annual international ACM SIGIR conference on research and development in, information retrieval, pp 178–185Google Scholar
  17. 17.
    Kondrak G (2005) N-gram similarity and distance. In: Proceedings of the 12th international conference on string processing and, information retrieval, pp 115–126Google Scholar
  18. 18.
    Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86CrossRefzbMATHMathSciNetGoogle Scholar
  19. 19.
    Lavesson N, Axelsson S (2011) Similarity assessment for removal of noisy end user license agreements. Knowl Inf Syst 1–23Google Scholar
  20. 20.
    Li M, Chen X, Li X, Ma B, Vitanyi PMB (2004) The similarity metric. IEEE Trans Inf Theory 50(12):3250–3264CrossRefMathSciNetGoogle Scholar
  21. 21.
    Li T, Zhu S, Ogihara M (2006) Using discriminant analysis for multi-class classification: an experimental investigation. Knowl Inf Syst 10(4):453–472CrossRefGoogle Scholar
  22. 22.
    Luhn HP (1958) The automatic creation of literature abstracts. IBM J Res Dev 2:159–165CrossRefMathSciNetGoogle Scholar
  23. 23.
    Martínez R, Cebrian M, Rodríguez FB, Camacho D (2008) Contextual information retrieval based on algorithmic information theory and statistical outlier detection. In: Proceedings of the IEEE information theory, workshop, pp 292–297Google Scholar
  24. 24.
    Melville JL, Riley JF, Hirst JD (2007) Similarity by compression. J Chem Inf Model 47(1):25–33CrossRefGoogle Scholar
  25. 25.
    Mittendorf E, Schäuble P (1994) Document and passage retrieval based on hidden markov models. In: Proceedings of the 17th annual international ACM SIGIR conference on research and development in, information retrieval, pp 318–327Google Scholar
  26. 26.
    Rozenfeld B, Feldman R (2008) Self-supervised relation extraction from the web. Knowl Inf Syst 17(1):17–33CrossRefGoogle Scholar
  27. 27.
    Salomon D (2004) Data compression: the complete reference. Springer, New YorkGoogle Scholar
  28. 28.
    Salton G (1989) Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley, BostonGoogle Scholar
  29. 29.
    Salton G, Allan J, Buckley C (1993) Approaches to passage retrieval in full text information systems. In: Proceedings of the 16th annual international ACM SIGIR conference on research and development in, information retrieval, pp 49–58Google Scholar
  30. 30.
    Sun R, Ong C, Chua T (2006) Mining dependency relations for query expansion in passage retrieval. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in, information retrieval, pp 382–389Google Scholar
  31. 31.
    Tellex S, Katz B, Lin J, Fernandes A, Marton G (2003) Quantitative evaluation of passage retrieval algorithms for question answering. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, pp 41–47Google Scholar
  32. 32.
    Theeramunkong T (2004) Applying passage in web text mining. Int J Intell Syst 19:149–158CrossRefGoogle Scholar
  33. 33.
    Tiedemann J, Mur J (2008) Simple is best: experiments with different document segmentation strategies for passage retrieval. In: Proceedings of the 2nd workshop on information retrieval for question answering, pp 17–25Google Scholar
  34. 34.
    Ukkonen E (1992) Approximate string-matching with q-grams and maximal matches. Theor Comput Sci 92(1):191–211CrossRefzbMATHMathSciNetGoogle Scholar
  35. 35.
    Van Rijsbergen CJ (1979) Information retrieval. Butterworth-Heinemann, NewtonGoogle Scholar
  36. 36.
    Verdú S, Weissman T (2008) The information lost in erasures. IEEE Trans Inf Theory 54(11):5030–5058CrossRefGoogle Scholar
  37. 37.
    Wan X (2008) Beyond topical similarity: a structural similarity measure for retrieving highly similar documents. Knowl Inf Syst 15(1):55–73CrossRefGoogle Scholar
  38. 38.
    Wilbur WJ, Sirotkin K (1992) The automatic identification of stop words. J Inf Sci 18(1):45CrossRefGoogle Scholar
  39. 39.
    Wu D, Mendel JM (2008) A vector similarity measure for linguistic approximation: interval type-2 and type-1 fuzzy sets. Inf Sci 178(2):381–402CrossRefzbMATHMathSciNetGoogle Scholar
  40. 40.
    Xiong H, Pandey G, Steinbach M, Kumar V (2006) Enhancing data analysis with noise removal. IEEE Trans Knowl Data Eng 18(3):304–319CrossRefGoogle Scholar
  41. 41.
    Yang Y (1995) Noise reduction in a statistical approach to text categorization. In: Proceedings of the 18th annual international ACM SIGIR conference on research and development in, information retrieval, pp 256–263Google Scholar
  42. 42.
    Zhang X, Hao Y, Zhu X, Li M (2008) New information distance measure and its application in question answering system. J Comput Sci Technol 23(4):557–572CrossRefMathSciNetGoogle Scholar
  43. 43.
    Zipf GK (1935) The psychobiology of language. Houghton-Mifflin, New YorkGoogle Scholar
  44. 44.
    Zipf GK (1949) Human behavior and the principle of least effort. Addison-Wesley, CambridgeGoogle Scholar
  45. 45.
    Zobel J, Moffat A, Wilkinson R, Sacks-Davis R (1995) Efficient retrieval of partial documents. Inf Process Manag 31:361–377CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London 2013

Authors and Affiliations

  • Ana Granados
    • 1
    Email author
  • Rafael Martínez
    • 1
  • David Camacho
    • 1
  • Francisco de Borja Rodríguez
    • 1
  1. 1.Department of Computer Science, Escuela Politécnica SuperiorUniversidad Autónoma de MadridMadridSpain

Personalised recommendations