Partial Symbol Ordering Distance

  • Javier Herranz
  • Jordi Nin
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5861)

Abstract

Nowadays sequences of symbols are becoming more important, as they are the standard format for representing information in a large variety of domains such as ontologies, sequential patterns or non numerical attributes in databases. Therefore, the development of new distances for this kind of data is a crucial need. Recently, many similarity functions have been proposed for managing sequences of symbols; however, such functions do not always hold the triangular inequality. This property is a mandatory requirement in many data mining algorithms like clustering or k-nearest neighbors algorithms, where the presence of a metric space is a must. In this paper, we propose a new distance for sequences of (non-repeated) symbols based on the partial distances between the positions of the common symbols. We prove that this Partial Symbol Ordering distance satisfies the triangular inequality property, and we finally describe a set of experiments supporting that the new distance outperforms the Edit distance in those scenarios where sequence similarity is related to the positions occupied by the symbols.

Keywords

Sequences of Symbols Distances Triangular Inequality 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agrawal, R., Srikant, R.: Mining sequential patterns. In: Proceedings of the Eleventh International Conference on Data Engineering, pp. 3–14 (1995)Google Scholar
  2. 2.
    Catalan Official Statistics Institute (IDESCAT), http://www.idescat.cat/en/
  3. 3.
    Chávez, E., Navarro, G., Baeza-Yates, R., Marroquín, J.L.: Searching in metric spaces. ACM Computing Surveys 33(3), 273–321 (2001)CrossRefGoogle Scholar
  4. 4.
    Damerau, F.J.: A technique for computer detection and correction of spelling errors. Communications of the ACM 7(3), 171–176 (1964)CrossRefGoogle Scholar
  5. 5.
    Dong, G., Pei, J.: Sequence Data Mining. Springer, Heidelberg (2007)MATHGoogle Scholar
  6. 6.
    Fawcett, T.: Roc graphs: Notes and practical considerations for data mining researchers. Hpl-2003-4, HP Laboratories Palo Alto (2003)Google Scholar
  7. 7.
    Gómez-Alonso, C., Valls, A.: A similarity measure for sequences of categorical data based on the ordering of common elements. In: Torra, V., Narukawa, Y. (eds.) MDAI 2008. LNCS (LNAI), vol. 5285, pp. 134–145. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  8. 8.
    Hamming, R.W.: Error detecting and error correcting codes. Bell System Technical Journal 26(2), 147–160 (1950)MathSciNetGoogle Scholar
  9. 9.
    Hoerndli, F., David, D.C., Götz, J.: Functional genomics meets neurodegenerative disorders: Part ii: Application and data integration. Progress in Neurobiology 76(3), 169–188 (2005)CrossRefGoogle Scholar
  10. 10.
    Jain, A.K., Murty, M.N., Flynn, P.: Data clustering: a review. ACM Computing Surveys 31(3), 264–323 (1999)CrossRefGoogle Scholar
  11. 11.
    Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(707–710) (1966)Google Scholar
  12. 12.
    Marzban, C.: A comment on the roc curve and the area under it as performance measures. Technical report, The Applied Physics Laboratory and the Department of Statistics, University of Washington (2004)Google Scholar
  13. 13.
    Nin, J., Salle, P., Bringay, S., Teisseire, M.: Using owa operators for gene sequential pattern clustering. In: 22nd IEEE Int. Symposium on Computer-Based Medical Systems (submitted, 2009)Google Scholar
  14. 14.
    Saneifar, H., Bringay, S., Laurent, A., Teisseire, M.: S2mp: Similarity measure for sequential patterns. In: AusDM, pp. 95–104 (2008)Google Scholar
  15. 15.
    Shoval, N., Auslander, G.K., Freytag, T., Landau, R., Oswald, F., Seidl, U., Wahl, H.-W., Werner, S., Heinik, J.: The use of advanced tracking technologies for the analysis of mobility in alzheimer’s disease and related cognitive diseases. BMC Geriatrics 8, 7 (2008)CrossRefGoogle Scholar
  16. 16.
    Torra, V., Domingo-Ferrer, J.: Record linkage methods for multidatabase data mining. In: Information Fusion in Data Mining, pp. 101–132. Springer, Heidelberg (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Javier Herranz
    • 1
  • Jordi Nin
    • 2
  1. 1.Dept. Matemàtica Aplicada IVUniversitat Politècnica de CatalunyaBarcelona(Spain)
  2. 2.LAAS, Laboratoire d’Analyse et d’Architecture des SystèmesCNRS, Centre National de la Recherche ScientifiqueToulouse(France)

Personalised recommendations