Extracting Knowledge from Life Courses: Clustering and Visualization

  • Nicolas S. Müller
  • Alexis Gabadinho
  • Gilbert Ritschard
  • Matthias Studer
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5182)

Abstract

This article presents some of the facilities offered by our TraMineR R-package for clustering and visualizing sequence data. Firstly, we discuss our implementation of the optimal matching algorithm for evaluating the distance between two sequences and its use for generating a distance matrix for the whole sequence data set. Once such a matrix is obtained, we may use it as input for a cluster analysis, which can be done straightforwardly with any method available in the R statistical environment. Then we present three kinds of plots for visualizing the characteristics of the obtained clusters: an aggregated plot depicting the average sequential behavior of cluster members; an sequence index plot that shows the diversity inside clusters and an original frequency plot that highlights the frequencies of the n most frequent sequences. TraMineR was designed for analysing sequences representing life courses and our presentation is illustrated on such a real world data set. The material presented should also be of interest for other kind of sequential data such as DNA analysis or web logs.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Deonier, R., Tavaré, S., Waterman, M.: Computational Genome Analysis: an Introduction. Springer, Heidelberg (2005)MATHGoogle Scholar
  2. 2.
    Needleman, S.B., Wunsch, C.: General method applicable to the search for similarities in the animo acid sequence of two proteins. Journal of Molecular Biology 48, 443–453 (1970)CrossRefGoogle Scholar
  3. 3.
    Kruskal, J.: An overview of sequence comparison. In: Time warps, string edits, and macromolecules. The theory and practice of sequence comparison, pp. 1–44. Adison-Wesley, Don Mills (1983)Google Scholar
  4. 4.
    Abbott, A., Forrest, J.: Optimal matching methods for historical sequences. Journal of Interdisciplinary History 16, 471–494 (1986)CrossRefGoogle Scholar
  5. 5.
    Abbott, A., Hrycak, A.: Measuring resemblance in sequence data: An optimal matching analaysis of musician’s careers. American Journal of Sociolgy 96(1), 144–185 (1990)CrossRefGoogle Scholar
  6. 6.
    Abbott, A., Tsay, A.: Sequence analysis and optimal matching methods in sociology, Review and prospect. Sociological Methods and Research 29(1), 3–33 (2000) (With discussion, pp 34-76)CrossRefGoogle Scholar
  7. 7.
    Rohwer, G., Pötter, U.: TDA user’s manual. Software, Ruhr-Universität Bochum, Fakultät für Sozialwissenschaften, Bochum (2002)Google Scholar
  8. 8.
    Wu, L.: Some comments on sequence analysis and optimal matching methods in sociology: Review and prospect. Sociological Methods and Research 29, 41–64 (2000)CrossRefGoogle Scholar
  9. 9.
    Notredame, C., Bucher, P., Gauthier, J.A., Widmer, E.: T-COFFEE/SALTT: User guide and reference manual (2005), Available at, http://www.tcoffee.org/saltt
  10. 10.
    Gauthier, J.A., Widmer, E.D., Bucher, P., Notredame, C.: How much does it cost? Optimization of costs in sequence analysis of social science data. Sociological Methods and Research (forthcoming, 2008)Google Scholar
  11. 11.
    Scherer, S.: Early career patterns: A comparison of Great Britain and West Germany. European Sociological Review 17(2), 119–144 (2001)CrossRefGoogle Scholar
  12. 12.
    Brzinsky-Fay, C., Kohler, U., Luniak, M.: Sequence analysis with Stata. The Stata Journal 6(4), 435–460 (2006)Google Scholar
  13. 13.
    Lesnard, L.: Describing social rhythms with optimal matching (2007)Google Scholar
  14. 14.
    Elzinga, C.H.: CHESA 2.1 User manual. User guide, Dept of Social Science Research methods, Vrije Universiteit, Amsterdam (2007)Google Scholar
  15. 15.
    Elzinga, C.H.: Sequence similarity: A nonaligning technique. Sociological Methods & Research 32, 3–29 (2003)CrossRefMathSciNetGoogle Scholar
  16. 16.
    Elzinga, C.H.: Combinatorial representations of token sequences. Journal of Classification 22(22), 87–118 (2005)MATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Nicolas S. Müller
    • 1
  • Alexis Gabadinho
    • 1
  • Gilbert Ritschard
    • 1
  • Matthias Studer
    • 1
  1. 1.Department of Econometrics and Laboratory of DemographyUniversity of Geneva 

Personalised recommendations