Skip to main content

Extracting and Rendering Representative Sequences

  • Conference paper
Knowledge Discovery, Knowlege Engineering and Knowledge Management (IC3K 2009)

Abstract

This paper is concerned with the summarization of a set of categorical sequences. More specifically, the problem studied is the determination of the smallest possible number of representative sequences that ensure a given coverage of the whole set, i.e. that have together a given percentage of sequences in their neighbourhood. The proposed heuristic for extracting the representative subset requires as main arguments a pairwise distance matrix, a representativeness criterion and a distance threshold under which two sequences are considered as redundant or, identically, in the neighborhood of each other. It first builds a list of candidates using a representativeness score and then eliminates redundancy. We propose also a visualization tool for rendering the results and quality measures for evaluating them. The proposed tools have been implemented in our TraMineR R package for mining and visualizing sequence data and we demonstrate their efficiency on a real world example from social sciences. The methods are nonetheless by no way limited to social science data and should prove useful in many other domains.

This work is part of the Swiss National Science Foundation research project FN-122230 “Mining event histories: Towards new insights on personal Swiss life courses”.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abbott, A., Tsay, A.: Sequence analysis and optimal matching methods in sociology, Review and prospect. Sociological Methods and Research 29(1), 3–33 (2000) (With discussion, pp. 34–76)

    Article  Google Scholar 

  2. Müller, N.S., Gabadinho, A., Ritschard, G., Studer, M.: Extracting knowledge from life courses: Clustering and visualization. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2008. LNCS, vol. 5182, pp. 176–185. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  3. Hobohm, U., Scharf, M., Schneider, R., Sander, C.: Selection of representative protein data sets. Protein Sci. 1(3), 409–417 (1992)

    Article  Google Scholar 

  4. Holm, L., Sander, C.: Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics 14(5), 423–429 (1998)

    Article  Google Scholar 

  5. Gabadinho, A., Ritschard, G., Studer, M., Müller, N.: Mining sequence data in R with the TraMineR package: A user’s guide. Technical report, Department of Econometrics and Laboratory of Demography, University of Geneva, Geneva (2009)

    Google Scholar 

  6. McVicar, D., Anyadike-Danes, M.: Predicting successful and unsuccessful transitions from school to work by using sequence methods. Journal of the Royal Statistical Society. Series A (Statistics in Society) 165(2), 317–334 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  7. Kaufman, L., Rousseeuw, P.J.: Finding groups in data: an introduction to cluster analysis. John Wiley and Sons, New York (1990)

    Book  MATH  Google Scholar 

  8. Studer, M., Ritschard, G., Gabadinho, A., Müller, N.S.: Discrepancy analysis of complex objects using dissimilarities. In: Guillet, F., Ritschard, G., Zighed, D.A., Briand, H. (eds.) Advances in Knowledge Discovery and Management. SCI, vol. 292, pp. 3–19. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  9. Clark, R.D.: Optisim: An extended dissimilarity selection method for finding diverse representative subsets. Journal of Chemical Information and Computer Sciences 37(6), 1181–1188 (1997)

    Article  Google Scholar 

  10. Daszykowski, M., Walczak, B., Massart, D.L.: Representative subset selection. Analytica Chimica Acta 468(1), 91–103 (2002)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Gabadinho, A., Ritschard, G., Studer, M., Müller, N.S. (2011). Extracting and Rendering Representative Sequences. In: Fred, A., Dietz, J.L.G., Liu, K., Filipe, J. (eds) Knowledge Discovery, Knowlege Engineering and Knowledge Management. IC3K 2009. Communications in Computer and Information Science, vol 128. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19032-2_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-19032-2_7

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-19031-5

  • Online ISBN: 978-3-642-19032-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics