Skip to main content

Part of the book series: Advanced Information and Knowledge Processing ((AI&KP))

Summary

Many interesting real-life mining applications rely on modeling data as sequences of discrete multi-attribute records. Existing literature on sequence mining is partitioned on application-specific boundaries. In this article we distill the basic operations and techniques that are common to these applications. These include conventional mining operations, such as classification and clustering, and sequence specific operations, such as tagging and segmentation. We review state-of-the-art techniques for sequential labeling and show how these apply in two real-life applications arising in address cleaning and information extraction from websites.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aldelberg, B., 1998: Nodose: A tool for semi-automatically extracting structured and semistructured data from text documents. SIGMOD.

    Google Scholar 

  2. Apostolico, A., and G. Bejerano, 2000: Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space. Proceedings of RECOMB2000.

    Google Scholar 

  3. Bilenko, M., R. Mooney, W. Cohen, P. Ravikumar and S. Fienberg, 2003: Adaptive name-matching in information integration. IEEE Intelligent Systems.

    Google Scholar 

  4. Borkar, V. R., K. Deshmukh and S. Sarawagi, 2001: Automatic text segmentation for extracting structured records. Proc. ACM SIGMOD International Conf. on Management of Data, Santa Barbara, USA.

    Google Scholar 

  5. Borthwick, A., J. Sterling, E. Agichtein and R. Grishman, 1998: Exploiting diverse knowledge sources via maximum entropy in named entity recognition. Sixth Workshop on Very Large Corpora, New Brunswick, New Jersey. Association for Computational Linguistics.

    Google Scholar 

  6. Bunescu, R., R. Ge, R. J. Mooney, E. Marcotte and A. K. Ramani, 2002: Extracting gene and protein names from biomedical abstracts, unpublished Technical Note. Available from URL: www.cs.utexas.edu/users/ml/publication/ie.html.

    Google Scholar 

  7. Burges, C. J. C., 1998: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2, 121–67.

    Article  Google Scholar 

  8. Califf, M. E., and R. J. Mooney, 1999: Relational learning of pattern-match rules for information extraction. Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99), 328–34.

    Google Scholar 

  9. Chakrabarti, S., 2002: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kauffman. URL: www.cse.iitb.ac.in/~ soumen/mining-the-web/

    Google Scholar 

  10. Chakrabarti, S., K. Punera and M. Subramanyam, 2002: Accelerated focused crawling through online relevance feedback. WWW, Hawaii, ACM.

    Google Scholar 

  11. Chakrabarti, S., S. Sarawagi and B. Dom, 1998: Mining surprising temporal patterns. Proc. of the Twentyfourth Int’l Conf. on Very Large Databases (VLDB), New York, USA.

    Google Scholar 

  12. Collins, M., 2002: Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. Empirical Methods in Natural Language Processing (EMNLP).

    Google Scholar 

  13. Crespo, A., J. Jannink, E. Neuhold, M. Rys and R. Studer, 2002: A survey of semi-automatic extraction and transformation. URL: www-db.stanford.edu/~ crespo/publications/.

    Google Scholar 

  14. Deng, K., A. Moore and M. Nechyba, 1997: Learning to recognize time series: Combining ARMA models with memory-based learning. IEEE Int. Symp. on Computational Intelligence in Robotics and Automation, 1, 246–50.

    Google Scholar 

  15. Dietterich, T., 2002: Machine learning for sequential data: A review. Structural, Syntactic, and Statistical Pattern Recognition; Lecture Notes in Computer Science, T. Caelli, ed., Springer-Verlag, 2396, 15–30.

    Google Scholar 

  16. Durbin, R., S. Eddy, A. Krogh and G. Mitchison, 1998: Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press.

    Google Scholar 

  17. Eskin, E., W. N. Grundy and Y. Singer, 2000: Protein family classification using sparse Markov transducers. Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB-2000). San Diego, CA.

    Google Scholar 

  18. Eskin, E., W. Lee and S. J. Stolfo, 2001: Modeling system calls for intrusion detection with dynamic window sizes. Proceedings of DISCEX II.

    Google Scholar 

  19. Freitag, D., and A. McCallum, 1999: Information extraction using HMMs and shrinkage. Papers from the AAAI-99 Workshop on Machine Learning for Information Extraction, 31–6.

    Google Scholar 

  20. Gionis, A., and H. Mannila, 2003: Finding recurrent sources in sequences. In Proceedings of the 7th annual conference on Computational Molecular Biology. Berlin, Germany.

    Google Scholar 

  21. Han, J., and M. Kamber, 2000: Data Mining: Concepts and Techniques. Morgan Kaufmann.

    Google Scholar 

  22. Humphreys, K., G. Demetriou and R. Gaizauskas, 2000: Two applications of information extraction to biological science journal articles: Enzyme interactions and protein structures. Proceedings of the 2000 Pacific Symposium on Biocomputing (PSB-2000), 502–13.

    Google Scholar 

  23. Jaakkola, T., M. Diekhans and D. Haussler, 1999: Using the Fisher kernel method to detect remote protein homologies. ISMB, 149–58.

    Google Scholar 

  24. Klein, D., and C. D. Manning, 2002: Conditional structure versus conditional estimation in NLP models. Workshop on Empirical Methods in Natural Language Processing (EMNLP).

    Google Scholar 

  25. Kushmerick, N., D. Weld and R. Doorenbos, 1997: Wrapper induction for information extraction. Proceedings of IJCAI.

    Google Scholar 

  26. Lafferty, J., A. McCallum and F. Pereira, 2001: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the International Conference on Machine Learning (ICML-2001), Williams, MA.

    Google Scholar 

  27. Laplace, P.-S., 1995: Philosophical Essays on Probabilities. Springer-Verlag, New York, translated by A. I. Dale from the 5th French edition of 1825.

    Google Scholar 

  28. Lawrence, S., C. L. Giles and K. Bollacker, 1999: Digital libraries and autonomous citation indexing. IEEE Computer, 32, 67–71.

    Google Scholar 

  29. Lee, W., and S. Stolfo, 1998: Data mining approaches for intrusion detection. Proceedings of the Seventh USENIX Security Symposium (SECURITY’ 98), San Antonio, TX.

    Google Scholar 

  30. Leslie, C., E. Eskin, J. Weston, and W. S. Noble, 2004: Mismatch string kernels for discriminative protein classification. Bioinformatics, 20, 467–76.

    Article  Google Scholar 

  31. Li, D., K. Wong, Y. H. Hu and A. Sayeed., 2002: Detection, classification and tracking of targets in distributed sensor networks. IEEE Signal Processing Magazine, 19.

    Google Scholar 

  32. Liu, D. C., and J. Nocedal, 1989: On the limited memory BFGS method for large-scale optimization. Mathematic Programming, 45, 503–28.

    MathSciNet  Google Scholar 

  33. Malouf, R., 2002: A comparison of algorithms for maximum entropy parameter estimation. Proceedings of The Sixth Conference on Natural Language Learning (CoNLL-2002), 49–55.

    Google Scholar 

  34. McCallum, A., D. Freitag and F. Pereira, 2000: Maximum entropyMarkov models for information extraction and segmentation. Proceedings of the International Conference on Machine Learning (ICML-2000), Palo Alto, CA, 591–8.

    Google Scholar 

  35. McCallum, A. K., K. Nigam, J. Rennie, and K. Seymore, 2000: Automating the construction of Internet portals with machine learning. Information Retrieval Journal, 3, 127–63.

    Google Scholar 

  36. Muslea, I., 1999: Extraction patterns for information extraction tasks: A survey. The AAAI-99 Workshop on Machine Learning for Information Extraction.

    Google Scholar 

  37. Muslea, I., S. Minton and C. A. Knoblock, 1999: A hierarchical approach to wrapper induction. Proceedings of the Third International Conference on Autonomous Agents, Seattle, WA.

    Google Scholar 

  38. Rabiner, L., 1989: A tutorial on Hidden Markov Models and selected applications in speech recognition. Proceedings of the IEEE, 77(2).

    Google Scholar 

  39. Rabiner, L., and B.-H. Juang, 1993: Fundamentals of Speech Recognition, Prentice-Hall, Chapter 6.

    Google Scholar 

  40. Ratnaparkhi, A., 1999: Learning to parse natural language with maximum entropy models. Machine Learning, 34.

    Google Scholar 

  41. Ron, D., Y. Singer and N. Tishby, 1996: The power of amnesia: learning probabilistic automata with variable memory length. Machine Learning, 25, 117–49.

    Article  Google Scholar 

  42. Seymore, K., A. McCallum and R. Rosenfeld, 1999: Learning Hidden Markov Model structure for information extraction. Papers from the AAAI-99 Workshop on Machine Learning for Information Extraction, 37–42.

    Google Scholar 

  43. Sha, F., and F. Pereira, 2003: Shallow parsing with conditional random fields. In Proceedings of HLT-NAACL.

    Google Scholar 

  44. Soderland, S., 1999: Learning information extraction rules for semi-structured and free text. Machine Learning, 34.

    Google Scholar 

  45. Stolcke, A., 1994: Bayesian Learning of Probabilistic Language Models. Ph.D. thesis, UC Berkeley.

    Google Scholar 

  46. Takeuchi, K., and N. Collier, 2002: Use of support vector machines in extended named entity recognition. The 6th Conference on Natural Language Learning (CoNLL).

    Google Scholar 

  47. Vydiswaran, V., and S. Sarawagi, 2005: Learning to extract information from large websites using sequential models. COMAD.

    Google Scholar 

  48. Warrender, C., S. Forrest and B. Pearlmutter, 1999: Detecting intrusions using system calls: Alternative data models. IEEE Symposium on Security and Privacy.

    Google Scholar 

Download references

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Dr Sanghamitra Bandyopadhyay

About this chapter

Cite this chapter

Sarawagi, S. (2005). Sequence Data Mining. In: Advanced Methods for Knowledge Discovery from Complex Data. Advanced Information and Knowledge Processing. Springer, London. https://doi.org/10.1007/1-84628-284-5_6

Download citation

  • DOI: https://doi.org/10.1007/1-84628-284-5_6

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-85233-989-0

  • Online ISBN: 978-1-84628-284-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics