Summary
Many interesting real-life mining applications rely on modeling data as sequences of discrete multi-attribute records. Existing literature on sequence mining is partitioned on application-specific boundaries. In this article we distill the basic operations and techniques that are common to these applications. These include conventional mining operations, such as classification and clustering, and sequence specific operations, such as tagging and segmentation. We review state-of-the-art techniques for sequential labeling and show how these apply in two real-life applications arising in address cleaning and information extraction from websites.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aldelberg, B., 1998: Nodose: A tool for semi-automatically extracting structured and semistructured data from text documents. SIGMOD.
Apostolico, A., and G. Bejerano, 2000: Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space. Proceedings of RECOMB2000.
Bilenko, M., R. Mooney, W. Cohen, P. Ravikumar and S. Fienberg, 2003: Adaptive name-matching in information integration. IEEE Intelligent Systems.
Borkar, V. R., K. Deshmukh and S. Sarawagi, 2001: Automatic text segmentation for extracting structured records. Proc. ACM SIGMOD International Conf. on Management of Data, Santa Barbara, USA.
Borthwick, A., J. Sterling, E. Agichtein and R. Grishman, 1998: Exploiting diverse knowledge sources via maximum entropy in named entity recognition. Sixth Workshop on Very Large Corpora, New Brunswick, New Jersey. Association for Computational Linguistics.
Bunescu, R., R. Ge, R. J. Mooney, E. Marcotte and A. K. Ramani, 2002: Extracting gene and protein names from biomedical abstracts, unpublished Technical Note. Available from URL: www.cs.utexas.edu/users/ml/publication/ie.html.
Burges, C. J. C., 1998: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2, 121–67.
Califf, M. E., and R. J. Mooney, 1999: Relational learning of pattern-match rules for information extraction. Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99), 328–34.
Chakrabarti, S., 2002: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kauffman. URL: www.cse.iitb.ac.in/~ soumen/mining-the-web/
Chakrabarti, S., K. Punera and M. Subramanyam, 2002: Accelerated focused crawling through online relevance feedback. WWW, Hawaii, ACM.
Chakrabarti, S., S. Sarawagi and B. Dom, 1998: Mining surprising temporal patterns. Proc. of the Twentyfourth Int’l Conf. on Very Large Databases (VLDB), New York, USA.
Collins, M., 2002: Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. Empirical Methods in Natural Language Processing (EMNLP).
Crespo, A., J. Jannink, E. Neuhold, M. Rys and R. Studer, 2002: A survey of semi-automatic extraction and transformation. URL: www-db.stanford.edu/~ crespo/publications/.
Deng, K., A. Moore and M. Nechyba, 1997: Learning to recognize time series: Combining ARMA models with memory-based learning. IEEE Int. Symp. on Computational Intelligence in Robotics and Automation, 1, 246–50.
Dietterich, T., 2002: Machine learning for sequential data: A review. Structural, Syntactic, and Statistical Pattern Recognition; Lecture Notes in Computer Science, T. Caelli, ed., Springer-Verlag, 2396, 15–30.
Durbin, R., S. Eddy, A. Krogh and G. Mitchison, 1998: Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press.
Eskin, E., W. N. Grundy and Y. Singer, 2000: Protein family classification using sparse Markov transducers. Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB-2000). San Diego, CA.
Eskin, E., W. Lee and S. J. Stolfo, 2001: Modeling system calls for intrusion detection with dynamic window sizes. Proceedings of DISCEX II.
Freitag, D., and A. McCallum, 1999: Information extraction using HMMs and shrinkage. Papers from the AAAI-99 Workshop on Machine Learning for Information Extraction, 31–6.
Gionis, A., and H. Mannila, 2003: Finding recurrent sources in sequences. In Proceedings of the 7th annual conference on Computational Molecular Biology. Berlin, Germany.
Han, J., and M. Kamber, 2000: Data Mining: Concepts and Techniques. Morgan Kaufmann.
Humphreys, K., G. Demetriou and R. Gaizauskas, 2000: Two applications of information extraction to biological science journal articles: Enzyme interactions and protein structures. Proceedings of the 2000 Pacific Symposium on Biocomputing (PSB-2000), 502–13.
Jaakkola, T., M. Diekhans and D. Haussler, 1999: Using the Fisher kernel method to detect remote protein homologies. ISMB, 149–58.
Klein, D., and C. D. Manning, 2002: Conditional structure versus conditional estimation in NLP models. Workshop on Empirical Methods in Natural Language Processing (EMNLP).
Kushmerick, N., D. Weld and R. Doorenbos, 1997: Wrapper induction for information extraction. Proceedings of IJCAI.
Lafferty, J., A. McCallum and F. Pereira, 2001: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the International Conference on Machine Learning (ICML-2001), Williams, MA.
Laplace, P.-S., 1995: Philosophical Essays on Probabilities. Springer-Verlag, New York, translated by A. I. Dale from the 5th French edition of 1825.
Lawrence, S., C. L. Giles and K. Bollacker, 1999: Digital libraries and autonomous citation indexing. IEEE Computer, 32, 67–71.
Lee, W., and S. Stolfo, 1998: Data mining approaches for intrusion detection. Proceedings of the Seventh USENIX Security Symposium (SECURITY’ 98), San Antonio, TX.
Leslie, C., E. Eskin, J. Weston, and W. S. Noble, 2004: Mismatch string kernels for discriminative protein classification. Bioinformatics, 20, 467–76.
Li, D., K. Wong, Y. H. Hu and A. Sayeed., 2002: Detection, classification and tracking of targets in distributed sensor networks. IEEE Signal Processing Magazine, 19.
Liu, D. C., and J. Nocedal, 1989: On the limited memory BFGS method for large-scale optimization. Mathematic Programming, 45, 503–28.
Malouf, R., 2002: A comparison of algorithms for maximum entropy parameter estimation. Proceedings of The Sixth Conference on Natural Language Learning (CoNLL-2002), 49–55.
McCallum, A., D. Freitag and F. Pereira, 2000: Maximum entropyMarkov models for information extraction and segmentation. Proceedings of the International Conference on Machine Learning (ICML-2000), Palo Alto, CA, 591–8.
McCallum, A. K., K. Nigam, J. Rennie, and K. Seymore, 2000: Automating the construction of Internet portals with machine learning. Information Retrieval Journal, 3, 127–63.
Muslea, I., 1999: Extraction patterns for information extraction tasks: A survey. The AAAI-99 Workshop on Machine Learning for Information Extraction.
Muslea, I., S. Minton and C. A. Knoblock, 1999: A hierarchical approach to wrapper induction. Proceedings of the Third International Conference on Autonomous Agents, Seattle, WA.
Rabiner, L., 1989: A tutorial on Hidden Markov Models and selected applications in speech recognition. Proceedings of the IEEE, 77(2).
Rabiner, L., and B.-H. Juang, 1993: Fundamentals of Speech Recognition, Prentice-Hall, Chapter 6.
Ratnaparkhi, A., 1999: Learning to parse natural language with maximum entropy models. Machine Learning, 34.
Ron, D., Y. Singer and N. Tishby, 1996: The power of amnesia: learning probabilistic automata with variable memory length. Machine Learning, 25, 117–49.
Seymore, K., A. McCallum and R. Rosenfeld, 1999: Learning Hidden Markov Model structure for information extraction. Papers from the AAAI-99 Workshop on Machine Learning for Information Extraction, 37–42.
Sha, F., and F. Pereira, 2003: Shallow parsing with conditional random fields. In Proceedings of HLT-NAACL.
Soderland, S., 1999: Learning information extraction rules for semi-structured and free text. Machine Learning, 34.
Stolcke, A., 1994: Bayesian Learning of Probabilistic Language Models. Ph.D. thesis, UC Berkeley.
Takeuchi, K., and N. Collier, 2002: Use of support vector machines in extended named entity recognition. The 6th Conference on Natural Language Learning (CoNLL).
Vydiswaran, V., and S. Sarawagi, 2005: Learning to extract information from large websites using sequential models. COMAD.
Warrender, C., S. Forrest and B. Pearlmutter, 1999: Detecting intrusions using system calls: Alternative data models. IEEE Symposium on Security and Privacy.
Rights and permissions
Copyright information
© 2005 Dr Sanghamitra Bandyopadhyay
About this chapter
Cite this chapter
Sarawagi, S. (2005). Sequence Data Mining. In: Advanced Methods for Knowledge Discovery from Complex Data. Advanced Information and Knowledge Processing. Springer, London. https://doi.org/10.1007/1-84628-284-5_6
Download citation
DOI: https://doi.org/10.1007/1-84628-284-5_6
Publisher Name: Springer, London
Print ISBN: 978-1-85233-989-0
Online ISBN: 978-1-84628-284-3
eBook Packages: Computer ScienceComputer Science (R0)