Sequence Data Mining

Sarawagi, Sunita

doi:10.1007/1-84628-284-5_6

Sunita Sarawagi

Part of the book series: Advanced Information and Knowledge Processing ((AI&KP))

904 Accesses
2 Citations

Summary

Many interesting real-life mining applications rely on modeling data as sequences of discrete multi-attribute records. Existing literature on sequence mining is partitioned on application-specific boundaries. In this article we distill the basic operations and techniques that are common to these applications. These include conventional mining operations, such as classification and clustering, and sequence specific operations, such as tagging and segmentation. We review state-of-the-art techniques for sequential labeling and show how these apply in two real-life applications arising in address cleaning and information extraction from websites.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aldelberg, B., 1998: Nodose: A tool for semi-automatically extracting structured and semistructured data from text documents. SIGMOD.
Google Scholar
Apostolico, A., and G. Bejerano, 2000: Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space. Proceedings of RECOMB2000.
Google Scholar
Bilenko, M., R. Mooney, W. Cohen, P. Ravikumar and S. Fienberg, 2003: Adaptive name-matching in information integration. IEEE Intelligent Systems.
Google Scholar
Borkar, V. R., K. Deshmukh and S. Sarawagi, 2001: Automatic text segmentation for extracting structured records. Proc. ACM SIGMOD International Conf. on Management of Data, Santa Barbara, USA.
Google Scholar
Borthwick, A., J. Sterling, E. Agichtein and R. Grishman, 1998: Exploiting diverse knowledge sources via maximum entropy in named entity recognition. Sixth Workshop on Very Large Corpora, New Brunswick, New Jersey. Association for Computational Linguistics.
Google Scholar
Bunescu, R., R. Ge, R. J. Mooney, E. Marcotte and A. K. Ramani, 2002: Extracting gene and protein names from biomedical abstracts, unpublished Technical Note. Available from URL: www.cs.utexas.edu/users/ml/publication/ie.html.
Google Scholar
Burges, C. J. C., 1998: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2, 121–67.
Article Google Scholar
Califf, M. E., and R. J. Mooney, 1999: Relational learning of pattern-match rules for information extraction. Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99), 328–34.
Google Scholar
Chakrabarti, S., 2002: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kauffman. URL: www.cse.iitb.ac.in/~ soumen/mining-the-web/
Google Scholar
Chakrabarti, S., K. Punera and M. Subramanyam, 2002: Accelerated focused crawling through online relevance feedback. WWW, Hawaii, ACM.
Google Scholar
Chakrabarti, S., S. Sarawagi and B. Dom, 1998: Mining surprising temporal patterns. Proc. of the Twentyfourth Int’l Conf. on Very Large Databases (VLDB), New York, USA.
Google Scholar
Collins, M., 2002: Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. Empirical Methods in Natural Language Processing (EMNLP).
Google Scholar
Crespo, A., J. Jannink, E. Neuhold, M. Rys and R. Studer, 2002: A survey of semi-automatic extraction and transformation. URL: www-db.stanford.edu/~ crespo/publications/.
Google Scholar
Deng, K., A. Moore and M. Nechyba, 1997: Learning to recognize time series: Combining ARMA models with memory-based learning. IEEE Int. Symp. on Computational Intelligence in Robotics and Automation, 1, 246–50.
Google Scholar
Dietterich, T., 2002: Machine learning for sequential data: A review. Structural, Syntactic, and Statistical Pattern Recognition; Lecture Notes in Computer Science, T. Caelli, ed., Springer-Verlag, 2396, 15–30.
Google Scholar
Durbin, R., S. Eddy, A. Krogh and G. Mitchison, 1998: Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press.
Google Scholar
Eskin, E., W. N. Grundy and Y. Singer, 2000: Protein family classification using sparse Markov transducers. Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB-2000). San Diego, CA.
Google Scholar
Eskin, E., W. Lee and S. J. Stolfo, 2001: Modeling system calls for intrusion detection with dynamic window sizes. Proceedings of DISCEX II.
Google Scholar
Freitag, D., and A. McCallum, 1999: Information extraction using HMMs and shrinkage. Papers from the AAAI-99 Workshop on Machine Learning for Information Extraction, 31–6.
Google Scholar
Gionis, A., and H. Mannila, 2003: Finding recurrent sources in sequences. In Proceedings of the 7th annual conference on Computational Molecular Biology. Berlin, Germany.
Google Scholar
Han, J., and M. Kamber, 2000: Data Mining: Concepts and Techniques. Morgan Kaufmann.
Google Scholar
Humphreys, K., G. Demetriou and R. Gaizauskas, 2000: Two applications of information extraction to biological science journal articles: Enzyme interactions and protein structures. Proceedings of the 2000 Pacific Symposium on Biocomputing (PSB-2000), 502–13.
Google Scholar
Jaakkola, T., M. Diekhans and D. Haussler, 1999: Using the Fisher kernel method to detect remote protein homologies. ISMB, 149–58.
Google Scholar
Klein, D., and C. D. Manning, 2002: Conditional structure versus conditional estimation in NLP models. Workshop on Empirical Methods in Natural Language Processing (EMNLP).
Google Scholar
Kushmerick, N., D. Weld and R. Doorenbos, 1997: Wrapper induction for information extraction. Proceedings of IJCAI.
Google Scholar
Lafferty, J., A. McCallum and F. Pereira, 2001: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the International Conference on Machine Learning (ICML-2001), Williams, MA.
Google Scholar
Laplace, P.-S., 1995: Philosophical Essays on Probabilities. Springer-Verlag, New York, translated by A. I. Dale from the 5th French edition of 1825.
Google Scholar
Lawrence, S., C. L. Giles and K. Bollacker, 1999: Digital libraries and autonomous citation indexing. IEEE Computer, 32, 67–71.
Google Scholar
Lee, W., and S. Stolfo, 1998: Data mining approaches for intrusion detection. Proceedings of the Seventh USENIX Security Symposium (SECURITY’ 98), San Antonio, TX.
Google Scholar
Leslie, C., E. Eskin, J. Weston, and W. S. Noble, 2004: Mismatch string kernels for discriminative protein classification. Bioinformatics, 20, 467–76.
Article Google Scholar
Li, D., K. Wong, Y. H. Hu and A. Sayeed., 2002: Detection, classification and tracking of targets in distributed sensor networks. IEEE Signal Processing Magazine, 19.
Google Scholar
Liu, D. C., and J. Nocedal, 1989: On the limited memory BFGS method for large-scale optimization. Mathematic Programming, 45, 503–28.
MathSciNet Google Scholar
Malouf, R., 2002: A comparison of algorithms for maximum entropy parameter estimation. Proceedings of The Sixth Conference on Natural Language Learning (CoNLL-2002), 49–55.
Google Scholar
McCallum, A., D. Freitag and F. Pereira, 2000: Maximum entropyMarkov models for information extraction and segmentation. Proceedings of the International Conference on Machine Learning (ICML-2000), Palo Alto, CA, 591–8.
Google Scholar
McCallum, A. K., K. Nigam, J. Rennie, and K. Seymore, 2000: Automating the construction of Internet portals with machine learning. Information Retrieval Journal, 3, 127–63.
Google Scholar
Muslea, I., 1999: Extraction patterns for information extraction tasks: A survey. The AAAI-99 Workshop on Machine Learning for Information Extraction.
Google Scholar
Muslea, I., S. Minton and C. A. Knoblock, 1999: A hierarchical approach to wrapper induction. Proceedings of the Third International Conference on Autonomous Agents, Seattle, WA.
Google Scholar
Rabiner, L., 1989: A tutorial on Hidden Markov Models and selected applications in speech recognition. Proceedings of the IEEE, 77(2).
Google Scholar
Rabiner, L., and B.-H. Juang, 1993: Fundamentals of Speech Recognition, Prentice-Hall, Chapter 6.
Google Scholar
Ratnaparkhi, A., 1999: Learning to parse natural language with maximum entropy models. Machine Learning, 34.
Google Scholar
Ron, D., Y. Singer and N. Tishby, 1996: The power of amnesia: learning probabilistic automata with variable memory length. Machine Learning, 25, 117–49.
Article Google Scholar
Seymore, K., A. McCallum and R. Rosenfeld, 1999: Learning Hidden Markov Model structure for information extraction. Papers from the AAAI-99 Workshop on Machine Learning for Information Extraction, 37–42.
Google Scholar
Sha, F., and F. Pereira, 2003: Shallow parsing with conditional random fields. In Proceedings of HLT-NAACL.
Google Scholar
Soderland, S., 1999: Learning information extraction rules for semi-structured and free text. Machine Learning, 34.
Google Scholar
Stolcke, A., 1994: Bayesian Learning of Probabilistic Language Models. Ph.D. thesis, UC Berkeley.
Google Scholar
Takeuchi, K., and N. Collier, 2002: Use of support vector machines in extended named entity recognition. The 6th Conference on Natural Language Learning (CoNLL).
Google Scholar
Vydiswaran, V., and S. Sarawagi, 2005: Learning to extract information from large websites using sequential models. COMAD.
Google Scholar
Warrender, C., S. Forrest and B. Pearlmutter, 1999: Detecting intrusions using system calls: Alternative data models. IEEE Symposium on Security and Privacy.
Google Scholar

Download references

Authors

Sunita Sarawagi
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Sarawagi, S. (2005). Sequence Data Mining. In: Advanced Methods for Knowledge Discovery from Complex Data. Advanced Information and Knowledge Processing. Springer, London. https://doi.org/10.1007/1-84628-284-5_6

Download citation

DOI: https://doi.org/10.1007/1-84628-284-5_6
Publisher Name: Springer, London
Print ISBN: 978-1-85233-989-0
Online ISBN: 978-1-84628-284-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics