A fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time. E-mail and news articles are two natural examples of such streams, each characterized by topics that appear, grow in intensity for a period of time, and then fade away. The published literature in a particular research field can be seen to exhibit similar phenomena over a much longer time scale. Underlying much of the text mining work in this area is the following intuitive premise—that the appearance of a topic in a document stream is signaled by a “burst of activity,” with certain features rising sharply in frequency as the topic emerges.
The goal of the present work is to develop a formal approach for modeling such “bursts,” in such a way that they can be robustly and efficiently identified, and can provide an organizational framework for analyzing the underlying content. The approach is based on modeling the stream using an infinite-state automaton, in which bursts appear naturally as state transitions; it can be viewed as drawing an analogy with models from queueing theory for bursty network traffic. The resulting algorithms are highly efficient, and yield a nested representation of the set of bursts that imposes a hierarchical structure on the overall stream. Experiments with e-mail and research paper archives suggest that the resulting structures have a natural meaning in terms of the content that gave rise to them.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
Tax calculation will be finalised during checkout.
Agrawal, R. and Srikant, R. 1995. Mining sequential patterns. In Proc. Intl. Conf. on Data Engineering.
Aigrain, P., Zhang, H., and Petkovic, D. 1996. Content-based representation and retrieval of visual media: A state-of-the-art review.Multimedia Tools and Applications, 3.
Allan, J., Carbonell, J.G., Doddington, G., Yamron, J., and Yang, Y. 1998a. Topic detection and tracking pilot study: Final report. In Proc. DARPA Broadcast News Transcription and Understanding Workshop, (Feb.)
Allan, J., Papka, R., and Lavrenko, V. 1998b. On-line new event detection and tracking. In Proc. SIGIR Intl. Conf. Information Retrieval.
Anick, D., Mitra, D., and Sondhi, M. 1982. Stochastic theory of a data handling system with multiple sources. Bell Syst. Tech. Journal, 61. arxiv.org e-Print archive, at www.arxiv.org.
Becker, K. and Cardoso, M. 2000. Mail-by-Example: A visual query interface for managing large volumes of electronic messages. In Proc. 15th Brazilian Symposium on Databases.
Beeferman, D., Berger, A., and Lafferty, J. 1999. Statistical models for text segmentation. Machine Learning, 34:177–210.
Berghel, H. 1997. E-mail: The good, the bad, and the ugly. Communications of the ACM, 40(4):11–15.
Birrell, A., Perl, S., Schroeder, M., Wobber, T. 1997. The Pachyderm E-mail System, at http://www.research.compaq.com/SRC/pachyderm/.
Blanton, T. Ed. 1995. White House E-mail. New Press.
Boone, G. 1998. Concept features in Re: Agent, an intelligent e-mail agent. InProc. 2nd Intl. Conf. Autonomous Agents.
Charikar, M., Chen, K., and Farach-Colton, M. 2002. Finding frequent items in data streams. In Proc. 29th International Colloquium on Automata, Languages, and Programming.
Chatfield, C. 1996. The Analysis of Time Series: An Introduction. Chapman and Hall.
Chatman, S. 1978. Story and Discourse: Narrative Structure in Fiction and Film. Cornell Univ. Press.
Chudova, D. and Smyth, P. 2001. Unsupervised identification of sequential patterns under a Markov assumption. KDD Workshop on Temporal Data Mining.
Cohen, W. 1996. Learning rules that classify e-mail. In Proc. AAAI Spring Symp. Machine Learning and Information Access.
Cover, T. and Hart, P. 1967. Nearest neighbor pattern classification. IEEE Trans. Information Theory, IT-13:21–27.
Davison, W., Wall, L., and Barber, S., trn, 1993. http://web.mit.edu/afs/sipb/project/trn/src/trn-3.6/.
Ehrich, R. and Foith, J. 1976. Representation of random waveforms by relational trees. IEEE Trans. Computers, C25:7.
Elwalid, A. and Mitra, D. 1993. Effective bandwidth of general Markovian traffic sources and admission control of high speed networks. IEEE/ACM Trans. Networking, 1.
Fine, S., Singer, Y., and Tishby, N. 1998. The hierarchical hidden Markov model: Analysis and applications. Machine Learning, 32.
Forster, E.M. 1927. Aspects of the Novel. Harcourt, Brace, and World, Inc.
Gay, G. and Grace-Martin, M. 2001. Web browsing, mobile computing and academic performance. Educational Technology & Society, 4.
Garofalakis, M., Gehrke, J., and Rastogi, R. 2002. Querying and mining data streams: You only get one look. Tutorial at ACM SIGMOD International Conference on Management of Data.
Genette, G. 1980. Narrative Discourse: An Essay in Method, English translation (J.E. Lewin). Cornell Univ. Press.
Genette, G. 1988. Narrative Discourse Revisited. English translation (J.E. Lewin). Cornell Univ. Press.
Google Zeitgeist: Search patterns, trends, and surprises according to Google, at www.google.com/press/ zeitgeist.html.
Grosz, B. and Sidner, C. 1986. Attention, intentions, and the structure of discourse. Computational Linguistics, 12.
Gruber, T. Hypermail, Enterprise Integration Technologies.
Guralnik, V. and Srivastava, J. 1999. Event detection from time series data. In Intl. Conf. Knowledge Discovery and Data Mining.
Han, J., Gong, W., and Yin, Y. 1998. Mining segment-wise periodic patterns in time-related databases. In Proc. Intl. Conf. Knowledge Discovery and Data Mining.
Hand, D., Mannila, H., and Smyth, P. 2001. Principles of Data Mining. MIT Press.
Havre, S., Hetzler, B., and Nowell, L. 2000. ThemeRiver: Visualizing theme changes over time. In Proc. IEEE Symposium on Information Visualization.
Hawkins, D. 1976. Point estimation of the parameters of piecewise regression models. Applied Statistics, 25.
Heckel, B. and Hamann, B. 1997. EmVis—A visual e-mail analysis tool. In Proc. Workshop on New Paradigms in Information Visualization and Manipulation, in conjunction with Conf. on Information and Knowledge Management.
Helfman, J. and Isbell, C. 1995. Ishmail: Immediate identification of important information. AT&T Labs Technical Report.
Horvitz, E. 1999. Principles of mixed-initiative user interfaces. In Proc. ACM Conf. Human Factors in Computing Systems.
Hudson, D. 1966. Fitting segmented curves whose join points have to be estimated. Journal of the American Statistical Association 61:1097–1129.
Kelly, F.P. 1996. Notes on effective bandwidths. In Stochastic Networks: Theory and Applications, (F.P. Kelly, S. Zachary, and I. Ziedins (Eds.)).Oxford Univ. Press.
Keogh, E. and Smyth, P. 1997. A probabilistic approach to fast pattern matching in time series databases. In Proc. Intl. Conf. Knowledge Discovery and Data Mining.
Klein, J.I. et al. 2000. Plaintiffs' Memorandum in Support of Proposed Final Judgment, United States of America v. Microsoft Corporation and State of New York, ex rel. Attorney General Eliot Spitzer, et al., v. Microsoft Corporation, Civil Actions No. 98-1232 (TPJ) and 98-1233 (TPJ), April.
Last, M., Klein, Y., and Kandel, A. 2001. Knowledge discovery in time series databases. IEEE Transactions on Systems, Man, and Cybernetics, 31B.
Lavrenko, V., Schmill, M., Lawrie, D., Ogilvie, P., Jensen, D., and Allan, J. 2000. Mining of concurrent text and time-series. In KDD-2000 Workshop on Text Mining.
Lewis, D.D. and Knowles, K.A. 1997. Threading electronic mail: A preliminary study. Inf. Proc. Management, 33.
Lukesh, S.S. 1999. E-mail and potential loss to future archives and scholarship, or, The dog that didn't bark. First Monday, 4(9), at http://firstmonday.org.
Maes, P. 1994. Agents that reduce work and information overload. Communications of the ACM, 37(7):30–40.
Mannila, H. and Salmenkivi, M. 2001. Finding simple intensity descriptions from event sequence data. In Proc. Intl. Conf. on Knowledge Discovery and Data Mining.
Mannila, H., Toivonen, H., and Verkamo, A.I. 1995. Discovering frequent episodes in sequences. In Proc. Intl. Conf. on Knowledge Discovery and Data Mining.
Martin, R. and Yohai, V. 2001. Data mining for unusual movements in temporal data. KDD Wkshp. Temporal Data Mining.
Markus, M.L. 1994. Finding a happy medium: Explaining the negative effects of electronic communication on social life at work. ACM Trans. Info. Sys., 12:119–149.
Miller, N., Wong, P., Brewster, M., and Foote, H. 1998. Topic islands: A wavelet-based text visualization system. In Proc. IEEE Visualization.
Moore, R., Baru, C., Rajasekar, A., Ludaescher, B., Marciano, R., Wan, M., Schroeder, W., and Gupta, A. 2000. Collection-based persistent digital archives—part 2. D-Lib Magazine, 6.
Murphy, K. and Paskin, M. 2001. Linear time inference in hierarchical HMMs. Advances in Neural Information Processing Systems (NIPS), 14.
Olsen, F. 1999. Facing flood of e-mail, archives seeks help from supercomputer researchers. Chronicle of Higher Education, August 24.
Payne, T. and Edwards, P. 1997. Interface agents that learn: An investigation of learning issues in a mail agent interface. Applied Artificial Intelligence, 11:1–32.
Pollock, S. 1988. A rule-based message filtering system. ACM Trans. Office Automation Systems, 6(3):232–254.
Rabiner, L. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. In Proc. IEEE, 77.
Redmond, M. and Adelson, B. 1998. AlterEgo e-mail filtering agent. In Proc. AAAI Workshop on Case-Based Reasoning.
Rennie, J. 2000. ifile: An application of machine learning to e-mail filtering. In Proc. KDD Workshop on Text Mining.
Sahami, M., Dumais, S., Heckerman, D., and Horvitz, E. 1998. A bayesian approach to filtering junk email. In Proc. AAAI Workshop on Learning for Text Categorization.
Schneier, B. 1996. Applied Cryptography Wiley.
Scott, S.L. 1998. Bayesian Methods and Extensions for the Two State Markov Modulated Poisson Process, Ph.D. Thesis, Harvard University, Dept. of Statistics.
Scott, S.L. and Smyth, P. 2002. The markov modulated poisson process and markov poisson cascade with applications to web traffic modeling. Seventh Valencia Conference on Bayesian Statistics.
Segal, R. and Kephart, J. 1999. MailCat: An intelligent assistant for organizing e-mail. In Proc. Intl. Conf. Autonomous Agents.
Segal, R. and Kephart, J. 2000. Incremental learning in swiftFile. In Proc. Intl. Conf. on Machine Learning.
Shaw, S. and DeFigueiredo, R. 1990. Structural processing of wave forms as trees. IEEE Transactions on Acoustics, Speech, and Signal Processing, 38:2.
Swan, R. and Allan, J. 1999. Extracting significant time-varying features from text. In Proc. 8th Intl. Conf. on Information Knowledge Management.
Swan, R. and Allan, J. 2000. Automatic generation of overview timelines. In Proc. SIGIR Intl. Conf. Information Retrieval.
Swan, R. and Jensen, D. 2000. TimeMines: Constructing timelines with statistical models of word usage. In KDD-2000 Workshop on Text Mining.
Whittaker, S. and Sidner, C. 1996. E-mail overload: Exploring personal information management of e-mail. InProc. ACM SIGCHI Conf. on Human Factors in Computing Systems.
Wong, P., Cowley, W., Foote, H., Jurrus, E., Thomas, J. 2000. Visualizing sequential patterns for text mining. In Proc. IEEE Information Visualization.
Yang, Y., Ault, T., Pierce, T., and Lattimer, C.W. 2000. Improving text categorization methods for event tracking. In Proc. SIGIR Intl. Conf. Information Retrieval
Yang, Y., Pierce, T., and Carbonell, J.G. 1998. A study on retrospective and on-line event detection. In Proc. SIGIR Intl. Conf. Information Retrieval.
About this article
Cite this article
Kleinberg, J. Bursty and Hierarchical Structure in Streams. Data Mining and Knowledge Discovery 7, 373–397 (2003). https://doi.org/10.1023/A:1024940629314
- data stream algorithms
- text mining
- Markov source models