Encyclopedia of Database Systems

Living Edition
| Editors: Ling Liu, M. Tamer Özsu

Text Stream Processing

  • Jeong-Hyon HwangEmail author
  • Alan G. Labouseur
  • Paul W. Olsen Jr.
Living reference work entry
DOI: https://doi.org/10.1007/978-1-4899-7993-3_80751-1

Keywords

Latent Dirichlet Allocation Neural Information Processing System Cost Constraint Inverted List Text Stream 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Synonyms

Definition

A text stream is a continuously generated series of comments or small text documents. Each comment or text document may be associated with a time stamp indicating when it was produced or received by a certain device or system. Text stream processing refers to real-time extraction of desired information from text streams (through categorizing and clustering documents in text streams, detecting and tracking topics, matching patterns, and discovering events). Streaming text media (e.g., Twitter, WeChat, Facebook, news feeds, etc.) have fresher content with richer attributes and tend to have broader coverage compared to traditional electronic media (e.g., forums, blogs, and web sites). These advantages make them ripe for use in many engaging, innovative, and empowering applications (see Key Applications, below). In contrast to offline text mining, which analyzes a static collection of text documents (see “Text Mining”), text stream processing requires techniques that quickly produce answers while keeping up with input text streams.

Historical Background

Prior to the study of text stream processing, researchers had developed a variety of techniques for analyzing collections of text. Representative techniques include:

Text Categorization/Classification.

Text categorization techniques assign documents to predefined categories according to their contents. These techniques typically extract features (e.g., frequencies of key terms) from each document and classify documents according to their features. For this classification, a supervised machine learning scheme (e.g., a Bayesian classifier, support vector machine, decision tree) is used (see “Text Categorization”).

Text Clustering.

Text clustering techniques group documents according to their similarities without predefined categories or classes. Representative text clustering techniques include single-link clustering, k-means, and co-clustering (see “Text Clustering”).

Topic Detection and Tracking.

Given a series of text documents, topic detection and tracking (TDT) techniques seek to discover topics within those documents and associate related topics with each other (see “Topic Detection and Tracking”).

Scientific Fundamentals

In contrast to offline text mining techniques, text stream processing techniques are designed to quickly extract knowledge from continuously generated text data. These techniques can be categorized as follows:

Categorization/Classification

Given (i) a stream of incoming documents, (ii) users’ interest profiles (i.e., queries), and (iii) an integer k, text categorization/classification techniques typically maintain the k most relevant documents for each query [2, 3, 10]. By managing only a small collection of the most relevant documents (not all received documents), these techniques keep the runtime overhead low, thereby enabling prompt processing of text streams. Each technique uses its own metric(s) for measuring the relevance of a document to a query. For example, Mouratidis and Pang defined the relevance of a document d to a query Q as
$$\displaystyle{\sum _{t\in Q} \frac{f_{Q,t}} {\sqrt{\sum _{t'\in Q } f_{Q,t' }^{2}}} \frac{f_{d,t}} {\sqrt{\sum _{t'\in d } f_{d,t' }^{2}}}}$$
where f Q, t and f d, t represent the frequency of term t in query Q and document d, respectively [10]. On the other hand, Chen et al. [3] adopted a metric that incorporates, given a tweet and a query, the PageRank of the user who produced that tweet, with the constraint that the tweet must contain every word appearing in the query. The graph in which user PageRank is calculated represents users as vertices and following relationships as directed edges.

The above text categorization/classification techniques adopt methods for efficiently maintaining the k most relevant documents for each query. For example, they derive, with low overhead, a ranking threshold for each query such that all documents that rank below that threshold cannot be in the top-k result and thus can be safely ignored. Furthermore, these techniques usually maintain, for each term, a list of documents containing that term (i.e., an inverted list) to quickly find relevant documents given a term. By maintaining these inverted lists in a certain way (e.g., keeping each list in decreasing frequency order), these techniques can reduce the number of documents they examine to obtain the correct top-k result [10].

Clustering

Given (i) a stream of documents and (ii) an integer k, a text clustering technique constructs and maintains k clusters of documents, where documents within each cluster are highly similar. In order to efficiently deal with large volumes of incoming text, clustering techniques typically use a sliding time window and keep only the documents received within that time window [1, 6]. Banerjee et al. provided three online versions of popular clustering techniques, namely, von Mises-Fisher (vMF), Dirichlet compound multinomial (DCM), and latent Dirichlet allocation (LDA) [1].

Another complexity that arises in the context of clustering within a text stream is how best to represent the documents to be clustered. The traditional vector space model for representing documents is not ideal for a streaming environment, because although two documents might appear similar in terms of features, they may refer to different real-world events depending on the time they were created. To meet this challenge, He et al. [6] presented a model that, given a document, takes into account both the document’s creation time and the bursty weight of the document’s features at that time. The bursty weight of a feature represents the intensity of a burst at a specific time [7]; if the feature is not experiencing a burst at that time, the bursty weight is 0.

Topic Monitoring

Given (i) a text stream, (ii) a classifier f that determines the relevance of a comment (or document) to a specific target topic, and (iii) a set of cost constraints (e.g., the maximum number of comments that can be examined per time unit), topic monitoring techniques strive to detect as many relevant comments as possible while meeting the cost constraints. Since manually defining keywords for a target topic is laborious and likely to result in suboptimal topic coverage (e.g., keywords may become outdated as time passes according to changing contexts), these techniques require automatic selection/adjustment of keywords.

Li et al. designed automatic topic-focused monitoring (ATM) [8], which supports two types of cost constraints: a limit n on the number of keywords and the maximum number B of comments to examine. As shown in Fig. 1, for each time window, ATM uniformly samples up to B comments from the input stream (“Sampler”), selects n keywords (“Keyword Selector”), and uses these keywords to identify relevant comments (“Classifier”). Since the problem of finding the n most useful keywords (i.e., those that lead to the largest number of relevant comments) under cost constraints is NP-hard, ATM uses a polynomial time greedy approximation algorithm. This keyword selection algorithm repeatedly chooses the next most useful keyword until the number of selected keywords reaches n.
Text Stream Processing, Fig. 1

Automatic topic-focused monitoring

Bursty Event Detection

Given a stream of documents, the goal of bursty event detection is to identify prominent events based on features (i.e., words or terms) of the documents and their time information. Such a prominent event, called a bursty event, is expressed using a set of bursty features found in the documents appearing in a certain time window. A bursty feature has the property that it has unusually high occurrences in those documents. A set of bursty features may be used to specify positive examples for training a supervised text classifier (i.e., text classification is feasible without user-provided positive examples).

Fung et al. have developed a parameter-free probabilistic approach for effectively and efficiently identifying bursty events [5]. This technique, called feature-pivot clustering, first identifies bursty features. Then, this technique identifies bursty events by grouping bursty features. For this grouping, it uses a cost function that assigns a lower cost to a group of bursty features as that group contains more bursty features which are similar and commonly appear in a large number of documents. This grouping phase repeatedly finds a bursty event with the lowest cost until every bursty feature is assigned to a bursty event. Finally, the technique finds the hot periods of each bursty event (i.e., the time periods when the features related to that bursty event have unusually high occurrences in documents).

Pattern Matching

Elkhalifa et al. [4] studied pattern matching in text streams, where, in addition to dealing with vast amounts of data, numerous continuous pattern detection queries must be supported. In this work, patterns are defined in terms of a tree called a pattern detection graph (PDG), where leaves correspond to simple patterns (e.g., term synonyms, regular expressions), and internal nodes are complex patterns (i.e., complex operators, such as term frequency and text proximity, that can be applied to simple patterns). The leaves of the tree are indexed in a suffix trie shared among all PDGs. When a new document is encountered, the terms in the document are compared against the leaves (simple patterns) stored in the suffix trie. If there is a leaf match, details concerning the match are propagated to every PDG that contains the pattern as a leaf. In this way, multiple patterns sharing a leaf node can be quickly detected without redundantly matching each individual pattern. As soon as a pattern is detected in the stream, an email message containing details of that detection is sent to the user who submitted the pattern.

Key Applications

Business Intelligence.

Analysis of tweets and news stories helps business owners understand customers’ opinions about products and make appropriate marketing decisions.

Emergency Management.

Crises and disasters are quickly identified from tweets and then addressed accordingly.

Political Analysis.

Politicians identify concerns of voters from tweets and blogs and strive to predict election results.

National Security.

Terror threats are identified from email messages, tweets, and blogs.

Healthcare.

Disease outbreaks are detected from tweets, blogs, and search engine log files.

Future Directions

Possible future research areas include:
  • Text stream processing with variable time windows

  • Text stream processing optimizations

  • Massively parallel text stream processing

  • Text stream processing under quality of service (QoS) constraints

Data Sets

SparkFun Electronics hosts a collection of public data streams at https://data.sparkfun.com.

Reuters news articles classified by topic and ordered by their date of issue are available at http://archive.ics.uci.edu/ml/.

Text from the proceedings of the Neural Information Processing Systems (NIPS) conference is provided at http://nips.djvuzone.org/txt.html.

A series of research articles can be obtained from http://www.acm.org/dl [9].

Cross-References

Recommended Reading

  1. 1.
    Banerjee A, Basu S. Topic models over text streams: a study of batch and online unsupervised learning. In: Proceedings of the seventh SIAM international conference on data mining (SDM); 2007. p. 437–42.Google Scholar
  2. 2.
    Bhide M, Chakaravarthy VT, Ramamritham K, Roy P. Keyword search over dynamic categorized information. In: Proceedings of the 2009 IEEE international conference on data engineering (ICDE); 2009. p. 258–69.Google Scholar
  3. 3.
    Chen C, Li F, Ooi BC, Wu S. TI: an efficient indexing mechanism for real-time search on tweets. In: Proceedings of the 2011 ACM SIGMOD international conference on management of data; 2011. p. 649–60.Google Scholar
  4. 4.
    Elkhalifa L, Adaikkalavan R, Chakravarthy S. InfoFilter: a system for expressive pattern specification and detection over text streams. In: Proceedings of the 2005 ACM symposium on applied computing (SAC); 2005. p. 1084–8.Google Scholar
  5. 5.
    Fung GPC, Yu JX, Yu PS, Lu H. Parameter free bursty events detection in text streams. In Proceedings of the 31st international conference on very large data bases (VLDB); 2005. p. 181–92.Google Scholar
  6. 6.
    He Q, Chang K, Lim E-P, Zhang J. Bursty feature representation for clustering text streams. In: Proceedings of the seventh SIAM international conference on data mining (SDM). SIAM; 2007. p. 491–6.Google Scholar
  7. 7.
    Kleinberg J. Bursty and hierarchical structure in streams. Data Min Knowl Disc. 2003;7(4):373–97.MathSciNetCrossRefGoogle Scholar
  8. 8.
    Li R, Wang S, Chang KC-C. Towards social data platform: automatic topic-focused monitor for twitter stream. Proc VLDB Endow(PVLDB). 2013;6(14):1966–77.Google Scholar
  9. 9.
    Mei Q, Zhai C. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery and data mining; 2005. p. 198–207.Google Scholar
  10. 10.
    Mouratidis K, Pang H. An incremental threshold method for continuous text search queries. In: Proceedings of the 25th international conference on data engineering (ICDE); 2009. p. 1187–90.Google Scholar

Copyright information

© Springer Science+Business Media LLC 2017

Authors and Affiliations

  • Jeong-Hyon Hwang
    • 1
    Email author
  • Alan G. Labouseur
    • 2
  • Paul W. Olsen Jr.
    • 3
  1. 1.Department of Computer ScienceUniversity at Albany – State University of New YorkAlbanyUSA
  2. 2.School of Computer Science and MathematicsMarist CollegePoughkeepsieUSA
  3. 3.Department of Computer ScienceThe College of Saint RoseAlbanyUSA

Section editors and affiliations

  • Ugur Cetintemel
    • 1
  1. 1.Department of Computer ScienceBrown UniversityProvidenceUSA