Text Stream Processing
KeywordsLatent Dirichlet Allocation Neural Information Processing System Cost Constraint Inverted List Text Stream
A text stream is a continuously generated series of comments or small text documents. Each comment or text document may be associated with a time stamp indicating when it was produced or received by a certain device or system. Text stream processing refers to real-time extraction of desired information from text streams (through categorizing and clustering documents in text streams, detecting and tracking topics, matching patterns, and discovering events). Streaming text media (e.g., Twitter, WeChat, Facebook, news feeds, etc.) have fresher content with richer attributes and tend to have broader coverage compared to traditional electronic media (e.g., forums, blogs, and web sites). These advantages make them ripe for use in many engaging, innovative, and empowering applications (see Key Applications, below). In contrast to offline text mining, which analyzes a static collection of text documents (see “Text Mining”), text stream processing requires techniques that quickly produce answers while keeping up with input text streams.
Prior to the study of text stream processing, researchers had developed a variety of techniques for analyzing collections of text. Representative techniques include:
Text categorization techniques assign documents to predefined categories according to their contents. These techniques typically extract features (e.g., frequencies of key terms) from each document and classify documents according to their features. For this classification, a supervised machine learning scheme (e.g., a Bayesian classifier, support vector machine, decision tree) is used (see “Text Categorization”).
Text clustering techniques group documents according to their similarities without predefined categories or classes. Representative text clustering techniques include single-link clustering, k-means, and co-clustering (see “Text Clustering”).
Topic Detection and Tracking.
Given a series of text documents, topic detection and tracking (TDT) techniques seek to discover topics within those documents and associate related topics with each other (see “Topic Detection and Tracking”).
In contrast to offline text mining techniques, text stream processing techniques are designed to quickly extract knowledge from continuously generated text data. These techniques can be categorized as follows:
The above text categorization/classification techniques adopt methods for efficiently maintaining the k most relevant documents for each query. For example, they derive, with low overhead, a ranking threshold for each query such that all documents that rank below that threshold cannot be in the top-k result and thus can be safely ignored. Furthermore, these techniques usually maintain, for each term, a list of documents containing that term (i.e., an inverted list) to quickly find relevant documents given a term. By maintaining these inverted lists in a certain way (e.g., keeping each list in decreasing frequency order), these techniques can reduce the number of documents they examine to obtain the correct top-k result .
Given (i) a stream of documents and (ii) an integer k, a text clustering technique constructs and maintains k clusters of documents, where documents within each cluster are highly similar. In order to efficiently deal with large volumes of incoming text, clustering techniques typically use a sliding time window and keep only the documents received within that time window [1, 6]. Banerjee et al. provided three online versions of popular clustering techniques, namely, von Mises-Fisher (vMF), Dirichlet compound multinomial (DCM), and latent Dirichlet allocation (LDA) .
Another complexity that arises in the context of clustering within a text stream is how best to represent the documents to be clustered. The traditional vector space model for representing documents is not ideal for a streaming environment, because although two documents might appear similar in terms of features, they may refer to different real-world events depending on the time they were created. To meet this challenge, He et al.  presented a model that, given a document, takes into account both the document’s creation time and the bursty weight of the document’s features at that time. The bursty weight of a feature represents the intensity of a burst at a specific time ; if the feature is not experiencing a burst at that time, the bursty weight is 0.
Given (i) a text stream, (ii) a classifier f that determines the relevance of a comment (or document) to a specific target topic, and (iii) a set of cost constraints (e.g., the maximum number of comments that can be examined per time unit), topic monitoring techniques strive to detect as many relevant comments as possible while meeting the cost constraints. Since manually defining keywords for a target topic is laborious and likely to result in suboptimal topic coverage (e.g., keywords may become outdated as time passes according to changing contexts), these techniques require automatic selection/adjustment of keywords.
Bursty Event Detection
Given a stream of documents, the goal of bursty event detection is to identify prominent events based on features (i.e., words or terms) of the documents and their time information. Such a prominent event, called a bursty event, is expressed using a set of bursty features found in the documents appearing in a certain time window. A bursty feature has the property that it has unusually high occurrences in those documents. A set of bursty features may be used to specify positive examples for training a supervised text classifier (i.e., text classification is feasible without user-provided positive examples).
Fung et al. have developed a parameter-free probabilistic approach for effectively and efficiently identifying bursty events . This technique, called feature-pivot clustering, first identifies bursty features. Then, this technique identifies bursty events by grouping bursty features. For this grouping, it uses a cost function that assigns a lower cost to a group of bursty features as that group contains more bursty features which are similar and commonly appear in a large number of documents. This grouping phase repeatedly finds a bursty event with the lowest cost until every bursty feature is assigned to a bursty event. Finally, the technique finds the hot periods of each bursty event (i.e., the time periods when the features related to that bursty event have unusually high occurrences in documents).
Elkhalifa et al.  studied pattern matching in text streams, where, in addition to dealing with vast amounts of data, numerous continuous pattern detection queries must be supported. In this work, patterns are defined in terms of a tree called a pattern detection graph (PDG), where leaves correspond to simple patterns (e.g., term synonyms, regular expressions), and internal nodes are complex patterns (i.e., complex operators, such as term frequency and text proximity, that can be applied to simple patterns). The leaves of the tree are indexed in a suffix trie shared among all PDGs. When a new document is encountered, the terms in the document are compared against the leaves (simple patterns) stored in the suffix trie. If there is a leaf match, details concerning the match are propagated to every PDG that contains the pattern as a leaf. In this way, multiple patterns sharing a leaf node can be quickly detected without redundantly matching each individual pattern. As soon as a pattern is detected in the stream, an email message containing details of that detection is sent to the user who submitted the pattern.
Analysis of tweets and news stories helps business owners understand customers’ opinions about products and make appropriate marketing decisions.
Crises and disasters are quickly identified from tweets and then addressed accordingly.
Politicians identify concerns of voters from tweets and blogs and strive to predict election results.
Terror threats are identified from email messages, tweets, and blogs.
Disease outbreaks are detected from tweets, blogs, and search engine log files.
Text stream processing with variable time windows
Text stream processing optimizations
Massively parallel text stream processing
Text stream processing under quality of service (QoS) constraints
SparkFun Electronics hosts a collection of public data streams at https://data.sparkfun.com.
Reuters news articles classified by topic and ordered by their date of issue are available at http://archive.ics.uci.edu/ml/.
Text from the proceedings of the Neural Information Processing Systems (NIPS) conference is provided at http://nips.djvuzone.org/txt.html.
- 1.Banerjee A, Basu S. Topic models over text streams: a study of batch and online unsupervised learning. In: Proceedings of the seventh SIAM international conference on data mining (SDM); 2007. p. 437–42.Google Scholar
- 2.Bhide M, Chakaravarthy VT, Ramamritham K, Roy P. Keyword search over dynamic categorized information. In: Proceedings of the 2009 IEEE international conference on data engineering (ICDE); 2009. p. 258–69.Google Scholar
- 3.Chen C, Li F, Ooi BC, Wu S. TI: an efficient indexing mechanism for real-time search on tweets. In: Proceedings of the 2011 ACM SIGMOD international conference on management of data; 2011. p. 649–60.Google Scholar
- 4.Elkhalifa L, Adaikkalavan R, Chakravarthy S. InfoFilter: a system for expressive pattern specification and detection over text streams. In: Proceedings of the 2005 ACM symposium on applied computing (SAC); 2005. p. 1084–8.Google Scholar
- 5.Fung GPC, Yu JX, Yu PS, Lu H. Parameter free bursty events detection in text streams. In Proceedings of the 31st international conference on very large data bases (VLDB); 2005. p. 181–92.Google Scholar
- 6.He Q, Chang K, Lim E-P, Zhang J. Bursty feature representation for clustering text streams. In: Proceedings of the seventh SIAM international conference on data mining (SDM). SIAM; 2007. p. 491–6.Google Scholar
- 8.Li R, Wang S, Chang KC-C. Towards social data platform: automatic topic-focused monitor for twitter stream. Proc VLDB Endow(PVLDB). 2013;6(14):1966–77.Google Scholar
- 9.Mei Q, Zhai C. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery and data mining; 2005. p. 198–207.Google Scholar
- 10.Mouratidis K, Pang H. An incremental threshold method for continuous text search queries. In: Proceedings of the 25th international conference on data engineering (ICDE); 2009. p. 1187–90.Google Scholar