Abstract
This paper presents an evolutionary algorithm for modeling the arrival dates in time-stamped data sequences such as newscasts, e-mails, IRC conversations, scientific journal articles or weblog postings. These models are applied to the detection of buzz (i.e. terms that occur with a higher-than-normal frequency) in them, which has attracted a lot of interest in the online world with the increasing number of periodic content producers. That is why in this paper we have used this kind of online sequences to test our system, though it is also valid for other types of event sequences. The algorithm assigns frequencies (number of events per time unit) to time intervals so that it produces an optimal fit to the data. The optimization procedure is a trade off between accurately fitting the data and avoiding too many frequency changes, thus overcoming the noise inherent in these sequences. This process has been traditionally performed using dynamic programming algorithms, which are limited by memory and efficiency requirements. This limitation can be a problem when dealing with long sequences, and suggests the application of alternative search methods with some degree of uncertainty to achieve tractability, such as the evolutionary algorithm proposed in this paper. This algorithm is able to reach the same solution quality as those classical dynamic programming algorithms, but in a shorter time. We also test different cost functions and propose a new one that yields better fits than the one originally proposed by Kleinberg on real-world data. Finally, several distributions of states for the finite state automata are tested, with the result that an uniform distribution produces much better fits than the geometric distribution also proposed by Kleinberg. We also present a variant of the evolutionary algorithm, which achieves a fast fit of a sequence extended with new data, by taking advantage of the fit obtained for the original subsequence.
Similar content being viewed by others
Notes
The number of generations required to reach the optimal fit is about ten times the one required with the strategy finally adopted.
The number of generations required to reach the optimal fit is about twenty times the one required with the strategy finally adopted.
terrorism.
terrorist attack.
References
Araujo L (2004) Symbiosis of evolutionary techniques and statistical natural language processing. IEEE Trans Evol Comput 8(1):14–27
Araujo L, Merelo JJ (2006) Automatic detection of trends in dynamical text: an evolutionary approach. http://www.citebase.org/abstract?id=oai:arXiV.org:cs/0601047
Araujo L, Cuesta JA, Merelo JJ (2006) Genetic algorithm for burst detection and activity tracking in event streams. In: Runarsson TP, Beyer HG, Burke E, Guervós JJM, Bullinaria LDWA, Rowe J, Yao X (eds) Proceedings PPSN IX, no. 4193. Lecture notes in computer science, LNCS. Springer, Berlin, pp 453–462
Bingham E, Kabán A, Girolami M (2003) Topic identification in dynamical text by complexity pursuit. Neural Process Lett 17(1):69–83
Charikar M, Chen K, Farach-Colton M (2002) Finding frequent items in data streams. In: Charikar M, Chen K, Farach-Colton M. Finding frequent items in data streams. In: Proceedings of the 29th International Colloquium on Automata, Languages, and Programming, 2002. http://citeseer.ist.psu.edu/charikar02finding.html
Eiben AE, Smith JE (2003) Introduction to evolutionary computing. Springer, Berlin
Elwalid AI, Mitra D (1993) Effective bandwidth of general Markovian traffic sources and admission control of high speed networks. IEEE/ACM Trans Netw 1(3):329–343
Forney GD (1973) The Viterbi algorithm. Proc IEEE 61(3):268–278
Galvão RK, Becerra VM, Abou-Seada M (2004) Ratio selection for classification models. Data Mining and Knowledge Discovery 8(2):151–170. doi:10.1023/B:DAMI.0000015913.38787.b3
Girolami M, Kaban A (2004) Simplicial mixtures of Markov chains: distributed modelling of dynamic user profiles. In: Thrun S, Saul L, Schölkopf B (eds) Advances in neural information processing systems 16. MIT Press, Cambridge
Goldberg DE (1989) Genetic Algorithms in search, optimization and machine learning. Addison Wesley, Reading
Gollapudi S, Sivakumar D (2004) Framework and algorithms for trend analysis in massive temporal data sets. In: CIKM’04: Proceedings of the thirteenth ACM international conference on Information and knowledge management. ACM Press, New York, pp 168–177. doi:10.1145/1031171.1031208
Gruhl D, Guha R, Kumar R, Novak J, Tomkins A (2005) The predictive power of online chatter. In: KDD’05: Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM Press, New York, pp 78–87. doi:10.1145/1081870.1081883. http://portal.acm.org/citation.cfm?id=1081883
Hsu WH, Welge M, Redman T, Clutter D (2002) High-performance commercial data mining: a multistrategy machine learning application. Data Min Knowl Discov 6(4):361–391
Ihler A, Hutchins J, Smyth P (2006) Adaptive event detection with time-varying poisson processes. In: KDD’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, New York, pp 207–216. doi:10.1145/1150402.1150428
Kleinberg JM (2003) Bursty and hierarchical structure in streams. Data Min Knowl Discov 7(4):373–397
Kleinberg J (2006) Temporal dynamics of on-line information streams. In: Garofalakis M, Gehrke J, Rastogi R (eds) Data stream management: processing high-speed data streams. Springer, Berlin. http://www.cs.cornell.edu/home/kleinber/stream-survey04.pdf
Kumar R, Novak J, Raghavan P, Tomkins A (2004) Structure and evolution of blogspace. Commun ACM 47(12):35–39. doi:10.1145/1035134.1035162
Michalewicz Z, Fogel DB (2004) How to solve it: modern heuristics, 2nd edn. Revised and extended edn. Springer, Berlin. ISBN:3-540-22494-7
Muthukrishnan S (2003) Data streams: algorithms and applications. In: SODA’03: Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, pp 413–413. Extended version available at http://infolab.usc.edu/csci599/Fall2003/Data thms
Rabiner LR (1990) A tutorial on hidden Markov models and selected applications in speech recognition. In: Readings in speech recognition. Morgan Kaufmann Publishers Inc., Menlo Park, pp 267–296
Wang X, McCallum A (2006) Topics over time: a non-Markov continuous-time model of topical trends. In: KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, New York, pp 424–433. doi:10.1145/1150402.1150450
Yi J (2005) Detecting buzz from time-sequenced document streams. In: e-Technology, e-Commerce and e-Service, 2005. EEE ’05. Proceedings. The 2005 IEEE International Conference on, pp 347–352. http://ieeexplore.ieee.org/iel5/9634/30444/01402320.pdf
Acknowledgments
This work has been supported by the Spanish MICYT projects TIN2007-68083-C02-01 and TIN2007-67581-C02-01, the Junta de Andalucia CICE project P06-TIC-02025 and the Granada University PIUGR 9/11/06 project. We are also very grateful to the anonymous reviewers, who greatly contributed to the improvement of this papers and suggested new lines of research.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Araujo, L., Merelo, J.J. Automatic detection of trends in time-stamped sequences: an evolutionary approach. Soft Comput 14, 211–227 (2010). https://doi.org/10.1007/s00500-008-0395-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-008-0395-8