Advertisement

Soft Computing

, 14:211 | Cite as

Automatic detection of trends in time-stamped sequences: an evolutionary approach

  • Lourdes Araujo
  • Juan Julián Merelo
Original Paper

Abstract

This paper presents an evolutionary algorithm for modeling the arrival dates in time-stamped data sequences such as newscasts, e-mails, IRC conversations, scientific journal articles or weblog postings. These models are applied to the detection of buzz (i.e. terms that occur with a higher-than-normal frequency) in them, which has attracted a lot of interest in the online world with the increasing number of periodic content producers. That is why in this paper we have used this kind of online sequences to test our system, though it is also valid for other types of event sequences. The algorithm assigns frequencies (number of events per time unit) to time intervals so that it produces an optimal fit to the data. The optimization procedure is a trade off between accurately fitting the data and avoiding too many frequency changes, thus overcoming the noise inherent in these sequences. This process has been traditionally performed using dynamic programming algorithms, which are limited by memory and efficiency requirements. This limitation can be a problem when dealing with long sequences, and suggests the application of alternative search methods with some degree of uncertainty to achieve tractability, such as the evolutionary algorithm proposed in this paper. This algorithm is able to reach the same solution quality as those classical dynamic programming algorithms, but in a shorter time. We also test different cost functions and propose a new one that yields better fits than the one originally proposed by Kleinberg on real-world data. Finally, several distributions of states for the finite state automata are tested, with the result that an uniform distribution produces much better fits than the geometric distribution also proposed by Kleinberg. We also present a variant of the evolutionary algorithm, which achieves a fast fit of a sequence extended with new data, by taking advantage of the fit obtained for the original subsequence.

Keywords

Evolutionary algorithms Event tracking Data time-stamped sequences Burst detection 

Notes

Acknowledgments

This work has been supported by the Spanish MICYT projects TIN2007-68083-C02-01 and TIN2007-67581-C02-01, the Junta de Andalucia CICE project P06-TIC-02025 and the Granada University PIUGR 9/11/06 project. We are also very grateful to the anonymous reviewers, who greatly contributed to the improvement of this papers and suggested new lines of research.

References

  1. Araujo L (2004) Symbiosis of evolutionary techniques and statistical natural language processing. IEEE Trans Evol Comput 8(1):14–27CrossRefMathSciNetGoogle Scholar
  2. Araujo L, Merelo JJ (2006) Automatic detection of trends in dynamical text: an evolutionary approach. http://www.citebase.org/abstract?id=oai:arXiV.org:cs/0601047
  3. Araujo L, Cuesta JA, Merelo JJ (2006) Genetic algorithm for burst detection and activity tracking in event streams. In: Runarsson TP, Beyer HG, Burke E, Guervós JJM, Bullinaria LDWA, Rowe J, Yao X (eds) Proceedings PPSN IX, no. 4193. Lecture notes in computer science, LNCS. Springer, Berlin, pp 453–462Google Scholar
  4. Bingham E, Kabán A, Girolami M (2003) Topic identification in dynamical text by complexity pursuit. Neural Process Lett 17(1):69–83CrossRefGoogle Scholar
  5. Charikar M, Chen K, Farach-Colton M (2002) Finding frequent items in data streams. In: Charikar M, Chen K, Farach-Colton M. Finding frequent items in data streams. In: Proceedings of the 29th International Colloquium on Automata, Languages, and Programming, 2002. http://citeseer.ist.psu.edu/charikar02finding.html
  6. Eiben AE, Smith JE (2003) Introduction to evolutionary computing. Springer, BerlinGoogle Scholar
  7. Elwalid AI, Mitra D (1993) Effective bandwidth of general Markovian traffic sources and admission control of high speed networks. IEEE/ACM Trans Netw 1(3):329–343CrossRefGoogle Scholar
  8. Forney GD (1973) The Viterbi algorithm. Proc IEEE 61(3):268–278CrossRefMathSciNetGoogle Scholar
  9. Galvão RK, Becerra VM, Abou-Seada M (2004) Ratio selection for classification models. Data Mining and Knowledge Discovery 8(2):151–170. doi: 10.1023/B:DAMI.0000015913.38787.b3 Google Scholar
  10. Girolami M, Kaban A (2004) Simplicial mixtures of Markov chains: distributed modelling of dynamic user profiles. In: Thrun S, Saul L, Schölkopf B (eds) Advances in neural information processing systems 16. MIT Press, CambridgeGoogle Scholar
  11. Goldberg DE (1989) Genetic Algorithms in search, optimization and machine learning. Addison Wesley, ReadingGoogle Scholar
  12. Gollapudi S, Sivakumar D (2004) Framework and algorithms for trend analysis in massive temporal data sets. In: CIKM’04: Proceedings of the thirteenth ACM international conference on Information and knowledge management. ACM Press, New York, pp 168–177. doi: 10.1145/1031171.1031208
  13. Gruhl D, Guha R, Kumar R, Novak J, Tomkins A (2005) The predictive power of online chatter. In: KDD’05: Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM Press, New York, pp 78–87. doi: 10.1145/1081870.1081883. http://portal.acm.org/citation.cfm?id=1081883
  14. Hsu WH, Welge M, Redman T, Clutter D (2002) High-performance commercial data mining: a multistrategy machine learning application. Data Min Knowl Discov 6(4):361–391CrossRefMathSciNetGoogle Scholar
  15. Ihler A, Hutchins J, Smyth P (2006) Adaptive event detection with time-varying poisson processes. In: KDD’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, New York, pp 207–216. doi: 10.1145/1150402.1150428
  16. Kleinberg JM (2003) Bursty and hierarchical structure in streams. Data Min Knowl Discov 7(4):373–397CrossRefMathSciNetGoogle Scholar
  17. Kleinberg J (2006) Temporal dynamics of on-line information streams. In: Garofalakis M, Gehrke J, Rastogi R (eds) Data stream management: processing high-speed data streams. Springer, Berlin. http://www.cs.cornell.edu/home/kleinber/stream-survey04.pdf
  18. Kumar R, Novak J, Raghavan P, Tomkins A (2004) Structure and evolution of blogspace. Commun ACM 47(12):35–39. doi: 10.1145/1035134.1035162 CrossRefGoogle Scholar
  19. Michalewicz Z, Fogel DB (2004) How to solve it: modern heuristics, 2nd edn. Revised and extended edn. Springer, Berlin. ISBN:3-540-22494-7Google Scholar
  20. Muthukrishnan S (2003) Data streams: algorithms and applications. In: SODA’03: Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, pp 413–413. Extended version available at http://infolab.usc.edu/csci599/Fall2003/Data thms
  21. Rabiner LR (1990) A tutorial on hidden Markov models and selected applications in speech recognition. In: Readings in speech recognition. Morgan Kaufmann Publishers Inc., Menlo Park, pp 267–296Google Scholar
  22. Wang X, McCallum A (2006) Topics over time: a non-Markov continuous-time model of topical trends. In: KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, New York, pp 424–433. doi: 10.1145/1150402.1150450
  23. Yi J (2005) Detecting buzz from time-sequenced document streams. In: e-Technology, e-Commerce and e-Service, 2005. EEE ’05. Proceedings. The 2005 IEEE International Conference on, pp 347–352. http://ieeexplore.ieee.org/iel5/9634/30444/01402320.pdf

Copyright information

© Springer-Verlag 2009

Authors and Affiliations

  1. 1.Departamento de Lenguajes y Sistemas InformáticosUniversidad Nacional de Educación a DistanciaMadridSpain
  2. 2.Departamento de Arquitectura y Tecnología de ComputadoresUniversidad de GranadaGranadaSpain

Personalised recommendations