Skip to main content
Log in

Automatic detection of trends in time-stamped sequences: an evolutionary approach

  • Original Paper
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

This paper presents an evolutionary algorithm for modeling the arrival dates in time-stamped data sequences such as newscasts, e-mails, IRC conversations, scientific journal articles or weblog postings. These models are applied to the detection of buzz (i.e. terms that occur with a higher-than-normal frequency) in them, which has attracted a lot of interest in the online world with the increasing number of periodic content producers. That is why in this paper we have used this kind of online sequences to test our system, though it is also valid for other types of event sequences. The algorithm assigns frequencies (number of events per time unit) to time intervals so that it produces an optimal fit to the data. The optimization procedure is a trade off between accurately fitting the data and avoiding too many frequency changes, thus overcoming the noise inherent in these sequences. This process has been traditionally performed using dynamic programming algorithms, which are limited by memory and efficiency requirements. This limitation can be a problem when dealing with long sequences, and suggests the application of alternative search methods with some degree of uncertainty to achieve tractability, such as the evolutionary algorithm proposed in this paper. This algorithm is able to reach the same solution quality as those classical dynamic programming algorithms, but in a shorter time. We also test different cost functions and propose a new one that yields better fits than the one originally proposed by Kleinberg on real-world data. Finally, several distributions of states for the finite state automata are tested, with the result that an uniform distribution produces much better fits than the geometric distribution also proposed by Kleinberg. We also present a variant of the evolutionary algorithm, which achieves a fast fit of a sequence extended with new data, by taking advantage of the fit obtained for the original subsequence.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. The number of generations required to reach the optimal fit is about ten times the one required with the strategy finally adopted.

  2. The number of generations required to reach the optimal fit is about twenty times the one required with the strategy finally adopted.

  3. http://www.blogalia.com.

  4. terrorism.

  5. terrorist attack.

References

  • Araujo L (2004) Symbiosis of evolutionary techniques and statistical natural language processing. IEEE Trans Evol Comput 8(1):14–27

    Article  MathSciNet  Google Scholar 

  • Araujo L, Merelo JJ (2006) Automatic detection of trends in dynamical text: an evolutionary approach. http://www.citebase.org/abstract?id=oai:arXiV.org:cs/0601047

  • Araujo L, Cuesta JA, Merelo JJ (2006) Genetic algorithm for burst detection and activity tracking in event streams. In: Runarsson TP, Beyer HG, Burke E, Guervós JJM, Bullinaria LDWA, Rowe J, Yao X (eds) Proceedings PPSN IX, no. 4193. Lecture notes in computer science, LNCS. Springer, Berlin, pp 453–462

  • Bingham E, Kabán A, Girolami M (2003) Topic identification in dynamical text by complexity pursuit. Neural Process Lett 17(1):69–83

    Article  Google Scholar 

  • Charikar M, Chen K, Farach-Colton M (2002) Finding frequent items in data streams. In: Charikar M, Chen K, Farach-Colton M. Finding frequent items in data streams. In: Proceedings of the 29th International Colloquium on Automata, Languages, and Programming, 2002. http://citeseer.ist.psu.edu/charikar02finding.html

  • Eiben AE, Smith JE (2003) Introduction to evolutionary computing. Springer, Berlin

  • Elwalid AI, Mitra D (1993) Effective bandwidth of general Markovian traffic sources and admission control of high speed networks. IEEE/ACM Trans Netw 1(3):329–343

    Article  Google Scholar 

  • Forney GD (1973) The Viterbi algorithm. Proc IEEE 61(3):268–278

    Article  MathSciNet  Google Scholar 

  • Galvão RK, Becerra VM, Abou-Seada M (2004) Ratio selection for classification models. Data Mining and Knowledge Discovery 8(2):151–170. doi:10.1023/B:DAMI.0000015913.38787.b3

    Google Scholar 

  • Girolami M, Kaban A (2004) Simplicial mixtures of Markov chains: distributed modelling of dynamic user profiles. In: Thrun S, Saul L, Schölkopf B (eds) Advances in neural information processing systems 16. MIT Press, Cambridge

    Google Scholar 

  • Goldberg DE (1989) Genetic Algorithms in search, optimization and machine learning. Addison Wesley, Reading

  • Gollapudi S, Sivakumar D (2004) Framework and algorithms for trend analysis in massive temporal data sets. In: CIKM’04: Proceedings of the thirteenth ACM international conference on Information and knowledge management. ACM Press, New York, pp 168–177. doi:10.1145/1031171.1031208

  • Gruhl D, Guha R, Kumar R, Novak J, Tomkins A (2005) The predictive power of online chatter. In: KDD’05: Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM Press, New York, pp 78–87. doi:10.1145/1081870.1081883. http://portal.acm.org/citation.cfm?id=1081883

  • Hsu WH, Welge M, Redman T, Clutter D (2002) High-performance commercial data mining: a multistrategy machine learning application. Data Min Knowl Discov 6(4):361–391

    Article  MathSciNet  Google Scholar 

  • Ihler A, Hutchins J, Smyth P (2006) Adaptive event detection with time-varying poisson processes. In: KDD’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, New York, pp 207–216. doi:10.1145/1150402.1150428

  • Kleinberg JM (2003) Bursty and hierarchical structure in streams. Data Min Knowl Discov 7(4):373–397

    Article  MathSciNet  Google Scholar 

  • Kleinberg J (2006) Temporal dynamics of on-line information streams. In: Garofalakis M, Gehrke J, Rastogi R (eds) Data stream management: processing high-speed data streams. Springer, Berlin. http://www.cs.cornell.edu/home/kleinber/stream-survey04.pdf

  • Kumar R, Novak J, Raghavan P, Tomkins A (2004) Structure and evolution of blogspace. Commun ACM 47(12):35–39. doi:10.1145/1035134.1035162

    Article  Google Scholar 

  • Michalewicz Z, Fogel DB (2004) How to solve it: modern heuristics, 2nd edn. Revised and extended edn. Springer, Berlin. ISBN:3-540-22494-7

  • Muthukrishnan S (2003) Data streams: algorithms and applications. In: SODA’03: Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, pp 413–413. Extended version available at http://infolab.usc.edu/csci599/Fall2003/Data thms

  • Rabiner LR (1990) A tutorial on hidden Markov models and selected applications in speech recognition. In: Readings in speech recognition. Morgan Kaufmann Publishers Inc., Menlo Park, pp 267–296

  • Wang X, McCallum A (2006) Topics over time: a non-Markov continuous-time model of topical trends. In: KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, New York, pp 424–433. doi:10.1145/1150402.1150450

  • Yi J (2005) Detecting buzz from time-sequenced document streams. In: e-Technology, e-Commerce and e-Service, 2005. EEE ’05. Proceedings. The 2005 IEEE International Conference on, pp 347–352. http://ieeexplore.ieee.org/iel5/9634/30444/01402320.pdf

Download references

Acknowledgments

This work has been supported by the Spanish MICYT projects TIN2007-68083-C02-01 and TIN2007-67581-C02-01, the Junta de Andalucia CICE project P06-TIC-02025 and the Granada University PIUGR 9/11/06 project. We are also very grateful to the anonymous reviewers, who greatly contributed to the improvement of this papers and suggested new lines of research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lourdes Araujo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Araujo, L., Merelo, J.J. Automatic detection of trends in time-stamped sequences: an evolutionary approach. Soft Comput 14, 211–227 (2010). https://doi.org/10.1007/s00500-008-0395-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-008-0395-8

Keywords

Navigation