Skip to main content
Log in

Repeated patterns detection in big data using classification and parallelism on LERP Reduced Suffix Arrays

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Suffix array is a powerful data structure, used mainly for pattern detection in strings. The main disadvantage of a full suffix array is its quadratic O(n 2) space capacity when the actual suffixes are needed. In our previous work [39], we introduced the innovative All Repeated Patterns Detection (ARPaD) algorithm and the Moving Longest Expected Repeated Pattern (MLERP) process. The former detects all repeated patterns in a string using a partition of the full Suffix Array and the latter is capable of analyzing large strings regardless of their size. Furthermore, the notion of Longest Expected Repeated Pattern (LERP), also introduced by the authors in a previous work, significantly reduces to linear O ( n ) the space capacity needed for the full suffix array. However, so far the LERP value has to be specified in ad hoc manner based on experimental or empirical values. In order to overcome this problem, the Probabilistic Existence of LERP theorem has been proven in this paper and, furthermore, a formula for an accurate upper bound estimation of the LERP value has been introduced using only the length of the string and the size of the alphabet used in constructing the string. The importance of this method is the optimum upper bounding of the LERP value without any previous preprocess or knowledge of string characteristics. Moreover, the new data structure LERP Reduced Suffix Array is defined; it is a variation of the suffix array, and has the advantage of permitting the classification and parallelism to be implemented directly on the data structure. All other alternative methodologies deal with the very common problem of fitting any kind of data structure in a computer memory or disk in order to apply different time efficient methods for pattern detection. The current advanced and elegant proposed methodology allows us to alter the above-mentioned problem such that smaller classes of the problem can be distributed on different systems and then apply current, state-of-the-art, techniques such as parallelism and cloud computing using advanced DBMSs which are capable of handling the storage and analysis of big data. The implementation of the above-described methodology can be achieved by invoking our innovative ARPaD algorithm. Extensive experiments have been conducted on small, comparable strings of Champernowne Constant and DNA as well as on extremely large strings of π with length up to 68 billion digits. Furthermore, the novelty and superiority of our methodology have been also tested on real life application such as a Distributed Denial of Service (DDoS) attack early warning system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

References

  1. Apostolico A, Preparata FP (1983) Optimal off-line detection of repetitions in a string. Theor Comput Sci 22:297–315

    Article  MathSciNet  MATH  Google Scholar 

  2. Apostolico A, Szpankowski W (1992) Self-alignment in words and their applications. J Algorithms 13 (3):446–467

    Article  MathSciNet  MATH  Google Scholar 

  3. Borel E (1909) Les probabilités dénombrables et leurs applications arithmétiques. Rend Circ Mat Palermo 27:247–271

    Article  MATH  Google Scholar 

  4. Bailey DH, Crandall RE (2001) On the random character of fundamental constant expansions. Exp Math 10(2):175–190

    Article  MathSciNet  MATH  Google Scholar 

  5. Bailey DH, Crandall RE (2002) Random generators and normal numbers. Exp Math 11(4):527–546

    Article  MathSciNet  MATH  Google Scholar 

  6. Bailey DH, Borwein JM, Calude CS, Dinneen MJ, Dumitrescu M, Yee A (2012) An empirical approach to the NorMality of π. Exp Math 21(4):375–384

    Article  MathSciNet  MATH  Google Scholar 

  7. Becher V (2012) Turing’s normal numbers: towards randomness. In: Cooper BS, Dawar A, Löwe B (eds) How the world computes: lecture notes in computer science, vol 7318. Springer, pp 35–45

  8. Calude C (1994) Borel normality and algorithmic randomness. In: Rozenberg G, Salomaa A (eds) Development in language theory. World Scientif, Singapore, pp 113–129

  9. Calude C (1995) What is a random string? J Univ Sci 1(1):48–66

    MATH  Google Scholar 

  10. Chaitin GJ (1988) Randomness in arithmetic. Sci Am 259 (1):80–85

    Article  Google Scholar 

  11. Champernowne D (1933) The construction of decimals normal in the scale of ten. J London Math Soc 8:254–260

    Article  MathSciNet  MATH  Google Scholar 

  12. Church A (1940) On the concept of a random sequence. Bull Amer Math Soc 46(2):130–135

    Article  MathSciNet  MATH  Google Scholar 

  13. Copeland AH, Erdos P (1946) Note on normal numbers. Bull Amer Math Soc 52:857–860

    Article  MathSciNet  MATH  Google Scholar 

  14. Dasgupta A (2011) Mathematical foundations of randomness. In: Gabbay DM, Thagard P, Woods J (eds) Philosophy of statistics. North Holland, Saint Louis, pp 641–710

  15. Davenport H, Erdos P (1952) Note on normal decimals. Canad J Math 4:58–63

    Article  MathSciNet  MATH  Google Scholar 

  16. Devroye L, Szpankowski W, Rais B (1992) A note on the height of suffix trees. SIAM J Comput 21 (1):48–53

    Article  MathSciNet  MATH  Google Scholar 

  17. Franek F, Smyth WF, Tang Y (2003) Computing all repeats using suffix arrays. J Autom Lang Comb 8(4):579–591

    MathSciNet  MATH  Google Scholar 

  18. Gog S, Moffat A, Culpepper S, Turpin A, Wirth A (2013) Large-scale pattern search using reduced-space on-disk suffix arrays. arXiv:1303.6481v1

  19. Guo D, Hu X, Xie F, Wu X (2013) Pattern matching with wildcards and gap-length constraints based on a centrality-degree graph. Appl Intell 39:57–74

    Article  Google Scholar 

  20. Hardy GH, Wright EM (1960) An introduction to the theory of numbers, 4th edn. Oxford University Press

  21. Karkkainen J, Sanders P, Burkhardt S (2006) Linear work suffix array construction. J ACM (JACM) 53(6):918–936

    Article  MathSciNet  MATH  Google Scholar 

  22. Karlin S, Ghandour G, Ost F, Tavere S, Korn L (1983) New approaches for computer analysis of nucleic acid sequences. Proc Natl Acad Sci USA 80:5660–5664

    Article  MATH  Google Scholar 

  23. Khoshnevisan D (2006) Normal numbers are normal. Clay Mathematics Institute Annual Report 15(2006):27–31

    Google Scholar 

  24. Ko P, Aluru S (2003) Space efficient linear time construction of suffix arrays. In: Proceedings of the 14th annual conference on Combinatorial pattern matching, pp 200–210

  25. Long CT (1957) Note on normal numbers. Pac J Math 7(2):1163–1165

    Article  MathSciNet  MATH  Google Scholar 

  26. Manber U, Myers G (1990) Suffix arrays: a new method for on-line string searches. In: Proceedings of the first annual ACM-SIAM symposium on discrete algorithms, pp 319–327

  27. Niven I, Zuckerman H (1951) On the definition of normal numbers. Pac J Math 1(1):103–109

    Article  MathSciNet  MATH  Google Scholar 

  28. Orlandi A, Venturini R (2011) Space-efficient substring occurrence estimation. In: Proceedings of the 30th principles of database systems PODS, pp 95–106

  29. Phoophakdee B, Zaki M (2007) Genome-scale disk-based suffix tree indexing. In: Proceedings of international conference on management of data SIGMOD ’07, pp 833–844

  30. Puglishi SJ, Smyth WF, Yusufu M (2008) Fast optimal algorithms for computing all the repeats in a string. In: Proceedings of PSC, pp 161–169

  31. Schürmann KB, Stoye J (2005) An incomplex algorithm for fast suffix array construction. In: Proceedings of the 7th workshop on algorithm engineering and experiments and the 2nd workshop on analytic algorithmics and combinatorics (ALENEX/ANALCO 2005), pp 77–85

  32. Sinha R, Moffat A, Puglisi S, Turpin A (2008) Improving Suffix Array Locality for Fast Pattern Matching on Disk. In: Proceedings of international conference on management of data SIGMOD ’08, pp 661–672

  33. Wagon S (1985) Is Pi normal?. Math Intell 7(3):65–67

    Article  MathSciNet  MATH  Google Scholar 

  34. Weiner P Linear pattern matching algorithms. In: SWAT ’73 proceedings of the 14th annual symposium on switching and automata theory (swat 1973), pp 1–11

  35. Wu Y, Wang L, Ren J, Ding W, Wu X (2014) Mining sequential patterns with periodic wildcards. Appl Intell 41:99–116

    Article  Google Scholar 

  36. Xylogiannopoulos K, Karampelas P, Alhajj R (2012) Periodicity data mining in time series using suffix arrays. In: Proceedings of IEEE intelligent systems IS’12, pp 172–181

  37. Xylogiannopoulos K, Karampelas P, Alhajj R (2012) Minimization of suffix array’s storage capacity for periodicity detection in time series. In: Proceedings of IEEE international conference in tools with artificial intelligence

  38. Xylogiannopoulos K, Karampelas P, Alhajj R (2014) Early DDoS detection based on data mining techniques. In: Proceedings of 8th workshop in information security theory and practice (WISTP), pp 190–199

  39. Xylogiannopoulos K, Karampelas P, Alhajj R (2014) Analyzing very large time series using ssuffix arrays. Appl Intell 41(3):941–955

    Article  MATH  Google Scholar 

  40. Xylogiannopoulos K, Karampelas P, Alhajj R (2014) Experimental analysis on the NorMality of π, e, φ, sqrt(2) using advanced data-mining techniques. Exp Math 23(2):105–128

    Article  MathSciNet  MATH  Google Scholar 

  41. Yee A (2013) Y-cruncher – a multi-threaded Pi-program [Online]. Available: http://www.numberworld.org/y-cruncher/

  42. UCLA, (2006, Feb 26). http://www.lasr.cs.ucla.edu/ddos/traces/public/attacktrace2/udp/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Reda Alhajj.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xylogiannopoulos, K.F., Karampelas, P. & Alhajj, R. Repeated patterns detection in big data using classification and parallelism on LERP Reduced Suffix Arrays. Appl Intell 45, 567–597 (2016). https://doi.org/10.1007/s10489-016-0766-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-016-0766-2

Keywords

Navigation