Repeated patterns detection in big data using classification and parallelism on LERP Reduced Suffix Arrays

Xylogiannopoulos, Konstantinos F.; Karampelas, Panagiotis; Alhajj, Reda

doi:10.1007/s10489-016-0766-2

Repeated patterns detection in big data using classification and parallelism on LERP Reduced Suffix Arrays

Published: 05 April 2016

Volume 45, pages 567–597, (2016)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Konstantinos F. Xylogiannopoulos¹,
Panagiotis Karampelas² &
Reda Alhajj¹

451 Accesses
23 Citations
Explore all metrics

Abstract

Suffix array is a powerful data structure, used mainly for pattern detection in strings. The main disadvantage of a full suffix array is its quadratic O(n ²) space capacity when the actual suffixes are needed. In our previous work [39], we introduced the innovative All Repeated Patterns Detection (ARPaD) algorithm and the Moving Longest Expected Repeated Pattern (MLERP) process. The former detects all repeated patterns in a string using a partition of the full Suffix Array and the latter is capable of analyzing large strings regardless of their size. Furthermore, the notion of Longest Expected Repeated Pattern (LERP), also introduced by the authors in a previous work, significantly reduces to linear O ( n ) the space capacity needed for the full suffix array. However, so far the LERP value has to be specified in ad hoc manner based on experimental or empirical values. In order to overcome this problem, the Probabilistic Existence of LERP theorem has been proven in this paper and, furthermore, a formula for an accurate upper bound estimation of the LERP value has been introduced using only the length of the string and the size of the alphabet used in constructing the string. The importance of this method is the optimum upper bounding of the LERP value without any previous preprocess or knowledge of string characteristics. Moreover, the new data structure LERP Reduced Suffix Array is defined; it is a variation of the suffix array, and has the advantage of permitting the classification and parallelism to be implemented directly on the data structure. All other alternative methodologies deal with the very common problem of fitting any kind of data structure in a computer memory or disk in order to apply different time efficient methods for pattern detection. The current advanced and elegant proposed methodology allows us to alter the above-mentioned problem such that smaller classes of the problem can be distributed on different systems and then apply current, state-of-the-art, techniques such as parallelism and cloud computing using advanced DBMSs which are capable of handling the storage and analysis of big data. The implementation of the above-described methodology can be achieved by invoking our innovative ARPaD algorithm. Extensive experiments have been conducted on small, comparable strings of Champernowne Constant and DNA as well as on extremely large strings of π with length up to 68 billion digits. Furthermore, the novelty and superiority of our methodology have been also tested on real life application such as a Distributed Denial of Service (DDoS) attack early warning system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Generalized enhanced suffix array construction in external memory

Article Open access 07 December 2017

Felipe A. Louza, Guilherme P. Telles, … Cristina D. A. Ciferri

Parallel Suffix Sorting for Large String Analytics

Efficient Computation of Substring Equivalence Classes with Suffix Arrays

Article 17 August 2016

Kazuyuki Narisawa, Hideharu Hiratsuka, … Masayuki Takeda

References

Apostolico A, Preparata FP (1983) Optimal off-line detection of repetitions in a string. Theor Comput Sci 22:297–315
Article MathSciNet MATH Google Scholar
Apostolico A, Szpankowski W (1992) Self-alignment in words and their applications. J Algorithms 13 (3):446–467
Article MathSciNet MATH Google Scholar
Borel E (1909) Les probabilités dénombrables et leurs applications arithmétiques. Rend Circ Mat Palermo 27:247–271
Article MATH Google Scholar
Bailey DH, Crandall RE (2001) On the random character of fundamental constant expansions. Exp Math 10(2):175–190
Article MathSciNet MATH Google Scholar
Bailey DH, Crandall RE (2002) Random generators and normal numbers. Exp Math 11(4):527–546
Article MathSciNet MATH Google Scholar
Bailey DH, Borwein JM, Calude CS, Dinneen MJ, Dumitrescu M, Yee A (2012) An empirical approach to the NorMality of π. Exp Math 21(4):375–384
Article MathSciNet MATH Google Scholar
Becher V (2012) Turing’s normal numbers: towards randomness. In: Cooper BS, Dawar A, Löwe B (eds) How the world computes: lecture notes in computer science, vol 7318. Springer, pp 35–45
Calude C (1994) Borel normality and algorithmic randomness. In: Rozenberg G, Salomaa A (eds) Development in language theory. World Scientif, Singapore, pp 113–129
Calude C (1995) What is a random string? J Univ Sci 1(1):48–66
MATH Google Scholar
Chaitin GJ (1988) Randomness in arithmetic. Sci Am 259 (1):80–85
Article Google Scholar
Champernowne D (1933) The construction of decimals normal in the scale of ten. J London Math Soc 8:254–260
Article MathSciNet MATH Google Scholar
Church A (1940) On the concept of a random sequence. Bull Amer Math Soc 46(2):130–135
Article MathSciNet MATH Google Scholar
Copeland AH, Erdos P (1946) Note on normal numbers. Bull Amer Math Soc 52:857–860
Article MathSciNet MATH Google Scholar
Dasgupta A (2011) Mathematical foundations of randomness. In: Gabbay DM, Thagard P, Woods J (eds) Philosophy of statistics. North Holland, Saint Louis, pp 641–710
Davenport H, Erdos P (1952) Note on normal decimals. Canad J Math 4:58–63
Article MathSciNet MATH Google Scholar
Devroye L, Szpankowski W, Rais B (1992) A note on the height of suffix trees. SIAM J Comput 21 (1):48–53
Article MathSciNet MATH Google Scholar
Franek F, Smyth WF, Tang Y (2003) Computing all repeats using suffix arrays. J Autom Lang Comb 8(4):579–591
MathSciNet MATH Google Scholar
Gog S, Moffat A, Culpepper S, Turpin A, Wirth A (2013) Large-scale pattern search using reduced-space on-disk suffix arrays. arXiv:1303.6481v1
Guo D, Hu X, Xie F, Wu X (2013) Pattern matching with wildcards and gap-length constraints based on a centrality-degree graph. Appl Intell 39:57–74
Article Google Scholar
Hardy GH, Wright EM (1960) An introduction to the theory of numbers, 4th edn. Oxford University Press
Karkkainen J, Sanders P, Burkhardt S (2006) Linear work suffix array construction. J ACM (JACM) 53(6):918–936
Article MathSciNet MATH Google Scholar
Karlin S, Ghandour G, Ost F, Tavere S, Korn L (1983) New approaches for computer analysis of nucleic acid sequences. Proc Natl Acad Sci USA 80:5660–5664
Article MATH Google Scholar
Khoshnevisan D (2006) Normal numbers are normal. Clay Mathematics Institute Annual Report 15(2006):27–31
Google Scholar
Ko P, Aluru S (2003) Space efficient linear time construction of suffix arrays. In: Proceedings of the 14th annual conference on Combinatorial pattern matching, pp 200–210
Long CT (1957) Note on normal numbers. Pac J Math 7(2):1163–1165
Article MathSciNet MATH Google Scholar
Manber U, Myers G (1990) Suffix arrays: a new method for on-line string searches. In: Proceedings of the first annual ACM-SIAM symposium on discrete algorithms, pp 319–327
Niven I, Zuckerman H (1951) On the definition of normal numbers. Pac J Math 1(1):103–109
Article MathSciNet MATH Google Scholar
Orlandi A, Venturini R (2011) Space-efficient substring occurrence estimation. In: Proceedings of the 30th principles of database systems PODS, pp 95–106
Phoophakdee B, Zaki M (2007) Genome-scale disk-based suffix tree indexing. In: Proceedings of international conference on management of data SIGMOD ’07, pp 833–844
Puglishi SJ, Smyth WF, Yusufu M (2008) Fast optimal algorithms for computing all the repeats in a string. In: Proceedings of PSC, pp 161–169
Schürmann KB, Stoye J (2005) An incomplex algorithm for fast suffix array construction. In: Proceedings of the 7th workshop on algorithm engineering and experiments and the 2nd workshop on analytic algorithmics and combinatorics (ALENEX/ANALCO 2005), pp 77–85
Sinha R, Moffat A, Puglisi S, Turpin A (2008) Improving Suffix Array Locality for Fast Pattern Matching on Disk. In: Proceedings of international conference on management of data SIGMOD ’08, pp 661–672
Wagon S (1985) Is Pi normal?. Math Intell 7(3):65–67
Article MathSciNet MATH Google Scholar
Weiner P Linear pattern matching algorithms. In: SWAT ’73 proceedings of the 14th annual symposium on switching and automata theory (swat 1973), pp 1–11
Wu Y, Wang L, Ren J, Ding W, Wu X (2014) Mining sequential patterns with periodic wildcards. Appl Intell 41:99–116
Article Google Scholar
Xylogiannopoulos K, Karampelas P, Alhajj R (2012) Periodicity data mining in time series using suffix arrays. In: Proceedings of IEEE intelligent systems IS’12, pp 172–181
Xylogiannopoulos K, Karampelas P, Alhajj R (2012) Minimization of suffix array’s storage capacity for periodicity detection in time series. In: Proceedings of IEEE international conference in tools with artificial intelligence
Xylogiannopoulos K, Karampelas P, Alhajj R (2014) Early DDoS detection based on data mining techniques. In: Proceedings of 8th workshop in information security theory and practice (WISTP), pp 190–199
Xylogiannopoulos K, Karampelas P, Alhajj R (2014) Analyzing very large time series using ssuffix arrays. Appl Intell 41(3):941–955
Article MATH Google Scholar
Xylogiannopoulos K, Karampelas P, Alhajj R (2014) Experimental analysis on the NorMality of π, e, φ, sqrt(2) using advanced data-mining techniques. Exp Math 23(2):105–128
Article MathSciNet MATH Google Scholar
Yee A (2013) Y-cruncher – a multi-threaded Pi-program [Online]. Available: http://www.numberworld.org/y-cruncher/
UCLA, (2006, Feb 26). http://www.lasr.cs.ucla.edu/ddos/traces/public/attacktrace2/udp/

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Calgary, Calgary, Alberta, Canada
Konstantinos F. Xylogiannopoulos & Reda Alhajj
Department of Informatics and Computers, Hellenic Air Force Academy, Dekelia Air Base, Acharnes, Greece
Panagiotis Karampelas

Authors

Konstantinos F. Xylogiannopoulos
View author publications
You can also search for this author in PubMed Google Scholar
Panagiotis Karampelas
View author publications
You can also search for this author in PubMed Google Scholar
Reda Alhajj
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Reda Alhajj.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xylogiannopoulos, K.F., Karampelas, P. & Alhajj, R. Repeated patterns detection in big data using classification and parallelism on LERP Reduced Suffix Arrays. Appl Intell 45, 567–597 (2016). https://doi.org/10.1007/s10489-016-0766-2

Download citation

Published: 05 April 2016
Issue Date: October 2016
DOI: https://doi.org/10.1007/s10489-016-0766-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Repeated patterns detection in big data using classification and parallelism on LERP Reduced Suffix Arrays

Abstract

Access this article

Similar content being viewed by others

Generalized enhanced suffix array construction in external memory

Parallel Suffix Sorting for Large String Analytics

Efficient Computation of Substring Equivalence Classes with Suffix Arrays

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Repeated patterns detection in big data using classification and parallelism on LERP Reduced Suffix Arrays

Abstract

Access this article

Similar content being viewed by others

Generalized enhanced suffix array construction in external memory

Parallel Suffix Sorting for Large String Analytics

Efficient Computation of Substring Equivalence Classes with Suffix Arrays

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation