Advertisement

Knowledge and Information Systems

, Volume 41, Issue 2, pp 499–530 | Cite as

Finding peculiar compositions of two frequent strings with background texts

Regular Paper
  • 167 Downloads

Abstract

We consider mining unusual patterns from a set \(T\) of target texts. A typical method outputs unusual patterns if their observed frequencies are far from their expectation estimated under an assumed probabilistic model. However, it is difficult for the method to deal with the zero frequency and thus it suffers from data sparseness. We employ another set \(B\) of background texts to define a composition \(xy\) to be peculiar if both \(x\) and \(y\) are more frequent in \(B\) than in \(T\) and conversely \(xy\) is more frequent in \(T\). \(xy\) is unusual because \(x\) and \(y\) are infrequent in \(T\) while \(xy\) is unexpectedly frequent compared to \(xy\) in \(B\). To find frequent subpatterns and infrequent patterns simultaneously, we develop a fast algorithm using the suffix tree and show that it scales almost linearly under practical settings of parameters. Experiments using DNA sequences show that found peculiar compositions basically appear in rRNA while patterns found by an existing method seem not to relate to specific biological functions. We also show that discovered patterns have similar lengths at which the distribution of frequencies of fixed length substrings begins to skew. This fact explains why our method can find long peculiar compositions.

Keywords

Algorithms Exceptional pattern mining Text mining Bioinformatics Genetic maps Suffix trees 

Notes

Acknowledgments

This research was partially supported by the KAKENHI Grant No. 21300053, 25280085, 19700150, 21650031, and 24300059, and the Strategic International Cooperative Program funded by Japan Science and Technology Agency (JST). This paper is a major value-added version of a conference paper that appeared in [16]. In addition to genetic maps of peculiar compositions, which first appeared in [15], those of patterns found by other criteria are added in this paper.

References

  1. 1.
    Agrawal R, Imielinski T, Swam A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on management of data, pp 207–216Google Scholar
  2. 2.
    Andrade MA, Valencia A (1998) Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics 14(7):600–607. doi: 10.1093/bioinformatics/14.7.600 CrossRefGoogle Scholar
  3. 3.
    Apostolico A, Bock ME, Lonardi S, Xu X (2000) Efficient detection of unusual words. J Comput Biol 7(1/2):71–94CrossRefGoogle Scholar
  4. 4.
    Apostolico A, Pizzi C (2008) Scoring unusual words with varying mismatch errors. Math Comput Sci 1(4):639–653MathSciNetCrossRefMATHGoogle Scholar
  5. 5.
    Arimura H, Shimozono S (1998) Maximizing agreement with a classification by bounded or unbounded number of associated words. In Proceedings of the 9th international symposium on algorithms and computation. Lecture Notes Artif Intell 1533:39–48MathSciNetGoogle Scholar
  6. 6.
    Beißbarth T, Speed TP (2004) GOstat: find statistically overrepresented gene ontologies within a group of genes. Bioinformatics 20(9):1464–1465CrossRefGoogle Scholar
  7. 7.
    Berry MW (ed) (2003) Survey of text mining: clustering, classification, and retrieval. Springer, BerlinGoogle Scholar
  8. 8.
    Chan PK, Stolfo SJ (1998) Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In Fourth international conference on knowledge discovery and data mining, pp 164–168Google Scholar
  9. 9.
    Fayyad UM, Piatetsky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery: an overview In: Advances in knowledge discovery and data mining. AAAI/MIT Press, Menlo ParkGoogle Scholar
  10. 10.
    Gomez JC, Boiy E, Moens M-F (2012) Highly discriminative statistical features for Email classification. Knowl Inform Syst 31(1):23–53CrossRefGoogle Scholar
  11. 11.
    Gusfield D (1997) Algorithms on strings, trees and sequence. Cambridge University Press, New YorkCrossRefGoogle Scholar
  12. 12.
    Horng J-T, Huang H-D, Huang S-L, Yang U-C, Chang Y-C (2002) Mining putative regulatory elements in promoter regions of Saccharomyces Cerevisiae. In Silico Biol 2(3):263–273Google Scholar
  13. 13.
    Huang H-D, Chang H-L, Tsou T-S, Liu B-J, Kao C-Y, Horng J-T (2003) A data mining method to predict transcriptional regulatory sites based on differentially expressed genes in human genome. J Info Sci Eng 19(6):923–942Google Scholar
  14. 14.
    Ikeda D (1999) Characteristic sets of strings common to semi-structured documents. In: Proceedings of the second international conference on discovery science. Lecture Notes Artif Intell 1721:139–147Google Scholar
  15. 15.
    Ikeda D, Maruyama O, Kuhara S (2013) Infrequent, unexpected, and contrast pattern discovery from bacterial genomes by genome-wide comparative analysis. In: Proceedings of the 4th international conference on bioinformatics models, methods and algorithms, pp 308–311Google Scholar
  16. 16.
    Ikeda D, Suzuki E (2009) Mining peculiar compositions of frequent substrings from sparse text data using background texts. In: Proceedings of ECML PKDD, pp 596–611Google Scholar
  17. 17.
    Jagadish, HV, Ng, RT, Srivastava, D (1999) Substring selectivity estimation. In Proceedings of the eighteenth symposium on principles of database systems, pp 249–260Google Scholar
  18. 18.
    Ji X, Bailey J, Dong G (2007) Mining minimal distinguishing subsequence patterns with Gap constraints. Knowl Inform Syst 11(3):259–286CrossRefGoogle Scholar
  19. 19.
    Keogh E, Lin J, Lee S-H, Herle HV (2006) Finding the most unusual time series subsequence: algorithms and applications. Knowl Inform Syst 11(1):1–27CrossRefGoogle Scholar
  20. 20.
    Knorr EM, Ng RT (1999) Finding intensional knowledge of distance-based outliers. In: 25th international conference on very large data bases, pp 211–222Google Scholar
  21. 21.
    Leung M-Y, Marsh GM, Speed TP (1996) Over- and underrepresentation of short DNA words in herpesvirus genomes. J Comput Biol 3(3):345–360CrossRefGoogle Scholar
  22. 22.
    Marschall T, Rahmann S (2009) Efficient exact motif discovery. Bioinformatics 25(12):i356–i364CrossRefGoogle Scholar
  23. 23.
    McCreight EM (1976) A space-economical suffix tree construction algorithm. J ACM 23(2):262–272MathSciNetCrossRefMATHGoogle Scholar
  24. 24.
    Padmanabhan B, Tuzhilin A (2000) Small is beautiful: discovering the minimal set of unexpected patterns. In: Sixth ACM SIGKDD international conference on knowledge discovery and data mining, pp 54–63Google Scholar
  25. 25.
    Parida L (2007) Pattern discovery in bioinformatics: theory & algorithms. Chapman & Hall/CRC, MarinCrossRefGoogle Scholar
  26. 26.
    Pham D-S, Saha B, Phung DQ, Venkatesh S (2013) Detection of cross-channel anomalies. Knowl Inform Syst 35(1):33–59CrossRefGoogle Scholar
  27. 27.
    Sarawagi S, Agrawal R, Megiddo N (1998) Discovery-driven exploration of OLAP data cubes. In: EDBT 1998. LNCS vol 1377, pp 168–182Google Scholar
  28. 28.
    Schbath S (1997) An efficient statistic to detect over- and under-represented words in DNA sequences. J Comput Biol 4(2):189–192CrossRefGoogle Scholar
  29. 29.
    Suzuki E (1997) Autonomous discovery of reliable exception rules. In: Third international conference on knowledge discovery and data mining, pp 259–262Google Scholar
  30. 30.
    Suzuki E (2002) Undirected discovery of interesting exception rules. Int J Patt Recog Artif Intell 16(8):1065–1086CrossRefGoogle Scholar
  31. 31.
    Suzuki E, Shimura M (1996) Exceptional knowledge discovery in databases based on information theory. In: Second international conference knowledge discovery and data mining, pp 275–278Google Scholar
  32. 32.
    Suzuki E, Tsumoto S (2000) Evaluating hypothesis-driven exception-rule discovery with medical data sets. In: PAKDD 2000. LNAI vol. 1805, Springer, Berlin, pp 208–211Google Scholar
  33. 33.
    Uemura T, Ikeda D, Arimura H (2008) Unsupervised spam detection by document complexity estimation. In: Proceedings of the 11th international conference on discovery science. Lecture notes in artificial intelligence 5255:319–331Google Scholar
  34. 34.
    Ukkonen E (1995) On-line construction of suffix trees. Algorithmica 14(3):249–260MathSciNetCrossRefMATHGoogle Scholar
  35. 35.
    Wang J, Han J, Pei J (2003) CLOSET+: Searching for the best strategies for mining frequent closed itemsets. In: Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining, pp 236–245Google Scholar
  36. 36.
    Yan X, Han J, Afshar R (2003) CloSpan: mining closed sequential patterns in large databases. In: Proceedings of the 4th SIAM international conference on data miningGoogle Scholar

Copyright information

© Springer-Verlag London 2013

Authors and Affiliations

  1. 1.Department of InformaticsKyushu UnivesityFukuokaJapan

Personalised recommendations