Skip to main content
Log in

Extracting significant pattern histories from timestamped texts using MapReduce

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

This paper provides valuable clues for trend analysis in text mining that one can have texts attached with timestamps as tags and then observe the frequency distribution of the patterns over equally spaced time intervals to predict the trend. Observing frequency distributions (histories) of significant patterns plays an important role for trend analysts. To have the computation of extracting these frequency distributions from a huge amount of texts with timestamps over long time periods scalable, this paper proposes a novel approach based on Hadoop MapReduce programming model that improves our previous work based on external memory approach to reduce the computation time from several days to several hours. The history of a significant pattern is the frequency distribution of that pattern over equally spaced time intervals; a significant pattern is one maximal repeat of consecutive words within texts. Note that the length of one significant pattern can be as long as that of one sentence if that sentence appears twice. To solidify the contribution of this study, the experimental resources included the titles and abstracts (total 12 GB) of 14,473,242 articles from 1990 to 2014 (25 years) downloaded from PubMed, a well-known web site for biomedical literature. Experimental results show that the scale of computation time can be reduced from days to hours employing six computing nodes within one personal computer cluster. Notably, these pattern histories, over two decades in length, not only provide clues that can be analyzed for trend variations within these articles, but also have the potential to reveal revolutions in article writing that might be valuable to the linguist who engages in corpus analysis in the future.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. http://www.ncbi.nlm.nih.gov/pubmed/.

  2. http://www.ncbi.nlm.nih.gov/pubmed/.

  3. (PubMedPubDate PubStatus = “pubmed”).

  4. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/TaskConfiguration_H2.html.

  5. http://tm.asia.edu.tw/TM/Search_PubMed_Simple.php.

  6. Based on this paper, the author had applied for an USA patent provisional application (US 62/301,681) entitled “METHOD FOR EXTRACTING MAXIMAL REPEAT PATTERNS AND COMPUTING FREQUENCY DISTRIBUTION TABLES” at 2016/3/1.

References

  1. Gusfield D (1997) Algorithms on strings, trees, and sequences : computer science and computational biology. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  2. Wang J-D (2006) External memory approach to compute the maximal repeats across classes from dna sequences. Asian J Health Inf Sci 1(2):276–295

    Google Scholar 

  3. Wang J-D (2011) A novel approach to compute pattern history for trend analysis. In: The 8th international conference on fuzzy systems and knowledge discovery, pp 1796–1800

  4. Lin J, Dyer C (2010) Data-intensive text processing with MapReduce

  5. White T (2012) Hadoop: the definitive guide (3rd edn), definitive guide series, O’Reilly Media. http://books.google.com.tw/books?id=Nff49D7vnJcC

  6. Witten IH, Frank E (2011) Data mining: practical machine learning tools and techniques, 3rd edn. Elsevier, Amsterdam

    MATH  Google Scholar 

  7. Zhang Z, Zhang R (2008) Multimedia data mining: a systematic introduction to concepts and theory, 1st edn. Chapman & Hall/CRC, London

    Book  MATH  Google Scholar 

  8. Berry MW, Kogan J (2010) Text mining: applications and theory. Wiley, New York

    Book  Google Scholar 

  9. Srivastava A, Sahami M (2009) Text mining: classification, clustering, and applications. Chapman & Hall/CRC, London

    Book  MATH  Google Scholar 

  10. Kao A, Poteet SR (2006) Natural language processing and text mining. Springer, Berlin

    MATH  Google Scholar 

  11. Feldman R (2006) Text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, New York, NY

    Book  Google Scholar 

  12. Manu K (2006) Text mining application programming, CHARLES RIVER MEDLA

  13. Bilisoly R (2008) Practical text mining with Perl. Wiley, Amsterdam

    Book  MATH  Google Scholar 

  14. Mei Q, Zhai C (2005) Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery in data mining, KDD ’05, ACM, New York, NY, USA, pp 198–207. doi:10.1145/1081870.1081895

  15. Shaik Z, Garla S, Chakraborty G (2012) SAS since 1976: an application of text mining to reveal trends. In: SAS Global Forum 2012: data mining and text analytics, pp 1–10

  16. Conlon SJ, Simmons LL (2013) Mining it business texts to analyze technology trends, To Know Press, pp S5\_125–125. http://EconPapers.repec.org/RePEc:tkp:tiim13:s5_125-125

  17. Luo D, Yang J, Krstajic M, Ribarsky W, Keim D (2012) Eventriver: visually exploring text collections with temporal references. Visual Comput Graph IEEE Trans 18(1):93–105. doi:10.1109/TVCG.2010.225

    Article  Google Scholar 

  18. Yang J, Leskovec J (2011) Patterns of temporal variation in online media. In: Proceedings of the 4th ACM international conference on web search and data mining, WSDM ’11, ACM, New York, NY, USA, pp 177–186. doi:10.1145/1935826.1935863

  19. Manber U, Myers G (1993) Suffix arrays: a new method for on-line string searches. SIAM 22(5):935–948

    Article  MathSciNet  MATH  Google Scholar 

  20. Abouelhoda MI, Kurtz S, Ohlebusch E (2004) Replacing suffix trees with enhanced suffix arrays. J Discr Algorithm 2(1):53–86. doi:10.1016/S1570-8667(03)00065-0

    Article  MathSciNet  MATH  Google Scholar 

  21. Shrestha AMS, Frith MC, Horton P (2014) A bioinformatician’s guide to the forefront of suffix array construction algorithms. Brief Bioinform 15(2):138–154. doi:10.1093/bib/bbt081

    Article  Google Scholar 

  22. Chien L-F (1997) Pat-tree-based keyword extraction for chinese information retrieval. SIGIR Forum 31(SI):50–58. doi:10.1145/278459.258534

    Article  Google Scholar 

  23. Ferragina P, Grossi R (1999) The string B-tree: a new data structure for string search in external memory and its application. J ACM 46(2):236–280

    Article  MathSciNet  MATH  Google Scholar 

  24. Kulekci MO, Vitter JS, Xu B (2012) Efficient maximal repeat finding using the burrows-wheeler transform and wavelet tree. IEEE/ACM Trans Comput Biol Bioinform 9(2):421–429. doi:10.1109/TCBB.2011.127

    Article  Google Scholar 

  25. Lam C (2010) Hadoop in action, 1st edn. Manning Publications Co., Greenwich, CT

    Google Scholar 

  26. Li F, Ooi BC, Özsu MT, Wu S (2014) Distributed data management using mapreduce. ACM Comput Surv 46(3):31:1–31:42. doi:10.1145/2503009

    Google Scholar 

  27. McCreadie R, Macdonald C, Ounis I (2012) Mapreduce indexing strategies: studying scalability and efficiency. Inf Process Manag 48(5):873–888, large-scale and distributed systems for information retrieval. doi:10.1016/j.ipm.2010.12.003. http://www.sciencedirect.com/science/article/pii/S0306457310001044

  28. Qin L, Yu JX, Chang L, Cheng H, Zhang C, Lin X (2014) Scalable big graph processing in mapreduce. In: Proceedings of the 2014 ACM SIGMOD international conference on management of data, SIGMOD ’14, ACM, New York, NY, USA, pp 827–838. doi:10.1145/2588555.2593661

  29. Zhang X, Yang L, Liu C, Chen J (2014) A scalable two-phase top-down specialization approach for data anonymization using mapreduce on cloud. Parallel Distrib Syst IEEE Trans 25(2):363–373. doi:10.1109/TPDS.2013.48

    Article  Google Scholar 

  30. Tapiador D, OMullane W, Brown A, Luri X, Huedo E, Osuna P (2014) A framework for building hypercubes using mapreduce. Comput Phys Commun 185(5):1429–1438. doi:10.1016/j.cpc.2014.02.010. http://www.sciencedirect.com/science/article/pii/S0010465514000423

  31. Hsu C-H, Slagter KD, Chung Y-C (2015) Locality and loading aware virtual machine mapping techniques for optimizing communications in mapreduce applications. Fut Gener Comput Syst 53:43–54

    Article  Google Scholar 

  32. Slagter K, Hsu C-H, Chung Y-C, Zhang D (2013) An improved partitioning mechanism for optimizing massive data analysis using mapreduce. J Supercomput 66(1):539–555

    Article  Google Scholar 

  33. Slagter KD, Hsu C-H, Chung Y-C (2015) An adaptive and memory efficient sampling mechanism for partitioning in mapreduce. Int J Parallel Prog 43(3):489–507

    Article  Google Scholar 

  34. Wang J-D, Tsay J-J (2002) Mining periodic events from retrospective Chinese news. Int J Comput Process Orient Lang Special Issue “Web WAP Orient Lang Multimed Comput” 15(4):361–377

    Google Scholar 

  35. Mount DW (2004) Bioinformatics: sequence and genome analysis, 2nd edn. Cold Spring Harbor Laboratory Press, New York

    Google Scholar 

  36. Cao H, Phinney M, Petersohn D, Merideth B, Shyu C (2016) Mining large-scale repetitive sequences in a mapreduce setting. Int J Data Mining Bioinf (IJDMB) 14(3):210–228. doi:10.1504/IJDMB.2016.074873

    Article  Google Scholar 

  37. Tan YS, Tan J, Chng ES, Lee B-S, Li J, Date S, Chak HP, Xiao X, Narishige A (2013) Hadoop framework: impact of data organization on performance. Softw: Pract Exp 43(11):1241–1260. doi:10.1002/spe.1082

    Google Scholar 

Download references

Acknowledgments

A special thanks goes to Mr. Wang Yao-Tsung for valuable suggestions about the usages of Hadoop and to Heri Wijayanto and Yan-Tang Chen for collecting experimental results.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jing-Doo Wang.

Appendix: Extraction of candidate significant pattern histories

Appendix: Extraction of candidate significant pattern histories

The history of one significant pattern is a series of frequency distributions of that pattern over consecutive, equal time-intervals. The processes of extracting candidate significant patterns with the statistics of their tags herein are derived from the previous work in [2] that was characters-based. The main idea that underlies the extraction of candidate significant pattern histories is to scan sorted word-suffixes that are assigned timestamps and then to extract the longest common prefix words as candidate significant patterns, if they exist, between two adjacent word-suffixes. Meanwhile, a stack is used to accumulate the history of that pattern via push/pop operations.

Fig. 12
figure 12

The pseudo code for extracting candidate significant pattern history

Figure 12 presents the pseudocode for extracting candidate significant patterns and their corresponding histories. The input contains consecutive and sorted word-suffixes that are labeled with timestamps, and the output contains the histories of candidate significant patterns with right or left boundary verification. For the purposes of describing the extraction processes in detail, the key terms that are used in the pseudocodes are as follows:

  • ParseSuffix: a subroutine to extract word-suffixes and their timestamp within one input line.

  • Record: an object for storing information about one candidate significant pattern.

    • Length: the number of words within a candidate significant pattern.

    • AddHistory: a method for adding the history of one pattern to that of a candidate significant pattern.

    • SubHistory: a method for subtracting the history of one pattern from that of a candidate significant pattern.

    • Boundary: a timestamp for the pattern whose history is computed twice.

  • PatternStack: a stack for keeping track the histories of one pattern and its sub-patterns.

  • CommonPrefixWordsAndHistory: a method that returns a record with the longest common prefix words of two adjacent word-suffixes and their histories.

  • TopStackRecord: the record on the top of the stack PatternStack

A simple example of scanning sorted word-suffixes is presented below to illustrate the extraction process. Figure 13, in which each of \(W_\alpha \), \(W_\beta \) and so on, represents one word, includes nine sorted word-suffixes with individually attached timestamps, such as “2012–3”. For simplicity, only those common prefix words between two adjacent word-suffixes are presented; when the last sorted word-suffixes are reached, one extra line (the tenth line) was added with an empty string to force the patterns that remain in the PatternStack, if any exist, to pop out. Table 6 presents the histories of the four candidate significant patterns that are extracted according to Fig. 13.

Fig. 13
figure 13

9 Sorted word-suffixes

Table 6 The histories of four candidate significant patterns that are extracted from Fig. 13
Fig. 14
figure 14

Push the pattern “\(W_\alpha \)” when at Line 2

Fig. 15
figure 15

Push the “\(W_\alpha W_\beta W_\gamma \)” pattern when at Line 4

Fig. 16
figure 16

Push the “\(W_\alpha W_\beta W_\gamma W_\delta W_\eta W_\rho \)” pattern when at Line 7

To understand the extraction processes more precisely, temporary contents within PatternStack are presented below, and the ten sorted word-suffixes are scanned, as shown in Fig. 13. First, as shown in Fig.14 scanning at line 2 found—pattern “\(W_\alpha \)”, which is the longest common prefix-word that is derived from the first and the second lines, and this pattern was pushed into PatternStack because its length (\({=}1\)) was greater than that (\({=}\hbox { -}1\)) of the NullStr of the current top record in PatternStack. This action corresponds to the pseudocode at line (S7) in Fig. 12. Similarly, as shown in Figs. 15 and 16, when scanning lines 4 and 7 revealed two patterns, “\(W_\alpha W_\beta , W_\gamma \)” and “\(W_\alpha W_\beta W_\gamma W_\delta W_\eta W_\rho \)”, which were consecutively pushed into PatternStack. However, as shown in Fig. 17, Scanning line 8 revealed that the length (\({=}2\)) of pattern “\(W_\alpha W_\beta \)” was less than that (\({=}6\)) of pattern “\(W_\alpha W_\beta W_\gamma W_\delta W_\eta W_\rho \)” on the top record within PatternStack. This situation met the conditions of line (S12) and caused the record that contained the pattern “\(W_\alpha W_\beta W_\gamma W_\delta W_\eta W_\rho \)” and its history to be Popped. The history of “\(W_\alpha W_\beta W_\gamma W_\delta W_\eta W_\rho \)” was accumulated with that of the top record for pattern “\(W_\alpha W_\beta W_\gamma \)” because the latter was a sub-pattern of the former; the history of the pattern on the top of PatternStack needed to be removed along with the boundary history that was marked “2012–1” because that boundary history was counted twice at line 6. Furthermore, a pop operation ran once again and output the pattern “\(W_\alpha W_\beta W_\gamma \)” with its history, which was accumulated with that of the pattern “\(W_\alpha W_\beta \)”, which was freshly generated, and then the record with its history was pushed onto PatternStack after removing the boundary overlapping “\(W_\alpha W_\beta \)” at line 7. Note that the above operations were associated with the pseudocode from (S12) to (S29) in Fig. 12. When line 10 was scanned, an extra line was added to force the two patterns, “\(W_\alpha \)” and “\(W_\alpha W_\beta \)”, and their histories, to be popped from PatternStack as shown in Fig. 18.

Fig. 17
figure 17

Line 8(1): pop the “\(W_\alpha W_\beta W_\gamma W_\delta W_\eta W_\rho \)” and its history

Fig. 18
figure 18

Line 8(2): pop the “\(W_\alpha W_\beta W_\gamma \)” and its history

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, JD. Extracting significant pattern histories from timestamped texts using MapReduce. J Supercomput 72, 3236–3260 (2016). https://doi.org/10.1007/s11227-016-1713-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-016-1713-z

Keywords

Navigation