Abstract
This paper provides valuable clues for trend analysis in text mining that one can have texts attached with timestamps as tags and then observe the frequency distribution of the patterns over equally spaced time intervals to predict the trend. Observing frequency distributions (histories) of significant patterns plays an important role for trend analysts. To have the computation of extracting these frequency distributions from a huge amount of texts with timestamps over long time periods scalable, this paper proposes a novel approach based on Hadoop MapReduce programming model that improves our previous work based on external memory approach to reduce the computation time from several days to several hours. The history of a significant pattern is the frequency distribution of that pattern over equally spaced time intervals; a significant pattern is one maximal repeat of consecutive words within texts. Note that the length of one significant pattern can be as long as that of one sentence if that sentence appears twice. To solidify the contribution of this study, the experimental resources included the titles and abstracts (total 12 GB) of 14,473,242 articles from 1990 to 2014 (25 years) downloaded from PubMed, a well-known web site for biomedical literature. Experimental results show that the scale of computation time can be reduced from days to hours employing six computing nodes within one personal computer cluster. Notably, these pattern histories, over two decades in length, not only provide clues that can be analyzed for trend variations within these articles, but also have the potential to reveal revolutions in article writing that might be valuable to the linguist who engages in corpus analysis in the future.
Similar content being viewed by others
Notes
(PubMedPubDate PubStatus = “pubmed”).
Based on this paper, the author had applied for an USA patent provisional application (US 62/301,681) entitled “METHOD FOR EXTRACTING MAXIMAL REPEAT PATTERNS AND COMPUTING FREQUENCY DISTRIBUTION TABLES” at 2016/3/1.
References
Gusfield D (1997) Algorithms on strings, trees, and sequences : computer science and computational biology. Cambridge University Press, Cambridge
Wang J-D (2006) External memory approach to compute the maximal repeats across classes from dna sequences. Asian J Health Inf Sci 1(2):276–295
Wang J-D (2011) A novel approach to compute pattern history for trend analysis. In: The 8th international conference on fuzzy systems and knowledge discovery, pp 1796–1800
Lin J, Dyer C (2010) Data-intensive text processing with MapReduce
White T (2012) Hadoop: the definitive guide (3rd edn), definitive guide series, O’Reilly Media. http://books.google.com.tw/books?id=Nff49D7vnJcC
Witten IH, Frank E (2011) Data mining: practical machine learning tools and techniques, 3rd edn. Elsevier, Amsterdam
Zhang Z, Zhang R (2008) Multimedia data mining: a systematic introduction to concepts and theory, 1st edn. Chapman & Hall/CRC, London
Berry MW, Kogan J (2010) Text mining: applications and theory. Wiley, New York
Srivastava A, Sahami M (2009) Text mining: classification, clustering, and applications. Chapman & Hall/CRC, London
Kao A, Poteet SR (2006) Natural language processing and text mining. Springer, Berlin
Feldman R (2006) Text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, New York, NY
Manu K (2006) Text mining application programming, CHARLES RIVER MEDLA
Bilisoly R (2008) Practical text mining with Perl. Wiley, Amsterdam
Mei Q, Zhai C (2005) Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery in data mining, KDD ’05, ACM, New York, NY, USA, pp 198–207. doi:10.1145/1081870.1081895
Shaik Z, Garla S, Chakraborty G (2012) SAS since 1976: an application of text mining to reveal trends. In: SAS Global Forum 2012: data mining and text analytics, pp 1–10
Conlon SJ, Simmons LL (2013) Mining it business texts to analyze technology trends, To Know Press, pp S5\_125–125. http://EconPapers.repec.org/RePEc:tkp:tiim13:s5_125-125
Luo D, Yang J, Krstajic M, Ribarsky W, Keim D (2012) Eventriver: visually exploring text collections with temporal references. Visual Comput Graph IEEE Trans 18(1):93–105. doi:10.1109/TVCG.2010.225
Yang J, Leskovec J (2011) Patterns of temporal variation in online media. In: Proceedings of the 4th ACM international conference on web search and data mining, WSDM ’11, ACM, New York, NY, USA, pp 177–186. doi:10.1145/1935826.1935863
Manber U, Myers G (1993) Suffix arrays: a new method for on-line string searches. SIAM 22(5):935–948
Abouelhoda MI, Kurtz S, Ohlebusch E (2004) Replacing suffix trees with enhanced suffix arrays. J Discr Algorithm 2(1):53–86. doi:10.1016/S1570-8667(03)00065-0
Shrestha AMS, Frith MC, Horton P (2014) A bioinformatician’s guide to the forefront of suffix array construction algorithms. Brief Bioinform 15(2):138–154. doi:10.1093/bib/bbt081
Chien L-F (1997) Pat-tree-based keyword extraction for chinese information retrieval. SIGIR Forum 31(SI):50–58. doi:10.1145/278459.258534
Ferragina P, Grossi R (1999) The string B-tree: a new data structure for string search in external memory and its application. J ACM 46(2):236–280
Kulekci MO, Vitter JS, Xu B (2012) Efficient maximal repeat finding using the burrows-wheeler transform and wavelet tree. IEEE/ACM Trans Comput Biol Bioinform 9(2):421–429. doi:10.1109/TCBB.2011.127
Lam C (2010) Hadoop in action, 1st edn. Manning Publications Co., Greenwich, CT
Li F, Ooi BC, Özsu MT, Wu S (2014) Distributed data management using mapreduce. ACM Comput Surv 46(3):31:1–31:42. doi:10.1145/2503009
McCreadie R, Macdonald C, Ounis I (2012) Mapreduce indexing strategies: studying scalability and efficiency. Inf Process Manag 48(5):873–888, large-scale and distributed systems for information retrieval. doi:10.1016/j.ipm.2010.12.003. http://www.sciencedirect.com/science/article/pii/S0306457310001044
Qin L, Yu JX, Chang L, Cheng H, Zhang C, Lin X (2014) Scalable big graph processing in mapreduce. In: Proceedings of the 2014 ACM SIGMOD international conference on management of data, SIGMOD ’14, ACM, New York, NY, USA, pp 827–838. doi:10.1145/2588555.2593661
Zhang X, Yang L, Liu C, Chen J (2014) A scalable two-phase top-down specialization approach for data anonymization using mapreduce on cloud. Parallel Distrib Syst IEEE Trans 25(2):363–373. doi:10.1109/TPDS.2013.48
Tapiador D, OMullane W, Brown A, Luri X, Huedo E, Osuna P (2014) A framework for building hypercubes using mapreduce. Comput Phys Commun 185(5):1429–1438. doi:10.1016/j.cpc.2014.02.010. http://www.sciencedirect.com/science/article/pii/S0010465514000423
Hsu C-H, Slagter KD, Chung Y-C (2015) Locality and loading aware virtual machine mapping techniques for optimizing communications in mapreduce applications. Fut Gener Comput Syst 53:43–54
Slagter K, Hsu C-H, Chung Y-C, Zhang D (2013) An improved partitioning mechanism for optimizing massive data analysis using mapreduce. J Supercomput 66(1):539–555
Slagter KD, Hsu C-H, Chung Y-C (2015) An adaptive and memory efficient sampling mechanism for partitioning in mapreduce. Int J Parallel Prog 43(3):489–507
Wang J-D, Tsay J-J (2002) Mining periodic events from retrospective Chinese news. Int J Comput Process Orient Lang Special Issue “Web WAP Orient Lang Multimed Comput” 15(4):361–377
Mount DW (2004) Bioinformatics: sequence and genome analysis, 2nd edn. Cold Spring Harbor Laboratory Press, New York
Cao H, Phinney M, Petersohn D, Merideth B, Shyu C (2016) Mining large-scale repetitive sequences in a mapreduce setting. Int J Data Mining Bioinf (IJDMB) 14(3):210–228. doi:10.1504/IJDMB.2016.074873
Tan YS, Tan J, Chng ES, Lee B-S, Li J, Date S, Chak HP, Xiao X, Narishige A (2013) Hadoop framework: impact of data organization on performance. Softw: Pract Exp 43(11):1241–1260. doi:10.1002/spe.1082
Acknowledgments
A special thanks goes to Mr. Wang Yao-Tsung for valuable suggestions about the usages of Hadoop and to Heri Wijayanto and Yan-Tang Chen for collecting experimental results.
Author information
Authors and Affiliations
Corresponding author
Appendix: Extraction of candidate significant pattern histories
Appendix: Extraction of candidate significant pattern histories
The history of one significant pattern is a series of frequency distributions of that pattern over consecutive, equal time-intervals. The processes of extracting candidate significant patterns with the statistics of their tags herein are derived from the previous work in [2] that was characters-based. The main idea that underlies the extraction of candidate significant pattern histories is to scan sorted word-suffixes that are assigned timestamps and then to extract the longest common prefix words as candidate significant patterns, if they exist, between two adjacent word-suffixes. Meanwhile, a stack is used to accumulate the history of that pattern via push/pop operations.
Figure 12 presents the pseudocode for extracting candidate significant patterns and their corresponding histories. The input contains consecutive and sorted word-suffixes that are labeled with timestamps, and the output contains the histories of candidate significant patterns with right or left boundary verification. For the purposes of describing the extraction processes in detail, the key terms that are used in the pseudocodes are as follows:
-
ParseSuffix: a subroutine to extract word-suffixes and their timestamp within one input line.
-
Record: an object for storing information about one candidate significant pattern.
-
Length: the number of words within a candidate significant pattern.
-
AddHistory: a method for adding the history of one pattern to that of a candidate significant pattern.
-
SubHistory: a method for subtracting the history of one pattern from that of a candidate significant pattern.
-
Boundary: a timestamp for the pattern whose history is computed twice.
-
-
PatternStack: a stack for keeping track the histories of one pattern and its sub-patterns.
-
CommonPrefixWordsAndHistory: a method that returns a record with the longest common prefix words of two adjacent word-suffixes and their histories.
-
TopStackRecord: the record on the top of the stack PatternStack
A simple example of scanning sorted word-suffixes is presented below to illustrate the extraction process. Figure 13, in which each of \(W_\alpha \), \(W_\beta \) and so on, represents one word, includes nine sorted word-suffixes with individually attached timestamps, such as “2012–3”. For simplicity, only those common prefix words between two adjacent word-suffixes are presented; when the last sorted word-suffixes are reached, one extra line (the tenth line) was added with an empty string to force the patterns that remain in the PatternStack, if any exist, to pop out. Table 6 presents the histories of the four candidate significant patterns that are extracted according to Fig. 13.
To understand the extraction processes more precisely, temporary contents within PatternStack are presented below, and the ten sorted word-suffixes are scanned, as shown in Fig. 13. First, as shown in Fig.14 scanning at line 2 found—pattern “\(W_\alpha \)”, which is the longest common prefix-word that is derived from the first and the second lines, and this pattern was pushed into PatternStack because its length (\({=}1\)) was greater than that (\({=}\hbox { -}1\)) of the NullStr of the current top record in PatternStack. This action corresponds to the pseudocode at line (S7) in Fig. 12. Similarly, as shown in Figs. 15 and 16, when scanning lines 4 and 7 revealed two patterns, “\(W_\alpha W_\beta , W_\gamma \)” and “\(W_\alpha W_\beta W_\gamma W_\delta W_\eta W_\rho \)”, which were consecutively pushed into PatternStack. However, as shown in Fig. 17, Scanning line 8 revealed that the length (\({=}2\)) of pattern “\(W_\alpha W_\beta \)” was less than that (\({=}6\)) of pattern “\(W_\alpha W_\beta W_\gamma W_\delta W_\eta W_\rho \)” on the top record within PatternStack. This situation met the conditions of line (S12) and caused the record that contained the pattern “\(W_\alpha W_\beta W_\gamma W_\delta W_\eta W_\rho \)” and its history to be Popped. The history of “\(W_\alpha W_\beta W_\gamma W_\delta W_\eta W_\rho \)” was accumulated with that of the top record for pattern “\(W_\alpha W_\beta W_\gamma \)” because the latter was a sub-pattern of the former; the history of the pattern on the top of PatternStack needed to be removed along with the boundary history that was marked “2012–1” because that boundary history was counted twice at line 6. Furthermore, a pop operation ran once again and output the pattern “\(W_\alpha W_\beta W_\gamma \)” with its history, which was accumulated with that of the pattern “\(W_\alpha W_\beta \)”, which was freshly generated, and then the record with its history was pushed onto PatternStack after removing the boundary overlapping “\(W_\alpha W_\beta \)” at line 7. Note that the above operations were associated with the pseudocode from (S12) to (S29) in Fig. 12. When line 10 was scanned, an extra line was added to force the two patterns, “\(W_\alpha \)” and “\(W_\alpha W_\beta \)”, and their histories, to be popped from PatternStack as shown in Fig. 18.
Rights and permissions
About this article
Cite this article
Wang, JD. Extracting significant pattern histories from timestamped texts using MapReduce. J Supercomput 72, 3236–3260 (2016). https://doi.org/10.1007/s11227-016-1713-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-016-1713-z