Abstract
Finding associations between itemsets within two categories (e.g., drugs and adverse effects, genes and diseases) are very important in many domains. However, these association mining tasks often involve computation-intensive algorithms and a large amount of data. This paper investigates how to leverage MapReduce to effectively mine the associations between itemsets within two categories using a large set of unstructured data. While existing MapReduce-based association mining algorithms focus on frequent itemset mining (i.e., finding itemsets whose frequencies are higher than a threshold), we proposed a MapReduce algorithm that could be used to compute all the interestingness measures defined on the basis of a 2 × 2 contingency table. The algorithm was applied to mine the associations between drugs and diseases using 33,959 full-text biomedical articles on the Amazon Elastic MapReduce (EMR) platform. Experiment results indicate that the proposed algorithm exhibits linear scalability.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
L. Geng, H.J. Hamilton, Interestingness measures for data mining: a survey, ACM. Comput. Surv. 38, (2006)
P.-N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, 1st edn. (Addison Wesley, Boston, 2005)
R. Agrawal, R. Srikant, Fast algorithms for mining association rules. Presented at the proceedings of the 20th international conference on very large databases, Santiago, Chile, 1994
J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate generation. SIGMOD. Rec. 29, 1–12 (2000)
N.C.f.B. Information, PubMed, 2017. http://www.ncbi.nlm.nih.gov/pubmed.
J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters. Commun. ACM. 51, 107–113 (2008)
F. Kovács, J. Illés, Frequent itemset mining on hadoop, in 2013 IEEE 9th International Conference on Computational Cybernetics (ICCC), 2013, pp. 241–245
X.Y. Yang, Z. Liu, Y. Fu, MapReduce as a programming model for association rules algorithm on Hadoop, in The 3rd International Conference on Information Sciences and Interaction Sciences, 2010, pp. 99–102
K. Chavan, P. Kulkarni, P. Ghodekar, S.N. Patil, Frequent itemset mining for Big data, in 2015 International Conference On Green Computing and Internet of Things (ICGCIoT), 2015, pp. 1365–1368
N. Li, L. Zeng, Q. He, Z. Shi, Parallel implementation of apriori algorithm based on MapReduce, in 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, 2012, pp. 236–241
C. Doulkeridis, K. Nørvåg, A survey of large-scale analytical query processing in MapReduce. VLDB J. 23, 355–380 (2014)
Y. Ji, Y. Tian, F. Shen, J. Tran, Leveraging MapReduce to efficiently extract associations between biomedical concepts from large text data. Microprocess. Microsyst. 46, 202–210 (2016)
Y. Ji, Y. Tian, F. Shen, J. Tran, High-performance biomedical association mining with MapReduce, in 2015 12th International Conference on Information Technology—New Generations, 2015, pp. 465–470
T.R. Conference, TREC 2006 Genomics Track. http://skynet.ohsu.edu/trec-gen/.
A.R. Aronson, F.M. Lang, An overview of MetaMap: historical perspective and recent advances. J. Am. Med. Inform. Assoc. 17, 229–236 (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Ji, Y., Tian, Y., Shen, F., Tran, J. (2018). Mining Associations Between Two Categories Using Unstructured Text Data in Cloud. In: Latifi, S. (eds) Information Technology - New Generations. Advances in Intelligent Systems and Computing, vol 738. Springer, Cham. https://doi.org/10.1007/978-3-319-77028-4_70
Download citation
DOI: https://doi.org/10.1007/978-3-319-77028-4_70
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77027-7
Online ISBN: 978-3-319-77028-4
eBook Packages: EngineeringEngineering (R0)