Mining Associations Between Two Categories Using Unstructured Text Data in Cloud

Ji, Yanqing; Tian, Yun; Shen, Fangyang; Tran, John

doi:10.1007/978-3-319-77028-4_70

Yanqing Ji¹⁵,
Yun Tian¹⁶,
Fangyang Shen¹⁷ &
…
John Tran¹⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 738))

2629 Accesses

Abstract

Finding associations between itemsets within two categories (e.g., drugs and adverse effects, genes and diseases) are very important in many domains. However, these association mining tasks often involve computation-intensive algorithms and a large amount of data. This paper investigates how to leverage MapReduce to effectively mine the associations between itemsets within two categories using a large set of unstructured data. While existing MapReduce-based association mining algorithms focus on frequent itemset mining (i.e., finding itemsets whose frequencies are higher than a threshold), we proposed a MapReduce algorithm that could be used to compute all the interestingness measures defined on the basis of a 2 × 2 contingency table. The algorithm was applied to mine the associations between drugs and diseases using 33,959 full-text biomedical articles on the Amazon Elastic MapReduce (EMR) platform. Experiment results indicate that the proposed algorithm exhibits linear scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

L. Geng, H.J. Hamilton, Interestingness measures for data mining: a survey, ACM. Comput. Surv. 38, (2006)
Google Scholar
P.-N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, 1st edn. (Addison Wesley, Boston, 2005)
Google Scholar
R. Agrawal, R. Srikant, Fast algorithms for mining association rules. Presented at the proceedings of the 20th international conference on very large databases, Santiago, Chile, 1994
Google Scholar
J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate generation. SIGMOD. Rec. 29, 1–12 (2000)
Article Google Scholar
N.C.f.B. Information, PubMed, 2017. http://www.ncbi.nlm.nih.gov/pubmed.
J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters. Commun. ACM. 51, 107–113 (2008)
Article Google Scholar
F. Kovács, J. Illés, Frequent itemset mining on hadoop, in 2013 IEEE 9th International Conference on Computational Cybernetics (ICCC), 2013, pp. 241–245
Google Scholar
X.Y. Yang, Z. Liu, Y. Fu, MapReduce as a programming model for association rules algorithm on Hadoop, in The 3rd International Conference on Information Sciences and Interaction Sciences, 2010, pp. 99–102
Google Scholar
K. Chavan, P. Kulkarni, P. Ghodekar, S.N. Patil, Frequent itemset mining for Big data, in 2015 International Conference On Green Computing and Internet of Things (ICGCIoT), 2015, pp. 1365–1368
Google Scholar
N. Li, L. Zeng, Q. He, Z. Shi, Parallel implementation of apriori algorithm based on MapReduce, in 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, 2012, pp. 236–241
Google Scholar
C. Doulkeridis, K. Nørvåg, A survey of large-scale analytical query processing in MapReduce. VLDB J. 23, 355–380 (2014)
Article Google Scholar
Y. Ji, Y. Tian, F. Shen, J. Tran, Leveraging MapReduce to efficiently extract associations between biomedical concepts from large text data. Microprocess. Microsyst. 46, 202–210 (2016)
Article Google Scholar
Y. Ji, Y. Tian, F. Shen, J. Tran, High-performance biomedical association mining with MapReduce, in 2015 12th International Conference on Information Technology—New Generations, 2015, pp. 465–470
Google Scholar
T.R. Conference, TREC 2006 Genomics Track. http://skynet.ohsu.edu/trec-gen/.
A.R. Aronson, F.M. Lang, An overview of MetaMap: historical perspective and recent advances. J. Am. Med. Inform. Assoc. 17, 229–236 (2010)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Gonzaga University, Spokane, WA, USA
Yanqing Ji
Department of Computer Science, Eastern Washington University, Cheney, WA, USA
Yun Tian
Department of Computer System Technology, New York City College of Technology, Brooklyn, NY, USA
Fangyang Shen
Frontier Behavioral Health, Spokane, WA, USA
John Tran

Authors

Yanqing Ji
View author publications
You can also search for this author in PubMed Google Scholar
Yun Tian
View author publications
You can also search for this author in PubMed Google Scholar
Fangyang Shen
View author publications
You can also search for this author in PubMed Google Scholar
John Tran
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanqing Ji .

Editor information

Editors and Affiliations

Department of Electrical & Computer Engineering, University of Nevada, Las Vegas, Las Vegas, Nevada, USA
Shahram Latifi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ji, Y., Tian, Y., Shen, F., Tran, J. (2018). Mining Associations Between Two Categories Using Unstructured Text Data in Cloud. In: Latifi, S. (eds) Information Technology - New Generations. Advances in Intelligent Systems and Computing, vol 738. Springer, Cham. https://doi.org/10.1007/978-3-319-77028-4_70

Download citation

DOI: https://doi.org/10.1007/978-3-319-77028-4_70
Published: 13 April 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77027-7
Online ISBN: 978-3-319-77028-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics