Advertisement

Unstructured Data, NoSQL, and Terms Analytics

  • Richard K. LomoteyEmail author
  • Ralph Deters
Chapter
Part of the International Series on Computer Entertainment and Media Technology book series (ISCEMT)

Abstract

Today’s high-dimensional data, which is mostly unstructured, makes data patterns discovery (a.k.a. data mining) challenging and difficult for services engineers. Unstructured data mining deviates from existing information extraction methodologies that have been previously put forward due to the fact that recent data formation and storage has no standard schema; and the data is heterogeneous. At the storage level, the NoSQL database has been proposed as a preferred technology to accommodate the high-dimensional data, and the technology has received significant enterprise adoption. At the technology level, the query style of NoSQL databases differ from schema-based storages such as the RDBMS. Currently, there is lack of tools, technologies, and methodologies that can aid the community to support data patterns discovery in the big data epoch. Previously, an Analytics-as-a-Service (AaaS) framework is proposed for terms mining in document-based NoSQL systems. In this chapter, we provide comprehensive views about the performance of several algorithms that have been employed to achieve the topics and terms mining tasks. This chapter is a reproduction of several proposed algorithms which can enable the software engineering community to realize what has been done regarding the enhancement of accuracy of terms mining form document-based NoSQL systems.

Keywords

Bernoulli algorithm Association rule Big data Analytics-as-a-Service (AaaS) Unstructured data Data mining Hidden Markov Model Apriori Optimistic search Pessimistic search Parallel search 

Notes

Acknowledgement

• Special thanks to grad students in the MADMUC Lab, University of Saskatchewan.

• Thanks to Prof. Patrick Hung of the IT Security Unit, University of Ontario Institute of Technology.

• Final thanks to the Editors and Reviewers of this chapter for their feedback.

References

  1. 1.
    M.R. Wigan, R. Clarke, Big data’s big unintended consequences. Computer 46(6), 46–53 (2013). doi: 10.1109/MC.2013.195 CrossRefGoogle Scholar
  2. 2.
    R. Akerkar, C. Badica, C. B. Burdescu, Desiderata for research in web intelligence, mining and semantics, in Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics (WIMS '12). ACM, New York, NY, USA, Article 0, 5 pages. DOI= 10.1145/2254129.2254131 http://doi.acm.org/10.1145/2254129.2254131Google Scholar
  3. 3.
    P. C. Zikopoulos, C. Eaton, D. de Roos, T. Deutsch, G. Lapis, Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data, Published by McGraw-Hill Companies, 2012. https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Big%20Data%20University/page/FREE%20ebook%20-%20Understanding%20Big%20Data
  4. 4.
    K. Rupanagunta, D. Zakkam, H. Rao, How to Mine Unstructured Data, Article in Information Management, June 29 2012, http://www.information-management.com/newsletters/data-mining-unstructured-big-data-youtube--10022781-1.html
  5. 5.
    IBM Research, Analytics-as-a-Service Platform, Available: http://researcher.ibm.com/researcher/view_project.php?id=3992
  6. 6.
    J. Sequeda, D. P. Miranker, “Linked Data,” Linked Data tutorial at Semtech 2012, Jun 07, 2012. Available: http://www.slideshare.net/juansequeda/linked-data-tutorial-at-semtech-2012
  7. 7.
  8. 8.
  9. 9.
    EMC, EMC Accelerates Journey to Big Data with Business Analytics-as-a-Service, http://www.emc.com/collateral/white-papers/h11259-emc-accelerates-journey-big-data-ba-wp.pdf
  10. 10.
  11. 11.
    X. Sun, B. Gao, L. Fan, W. An, A Cost-Effective Approach to Delivering Analytics as a Service, IEEE 19th International Conference on Web Services (ICWS 2012), vol., no., pp.512,519, 24–29 June 2012, doi: 10.1109/ICWS.2012.79Google Scholar
  12. 12.
    P. Deepak, P. M. Deshpande, K. Murthy, Configurable and Extensible Multi-flows for Providing Analytics as a Service on the Cloud, 2012 Annual SRII Global Conference (SRII), vol., no., pp.1,10, 24–27 July 2012, doi: 10.1109/SRII.2012.11Google Scholar
  13. 13.
    D. Keim, J. Kohlhammer, G. Ellis, F. Mansmann, Mastering the Information Age Solving Problems with Visual Analytics, Printed in Germany, Druckhaus “Thomas Müntzer” GmbH, Bad Langensalza ISBN 978-3-905673-77-7Google Scholar
  14. 14.
    F. S. Gharehchopogh, Z. A. Khalifelu, Analysis and evaluation of unstructured data: text mining versus natural language processing, Application of Information and Communication Technologies (AICT), 2011 5th International Conference on, vol., no., pp.1–4, 12–14 Oct. 2011, doi: 10.1109/ICAICT.2011.6111017Google Scholar
  15. 15.
    V. Tunali, T. T. Bilgin, PRETO: A High-performance Text Mining Tool for Preprocessing Turkish Texts, 2012 International Conference on Computer Systems and TechnologiesGoogle Scholar
  16. 16.
    S.V. Vinchurkar, S.M. Nirkhi, Feature extraction of product from customer feedback through blog. Int. J. Emerg. Technol. Adv. Eng. 2(1), 314–323 (2012). ISSN 2250-2459Google Scholar
  17. 17.
    D. Kuonen, Challenges in bioinformatics for statistical data miners. Bull. Swiss Stat. Soc. 46, 10–17 (2003)Google Scholar
  18. 18.
    J. Y. Hsu, W. Yih, Template-Based Information Mining from HTML Documents, American Association for Artificial Intelligence, July 1997Google Scholar
  19. 19.
    M. Delgado, M. Martín-Bautista, D. Sánchez, M. Vila, Mining Text Data: Special Features and Patterns, Pattern Detection and Discovery, Lecture Notes in Computer Science, 2002, Volume 2447/2002, 175-186, DOI: 10.1007/3-540-45728-3_11Google Scholar
  20. 20.
    Q. Zhao, S. S. Bhowmick, Association Rule Mining: A Survey, Technical Report, CAIS, Nanyang Technological University, Singapore, No. 2003116, 2003Google Scholar
  21. 21.
    W. Abramowicz, T. Kaczmarek, M. Kowalkiewicz, Supporting topic map creation using data mining techniques. Aust. J. Inf. Syst. 11(1), 63–78 (2003)Google Scholar
  22. 22.
    B. Janet, A. V. Reddy, Cube index for unstructured text analysis and mining, in Proceedings of the 2011 International Conference on Communication, Computing & Security (ICCCS '11). ACM, New York, NY, USA, 397–402Google Scholar
  23. 23.
    L. Han, T.O. Suzek, Y. Wang, S.H. Bryant, The text-mining based PubChem Bioassay neighboring analysis. BMC Bioinformatics 11, 549 (2010). doi: 10.1186/1471-2105-11-549 CrossRefGoogle Scholar
  24. 24.
    L. Dey, S. K. M. Haque, Studying the effects of noisy text on text mining applications, in Proceedings of the Third Workshop on Analytics for Noisy Unstructured Text Data (AND '09). ACM, New York, NY, USA, 107–114Google Scholar
  25. 25.
    S. Godbole, I. Bhattacharya, A. Gupta, A. Vea, Building re-usable dictionary repositories for real-world text mining, in Proceedings of the 19th ACM international conference on Information and knowledge management (CIKM '10). ACM, New York, NY, USA, 1189–1198Google Scholar
  26. 26.
    R. Feldman, M. Fresko, H. Hirsh, Y. Aumann, O. Liphstat, Y. Schler, M. Rajman, Knowledge Management: A Text Mining Approach, Proc. of the 2nd Int. Conf. on Practical Aspects of Knowledge Management (PAKM98), (Basel, Switzerland, 29–30 Oct 1998)Google Scholar
  27. 27.
    R. Feldman, M. Fresko, Y. Kinar, Y. Lindell, O. Liphstat, M. Rajman, Y. Schler, O. Zamir, Text mining at the term level, Proc. of the 2nd European Symposium on Principles of Data Mining and Knowledge Discovery (PKDD'98) Google Scholar
  28. 28.
    J. C. Scholtes, Text-Mining: The next step in search technology, DESI-III Workshop Barcelona, June 8, 2009Google Scholar
  29. 29.
    J. Lee, D. Grossman, O. Frieder, M. C. Mccabe, Integrating structured data and text: a multi-dimensional approach, Proc. of Information Technology: Coding and Computing, 2000. International Conference on, vol., no., pp. 264–269, 2000Google Scholar
  30. 30.
    V. Gupta, G.S. Lehal, A survey of text mining techniques and applications. J. Emerg. Technol. Web Intell. 1(1), 60–76 (2009)Google Scholar
  31. 31.
    R.K. Lomotey, R. Deters, Analytics-as-a-Service framework for terms association mining in unstructured data. Int. J. Bus. Process Integrat. Manag. 7(1), 49–61 (2014)CrossRefGoogle Scholar
  32. 32.
    Y. Gu, C. Kallas, J. Zhang, J. Marx, J. Tjoe, Automatic Patient Search Using Bernoulli Model. in Proc. of 2013 I.E. International Conference on Healthcare Informatics (ICHI 2013), pp. 517–522, Sept 9–11 2013, (Philadelphia, PA, USA, 2013)
  33. 33.
    R. K. Lomotey, R. Deters, Terms extraction from unstructured data silos, 8th International Conference on System of Systems Engineering (SoSE 13), (2013) pp. 19–24, 2–6 June 2013, doi: 10.1109/SYSoSE.2013.6575236Google Scholar
  34. 34.
    T. Scheffer, C. Decomain, S. Wrobel, Mining the Web with active hidden Markov models, ICDM 2001, Proceedings IEEE International Conference on Data Mining, vol., no., pp. 645–646, 2001, doi: 10.1109/ICDM.2001.989591Google Scholar
  35. 35.
    S. Mukherjee, S.J. Mitra, Hidden Markov Models, grammars, and biology: a tutorial. J. Bioinform. Comput. Biol. 3(2), 491–526 (2005)CrossRefGoogle Scholar
  36. 36.
    R. K. Lomotey, R. Deters, Data Mining from NoSQL Document-Append Style Storages. Proc. of the 2014 I.E. International Conference on Web Services (ICWS 2014), pp. 385–392, June 27–July 02, 2014, (Anchorage, Alaska, USA, 2014)Google Scholar
  37. 37.
    R. K. Lomotey, R. Deters, RSenter: tool for topics and terms extraction from unstructured data debris. Proc. of the 2013 I.E. International Congress on Big Data, pp. 395–402, Santa Clara, California, 27 June–2 July 2013Google Scholar
  38. 38.
    S. Haiduc, G. Bavota, R. Oliveto, A. de Lucia, A. Marcus, Automatic Query Performance Assessment during the Retrieval of Software Artifacts, Automated Software Engineering 2012 (ASE ’12), September 3–7, 2012, Essen, GermanyGoogle Scholar
  39. 39.
    A. Balinsky, H. Balinsky, S. Simske, On the Helmholtz Principle for Data Mining, Published by Hewlett-Packard Development Company, L.P. (2010). Available: http://www.hpl.hp.com/techreports/2010/HPL-2010-133.pdf
  40. 40.
    Erlang Programing Language, http://www.erlang.org/

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Information Sciences and TechnologyThe Pennsylvania State University - BeaverMonacaUSA
  2. 2.Department of Computer ScienceUniversity of SaskatchewanSaskatoonCanada

Personalised recommendations