Skip to main content

Utilizing Big Data Analytics for Automatic Building of Language-agnostic Semantic Knowledge Bases

  • Chapter
  • First Online:
Distributed Computing in Big Data Analytics

Part of the book series: Scalable Computing and Communications ((SCC))

  • 1373 Accesses

Abstract

Most work in building semantic knowledge bases has thus far focused upon either manually building language-specific taxonomies/ontologies or upon automatic techniques such as clustering or dimensionality reduction to discover latent semantic links within the content of a given corpus. The former is very labor intensive and is hard to maintain, while the latter is prone to noise and may be hard for a human to understand or to interact with directly. More recently, ontology learning systems have arisen with the hope of automatically extracting relationships from free-text content. Unfortunately, this is also problematic in that it is only able to utilize relationships found within documents, and it also loses substantial meaning which is encoded in the underlying documents when generating the ontology. We will describe a combination of techniques which, taken together, can overcome these problems. First, we’ll show how search logs represent a largely untapped source for discovering latent semantic relationships between phrases, which can be used to build a semantic knowledge base. Second, we’ll show how a semantic knowledge graph of relationships between all terms and concepts can be automatically built and compactly represented to enable traversal and scoring of relationships without losing any of the nuanced meaning embedded in the underlying corpus of documents. We will discuss how to use key big data analytics technologies and techniques for mining search logs, as well as textual content, to discover semantic relationships between key phrases in a manner that is language-agnostic, human-understandable, highly scalable, and mostly noise-free.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/careerbuilder/semantic-knowledge-graph

  2. 2.

    https://github.com/OpenSextant/SolrTextTagger

References

  1. R. Navigli and P. Velardi, “Learning domain ontologies from document warehouses and dedicated web sites,” Computational Linguistics, vol. 30, no. 2, 2004.

    Google Scholar 

  2. T. Grainger and T. Potter, Solr in Action. Manning Publications Co, 2014.

    Google Scholar 

  3. J. Bobadilla, F. Ortega, A. Hernando, and A. Gutierrez,´ “Recommender systems survey,” Knowledge-Based Systems, vol. 46, pp. 109–132, 2013.

    Google Scholar 

  4. J. Lu, D. Wu, M. Mao, W. Wang, and G. Zhang, “Recommender system application developments: a survey,” Decision Support Systems, vol. 74, pp. 12–32, 2015.

    Google Scholar 

  5. C. C. Aggarwal, “Content-based recommender systems,” in Recommender Systems, pp. 139–166, Springer, 2016.

    Google Scholar 

  6. M. J. Pazzani and D. Billsus, “Content-based recommendation systems,” in The adaptive web, pp. 325–341, Springer, 2007.

    Google Scholar 

  7. X. Su and T. M. Khoshgoftaar, “A survey of collaborative filtering techniques,” Advances in artificial intelligence, vol. 2009, p. 4, 2009.

    Google Scholar 

  8. R. Burke, “Hybrid recommender systems: Survey and experiments,” User modeling and useradapted interaction, vol. 12, no. 4, pp. 331–370, 2002.

    Google Scholar 

  9. M. de Gemmis, P. Lops, C. Musto, F. Narducci, and G. Semeraro, “Semantics-aware contentbased recommender systems,” in Recommender Systems Handbook, pp. 119–159, Springer, 2015.

    Google Scholar 

  10. S. Harispe, S. Ranwez, S. Janaqi, and J. Montmain, “Semantic measures for the comparison of units of language, concepts or entities from text and knowledge base analysis,” arXiv preprint arXiv:1310.1285, 2013.

    Google Scholar 

  11. R. Mihalcea, C. Corley, and C. Strapparava, “Corpus-based and knowledge-based measures of text semantic similarity,” in AAAI, vol. 6, pp. 775–780, 2006.

    Google Scholar 

  12. A. Budanitsky and G. Hirst, “Semantic distance in wordnet: An experimental, applicationoriented evaluation of five measures,” in Workshop on WordNet and Other Lexical Resources, vol. 2, 2001.

    Google Scholar 

  13. G. Bouma, “Normalized (pointwise) mutual information in collocation extraction,” in Proceedings of the Biennial GSCL Conference, pp. 31–40, 2009.

    Google Scholar 

  14. S. T. Dumais, “Latent semantic analysis,” Annual review of information science and technology, vol. 38, no. 1, pp. 188–230, 2004.

    Google Scholar 

  15. P. D. Turney, “Mining the web for synonyms: PMI-IR versus lsa on toefl,” in Proceedings of the 12th European Conference on Machine Learning, EMCL ‘01, (London, UK, UK), pp. 491–502, Springer-Verlag, 2001.

    Google Scholar 

  16. T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.

    Google Scholar 

  17. K. AlJadda, M. Korayem, C. Ortiz, T. Grainger, J. A. Miller, and W. S. York, “Pgmhd: A scalable probabilistic graphical model for massive hierarchical data problems,” in Big Data (Big Data), 2014 IEEE International Conference on, pp. 55–60, IEEE, 2014.

    Google Scholar 

  18. K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop distributed file system,” in Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, pp. 1–10, IEEE, 2010.

    Google Scholar 

  19. J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.

    Google Scholar 

  20. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy, “Hive: a warehousing solution over a map-reduce framework,” Proceedings of the VLDB Endowment, vol. 2, no. 2, pp. 1626–1629, 2009.

    Google Scholar 

  21. T. Grainger, K. AlJadda, M. Korayem, and A. Smith, “The semantic knowledge graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain,” in IEEE 3rd International Conference on Data Science and Advanced Analytics, IEEE, 2016.

    Google Scholar 

  22. K. AlJadda, M. Korayem, T. Grainger, and C. Russell, “Crowdsourced query augmentation through semantic discovery of domain-specific jargon,” in IEEE International Conference on Big Data (Big Data 2014), pp. 808–815, IEEE, 2014.

    Google Scholar 

  23. M. Korayem, C. Ortiz, K. AlJadda, and T. Grainger, “Query sense disambiguation leveraging large scale user behavioral data,” in IEEE International Conference on Big Data (Big Data 2015), pp. 1230–1237, IEEE, 2015.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Khalifeh AlJadda .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

AlJadda, K., Korayem, M., Grainger, T. (2017). Utilizing Big Data Analytics for Automatic Building of Language-agnostic Semantic Knowledge Bases. In: Mazumder, S., Singh Bhadoria, R., Deka, G. (eds) Distributed Computing in Big Data Analytics. Scalable Computing and Communications. Springer, Cham. https://doi.org/10.1007/978-3-319-59834-5_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-59834-5_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-59833-8

  • Online ISBN: 978-3-319-59834-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics