Utilizing Big Data Analytics for Automatic Building of Language-agnostic Semantic Knowledge Bases

AlJadda, Khalifeh; Korayem, Mohammed; Grainger, Trey

doi:10.1007/978-3-319-59834-5_9

Khalifeh AlJadda⁵,
Mohammed Korayem⁵ &
Trey Grainger⁵

Part of the book series: Scalable Computing and Communications ((SCC))

1373 Accesses

Abstract

Most work in building semantic knowledge bases has thus far focused upon either manually building language-specific taxonomies/ontologies or upon automatic techniques such as clustering or dimensionality reduction to discover latent semantic links within the content of a given corpus. The former is very labor intensive and is hard to maintain, while the latter is prone to noise and may be hard for a human to understand or to interact with directly. More recently, ontology learning systems have arisen with the hope of automatically extracting relationships from free-text content. Unfortunately, this is also problematic in that it is only able to utilize relationships found within documents, and it also loses substantial meaning which is encoded in the underlying documents when generating the ontology. We will describe a combination of techniques which, taken together, can overcome these problems. First, we’ll show how search logs represent a largely untapped source for discovering latent semantic relationships between phrases, which can be used to build a semantic knowledge base. Second, we’ll show how a semantic knowledge graph of relationships between all terms and concepts can be automatically built and compactly represented to enable traversal and scoring of relationships without losing any of the nuanced meaning embedded in the underlying corpus of documents. We will discuss how to use key big data analytics technologies and techniques for mining search logs, as well as textual content, to discover semantic relationships between key phrases in a manner that is language-agnostic, human-understandable, highly scalable, and mostly noise-free.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

R. Navigli and P. Velardi, “Learning domain ontologies from document warehouses and dedicated web sites,” Computational Linguistics, vol. 30, no. 2, 2004.
Google Scholar
T. Grainger and T. Potter, Solr in Action. Manning Publications Co, 2014.
Google Scholar
J. Bobadilla, F. Ortega, A. Hernando, and A. Gutierrez,´ “Recommender systems survey,” Knowledge-Based Systems, vol. 46, pp. 109–132, 2013.
Google Scholar
J. Lu, D. Wu, M. Mao, W. Wang, and G. Zhang, “Recommender system application developments: a survey,” Decision Support Systems, vol. 74, pp. 12–32, 2015.
Google Scholar
C. C. Aggarwal, “Content-based recommender systems,” in Recommender Systems, pp. 139–166, Springer, 2016.
Google Scholar
M. J. Pazzani and D. Billsus, “Content-based recommendation systems,” in The adaptive web, pp. 325–341, Springer, 2007.
Google Scholar
X. Su and T. M. Khoshgoftaar, “A survey of collaborative filtering techniques,” Advances in artificial intelligence, vol. 2009, p. 4, 2009.
Google Scholar
R. Burke, “Hybrid recommender systems: Survey and experiments,” User modeling and useradapted interaction, vol. 12, no. 4, pp. 331–370, 2002.
Google Scholar
M. de Gemmis, P. Lops, C. Musto, F. Narducci, and G. Semeraro, “Semantics-aware contentbased recommender systems,” in Recommender Systems Handbook, pp. 119–159, Springer, 2015.
Google Scholar
S. Harispe, S. Ranwez, S. Janaqi, and J. Montmain, “Semantic measures for the comparison of units of language, concepts or entities from text and knowledge base analysis,” arXiv preprint arXiv:1310.1285, 2013.
Google Scholar
R. Mihalcea, C. Corley, and C. Strapparava, “Corpus-based and knowledge-based measures of text semantic similarity,” in AAAI, vol. 6, pp. 775–780, 2006.
Google Scholar
A. Budanitsky and G. Hirst, “Semantic distance in wordnet: An experimental, applicationoriented evaluation of five measures,” in Workshop on WordNet and Other Lexical Resources, vol. 2, 2001.
Google Scholar
G. Bouma, “Normalized (pointwise) mutual information in collocation extraction,” in Proceedings of the Biennial GSCL Conference, pp. 31–40, 2009.
Google Scholar
S. T. Dumais, “Latent semantic analysis,” Annual review of information science and technology, vol. 38, no. 1, pp. 188–230, 2004.
Google Scholar
P. D. Turney, “Mining the web for synonyms: PMI-IR versus lsa on toefl,” in Proceedings of the 12th European Conference on Machine Learning, EMCL ‘01, (London, UK, UK), pp. 491–502, Springer-Verlag, 2001.
Google Scholar
T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
Google Scholar
K. AlJadda, M. Korayem, C. Ortiz, T. Grainger, J. A. Miller, and W. S. York, “Pgmhd: A scalable probabilistic graphical model for massive hierarchical data problems,” in Big Data (Big Data), 2014 IEEE International Conference on, pp. 55–60, IEEE, 2014.
Google Scholar
K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop distributed file system,” in Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, pp. 1–10, IEEE, 2010.
Google Scholar
J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
Google Scholar
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy, “Hive: a warehousing solution over a map-reduce framework,” Proceedings of the VLDB Endowment, vol. 2, no. 2, pp. 1626–1629, 2009.
Google Scholar
T. Grainger, K. AlJadda, M. Korayem, and A. Smith, “The semantic knowledge graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain,” in IEEE 3rd International Conference on Data Science and Advanced Analytics, IEEE, 2016.
Google Scholar
K. AlJadda, M. Korayem, T. Grainger, and C. Russell, “Crowdsourced query augmentation through semantic discovery of domain-specific jargon,” in IEEE International Conference on Big Data (Big Data 2014), pp. 808–815, IEEE, 2014.
Google Scholar
M. Korayem, C. Ortiz, K. AlJadda, and T. Grainger, “Query sense disambiguation leveraging large scale user behavioral data,” in IEEE International Conference on Big Data (Big Data 2015), pp. 1230–1237, IEEE, 2015.
Google Scholar

Download references

Author information

Authors and Affiliations

CareerBuilder, Norcross, GA, USA
Khalifeh AlJadda, Mohammed Korayem & Trey Grainger

Authors

Khalifeh AlJadda
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed Korayem
View author publications
You can also search for this author in PubMed Google Scholar
Trey Grainger
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Khalifeh AlJadda .

Editor information

Editors and Affiliations

IBM Analytics, San Ramon, California, USA
Sourav Mazumder
Discipline of Computer Science and Engineering, Indian Institute of Technology Indore, Indore, Madhya Pradesh, India
Robin Singh Bhadoria
Directorate General of Training, Ministry of Skill Development and Entrepreneurship, New Delhi, Delhi, India
Ganesh Chandra Deka

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

AlJadda, K., Korayem, M., Grainger, T. (2017). Utilizing Big Data Analytics for Automatic Building of Language-agnostic Semantic Knowledge Bases. In: Mazumder, S., Singh Bhadoria, R., Deka, G. (eds) Distributed Computing in Big Data Analytics. Scalable Computing and Communications. Springer, Cham. https://doi.org/10.1007/978-3-319-59834-5_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-59834-5_9
Published: 31 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59833-8
Online ISBN: 978-3-319-59834-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics