Abstract
Most work in building semantic knowledge bases has thus far focused upon either manually building language-specific taxonomies/ontologies or upon automatic techniques such as clustering or dimensionality reduction to discover latent semantic links within the content of a given corpus. The former is very labor intensive and is hard to maintain, while the latter is prone to noise and may be hard for a human to understand or to interact with directly. More recently, ontology learning systems have arisen with the hope of automatically extracting relationships from free-text content. Unfortunately, this is also problematic in that it is only able to utilize relationships found within documents, and it also loses substantial meaning which is encoded in the underlying documents when generating the ontology. We will describe a combination of techniques which, taken together, can overcome these problems. First, we’ll show how search logs represent a largely untapped source for discovering latent semantic relationships between phrases, which can be used to build a semantic knowledge base. Second, we’ll show how a semantic knowledge graph of relationships between all terms and concepts can be automatically built and compactly represented to enable traversal and scoring of relationships without losing any of the nuanced meaning embedded in the underlying corpus of documents. We will discuss how to use key big data analytics technologies and techniques for mining search logs, as well as textual content, to discover semantic relationships between key phrases in a manner that is language-agnostic, human-understandable, highly scalable, and mostly noise-free.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
R. Navigli and P. Velardi, “Learning domain ontologies from document warehouses and dedicated web sites,” Computational Linguistics, vol. 30, no. 2, 2004.
T. Grainger and T. Potter, Solr in Action. Manning Publications Co, 2014.
J. Bobadilla, F. Ortega, A. Hernando, and A. Gutierrez,´ “Recommender systems survey,” Knowledge-Based Systems, vol. 46, pp. 109–132, 2013.
J. Lu, D. Wu, M. Mao, W. Wang, and G. Zhang, “Recommender system application developments: a survey,” Decision Support Systems, vol. 74, pp. 12–32, 2015.
C. C. Aggarwal, “Content-based recommender systems,” in Recommender Systems, pp. 139–166, Springer, 2016.
M. J. Pazzani and D. Billsus, “Content-based recommendation systems,” in The adaptive web, pp. 325–341, Springer, 2007.
X. Su and T. M. Khoshgoftaar, “A survey of collaborative filtering techniques,” Advances in artificial intelligence, vol. 2009, p. 4, 2009.
R. Burke, “Hybrid recommender systems: Survey and experiments,” User modeling and useradapted interaction, vol. 12, no. 4, pp. 331–370, 2002.
M. de Gemmis, P. Lops, C. Musto, F. Narducci, and G. Semeraro, “Semantics-aware contentbased recommender systems,” in Recommender Systems Handbook, pp. 119–159, Springer, 2015.
S. Harispe, S. Ranwez, S. Janaqi, and J. Montmain, “Semantic measures for the comparison of units of language, concepts or entities from text and knowledge base analysis,” arXiv preprint arXiv:1310.1285, 2013.
R. Mihalcea, C. Corley, and C. Strapparava, “Corpus-based and knowledge-based measures of text semantic similarity,” in AAAI, vol. 6, pp. 775–780, 2006.
A. Budanitsky and G. Hirst, “Semantic distance in wordnet: An experimental, applicationoriented evaluation of five measures,” in Workshop on WordNet and Other Lexical Resources, vol. 2, 2001.
G. Bouma, “Normalized (pointwise) mutual information in collocation extraction,” in Proceedings of the Biennial GSCL Conference, pp. 31–40, 2009.
S. T. Dumais, “Latent semantic analysis,” Annual review of information science and technology, vol. 38, no. 1, pp. 188–230, 2004.
P. D. Turney, “Mining the web for synonyms: PMI-IR versus lsa on toefl,” in Proceedings of the 12th European Conference on Machine Learning, EMCL ‘01, (London, UK, UK), pp. 491–502, Springer-Verlag, 2001.
T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
K. AlJadda, M. Korayem, C. Ortiz, T. Grainger, J. A. Miller, and W. S. York, “Pgmhd: A scalable probabilistic graphical model for massive hierarchical data problems,” in Big Data (Big Data), 2014 IEEE International Conference on, pp. 55–60, IEEE, 2014.
K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop distributed file system,” in Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, pp. 1–10, IEEE, 2010.
J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy, “Hive: a warehousing solution over a map-reduce framework,” Proceedings of the VLDB Endowment, vol. 2, no. 2, pp. 1626–1629, 2009.
T. Grainger, K. AlJadda, M. Korayem, and A. Smith, “The semantic knowledge graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain,” in IEEE 3rd International Conference on Data Science and Advanced Analytics, IEEE, 2016.
K. AlJadda, M. Korayem, T. Grainger, and C. Russell, “Crowdsourced query augmentation through semantic discovery of domain-specific jargon,” in IEEE International Conference on Big Data (Big Data 2014), pp. 808–815, IEEE, 2014.
M. Korayem, C. Ortiz, K. AlJadda, and T. Grainger, “Query sense disambiguation leveraging large scale user behavioral data,” in IEEE International Conference on Big Data (Big Data 2015), pp. 1230–1237, IEEE, 2015.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
AlJadda, K., Korayem, M., Grainger, T. (2017). Utilizing Big Data Analytics for Automatic Building of Language-agnostic Semantic Knowledge Bases. In: Mazumder, S., Singh Bhadoria, R., Deka, G. (eds) Distributed Computing in Big Data Analytics. Scalable Computing and Communications. Springer, Cham. https://doi.org/10.1007/978-3-319-59834-5_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-59834-5_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59833-8
Online ISBN: 978-3-319-59834-5
eBook Packages: Computer ScienceComputer Science (R0)