Distributed Computing in Big Data Analytics pp 137-160 | Cite as
Utilizing Big Data Analytics for Automatic Building of Language-agnostic Semantic Knowledge Bases
Abstract
Most work in building semantic knowledge bases has thus far focused upon either manually building language-specific taxonomies/ontologies or upon automatic techniques such as clustering or dimensionality reduction to discover latent semantic links within the content of a given corpus. The former is very labor intensive and is hard to maintain, while the latter is prone to noise and may be hard for a human to understand or to interact with directly. More recently, ontology learning systems have arisen with the hope of automatically extracting relationships from free-text content. Unfortunately, this is also problematic in that it is only able to utilize relationships found within documents, and it also loses substantial meaning which is encoded in the underlying documents when generating the ontology. We will describe a combination of techniques which, taken together, can overcome these problems. First, we’ll show how search logs represent a largely untapped source for discovering latent semantic relationships between phrases, which can be used to build a semantic knowledge base. Second, we’ll show how a semantic knowledge graph of relationships between all terms and concepts can be automatically built and compactly represented to enable traversal and scoring of relationships without losing any of the nuanced meaning embedded in the underlying corpus of documents. We will discuss how to use key big data analytics technologies and techniques for mining search logs, as well as textual content, to discover semantic relationships between key phrases in a manner that is language-agnostic, human-understandable, highly scalable, and mostly noise-free.
References
- 1.R. Navigli and P. Velardi, “Learning domain ontologies from document warehouses and dedicated web sites,” Computational Linguistics, vol. 30, no. 2, 2004.Google Scholar
- 2.T. Grainger and T. Potter, Solr in Action. Manning Publications Co, 2014.Google Scholar
- 3.J. Bobadilla, F. Ortega, A. Hernando, and A. Gutierrez,´ “Recommender systems survey,” Knowledge-Based Systems, vol. 46, pp. 109–132, 2013.Google Scholar
- 4.J. Lu, D. Wu, M. Mao, W. Wang, and G. Zhang, “Recommender system application developments: a survey,” Decision Support Systems, vol. 74, pp. 12–32, 2015.Google Scholar
- 5.C. C. Aggarwal, “Content-based recommender systems,” in Recommender Systems, pp. 139–166, Springer, 2016.Google Scholar
- 6.M. J. Pazzani and D. Billsus, “Content-based recommendation systems,” in The adaptive web, pp. 325–341, Springer, 2007.Google Scholar
- 7.X. Su and T. M. Khoshgoftaar, “A survey of collaborative filtering techniques,” Advances in artificial intelligence, vol. 2009, p. 4, 2009.Google Scholar
- 8.R. Burke, “Hybrid recommender systems: Survey and experiments,” User modeling and useradapted interaction, vol. 12, no. 4, pp. 331–370, 2002.Google Scholar
- 9.M. de Gemmis, P. Lops, C. Musto, F. Narducci, and G. Semeraro, “Semantics-aware contentbased recommender systems,” in Recommender Systems Handbook, pp. 119–159, Springer, 2015.Google Scholar
- 10.S. Harispe, S. Ranwez, S. Janaqi, and J. Montmain, “Semantic measures for the comparison of units of language, concepts or entities from text and knowledge base analysis,” arXiv preprint arXiv:1310.1285, 2013.Google Scholar
- 11.R. Mihalcea, C. Corley, and C. Strapparava, “Corpus-based and knowledge-based measures of text semantic similarity,” in AAAI, vol. 6, pp. 775–780, 2006.Google Scholar
- 12.A. Budanitsky and G. Hirst, “Semantic distance in wordnet: An experimental, applicationoriented evaluation of five measures,” in Workshop on WordNet and Other Lexical Resources, vol. 2, 2001.Google Scholar
- 13.G. Bouma, “Normalized (pointwise) mutual information in collocation extraction,” in Proceedings of the Biennial GSCL Conference, pp. 31–40, 2009.Google Scholar
- 14.S. T. Dumais, “Latent semantic analysis,” Annual review of information science and technology, vol. 38, no. 1, pp. 188–230, 2004.Google Scholar
- 15.P. D. Turney, “Mining the web for synonyms: PMI-IR versus lsa on toefl,” in Proceedings of the 12th European Conference on Machine Learning, EMCL ‘01, (London, UK, UK), pp. 491–502, Springer-Verlag, 2001.Google Scholar
- 16.T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.Google Scholar
- 17.K. AlJadda, M. Korayem, C. Ortiz, T. Grainger, J. A. Miller, and W. S. York, “Pgmhd: A scalable probabilistic graphical model for massive hierarchical data problems,” in Big Data (Big Data), 2014 IEEE International Conference on, pp. 55–60, IEEE, 2014.Google Scholar
- 18.K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop distributed file system,” in Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, pp. 1–10, IEEE, 2010.Google Scholar
- 19.J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.Google Scholar
- 20.A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy, “Hive: a warehousing solution over a map-reduce framework,” Proceedings of the VLDB Endowment, vol. 2, no. 2, pp. 1626–1629, 2009.Google Scholar
- 21.T. Grainger, K. AlJadda, M. Korayem, and A. Smith, “The semantic knowledge graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain,” in IEEE 3rd International Conference on Data Science and Advanced Analytics, IEEE, 2016.Google Scholar
- 22.K. AlJadda, M. Korayem, T. Grainger, and C. Russell, “Crowdsourced query augmentation through semantic discovery of domain-specific jargon,” in IEEE International Conference on Big Data (Big Data 2014), pp. 808–815, IEEE, 2014.Google Scholar
- 23.M. Korayem, C. Ortiz, K. AlJadda, and T. Grainger, “Query sense disambiguation leveraging large scale user behavioral data,” in IEEE International Conference on Big Data (Big Data 2015), pp. 1230–1237, IEEE, 2015.Google Scholar