Abstract
The first step toward automated taxonomy discovery is to identify important concepts. In this chapter, we discuss how to obtain high-quality concepts via concept set expansion techniques. Specifically, we first layout general approaches toward the concept set expansion task and then present an iterative expansion framework with thorough experiment analysis. After that, we discuss how to extend this expansion framework by exploiting automatically discovered negative sets and incorporating signals from pre-trained language model. Finally, we conclude this chapter with interesting future research directions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Results of SEISA on PubMed-CVD are omitted due to the scalability issue.
References
Balasubramanyan, R., Dalvi, B.B., Cohen, W.W.: From topic models to semi-supervised learning: Biasing mixed-membership models to exploit topic-indicative features in entity clustering. In: Proceedings of 2013 Joint European Conference on Machine Learning and Knowledge Discovery in Databases (2013)
Chen, Z., Cafarella, M., Jagadish, H.: Long-tail vocabulary dictionary extraction from the web. In: Proceedings of the 9th ACM International Conference on Web Search and Data Mining (2016)
Chierichetti, F., Kumar, R., Pandey, S., Vassilvitskii, S.: Finding the jaccard median. In: Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (2010)
Curran, J.R., Murphy, T., Scholz, B.: Minimising semantic drift with mutual exclusion bootstrapping. In: Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics (2007)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2019)
Ghahramani, Z., Heller, K.A.: Bayesian sets. In: Proceedings of the 19th Conference on Neural Information Processing Systems (2005)
Gupta, S., MacLean, D.L., Heer, J., Manning, C.D.: Research and applications: induced lexico-syntactic patterns improve information extraction from online medical forums. J Amer Med Inform Assoc (2014)
Gupta, S., Manning, C.D.: Improved pattern learning for bootstrapped entity extraction. In: Proceedings of the 18th Conference on Computational Natural Language Learning (2014)
Gupta, S., Manning, C.D.: Distributed representations of words to guide bootstrapped entity classifiers. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2015)
He, Y., Xin, D.: SEISA: set expansion by iterative similarity aggregation. In: Proceedings of the 20th International Conference on World Wide Web (2011)
Huang, J., Xie, Y., Meng, Y., Shen, J., Zhang, Y., Han, J.: Guiding corpus-based set expansion by auxiliary sets generation and co-expansion. In: Proceedings of the 2020 Web Conference (2020)
Jindal, P., Roth, D.: Learning from negative examples in set-expansion. In: Proceedings of IEEE 11th International Conference on Data Mining (2011)
Lin, D., Wu, X.: Phrase clustering for discriminative learning. In: Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics (2009)
Lin, W., Yangarber, R., Grishman, R.: Bootstrapped learning of semantic classes from positive and negative examples. In: Proceedings of ICML-2003 Workshop on The Continuum from Labeled to Unlabeled Data (2003)
Ling, X., Weld, D.S.: Fine-grained entity recognition. In: Proceedings of the 2012 AAAI Conference on Artificial Intelligence (2012)
Liu, J., Shang, J., Wang, C., Ren, X., Han, J.: Mining quality phrases from massive text corpora. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (2015)
Mamou, J., Pereg, O., Wasserblat, M., Eirew, A., Green, Y., Guskin, S., Izsak, P., Korat, D.: Term set expansion based NLP Architect by Intel AI Lab. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (2018)
McIntosh, T., Curran, J.R.: Weighted mutual exclusion bootstrapping for domain independent lexicon and template acquisition. In: Proceedings of the Australasian Language Technology Association Workshop 2008 (2008)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 27th Conference on Neural Information Processing Systems (2013)
Pantel, P., Crestan, E., Borkovsky, A., Popescu, A.M., Vyas, V.: Web-scale distributional similarity and entity set expansion. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (2009)
Ren, X., El-Kishky, A., Wang, C., Tao, F., Voss, C.R., Han, J.: ClusType: effective entity recognition and typing by relation phrase-based clustering. In: Proceedings of the 24th International Conference on World Wide Web (2015)
Ren, X., Lv, Y., Wang, K., Han, J.: Comparative document analysis for large text corpora. In: Proceedings of the 10th ACM International Conference on Web Search and Data Mining (2017)
Riloff, E.: Automatically generating extraction patterns from untagged text. In: Proceedings of the 1996 AAAI Conference on Artificial Intelligence (1996)
Rong, X., Chen, Z., Mei, Q., Adar, E.: Egoset: exploiting word ego-networks and user-generated ontology for multifaceted set expansion. In: Proceedings of the 9th ACM International Conference on Web Search and Data Mining (2016)
Shen, J., Wu, Z., Lei, D., Shang, J., Ren, X., Han, J.: SetExpan: corpus-based set expansion via context feature selection and rank ensemble. In: Proceedings of the 2017 Joint European Conference on Machine Learning and Knowledge Discovery in Databases (2017)
Shi, B., Zhang, Z., Sun, L., Han, X.: A probabilistic co-bootstrapping method for entity set expansion. In: Proceedings of the 25th International Conference on Computational Linguistics (2014)
Shi, S., Zhang, H., Yuan, X., Wen, J.R.: Corpus-based semantic class mining: distributional vs. pattern-based approaches. In: Proceedings of the 23rd International Conference on Computational Linguistics (2010)
Talukdar, P.P., Reisinger, J., Pasca, M., Ravichandran, D., Bhagat, R., Pereira, F.: Weakly-supervised acquisition of labeled class instances using graph random walks. In: Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (2008)
Tang, J., Qu, M., Mei, Q.: PTE: predictive text embedding through large-scale heterogeneous text networks. In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2015)
Thelen, M., Riloff, E.: A bootstrapping method for learning semantic lexicons using extraction pattern contexts. In: Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (2002)
Tong, S., Dean, J.: System and methods for automatically creating lists (2008). US Patent 7,350,187
Velardi, P., Faralli, S., Navigli, R.: Ontolearn reloaded: a graph-based algorithm for taxonomy induction. In: Computational Linguistics (2013)
Wang, C., Chakrabarti, K., He, Y., Ganjam, K., Chen, Z., Bernstein, P.A.: Concept expansion using web tables. In: Proceedings of the 24th International Conference on World Wide Web (2015)
Wang, R.C., Cohen, W.W.: Language-independent set expansion of named entities using the web. In: Proceedings of the 7th IEEE International Conference on Data Mining (2007)
Wang, Y.Y., Hoffmann, R., Li, X., Szymanski, J.: Semi-supervised learning of semantic classes for query understanding: from the web and for the web. In: Proceedings of the 18th ACM International Conference on Information and Knowledge Management (2009)
Yan, L., Han, X., Sun, L., He, B.: Learning to bootstrap for entity set expansion. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (2019)
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., Le, Q.V.: XLNet: generalized autoregressive pretraining for language understanding. In: Proceedings of the 33rd Conference on Neural Information Processing Systems (2019)
Yu, P., Huang, Z., Rahimi, R., Allan, J.D.: Corpus-based set expansion with lexical features and distributed representations. In: Proceedings of the 42nd International ACM SIGIR Conference on Research & Development in Information Retrieval (2019)
Zhang, Y., Shen, J., Shang, J., Han, J.: Empower entity set expansion via language model probing. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Shen, J., Han, J. (2022). Concept Set Expansion. In: Automated Taxonomy Discovery and Exploration. Synthesis Lectures on Data Mining and Knowledge Discovery. Springer, Cham. https://doi.org/10.1007/978-3-031-11405-2_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-11405-2_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-11404-5
Online ISBN: 978-3-031-11405-2
eBook Packages: Synthesis Collection of Technology (R0)eBColl Synthesis Collection 11