Identifying Domains and Concepts in Short Texts via Partial Taxonomy and Unlabeled Data

  • Yihong ZhangEmail author
  • Claudia Szabo
  • Quan Z. Sheng
  • Wei Emma Zhang
  • Yongrui Qin
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10253)


Accurate and real-time identification of domains and concepts discussed in microblogging texts is crucial for many important applications such as earthquake monitoring, influenza surveillance and disaster management. Existing techniques such as machine learning and keyword generation are application specific and require significant amount of training in order to achieve high accuracy. In this paper, we propose to use a multiple domain taxonomy (MDT) to capture general user knowledge. We formally define the problems of domain classification and concept tagging. Using the MDT, we devise domain-independent pure frequency count methods that do not require any training data nor annotations and that are not sensitive to misspellings or shortened word forms. Our extensive experimental analysis on real Twitter data shows that both methods have significantly better identification accuracy with low runtime than existing methods for large datasets.


Text classification Concept extraction Unsupervised method Twitter 


  1. 1.
    Barberá, P.: Birds of the same feather tweet together: Bayesian ideal point estimation using Twitter data. Polit. Anal. 23(1), 76–91 (2015)CrossRefGoogle Scholar
  2. 2.
    Dredze, M., Paul, M.J., Bergsma, S., Tran, H.: Carmen: a Twitter geolocation system with applications to public health. In: AAAI Workshop on Expanding the Boundaries of Health Informatics Using AI, pp. 20–24 (2013)Google Scholar
  3. 3.
    Finin, T., Murnane, W., Karandikar, A., Keller, N., Martineau, J., Dredze, M.: Annotating named entities in Twitter data with crowdsourcing. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 80–88. Association for Computational Linguistics (2010)Google Scholar
  4. 4.
    Han, X., Sun, L., Zhao, J.: Collective entity linking in web text: a graph-based method. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 765–774. ACM (2011)Google Scholar
  5. 5.
    Kwon, S., Cha, M., Jung, K., Chen, W., Wang, Y.: Prominent features of rumor propagation in online social media. In: Proceedings of 13th International Conference on Data Mining, pp. 1103–1108 (2013)Google Scholar
  6. 6.
    Li, C., Wang, H., Zhang, Z., Sun, A., Ma, Z.: Topic modeling for short texts with auxiliary word embeddings. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 165–174. ACM (2016)Google Scholar
  7. 7.
    Li, R., Lei, K.H., Khadiwala, R., Chang, K.-C.: TEDAS: a Twitter-based event detection and analysis system. In: Proceedings of 28th International Conference on Data Engineering, pp. 1273–1276 (2012)Google Scholar
  8. 8.
    Lucia, W., Ferrari, E.: Egocentric: ego networks for knowledge-based short text classification. In: Proceedings of the 23rd ACM International Conference on Information and Knowledge Management, pp. 1079–1088. ACM (2014)Google Scholar
  9. 9.
    Maddock, J., Starbird, K., Al-Hassani, H., Sandoval, D.E., Orand, M., Mason, R.M.: Characterizing online rumoring behavior using multi-dimensional signatures. In: Proceedings of the ACM Conference on Computer Supported Cooperative Work and Social Computing, pp. 228–241 (2015)Google Scholar
  10. 10.
    Milne, D., Witten, I.H.: Learning to link with Wikipedia. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 509–518. ACM (2008)Google Scholar
  11. 11.
    Olteanu, A., Castillo, C., Diaz, F., Vieweg, S.: CrisisLex: a lexicon for collecting and filtering microblogged communications in crises. In: Proceedings of the 8th International AAAI Conference on Weblogs and Social Media, pp. 376–385 (2014)Google Scholar
  12. 12.
    Poibeau, T., Kosseim, L.: Proper name extraction from non-journalistic texts. Lang. Comput. 37(1), 144–157 (2001)zbMATHGoogle Scholar
  13. 13.
    Popescu, A.-M., Pennacchiotti, M.: Detecting controversial events from Twitter. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pp. 1873–1876 (2010)Google Scholar
  14. 14.
    Ritter, A., Etzioni, O., Clark, S., et al.: Open domain event extraction from Twitter. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1104–1112. ACM (2012)Google Scholar
  15. 15.
    Sakaki, T., Okazaki, M., Matsuo, Y.: Earthquake shakes Twitter users: real-time event detection by social sensors. In: Proceedings of the 19th International World Wide Web Conference, pp. 851–860 (2010)Google Scholar
  16. 16.
    Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., Demirbas, M.: Short text classification in Twitter to improve information filtering. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 841–842 (2010)Google Scholar
  17. 17.
    Tuan, L.A., Kim, J.-J., Kiong, N.S.: Taxonomy construction using syntactic contextual evidence. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 810–819 (2014)Google Scholar
  18. 18.
    Tumasjan, A., Sprenger, T.O., Sandner, P.G., Welpe, I.M.: Predicting elections with Twitter: what 140 characters reveal about political sentiment. In: Proceedings of the Fourth International Conference on Weblogs and Social Media, pp. 178–185 (2010)Google Scholar
  19. 19.
    Unankard, S., Li, X., Sharaf, M., Zhong, J., Li, X.: Predicting elections from social networks based on sub-event detection and sentiment analysis. In: Benatallah, B., Bestavros, A., Manolopoulos, Y., Vakali, A., Zhang, Y. (eds.) WISE 2014. LNCS, vol. 8787, pp. 1–16. Springer, Cham (2014). doi: 10.1007/978-3-319-11746-1_1 CrossRefGoogle Scholar
  20. 20.
    Unankard, S., Li, X., Sharaf, M.A.: Emerging event detection in social networks with location sensitivity. World Wide Web 18(5), 1393–1417 (2015)CrossRefGoogle Scholar
  21. 21.
    Zhang, Y., Szabo, C., Sheng, Q.Z.: Sense and focus: towards effective location inference and event detection on Twitter. In: Wang, J., Cellary, W., Wang, D., Wang, H., Chen, S.-C., Li, T., Zhang, Y. (eds.) WISE 2015. LNCS, vol. 9418, pp. 463–477. Springer, Cham (2015). doi: 10.1007/978-3-319-26190-4_31 CrossRefGoogle Scholar
  22. 22.
    Zhang, Y., Szabo, C., Sheng, Q.Z.: Improving object and event monitoring on Twitter through lexical analysis and user profiling. In: Cellary, W., Mokbel, M.F., Wang, J., Wang, H., Zhou, R., Zhang, Y. (eds.) WISE 2016. LNCS, vol. 10042, pp. 19–34. Springer, Cham (2016). doi: 10.1007/978-3-319-48743-4_2 CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Yihong Zhang
    • 1
    Email author
  • Claudia Szabo
    • 2
  • Quan Z. Sheng
    • 3
  • Wei Emma Zhang
    • 2
  • Yongrui Qin
    • 4
  1. 1.School of Computer Science and EngineeringNanyang Technological UniversitySingaporeSingapore
  2. 2.School of Computer ScienceThe University of AdelaideAdelaideAustralia
  3. 3.Department of ComputingMacquarie UniversitySydneyAustralia
  4. 4.School of Computing and EngineeringUniversity of HuddersfieldHuddersfieldUK

Personalised recommendations