New Generation Computing

, Volume 28, Issue 3, pp 217–236

Organizing the Web's Information Explosion to Discover Unknown Unknowns

  • Kentaro Torisawa
  • Stijn De Saeger
  • Jun’ichi Kazama
  • Asuka Sumida
  • Daisuke Noguchi
  • Yasunori Kakizawa
  • Masaki Murata
  • Kow Kuroda
  • Ichiro Yamada
Article

Abstract

This paper introduces the TORISHIKI-KAI project, which aims to construct a million-word-scale semantic network from the Web using state of the art knowledge acquisition methods. The resulting network can be browsed as a Web search directory, and we show that the directory is useful for finding “unknown unknowns” — in the infamous words of D.H. Rumsfeld: things “we don't know we don't know.” Because typically we have no way to look for information we don't even know is missing, a crucial characteristic of unknown unknowns is that they are very difficult to discover through keyword-based Web search. Some examples of the unknown unknowns we have found include unexpected troubles associated with commercial products, surprising new combinations of ingredients in new recipes, unexpected tools or methods for commiting suicide, and so on. We expect such information to be useful for risk management, innovation support, and the detection of harmful information on the Web.

Keywords:

Information Retrieval Knowledge Acquisition Knowledge Management 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abe, S., Inui, K. and Matsumoto, Y., “Two-phased event relation acquisition: Coupling the relation-oriented and argument-oriented approaches,” in Proc. of the 22nd International Conference on Computational Linguistics (COLING-2008), pp.1–8, 2008.Google Scholar
  2. 2.
    Ando, M., Sekine, S. and Ishizaki, S., “Automatic extraction of hyponyms from newspaper using lexicosyntactic patterns,” in IPSJ SIG Technical Report 2003-NL-157 (in Japanese), pp.77–82, 2003.Google Scholar
  3. 3.
    Baeza-Yates, R., Hurtado, C. and Mendoza, M., “Query recommendation using query logs in search engines,” in International Workshop on Clustering Information over the Web (ClustWeb, in conjunction with EDBT), Creete, pp.588–596, Springer, 2004.Google Scholar
  4. 4.
    Blum, A. and Mitchell, T., “Combining labeled and unlabeled data with co-training,” in Proc. of the eleventh annual conference on Computational Learning Theory (COLT'98), pp.92–100, 1998.Google Scholar
  5. 5.
    Caraballo, S. A., “Automatic construction of a hypernym-labeled noun hierarchy from text,” in Proc. of the 37th annual meeting of The Association for Computational Linguistics, pp.120–126, 1999.Google Scholar
  6. 6.
    Dagan, I., Lee, L. and Pereira, F., “Similarity-based models of co-occurrence probabilities,” Machine Learning, Kluwer Academic Publishers, Boston, pp.43–69, 1999.Google Scholar
  7. 7.
    De Saeger, S., Torisawa, K. and Kazama, J., “Looking for trouble,” in Proc. of The 22nd International Conference on Computational Linguistics (Coling2008), 2008.Google Scholar
  8. 8.
    Dempster, A. P., Laird, N. M. and Rubin, D. B., “Maximum likelihood from incomplete data via the EM algorithm,” J. R. Statist, Soc. B, 39, pp.185–197, 1977.MathSciNetGoogle Scholar
  9. 9.
    Etzioni, O., Cafarella, M., Downey, D., Popescu, A., Shaked, T., Soderland, S., Weld, D. and Yates, A., “Unsupervised named-entity extraction from the web: An experimental study,” Artificial Intelligence, Elsevier B.V., pp.91–134, 2005.Google Scholar
  10. 10.
    Gliozzo, A. M., Pennacchiotti, M. and Pantel, P., “The domain restriction hypothesis: Relating term similarity and semantic consistency,” in Proc. of Human Language Technology Conference/North Americal Chapter of the Association for Computational Linguistics Annual Meeting (HLT-NAACL07), pp.131–138, 2007.Google Scholar
  11. 11.
    Harris, Z., “Distributional Structure,” in Word, 10, 23, pp.146–162, 1954.Google Scholar
  12. 12.
    Hearst, M., “Automatic acquisition of hyponyms from large text corpora,” in Proc. of the 14th International Conference on Computational Lnguistics (COLING 1992), pp.539–545, 1992.Google Scholar
  13. 13.
    Imasumi, K., “Automatic acqusition of hyponymy relations from coordinated noun phrases and appositions,” Master's Thesis, Kyushu Institute of Technology, 2001.Google Scholar
  14. 14.
    Kazama, J., De Saeger, S., Torisawa, K. and Murata, M., “Generating a large-scale analogy list using a probabilistic clustering based on noun-verb dependency profiles,” in 15th Annual Meeting of The Association for Natural Language Processing (in Japanese), 2009.Google Scholar
  15. 15.
    Kazama, J. and Torisawa, K., “Exploiting Wikipedia as external knowledge for named entity recognition,” in Proc. of the Conference on Empirical Methods in Natural Language Processing and Conference on Computational Natural Language Learning (EMNLP-CoNLL 2007), pp.698–707, 2007.Google Scholar
  16. 16.
    Kazama, J. and Torisawa, K., “Inducing gazetteers for named entity recognition by largescale clustering of dependency relations,” in Proc. of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-08: HLT), pp.407–415, 2008.Google Scholar
  17. 17.
    Kozareva, Z., Riloff, E. and Hovy, E., “Semantic class learning from the web with hyponym pattern linkage graphs,” in Proc. of Association for Computational Linguistics (ACL-08: HLT), pp.1048–1056, Columbus, Ohio, June 2008.Google Scholar
  18. 18.
    Manning, C. and Schütze, H., Foundations of Statistical Natural Language Processing, ISBN 4-13-065404-7, MIT Press, 1999.Google Scholar
  19. 19.
    Oh, J., Uchimoto, K. and Torisawa, K., “Bilingual co-training for monolingual hyponymyrelation acquisition,” in Proc. of the Joint conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (ACL-IJCNLP 2009), pp.432–440, 2009.Google Scholar
  20. 20.
    Pantel, P. and Pennacchiotti, M., “Espresso: Leveraging generic patterns for automatically harvesting semantic relations,” in Proc. of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL-06), pp.113–120, 2006.Google Scholar
  21. 21.
    Pantel, P. and Ravichandran, D., “Automatically labeling semantic classes,” inProc. of the Human Language Technology and North American Chapter of the Association for Computational Linguistic Conference, pp.321–328, 2004.Google Scholar
  22. 22.
    Pasca, M., “Acquisition of categorized named entities for web search,” in Proc. of the 2004 ACM CIKM International Conference on Information and Knowledge Management, pp.137–145, 2004.Google Scholar
  23. 23.
    Ponzetto, S.P. and Strube M., “Deriving a large scale taxonomy from Wikipedia,” in Proc. of the 22nd National Conference on Artificial Intelligence, pp.1440–1445, 2007.Google Scholar
  24. 24.
    Riloff, E. and Jones, R., “Learning dictionaries for information extraction by multi-level bootstrapping,” in Proc. of the Sixteenth National Conference on Artificial Intelligence, 1999.Google Scholar
  25. 25.
    De Saeger, S., Torisawa, K., Kazama, J., Kuroda, K. and Murata M., “Large scale relation acquisition using class dependent patterns,” in Proc. of the 9th IEEE International Conference on Data Mining (ICDM 2009), 2009.Google Scholar
  26. 26.
    Shinzato, K., Shibata, T., Kawahara, D., Hashimoto, C. and Kurohashi S., “Tsubaki: An open search engine infrastructure for developing new information access,” in Proc. of the 3rd International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (IJCNLP 2008), pp.189–196, 2008.Google Scholar
  27. 27.
    Shinzato, K. and Torisawa, K., “Acquiring hyponymy relations from Web documents,” in Proc. of Human Language Technology Conference/North Americal Chapter of the Association for Computational Linguistics Annual Meeting (HLT-NAACL04), pp.73–80, 2004.Google Scholar
  28. 28.
    Snow, R., Jurafsky, D. and Ng, A. Y., “Semantic taxonomy induction from heterogenous evidence,” in Proc. of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics (COLING-ACL-06), pp.801–808, 2006.Google Scholar
  29. 29.
    Suchanek, F., Kasneci, G. and Weikum, G., “YAGO: A core of semantic knowledge - unifying WordNet and Wikipedia,” in 16th International World Wide Web Conference (WWW 2007), pp.697–706, ACM, 2007.Google Scholar
  30. 30.
    Sumida, A., Yoshinaga, N. and Torisawa, K., “Boosting precision and recall of hyponymy relation acquisition from hierarchical layouts in Wikipedia,” in Proc. of the Sixth International Language Resources and Evaluation (LREC'08), pp.2462–2469, 2008.Google Scholar
  31. 31.
    Torisawa, K., “An unsupervised method for canonicalization of Japanese postpositions,” in Proc. of the 6th Natural Language Proceesing Pacific Rim Symposiumu (NLPRS 2001), pp.211–218, 2001.Google Scholar
  32. 32.
    Torisawa, K., “Automatic acquisition of expressions representing preparation and utilization of an object,” in Proc. of the Recent Advances in Natural Language Processing (RANLP05), pp.556–560, 2005.Google Scholar
  33. 33.
    Vapnik, V. N., Statistical Learning Theory, Wiley-Interscience, 1998.Google Scholar
  34. 34.
    Yamada, I., Torisawa, K., Kazama, J., Kuroda, K., Murata, M., De Saeger, S., Bond, F. and Sumida, A., “Hypernym discovery based on distributional similarity and hierarchical structures,” in Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2009), pp.929–937, 2009.Google Scholar
  35. 35.
    Zeng, Q., Crowell, J., Plovnick, R., Kim, E., Ngo, L. and Dibble, E., “Assisting consumer health information retrieval with query recommendation,” Journal of the American Medical Informatics Association, 13, 80–90, 2006.CrossRefGoogle Scholar
  36. 36.
    Zhang, Z. and Nasraoui, O., “Mining search engine query logs for query recommendation,” in Proc. of the 15th International Conference on World Wide Web (WWW'06), pp.1039–1040. ACM Press, 2006.Google Scholar

Copyright information

© Ohmsha and Springer Japan jointly hold copyright of the journal. 2010

Authors and Affiliations

  • Kentaro Torisawa
    • 1
  • Stijn De Saeger
    • 1
  • Jun’ichi Kazama
    • 1
  • Asuka Sumida
    • 1
    • 2
  • Daisuke Noguchi
    • 1
    • 3
  • Yasunori Kakizawa
    • 1
  • Masaki Murata
    • 1
  • Kow Kuroda
    • 1
  • Ichiro Yamada
    • 1
  1. 1.Language Infrastructure Group, MASTAR Project, National Institute of Information and Communications Technology (NICT)KyotoJapan
  2. 2.Japan Advanced Institute of Science and TechnologyIshikawaJapan
  3. 3.NEC BIGLOBE Ltd.TokyoJapan

Personalised recommendations