Unsupervised information extraction from unstructured, ungrammatical data sources on the World Wide Web

Original Paper

Abstract

Information extraction from unstructured, ungrammatical data such as classified listings is difficult because traditional structural and grammatical extraction methods do not apply. Previous work has exploited reference sets to aid such extraction, but it did so using supervised machine learning. In this paper, we present an unsupervised approach that both selects the relevant reference set(s) automatically and then uses it for unsupervised extraction. We validate our approach with experimental results that show our unsupervised extraction is competitive with supervised machine learning approaches, including the previous supervised approach that exploits reference sets.

Keywords

Information extraction Unsupervised Semantic annotation Information integration Unstructured data sources 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agichtein, E., Ganti, V.: Mining reference tables for automatic text segmentation. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 20-29. ACM, Baltimore (2004)Google Scholar
  2. 2.
    Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 39–48. ACM, Baltimore (2003)Google Scholar
  3. 3.
    Cafarella, M.J., Downey, D., Soderland, S., Etzioni, O.: KnowItNow: Fast, scalable information extraction from the web. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 563-570. Association for Computational Linguistics, East Stroudsburg (2005)Google Scholar
  4. 4.
    Carman, M.J., Knoblock, C.A.: Learning semantic descriptions of web information sources. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 2695–2700 (2007)Google Scholar
  5. 5.
    Ciravegna, F.: Adaptive information extraction from text by rule induction and generalisation. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 1251–1256 (2001)Google Scholar
  6. 6.
    Cohen, W., Ravikumar, P., Feinberg, S.: A comparison of string metrics for matching names and records. In: Proceedings of the ACM SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pp. 13–18 (2003)Google Scholar
  7. 7.
    Cohen, W., Sarawagi, S.: Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 89–98. ACM, Baltimore (2004)Google Scholar
  8. 8.
    Craswell, N., Bailey, P., Hawking, D.: Server selection on the world wide web. In: Proceedings of the Conference on Digital Libraries, pp. 37–46. ACM, Baltimore (2000)Google Scholar
  9. 9.
    Dill, S., Gibson, N., Gruhl, D., Guha, R., Jhingran, A., Kanungo,~T., Rajagopalan, S., Tomkins, A., Tomlin, J.A., Zien,~J.Y.: Semtag and seeker: Bootstrapping the semantic web via automated semantic annotation. In: Proceedings of the International World Wide Web Conference, pp. 178–186. ACM, Baltimore (2003)Google Scholar
  10. 10.
    Hassan, H., Hassan, A., Emam, O.: Unsupervised information extraction approach using graph mutual reinforcement. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 501–508. Association for Computational Linguistics, East Stroudsburg (2006)Google Scholar
  11. 11.
    Kushmerick, N., Weld, D.S., Doorenbos, R.: Wrapper induction for information extraction. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 729–737 (1997)Google Scholar
  12. 12.
    Lerman, K., Plangrasopchok, A., Knoblock, C.A.: Automatically labeling the inputs and outputs of web services. In: Proceedings of the National Conference on Artificial Intelligence, pp. 1363–1368. AAAI, Charlotte (2006)Google Scholar
  13. 13.
    Levy, A.: Logic-based techniques in data integration. In: J.~Minker (ed.) Logic Based Artificial Intelligence, pp. 575–595. Kluwer, Dordrecht (2000)Google Scholar
  14. 14.
    Levy, A.Y., Rajaraman, A., Ordille, J.J.: Querying heterogeneous information sources using source descriptions. In: Proceedings of the International Conference on Very Large Data Bases, pp. 251–262. Morgan Kaufmann, San Fransisco (1996)Google Scholar
  15. 15.
    Lin J. (1991). Divergence measures based on the shannon entropy. IEEE Trans. Inf. Theory 37(1): 145–151 MATHCrossRefGoogle Scholar
  16. 16.
    McCallum, A.: Mallet: A machine learning for language toolkit http://mallet.cs.umass.edu (2002)
  17. 17.
    Michelson, M., Knoblock, C.A.: Semantic annotation of unstructured and ungrammatical text. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 1091–1098 (2005)Google Scholar
  18. 18.
    Michelson, M., Knoblock, C.A.: An automatic approach to semantic annotation of unstructured, ungrammatical sources: A first look. In: Proceedings of the IJCAI Workshop on Analytics for Noisy Unstructured Text Data, pp. 123–130 (2007)Google Scholar
  19. 19.
    Michelson, M., Knoblock, C.A.: Mining heterogeneous transformations for record linkage. In: Proceedings of the International Workshop on Information Integration on the Web, pp. 68–73. AAAI, Charlotte (2007)Google Scholar
  20. 20.
    Minton, S.N., Nanjo, C., Knoblock, C.A., Michalowski, M., Michelson, M.: A heterogeneous field matching method for record linkage. In: Proceedings of the IEEE International Conference on Data Mining, pp. 314–321. IEEE Computer Society, Washington DC (2005)Google Scholar
  21. 21.
    Paşca, M., Lin, D., Bigham, J., Lifchits, A., Jain, A.: Organizing and searching the world wide web of facts - step one: the one- million fact extraction challenge. In: Proceedings of the National Conference on Artificial Intelligence, pp. 1400–1405. AAAI, Charlotte (2006)Google Scholar
  22. 22.
    Reeve, L., Han, H.: Survey of semantic annotation platforms. In: Proceedings of ACM Symposium on Applied Computing, pp. 1634–1638. ACM, Baltimore (2005)Google Scholar
  23. 23.
    Smith T.F., Waterman M.S. (1981). Identification of common molecular subsequences. J. Mol. Biol. 147: 195–197 CrossRefGoogle Scholar
  24. 24.
    Thakkar S., Ambite J.L., Knoblock C.A. (2005). Composing, optimizing, and executing plans for bioinformatics web services. Int. J. Very Large Databases, Spec. Issue Data Manage. Anal. Mining Life Sci 14(3): 330–353 Google Scholar
  25. 25.
    Winkler, W.E.: The state of record linkage and current research problems. Technical Report U.S. Census Bureau (1999)Google Scholar

Copyright information

© Springer-Verlag 2007

Authors and Affiliations

  1. 1.Information Sciences InstituteUniversity of Southern CaliforniaMarina del ReyUSA

Personalised recommendations