What Have Innsbruck and Leipzig in Common? Extracting Semantics from Wiki Content

  • Sören Auer
  • Jens Lehmann
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4519)

Abstract

Wikis are established means for the collaborative authoring, versioning and publishing of textual articles. The Wikipedia project, for example, succeeded in creating the by far largest encyclopedia just on the basis of a wiki. Recently, several approaches have been proposed on how to extend wikis to allow the creation of structured and semantically enriched content. However, the means for creating semantically enriched structured content are already available and are, although unconsciously, even used by Wikipedia authors. In this article, we present a method for revealing this structured content by extracting information from template instances. We suggest ways to efficiently query the vast amount of extracted information (e.g. more than 8 million RDF statements for the English Wikipedia version alone), leading to astonishing query answering possibilities (such as for the title question). We analyze the quality of the extracted content, and propose strategies for quality improvements with just minor modifications of the wiki systems being currently used.

References

  1. 1.
    Apostolico, A., Galil, Z. (eds.): Pattern Matching Algorithms. OUP, Oxford (1997)MATHGoogle Scholar
  2. 2.
    Auer, S., Dietzold, S., Riechert, T.: OntoWiki – A tool for social, semantic collaboration. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 736–749. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  3. 3.
    Bizer, C.: D2R MAP - A database to RDF mapping language. In: WWW, Posters (2003), http://www2003.org/cdrom/papers/poster/p004/p4-bizer.html
  4. 4.
    Bryant, S.L., Forte, A., Bruckman, A.: Becoming wikipedian: transformation of participation in a collaborative online encyclopedia. In: GROUP’05: International Conference on Supporting Group Work, Net communities, pp. 1–10 (2005), http://doi.acm.org/10.1145/1099203.1099205
  5. 5.
    Chernov, S., Iofciu, T., Nejdl, W., Zhuo, X.: Extracting semantic relationships between wikipedia categories. In: 1st International Workshop: ”SemWiki2006 - From Wiki to Semantics” (SemWiki 2006), co-located with the ESWC2006 in Budva, Montenegro, June 12 (2006)Google Scholar
  6. 6.
    Denoyer, L., Gallinari, P.: The Wikipedia XML Corpus. SIGIR Forum (2006)Google Scholar
  7. 7.
    Dietzold, S.: Generating rdf models from ldap directories. In: Bizer, C., Auer, S., Miller, L. (eds.) Proceedings of the SFSW 05 Workshop on Scripting for the Semantic Web, Hersonissos, Crete, Greece, May 30, 2005. CEUR Workshop Proceedings, vol. 135 (2005)Google Scholar
  8. 8.
    Dimitrov, D.A., Heflin, J., Qasem, A., Wang, N.: Information integration via an end-to-end distributed semantic web system. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 764–777. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  9. 9.
    Douglas, S., Hurst, M.: Layout and language: lists and tables in technical documents. In: Proceedings of ACL SIGPARSE Workshop on Punctuation in Computational Linguistics, Jul. 1996, pp. 19–24 (1996)Google Scholar
  10. 10.
    Embley, D.W., Tao, C., Liddle, S.W.: Automatically extracting ontologically specified data from HTML tables of unknown structure. In: Spaccapietra, S., March, S.T., Kambayashi, Y. (eds.) ER 2002. LNCS, vol. 2503, pp. 322–337. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  11. 11.
    Hu, J., Kashi, R.S., Lopresti, D.P., Wilfong, G.T.: Evaluating the performance of table processing algorithms. International Journal on Document Analysis and Recognition 4(3), 140–153 (2002)CrossRefGoogle Scholar
  12. 12.
    Hurst, M.: Layout and language: Beyond simple text for information interaction – modelling the table. In: Proceedings of the 2nd International Conference on Multimodal Interfaces, Hong Kong (1999)Google Scholar
  13. 13.
    Hurst, M.: The Interpretation of Tables in Texts. PhD thesis, University of Edinburgh (2000)Google Scholar
  14. 14.
    Katz, B., Marton, G., Borchardt, G., Brownell, A., Felshin, S., Loreto, D., Louis-Rosenberg, J., Lu, B., Mora, F., Stiller, S., Uzuner, O., Wilcox, A.: External knowledge sources for question answering. In: Proceedings of the 14th Annual Text REtrieval Conference (TREC2005), Gaithersburg, MD (November 2005)Google Scholar
  15. 15.
    Krötzsch, M., Vrandecic, D., Völkel, M.: Wikipedia and the Semantic Web - The Missing Links. In: Voss, J., Lih, A. (eds.) Proceedings of Wikimania 2005, Frankfurt, Germany (2005)Google Scholar
  16. 16.
    Leuf, B., Cunningham, W.: The Wiki Way: Collaboration and Sharing on the Internet. Addison Wesley, Reading (Apr. 2001)Google Scholar
  17. 17.
    Mitchell, T.: Machine Learning. McGraw Hill, New York (1997)MATHGoogle Scholar
  18. 18.
    Ng, H.T., Lim, C.Y., Koo, J.L.T.: Learning to recognize tables in free text. In: ACL (1999), http://www.aclweb.org/anthology/P99-1057
  19. 19.
    System One. Wikipedia3 (2006), http://labs.systemone.at/wikipedia3
  20. 20.
    Pinto, D., McCallum, A., Wei, X., Croft, W.B.: Table extraction using conditional random fields. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, IR theory, pp. 235–242 (2003)Google Scholar
  21. 21.
    Pivk, A., Cimiano, P., Sure, Y.: From tables to frames. Journal of Web Semantics 3(2-3), 132–146 (2005), http://dx.doi.org/10.1016/j.websem.2005.06.003 Google Scholar
  22. 22.
    Suh, S., Halpin, H., Klein, E.: Extracting common sense knowledge from wikipedia. In: Proceedings of the ISWC-06 Workshop on Web Content Mining with Human Language Technologies (2006)Google Scholar
  23. 23.
    Tijerino, Y.A., Embley, D.W., Lonsdale, D.W., Nagy, G.: Ontology generation from tables. In: WISE, pp. 242–252. IEEE Computer Society Press, Los Alamitos (2003), http://csdl.computer.org/comp/proceedings/wise/2003/1999/00/19990242abs.htm Google Scholar
  24. 24.
    Völkel, M., Krötzsch, M., Vrandecic, D., Haller, H., Studer, R.: Semantic wikipedia. In: Carr, L., De Roure, D., Iyengar, A., Goble, C.A., Dahlin, M. (eds.) Proceedings of the 15th international conference on World Wide Web, WWW 2006, pp. 585–594. ACM Press, New York (2006)CrossRefGoogle Scholar
  25. 25.
    Wang, X.: Tabular abstraction, editing, and formatting. PhD thesis, University of Waterloo, Computer Science Dept., Waterloo, Ont., Canada (1996)Google Scholar
  26. 26.
    Wang, Y., Phillips, I.T., Haralick, R.M.: Table structure understanding and its performance evaluation. Pattern Recognition 37(7), 1479–1497 (2004), http://dx.doi.org/10.1016/j.patcog.2004.01.012 CrossRefGoogle Scholar
  27. 27.
    Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition: Models, observations, transformations, and inferences. International Journal on Document Analysis and Recognition 7(1), 1–16 (2004), http://dx.doi.org/10.1007/s10032-004-0120-9 Google Scholar

Copyright information

© Springer Berlin Heidelberg 2007

Authors and Affiliations

  • Sören Auer
    • 1
    • 2
  • Jens Lehmann
    • 1
  1. 1.Universität Leipzig, Department of Computer Science, Johannisgasse 26, D-04103 LeipzigGermany
  2. 2.University of Pennsylvania, Department of Computer and Information Science, Philadelphia, PA 19104USA

Personalised recommendations