Transforming unstructured or semi-structured information into structured knowledge is one of the big challenges of today’s knowledge society. While this abstract goal is still unreached and probably unreachable, intelligent information extraction techniques are considered key ingredients on the way to generating and representing knowledge for a wide variety of applications. This is especially true for the current efforts to turn the World Wide Web being the world’s largest collection of information into the world’s largest knowledge base. This introduction gives a broad overview about the major topics and current trends in information extraction.
Similar content being viewed by others
Craven M, DiPasquo D, Freitag D, McCallum A, Mitchell T, Nigam K, Slattery S (1999) Learning to construct knowledge bases from the World Wide Web. Artif Intell
Weikum G, Theobald M (2010) From information to knowledge: harvesting entities and relationships from web sources. In: Proc of ACM symposium on principles of database systems (PODS), Indianapolis, USA
Buneman P, Khanna S, Tan WC (2001) Why and where: a characterization of data provenance. In: Proc of 8th international conference on database theory (ICDT), London, UK
Tayi GK, Ballou DP (1998) Examining data quality. Commun ACM 41(2)
Soderland S (1999) Learning information extraction rules for semi-structured and free text. Mach Learn 34(1–3)
Kushmerick N (2000) Wrapper induction: efficiency and expressiveness. Artif Intell 118(1–2)
d’Oro L, Ruffolo M, Staab S (2010) SXPath—extending XPath towards spatial querying on web documents. In: Proc of international conference on very large data bases (VLDB), Singapore. PVLDB, vol 4(2)
Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Ives ZG (2007) DBPedia: a nucleus for a web of open data. In: The semantic web (ISWC/ASWC 2007). LNCS, vol 4825. Springer, Berlin
Jurafsky D, Martin JH (2008) Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition. Prentice Hall, New York
Gildea D, Jurafsky D (2000) Automatic labeling of semantic roles. In: Proc of annual meeting of the association for computational linguistics (ACL), Hong Kong, China
Grishman R, Sundheim B (1996) Message understanding conference—6: a brief history. In: Proc of international conference on computational linguistics (COLING), Kopenhagen, Denmark
Malouf R (2002) Markov models for language-independent named entity recognition. In: Proc of conference on natural language learning (CoNLL), Taipei, Taiwan
Curran JR, Clark S (2003) Language independent NER using a maximum entropy tagger. In: Proc of conference on natural language learning (CoNLL), Edmonton, Canada
Bunescu RC, Mooney RJ (2004) Collective information extraction with relational Markov networks. In: Proc of annual meeting of the association for computational linguistics (ACL), Barcelona, Spain
Finkel JR, Grenager T, Manning C (2005) Incorporating non-local information into information extraction systems by Gibbs sampling. In: Proc of annual meeting of the association for computational linguistics (ACL), Ann Arbor, MI, USA
Hearst M (1992) Automatic acquisition of hyponyms from large text corpora. In: Proc of international conference on computational linguistics (COLING), Nantes, France
Etzioni O, Cafarella M, Downey D, Popescu A, Shaked T, Soderland S, Weld DS, Yates A (2005) Unsupervised named-entity extraction from the web: an experimental study. J Artif Intell 165(1)
Banko M, Cafarella M, Soderland S, Broadhead M, Etzioni O (2007) Open information extraction from the web. In: Proc of international joint conference on artificial intelligence (IJCAI), Hyderabad, India
Bunescu RC, Pasca M (2006) Using encyclopedic knowledge for named entity disambiguation. In: Proc of conference of the European chapter of the association for computational linguistics (EACL), Trento, Italy
Cucerzan S (2011) Large-scale named entity disambiguation based on Wikipedia data. In: Proc of conference on empirical methods in natural language processing (EMNLP), Edinburgh, UK
Hoffart J, Yosef M, Bordino I, Fürstenau H, Pinkal M, Spaniol M, Taneva B, Thater S, Weikum G (2011) Robust disambiguation of named entities in text. In: Proc of conference on empirical methods in natural language processing (EMNLP), Edinburgh, UK
Hassell J, Aleman-Meza B, Arpinar IB (2006) Ontology-driven automatic entity disambiguation in unstructured text. In: Proc of international semantic web conference (ISWC), Athens, GA, USA
Bekkerman R, McCallum A (2005) Disambiguating web appearances of people in a social network. In: Proc of international conference on World Wide Web (WWW), Chiba, Japan
Dorow B, Widdows D (2003) Discovering corpus-specific word senses. In: Proc of conference of the European chapter of the association for computational linguistics (EACL), Budapest, Hungary
Nie Z, Ma Y, Shi S, Wen J, Ma W (2007) Web object retrieval. In: Proc of international conference on World Wide Web (WWW), Banff, Canada
Nie Z, Wen J, Ma W (2007) Object-level vertical search. In: Proc of biennial conference on innovative data systems research (CIDR), Asilomar, CA, USA
Dey D, Sarkar S, De P (2002) A distance-based approach to entity reconciliation in heterogeneous databases. IEEE Trans Knowl Data Eng 14(3)
Dong X, Halevy A, Madhavan J (2005) Reference reconciliation in complex information spaces. In: Proc of ACM international conference on management of data (SIGMOD), Baltimore, MD, USA
Chaudhuri S, Ganti V, Xin D (2009) Mining document collections to facilitate accurate approximate entity matching. In: Proc of international conference on very large data bases (VLDB), Lyon, France. PVLDB, vol 2(1)
Hearst M (1992) Automatic acquisition of hyponyms from large text corpora. In: Proc of international conference on computational linguistics (COLING), Nantes, France
Charniak E, Berland M (1999) Finding parts in very large corpora. In: Proc of annual meeting of the association for computational linguistics (ACL), College Park, MD, USA
Cederberg S, Widdows D (2003) Using LSA and noun coordination information to improve the precision and recall of automatic hyponymy extraction. In: Proc of conference on natural language learning (CoNLL), Edmonton, Canada
Stoica E, Hearst M, Richardson M (2007) Automating creation of hierarchical faceted metadata structures. In: Proc of human language technology conference of the association of computational linguistics, Rochester, NY, USA
Cimiano P, Handschuh S, Staab S (2004) Towards the self-annotating web. In: Proc of international conference on World Wide Web (WWW), New York, NY, USA
Sanderson M, Croft B (1999) Deriving concept hierarchies from text. In: Proc of international ACM SIGIR conference on research and development in information retrieval, Berkeley, CA, USA
Diederich J, Balke W (2007) The semantic GrowBag algorithm: automatically deriving categorization systems. In: Proc of European conference on research and advanced technology for digital libraries (ECDL), Budapest, Hungary
Jäschke R, Hotho A, Schmitz C, Ganter B, Stumme G (2008) Discovering shared conceptualizations in folksonomies. J Web Seman 6(1)
Cohen W (1998) Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Proc of ACM international conference on management of data (SIGMOD), Seattle, WA, USA
Mena E, Kashyap V, Illarramendi A, Sheth A (2000) Imprecise answers in distributed environments: estimation of information loss for multi-ontology based query processing. Int J Cooperat Inf Syst 9(4)
Rodriguez M, Egenhofer M (2003) Determining semantic similarity among entity classes from different ontologies. IEEE Trans Knowl Data Eng 15(2)
Gracia J, d’Aquin M, Mena E (2009) Large scale integration of senses for the semantic web. In: Proc of international conference on World Wide Web (WWW), Madrid, Spain
Fader A, Soderland S, Etzioni O (2011) Identifying relations for open information extraction. In: Proc of conference on empirical methods in natural language processing (EMNLP), Edinburgh, UK
Kasneci G, Ramanath M, Suchanek F, Weikum G (2008) The YAGO-NAGA approach to knowledge discovery. SIGMOD Rec 37(4)
Wu F, Weld D (2007) Autonomouslly semantifying Wikipedia. In: Proc of ACM international conference on information and knowledge management (CIKM), Lisbon, Portugal
Brin S (1998) Extracting patterns and relations from the World Wide Web. In: Proc of international workshop on the World Wide Web and databases (WebDB), Valencia, Spain
Zhu J, Nie Z, Liu X, Zhang B, Wen J (2009) StatSnowball: a statistical approach to extracting entity relationships. In: Proc of international conference on World Wide Web (WWW), Madrid, Spain
Agichtein E, Gravano L (2000) Snowball: extracting relations from large plain-text collections. In: Proc of ACM international conference on digital libraries (DL), San Antonio, TX, USA
Suchanek F, Sozio M, Weikum G (2009) SOFIE: a self-organizing framework for information extraction. In: Proc of international conference on World Wide Web (WWW), Madrid, Spain
Nakashole N, Theobald M, Weikum G (2011) Scalable knowledge harvesting with high precision and high recall. In: Proc of ACM international conference on web search and data mining (WSDM), Hong Kong, China
Kok S, Domingos P (2008) Extracting semantic networks from text via relational clustering. In: Proc of European conference on machine learning and knowledge discovery in databases (ECML/PKDD), Antwerp, Belgium
Yates A, Etzioni O (2009) Unsupervised methods for determining object and relation synonyms on the Web. J Artif Intell Res 34
Bollegala D, Matsuo Y, Ishizuka M (2009) Measuring the similarity between implicit semantic relations from the web. In: Proc of international conference on World Wide Web (WWW), Madrid, Spain
Wang Y, Zhu M, Qu L, Spaniol M, Weikum G, (2010) Timely YAGO: harvesting, querying, and visualizing temporal knowledge from Wikipedia. In: Proc of international conference on extending database technology (EDBT), Lausanne, Switzerland
Doan A, Ramakrishnan R, Halevy AY (2011) Crowdsourcing systems on the World-Wide Web. Commun ACM 54
Raykar VC, Yu S, Zhao LH, Valadez GH, Florin C, Bogoni L, Moy L (2010) Learning from crowds. J Mach Learn Res 11
McCann R, Shen W, Doan A (2008) Matching schemas in online communities: a web 2.0 approach. In: Proc of the international conference on data engineering (ICDE), Cancun, Mexico
Alonso O, Rose DE, Stewart B (2008) Crowdsourcing for relevance evaluation. In: ACM SIGIR forum. ACM, New York
Chai X, Gao BJ, Shen W, Doan A, Bohannon P, Zh X (2008) Building community Wikipedias: a machine-human partnership approach. In: Proc of int conf on data engineering (ICDE), Cancun, Mexico
DeRose P, Shen W, Chen F, Lee Y, Burdick D, Doan A, Ramakrishnan R (2007) DBLife: a community information management platform for the database research community. In: Proc of conference on innovative data systems research (CIDR), Asilomar, CA, USA
Chai X, Vuong B, Doan A, Naughton JF (2009) Efficiently incorporating user feedback into information extraction and integration programs. In: Proc of ACM international conference on management of data (SIGMOD), Providence, RI, USA
Franklin M, Kossmann D, Kraska T, Ramesh S, Xin R (2011) CrowdDB: answering queries with crowdsourcing. In: Proc of ACM international conference on management of data (SIGMOD), Athens, Greece
Demartini G, Difallah DE, Cudré-Mauroux P (2012) ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In: Proc of international World Wide Web conference (WWW), Lyon, France
Selke J, Lofi C, Balke W (2012) Pushing the boundaries of crowd-enabled databases with query-driven schema expansion. In: Proc of international conference on very large data bases (VLDB), Istanbul, Turkey. PVLDB, vol 5(6)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Balke, WT. Introduction to Information Extraction: Basic Notions and Current Trends. Datenbank Spektrum 12, 81–88 (2012). https://doi.org/10.1007/s13222-012-0090-x
Issue Date:
DOI: https://doi.org/10.1007/s13222-012-0090-x