Looking for Information in Documents

Weiss, Sholom M.; Indurkhya, Nitin; Zhang, Tong

doi:10.1007/978-1-84996-226-1_6

Sholom M. Weiss⁵,
Nitin Indurkhya⁶ &
Tong Zhang⁷

Part of the book series: Texts in Computer Science ((TCS))

3339 Accesses

Abstract

A common task in natural language processing and text mining is the extraction and formatting of information from unstructured text. One can think of the end goal of information extraction in terms of filling templates codifying the extracted information. The templates are then put into a knowledge database for future use. This chapter describes several models and learning methods that can be used to solve information extraction. We focused on two major subtasks, one is to extract entities, such as person name, organization, etc. from sentences, and the other is to determine the relationship among extracted entities.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

D. Bikel, R. Schwartz, and R. Weischedel. An algorithm that learns what’s in a name. Machine Learning, 34(1–3):211–231, 1999.
Article MATH Google Scholar
A. Borthwick. A maximum entropy approach to named entity recognition. Ph.D. thesis, New York University, 1999.
Google Scholar
L. Breiman. Prediction games and arcing algorithms. Neural Computation, 11:1493–1517, 1999.
Article Google Scholar
M. Califf and R. Mooney. Relational learning of pattern-match rules for information extraction. In Working Notes of AAAI Spring Symposium on Applying Machine Learning to Discourse Processing, pages 6–11. AAAI Press, Menlo Park, 1998.
Google Scholar
C. Cardie and K. Wagstaff. Noun phrase coreference as clustering. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in NLP and Very Large Corpora, pages 82–89. ACL, East Stroudsburg, 1999.
Google Scholar
M. Collins. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proceedings of EMNLP’02. ACL, East Stroudsburg, 2002.
Google Scholar
M. Craven and S. Slattery. Relational learning with statistical predicate invention: Better models for hypertext. Machine Learning, 43:97–119, 2001.
Article MATH Google Scholar
J. Darroch and D. Ratcliff. Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics, 43:1470–1480, 1972.
Article MATH MathSciNet Google Scholar
R. Florian, A. Ittycheriah, H. Jing, and T. Zhang. Named entity recognition through classifier combination. In Proceedings of CoNLL-2003, pages 168–171. ACL, East Stroudsburg, 2003.
Google Scholar
D. Freitag. Information extraction from HTML: Application of a general machine learning approach. In Proceedings of the 15th National Conference on Artificial Intelligence, pages 517–523. AAAI Press, Menlo Park, 1998.
Google Scholar
Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.
Article MATH MathSciNet Google Scholar
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer Series in Statistics. Springer, New York, 2001.
Book Google Scholar
S. Huffman. Learning information extraction patterns from examples. In IJCAI Workshop on New Approaches to Learning for Natural Language Processing, pages 246–260. IJCAI, San Francisco, 1995.
Google Scholar
E. Jaynes. Information theory and statistical mechanics. Physical Review, 106:620–630, 1957.
Article MATH MathSciNet Google Scholar
J. Kim and D. Moldovan. Acquisition of linguistic patterns for knowledge-based information extraction. IEEE Transactions on Knowledge and Data Engineering, 7(5):713–724, 1995.
Article Google Scholar
G. Krupka and K. Hausman. IsoQuest Inc.: Description of the NetOwl TM extractor system as used for MUC-7. In Proceedings of the Seventh Message Understanding Conference (MUC-7). NIST, Washington, 1998.
Google Scholar
J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML-01, pages 282–289. Morgan Kaufmann, San Francisco, 2001.
Google Scholar
S. Lappin and H. Leass. An algorithm for pronominal anaphora resolution. Computational Linguistics, 20(4):535–561, 1994.
Google Scholar
J. McCarthy and W. Lehnert. Using decision trees for coreference resolution. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, pages 1050–1055. Morgan Kaufmann, San Francisco, 1995.
Google Scholar
A. Mikheev, C. Grover, and M. Moens. Description of the LTG system used for MUC-7. In Proceedings of the Seventh Message Understanding Conference (MUC-7). NIST, Washington, 1998.
Google Scholar
S. Miller, M. Crystal, H. Fox, L. Ramshaw, R. Schwartz, R. Stone, and R. Weischedel. BBN: Description of the SIFT system as used for MUC-7. In Proceedings of the Seventh Message Understanding Conference (MUC-7). NIST, Washington, 1998.
Google Scholar
S. Pietra, V. Pietra, and J. Lafferty. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4):380–393, 1997.
Article Google Scholar
E. Riloff. Automatically constructing a dictionary for information extraction tasks. In Proceedings of the 11th National Conference on Artificial Intelligence, pages 811–816. AAAI Press, Menlo Park, 1993.
Google Scholar
D. Roth and W. Yih. Relational learning via propositional algorithms: An information extraction case study. In Proceedings of the 17th International Joint Conference on Artificial Intelligence, pages 1257–1263. Morgan Kaufmann, San Francisco, 2001.
Google Scholar
S. Soderland. Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1–3):233–272, 1999.
Article MATH Google Scholar
S. Soderland, D. Fisher, J. Aseltine, and W. Lehnert. CRYSTAL: Inducing a conceptual dictionary. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, pages 1314–1319. Morgan Kaufmann, San Francisco, 1995.
Google Scholar
W.-M. Soon, H.-T. Ng, and C.-Y. Lim. A machine learning approach to coreference resolution of noun phrases. Computational Linguistics, 27(4):521–544, 2001.
Article Google Scholar
B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In S. Thrun, L. Saul and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, 2004.
Google Scholar
C. Tillmann and T. Zhang. An online relevant set algorithm for statistical machine translation. IEEE Transactions on Audio, Speech, and Language Processing, 16(7):1274–1286, 2008.
Article Google Scholar
I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables. JMLR, 6:1453–1484, 2005.
MATH MathSciNet Google Scholar
V. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.
MATH Google Scholar
D. Zelenko, C. Aone, and A. Richardella. Kernel methods for relation extraction. Journal of Machine Learning Research, 3:1083–1106, 2003.
MATH MathSciNet Google Scholar
T. Zhang and D. Johnson. A robust risk minimization based named entity recognition system. In Proceedings of CoNLL-2003, pages 204–207. ACL, East Stroudsburg, 2003.
Google Scholar
T. Zhang, F. Damerau, and D. Johnson. Text chunking based on a generalization of Winnow. Journal of Machine Learning Research, 2(5):615–637, 2002.
MATH Google Scholar

Download references

Author information

Authors and Affiliations

T.J. Watson Research Center, IBM Corporation, Kitchawan Road 1101, Yorktown Heights, 10598, NY, USA
Sholom M. Weiss
School of Computer Science & Engg., University of New South Wales, Sydney, 2052, NSW, Australia
Nitin Indurkhya
Dept. Statistics, Hill Center, Rutgers University, Piscataway, 08854-8019, NJ, USA
Tong Zhang

Authors

Sholom M. Weiss
View author publications
You can also search for this author in PubMed Google Scholar
Nitin Indurkhya
View author publications
You can also search for this author in PubMed Google Scholar
Tong Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sholom M. Weiss .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Weiss, S.M., Indurkhya, N., Zhang, T. (2010). Looking for Information in Documents. In: Fundamentals of Predictive Text Mining. Texts in Computer Science. Springer, London. https://doi.org/10.1007/978-1-84996-226-1_6

Download citation

DOI: https://doi.org/10.1007/978-1-84996-226-1_6
Publisher Name: Springer, London
Print ISBN: 978-1-84996-225-4
Online ISBN: 978-1-84996-226-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics