Abstract
Text processing represents a preliminary phase to many document content handling tasks aimed at extracting and organizing information therein. The computer science disciplines devoted to understanding language, and hence useful for such objectives, are Computational Linguistics and Natural Language Processing. They rely on the availability of suitable linguistic resources (corpora, computational lexica, etc.) and of standard representation models of linguistic information to build tools that are able to analyze sentences at various levels of complexity: morphologic, lexical, syntactic, semantic. This chapter provides a survey of the main Natural Language Processing tasks (tokenization, language recognition, stemming, stopword removal, Part of Speech tagging, Word Sense Disambiguation, Parsing) and presents some related techniques, along with lexical resources of interest to the research community.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
A thesaurus is a dictionary based on semantic models of organization, that reports for each word, in addition to its meaning, also a list of related words (e.g., its synonyms and antonyms).
- 2.
A “formal, explicit specification of a shared conceptualization” [13], i.e., a vocabulary to model the type of objects, concepts, and their properties and relations, in a domain. In philosophy, a systematic account of Existence.
- 3.
Website http://wordnet.princeton.edu.
- 4.
Since backward compatibility is not guaranteed in WordNet versioning, and hence the synset IDs may change, WordNet Domains might not be aligned with the latest WordNet version. This can be a problem when both are to be used, since either one gives up in using the latest WordNet version, or one has to install two different WordNet versions on the same system.
- 5.
Description Logics are fragments of First-Order Logic, having a formal semantics, and typically exploited to define concepts (i.e., classes) and roles (i.e., binary relations) in ontologies.
- 6.
Considering spoken text, a further preliminary level is the phonetic one.
- 7.
To improve approximation between the conventional year and the actual movement of the Earth, a year is leap, i.e., includes February 29th, if it can be divided by 4, unless it can also be divided by 100 (in which case it is not), unless it can be divided by 400 as well (in which case it is).
- 8.
This term is often used also to denote a sequence of n consecutive words in a text, but here the letter-based interpretation is assumed.
- 9.
The cited experiment adopted a generative perspective on this subject. First, casual sequences of equiprobable letters and spaces were generated, showing that no particular language was suggested by the outcome. Then a first-order approximation of English was produced by generating sequences where the probability of adding a character to the sequence was the same as the frequency computed for that character on actual English texts. Again, this was of little help for hypothesizing from the sequence the language from whose frequency distribution it was generated. Switching to a second order approximation, where each character was extracted according to a different distribution depending on the previous extracted character, started showing some hints of English here and there. Indeed, it was able to represent phenomena such as the ‘qu’ pair, where the ‘u’ must necessarily follow the ‘q’, that the single-letter approximation could not express. Finally, a third-order approximation, in which the probability distribution for extracting the next symbol depended on the previous two characters, showed a significant improvement in the outcome, where English was quite evident in the overall flavor of the sound, and several monosyllable words were actually caught. Comparable results were obtained for other languages as well, e.g., Latin.
- 10.
In cases of binary ambiguity, the estimation is that in a given collocation a word is used with only one sense with a 90–99% probability [27].
- 11.
References
Allen, J.F.: Natural Language Understanding. Benjamin-Cummings, Redwood City (1994)
Bentivogli, L., Forner, P., Magnini, B., Pianta, E.: Revising WordNet domains hierarchy: Semantics, coverage, and balancing. In: Proceedings of COLING 2004 Workshop on Multilingual Linguistic Resources, pp. 101–108 (2004)
Berry-Rogghe, G.: The computation of collocations and their relevance to lexical studies. In: Aitken, A.J., Bailey, R.W., Hamilton-Smith, N. (eds.) The Computer and Literary Studies, pp. 103–112. Edinburgh University Press, Edinburgh (1973)
Brill, E.: A simple rule-based part of speech tagger. In: HLT ’91: Proceedings of the Workshop on Speech and Natural Language, pp. 112–116 (1992)
Brill, E.: Some advances in transformation-based part of speech tagging. In: Proceedings of the 12th National Conference on Artificial Intelligence (AAAI), vol. 1, pp. 722–727 (1994)
Brill, E.: Unsupervised learning of disambiguation rules for part of speech tagging. In: Natural Language Processing Using Very Large Corpora Workshop, pp. 1–13. Kluwer, Amsterdam (1995)
Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Rosati, R.: Tractable reasoning and efficient query answering in description logics: The DL-lite family. Journal of Automated Reasoning 39(3), 385–429 (2007)
Calzolari, N., Lenci, A.: Linguistica computazionale—strumenti e risorse per il trattamento automatico della lingua. Mondo Digitale 2, 56–69 (2004) (in Italian)
De Mauro, T.: Grande Dizionario Italiano dell’Uso. UTET, Turin (1999) (in Italian)
Dewey, M., et al.: Dewey Decimal Classification and Relative Index. Edition 22. OCLC Online Computer Library Center (2003)
Gale, W., Church, K., Yarowsky, D.: One sense per discourse. In: Proceedings of the ARPA Workshop on Speech and Natural Language Processing, pp. 233–237 (1992)
Grishman, R.: Computational Linguistic—An Introduction. Studies in Natural Language Processing. Cambridge University Press, Cambridge (1986)
Gruber, T.R.: A translation approach to portable ontology specifications. Knowledge Acquisition 5(2), 199–220 (1993)
Halliday, M.: Categories of the theory of grammar. Word 17, 241–292 (1961)
Ide, N., Véronis, J.: Introduction to the special issue on Word Sense Disambiguation: The state of the art. Compuational Linguistics 24(1), 1–40 (1998)
Krovetz, R.: More than one sense per discourse. In: Proceedings of SENSEVAL Workshop, pp. 1–10 (1998)
Lafferty, J., Sleator, D.D., Temperley, D.: Grammatical trigrams: A probabilistic model of link grammar. In: Proceedings of the AAAI Fall Symposium on Probabilistic Approaches to Natural Language (1992)
Lesk, M.: Automatic sense disambiguation using machine-readable dictionaries: How to tell a pine cone from an ice cream cone. In: Proceedings of the 5th International Conference on Systems Documentation (SIGDOC), pp. 24–26 (1986)
Magnini, B., Cavaglià, G.: Integrating subject field codes into WordNet. In: Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC), pp. 1413–1418 (2000)
Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, New York (1999)
McCarthy, J., Minsky, M.L., Rochester, N., Shannon, C.E.: A proposal for the Dartmouth Summer research project on Artificial Intelligence. Tech. rep., Dartmouth College (1955)
Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.J.: Introduction to WordNet: An on-line lexical database. International Journal of Lexicography 3(4), 235–244 (1990)
Oltramari, A., Vetere, G.: Lexicon and ontology interplay in Senso Comune. In: Proceedings of OntoLex 2008 Workshop, 6th International Conference on Language Resources and Evaluation (LREC) (2008)
Pierce, J.R.: Symbols, Signals and Noise—The Nature and Process of Communication. Harper Modern Science Series. Harper & Brothers (1961)
Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Sleator, D.D., Temperley, D.: Parsing English text with a link grammar. In: Proceedings of the 3rd International Workshop on Parsing Technologies (1993)
Yarowsky, D.: One sense per collocation. In: Proceeding of ARPA Human Language Technology Workshop, pp. 266–271 (1993)
Yarowsky, D.: Decision lists for lexical ambiguity resolution: Application to accent restoration in Spanish and French. In: Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, pp. 88–95 (1994)
Yarowsky, D.: Unsupervised Word Sense Disambiguation rivaling supervised methods. In: Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pp. 189–196 (1995)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2011 Springer-Verlag London Limited
About this chapter
Cite this chapter
Ferilli, S. (2011). Natural Language Processing. In: Automatic Digital Document Processing and Management. Advances in Pattern Recognition. Springer, London. https://doi.org/10.1007/978-0-85729-198-1_6
Download citation
DOI: https://doi.org/10.1007/978-0-85729-198-1_6
Publisher Name: Springer, London
Print ISBN: 978-0-85729-197-4
Online ISBN: 978-0-85729-198-1
eBook Packages: Computer ScienceComputer Science (R0)