Natural Language Processing

Ferilli, Stefano

doi:10.1007/978-0-85729-198-1_6

Stefano Ferilli²

Part of the book series: Advances in Pattern Recognition ((ACVPR))

1336 Accesses
1 Citations

Abstract

Text processing represents a preliminary phase to many document content handling tasks aimed at extracting and organizing information therein. The computer science disciplines devoted to understanding language, and hence useful for such objectives, are Computational Linguistics and Natural Language Processing. They rely on the availability of suitable linguistic resources (corpora, computational lexica, etc.) and of standard representation models of linguistic information to build tools that are able to analyze sentences at various levels of complexity: morphologic, lexical, syntactic, semantic. This chapter provides a survey of the main Natural Language Processing tasks (tokenization, language recognition, stemming, stopword removal, Part of Speech tagging, Word Sense Disambiguation, Parsing) and presents some related techniques, along with lexical resources of interest to the research community.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
A thesaurus is a dictionary based on semantic models of organization, that reports for each word, in addition to its meaning, also a list of related words (e.g., its synonyms and antonyms).
2.
A “formal, explicit specification of a shared conceptualization” [13], i.e., a vocabulary to model the type of objects, concepts, and their properties and relations, in a domain. In philosophy, a systematic account of Existence.
3.
Website http://wordnet.princeton.edu.
4.
Since backward compatibility is not guaranteed in WordNet versioning, and hence the synset IDs may change, WordNet Domains might not be aligned with the latest WordNet version. This can be a problem when both are to be used, since either one gives up in using the latest WordNet version, or one has to install two different WordNet versions on the same system.
5.
Description Logics are fragments of First-Order Logic, having a formal semantics, and typically exploited to define concepts (i.e., classes) and roles (i.e., binary relations) in ontologies.
6.
Considering spoken text, a further preliminary level is the phonetic one.
7.
To improve approximation between the conventional year and the actual movement of the Earth, a year is leap, i.e., includes February 29th, if it can be divided by 4, unless it can also be divided by 100 (in which case it is not), unless it can be divided by 400 as well (in which case it is).
8.
This term is often used also to denote a sequence of n consecutive words in a text, but here the letter-based interpretation is assumed.
9.
The cited experiment adopted a generative perspective on this subject. First, casual sequences of equiprobable letters and spaces were generated, showing that no particular language was suggested by the outcome. Then a first-order approximation of English was produced by generating sequences where the probability of adding a character to the sequence was the same as the frequency computed for that character on actual English texts. Again, this was of little help for hypothesizing from the sequence the language from whose frequency distribution it was generated. Switching to a second order approximation, where each character was extracted according to a different distribution depending on the previous extracted character, started showing some hints of English here and there. Indeed, it was able to represent phenomena such as the ‘qu’ pair, where the ‘u’ must necessarily follow the ‘q’, that the single-letter approximation could not express. Finally, a third-order approximation, in which the probability distribution for extracting the next symbol depended on the previous two characters, showed a significant improvement in the outcome, where English was quite evident in the overall flavor of the sound, and several monosyllable words were actually caught. Comparable results were obtained for other languages as well, e.g., Latin.
10.
In cases of binary ambiguity, the estimation is that in a given collocation a word is used with only one sense with a 90–99% probability [27].
11.
http://www.abisource.com/projects/link-grammar.

References

Allen, J.F.: Natural Language Understanding. Benjamin-Cummings, Redwood City (1994)
Google Scholar
Bentivogli, L., Forner, P., Magnini, B., Pianta, E.: Revising WordNet domains hierarchy: Semantics, coverage, and balancing. In: Proceedings of COLING 2004 Workshop on Multilingual Linguistic Resources, pp. 101–108 (2004)
Chapter Google Scholar
Berry-Rogghe, G.: The computation of collocations and their relevance to lexical studies. In: Aitken, A.J., Bailey, R.W., Hamilton-Smith, N. (eds.) The Computer and Literary Studies, pp. 103–112. Edinburgh University Press, Edinburgh (1973)
Google Scholar
Brill, E.: A simple rule-based part of speech tagger. In: HLT ’91: Proceedings of the Workshop on Speech and Natural Language, pp. 112–116 (1992)
Chapter Google Scholar
Brill, E.: Some advances in transformation-based part of speech tagging. In: Proceedings of the 12th National Conference on Artificial Intelligence (AAAI), vol. 1, pp. 722–727 (1994)
Google Scholar
Brill, E.: Unsupervised learning of disambiguation rules for part of speech tagging. In: Natural Language Processing Using Very Large Corpora Workshop, pp. 1–13. Kluwer, Amsterdam (1995)
Google Scholar
Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Rosati, R.: Tractable reasoning and efficient query answering in description logics: The DL-lite family. Journal of Automated Reasoning 39(3), 385–429 (2007)
Article MathSciNet MATH Google Scholar
Calzolari, N., Lenci, A.: Linguistica computazionale—strumenti e risorse per il trattamento automatico della lingua. Mondo Digitale 2, 56–69 (2004) (in Italian)
Google Scholar
De Mauro, T.: Grande Dizionario Italiano dell’Uso. UTET, Turin (1999) (in Italian)
Google Scholar
Dewey, M., et al.: Dewey Decimal Classification and Relative Index. Edition 22. OCLC Online Computer Library Center (2003)
Google Scholar
Gale, W., Church, K., Yarowsky, D.: One sense per discourse. In: Proceedings of the ARPA Workshop on Speech and Natural Language Processing, pp. 233–237 (1992)
Google Scholar
Grishman, R.: Computational Linguistic—An Introduction. Studies in Natural Language Processing. Cambridge University Press, Cambridge (1986)
Book Google Scholar
Gruber, T.R.: A translation approach to portable ontology specifications. Knowledge Acquisition 5(2), 199–220 (1993)
Article Google Scholar
Halliday, M.: Categories of the theory of grammar. Word 17, 241–292 (1961)
Google Scholar
Ide, N., Véronis, J.: Introduction to the special issue on Word Sense Disambiguation: The state of the art. Compuational Linguistics 24(1), 1–40 (1998)
Google Scholar
Krovetz, R.: More than one sense per discourse. In: Proceedings of SENSEVAL Workshop, pp. 1–10 (1998)
Google Scholar
Lafferty, J., Sleator, D.D., Temperley, D.: Grammatical trigrams: A probabilistic model of link grammar. In: Proceedings of the AAAI Fall Symposium on Probabilistic Approaches to Natural Language (1992)
Google Scholar
Lesk, M.: Automatic sense disambiguation using machine-readable dictionaries: How to tell a pine cone from an ice cream cone. In: Proceedings of the 5th International Conference on Systems Documentation (SIGDOC), pp. 24–26 (1986)
Chapter Google Scholar
Magnini, B., Cavaglià, G.: Integrating subject field codes into WordNet. In: Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC), pp. 1413–1418 (2000)
Google Scholar
Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, New York (1999)
MATH Google Scholar
McCarthy, J., Minsky, M.L., Rochester, N., Shannon, C.E.: A proposal for the Dartmouth Summer research project on Artificial Intelligence. Tech. rep., Dartmouth College (1955)
Google Scholar
Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.J.: Introduction to WordNet: An on-line lexical database. International Journal of Lexicography 3(4), 235–244 (1990)
Article Google Scholar
Oltramari, A., Vetere, G.: Lexicon and ontology interplay in Senso Comune. In: Proceedings of OntoLex 2008 Workshop, 6th International Conference on Language Resources and Evaluation (LREC) (2008)
Google Scholar
Pierce, J.R.: Symbols, Signals and Noise—The Nature and Process of Communication. Harper Modern Science Series. Harper & Brothers (1961)
Google Scholar
Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Article Google Scholar
Sleator, D.D., Temperley, D.: Parsing English text with a link grammar. In: Proceedings of the 3rd International Workshop on Parsing Technologies (1993)
Google Scholar
Yarowsky, D.: One sense per collocation. In: Proceeding of ARPA Human Language Technology Workshop, pp. 266–271 (1993)
Chapter Google Scholar
Yarowsky, D.: Decision lists for lexical ambiguity resolution: Application to accent restoration in Spanish and French. In: Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, pp. 88–95 (1994)
Chapter Google Scholar
Yarowsky, D.: Unsupervised Word Sense Disambiguation rivaling supervised methods. In: Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pp. 189–196 (1995)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Informatica, Università di Bari, Via E. Orabona 4, 70126, Bari, Italy
Stefano Ferilli

Authors

Stefano Ferilli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stefano Ferilli .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ferilli, S. (2011). Natural Language Processing. In: Automatic Digital Document Processing and Management. Advances in Pattern Recognition. Springer, London. https://doi.org/10.1007/978-0-85729-198-1_6

Download citation

DOI: https://doi.org/10.1007/978-0-85729-198-1_6
Publisher Name: Springer, London
Print ISBN: 978-0-85729-197-4
Online ISBN: 978-0-85729-198-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics