Natural Language Processing

  • Taylor Arnold
  • Lauren Tilton

Abstract

An introduction applying low-level natural language processing is given in this chapter. Techniques such as tokenization, lemmatization, part of speech tagging, and coreference detection are described in relationship to text analysis. The methods are applied to a corpus of short stories by Sir Arthur Conan Doyle featuring his famous detective, Sherlock Holmes.

References

  1. [1]
    Albert Camus. L’étranger. Ernst Klett Sprachen, 2005.Google Scholar
  2. [2]
    Pi-Chuan Chang, Huihsin Tseng, Dan Jurafsky, and Christopher D Manning. Discriminative reordering with chinese grammatical relations features. In Proceedings of the Third Workshop on Syntax and Structure in Statistical Translation, pages 51–59. Association for Computational Linguistics, 2009.Google Scholar
  3. [3]
    Danqi Chen and Christopher D Manning. A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 740–750, 2014.Google Scholar
  4. [4]
    Marie-Catherine De Marneffe and Christopher D Manning. Stanford typed dependencies manual. URL http://nlp.stanford.edu/software/dependencies_manual.pdf, 2008.
  5. [5]
    Ingo Feinerer and Kurt Hornik. tm: Text mining package. R package version 0.5-5., URL http://CRAN.R-project.org/package=tm, 2011.
  6. [6]
    Jenny Rose Finkel, Trond Grenager, and Christopher Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 363–370. Association for Computational Linguistics, 2005.Google Scholar
  7. [7]
    Spence Green and Christopher D Manning. Better arabic parsing: Baselines, evaluations, and analysis. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 394–402. Association for Computational Linguistics, 2010.Google Scholar
  8. [8]
    Spence Green, Marie-Catherine de Marneffe, John Bauer, and Christopher D Manning. Multiword expression identification with tree substitution grammars: A parsing tour de force with french. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 725–735. Association for Computational Linguistics, 2011.Google Scholar
  9. [9]
    Michael Hart. Project gutenberg. Project Gutenberg, 1971.Google Scholar
  10. [10]
    Katri Haverinen, Jenna Nyblom, Timo Viljanen, Veronika Laippala, Samuel Kohonen, Anna Missilä, Stina Ojala, Tapio Salakoski, and Filip Ginter. Building the essential resources for finnish: the turku dependency treebank. Language Resources and Evaluation, 48(3): 493–531, 2014.CrossRefGoogle Scholar
  11. [11]
    Dan Jurafsky and James H Martin. Speech & language processing. Pearson Education India, 2000.Google Scholar
  12. [12]
    David Kathman. The question of authorship. Shakespeare: An Oxford Guide, pages 620–632, 2003.Google Scholar
  13. [13]
    Heeyoung Lee, Yves Peirsman, Angel Chang, Nathanael Chambers, Mihai Surdeanu, and Dan Jurafsky. Stanford’s multi-pass sieve coreference resolution system at the conll-2011 shared task. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task, pages 28–34. Association for Computational Linguistics, 2011.Google Scholar
  14. [14]
    Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J Bethard, and David McClosky. The stanford corenlp natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55–60, 2014.Google Scholar
  15. [15]
    Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K Gray, Joseph P Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin Nowak, and Erez Lieberman Aiden. Quantitative analysis of culture using millions of digitized books. science, 331(6014):176–182, 2011.Google Scholar
  16. [16]
    Claire Cain Miller. Is the professor bossy or brilliant? much depends on gender. The New York Times, 2 2015.Google Scholar
  17. [17]
    Slav Petrov, Dipanjan Das, and Ryan McDonald. A universal part-of-speech tagset. arXiv preprint arXiv:1104.2086, 2011.Google Scholar
  18. [18]
    Anna N Rafferty and Christopher D Manning. Parsing three german treebanks: lexicalized and unlexicalized baselines. In Proceedings of the Workshop on Parsing German, pages 40–46. Association for Computational Linguistics, 2008.Google Scholar
  19. [19]
    Beatrice Santorini. Part-of-speech tagging guidelines for the penn treebank project (3rd revision). 1990.Google Scholar
  20. [20]
    Benjamin Schmidt. The foreign language of mad men. The Atlantic, 3 2012.Google Scholar
  21. [21]
    Kathryn Schultz. What is distant reading? The New York Times, 6 2011.Google Scholar
  22. [22]
    Mojgan Seraji, Joakim Nivre, et al. Bootstrapping a persian dependency treebank. Linguistic Issues in Language Technology, 7(1), 2012.Google Scholar
  23. [23]
    HS Sichel. On a distribution representing sentence-length in written prose. Journal of the Royal Statistical Society. Series A (General), pages 25–34, 1974.Google Scholar
  24. [24]
    Kristina Toutanova, Dan Klein, Christopher D Manning, and Yoram Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pages 173–180. Association for Computational Linguistics, 2003.Google Scholar
  25. [25]
    Michael Wood. In search of Shakespeare. Random House, 2005.Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Taylor Arnold
    • 1
  • Lauren Tilton
    • 1
  1. 1.Yale UniversityNew HavenUSA

Personalised recommendations