Advertisement

Identification of Lost or Deserted Written Texts Using Zipf’s Law with NLTK

  • Devanshi Gupta
  • Priyank Singh Hada
  • Deepankar Mitra
  • Niket Sharma
Part of the Smart Innovation, Systems and Technologies book series (SIST, volume 27)

Abstract

Sometimes it becomes very difficult to identify the valuable text written by some great personalities; especially when the text is not having a signature or the author is anonymous. Deserted manuscripts or documents without a title or heading can be an additional pain. It might happen that the work of dignitaries are lost or only some part of their valuable piece of work is found available in the libraries or with other storage media’s. By deploying Zipf’s law with the NLTK module available in python, this problem can be solved to a great extent, helping save the originality of the valuable texts and not leaving them unidentified. This can also be helpful in some real time data analysis where frequency plays an important role; plagiarism detection in written texts is one such example. NLTK is a strong toolkit which helps in extracting, segmenting, parsing, tagging and searching etc. of many natural languages with the help of python modules. In this paper it has been tried to combine Zipf’s Law with NLTK to come up with a tool to identify the anonymous or deserted valuable texts.

Keywords

NLTK Zipf’s law NLP 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media (2009)Google Scholar
  2. 2.
    Lobur, M., Romanyuk, A., Romanyshyn, M.: Using NLTK for Educational and Scientific Purposes. In: 11th International Conference on The Experience of Designing and Application of CAD Systems in Microelectronics, pp. 426–428 (2011)Google Scholar
  3. 3.
    Abney, S., Bird, S.: The Human Language Project: Building a Universal Corpus of the World’s Languages. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 88–97 (2010)Google Scholar
  4. 4.
    Li, W.: Random Texts Exhibit Zipf’s-Law-Like Word Frequency Distribution. IEEE Transactions on Information Theory 38(6), 1842–1845 (1992)CrossRefGoogle Scholar
  5. 5.
    Rahman, A., Alam, H., Cheng, H., Llido, P., Tarnikova, Y., Kumar, A., Tjahjadi, T., Wilcox, C., Nakatsu, C., Hartono, R.: Fusion of Two Parsers for a Natural Language Processing Toolkit. In: Proceedings of the Fifth International Conference on Information Fusion, pp. 228–234 (2002)Google Scholar
  6. 6.
    Garrette, D., Klein, E.: An Extensible Toolkit for Computational Semantics. In: Proceedings of the 8th International Conference on Computational Semantics, pp. 116–127 (2009)Google Scholar
  7. 7.
    Chen, Q., Zhang, J., Wang, Y.: The Zipf’s Law in the Revenue of Top 500 Chinese Companies. In: 4th International Conference on Wireless Communications, Networking and Mobile Computing, pp. 1–4 (2008)Google Scholar
  8. 8.
    Shan, G., Hui-xia, W., Jun, W.: Research and application of Web caching workload characteristics model. In: 2nd IEEE International Conference on Information Management and Engineering (ICIME), pp. 105–109 (2010)Google Scholar
  9. 9.
    Project Gutenberg Archive, https://archive.org/details/gutenberg

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Devanshi Gupta
    • 1
  • Priyank Singh Hada
    • 1
  • Deepankar Mitra
    • 1
  • Niket Sharma
    • 1
  1. 1.ComputerScience DepartmentManipal University JaipurJaipurIndia

Personalised recommendations