News Media Analysis Using Focused Crawl and Natural Language Processing: Case of Lithuanian News Websites

  • Tomas Krilavičius
  • Žygimantas Medelis
  • Jurgita Kapočiūtė-Dzikienė
  • Tomas Žalandauskas
Part of the Communications in Computer and Information Science book series (CCIS, volume 319)

Abstract

The amount of information that is created, used or stored is growing exponentially and types of data sources are diverse. Most of it is available as an unstructured text. Moreover, considerable part of it is available on-line, usually accessible as Internet resources. It is too expensive or even impossible for humans to analyze all the resources for a required information. Classical Information Technology techniques are not sufficient to process such amounts of information and render it in a form convenient for further analysis. Information Retrieval (IR) and Natural Language Processing (NLP) provide a number of instruments for information analysis and retrieval. In this paper we present a combined application of NLP and IR for Lithuanian media analysis. We demonstrate that a combination of IR and NLP tools with appropriate changes can be successfully applied to Lithuanian media texts.

Keywords

Information Retrieval Natural Language Processing stemming focused crawl Lithuanian language 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Plana, A.: Text/content analytics 2011: User perspectives on solutions and providers. Technical report, Alta Plana (September 2011)Google Scholar
  2. 2.
    Manning, C., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge Univ. Press, New York (2008)CrossRefGoogle Scholar
  3. 3.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley (1999)Google Scholar
  4. 4.
    Natural Language Access to Structured Text. In: Coling 1982: Proceedings of the Ninth International Conference on Computational Linguistics (1982)Google Scholar
  5. 5.
    Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley (2005)Google Scholar
  6. 6.
    Rösner, D., Grote, B., Hartmann, K., Höfling, B.: From natural language documents to sharable product knowledge: A knowledge engineering approach. Journal of Universal Computer Science 3(8), 955–987 (1997)Google Scholar
  7. 7.
    Apache Foundation: Apache Tika. Web page (2011), http://tika.apache.org (last visited: December 10, 2011)
  8. 8.
    LingPipe: Lingpipe. Web page (2011), http://alias-i.com/lingpipe/ (last visited: December 10, 2011)
  9. 9.
    Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., Aswani, N., Roberts, I., Gorrell, G., Funk, A., Roberts, A., Damljanovic, D., Heitz, T., Greenwood, M.A., Saggion, H., Petrak, J., Li, Y., Peters, W.: Text Processing with GATE (Version 6) (2011)Google Scholar
  10. 10.
    Vaičiūnas, A., Kaminskas, V., Raškinis, G.: Statistical language models of lithuanian based on word clustering and morphological decomposition. Informatica 15(4), 565–580 (2004)Google Scholar
  11. 11.
    Šveikauskienė, D.: Formal description of the syntax of the lithuanian language. Information Technologies and Control 34, 245–256 (2005)Google Scholar
  12. 12.
    Bevainytė, A., Butėnas, L.: Document classification using weighted ontology. Materials Physics and Mechanics 9(3), 236–245 (2010)Google Scholar
  13. 13.
    Tomović, A., Janičić, P.: A Variant of N-Gram Based Language Classification. In: Basili, R., Pazienza, M.T. (eds.) AI*IA 2007. LNCS (LNAI), vol. 4733, pp. 410–421. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  14. 14.
    Zinkevičius, Z.: Lemuoklis - tool for morphological analysis. Darbai ir Dienos (24), 245–274 (2000)Google Scholar
  15. 15.
    Marcinkevičienė, R., Vitkutė-Adžgauskienė, D.: Developing the human language technology infrastructure in lithuania. In: Proceedings of the 2010 Conference on Human Language Technologies – The Baltic Perspective: Proceedings of the Fourth International Conference Baltic HLT 2010, pp. 3–10. IOS Press, Amsterdam (2010)Google Scholar
  16. 16.
    Pandey, U., Chakravarty, S.: A survey on text classification techniques for e-mail filtering. In: Proceedings of the 2010 Second International Conference on Machine Learning and Computing, ICMLC 2010, pp. 32–36. IEEE Computer Society, Washington, DC (2010)CrossRefGoogle Scholar
  17. 17.
    Baharudin, B., Lee, L.H., Khan, K.: A review of machine learning algorithms for text-documents classification. Journal of Advances in Information Technology 1(1), 4–20 (2010)CrossRefGoogle Scholar
  18. 18.
    Harish, B.S., Guru, D.S., Manjunath, S.: Representation and classification of text documents: A brief review. IJCA, Special Issue on RTIPPR (2), 110–119 (2010)Google Scholar
  19. 19.
    Maicher, L., Park, J. (eds.): TMRA 2005. LNCS (LNAI), vol. 3873. Springer, Heidelberg (2006)Google Scholar
  20. 20.
    Yang, S.Y.: Ontocrawler: A focused crawler with ontology-supported website models for information agents. Expert Systems with Applications 37(7), 5381–5389 (2010)CrossRefGoogle Scholar
  21. 21.
    Porter, M.F.: Snowball: A language for stemming algorithms. Published online (October 2001), http://snowball.tartarus.org/texts/introduction.html (accessed March 11, 2008)
  22. 22.
    The National Archives: The soundex indexing system. Web page (May 2007), http://www.archives.gov/research/census/soundex.html
  23. 23.
    Centre of Computational Linguistics: Lithuanian digital resources. Web page (2011), http://sruoga.vdu.lt/lituanistiniai-skaitmeniai-istekliai
  24. 24.
    TokenMill: Lt language pack. Web page (2012), https://github.com/tokenmill/ltlangpack
  25. 25.
    Németh, L.: Hunspell. Web page (2012), http://hunspell.sourceforge.net
  26. 26.
    Lukaševičius, R., Agejevas, A.: ispell-lt. Web page, ftp://ftp.akl.lt/ispell-lt/
  27. 27.
    Wikipedia: Language identification — wikipedia, the free encyclopedia (2012) (Online; accessed April 30, 2012)Google Scholar
  28. 28.
    Wikipedia: Stop words — wikipedia, the free encyclopedia (2012) (Online; accessed April 30, 2012)Google Scholar
  29. 29.
    Krilavičius, T., Kuliešienė, D.: Soundex for lithuanian language. Internal report, UAB TokenMill (2010)Google Scholar
  30. 30.
    Krilavičius, T., Baltrūnas, M.: Soundex for lithuanian language. Internal report and bachelor thesis, UAB TokenMill and Vytautas Magnus University (2012)Google Scholar
  31. 31.
    Paliulionis, V.: Lietuviškų adresų geokodavimo problemos ir jų sprendimo būdai. Informacijos Mokslai, 217–222 (2009)Google Scholar
  32. 32.
    Krilavičius, T., Medelis, V.: Porter stemmer for lithuanian language. Internal report and bachelor thesis, UAB TokenMill and Vytautas Magnus University (2010)Google Scholar
  33. 33.
    Ghosh, J., Strehl, A.: Similarity-Based Text Clustering: A Comparative Study. In: Kogan, J., Nicholas, C., Teboulle, M. (eds.) Grouping Multidimensional Data, pp. 73–97. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  34. 34.
    Zhong, S., Ghosh, J.: Generative model-based document clustering: a comparative study. Knowledge and Information Systems 8, 374–384 (2005), doi:10.1007/s10115-004-0194-1CrossRefGoogle Scholar
  35. 35.
    Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining, vol. 400(X), pp. 1–20 (2000)Google Scholar
  36. 36.
    Andrews, N.O., Fox, E.A.: Recent developments in document clustering. Technical report (2007)Google Scholar
  37. 37.
    Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Journal of Linguisticae Investigationes 30(1), 1–20 (2007)CrossRefGoogle Scholar
  38. 38.
    Kaur, D., Gupta, V.: A survey of named entity recognition in english and other indian languages. IJCSI International Journal of Computer Science Issues 7(6), 239–245 (2010)Google Scholar
  39. 39.
    AbdelRahman, S., Elarnaoty, M., Magdy, M., Fahmy, A.: Integrated machine learning techniques for arabic named entity recognition. IJCSI International Journal of Computer Science Issues 7(4), 27–36 (2010)Google Scholar
  40. 40.
    Nguyen, D.B., Hoang, S.H., Pham, S.B., Nguyen, T.P.: Named Entity Recognition for Vietnamese. In: Nguyen, N.T., Le, M.T., Świątek, J. (eds.) ACIIDS 2011, Part II. LNCS, vol. 5991, pp. 205–214. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  41. 41.
    Kapočiūtė-Dzikienė, J., Raškinis, G.: Rule-based annotation of lithuanian text corpora. Information technology and control. Technologija 34, 290–296 (2005)Google Scholar
  42. 42.
    Balčas, J., Krilavičius, T., Medelis, V.: Lithuanian date and time identification using GATE and Jape. Internal report and bachelor thesis, UAB TokenMill and Vytautas Magnus Unviersity (2012)Google Scholar
  43. 43.
    Širviskas, R., Krilavičius, T., Medelis, V.: Lithuanian citations identification using GATE and Jape. Internal report and bachelor thesis, UAB TokenMill and Vytautas Magnus University (2012)Google Scholar
  44. 44.
    Apache Foundation: Apache Nutch. Web page (2011), http://nutch.apache.org (last visited: December 10, 2011)
  45. 45.
    Apache Foundation: Apache Mahout. Web page (2011), http://mahout.apache.org (last visited: December 10, 2011)
  46. 46.
    Apache Foundation: Apache Solr. Web page (2011), http://lucene.apache.org/solr (last visited: December 10, 2011)
  47. 47.
    Apache Foundation: Apache Lucene. Web page (2011), http://lucene.apache.org (last visited: December 10, 2011)

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Tomas Krilavičius
    • 1
  • Žygimantas Medelis
    • 2
  • Jurgita Kapočiūtė-Dzikienė
    • 1
  • Tomas Žalandauskas
    • 1
  1. 1.Baltic Institute of Advanced TechnologyVilniusLithuania
  2. 2.UAB “Tokenmill”Lithuania

Personalised recommendations