Advertisement

On the problem of Wiki texts indexing

  • A. A. Krizhanovsky
  • A. V. Smirnov
Artificial Intelligence

Abstract

A new type of documents called a “wiki page” is winning the Internet. This is expressed not only in an increase of the number of Internet pages of this type, but also in the popularity of Wiki projects (in particular, Wikipedia); therefore the problem of parsing in Wiki texts is becoming more and more topical. A new method for indexing Wikipedia texts in three languages: Russian, English, and German, is proposed and implemented. The architecture of the indexing system, including the software components GATE and Lemmatizer, is considered. The rules of converting Wiki texts into texts in a natural language are described. Index bases for the Russian Wikipedia and Simple English Wikipedia are constructed. The validity of Zipf’s laws is tested for the Russian Wikipedia and Simple English Wikipedia.

Keywords

System Science International Word Form Indexing System Index Database Full Text Search 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    L. Rainie and B. Tancer, “Wikipedia Users,” in Reports: Online Activities & Pursuits (2007), http://www.pewinternet.org/pdfs/PIP-Wikipedia07.pdf.
  2. 2.
    J. J. Kleinberg, ACM 46(5) (1999).Google Scholar
  3. 3.
    S. Brin and L. Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998),” http://www-db.stanford.edu/backrub/google.html.
  4. 4.
    S. Fortunato, M. Boguna, A. Flammini, et al., “How to Make the Top Ten: Approximating PageRank from Indegree,” 2005, http://arxiv.org/abs/cs/0511016.
  5. 5.
    Survey of Text Mining: Clustering, Classification, and Retrieval, Ed. by M. Berry (Springer, New York, 2003).Google Scholar
  6. 6.
    Y. Ollivier and P. Senellart, “Finding Related Pages Using Green Measures: An Illustration with Wikipedia,” in Association for the Advancement of Artificial Intelligence, Vancouver, Canada (2007).Google Scholar
  7. 7.
    D. Milne, “Computing Semantic Relatedness Using Wikipedia Link Structure,” in Proceedings of New Zealand Computer Science Research Student Conference (NZCSRSC’2007), Hamilton, New Zealand, 2007, http://www.cs.waikato.ac.nz/dnk2/publications/nzcsrsc07.pdf.
  8. 8.
    S. Melnik, H. Garcia-Molina, and E. Rahm, “Similarity Flooding: a Versatile Graph Matching Algorithm and Its Application to Schema Matching,” in Proceedings of 18th ICDE Conference, San Jose CA, USA, 2002, http://research.microsoft.com/melnik/publications.html.
  9. 9.
    V. Blondel and P. Senellart, “Automatic Extraction of Synonyms in a Dictionary,” in Proceedings of SIAM Workshop on Text Mining, Arlington, Texas, USA, 2002. http://www.inma.ucl.ac.be/?blondel/publications/areas.html.
  10. 10.
    V. Blondel, A. Gajardo, M. Heymans, et al., “A Measure of Similarity Between Graph Vertices: Applications to Synonym Extraction and Web Searching,” SIAM Review 46(1) (2004).Google Scholar
  11. 11.
    E. Gabrilovich and S. Markovitch, “Computing Semantic Relatedness, Using Wikipedia-Based Explicit Semantic Analysis,” in Proceedings of 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, 2007, http://www.cs.technion.ac.il/gabr/papers/ijcai-2007-sim.pdf.
  12. 12.
    M. Sahami and T. D. Heilman, “A Web-Based Kernel Function for Measuring the Similarity of Short Text Snippets,” in Proceedings of 15th International World Wide Web Conference (www), 2006, http://robotics.stanford.edu/users/sahami/papers-dir/www2006.pdf.
  13. 13.
    P. Pantel and D. Lin, “Word-for-Word Glossing with Contextually Similar Words,” in Proceedings of ANLPNAACL 2000, Seattle, USA, 2000.Google Scholar
  14. 14.
    I. Kuralenok and I. Nekrest’yanov, “Automatic Document Classification Based on Latent-Semantic Analysis,” in Proceedings of the Conference on Electronic Libraries: Promising methods and Technologies, Electronic Collections, St. Petersburg, Russia, 1999, http://www.dl99.nw.ru [in Russian].
  15. 15.
    K. Bharat and M. Henzinger, “Improved Algorithms for Topic Distillation in a Hyperlinked Environment,” in Proceedings of 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 98), 1998. ftp://ftp.digital.com/pub/DEC/SRC/publications/monika/sigir98.pdf.Proc, 21.
  16. 16.
    A. G. Maguitman and F. Menczer, H. Roinestad, et al., Algorithmic Detection of Semantic Similarity, 2005, http://www2005.org/cdrom/contents.htm.
  17. 17.
    A. A. Krizhanovsky, “Automated Search of Semantically Close Words by the Example of Aviation Terminology,” Avtomatizatsiya v Promyshlennosti, 64(4), (2008).Google Scholar
  18. 18.
    A. A. Krizhanovsky, “Synonym Search in Wikipedia: Synarcher,” in Proceedings of the 11th International Conference on Speech and Computer SPECOM’2006, St. Petersburg, Russia, 2006.Google Scholar
  19. 19.
    A. A. Krizhanovsky, “Evaluation of Search Results of Semantically Close Words in Wikipedia: Information Content and the Adapted HITS Algorithm,” in Proceedings of Wiki Conference, St. Petersburg, Russia, 2007 [in Russian].Google Scholar
  20. 20.
    I. V. Segalovich, “How Search Engines Operate,” 2004, http://company.yandex.ru/articles/.
  21. 21.
    S. Robertson, “Understanding Inverse Document Frequency: on Theoretical Arguments for IDF,” J. Documentation, No. 60 (2004). http://www.soi.city.ac.uk/~ser/idfpapers/Robertson-idf-JDoc.pdf.
  22. 22.
    H. Cunningham, D. Maynard, K. Bontcheva, et al., Developing Language Processing Components with GATE (User’s Guide), Technical report. University of Sheffield, UK, 2005, http://www.gate.ac.uk.
  23. 23.
    A. V. Sokirko, “Morphological Modules at Site www.aot.ru,” in Proceedings of International conference Dialog 2004 on Computer Linguistics and Intelligent Technologies, Moscow, Russia, 2004, [in Russian].Google Scholar
  24. 24.
    D. Vakhitova, “Development of a Corpus of Texts on Corpus Linguistics, 2006, http://matling.spb.ru/files/kurs/Vahitova-Corpus.doc.
  25. 25.
    J. E. F. Friedl, Regular Expressions (Piter, St. Petersburg, 2001) [in Russian].Google Scholar
  26. 26.
    S. P. Ponzetto and M. Strube, “An API for Measuring the Relatedness of Words in Wikipedia,” in Companion Volume to the Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Prague, Czech Republic, 2007.Google Scholar
  27. 27.
    T. Zesch, C. Mueller, and I. Gurevych, “Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary,” in Proceedings of Conference on Language Resources and Evaluation (LREC), Marrakech, Morocco, 2008.Google Scholar
  28. 28.
    C. D. Manning and H. Schutze, Foundations of Statistical Natural Language Processing (The MIT Press, 1999).Google Scholar
  29. 29.
    S. Campbell, J.-P. Chancelier, and R. Nikoukhah, Modeling and Simulation in Scilab/Scicos (Springer, 2006).Google Scholar
  30. 30.
    O. N. Lyashevskaya and S. A. Sharov, “Frequency Dictionary of the National Corpus of Russian Language: Concept and Technique for Development,” in Proceedings of International Conference Dialog 2008 on Computer Linguistics and Intelligent Technologies, Bekasovo, Russia, 2008, http://www.dialog-21.ru/dialog2008/materials/pdf/53.pdf.
  31. 31.
    J. Atserias, H. Zaragoza, M. Ciaramita, et al., “Semantically Annotated Snapshot of the English Wikipedia,” in Proceedings of Conference on Language Resources and Evaluation, Marrakech, Morocco, 2008.Google Scholar
  32. 32.
    N. Aswani, V. Tablan, K. Bontcheva, et al., “Indexing and Querying Linguistic Metadata and Document Content,” in Proceedings of RANLP’2005, Borovets, Bulgaria, 2005.Google Scholar
  33. 33.
    R. Witte and T. Gitzinger, “Connecting Wikis and Natural Language Processing Systems,” in Proceedings of WikiSym’07, Canada, Quebec, 2007, http://www.wikisym.org/ws2007/-publish/Witte-WikiSym2007-Natur alLanguageProcessing.pdf.
  34. 34.
    P. Boldi and S. Vigna, Efficient Optimally Lazy Algorithms for Minimal-Interval Semantics (2007), http://vigna.dsi.unimi.it/papers.php.
  35. 35.
    B. Magnini, C. Strapparava, G. Pezzulo, et al., “The Role of Domain Information in Word Sense Disambiguation,” J. Natural Language Engineering 4(8) (2002).Google Scholar
  36. 36.
    A. Smirnov and A. Krizhanovsky, “Information Filtering Based on Wiki Index Database,” in Proceeding of FLINS’08, Madrid, Spain, 2008, http://arxiv.org/abs/0804.2354.
  37. 37.
    M. Shamsfard, A. Nematzadeh, and S. Motiee, “ORank: An Ontology Based System for Ranking Documents,” Int. J. Comput. Sci. 3(1) (2006). http://www.waset.org/ijcs/v1/v1-3-30.pdf.
  38. 38.
    M. Meyer, C. Rensing, and R. Steinmetz, “Categorizing Learning Objects Based on Wikipedia as Substitute Corpus,” in Proceedings of LODE’07, Crete, Greece, 2007, http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-311/paper09.pdf.
  39. 39.
    A. Gulin, M. Maslov, and I. Segalovich, “The Algorithm of Yandex Text Ranking Algoritm at ROMIP-2006,” in Procedings of ROMIP’2006, http://download.yandex.ru/company/03-yandex.pdf.
  40. 40.
    H. Geser, “From Printed to ‘Wikified’ Encyclopedias. Sociological Aspects of an Incipient Cultural Revolution,” in Sociology in Switzerland: Towards Cybersociety and Virtual Social Relations (Zuerich, 2007), http://socio.ch/intcom/t-hgeser16.pdf.
  41. 41.
    L.-S. Wu, R. Akavipat, F. Menczer, “6S: P2P Web Index Collecting and Sharing Application,” in Proceeding of RIAO’2007, http://sixearch.org/paper/6S-P2P-Web-1.pdf.

Copyright information

© Pleiades Publishing, Ltd. 2009

Authors and Affiliations

  • A. A. Krizhanovsky
    • 1
  • A. V. Smirnov
    • 1
  1. 1.St.-Petersburg Institute of Informatics and AutomationRussian Academy of SciencesSt.-PetersburgRussia

Personalised recommendations