Automatic Semantic Subject Indexing of Web Documents in Highly Inflected Languages

Sinkkilä, Reetta; Suominen, Osma; Hyvönen, Eero

doi:10.1007/978-3-642-21034-1_15

Reetta Sinkkilä^23,24,
Osma Suominen^23,24 &
Eero Hyvönen^23,24

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6643))

Included in the following conference series:

Extended Semantic Web Conference

1628 Accesses
4 Citations

Abstract

Structured semantic metadata about unstructured web documents can be created using automatic subject indexing methods, avoiding laborious manual indexing. A succesful automatic subject indexing tool for the web should work with texts in multiple languages and be independent of the domain of discourse of the documents and controlled vocabularies. However, analyzing text written in a highly inflected language requires word form normalization that goes beyond rule-based stemming algorithms. We have tested the state-of-the art automatic indexing tool Maui on Finnish texts using three stemming and lemmatization algorithms and tested it with documents and vocabularies of different domains. Both of the lemmatization algorithms we tested performed significantly better than a rule-based stemmer, and the subject indexing quality was found to be comparable to that of human indexers.

Download to read the full chapter text

Chapter PDF

Automated Subject Indexing of Domain Specific Collections Using Word Embeddings and General Purpose Thesauri

Automating Hierarchical Subject Index Construction for Scientific Documents

The Effects of Word Frequency Distortions Occasioned by Compounding on the Automatic Indexing of Yorùbá Text

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Buitelaar, P., Declerck, T.: Linguistic Annotation for the Semantic Web. In: Annotation for the Semantic Web, pp. 93–110. IOS Press, Amsterdam (2003)
Google Scholar
El-Shishtawy, T., Al-Sammak, A.: Arabic Keyphrase Extraction using Linguistic knowledge and Machine Learning Techniques. In: Proceedings of the Second International Conference on Arabic Language Resources and Tools (2009)
Google Scholar
Hawking, D., Zobel, J.: Does Topic Metadata Help With Web Search? Journal of the American Society for Information Science and Technology 58(5), 613–628 (2007)
Article Google Scholar
Hirsimäki, T., Creutz, M., Siivola, V., Kurimo, M., Virpioja, S., Pylkkönen, J.: Unlimited Vocabulary Speech Recognition with Morph Language Models Applied to Finnish. Computer Speech & Language 20(4), 515–541 (2006)
Article Google Scholar
Lindén, K., Silfverberg, M., Pirinen, T.: HFST Tools for Morphology – An EfficientOpen-Source Package for Construction of Morphological Analyzers. In: Mahlow, C., Piotrowski, M. (eds.) State of the Art in Computational Morphology. CCIS, vol. 41, pp. 28–47. Springer, Heidelberg (2009)
Chapter Google Scholar
Löfberg, L., Archer, D., Piao, S., Rayson, P., Mcenery, T., Varantola, K., pekka Juntunen, J.: Porting an English semantic tagger to the Finnish language. In: Proceedings of the Corpus Linguistics 2003 Conference (2003)
Google Scholar
Löfberg, L., Piao, S., Nykanen, A., Varantola, K., Rayson, P., Juntunen, J.P.: A semantic tagger for the Finnish language. In: Proceedings of Corpus Linguistics 2005 (2005)
Google Scholar
Markey, K.: Interindexer Consistency Tests: A Literature Review and Report of a Test of Consistency in Indexing Visual Materials. Library and Information Science Research, An International Journal 6(2), 155–177 (1984)
Google Scholar
Maron, M.E.: Automatic Indexing: an Experimental Inquiry. Journal of the ACM (JACM) 8(3), 404–417 (1961)
Article MATH Google Scholar
Medelyan, O.: Human-competitive automatic topic indexing. Ph.D. thesis, University of Waikato, Department of Computer Science (2009)
Google Scholar
Medelyan, O., Witten, I.H.: Thesaurus Based Automatic Keyphrase Indexing. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (2006)
Google Scholar
Oflazer, K., Kuruöz, I.: Tagging and Morphological Disambiguation of Turkish Text. In: Proceedings of the Fourth Conference on Applied Natural Language Processing (1994)
Google Scholar
Pala, N., Çiçekli, I.: Turkish Keyphrase Extraction Using KEA. In: Proceedings of the 22nd International Symposium on Computer and Information Sciences, ISCIS 2007 (2007)
Google Scholar
Pennanen, P., Alatalo, T.: Leiki – a platform for personalized content targeting. In: Proceedings of the 12th ACM Conference on Hypertext and Hypermedia, HYPERTEXT 2001 (2001)
Google Scholar
Rolling, L.: Indexing consistency, quality and efficiency. Information Processing & Management 17(2), 69–76 (1981)
Article Google Scholar
Saarti, J.: Consistency of subject indexing of novels by public library professionals and patrons. Journal of Documentation 58(1), 49–65 (2002)
Article Google Scholar
Salton, G., Buckley, C.: Term-weighting Approaches in Automatic Text Retrieval. Information Processing and Management 24(5), 513–523 (1988)
Article Google Scholar
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Tapanainen, P., Järvinen, T.: A non-projective dependency parser. In: Proceedings of the Fifth Conference on Applied Natural Language Processing (1997)
Google Scholar
Trieschnigg, D., Pezik, P., Lee, V., de Jong, F., Kraaij, W., Rebholz-Schuhmann, D.: MeSH Up: Effective MeSH Text Classification for Improved Document Retrieval. Bioinformatics 25(11), 1412–1418 (2009)
Article Google Scholar
Valkeapää, O., Alm, O., Hyvönen, E.: Efficient content creation on the semantic web using metadata schemas with domain ontology services (System description). In: Franconi, E., Kifer, M., May, W. (eds.) ESWC 2007. LNCS, vol. 4519, pp. 819–828. Springer, Heidelberg (2007)
Chapter Google Scholar
Vehviläinen, A., Hyvönen, E., Alm, O.: A semi-automatic semantic annotation and authoring tool for a library help desk service. In: Emerging Technologies for Semantic Work Environments: Techniques, Methods, and Applications, pp. 100–114. IGI Group, Hershey (2008)
Chapter Google Scholar
Witten, I.H., Paynter, G., Frank, E., Gutwin, C., Nevill-Manning, C.G.: KEA: Practical Automatic Keyphrase Extraction. In: Proceedings of Digital Libraries 1999 (1999)
Google Scholar
Zunde, P., Dexter, M.E.: Indexing Consistency and Quality. American Documentation 20(3), 259–267 (1969)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Media Technology, Semantic Computing Research Group (SeCo), Aalto University, Finland
Reetta Sinkkilä, Osma Suominen & Eero Hyvönen
Department of Computer Science, University of Helsinki, Finland
Reetta Sinkkilä, Osma Suominen & Eero Hyvönen

Authors

Reetta Sinkkilä
View author publications
You can also search for this author in PubMed Google Scholar
Osma Suominen
View author publications
You can also search for this author in PubMed Google Scholar
Eero Hyvönen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Science, FORTH-ICS and University of Crete, 71110, Heraklion, Crete, Greece
Grigoris Antoniou
Jožef Stefan Institute, Dept. of Knowledge Technologies, Jamova 39, 1000, Ljubljana, Slovenia
Marko Grobelnik
Karlsruhe Institute of Technology, 76128, Karlsruhe, Germany
Elena Simperl
University of Manchester, M13 9PL, Manchester, United Kingdom
Bijan Parsia
Institute of Computer Science, FORTH-ICS and University of Crete, 700 13, Heraklion, Crete, Greece
Dimitris Plexousakis
VU University of Amsterdam, 1012 ZA, Amsterdam, The Netherlands
Pieter De Leenheer
Department of Computing Science, University of Aberdeen, AB24 3UE, Aberdeen, United Kingdom
Jeff Pan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sinkkilä, R., Suominen, O., Hyvönen, E. (2011). Automatic Semantic Subject Indexing of Web Documents in Highly Inflected Languages. In: Antoniou, G., et al. The Semantic Web: Research and Applications. ESWC 2011. Lecture Notes in Computer Science, vol 6643. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21034-1_15

Download citation

DOI: https://doi.org/10.1007/978-3-642-21034-1_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21033-4
Online ISBN: 978-3-642-21034-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Automatic Semantic Subject Indexing of Web Documents in Highly Inflected Languages

Abstract

Chapter PDF

Similar content being viewed by others

Automated Subject Indexing of Domain Specific Collections Using Word Embeddings and General Purpose Thesauri

Automating Hierarchical Subject Index Construction for Scientific Documents

The Effects of Word Frequency Distortions Occasioned by Compounding on the Automatic Indexing of Yorùbá Text

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Automatic Semantic Subject Indexing of Web Documents in Highly Inflected Languages

Abstract

Chapter PDF

Similar content being viewed by others

Automated Subject Indexing of Domain Specific Collections Using Word Embeddings and General Purpose Thesauri

Automating Hierarchical Subject Index Construction for Scientific Documents

The Effects of Word Frequency Distortions Occasioned by Compounding on the Automatic Indexing of Yorùbá Text

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation