Language Resources and Evaluation

, Volume 48, Issue 2, pp 227–248 | Cite as

General framework for mining, processing and storing large amounts of electronic texts for language modeling purposes

  • Jan Švec
  • Jan Lehečka
  • Pavel Ircing
  • Lucie Skorkovská
  • Aleš Pražák
  • Jan Vavruška
  • Petr Stanislav
  • Jan Hoidekr
Original Paper

Abstract

The paper describes a general framework for mining large amounts of text data from a defined set of Web pages. The acquired data are meant to constitute a corpus for training robust and reliable language models and thus the framework needs to also incorporate algorithms for appropriate text processing and duplicity detection in order to secure quality and consistency of the data. As we expect the resulting corpus to be very large, we have also implemented topic detection algorithms that allow us to automatically select subcorpora for domain-specific language models. The description of the framework architecture and the implemented algorithms is complemented with a detailed evaluation section. It analyses the basic properties of the gathered Czech corpus containing more than one billion text tokens collected using the described framework, shows the results of the topic detection methods and finally also describes the design and outcomes of the automatic speech recognition experiments with domain-specific language models estimated from the collected data.

Keywords

Text data mining Language modeling Topic identification Duplicity detection 

Notes

Acknowledgements

This work has been supported by the grant of The University of West Bohemia, project No. SGS-2010-054 and by the Grant Agency of the Czech Republic, project No. GAČR P103/12/G084. The access to the MetaCentrum computing facilities provided under the programme Projects of Large Infrastructure for Research, Development, and Innovations LM2010005 funded by the Ministry of Education, Youth, and Sports of the Czech Republic is appreciated.

References

  1. Baroni, M. & Bernardini, S. (2004). Bootcat: Bootstrapping corpora and terms from the web. In In Proceedings of LREC 2004, pp. 1313–1316.Google Scholar
  2. Broder, A. Z., Glassman, S. C., Manasse, M. S., & Zweig, G. (1997). Syntactic clustering of the web. Computer Networks and ISDN Systems, 29(8–13), 1157–1166.CrossRefGoogle Scholar
  3. Bulyko, I., Ostendorf, M., Siu, M., Ng, T., Stolcke, A., & Çetin, O. (2007). Web resources for language modeling in conversational speech recognition. ACM Transactions on Speech and Language Processing (TSLP), 5(1), 1:1–1:25.Google Scholar
  4. Fairon, C. (2006). Corporator: a tool for creating rss-based specialized corpora. In Proceedings of the 2nd international workshop on web as corpus, WAC ’06 (pp. 43–49). Stroudsburg, PA, USA: Association for Computational Linguistics.Google Scholar
  5. Kanis, J., & Skorkovská, L. (2010). Comparison of different lemmatization approaches through the means of information retrieval performance. In: P. Sojka, A. Horák, I. Kopeček, & K. Pala (Eds.), TSD 2010. LNCS (Vol. 6231, pp. 93–100). Heidelberg: Springer.Google Scholar
  6. Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1), 97–133.CrossRefGoogle Scholar
  7. Kilgarriff, A., Reddy, S., Pomikálek, J., & PVS, A. (2010). A corpus factory for many languages. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner, & D. Tapias (Eds.), Proceedings of the seventh international conference on language resources and evaluation (LREC’10) (pp. 904–910). Valletta, Malta: European Language Resources Association (ELRA).Google Scholar
  8. Kučera, K. (2002). The Czech National Corpus: Principles, design, and results. Literary and Linguistic Computing, 17(2), 245–257.CrossRefGoogle Scholar
  9. Li, P., Zhu, Q., Qian, P., & Fox, G. (2007). Constructing a large scale text corpus based on the grid and trustworthiness. In: V. Matousek & P. Mautner (Eds.), TSD. Lecture Notes in Computer Science (Vol. 4629, pp. 56–65). New York: Springer.Google Scholar
  10. Malkin, M. & Venkatesan, R. (2005). Comparison of texts streams in the presence of mild adversaries. In Proceedings of the 2005 Australasian workshop on grid computing and e-research (Vol. 44, pp. 179–186). ACSW Frontiers ’05. Australian Computer Society, Inc.,.Google Scholar
  11. Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. New York: Cambridge University Press.CrossRefGoogle Scholar
  12. Pomikálek, J. (2011). Removing boilerplate and duplicate content from web corpora. Ph.D. thesis, Faculty of Informatics, Masaryk University, Brno, Czech Republic.Google Scholar
  13. Pražák, A., Loose, Z., Psutka, J., Radová, V., & Müller, L. (2011). Four-phase re-speaker training system. In Proceedings of SIGMAP 2011. Seville.Google Scholar
  14. Psutka, J., Ircing, P., Psutka, J.V., Radová, V., Byrne, W., Hajič, J., Mírovský, J., & Gustman, S. (2003). Large vocabulary ASR for spontaneous Czech in the MALACH project. In Proceedings of Eurospeech 2003 (pp. 1821–1824). Geneva.Google Scholar
  15. Psutka, J., Radová, V., Müller, L., Matoušek, J., Ircing, P., & Graff, D. (2001). Large broadcast news and read speech corpora of spoken Czech. In Proceedings of Eurospeech 2001 (pp. 2067–2070). Denmark: Aalborg.Google Scholar
  16. Psutka, J., Švec, J., Psutka, J.V., Vaněk, J., Pražák, A., Šmídl, L., & Ircing, P. (2011). System for fast lexical and phonetic spoken term detection in a Czech cultural heritage archive. EURASIP Journal on Audio, Speech, and Music Processing, 10.Google Scholar
  17. Sharoff, S. (2006). Creating general-purpose corpora using automated search engine queries. In WaCky! Working papers on the Web as Corpus (pp. 63–98). Gedit.Google Scholar
  18. Spoustová, D., Spousta, M., & Pecina, P. (2010). Building a Web Corpus of Czech. In Proceedings of the seventh conference on international language resources and evaluation (LREC’10). Valletta, Malta.Google Scholar
  19. Stolcke, A. (2002). SRILM—an extensible language modeling toolkit. In Proceedings of ICSLP 2002 (pp. 901–904). Denver.Google Scholar
  20. Švec, J. (2010). The Voiar (Voice Archive) library. University of West Bohemia, Plzeň.Google Scholar
  21. Švec, J., Hoidekr, J., Soutner, D., & Vavruška, J. (2011). Web text data mining for building large scale language modelling corpus. In: I. Habernal & V. Matoušek (Eds.), Text, speech and dialogue. Lecture Notes in Computer Science (Vol. 6836, pp. 356–363). Berlin / Heidelberg: Springer.Google Scholar
  22. Trmal, J., Pražák, A., Loose, Z., & Psutka, J. (2010). Online TV Captioning of Czech Parliamentary Sessions. In: Sojka, P., Horák, A., Kopeček, I., & Pala, K. (Eds.), Text, speech and dialogue. Lecture Notes in Artificial Intelligence (Vol. 6231, pp. 416–422). Berlin: Springer.Google Scholar
  23. Vaněk, J. & Psutka, J. (2010). Gender-dependent acoustic models fusion developed for automatic subtitling of parliament meetings broadcasted by the Czech TV. In: P. Sojka, A. Horák, I. Kopeček, & K. Pala (Eds.), TSD 2010. LNCS (Vol. 6231, pp. 431–438). Heidelberg: Springer.Google Scholar
  24. Zajíc, Z., Machlica, L., & Müller, L. (2010). Robust statistic estimates for adaptation in the task of speech recognition. In: P. Sojka, A. Horák, I. Kopeček, & K. Pala (Eds.), TSD 2010. LNCS (Vol. 6231, pp. 464–471). Heidelberg: Springer.Google Scholar
  25. Zelinka, J., Kanis, J., & Müller, L. (2005). Automatic transcription of numerals in inflectional languages. In: V. Matoušek, P. Mautner, & T. Pavelka (Eds.), Text, speech and dialogue. Lecture Notes in Computer Science (Vol. 3658, pp. 326–333). Berlin/Heidelberg: Springer.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2013

Authors and Affiliations

  • Jan Švec
    • 1
  • Jan Lehečka
    • 1
  • Pavel Ircing
    • 1
  • Lucie Skorkovská
    • 1
  • Aleš Pražák
    • 1
  • Jan Vavruška
    • 1
  • Petr Stanislav
    • 1
  • Jan Hoidekr
    • 1
  1. 1.Department of Cybernetics, Faculty of Applied SciencesUniversity of West BohemiaPlzeňCzech Republic

Personalised recommendations