Spoken Information Extraction from Italian Broadcast News
Current research on information extraction from spoken documents is mainly focused on the recognition of named entities, such as names of organizations, locations and persons, within transcripts automatically generated by a speech recognizer. In this work we present research carried out at ITC-irst on named entity recognition in Italian broadcast news. In particular, an original statistical named entity tagger is described which can be trained with relatively little language resources: a seed list of named entities and a large untagged text corpus. Moreover, the paper discusses and presents named entity recognition experiments with case sensitive automatic transcripts, generated by the ITC-irst speech recognizer, and by training the named entity model with seed lists of different size.
KeywordsName Entity Recognition Entity Recognition Word Error Rate Broadcast News Speech Recognizer
Unable to display preview. Download preview PDF.
- N. Bertoldi, F. Brugnara, M. Cettolo, M. Federico, and D. Giuliani. From broadcast news to spontaneous dialogue transcription: portability issues. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Salt Lake City, UT, 2001.Google Scholar
- F. Brugnara and M. Federico. Dynamic language models for interactive speech applications. In Proceedings of the 5th European Conference on Speech Communication and Technology, pages 2751–2754, Rhodes, Greece, 1997.Google Scholar
- N. Chinchor, E. Brown, L. Ferro, and P. Robinson. 1999 Named Entity Recognition Task definition. Technical Report Version 1.4, MITRE, Corp., August 1999. http://www.nist.gov/speech/tests/ie-er/er_99/doc/ne99_taskdef_v1_4.ps.
- A. Cucchiarelli, D. Luzi, and P. Velandri. Automatic semantic tagging of unknown proper names. In In Proceedings of COLING-ACL 1998, Montreal, Canada, 1998.Google Scholar
- M. Federico, N. Bertoldi, and V. Sandrini. Bootstrapping named entity recognition for Italian broadcast news. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP), Philadelphia, PA, July 2002.Google Scholar
- Y. Gotoh and S. Renals. Information extraction from broadcast news. Journal of the Royal Statistical Society, A, pages 1295–1310, 2000.Google Scholar
- X. Huang, A. Acero, H.-W. Hon, and R. Reddy. Spoken language processing: a guide to theory, algorithm and system development. Prentice Hall, 2001.Google Scholar
- K. Humphreys, R. Gaizauskas, S. Azzam, C. Huyck, B. Mitchell, H. Cunningham, and Y. Wilks. University of Sheffield: description of the LASIE-II system as used for MUC-7. In In Meggase Understanding Conference Proceedings: MUC-7, 1998.Google Scholar
- G. Krupke and K. Hausman. Isoquest Inc: description of the NetOwl(TM) extractor system as used for MUC-7. In In Meggase Understanding Conference Proceedings: MUC-7, 1998.Google Scholar
- A. Mikheev, M. Moens, and C. Grover. Named entity recognition without gazetteers. In In Proceedings. of 9th Conference of the European Chapter of the Association for Computatinal Linguistics, Bergen, Norway, June 1999.Google Scholar
- D. Miller, R. Schwartz, R. Weischedel, and R. Stone. Named entity extraction from broadcast news. In Proceedings of the DARPA Broadcast News Workshop, Herndon, VA, February 1999.Google Scholar
- M. A. Przybocki, J. G. Fiscus, J. S. Garafolo, and D. S. Pallett. 1998 Hub-4 information extraction evaluation. In Proceedings of the DARPA Broadcast News Workshop, Herndon, VA, February 1999.Google Scholar