Rediscovering 15 + 2 years of discoveries in language resources and evaluation
This paper analyzes the content of the proceedings of the Language Resources and Evaluation Conference (LREC) over the past 17 years (1998–2014), with the goal of gaining a picture of the LREC community and the topics that are most relevant to the field. We follow the methodology used in similar studies, including the survey of the IEEE ICASSP conference proceedings from 1976 to 1990, the survey of the Association of Computational Linguistics conference proceedings over 50 years, and the survey of the proceedings of the conferences contained in the ISCA Archive over 25 years (1987–2012). We expand on results originally presented at LREC 2014, but include the proceedings of LREC 2014 itself in the study together with an analysis of various citation graphs. We show the evolution over time of the number of papers and authors, including their distribution by gender and affiliation, as well as collaborations and citation patterns among authors and papers, funding sources for reported research, and plagiarism and reuse in LREC papers; results for LREC are compared with similar results for major conferences in related fields. We also consider the evolution of research topics over time and identify the authors who introduced key terms. Finally, we propose and apply a measure of a researcher’s notability and provide the results for LREC authors. The study uses NLP methods that have been published in the corpus considered in the study. In addition to providing a revealing characterization of the LRE community, the study also demonstrates the need for establishing a system for unique identification of authors, papers and other sources to facilitate this type of analysis.
KeywordsELRA Anthology Language resources Language processing systems evaluation Text analytics Social networks ISLRN Bibliometrics Scientometrics
The authors wish to thank the ACL colleagues, Ken Church, Sanjeev Khudanpur, Amjbad Abu Jbara, Dragomir Radev and Simone Teufel, who helped them in the starting phase, Isabel Trancoso, who gave her ISCA Archive analysis on the use of assessment and corpora, Wolfgang Hess, who produced and provided a 14 GBytes ISCA Archive, Emmanuelle Foxonet who provided a list of authors given names with genre, Florian Boudin, who made available the TALN Anthology, Helen van der Stelt and Jolanda Voogd (Springer) who provided the LRE data and Douglas O’Shaughnessy, Denise Hurley, Rebecca Wollman and Casey Schwartz (IEEE) who provided the IEEE ICASSP and TASLP data, Nancy Ide and Christopher Cieri who largely improved the readability of the paper. They also thank Khalid Choukri, Alexandre Sicard and Nicoletta Calzolari, who provided information about the past LREC conferences, Victoria Arranz, Ioanna Giannopoulou, Johann Gorlier, Jérémy Leixa, Valérie Mapelli and Hélène Mazo, who helped in recovering the metadata for LREC 1998, and all the organizers, reviewers and authors over the 17 years conferences without whom this analysis could not have been conducted!
- ACL. (2012). Proceedings of the ACL-2012 special workshop on rediscovering 50 years of discoveries, ACL 2012, Jeju, July 10, 2012. ISBN 978-1-937284-29-9.Google Scholar
- Boudin, F. (2013). TALN archives: une archive numérique francophone des articles de recherche en traitement automatique de la langue. TALN-RÉCITAL 2013, Les Sables d’Olonne, Juin 17–21, 2013.Google Scholar
- Calzolari, N., Del Gratta, R., Francopoulo, G., Mariani, J., Rubino, F., Russo, I., et al. (2012). The LRE map. Harmonising community descriptions of resources. In Proceedings of the language resources and evaluation conference (LREC 2012), Istanbul, Turkey, May 23–25, 2012.Google Scholar
- Councill, I. G., Giles, C. L., & Kan, M.-Y. (2008). ParsCit: An open-source CRF reference string parsing package. In Proceedings of the language resources and evaluation conference (LREC 2008), Marrakesh, Morocco, May 2008.Google Scholar
- Csárdi, G., & Nepusz, T. (2006). The igraph software package for complex network research. InterJournal, Complex Systems, 1695, 1–9.Google Scholar
- Drouin, P. (2004). Detection of domain specific terminology using corpora comparison. In Proceedings of the language resources and evaluation conference (LREC 2004), Lisbon, Portugal, May 2004.Google Scholar
- Francopoulo, G. (2007). TagParser: Well on the way to ISO-TC37 conformance. In ICGL (International conference on global interoperability for language resources), Hong Kong.Google Scholar
- Francopoulo, G., Mariani, J., & Paroubek, P. (2015a). NLP4NLP: The cobbler’s children won’t go unshod. In 4th international workshop on mining scientific publications (WOSP2015), joint conference on digital libraries 2015 (JCDL 2015), Knoxville (USA), June 24, 2015.Google Scholar
- Francopoulo, G., Mariani, J., & Paroubek, P. (2015b). NLP4NLP: Applying NLP to written and spoken scientific NLP corpora. In Workshop on mining scientific papers: Computational linguistics and bibliometrics, 15th international society of scientometrics and informetrics conference (ISSI 2015), Istanbul (Turkey), June 29, 2015.Google Scholar
- Francopoulo, G., Mariani, J., & Paroubek, P. (2016). A study of reuse and plagiarism in LREC papers. In Proceedings of LREC 2016, Portorož, Slovenia, May 23–28, 2016.Google Scholar
- Fu, Y., Xu, F., & Uszkoreit, H. (2010). Determining the origin and structure of person names. In Proceedings of the seventh conference on international language resources and evaluation (LREC’10) (pp. 3417–3422), Valletta, Malta. European Language Resources Association (ELRA), May 2010. ISBN 2-9517408-6-7.Google Scholar
- Hall, D. L. W., Jurafsky, D., & Manning, C. (2008). Studying the history of ideas using topic models. In Proceedings of the conference on empirical methods in natural language processing (EMNLP’08) (pp. 363–371).Google Scholar
- Joerg, B., Höllrigl, T., & Sicilia, M.-A. (2012). Entities and identities in research information systems. In 11th international conference on current research information systems (CRIS2012): “e-Infrastructures for research and innovation: Linking information systems to improve scientific knowledge production”, Prague, Czech Republic, June 6–9, 2012.Google Scholar
- Li, H., Councill, I., Lee, W. C., & Giles, C. L. (2006). CiteSeerx: An architecture and web service design for an academic document search engine. In Proceedings of the 15th international conference on the World Wide Web. Google Scholar
- Mariani, J. (1990). La Conférence IEEE-ICASSP de 1976 à 1990: 15 ans de recherches en Traitement Automatique de la Parole. Notes et Documents LIMSI 90-8, Septembre 1990.Google Scholar
- Mariani, J., Cieri, C., Francopoulo, G., Paroubek, P., & Delaborde, M. (2014b). Facing the identification problem in language-related scientific data analysis. In Proceedings of LREC 2014, Reykjavik, Iceland, May 26–31, 2014.Google Scholar
- Mariani, J., Paroubek, P., Francopoulo, G., & Delaborde, M. (2013). Rediscovering 25 years of discoveries in spoken language processing: A preliminary ISCA archive analysis. In Proceedings of Interspeech 2013, Lyon, France, August 26–29, 2013.Google Scholar
- Mariani, J., Paroubek, P., Francopoulo, G., & Hamon, O. (2014a). Rediscovering 15 years of discoveries in language resources and evaluation: The LREC anthology analysis. In Proceedings of LREC 2014, Reykjavik, Iceland, May 26–31, 2014.Google Scholar
- Osborne, F., Motta, E., & Mulholland, P. (2013). Exploring scholarly data with Rexplore. In International semantic web conference, Sydney, Australia.Google Scholar
- Paul, M., & Girju, R. (2009). Topic modeling of research fields: An interdisciplinary perspective. In Recent advances in natural language processing (RANLP 2009), Borovets, Bulgaria.Google Scholar
- Rochat, Y. (2009). Closeness centrality extended to unconnected graphs: The harmonic centrality index. In Applications of Social Network Analysis (ASNA), 2009, Zurich, Switzerland.Google Scholar
- Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., & Su, Z. (2008). ArnetMiner: Extraction and mining of academic social networks. In Proceeding of the 14th international conference on knowledge discovery and data mining.Google Scholar
- The British National Corpus. (2007). Version 3 (BNC XML edition). Distributed by Oxford University Computing Services on behalf of the BNC Consortium. http://www.natcorp.ox.ac.uk/.
- The R Journal. (2012). 4(2), 5–12. ISSN 2073-4859, http://journal.r-project.org/.