Identifying author–inventors from Spain: methods and a first insight into results

Abstract

The purpose of this paper is twofold: methodological and empirical. Methodologically, we describe a matching and disambiguation procedure for the identification of author–inventors (researchers who publish and patent) located in the same country. Our methodology aims to maximize precision and recall rates by taking into account national name writing customs and country-specific dictionaries for person and institution names (academic and non-academic) in the name matching stage and by including a recursive validation step in the person disambiguation stage. An application of this methodology to the identification of Spanish author–inventors is described in detail. Empirically, we present the first results of applying the described methodology to the matching of all SCOPUS 2003–2008 publications of Spanish authors to all 1978–2009 EPO applications with Spanish inventors. Using this data, we identify 4,194 Spanish author–inventors. A first look at their patenting and publication patterns reveals that they make quite a significant contribution to the country’s overall scientific and technological production in the time period considered: 27 % of all EPO patent applications invented in Spain and 15 % of all SCOPUS publications authored in Spain, excluding non-technological disciplines. To our knowledge, this is the first time that a large scale identification of author–inventors from Spain has been done, with no limitation in terms of fields, regions or types of institutions. We also make available online for scientific use an anonymized subset of the database (patent applications invented by authors affiliated to Spanish public universities).

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Notes

  1. 1.

    The objective of the European Science Foundation (ESF) Research Networking Programme Academic Patenting in Europe (APE-INV) was to combine efforts from different research groups and create a European database of academic patenting. The present project has been developed in the framework of that Programme. For further information see http://www.esf-ape-inv.eu.

  2. 2.

    This data is available, together with data from other countries, at the website of the ESF programme on Academic Patenting in Europe: http://www.esf-ape-inv.eu/index.php?page=3#acadpat. As noted on the website, “Observations in each table are patent applications filed at the European Patent Office, with at least one designated inventor from the country concerned, for the time period indicated. Each table consist of two columns only, the first containing the patent publication number, as assigned by the European Patent Office, and the second one consisting of a dummy = 1 for academic patents (that is, with at least one academic researcher among the designated inventors).”

  3. 3.

    Other works describing large-scale matching of author-inventors include Boyack and Klavans (2008), who link 2002–2006 Scopus publications and 2002–2006 USPTO patents with rare inventor-author names only and exclude non-academic patents; and Cassiman et al. (2007) who match EPO patents and WoS publications based on their similarity in content using text mining techniques and then compare inventor and author names on the highest ranked matches.

  4. 4.

    For a detailed analysis of SCOPUS coverage see Moya-Anegón et al. (2007).

  5. 5.

    This is the case in any Spanish speaking country, not just in Spain, and also affects person names from Spanish-speaking communities in other countries (e.g. United States). Portuguese person names, for instance, also have multiple surnames, but follow different rules.

  6. 6.

    We structured SCOPUS first name and surname fields because we found some errors in the allocation of names and surnames to different fields (see Table 2b).

  7. 7.

    Based on very large corpus with grammatical tagging (also called part-of-speech tagging) and word frequency, this project gives access to dictionaries, thesaurus and lemmas (www.wiktionary.org).

  8. 8.

    Syntactic patterns (also known as grammars) are the structural rules of any given language. A grammar can be very complex (English or Spanish grammar, for instance) or very simple (how to write an address).

  9. 9.

    We classify these cases as having ‘fuzzy syntactic patterns’. In the name matching step, we apply the same procedures for blocking and matching except that we change the rules for blocking and we define specific ‘matching events’ for matching.

  10. 10.

    At this stage, we also introduce a non-name blocking criterion: we discard pairs that are very unlikely to correspond to author-inventors because the author´s publication is in a non-technical area. We consider that non-technical scientific fields correspond to the following five Scopus Science Classification areas (ASJC): (1) Arts and Humanities; (2) Business, Management and Accounting; (3) Economics, Econometrics and Finance; (4) Psychology and (5) Social Sciences.

  11. 11.

    For example, an author affiliated to a chemical institute can co-author an article with another author from an institute specialized in archaeology because they work together in the analysis of samples, although at first sight chemistry and archaeology would seem to be very distant. When this type of collaboration is not frequent, it takes a low value, but still different from zero.

  12. 12.

    Imagine we have two matching candidate couples as (Author A1 and Inventor I1) and (Author A2 and Inventor I2). A1 and A2 are co-authors and I1 and I2 are co-inventors. If (A1, I1) and (A2, I2) are false positives and we use them to calculate the indirect disambiguation variable, these two errors will mutually reinforce themselves. Now, if (A1, I1) is considered a real true positive and (A2, I2) is a matching couple candidate, (A2, I2) will benefit from the indirect disambiguation variable.

  13. 13.

    We consider, with a high confidence level, a non-dubious match as a validated match because of our very conservative criteria to detect potential false positives. In case of less conservative criteria, a validated match could be defined with more restrictions.

  14. 14.

    Like those available in the framework of the ESF APE-INV programme, www.ape-inv.eu.

  15. 15.

    It should be noted that we also explored alternative techniques to estimate the global score function and corresponding variable weights (Smalheiser and Torvik 2009; Pezzoni et al 2012; Lai et al 2013), but their successful implementation required further investment, and preliminary results were neither robust nor satisfactory enough compared to using our calibrated weights. We finally opted to leave the use of these techniques for further research.

  16. 16.

    The ROC curve plots sensitivity on the y axis by (1-specificity) on the x axis. The area under the ROC curve ranges from 0.5 to 1.0 with larger values indicative of better fit.

  17. 17.

    Dornbusch et al (2013) use a more general definition of the F measure with different weights given to precision and recall. The F measure presented here is the traditional one, where both rates are equally weighted.

  18. 18.

    Scientific areas correspond to ASJC SCOPUS journal classifications. We exclude scientific articles from journals assigned to the following first ASJC areas (first two digits) and consider all others as technologically-relevant: 10—Multidisciplinary, 12—Arts and Humanities, 14—Business, Management and Accounting, 18—Decision Sciences, 19—Earth and Planetary Sciences, 20—Economics, Econometrics and Finance, 29—Nursing, 32—Psychology, 33—Social Sciences, 35—Dentistry, 36—Health Professions. For the full list of ASJC codes, see http://ebrp.elsevier.com/pdf/Scopus_Custom_Data_Documentation_v4.pdf.

  19. 19.

    Other institutions, not part of the Spanish public research sector and thus considered as non-academic, include businesses, public administration, private universities and other higher education centres different from public universities, hospitals and other institutions from the health sector whose main activity is not research, as well as institutions not elsewhere classified.

  20. 20.

    Articles written by academic author-inventors amount to 93 % of all articles written by author-inventors in Chemistry, 84 % in Biotechnology (Biochemistry, Genetics and Molecular Biology) and 57 % in Medicine.

  21. 21.

    In the full sample of 2000-2008 EPO applications with Spanish inventors (authors and not authors), 51 % have Spanish business applicants, 28 % foreign companies, 15 % Spanish individuals and 7 % Spanish public research institutions. The annual share of EPO applications invented in Spain and held by Spanish PROs doubled from 1990 to 2008, from 4 to 8 %, whereas the share held by Spanish firms has only grown by four percentage points, from 50 % to 54 %. The share of EPO patent applications invented in Spain that are filed by individuals has decreased significantly, from 29 % in 1990 to 13 % in 2008.

  22. 22.

    The WIPO concordance between patent IPC classes and fields can be found here: http://www.wipo.int/ipstats/en/statistics/technology_concordance.html.

  23. 23.

    Patents are classified as academic or business-owned based on the keyword-based method of KUL/Eurostat (van Looy et al. 2006). The more fine-grained classification of academic-owned patents by type of public research organization is done manually relying on information from different sources, including SCImago Institution rankings (http://www.SCImagoir.com/).

  24. 24.

    The share of business-owned (with no academic co-ownership) goes down from 63 to 55 % (64–54 % in chemical patents).

References

  1. Azagra-Caro, J. M. (2011). Do public research organisations own most patents invented by their staff? Science and Public Policy, 38(3), 237–250.

    Article  Google Scholar 

  2. Balconi, M., Breschi, S., & Lissoni, F. (2004). Networks of inventors and the role of academia: An exploration of Italian patent data. Research Policy, 33, 127–145.

    Article  Google Scholar 

  3. Bonaccorsi, A., & Thoma, G. (2007). Institutional complementarity and inventive performance in nano science and technology. Research Policy, 36(6), 813–831.

    Article  Google Scholar 

  4. Boyack, K. W., & Klavans, R. (2008). Measuring science–technology interaction using rare inventor–author names. Journal of Informetrics, 2(3), 173–182.

    Article  Google Scholar 

  5. Breschi, S., & Catalini, C. (2010). Tracing the links between science and technology: An exploratory analysis of scientists’ and inventors’ networks. Research Policy, 39(1), 14–26.

    Article  Google Scholar 

  6. Buenstorf, G. (2009). Is commercialization good or bad for science? Individual-level evidence from the Max Planck Society. Research Policy, 38, 281–292.

    Article  Google Scholar 

  7. Cassiman, B., Glenisson, P., & Van Looy, B. (2007). Measuring industry-science links through inventor–author relations: A profiling methodology. Scientometrics, 70(2), 379–391.

    Article  Google Scholar 

  8. Dornbusch, F., Schmoch, U., Schulze, N., & Bethke, N. (2013). Identification of university-based patents: A new large-scale approach. Research Evaluation, 22, 52–63.

    Article  Google Scholar 

  9. Elmagarmid, A., Ipeirotis, P., & Verykios, V. (2007). Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1–16.

    Article  Google Scholar 

  10. Ester, M., Kriegel, H.P., Sander, J. & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In E. Simoudis, J. Han, U. M. Fayyad (Ed.), Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96). AAAI Press. pp. 226–231.

  11. FECYT. (2011). Principales indicadores bibliométricos de la actividad científica española 2009. FECYT. Madrid. http://icono.fecyt.es/informesypublicaciones/Paginas/Listadodepublicaciones.aspx. Retrieved Jan 2014.

  12. Geuna, A., & Rossi, F. (2011). Changes to university IPR regulations in Europe and the impact on academic patenting. Research Policy, 38(2), 281–291.

    Google Scholar 

  13. Giuri, P., Mariani, M., Brusoni, S., Crespi, G., Francoz, D., Gambardella, A., et al. (2007). Inventors and invention processes. Results from the PatVal-EU Survey. Research Policy, 36(8), 1107–1127.

    Article  Google Scholar 

  14. Iversen, E. J., Gulbrandsen, M., & Klitkou, A. (2007). A baseline for the impact of academic patenting legislation in Norway. Scientometrics, 70, 393–414.

    Article  Google Scholar 

  15. Lai, R., D’Amour, A., Doolin, D. M., Li, G.-C., Sun, Y., Torvik, V. et al. (2013). Disambiguation and co-authorship networks of the US patent inventor database (1975–2010), mimeo, 17 June 2013, http://funginstitute.berkeley.edu/sites/default/files/Disambiguation%20and%20Co-authorship%20Networks%20of%20the%20U.S.%20Patent%20Inventor%20Database%20(1975-2010)_0.pdf. Retrieved Jan 2014.

  16. Lissoni, F. (2012). Academic patenting in Europe: an overview of recent research and new perspectives. World Patent Information, 34(3), 197–205.

    Article  Google Scholar 

  17. Lissoni, F., Llerena, P., McKelvey, M., & Sanditov, B. (2008). Academic patenting in Europe: New evidence from the KEINS database. Research Evaluation, 17(2), 87–102.

    Article  Google Scholar 

  18. Lissoni, F., Llerena, P., & Sanditov, B. (2013a). Small worlds in networks of inventors and the role of academics: An analysis of France. Industry and Innovation, 20(3), 195–220.

    Article  Google Scholar 

  19. Lissoni, F., Maurino, A., Pezzoni M. & Tarasconi, G. (2010). APE‐INV’sname-game” algorithm challenge: A guideline for benchmark data analysis and reporting. http://www.esf-ape-inv.eu/download/Benchmark_document.pdf. Retrieved Jan 2014.

  20. Lissoni, F., Montobbio, F., & Zirulia, L. (2013b). Inventorship and authorship as attribution rights: An enquiry in the economics of scientific credit. Journal of Economic Behavior & Organization, 95, 49–69.

    Article  Google Scholar 

  21. Maraut, S., Dernis, H., Webb, C., Spieza, V. & Guellec, D. (2008). The OECD REGPAT database: A presentation, OECD STI Working Papers, 2008/2, OECD, Paris.

  22. Martínez, C., Azagra-Caro, J. M., & Maraut, S. (2013). Academic inventors, scientific impact and the institutionalisation of Pasteur’s Quadrant in Spain. Industry and Innovation, 20(5), 438–455.

    Article  Google Scholar 

  23. Meyer, M. (2003). Academic patents as an indicator of useful research? A new approach to measure academic inventiveness. Research Evaluation, 12(1), 17–27.

    Article  Google Scholar 

  24. Meyer, M. (2006). Are patenting scientists the better scholars? An exploratory comparison of inventor–authors with their non-inventing peers in nano-science and technology. Research Policy, 35, 1646–1662.

    Article  Google Scholar 

  25. Moed, H., Glanzel, W. & Schmoch, U. (2004). Handbook of quantitative science and technology research. The use of publication and patent statistics in studies of S&T systems, Moed, Glanzel and Schmoch (eds.), Kluwer Academic Publishers, Dordrecht.

  26. Moya-Anegón, F., Chinchilla-Rodríguez, Z., Vargas-Quesada, B., Corera-Álvarez, E., Muñoz-Fernández, F. J., González-Molina, A., et al. (2007). Coverage analysis of Scopus: A journal metric approach. Scientometrics, 73(1), 53–78.

    Article  Google Scholar 

  27. Murray, F., & Stern, N. (2007). Do formal intellectual property rights hinder the free flow of scientific knowledge? An empirical test of the anti-commons hypothesis. Journal of Economic Behavior & Organization, 63(4), 648–687.

    Article  Google Scholar 

  28. Noyons, E., Buter, R., van Raan, A., Schmoch, U., Heinze, T., Hinze, S. et al. (2003a). Mapping excellence in science and technology across Europe Nanoscience and Nanotechnology. Report of project EC-PPN CT-2002-0001 to the European Commission, October 2003. ftp://cordis.europa.eu/pub/indicators/docs/mapex_nano.pdf. Retrieved Jan 2014.

  29. Noyons, E., Buter, R., van Raan, A., Schmoch, U., Heinze, T., Hinze, S. et al. (2003b). Mapping excellence in science and technology across Europe Life Sciences. Report of project EC-PPLS CT-2002-0001 to the European Commission, October 2003. ftp://cordis.europa.eu/pub/indicators/docs/mapex_ls.pdf. Retrieved Jan 2014.

  30. Noyons, E. C. M., van Raan, A. F. J., Grupp, H., & Schmoch, U. (1994). Exploring the science and technology interface: Inventor–author relations in laser medicine research. Research Policy, 23, 443–457.

    Article  Google Scholar 

  31. Pezzoni, M., Lissoni, F. & Tarasconi, G. (2012). How to kill inventors: Testing the Massacrator Algorithm for inventor disambiguation, Cahiers du GREThA, 2012-29, December.

  32. Raffo, J., & Lhuillery, S. (2009). How to play the “names game”: Patent retrieval comparing different heuristics. Research Policy, 38, 1617–1627.

    Article  Google Scholar 

  33. Schmoch, U. (2008). Concept of a technology classification for country comparisons. Final report to the World Intellectual Property Office (WIPO). Karlsruhe: Fraunhofer ISI. http://www.wipo.int/edocs/mdocs/classifications/en/ipc_ce_41/ipc_ce_41_5-annex1.pdf. Retrieved Jan 2014.

  34. Schmoch, U., Dornbusch, F., Mallig, N., Michels, C., Schulze, N. & Bethke, N. (2012), Vollständige Erfassung von Patentanmeldungen aus Universitaten. Bericht an das Bundesministerium fur Bildung und Forschung (BMBF). Revidierte Fassung, Karlsruhe: Fraunhofer ISI. http://www.isi.fraunhofer.de/isi-media/docs/p/de/publikationen/Endbericht-Unipatente-Maerz-2012.pdf. Retrieved Jan 2014.

  35. SCImago. (2011). SCImago Institutions Ranking. SIR World Report 2011: Global Ranking. http://www.SCImagoir.com/. Retrieved Jan 2014.

  36. Smalheiser, N. R. & Torvik, V. I. (2009), Author name disambiguation. In B. Cronin (Ed.), Annual review of information science and technology, 43, 287–313.

  37. Trajtenberg M., Shiff G. & Melamed, R. (2006). The ‘names game’: Harnessing inventors’ patent data for economic research, NBER working paper 12479.

  38. Van Looy, B., du Plessis, M. & Magerman, T. (2006). Data Production Methods for Harmonized Patent Indicators: Patentee sector allocation, Eurostat Working Paper and Studies, Luxembourg.

  39. Verspagen, B. (2006). University Research, intellectual property rights and European Innovation Systems. Journal of Economic Surveys, 20(4), 607–632.

    Article  Google Scholar 

  40. Veugelers, R., Callaert, J., Song, X., & Van Looy, B. (2012). The participation of universities in technology development: do creation and use coincide? An empirical investigation on the level of national innovation systems. Economics of Innovation and New Technology, 21(5–6), 445–472.

    Article  Google Scholar 

  41. Wang, G. B., & Guan, J. C. (2011). Measuring science-technology interactions using patent citations and author–inventor links: an exploration analysis from Chinese nanotechnology. Journal of Nanoparticle Research, 13, 6245–6262.

    Article  Google Scholar 

  42. Winkler, W. E. (2006). Overview of record linkage and current research directions, Statistical Research Division U.S. https://www.census.gov/srd/papers/pdf/rrs2006-02.pdf. Retrieved Jan 2014.

Download references

Acknowledgments

We thank the SCImago Group for comments and help with SCOPUS data, in particular Félix de Moya and Elena Corera; Kenedy Alva for statistical assistance since the beginning; Mª José Moyano for support with the quality control and manual validation phase; and two anonymous referees for helpful comments. We also gratefully acknowledge advice and support at different stages of a long process of Francesco Lissoni, Luis Sanz-Menéndez, Domingo Represa, Joaquín M. Azagra-Caro, Ana Caldera, Antonio Fernández-Borrella, Laura Barrios, Paolo Freri, José Manuel Rojo and Matthijs den Besten. This work has also greatly benefited from exchanges with members of the ESF-APE-INV research networking programme. Preliminary versions of the methodology were presented at the name-game ESF-APE-INV workshops held in Paris December 2009 and Brussels September 2011. We thank participants for their comments. Finally, we acknowledge funding from the Spanish National Plan Project CSO2009-10845.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Catalina Martínez.

Appendix: calibrating weights for the token and name matching phases

Appendix: calibrating weights for the token and name matching phases

We created two test sets (one for the tokens and one for the person names) to calibrate the weights for the different events entering the token and name matching functions. These test sets were built in a comparable manner, so that they contained all types of mistakes, Spanish regional variations and languages other than Spanish (English, French and German mostly) including equivalent names in different languages. Both test sets were built on observation from our data source and on imaginary examples. The token test set contains around 5,000 tokens and the person name test set contains 1,000 names. Matching results were allocated into four categories, the first one for pair elements that will never be considered as equivalent (not matched), the other three categories for potential equivalent tokens or person names (good, medium and bad—the three Match Classes we consider) (Table 7).

Table 7 Descriptive statistics and t tests of name and disambiguation variables: valid v. invalid pairs

First, the blocking phases were designed to reach a 100 % recall rate with the test sets (all the pairs assigned to a matching category from Bad to Good should survive the blocking). No precision rates were calculated in the blocking steps, the only criteria being to reach a good filtering capacity. Second, the weights used in the matching phases were calibrated to reach a 100 % recall rate with the test set (pairs assigned to a matching category from Bad to Good should not be rejected) and a precision rate of at least 95 %.

For tokens, the number of comparisons surviving the first blocking step diminished by approximately 1,000 times, compared to the Cartesian number of comparisons, and after the second blocking step, the number of comparisons dropped by 20 times approximately. The table below presents some examples to illustrate how the token blocking and matching steps were implemented.

Examples of matched and not matched tokens

Token 1 Token 2 Blocking (1st step) Blocking (2nd step) Token MatchClass
Marco José Blocked   
Marco Amerigo Pass Blocked  
Marco Mauricio Pass Pass Not matched
Marco Maria Pass Pass Not matched
Marcoa Marioa Pass Pass Bad
Marioa Mariaa Pass Pass Bad
Elena Helen Pass Pass Medium
Arantxa Arancha Pass Pass Good
Mario Marrio Pass Pass Good
Catalina Catlina Pass Pass Good
Catalina Catalnia Pass Pass Good
  1. aAccording to the name dictionary used in the name matching phase, Mario and Marco exist and are different names, the same happens for Mario and Maria. That is why in both cases, they are assigned a Bad MatchClass, even though they only differ in one letter

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Maraut, S., Martínez, C. Identifying author–inventors from Spain: methods and a first insight into results. Scientometrics 101, 445–476 (2014). https://doi.org/10.1007/s11192-014-1409-1

Download citation

Keywords

  • Author–inventors
  • Science–technology links
  • Academic patenting
  • Matching
  • Disambiguation
  • SCOPUS
  • PATSTAT
  • Spain