Background

Clinical trials involve experiments on human beings and the results of many of them have implications for patient care. As such the reporting of trials must be prompt, accurate, and comprehensive. Sadly this is not always so and for some decades now there has been a push to raise the standard of reporting. Recent years have seen distinct improvements and these positive changes have come about due to various types of efforts. Since 1986 in particular there has been a call to establish registries, where sponsors register details of their trials [1]. Over the past two decades many registries have been established, including the Australian New Zealand Clinical Trials Registry (ANZCTR), Clinical Trial Registry-India (CTRI), the European Union Clinical Trial Registry, the German Clinical Trials Register, the Japan Primary Registries Network, and the United States’ ClinicalTrials.gov. The International Committee of Medical Journal Editors (ICMJE) has also pressured researchers to register their trials. Since 13 September 2005, it has required that trials intended for publication in their journals need be registered in a publicly accessible database before the recruitment of patients begins [2]. Subsequent to the announcement of this policy, there was a substantial increase in the number of trials registered with ClinicalTrials.gov [3].

Although trials have often been registered retrospectively, several steps have been taken to push for prospective registration [4,5,6,7]. This would ensure that it is possible to verify, for instance, that (a) the results of all trials have been reported and that trials with positive outcomes have not been selectively reported; and (b) there is no difference in the protocol, in the number or definitions of the primary and secondary outcomes, or in the strategy for data analysis between the information in the registry and that in the subsequent academic publication [8].

As a form of public accountability it is of paramount importance that the results of individual trials be made public. This also enables scrutiny of trial data by people unconnected with the trial in any way, and therefore without bias. However, it is important to note that ClinicalTrials.gov and other registries have been used to perform a variety of other analyses. These include (a) determining what the size of a trial tends to be and what kinds of trial methodologies are adopted by industry and non-industry sponsors [9]; (b) for globalized trials, determining the distribution of trial leadership over different nations [10]; (c) analyzing why trials have been terminated prematurely [11]; (d) enumerating which organizations sponsor trials, how many trials each sponsor conducts, and where these trial sponsors are located globally [12]; (e) profiling the clinical trial landscape of a country [4]; and (f) examining how the funders of clinical trials change over time [13]. Such analyses have policy implications since they may, for instance, help a government understand (a) what kinds of trials are ongoing in the country and whether they are relevant to the nation’s health burden; (b) whether a large fraction of trials are sponsored by local institutions or whether the country is simply a low-cost destination for trials sponsored by foreign organizations; and (c) how many trials a given Principal Investigator (PI) has conducted and whether there is cause for concern on this score. As such, trial registries serve a greater purpose than merely holding data pertaining to individual trials.

Registries have transformed the landscape of trial reporting. In India, for instance, the reporting of methods was better on CTRI than in academic publications [14]. Nevertheless, all is not well with the registries and some of the problems are as follows: (a) all trials that are required by law to be registered may not be [15]; (b) trials may be registered with incomplete information [16, 17]; (c) for a multi-country trial registered with different registries there may be discrepancies in the data in these registries [18]; (d) the results of a trial may not be reported within a year of completion, as is required in the USA, for instance [19]; and (e) a registry may be misused to market an unapproved procedure as a trial [20].

The present study arose from a research question inspired by a ruling regarding trials in India. In 2012, Indian Principal Investigators (PIs) were barred from running more than three trials at a time [21]. Any discussion of the optimal number should be based on the norms in countries with the best regulations. In order to understand whether this restriction—since revoked—was justified, we wished to look at the situation in the US where the largest number of trials is conducted [22]. During the work we realized that it was not possible to answer this question, partly due to issues with the quality of the data in ClinicalTrials.gov. We have therefore quantified or enumerated some of these data quality issues which pertain solely to the PI or Responsible Party (RP; which may list the PI). As such, this study is in the nature of an audit of these two fields.

Methods

We used data hosted by ClinicalTrials.gov, the largest registry of clinical trials in the world [23]. We accessed http://clinicaltrials.gov on 14 October 2016 and did an advanced search, with certain filters, as follows. For “Study type” we chose “Interventional studies”. For “Recruitment status” we considered the following categories: (a) Active, not recruiting, (b) Completed, (c) Enrolling by invitation, (d) Recruiting, (e) Suspended, and (f) Terminated. We then selected “Phase” 0–4 and “Record first received” 1/1/2005 to 12/31/2014. This yielded a total of 112,013 records (each with a unique NCT ID), which were downloaded in six lots, corresponding to the categories (a)–(f) above. The data were processed in these six lots for several steps before being merged into a single file. To be noted is that the 112,013 XML files and Additional file 3: Table S1, Additional file 4: Table S2, Additional file 5: Table S3, and Additional file 6: Table S4 are large files and have therefore been hosted at https://osf.io/jcb92. The scripts used for particular steps are provided in Additional file 2: S1 Folder, also available at https://osf.io/jcb92. A summary of the first set of steps taken to process the data is provided in Table 1 and additional details of the methodology are in Additional file 1: S1 Text.

Table 1 Steps taken to process the data

From the 112,013 records (Additional file 3: Table S1) we first selected those that had a start date from 1/1/2005 to 12/31/2014 (both inclusive). This yielded 79,838 records (Additional file 4: Table S2). From these, we selected those that contained “drug:”, “biological:”, or both of these terms in the intervention field. This yielded 64,496 “medicine” trials (Additional file 5: Table S3). We then examined the studies for their completion dates. This date had to be listed, but did not have to be in the 10-year window. If there was “null” for the “completion date”, but a valid entry for the “primary completion date”, the record was selected. We rejected the record if it lacked both these dates. This step yielded 63,786 records (Additional file 6: Table S4 and Additional file 7: Table S5), which were bifurcated into those that were registered with at least one authority in the US (a total of 35,121 records in the Additional file 8: Table S6 and Additional file 9: Table S7) and those that were not. Examples of such authorities are listed in Additional file 1: S1 Text.

We then processed the 35,121 records to identify those that listed both the name of the investigator and his or her role in the trial. This yielded 31,392 records (Additional file 10: Table S8) that contained both the names and the corresponding roles, and 3729 records that were missing one or both pieces of information. Many of the 31,392 records had multiple names, and in 17 cases the number of names did not match the number of roles. We rejected these 17 (Additional file 11: Table S9) and took forward the remaining 31,375 records.

The next step was to process records that contained multiple names and roles such that there were two columns, with each row containing one name and the corresponding role. This yielded 71,359 pairs of names and roles (Additional file 12: Table S10). In many of them the names of real persons were substituted with “non-person” junk information such as designations, call center numbers, and so on. We rejected 10,572 rows and took forward the 60,787 rows (Additional file 13: Table S11) that had the names of real persons. To be noted is that the 10,572 names have 8907 unique NCT IDs (Additional file 14: Table S12). We wished to determine the frequency of occurrence of a person’s name in these 60,787 rows but, on examining the names, identified many problems that prevented this.

Results

As mentioned above, we used various criteria to create a well-defined set of 35,121 trials, which we processed to yield 71,359 pairs of investigators’ names and their roles. We wished to determine the frequency of occurrence of individual names in these records but discovered that many “names” were junk information which prevented any meaningful assessment of the number of PIs or their frequency. Overall, we encountered four categories of errors with PI (or RP) information in ClinicalTrials.gov data as detailed below.

Missing data

In two of the several steps of data processing we found that a notable amount of data was missing. First, in trying to match name and role we found that one or both pieces of information were missing in 3729 (11.9%) of the 35,121 trial records (Table 1). Also, in 17 cases the number of names and number of roles in a given trial record did not match (Table 1). Second, since a given record may have more than one name and role, subsequent processing led us to a list of 71,359 pairs of names and roles. In 10,572 (17.4%) of these (Table 1), the “name” field contained junk information instead of the name of a real person. Examples of this “non-person” junk information were Bioscience Center, Central Contact, Chief Development Officer, Chief Medical Officer, Clinical Development Manager, Clinical Development Support, Clinical Director Vaccines, Clinical Program Leader, Clinical Project Lead, Clinical R&D, Clinical Sciences & Operations, Clinical Study Operations, Clinical Trial Management, Clinical Trials, [company’s call center number], [company’s name], Global Clinical Registry, Investigational Site, MD, Medical Director, Medical Director Clinical Science, Medical Monitor, Medical Responsible, Professor, Program Director, Sponsor’s Medical Expert, Study Physician, TBD TBD, Use Central Contact, Vice President Medical Affairs, VP Biological Sciences, and VP Clinical Science Strategy. After removing such junk “names”, we were left with 60,787 pairs of names and roles.

For the rejected records of both the Additional file 10: Table S8 and the Additional file 13: Table S11, we also wished to determine whether the PI had, at any point, been listed during the history of the trial. To do this we examined the history of a sample of records (Additional file 1: S1 Text and Additional file 15: Table S13). We used a 5% sample each of NCT IDs of the 3729 rejects of the Additional file 10: Table S8 and of the 8907 unique NCT IDs rejected in the Additional file 13: Table S11 (Additional file 14: Table S12), which amounted to 211 and 422 trials, respectively. We found that only 16 (7.5%) out of 211 Additional file 10: Table S8 rejects and only 9 (2%) out of 422 Additional file 13: Table S11 rejects had a PI in at least one history record. Overall, this amounted to 25 of 633, or 4% of the rejects overall. Taking into account these percentages, 3729 rejects of the Additional file 10: Table S8 are reduced to 3449, and 8907 rejects of the Additional file 13: Table S11 are reduced to 8729.

Finally, we summarized the data above. The overall number of records with missing or junk information was as follows: (a) 3449/35,121 in Additional file 10: Table S8; (b) 17/35,121 in Additional file 11: Table S9; and (c) 8729/35,121 in Additional file 13: Table S11. These add up to 12,195/35,121 (35%) of NCT IDs with missing or junk information in the PI field.

Variations in names

Next we wished to determine the frequency with which a given person’s name appeared as the PI in the set of 60,787 names in Additional file 13: Table S11. It turned out that, of the 60,787 names, 82% were those of a PI, with the rest being those of sub-investigators (5%), Study Directors (9%), and Study Chairs (4%). For the purpose of the results described below, however, this variety of designations did not matter. We took several steps to clean up the names to ensure that each individual was represented by a single name. However, there were different categories of problems with respect to the way names were entered in the database which made this process challenging. These issues are listed in 18 categories below.

  1. a)

    Extraneous information along with the name:

    1. (i)

      The name may have had a prefix (e.g., Prf., Prof. Dr., COL) or suffix (e.g., MD; Jr.; III; M.D., Principal Investigator; BSc, MBCHB, MD, Study Director) of varying lengths.

    2. (ii)

      The name may have included a punctuation mark.

  2. b)

    Different kinds of variations of the name:

    1. (i)

      The name may have had spelling mistakes.

    2. (ii)

      One or more parts of the name may have been abbreviated or truncated.

    3. (iii)

      Parts of the name may have been ordered differently.

    4. (iv)

      The middle name may or may not have been mentioned.

    5. (v)

      Parts of the name may or may not have been hyphenated.

    6. (vi)

      The surname may have been modified.

    7. (vii)

      The surname may have been repeated.

    8. (viii)

      The person’s initials may or may not have been separated by spaces.

    9. (ix)

      The entire name, or parts of it, may have been in capitals.

    10. (x)

      The name may have contained a non-English character or the closest English character.

    11. (xi)

      The first name may have been split into two, or the first and middle name may have been merged.

    12. (xii)

      The surname may have been split into two, or the middle and surname may have been merged.

    13. (xiii)

      A nickname, in brackets, may have been mentioned in the middle of the name.

    14. (xiv)

      The Americanized nickname of part of a foreign name may have replaced the original.

  3. c)

    Other complications with the names:

    1. (i)

      A person’s entire name may have been represented by just one word.

    2. (ii)

      Two individuals may have shared the same name.

We went on to eliminate or quantify categories a(i, ii), b (iv, ix, xiii) and c(i) (Additional file 1: S1 Text and Additional file 16: Table S14, Additional file 17: Table S15, Additional file 18: Table S16, Additional file 19: Table S17 and Additional file 20: Table S18), and obtained an estimate of 12.8% of names that could not be identified unambiguously. Although we have not quantified the other categories of errors, based on preliminary work, we believe that they are not numerous.

Multiple PIs per trial

Another category of error concerned trials that listed more than one person as PI. Examples included NCT01954056 (with 18 PIs), NCT00405704 (21 PIs), NCT01819272 (50 PIs), NCT00419263 (73 PIs), and NCT01361308 (74 PIs).

Missing RP tag

Finally we wished to know whether PI information was available from the RP tag. For this, we examined the 35,121 records from Additional file 10: Table S8. We found that the RP tag was missing in 1221 (3.5%) of 35,121 records (Additional file 21: Table S19). As explained in Additional file 1: S1 Text, the RP details were usually provided both at the top of the NCT ID record and at the bottom. At the top, the exact wording was usually “Information provided by (Responsible Party):...”. However, in 1221 records the wording was “Information provided by:...”. These records did not have the RP information at the bottom either. Thus, anybody using automated methods to search for RP information based on the RP tag would not find it.

In terms of whether the RP field was useful to obtain PI information, we used a sample of 500 records and found PI information only in 138 of them (Additional file 21: Table S19). All of these cases already had PI information, as determined in Additional file 10: Table S8. Thus, the RP field did not yield any additional PI information.

Discussion

As discussed above, data in clinical trial registries is often repurposed. It may be useful to know the number of unique PIs, and their frequency of occurrence, in trials registered with ClinicalTrials.gov. If, as we set out to do, one wished to identify all the PIs and then count their frequency, one had to go to great lengths to process the data. The many challenges we encountered are enumerated above.

The first challenge concerned missing PI data. Following from the Federal regulations outlined in 42 CRF Part 11, and as detailed at https://prsinfo.clinicaltrials.gov/definitions.html, it is not mandatory to list each of the scientific leadership, such as the PI, by name and role while registering a trial with ClinicalTrials.gov. The sponsor is the Responsible Party, and it is mandatory to list the sponsor, the sponsor-investigator, or the PI designated by the sponsor as the RP. Despite this, the PI or other scientific leadership’s name and role fields are filled for most trials. In some cases, however, junk information is provided instead. Given that trial data are repurposed for other kinds of analyses, it would make trial records even more valuable if these data were captured.

The non-registering and non-reporting of trials has become a high profile issue, with the naming and shaming of sponsors that do not register their trials or do not report trial results on time [19, 24, 25]. However, even for trials that are registered there are quality issues that we need to be concerned about. Examples of errors previously noted in such registries include (a) observational or other kinds of trials that were labeled as interventional trials [26] and (b) the non-listing of trial sites both at the start of a study and even after its completion [16]. In contrast, a recent analysis of over 10,000 Australian trials registered either with ANZCTR or ClinicalTrials.gov noted that data regarding the primary sponsor was missing for just one trial each on the two registries [4]. This is an example of how low the error rate could be for a given field of information, although it ought to be possible to ensure complete accuracy. Were the filing of PI information to be made mandatory, the current errors could be brought down to nil.

Aside from missing data, there were problems with regard to names due to both junk information and variations in a given PI’s name. With 35% of the NCT IDs lacking PI information or containing junk information in this field, and about 13% of PI names that are ambiguous, overall about half of the NCT IDs do not contain PI names that could readily be repurposed.

For the sake of accuracy it is important that there be just one version of a person’s name which should unambiguously identify that person. The registration system should be modified so that a person’s name has to be separately registered, and thereafter only registered names can be chosen in the “name” field. A person’s name may change with time, and so it should be possible to choose a registered name and also list the current name. This should be supplemented with a unique and permanent ID such as the Open Researcher and Contributor ID (ORCID). Only valid ORCID numbers should be accepted by the system and the database should not permit the registration of a trial unless the name of a PI—in these standardized formats—has been entered. Other researchers have also noted the absence of proper information in the name field [26, 27] and clearly the situation has not improved much over time. If, as stated several years ago [26], it is important that the scientific leadership of a trial be named, then those names must be accurate. Further, each name needs to be in a standardized format to permit an automated method of determining the frequency of its occurrence in a given registry or across registries. This is the only way to ensure full transparency of this part of the database.

We now come to the issue of a trial listing multiple PIs. Since a PI is defined as “The person who is responsible for the scientific and technical direction of the entire clinical study” [28], each trial should list just one overall PI. Trials may have multiple sites, each with its own PI, and therefore there may be a large number of site PIs. The trial summary may therefore list several PIs under “Study Locations”, but must not list more than one person in the (overall) “Investigator” field. As such, during the registration process, ClinicalTrials.gov should not permit the entry of more than one name as overall PI for the trial.

Although, in principle, PI information may be available from the RP field, we note that there are complications in identifying the RP easily. These are as follows: (a) there is an RP field both at the top and bottom of an NCT record, and information may be missing in either of them; (b) as discussed above, and surprisingly, the RP tag may be missing altogether at the top of the NCT record; (c) retrieving information from the RP at the bottom of the NCT record is non-trivial due to the non-uniform manner in which the information is stored; (d) the details of the RP are not necessarily identical in these two fields (although they do refer to the same organization). Future versions of ClinicalTrials.gov should aim to solve these issues.

Aside from correcting or preventing obvious errors in ClinicalTrials.gov, a bigger issue is the need to validate all the information in the database. As Dr. Scott Gottlieb, the former Commissioner of the US Food and Drug Administration (FDA), admitted, it is challenging to link trial information registered with ClinicalTrials.gov and the relevant application to the FDA for approval of a candidate drug [29]. It would aid transparency if it was mandatory to list the NCT IDs of the relevant clinical trials in these applications to the FDA.

Another issue concerns trials registered in multiple registries. In such cases, certain information should match in all the records related to a given trial [18]. However, there is no easy way to verify that it is really the same. In the interest of complete truthfulness and transparency, it is important that data in all registries are regularly and thoroughly cross-validated.

Given the fact that there are many trial registries, and each has many fields of information that need to be correctly filled, the limitation of our study is that it was restricted to one registry and only examined the PI and RP fields.

Conclusions

Registries were created in order to fulfill the ethical and scientific objectives of reporting on all trials promptly, completely, and accurately. However, the very objectives of creating a registry are defeated if trials are not registered or the registered data have substantial errors. As such there is increasing emphasis both on the importance of registering trials [30] and on improving the quality of data in the registries [31]. For trials registered with ClinicalTrials.gov, we have outlined four categories of problems with the names or roles of the PIs, or with the RP information, and have quantified three of them. We have also suggested how these errors could be prevented in future. Other researchers may wish to conduct additional audits of the database to identify or quantify other categories of errors in the hosted data.