Background

TMA technology was introduced in 1998 [1]. A TMA fundamentally differs from a conventional glass slide only in the number of tissue samples included [see Figure 1]. Tissue microarrays typically contain between 100 and 1,000 core tissue samples. A single TMA block can be sectioned and distributed to dozens of laboratories, saving years of preparation time, hundreds of thousands of dollars in tissue collection costs, and conserving experimental reagents by measuring a marker's distribution on hundreds of specimens arrayed on a single glass slide [1]. Several studies have demonstrated the value of TMAs to validate the biologic relevance of candidate genes expressed in prostate cancers [26].

Figure 1
figure 1

Prostate Cancer TMA slide. Hematoxylin & Eosin stained example of a TMA slide of prostate samples.

Because TMAs are designed to answer questions applicable to pathologic lesions with specific sets of attributes (e.g. stage or grade or diagnostic subtype), preparation of a TMA requires access to large archives of paraffin embedded tissues. Each TMA core tissue must be annotated with clinical, demographic or histopathologic information so that measurements on the TMA core samples can result in clinically useful correlations. To ensure inter-laboratory reproducibility, information describing the preparation of TMA blocks and slides need to be provided along with the TMA data records.

The Cooperative Prostate Cancer Tissue Resource (CPCTR) is a multi-institutional virtual tissue bank funded by the U.S. National Cancer Institute (NCI) to provide researchers with samples of prostate cancer tissues [7]. The member institutions of the CPCTR are New York University, George Washington University, University of Pittsburgh and Medical College of Wisconsin. The CPCTR began service to the cancer research community on December 6, 2001. The CPCTR has over 5,000 prostate cancer specimens including radical prostatectomy cases (paraffin and fresh-frozen) and paraffinized needle biopsies. The CPCTR represents the largest repository of histologically-characterized and clinically annotated prostate cancer tissue in the USA. All accrued cases undergo pathology review and all clinical data is collected using methodology standardized across the participating institutions. CPCTR resources are available to all researchers, academic and commercial. Further information can be obtained from the CPCTR website [8].

The CPCTR has constructed a prostate cancer TMA implemented in conformance with the new TMA Data Exchange Specification (herinafter designated "the Specification"). The Specification was developed through a series of open workshops sponsored by the National Cancer Institute and the Association for Pathology Informatics [9]. Tissue data included in the CPCTR TMA database is de-identified, and assembled in an open access database to permit data sharing, in compliance with current NIH policy on data sharing [10] and in concert with ongoing NIH initiatives to develop new methods for sharing research data [11, 12].

Results and Discussion

The TMA data exchange specification was designed to allow TMA database files to be totally self-describing. The properties of a self-describing database file would include:

  1. 1.

    An informative header that explained the purpose of the file and provided all the information to understand the file (i.e., its organization).

  2. 2.

    Information regarding the creation of the file (e.g., creator, date of creation)

  3. 3.

    Rights of use (e.g. specifying any restrictions on use)

  4. 4.

    Warranty information

  5. 5.

    Methodology (e.g. how the data contained in the file was obtained)

  6. 6.

    Data

  7. 7.

    Metadata (the data that describes or defines the actual data)

  8. 8.

    Metadata definitions (clear descriptions and definitions of the metadata)

The typical database contains data (property #6) but nothing else in the way of self-descriptive annotation. The CPCTR implementation of the Specification has all eight properties and employs the following enhancements:

  1. 1.

    Uses Uniform Resource Locators (URLs) to link the TMA database with web documents that provide detailed information supplementing the metadata tags. These external URLs are:

  2. a.

    A link to the Dublin Core Meta Data Elements used in the header section of the document [13].

  3. b.

    A link to the ISO-11179-compliant listing of Common Data Elements (CDEs) provided in the Specification [14].

  4. c.

    A link to the CPCTR CDEs [15].

  5. d.

    Links to external documents that provide methodologies for preparing CPCTR TMA blocks and sectioning slides [16, 17].

  6. 3.

    Supports complex TMAs within a single TMA file. In this case, a single TMA file contained four blocks, with cores from a single tissue samples appearing in multiple locations in more than one block.

  7. 2.

    Protects patient privacy (by deidentifying all data)

  8. 3.

    Allows data sharing (by permitting free distribution of the XML data document)

Conclusions

Tissue microarrays allow for the high throughput analysis of tissue samples and their association with clinical or outcomes data. Yet these experiments require a large amount of information for the subsequent analysis and evaluation, in particular by interested second parties. The Specification provides an accurate and reproducible method for the transfer of this information as is required for inter-laboratory reproducibility. One of the most important problems with modern data specifications is the daunting technical expertise required for their implementation. The Specification was written to permit maximal flexibility and minimal implementation requirements [9]. This study demonstrates that the Specification can be implemented using a simple Perl script that converts an Excel database into XML-tagged data elements. The resulting large section of core-related XML text can be simply inserted into a conformant document containing header, block and slide information. The resulting TMA database can be validated with a Perl script provided with the Specification document.

Methods

Human subjects protections

All institutions participating in the CPCTR have Institutional Review Board (IRB) approval for human subjects research. Each CPCTR institution develops its own local protocols to protect the confidentiality and privacy of human subjects and obtains local IRB approval for all CPCTR activities. The IRB assurance numbers for each cooperating institution are: New York University – M1177; Medical College of Wisconsin – M1061; University of Pittsburgh Medical Center – M1256; and George Washington University Medical Center – M1125. Tissue data records from the cooperating institutions are submitted to a central data manager (Information Management Services, Inc., contracted by the NCI) as de-identified records. All institutions assign an arbitrary number to each record before submitting the de-identified record to the central database. This ensures that the central database has no links connecting records to patients. In addition, HIPAA's proscribed set of 18 data elements are omitted from core sample records (so-called safe harbor approach to HIPAA-compliance) [18].

Tissue and data collection

The CPCTR maintains a publicly available Manual of Operations that describes its tissue collection procedures and policies [19].

Pathological characterization of specimens involves review of all cases by a CPCTR pathologist using diagnostic criteria explained in the publicly available CPCTR histologic atlas and manual [20].

Protocols for the construction the TMA block and TMA slide are publicly available documents available at the CPCTR web site and linked from the TMA Database [16, 17].

The TMA Data Exchange Specification

The Specification is an open access document that can be used without restriction [9].

The Specification requires four general sections for each TMA file:

  1. 1)

    Header, containing the specification Dublin Core identifiers, 2) Block, describing the paraffin-embedded array of tissues, 3) Slide, describing the glass slides produced from the Block, and 4) Core, containing all data related to the individual tissue samples contained in the array. The simplest possible structure for a conforming TMA file consists of nothing more than empty tags designating the four required sections [see Figure 2] [9].

Figure 2
figure 2

Simplest conforming TMA file. Image displaying the simplest possible XML file conforming to the TMA Data Exchange Specification.

Common Data Elements (CDEs) are metadata tags that describe the data elements included in an XML database. To be of value, CDEs must be well-defined, uniquely identified and available for human review or computer access. Eighty CDEs, conforming to the ISO-11179 [21] specification for data elements constitute the XML tags provided in the Specification [9]. CDE descriptors are publicly available [14]. However, the only CDEs that must appear in any conforming TMA file are the section CDEs (header, block, slide and core), the root CDE (histo) and the tma CDE itself (tma). A set of six simple semantic rules describe the syntax for the data exchange specification [9].

The Specification was designed for maximal flexibility. Flexibility in the first version of an XML specification permits the addition of greater structure in later versions built on tested implementations. A similar approach has been used for ANSI/HL7 Common Data Architecture (CDE) wherein the earliest version (Level One) is intentionally sparse [22]. At this time, there is no DTD (Data Type Definition) or Schema included in the Specification. For those wishing to use a DTD, a Specification-compliant DTD has been prepared by David G. Nohle, Ohio State University Department of Pathology and the Mid-Region AIDS & Cancer Specimen Resource (ACSR) [23].

Constructing the TMA Data file

Constructing a TMA Database consists of the following:

  1. 1.

    Filling the four sections (header, block, slide and core)

  2. 2.

    Assembling the four sections into a TMA file with a proper file declaration, root element and TMA CDE.

  3. 3.

    Validating that the TMA file conforms to the specification

The header, block and slide sections of the TMA will vary only slightly from project to project within a laboratory. The CPCTR header, block and slide sections were prepared "by hand" using the section-specific CDEs provided in the specification.

The header section contains descriptive information about the file and its contents. With the exception of one CDE (filename), the header CDEs are the same CDEs used in the Dublin Core set of XML identifiers used by librarians. Detailed information describing the Dublin Core elements is available [13]. A link to the Dublin Core elements is also included in CPCTR TMA database. The first few lines of the TMA database are shown [see Figure 3]. The block and slide headers of the TMA database are short and are also completed manually.

Figure 3
figure 3

TMA XML opening section. Image displaying the first few lines of the TMA XML document.

The cores are distributed for each block in an array, with cores assigned to specific locations [see Figure 4], and all the cores in an array are assigned to a slide, which is a numbered section derived from a block [see Figure 5]. The core section contains annotated data for each core in the TMA. The central database for all CPCTR tissues is maintained as an Excel database by an NCI-contracted information management service (IMS, Rockville, MD). IMS extracts an Excel sub-file consisting of records pertaining to the tissues selected for the TMA block. CPCTR-specific data elements included in the IMS records are publicly available [15].

Figure 4
figure 4

Core Array Mapping Image. Schematic image showing the array locations of cores listed in the TMA doucment.

Figure 5
figure 5

TMA XML slide section information. Image displaying the data elements describing the glass slide sectioning information.

A Perl script was written that converts Excel files to XML, enclosing the data associated with the spreadsheet cells to XML CDEs corresponding to the column headings. This creates the "core" section of the TMA database. A sample of an XML-tagged extracted data record is shown [see Figure 6]. The Perl script is available as an open access file with this article [see Additional file 1].

Figure 6
figure 6

TMA XML core section information. Image displaying the data elements comprising a record for a single tissue core.

The CPCTR prostate cancer TMA consists of 299 core samples distributed in four blocks, each block having 300 arrayed cores. Each block contains about 150 core samples in two different locations in each block. The core duplicates are staggered in the array, to maximize the chance that a given core will be represented if an area of the slide section is lost in processing. The distribution of one set of core samples in multiple array locations in four blocks yields a complex TMA that cannot be adequately represented by separate descriptions of each block. The Specification permits multi-block TMA files. Within the block CDE are the nested sets of four blocks that compose the complex TMA. Each core CDE is nested within a specific block CDE, and one core may have two associated array locations [see Figure 6].

The four sections are concatenated as a single XML database file. The CPCTR database file is provided with this manuscript [see Additional file 2].

Validating the TMA Data file

Once a TMA database is prepared, it needs to be validated to ensure conformance with the Specification. At this time, all TMA files should be validated using a software implementation written in Perl and distributed as an open access supplemental file with the Specification and with this publication [see Additional file 3]. The validating script requires a Perl installation but should operate equally well on any operating system. The validation software has a simple command-line interface. When the file successfully validates, the Perl script outputs the encountered CDEs from the Specification, a statement that the file is valid, and a one-way hash value specific for the validated file [see Figure 7].

Figure 7
figure 7

Validation script output. Image displaying the interaction between Perl validating script and user.

Availability and requirements

The Perl scripts and files for the production of TMA databases that meet the Specification are available with this publication. The example prostate cancer TMA database is available as a supplementary file with this article [see Additional file 1]. The actual tissue microarray slides are available after an application process Although the CPCTR is a non-profit, government-sponsored resource, a surcharge is attached for glass slides, to help defray a portion of the costs of TMA production. The application process and charges are described at the CPCTR web site [8]. Questions regarding any aspect of the CPCTR can be directed to the CPCTR email query service [ask-cpctr-l@list.nih.gov].