Findings

A typical genetics research laboratory is burdened by an information management challenge: to accurately store and make sense of a wide array of data of very different types [1]. A lab working with companion animals such as dogs, cats or horses needs to manage not only oligonucleotides, PCR products, markers, genes and DNA samples, but also data particular to companion animal work: owner information, pedigree relationships, registration numbers and breed names [2]. While it may be possible to track these disparate data independently, a database framework that coherently relates them and enforces lab-wide data input rules would allow for both higher overall productivity and fewer data entry errors [3]. Furthermore, such a data model would mean a single instance of the data can be easily located, backed up and computed upon.

We have developed a relational database "DOG-SPOT" (Dogs, Owners, Genotypes, Samples, Pedigrees, Oligos and Traits) implemented in Microsoft Access to provide information management for dog genetics research (additional file 1). Within a single database a laboratory can manage all of their data relating to: dogs, their breeds, phenotypes, pedigrees and kennel clubs; owners and communications with them; laboratory personnel and their activities; samples and biomaterials. Furthermore, data relating to genetics experiments can be captured, including: gene lists, oligonucleotides, amplicons and capillary sequence trace quality, in addition to genotypes for SNP, microsatellite and indel markers. See Table 1 for a list of the key database tables and fields and additional file 2 for the entity relationship diagram. Users can easily extend and customize DOG-SPOT by writing new forms, queries, macros, tables and relationships. Data quality is partially maintained by enforcing some rules about data types. For example, a dog record is required to have a non-null breed entry and the breed must be an entry in the breed table. As a second example, a dog's sire must be a dog of known sex "M" while its dam must be of known sex "F".

Table 1 Tables, fields, and primary and foreign keys in DOG-SPOT.

The solicitation and acquisition of high quality DNA samples, pedigrees and phenotypes from companion animals is a labor-intensive process. Often a lab needs to track a flood of correspondence between laboratory members and dozens or even hundreds of animal owners. DOG-SPOT provides an owner record for managing such relationships. Lab personnel can generate multiple communication records and reminders linked to each owner, which are displayed in an owner-centric form along with a list of dogs sampled from that owner (Figure 1). Reminders with past-due dates are filtered to an "alarm" list so they can be prioritized. Owner records can also be marked for inclusion in mailing-label print jobs. These features modularize the work flow and enable solicitation work to be efficiently handled by a team of lab members.

Figure 1
figure 1

Screen shot of the owner form. Owner information is entered and displayed here, including a log of communications between the lab and the owner. The owner's sampled dogs are also listed. Shipment of collection kits is tracked as well, and this owner can be marked for inclusion in the next batch of printed mailing labels.

Within DOG-SPOT owners and dogs have a one to many relationship, as do dogs and the records of samples collected from them. The dog record itself stores registration information, breed, name, date of birth, the dam and sire IDs, as well as values for measured traits (e.g. body weight, coat color, etc.). A sample record links to exactly one dog record and stores the yield and concentration of the extracted nucleic acid, extraction date, and biomaterial type (e.g. blood, gDNA, RNA, etc.). From the dog record a three generation pedigree can be displayed as a quality check.

The 'amplicons' table stores the sequence of each forward and reverse oligonucleotide and each amplicon's chromosome, start and end position. The 'sequence' table stores a record for each sequencing reaction and a true/false value assessing overall read quality so that sequence coverage will not count as "completed" unless it is of sufficient quality. After the user aligns sequences outside the database (e.g. with phred/phrap and consed) and identifies new microsatellite, indel and single nucleotide polymorphisms, they are uploaded in batch to the 'markers' table. The markers table stores marker position start and end, chromosome, flank sequence and type. When a lab member collects data they write a record to the 'experiment' table that briefly summarizes the bench work performed. Genotype records are then uploaded from text or excel files with the experiment id, sample id, marker id and the genotype value.

Suppose, for example, that a lab member aims to collect sequence data spanning a coding exon of a gene. They design PCR primers spanning the exon and create an amplicon record in DOG-SPOT that contains the sequences of the F and R primer, the Tm of each, and the amplicon's chromosome, start and end position in the canfam2 assembly of the dog reference sequence. When the primers are synthesized and optimized the lab member would modify the record to indicate the PCR conditions to use. They then would select a set of dogs for sequencing by viewing dogs, pedigrees and DNA samples in the database's dog form. After wet-lab work and sequence alignment using, for example, phred/phrap and consed, the lab member would write a set of three text files. First, they would record which sequences were high quality. Second, they would describe every marker discovered in the sequence contig by writing the marker's flank sequences, type, alleles and position. Third, they would genotype each of the sequenced samples for each marker. These three flat files would then be uploaded/appended to the DOG-SPOT sequence, marker and genotype tables, respectively.

To leverage the ease with which custom data can be uploaded and viewed in the UC Santa Cruz genome browser, DOG-SPOT stores amplicon and marker position information in a format compatible with upload: chromosome, start position and end position. To generate data for viewing in the genome browser, the user runs queries that write text files of the amplicons, sequence traces and markers. The user then executes the make_bed.pl perl script to generate a bed formatted file of custom data tracks that can be uploaded directly to the genome browser (see additional files 3 and 4 for the perl script and README, respectively). This visual overlay with UCSC tracks is a powerful tool for assessing the progress of sequencing through a candidate gene, for example, or to verify positioning of newly discovered markers within a gene (see Figure 2).

Figure 2
figure 2

Screen shot of the UC Santa Cruz genome browser showing uploaded DOG-SPOT amplicon and marker data. The user runs a macro in DOG-SPOT that queries tables to generate text files of: all amplicon records (shown in gray), amplicon records for which sequence has been attempted but is poor quality or incomplete (orange), amplicon records with sufficient high quality sequence data (green) and discovered markers (purple). The user runs the make_bed.pl perl script in a folder with these four text files to generate the BED formatted file "bed.txt" that can then be uploaded to the genome browser. This visual overlay with UCSC tracks enables convenient assessment of candidate gene sequence coverage, completeness and the positioning of markers relative to genes.

At present, DOG-SPOT is designed to store purebred dog records, breeds, samples and genotypic data. However, a user could readily convert it for use with cat, horse, or another species of interest. We implemented DOG-SPOT in MS-Access to provide an easy interface for lab members, some of whom had no prior database experience. However, because the database is not implemented in a more robust system, like MySQL or PostgreSQL, it would likely perform poorly if loaded with very large datasets, such as millions of array-generated genotypes.

To assist in the management of diverse types of lab data we have developed the DOG-SPOT database for the canine genetics research laboratory. By storing all of our laboratory's dog, owner, sample, amplicon, marker and genotype data in an instance of DOG-SPOT, we have successfully centrally managed these disparate data in a rational and organized way. Finally, by relying on contact management functions and modularization of work within the database, we have been able to efficiently utilize undergraduate workers for all aspects of sample solicitation, owner communication, sample data entry and biomaterial banking.