Introduction

Hemoglobin (Hb), a tetrameric protein with four polypeptide chains, typically two α and two β, is the most abundant protein in erythrocytes. It plays an important role in oxygen transportation [1]. Modifications and mutations on the amino acids of the hemoglobin sequences may reveal health states of individuals. For example, glycated hemoglobin is commonly used for diagnosis of diabetes [2, 3]. Severe hemoglobinopathies, including sickle cell disease, may involve only one amino acid substitution or deletion on hemoglobin chains [4]. Mutations, together with thalassemia, can lead to a more severe anemia [5]. Also, the fully sequenced hemoglobin can be utilized in sequence evolution analysis and ancestral sequence reconstruction [6]. The ability to identify small variations in hemoglobin sequences unambiguously and quickly is therefore desirable in a broad clinical context.

Tick larvae often have their first blood meal on small mammals and birds and, hence, the possibility exists for the transmission of pathogenic microorganisms from those initial hosts to tick larvae. After feeding, the larvae leave their host and molt to nymphs. Later on, nymphs need to obtain a second blood meal from a larger host to molt to adult. The viruses and/or bacteria survive in the midgut of the tick for months without infecting the tick itself. As a consequence, disease acquired from the first host, if any exists, can be transmitted to a second host [7]. There are many bird species known to be host reservoirs for tick-borne diseases [8, 9]. The ability to identify major reservoirs in one geographical region is useful for disease prevention and intervention. It has been shown that host blood proteins still persist in ticks months after feeding and molting [10]. Because hemoglobin is the most abundant protein in erythrocytes and its sequence varies in different species [11], Hb remnants from host in ticks can be identified to reveal host identities [12]. We have obtained blood samples from 33 bird species (142 individuals) that are common in the U.S. as hosts for ticks but that have unreported Hb sequences. Hb from most bird species contains one major component, HbA (αA 2β2) and one minor component, HbD (αD 2β2) [1316]. Therefore, most bird species have three types of Hb chains: αA, αD, and β, with αA and β chains being the most abundant Hb chains.

Bottom-up mass spectrometry, in which proteins are digested to form peptides that are fragmented in the mass spectrometer, is a very powerful tool in Hb identification. Gatlin and co-workers identified six variants of human hemoglobin by bottom-up proteomics and the SEQUEST-SNP algorithm, which dynamically generate all possible single-nucleotide polymorphisms (SNPs) [17]. Laskay and co-workers utilized a bottom-up mass spectrometry approach to sequence hemoglobin from nine mammalian species [12]. Guo and co-workers combined bottom-up mass spectrometry and X-ray diffraction to achieve complete sequence coverage of hemoglobin from two feline species [18]. Onder et al. employed spectral matching in a proteome profiling technique to identify vertebrate hosts of ticks without a priori knowledge of protein sequence information [19, 20]. The method which Onder et al. adopted is similar to a spectral library searching approach for peptide identification in bottom-up mass spectrometry, which involves matching query spectra to library spectra [2123]. The most common method for identifying digested peptide ions from bottom-up experiments is database searching. During this process, the mass and fragmentation pattern of the ions are matched with in silico generated peptide ions from known protein sequences in a database [24]. Obviously, one drawback of this database searching strategy is that the existing databases may not necessarily cover the protein/peptide under study. To overcome this disadvantage, several de novo sequencing algorithms have been developed [25, 26]. However, de novo sequencing results are lower throughput and the output can be hard to organize into interconnected peptides because each tandem mass spectrum will return several candidate peptide sequences and a typical run usually result in tens of thousands tandem mass spectra. In this study, a different bottom-up data processing method is needed.

Top-down mass spectrometry, which accurately measures the intact molecular weight of a multiply-protonated protein and directly fragments the protein ions [27] without prior digestion to peptides is becoming increasingly popular because of advances in mass spectrometry instrumentation, such as higher resolution and accuracy [2729], and improvements in top-down sequencing software [30, 31]. No digestion is performed in the top-down approach; thus, the connectivity of the peptides is not lost. The use of multiple fragmentation methods, such as collision-induced dissociation (CID) and electron transfer dissociation (ETD), facilitate better fragmentation coverage [32]. The advantages of top-down MS include precise determination of protein isoforms and localization of amino acid substitutions and post-translational modifications (PTMs). Edwards and coworkers have successfully applied a top-down approach to screen neonatal dried blood spot samples for hemoglobin variants [33, 34].

Here, we developed a top-down-assisted bottom-up mass spectrometry approach with a customized search database to successfully sequence the αA and β hemoglobin chains from the 33 bird species under investigation. The customized database was produced by aligning the known avian hemoglobin sequences from the National Center for Biotechnology Information (NCBI, 06/12/12), keeping sequences that are invariant in those constant and generating peptide sequences where variable amino acids were successively replaced with each canonical amino acid. The final resulting sequences produced by using this customized database to sequence the Hb of birds with unknown sequence and confirmed further with the top-down approach will be available to append to existing hemoglobin databases so that LC-MS/MS of protein digests or ESI or MALDI MS of intact proteins can be searched against this database to reveal host identities.

Experimental

Full Scan Mass Spectrometry Analysis of Bird Blood

Blood samples from 33 bird species were collected in purple Vacutainers containing anticoagulants and stored in a –80°C freezer. The birds were trapped as part of ongoing vector competence studies at Stone Lakes National Wildlife Refuge. The area is composed of seasonal and permanent wetlands, riparian forest, grasslands, and freshwater lakes, and is home to several colonial and migratory water bird species. Within those, five species were represented by only one bird per species. The other 28 species were represented by multiple birds per species serving as biological replicates. To determine the difference of the intact mass of Hb that may exist between individuals within the same species, the blood samples were lysed, diluted, and analyzed by a SYNAPT G2S Q-IM-TOF mass spectrometer (Waters Corporation, Manchester, UK). To prepare the spray solution, the blood aliquot was centrifuged at 14,000 g for 10 min to separate plasma (supernatant) and packed red blood cells. Two microliters of red blood cells were incubated with 25 μL of water (Optima LC/MS water, Thermo Fisher Scientific, Waltham, MA, USA) for 5 min for cell lysis. Cell lysate was centrifuged at 14,000 g for 5 min. The supernatant from the red blood cell lysate was diluted 200-fold with 0.1% formic acid (Thermo Fisher Scientific) in water in preparation for analysis. Multiply charged hemoglobin subunits were observed and deconvoluted to zero charge mass with the SYNAPT’s embedded MaxEnt 1 algorithm. Some settings for MaxEnt 1 were: mass range 5000–20,000, resolution 1 Da/channel, damage model uniform Gaussian width at half height 0.5 Da, minimum intensity ratios left 33% right 33%.

Bottom-Up Experiments

All the chemicals were purchased from Sigma-Aldrich (St. Louis, MO, USA) unless otherwise specified. For 16 bird species, the bottom-up sequencing approach was performed after in-gel tryptic digestion of bird hemoglobin, with LC-MS/MS analysis on an LTQ mass spectrometer (Thermo Fisher Scientific) coupled to a capillary-flow LC (Surveyor, Thermo Finnigan, San Jose, CA, USA). Fifteen microliters of red blood cell lysate was mixed with 5 μL 4 × Laemmli protein sample buffer (250 mM Tris-HCl, pH 6.8, 8% SDS, 30% glycerol, 0.02% bromophenol blue, 1% β-mercaptoethanol) followed by heating at 95°C for 5 min. The sample was loaded on a 4%–20% precast protein gel (Mini-PROTEAN TGX; Bio-Rad, Hercules, CA, USA) together with 10 μL Precision Plus Protein standard (Bio-Rad) in the adjacent lane. The gel was run at 90 V constantly for approximately 1.5 h until the dye front reached the reference line. The gel was rinsed three times in pure water (Millipore, Billerica, MA, USA), and stained with Bio-Safe Coomassie (Bio-Rad) for 1 h, then background destained with pure water overnight on a shaker (Boekel, Feasterville, PA, USA). The hemoglobin band (around 15 kDa) was excised, diced into 1 mm3 cubes, and transferred to a 1.5 mL Eppendorf tube. The gel pieces were washed with 200 μL pure water for 30 s and destained twice with 200 μL of methanol:50 mM ammonium bicarbonate (1:1 v/v) for 1 min with vortex mixing. The gel pieces were then incubated for 5 min in 200 μL of acetonitrile:50 mM ammonium bicarbonate (1:1 v/v) with vortex mixing. The gel pieces were further dehydrated in 200 μL acetonitrile then dried in a SpeedVac (Savant, Irvine, CA, USA) for 15 min. The dried gel pieces were reduced with 200 μL freshly prepared 25 mM dithiothreitol for 20 min at 56°C, then alkylated with 200 μL 55 mM iodoacetamide at room temperature for 20 min in the dark. The gel pieces were washed twice with 500 μL of pure water by vortex mixing briefly. Dehydration and drying as described before were repeated. In-gel digestion was performed by incubating in 50 μL of 20 ng/μL Trypsin Gold (Promega, Madison, WI, USA) in 0.01% ProteaseMAX (Promega) at 50°C for 1 h. The digestion was stopped by adding in 0.5% (final v/v) trifluoroacetic acid to inactivate trypsin. The solution was transferred to a new 1.5 mL Eppendorf tube and centrifuged at 14,000 g for 10 min. The supernatant was subjected to LC-MS/MS analysis. The LC system was modified from a microbore HPLC (Surveyor) to achieve capillary flow. The column (8 cm × 100 μm i.d.) used was packed with 5 μm Xorbax C18 resin (Agilent, Santa Clara, CA, USA). Peptides were separated with a gradient using mobile phase A (0.1% formic acid in water) and B (0.1% formic acid in acetonitrile). The initial flow was 5 min at 1.25 uL/min at 5% B, followed by 100 min of 0.5 uL/min with 5% B linearly ramping up to 45% B. Five min at 1.25 uL/min at 95% B was used as a final wash step. An LTQ was used, with the full MS range set to 400 to 1600, and the three most intense ions automatically selected by Xcalibur software for fragmentation.

In addition to the above approach, which was used at the University of Arizona for approximately 16 bird species, the bottom-up sequencing approach for 17 bird species was performed at OSU by in-solution tryptic digestion of bird blood, with LC-MS/MS analysis on a dual linear ion trap Velos Pro mass spectrometer (Thermo Fisher Scientific) coupled to a nanoACQUITY UPLC (Waters). Fifteen microliters of blood cell lysate obtained as described above was mixed with 28.5 μL 100 mM ammonium bicarbonate, 1.5 μL 1% ProteaseMAX (Promega), and 5 μg Trypsin Gold (Promega). Digestion was performed at 37°C for 3 h and stopped by adding in 0.5% (final v/v) trifluoroacetic acid. After centrifuging the digestion solution at 14,000 g for 10 min, supernatant was subjected to LC-MS/MS analysis. A Thermo EASY-Spray Column (particle size: 3 μm; 150 × 0.075 mm i.d.) was used. Flow rate was kept at 0.4 μL/min and the gradient started at 1% B, ramping to 35% B in 84 min. The full MS range was set to 400 to 2000, and the 10 most intense ions were automatically selected using Xcalibur software for fragmentation.

Bottom-Up Data Analysis

Because most of the bird species examined here have unreported hemoglobin sequences, we decided that a statistical analysis of known avian Hb sequences should first be performed to identify conserved regions in avian hemoglobin. The conserved regions could then be used in a customized database. We chose to use our in-house protein identification algorithm SQID [35] for peptide identification because it showed an improved number of identified peptides, attributable to incorportating pairwise fragmentation intensity information. The work flow of data analysis is shown in Figure 1. The statistical analysis of avian Hb in the National Center for Biotechnology Information (NCBI, 06/12/12) bank indicates that basic residue (arginine and lysine) sites are 98.3% and 99.8% conserved among 1045 avian Hb α chains and 760 avian Hb β chains. This gave us confidence in segmenting the Hb chain into tryptic peptides ending with K or R. Peptides with enumerated mutations at variable sites were then generated to assemble hypothetical position-labeled peptide groups. The customization step is elaborated in the Results section. A general sequence of a Hb chain can be obtained by combining the SQID top hits from each hypothetical position-labeled peptide group in numerical order. The peptide group name (e.g., A1–A9; B1–B12) represents the position of the peptide in the Hb chain.

Figure 1
figure 1

Flow chart of bottom-up data analysis

Top-Down Experiments and Data Analysis

Top-down data acquisition was performed on an Orbitrap Elite mass spectrometer (Thermo Fisher Scientific) with ESI infusion of bird blood lysate diluted in water/acetonitrile (1:1 v/v) containing 0.1% formic acid via a Heated Electrospray Ionization (HESI) source. Typical parameters used were: flow rate 4 μL/min, spray voltage 3.5 kV, sheath gas flow rate 5, aux. gas flow rate 0, sweep gas flow rate 0, capillary temperature 275°C. Collision induced dissociation (CID) and electron transfer dissociation (ETD) were performed on the multiply charged (mostly, between +14 to +18) ions of the intact proteins. The isolation window was 2 Da. The CID energy was 20%–30% and ETD reaction time was 10–15 ms depending on precursor ion charge state and mass. Top-down data were charge-deconvolved and deisotoped by Thermo Xtract software. The Xtract settings were: generate MH+ masses mode, resolution 240,000 @400, S/N threshold 3, fit factor 44%, remainder 25%, AveragineLowSulfur model, max. charge 30. After the fragments with S/N over 3 are deconvoluted and deisotoped, 10 ppm mass accuracy was used for manual matching based on the results from bottom-up. To show the abundance of products produced in top-down sequencing, product maps were generated by an in-house built Matlab script with input of the final sequences and the charge-deconvoluted MS2 peak lists. The Matlab script is available in Online Resource 3.

Phylogenetic analysis of the Hb αA chains was done via phylogeny.fr [36, 37]. Two sequenced bird species in the NCBI database were subjected to analysis together with 35 sequenced αA chains (two out of 33 species have two αA isoforms) in this work.

Results and Discussions

Intact Mass of Hemoglobin Chains

Blood samples from 28 out of the 33 bird species under study were collected from more than one individual bird. To check the heterogeneity of Hb within a species, the mass of Hb chains were measured on the SYNAPT G2S mass spectrometer. Figure 2 and Online Resource 1 “ESM_1. Intact average mass of hemoglobin chains” show the intact average mass of αA, αD, and β chains for the 28 bird species that contain more than one individual bird. For most of the species, the same masses were present for the individual sampled birds in a given species, whereas some species, namely No.7 Catharus ustulatus, No.12 Icterus bullockii, No.13 Junco hyemalis, No.14 Melospiza melodia, No.15 Myiarchus cinerascens, No.18 Pheucticus melanocephalus, and No.19 Picoides nuttallii showed one or two outlier(s). We assume hemoglobin chains within a specific species have the same amino acid sequence if they have the same mass. The different masses observed in the outliers may be due to sampling of different phenotypes, misassignment of the species at sampling, or environmentally-induced hemoglobin mutations [38]. The sequences reported for a given species in this paper are from the individual bird samples that all provided the same masses. Outlier mass values are included in Online Resource 1.

Figure 2
figure 2

Intact mass of hemoglobin chains (a) β, (b) αD, (c) αA from 28 bird species containing multiple bird individuals. Individual bird masses within a species are represented by “X” symbols offset on the x-axis to improve readability. Most bird species have a consensus mass among different individuals. Several species, namely 7, 12, 13, 14, 15, 18, and 19, contain outliers

Customizing a Database for Searching

Bottom-up proteomics typically utilizes sequence database searching [24], a popular approach to identify sequence of the digested peptides, although spectral library searching [2123] is also used. In sequence database searching, the intact mass and fragmentation pattern of the experimental peptide ions are matched to fragmentation spectra or m/z lists of peptide fragment ions produced for peptide sequences generated by in silico digestion of protein sequences in a database. This approach is a reasonably fast and high-throughput method to identify proteins. Because the Hb chains from the bird species in this study have not been sequenced before, a customized database was created in order to adopt the database searching approach.

The concept of customizing a sequence database has been applied to many studies to increase peptide detection sensitivity [39, 40]. Khatun et al. generated a database by 3-frame translation of a transcript database and 6-frame translation of genomic sequences [41]. Wang and coworkers created a protein sequence database from abundant RNA-seq data of a specific sample and further incorporated high-quality single nucleotide variations into the customized protein database [42, 43]. Park et al. considered single nucleotide variants, insertions, deletions, and stop-codon mutations derived from Exome-seq and RNA-seq to build a database for a specific sample [44].

In our study, the steps of generating a customized database are shown in Figure 1. First, a statistical analysis of amino acids in already sequenced avian Hb chains was performed. From the NCBI database, 760 entries of avian Hb β chains were available; these were aligned and all were 146 amino acids in length. Arginine and lysine sites were 99.8% conserved among the 760 entries, so the β chain can be segmented into 12 peptides ending with K or R. For example, for the first peptide VHWTAEEK, with variants noted at H2 and T4, and no R, K, or I included (because I and L were treated as one amino acid, labeled L), there are 17 (20 minus 3) potential amino acid substitutions on the second and the fourth positions. The second site mutation and the fourth site mutation were considered independently. Therefore, a total of 289 (17 multiplied by 17) VXWXAEEK peptide segments are generated as hypothetical position-labeled peptide group “B1,” in which X represents amino acids except for K, R, and I. In a similar way, 12 (B1 to B12) position-labeled peptide groups were constructed in the database. Each peptide group B1–B12 exists as 2 to 83,521 sequences in the database. Hb α chains are more varied because there are two copies of the α chain, one major component αA, and one minor component αD. Both are 141 amino acids long. In order to compromise between computational expense and the numerous variations of the peptides, some variable sites were limited to a certain number of substitutes that have over 1% occurrence in the sequenced Hb chains in the NCBI database. This compromise was justified by the results: more than 60% sequence coverage in bottom-up alone for αA from any species was achieved. Top-down experiments performed later on enabled us to fully differentiate αA from αD. Nine (A1 to A9) position-labeled peptide groups for Hb α chain were generated for the database along with the 12 for the Hb β chain.

The constitution of each position-labeled peptide group is listed in Table 1. In this table, variable site X represents all amino acids except for K, R, and I. Variable site Xk represents all amino acids except for R and I (due to occurrence of K at these specific sites in some of the birds of known sequence). Sites with limited substitution have all the possible amino acids listed vertically. The limited substitution is determined by the fact that in 99% of known sequences of avian hemoglobin, one of the listed residues is occupied at those sites. Peptides with enumerated mutations at variable sites (highlighted in yellow) were generated to assemble hypothetical position-labeled peptide groups (A1–A9, B1–B12). The number of sequences in each peptide group is listed in the third column of Table 1. An experimental peptide mass and MS/MS data were searched by the SQID proteomics search engine against each peptide in the hypothetical database. Because the peptide name represents the position of the peptide in the chain, a general sequence of a Hb chain can be obtained by combining the most confident peptide from each hypothetical position-labeled peptide group in the numerical order (A1–A9; B1–B12).

Table 1 Peptide Components in the Customized Database (Variable Sites are Highlighted)

Bottom-Up Approach

The software SQID [35] first indexed the database into peptides ending with K or R. The allowed peptides had mass ranges from 700 to 4500 Da and up to one missed cleavage. One missed cleavage is allowed during indexing of the database to accommodate the detection of regions containing internal R or K, such as A4, A7, and A8. Variable modifications included oxidation of methionine and carbamidomethylation on cysteine because reduction and alkylation were performed only on about half the hemoglobin samples (those subjected to in-gel digestion). Thermo Discoverer (V1.4) was used to convert the LC-MS/MS raw data to .dta files. Each .dta file contained the precursor mass (calculated based on measured m/z), charge, fragments’ m/z values, and fragments’ intensities. In order to determine the peptide sequence of each experimental tandem mass spectrum or .dta file, the precursor mass was compared with those of the in silico digested peptides and sequence candidates were chosen within 1.5 Da mass error of the precursor mass. A series of b and y ions and their neutral loss ions were then calculated for each of the candidates. To evaluate the accuracy of the match between the candidates and the experimental tandem mass spectrum, a scoring function considering the number of matched ions (within 0.8 Da mass error), the number of consecutively matched ions, and a bonus for strong pairwise fragmentation intensity was utilized. The sequence candidate with the highest score was assigned to the experimental tandem spectrum. The output of SQID for each sample was a spreadsheet that included information such as spectral name, corresponding sequence, the group number that contains the peptide, etc. For each peptide group, which is a collection of all the possible sequences at one tryptic cleavage region of the hemoglobin chains, the sequence with the highest SQID score is more likely to be the true sequence of the region. To be more specific, if more than one sequence was identified for a position-labeled peptide group, a minimum SQID score that corresponds to an FDR of 5% and a maximum delta SQID score of 0.3 were set to choose the sequence candidates. The FDR was established by searching against reverse decoys appended to our customized database, and the SQID cutoff was determined to be 16–21 for different species. The delta SQID score is represented by the normalized difference between the SQID score of the top ranked sequence for the position-labeled peptide group and current sequence as shown in the following equation:

$$ \mathrm{Delta}\mathrm{SQID}\mathrm{score}=\left(\mathrm{SQID}{\mathrm{score}}_{\mathrm{rank}1}\hbox{-} \mathrm{SQIDscore}\right)/\mathrm{SQID}{\mathrm{score}}_{\mathrm{rank}1} $$

Sequences with small, but not 0, delta SQID scores are potential sequences for a specific Hb region, and thus should be considered when constructing the general sequences of Hb from the bottom-up approach.

The Online Resource 2 “ESM_2. Peptide MS/MS Spectra Identified by SQID with Customized Database” contains all the successfully identified hemoglobin peptides for each of the bird species by this method. The confidence of assigning these sequences is strengthened by top-down analysis. Information on each page of the .PDF file includes species number, species’ common name, peptide sequence, hypothetical protein name (which represents the location of the peptide in the hemoglobin chain), raw data file identifier, SQID score, the peptide spectrum, and a three-row sequence cell. In each spectrum, b and y ion peaks are labeled and plotted in blue and red, respectively. The green peaks represent neutral losses from the b/y ions. The three-row sequence cell facilitates counting the b and y ions.

Top-Down Completion of Partial Bottom-Up Sequences

Top-down data analysis was used to fill in missing peptides and to confirm the sequences identified in the bottom-up approach. Both collision-induced dissociation (CID) and electron transfer dissociation (ETD) methods were performed to improve the sequence coverage. All the precursor masses and fragment masses are within 10 ppm error. Figure 3 shows the product ions of intact hemoglobin αA and β chains from Zonotrichia atricapilla species generated by CID (b, y ions; blue mark) and ETD (c, z ions; red arrow). The abundance of fragment ion information generated from the top-down approach provided sequence information that connected peptides identified in the bottom-up approach. For example, Figure 4 compares the result obtained from the bottom-up approach only to the result from the top-down-assisted bottom-up approach. The result from the top-down-assisted bottom-up approach is shown on the rows with amino acid number labels (1, 31, 61, 91, and 121) and the amino acids codes are in black or gray. A1 to A9 and B1 to B12 correspond to regions of hypothetical position-labeled peptide groups. A1, B7, B9, and B11 do not have confident hits returned from bottom-up database searching because all hits had SQID scores lower than 16 (FDR 5%). B2 has a confident hit “QLIPSGWGK” returned from the bottom-up approach (SQID score 29), but the sequence does not explain the fragment ions in the top-down experiments. A2 through A9, B1, B3, B4, B5, B6, B8, B10, and B12 are partially or completely sequenced correctly by the bottom-up approach. For the Hb αA and β chains of Zonotrichia atricapilla, 67% and 66% sequence coverages, respectively, were achieved without the assistance of the top-down method. However, the top-down-assisted bottom-up method achieved 100% confidence sequence coverage with ambiguity only at I and L. Moreover, the top-down method can provide post-translational modification information. For example, top-down data showed that K16 of the Corvus brachyrhynchos Hb αA chain was methylated. This modification was unexpected and, thus, was not included in the database searching of the bottom-up data. The bottom-up approach is also necessary because there are some regions in the middle of the hemoglobin sequences that do not have much fragmentation in the top-down approach, for example as shown in Figure 3, not much fragmentation has been observed in the β67–76 (VLTSFGEAVK) region, but this region was successfully detected in the bottom-up approach as a top hit for the hypothetical position-labeled peptide group B6. All the product ion maps of αA and β from the 33 bird species are in Online Resource 3. The resulting sequences and deconvoluted top-down mass spectra were further subjected to ProSight Lite [45] matching and the P-score and the protein characterization score (PCS) were obtained. For all the Hb sequences, the median P-score is 2.4E-71 and the median PCS is 822.68. These scores add confidence to our sequencing results.

Figure 3
figure 3

MS/MS product ion map of Hb αA chain (a) and β chain (b) from Zonotrichia atricapilla. (c) Legend for fragmentation symbols

Figure 4
figure 4

Peptides aligned from a SQID search with a customized database in the bottom-up method for Zonotrichia atricapilla αA chain (a) and β chain (b). The amino acid letters show the sequencing results from the top-down-assisted bottom-up method. The boxes with the blue A1-A9 or B1-B12 show the Hb sequence segmentation in the bottom-up method. Peptides in black are correctly identified by bottom-up. The gray regions are only identified in the top-down approach

Phylogenetic analysis (Figure 5) on the 35 sequenced Hb αA chains (two species have two αA isoforms) together with two known sequences was performed by uploading the sequences to the server phylogeny.fr [36, 37] in “One click” phylogeny analysis mode, and the result is consistent with the conventional classification of the avian organisms. In Figure 5, the bird name is in the format of “Order_Family_Genus_Species”, which is the conventional classification. The phylogenetic tree is indicative of how similar the sequences are: dissimilar sequences diverge early on the tree while similar sequences stay in the same branch. For example, the three species Colaptes auratus, Melanerpes formicivorus, and Picoides nuttallii are known to come from the same order PICIFORMES. They are distant from all the other species that belong to the PASSERIFORMES order or the GALLIFORMES order. This is confirmed by the phylogenetic tree produced from the protein sequences of the 37 Hb αA chains. Hb sequences from species Vireo gilvus, Vireo cassinii, and Vireo huttoni are from the same genus Vireo, and they are grouped together. Similarly, Hb sequences from the species Catharus guttatus and Catharus ustulatus are from the same family TURDIDEA and the same genus Catharus, and they grouped into the same branch. Turdus migratorius are from the same family (TURDIDAE) as Catharus guttatus and Catharus ustulatus, but are not the same genus (Turdus versus Catharus), and diverged a little earlier on the tree. Thryomanes bewickii and Troglodytes aedon from the TROGLODYTIDAE family, Haemorhous mexicanus and Haemorhous purpureus from the FRINGILLIDAE family, and Haemorhous genus, Euphagus cyanocephalus and Icterus bullockii from the ICTERIDAE family, Junco hyemalis, Melospiza melodia, Melozone crissalis, Passerella iliaca, Pipilo maculatus, Zonotrichia atricapilla, and Zonotrichia leucophrys from the EMBERIZIDAE family were all grouped separately based on their family. Aberrant from the classic classification are the species from the CARDINALIDAE family and the CORVIDAE family. They were grouped closely but not as close as exactly the same branch. The conventional classification is manifested by phylogenetic tree construction using genetic data, which contains 19 loci, topological constraints, and fossil constraints [46]. Our phylogenetic tree is based only on one parameter, the hemoglobin sequences, which may explain the aberrant results. The already sequenced Gallus gallus, which belongs to GALLIFORMES order, was branched together with Callipepla Californica from our study, adding confidence to our sequencing results.

Figure 5
figure 5

Phylogenetic tree of birds based on sequences of their Hb αA chains and generated by using phylogeny.fr. The tree matches very well with the conventional classification (reflected by the name “Order_Family_Genus_Species”). The two known sequences from NCBI, and which were inserted as controls, are highlighted

Conclusions

We have presented a top-down-assisted bottom-up sequencing approach. Building a customized database based on homologous and variable regions of already sequenced avian hemoglobin for the bottom-up approach expedited the alignment of the peptide data, thus providing a reliable model for top-down data analysis. The top-down approach played a role of quality control in confirming the peptides alignment in the customized database bottom-up approach but also provided sequence in regions where bottom-up failed. The combined approach has been successfully applied to sequence αA and β hemoglobin chains from 33 avian species. The sequences obtained in this study will be appended to current hemoglobin databases for tick/host studies. In the future, the top-down-assisted bottom-up sequencing approach can be applied to facilitate fast and complete sequencing of other homologous proteins.