Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

DNA sequencing is one great leap in the advance of life sciences. The research in these two sequencing articles came respectively from two chapters of my PhD thesis at Berkeley, and it is my luck to connect my life with DNA sequencing through my wonderful thesis advisor, Professor Terry Speed.

Several factors motivated the selection of DNA sequencing as my thesis topic. First, I had an ambition to be an applied mathematician when I was young. In my last year of college, however, I was tortured by fatigue and infection. My mood was very low then, and I had no appetite for more mathematics at all. With the help of my family, I gradually recovered with a therapy of Chinese herbs, and I continued to study hard mathematics. From then on, I had a vague yet deep thought in my mind that someday I should apply my mathematical knowledge to the understanding of life and medicine. This was one reason why in graduate school I looked for some applied topic related to life sciences. Second, Terry had been working on statistical genetics and was very enthusiastic about new statistical problems in genomics. He gave me a physical mapping problem as a start-up project. I quickly made some progress that helped me pass the oral exam. Third, in 1995 the Human Genome Project accelerated and researchers from many disciplines such as chemistry, engineering, computer science, mathematics, statistics jumped into the field. And Terry brought me into the adventure with a good will.

Among the interesting mathematical problems associated with genomics, I picked DNA sequencing, or more exactly, DNA base-calling as my thesis topic, as Terry suggested. In 1994-95, DNA sequencing was based on Sanger’s dideoxy DNA amplification, fluorescence dye technique and electrophoresis. In the beginning, I knew nothing about molecular biology, and Terry helped me understand the basic ideas with a great patience. His former student, David Nelson, participated in DNA sequencing research at that time too, and provided us with fairly complete background on electrophoresis [8910]. Another source of collaboration came from Professor Richard Mathies’ group in the chemistry department at Berkeley, who were conducting research on capillary DNA sequencing. In a statistical consulting service, of which Terry was in charge for the Statistics Department during one semester in 1995, Dr. Indu Kheterpal, who was a PhD graduate student in Professor Mathies’ group, brought in an interesting estimation problem in fluorescence dye technique. Terry set up a good collaboration with them and I learned a lot of chemistry related to DNA sequencing through the interaction.

Sanger DNA sequencing generates a signal trace from each template DNA, and base-calling is the data analysis part of DNA sequencing, aimed at reconstructing the nucleotide sequence with a fair fidelity. We decomposed the problem into three parts: color correction, deconvolution, and base-calling. Then we tried to work out solutions to each of them. In my opinion, the work on color correction and deconvolution is mathematically and statistically more elegant and original, and we put a lot of effort into publishing it. In comparison, the solution to the last step of base-calling is more engineering-like in flavor. Terry introduced me to the technique of the hidden Markov model (HMM), which was not so widely known then as it is now. I was intrigued by the idea and we designed an HMM for base-calling. In genome research, a good idea alone is not sufficient to have an impact, and a good implementation is equally important, if not more so. The implementation of the HMM base-calling requires model-training and a lot of serious software programming. Due to graduation and my limited programming strength, I only tested the idea and did not develop a real software solution. A little later, Dr. Green and his team published their famous work on base-calling. In the meantime, microarray technology gradually caught people’s attention. And our HMM base-calling idea was not pursued further [3].

By now our most influential contribution to DNA sequencing is color correction [4]. A few years ago, Terry told me that Solexa, now owned by Illumina, one major next generation sequencing platform, adopted our scheme. This is encouraging and yet not surprising because we have shown, at least in one important perspective, that the color correction scheme we proposed is optimal. In capillary Sanger sequencing, four dyes, which emit different colors as excited by laser, are used to distinguish four kinds of nucleotides. The purpose of color correction is to remove the cross-talk phenomenon of the four dyes’ emission spectra. One key idea of our work is that we need to estimate the cross-talk phenomenon adaptively from each experiment. Another key idea of our estimation is that we make use of the “canonical” distribution of data without any cross-talk. As a PhD student, I was enthusiastic about the solution when it was first discovered. In a late afternoon, we walked home down Hearst Avenue, and Terry asked me a serious question, “how do we know our solution is right?” I gave him an answer, “If we estimate the cross-talk matrix properly, the distribution of the corrected data should match the nominal one.” Terry agreed. After I graduated, I went through several interesting problems in engineering and science, and realized that they share a common nature with the color correction problem. I wrote an article about this class of blind inversion problems in the festschrift for Professor Terry Speed’s 60th birthday [6], because Terry’s question partially inspired the formulation of this notion.

In usual DNA sequencing light intensities at four wavebands are measured, since four dyes are used. Interestingly, Dr. Kheterpal and Professor Mathies asked us if we could instead use only three light intensities for base-calling. After some struggles, we designed a procedure consisting of a series of nonnegative least squares and a model selection scheme [1]. Professor Mathies was very pleased with the result.

The work on deconvolution is also motivated by Sanger sequencing and is more technical than color correction. Each base in a Sanger sequencing trace can roughly be represented by a Gaussian-shaped peak on a continuous scale, and the four kinds of nucleotides, namely A, G, C and T, are represented by four different colors respectively. The motion of DNA molecules in capillary is usually explained by the reptation theory. The aggregation of the molecules of the same size can approximately be described by a Brownian motion. That explains why each peak looks like a normal distribution. In Sanger sequencing, most base-calling errors come from the regions with runs of the same kind of nucleotides, and lead to insertions and deletions, or simply indels. Once an error of this type occurs in base-calling, it often causes more trouble than a substitution error does in an alignment. How to separate these peaks, or in other words, how to count the bases in a run correctly, is a problem that we solved with the deconvolution technique.

The parametric deconvolution [5] was something we worked out without much prior literature knowledge on the topic. Terry suggested that we do a literature survey. In 1995, I searched the key word, deconvolution, on Yahoo (I am sure it was not Google then), and got over one thousand hits, and the early work went back to the nineteenth century. Obviously deconvolution is a common problem in many areas. I read almost all relevant papers I could find, and discussed them with Terry over a long period of time till 2000 in Melbourne. One issue that puzzled us was whether deconvolution is an ill-posed problem — a notion postulated by Hadamard in 1902. Without any constraint on the solution space, deconvolution is an ill-posed problem, and had been classified so in applied mathematics. Nevertheless, in many cases, the signals to be reconstructed are positive and “sparse”. In parametric deconvolution, we formulate the unknowns by a mixture of finite Dirac spikes, and we can estimate them well in a regular sense, see Theorem 4.1 and 4.2 in Li and Speed [5], although the dimension of the solution space needs to be estimated too by model selection, see Algorithm 5.2 in Li and Speed [5] and Proposition 3.3 in Li [2]. Thus Terry and I came to the conclusion: if the signal to be reconstructed is positive and sparse, then deconvolution is well-posed.

The well-posedness explains why historically some nonparametric deconvolvers such as the Jansson’s method and the folk iteration (5.2) in Li and Speed [7], obtained in different scenarios by either EM algorithms or Bayesian methods in the literature, work quite well in their respective applications. Furthermore, Terry and I did an investigation on the general linear inverse problem with positive constraints (LININPOS) that underlies the folk iteration. We discovered that the iteration in fact minimizes the Kullback-Leibler divergence between the target and the fit, and this result clarifies the core structure of the LININPOS solution.

The work described here has been a source of both enlightenment and enjoyment to me. When I was writing down these words, those scenes when Terry and I walked down Hearst Avenue and chatted on various issues came upon my mind like yesterday. I am sure that Terry’s other students and colleagues had their own pleasant study and work experiences with him as well. His spirit is no doubt the source of many good things. In addition to his passion for science and mathematics, his respect of the interests and talents of each student, each collaborator and his own may partially explain his wide research spectrum.