Historical and evolutionary perspectives

The view through the window starts with a single cytochrome P450 monooxygenase (P450) identified and cloned in a long series of plants, beginning with Jerusalem artichoke (Benveniste et al. 1977; Gabriac et al. 1991), pea (Benveniste et al. 1978; Stewart and Schuler 1989) and eventually extending to Arabidopsis thaliana (mouse ear cress) (Mitzutani et al. 1997). In retrospect, its discovery was not surprising, since this ubiquitous P450 protein, dubbed t-cinnamic acid hydroxylase (t-CAH) and cinnamic 4-hydroxylase (C4H) by various sets of investigators, exists as the one most abundant and constitutively expressed monooxygenase present in all plants. Mediating a critical reaction in the phenylpropanoid pathway, this particular P450 was shown to control flux from t-cinnamic acid (t-CA), a phenylalanine derivative, into a collection of branched pathways leading to the synthesis of lignin, flavonoids, anthocyanins and phytoalexins. Representative in its catalytic core of a superfamily of P450 proteins capable of incorporating oxygen into aliphatic and aromatic molecules using electron transfer partners that are either membrane-bound (NADPH-dependent P450 reductase and cytochrome b5/cytochrome b5 reductase for ER-localized P450s) or soluble (ferrodoxin and ferrodoxin reductase for chloroplast P450s), this protein was initially purified because of its high constitutive abundance. Later, it became the standard against which other plant P450 proteins were measured because it was recognized as a highly selective and essential enzyme capable of yielding only p-coumaric acid, the precursor needed for subsequent branches in the phenylpropanoid pathway.

From a genomics perspective, this particular P450 and its transcripts represent just one of the many P450s that exist in plants. Annotations in the completely sequenced genomes of Arabidopsis and Oryza sativa (rice) have indicated that 245 full-length genes with 27 pseudogenes are contained in the Arabidopsis genome (Paquette et al. 2000; Werck-Reichhart et al. 2002; Schuler and Werck-Reichhart 2003) and that 334 full-length genes, 7 unresolved partial genes and 100 pseudogenes are contained in the Oryza genome (Nelson et al. 2004 and more recent annotations). Reflecting a highly diverse set of reactive sites, the P450 proteins existing in each of these species are encoded by a divergent gene superfamily that maintains significant conservation in secondary and tertiary structures with relatively low levels of primary sequence conservation. Amino acid conservations among the most divergent members of this superfamily in these species are typically in the range of 15–20% and sometimes as low as 14% (as between Arabidopsis CYP707A2 and rice CYP723A2). Analysis of P450 sequences in many different phyla has indicated that the most diagnostic signature motif for a P450 protein is a short sequence (F-G-R-C-G) surrounding the heme cysteine ligand positioned approximately 55 a.a. from the C-terminus (Nelson et al. 1993, 1996). But, even this signature is not strictly conserved in all members of the P450 superfamily; some of the most divergent P450s (e.g., allene oxide synthases (AOS), hydroperoxide lyases (HPL)) contain only three of these conserved amino acids.

Within any one organism such as Arabidopsis, the superfamily of P450 sequences has evolved to contain a spectrum of families that differ substantially in their coding sequences, intron positions and regulatory elements. To avoid the acronyms used earlier that designated P450s according to their substrate and/or historical source, a universal nomenclature system evolved that annotates P450 sequences with a CYP (CYtochrome P450) designator followed by numerical and alphabetic characters identifying family and subfamily groupings based on identities in their amino acid sequence (Nelson et al. 1993, 1996). In this, the most highly related monooxygenase proteins are grouped into gene families designated with numbers (CYP1, CYP2, etc.) indicating sequences sharing greater than 40% amino acid identity with subfamilies designated with alphabetical characters (A, B, C, etc.) indicating sequences sharing greater than 55% amino acid identity and individual loci designated with additional numbers following the subfamily designation (CYP1A1, CYP1A2, CYP1A3, etc.). In organisms where it is not yet clear if closely related sequences sharing more than 97% amino acid identity are derived from different loci, individual sequences are designated as allelic variants (v1, v2, etc.) following the locus designation. In organisms with complete genomic information (e.g., Arabidopsis, Oryza), closely related sequences sharing this level of identity are designated as independent loci unless they represent mutants or ecotype variants of a single locus.

Current Arabidopsis P450 annotations available at two evolving databases (http://Arabidopsis-P450.biotec.uiuc.edu; http://www.p450.kvl.dk//p450.shtml) indicate that, among the 44 P450 families and 69 subfamilies represented in the Arabidopsis genome, several single P450 gene families exist. These include CYP73A5 (t-CAH/C4H) in phenylpropanoid synthesis (Mitzutani et al. 1997), CYP75B1 (F3′H) in flavonoid/anthocyanin synthesis (Schoenbohm et al. 2000), CYP701A3 (ent-kaurene oxidase) in gibberellin synthesis (Helliwell et al. 1998, 1999), CYP734A1 (brassinolide 26-hydroxylase) in brassinosteroid degradation (Neff et al. 1999; Turk et al. 2003) and a collection of highly divergent genes that represent the first and sometimes sole members of new P450 families and subfamilies (e.g., CYP93D1, CYP711A1, CYP718A1, CYP720A1, CYP721A1, etc.). Duplication and diversification in other families has resulted in an array of other subfamilies containing between two members (CYP51G, CYP79F, CYP85A, etc.), 16 members (CYP71A) and 37 members (CYP71B).

Comparison with the array of P450 loci existing in rice has highlighted a number of lineage-specific P450 families maintained and lost during evolution of these monocot (rice) and dicot (Arabidopsis) species (Nelson et al. 2004). Arabidopsis P450 families clearly absent from the rice include CYP82, CYP83, CYP702, CYP705, CYP708, CYP712, CYP716, CYP718 and CYP720 but, with the exception of CYP705, all of these correspond to single gene or small multigene P450 families (2–6 members) that may mediate functions particular to Arabidopsis and/or functions replaceable by more divergent enzymes existing in rice. Interestingly, five of these “Arabidopsis-specific” families are grouped with the CYP85 clan, a phylogenetically larger grouping that was originally designated for its members mediating the modification of sterols and cyclic terpenes in brassinosteroid (BL), abscisic acid (ABA) and gibberellic acid (GA) biosynthesis (Nelson 1999). Others, such as the CYP82 and CYP83, appear to be divergent offshoots of the prolific CYP81 and CYP71 families, respectively, whose members have not been extensively characterized at this time.

The number and diversity of these many P450 loci provide special challenges in characterizing their expression patterns and physiological functions that are discussed further in this review. Even considering these challenges, the breadth of their biochemical activities and location in many essential plant pathways indicate that they can serve as important reporters for visualizing the intricacies of plant biochemistry and its integrated network of interacting pathways. Their role as reporters is especially evident when one considers that many of the proteins within this gene family exist at critical nodes in pathways responsible for synthesizing hormones (GA, BL, ABA, IAA) and plant signaling molecules (jasmonic acid (JA), salicylic acid (SA)), at branchpoints in pathways leading to the synthesis of plant defense molecules (lignin, flavonoids, phytoalexins) and at the termini of these pathways where, as targets for these defense signaling cascades, they are responsible for the direct synthesis of defense molecules. With their clear roles in the synthesis of hormones and signaling molecules, their networking exemplifies the range of integrated events occurring at the level of signal transduction especially as related to stress. With their existence in at least two different cellular compartments, the endoplasmic reticulum and chloroplasts, their networking also exemplifies the types of integration needing to occur between these cellular compartments.

Subcellular locations

With so many loci in this gene family, any categorization process that is aimed at grouping those with similar functions (e.g., subcellular location, tissue distributions, transcriptional response times, range of inducers, enzymatic activities) has potential for distinguishing one P450 protein and its corresponding locus from the next and limiting the range of functions predicted for each. In terms of subcellular location, it is clear that most of the Arabidopsis P450s are targeted to the endoplasmic reticulum (ER) using an amino-terminal signal sequence of 25–30 amino acids that, after insertion in this membrane, is not cleaved from the initial translation product. Positioned in this manner, the remainder of their structure remains on the cytoplasmic side of the membrane situated in proximity to ER-anchored NADPH P450 reductases that act as their electron transfer partners. A significantly smaller set of Arabidopsis P450s are targeted to chloroplasts using longer and more hydrophilic amino-terminal transit sequences. Analysis of amino-terminal sequences using the ChloroP program (Emanuelsson et al. 1999; http://www.cbs.dtu.dk/services/ChloroP) has identified a total of 42 Arabidopsis P450s that, according to these algorithms, are predicted to be targeted to the chloroplast because they contain a putative cleavage site for a chloroplast transit sequence (Table 1). Closer inspection of the positions of prolines, serines and threonines within these putative transit sequences, which vary in length from 10 to 97 amino acids, indicates that many of these contain clustered prolines approximately 30–35 amino acids from their amino-terminus, are not especially rich in Ser/Thr in their preceding amino acids and have sequence compositions more like endoplasmic reticulum-localized P450s. Elimination of these sequences and retention of those containing substantial numbers of Ser and Thr (>14%) in their amino-terminal sequences suggest that only 11 of those predicted to be chloroplast-localized by ChloroP may contain actual chloroplast targeting sequences (underlined in Table 1). Of these, CYP74A1 (AOS in JA synthesis), CYP74B2 (HPL in hexenal synthesis), CYP86B1 (undefined function), CYP97A3 (carotene β-hydroxylase in carotenoid synthesis), CYP97C1 (carotene ɛ-hydroxylase in carotenoid synthesis) and CYP701A3 (kaurene oxidase in GA synthesis) have actually been identified as chloroplast-localized (double asterisks in Table 1) (Froehlich et al. 2001; Helliwell et al. 2001; Watson et al. 2001; Tian et al., 2004; Kim and DellaPenna 2006). But, the final destinations of these differ considerably with one (CYP74A1) localized to the inner chloroplast membrane facing the stroma, another (CYP74B2) localized in the outer chloroplast membrane facing the intermembrane space, two (CYP86B1, CYP701A3) localized to the outer chloroplast membrane facing the cytoplasm and the remaining two (CYP97A3, CYP97C1) targeted to undefined locations in the chloroplast. Comparisons of the six proteins known to be chloroplast targeted indicate that the amino-termini of the four targeted into the chloroplast have 3–8 prolines scattered among the serines and threonines of their first 30 amino acids and the two targeted to the outside of the chloroplast have 0–1 prolines in their first 40 amino acids and 16–33% Ser/Thr in their first 30 amino acids. Evaluation of the others underlined in the ChloroP list (not eliminated based on the presence of a proline hinge) against these standards suggests that CYP78A5, CYP94B1, CYP94D1 and CYP97B3 all have features of proteins targeted into the chloroplasts. Further analysis of the remaining Arabidopsis P450s against this more elaborate set of criteria indicates that CYP72A8 and CYP72A9 lack proline clusters but have high Ser/Thr contents as do several targeted to the outside of the chloroplast.

Table 1 P450s with chloroplast or mitochondiral targeting signals

Two of those in the chloroP list, CYP79B2 and CYP79B3 involved in the synthesis of glucosinolates that are thought to be chloroplast-localized (described further in Nafisi et al. (2006) in this volume) display activities when expressed in E. coli and reconstituted with purified sorghum or rat microsomal P450 reductases (Hull and Celenza 2000; Mikkelsen et al. 2000). Others predicted by chloroP to be chloroplast-localized, such as CYP707A1 and CYP707A3 mediating ABA degradation (Kushiro et al. 2004; Saito et al. 2004), have been expressed in yeast and insect cells in the presence of the ER-localized Arabidopsis P450 reductase and are likely to be ER-localized. The lengths of their amino-terminal sequences and their lower Ser/Thr contents are more consistent with this localization. The range of P450s predicted to be targeted to the chloroplasts by the TargetP program (Table 1) (Emanuelsson et al. 2000) overlaps to some extent those predicted by the ChloroP program but some notable omissions occur. Among these, the omission of CYP86B1 and CYP701A3 known to be targeted to the exterior surface of the chloroplast suggests that TargetP predictions are less useful in predicting proteins targeted to the outer chloroplast membrane.

Although no plant P450s have yet been localized to mitochondria, as is the case for some mammalian P450s, it remains conceivable that some plant P450s are targeted to this organelle. TargetP predicts that as many as fifteen Arabidopsis P450s might be targeted to this organelle. But, further analyses of these indicate that two are also predicted to be chloroplast targeted by the alternate ChloroP program (blue in Table 1), twelve have amino-terminal sequence compositions more reminiscent of ER-localized P450s (e.g., 2–5 prolines in a short “hinge” region separating the signal sequence from the body of the protein) and two have ambiguous proline-hinge regions. Thus, it is unclear whether any of these Arabidopsis P450s are mitochondrially targeted.

Transcripts represented in databases

Our detailed BLAST analyses (Altschul et al. 1990) of available full-length cDNA and EST collections for the 272 Arabidopsis P450 genes and pseudogenes have identified 438 full-length P450 cDNAs in Genbank (http://www.ncbi.nih.gov/Genbank/index.html) and 1267 ESTs in the Arabidopsis thaliana Gene Index (AtGI) (http://www.tigr.org/tigr-scripts/tgi/T_index.cgi?species=arab). Alignments of these full-length sequences with their corresponding genomic sequences shown on individual P450 locus pages at http://Arabidopsis-P450.biotec.uiuc.edu/cgi-bin/p450.pl have provided supporting information for 166 of the 245 P450 loci with an additional eight P450 loci confirmed by our cloning of RT-PCR products.

With the caveat that current databases contain many P450 cDNAs derived from normal or stressed leaf tissues and small numbers of RT-PCR products cloned in directed searches for particular transcripts, enumeration of the number of full-length cDNAs for each locus indicates that substantial differences exist in the pools of different P450 subfamily and family transcripts (Table 2). Not unexpectedly, several loci with defined functions are represented by high numbers of full-length cDNAs (e.g., seven for CYP51G1 in sterol synthesis, eight for CYP74A1 in JA synthesis, eight for CYP83A1 in glucosinolate synthesis, seven for CYP90A1 in BL synthesis) and some with as-yet-undefined functions (e.g., seven for CYP71B6, eight for CYP81F1, five for CYP705A19). In total, 20 of 245 P450 loci are represented by five or more full-length cDNAs in databases. Presumably reflecting the abundance of their transcripts in the types of RNA samples used for construction of these cDNA libraries, 53 other loci are represented by three or four full-length cDNAs, 93 other loci are represented by one or two full-length cDNAs and 106 other loci have no available full-length cDNAs. Transcripts for the 27 full-length pseudogenes and pseudogene fragments in the genome are discussed below. These full-length P450 cDNA counts reflect sequences in the dbEST (http://www.ncbi.nlm.nih.gov/Genbank/index.html), RIKEN (http://rarge.gsc.riken.go.jp/) and CERES (ftp://ftp.tigr.org/pub/data/a_thaliana/ceres/) databases as of February 2006.

Table 2 Arabidopsis P450 full-length cDNAs in current databases

Analyses of these databases as well as validated and provisional REFSEQ sequences (Pruitt et al. 2002) have identified, quite surprisingly, an unusual set of five P450 transcripts in the RIKEN database spanning two adjacent loci in Arabidopsis genome. Three of these transcripts represent bicistronic transcripts spanning adjacent P450 loci that are potentially capable of coding for two complete P450 open reading frames (ORFs) (CYP71B34/CYP71B35), nearly complete ORFs (CYP705A15/CYP705A16) or adjacent P450 and O-methyltransferase ORFs (CYP97C1/OMT) (Thimmapuram et al. 2005). Two other unusual P450 transcripts represent monocistronic transcripts that splice two full-length P450 sequences to generate dimeric P450s not yet identified in another organism (CYP96A9/CYP96A10, CYP71A27/CYP71A28). The fact that splicing in these fused monocistronic transcripts occurs just upstream from the translation stop in the first ORF to just downstream from the signal sequence needed for ER-localization has suggested that these dimeric P450 fusion proteins may be functionally relevant for sequential modifications on hydrophobic substrates. Realizing that the identification of these unusual transcripts from adjacent loci can only be appreciated in plant species whose genomes have been completely sequenced, their existence has been verified with gene-specific P450 primers and probes and control and environmentally stressed Arabidopsis RNAs (Thimmapuram et al. 2005). As a result of this analysis, it is now apparent that the bicistronic and fused monocistronic transcripts exist side-by-side with monocistronic transcripts from each of the adjacent loci. For example, transcripts capable of coding for the dimeric CYP96A9/CYP96A10 protein exist in flowers along with abundant transcripts coding for CYP96A9 and rarer transcripts coding for CYP96A10 (Thimmapuram et al. 2005). Transcripts for the bicistronic CYP71B34/CYP71B35 and CYP97C1/OMT proteins exist in cold-and drought-stressed seedlings. Given that these unusual transcripts could not possibly have been predicted by current annotation algorithms, these transcripts have fogged existing definitions of genetic loci in the Arabidopsis genome and highlighted a number of P450 loci whose transcript profiles must take into account the fact that they are represented by both monocistronic and bicistronic transcripts.

In addition to providing support for existing gene models and some novel transcripts, our database curations have identified a few loci with alternative splicing variants. One (CYP51G1) contains an intron in its 5′ untranslated region while others contain cryptically spliced introns whose excision cause transcripts to code for prematurely truncated proteins (CYP71B2, CYP97C1), inefficiently spliced introns whose retention causes transcripts to code for prematurely truncated proteins (CYP71B29, CYP71B35, CYP72A13, CYP83A1, CYP707A3), introns with alternative 3′ splice sites whose variations cause transcripts to code for either full-length or amino-terminally truncated (missing 83 a.a.) proteins (CYP711A1) or alternative polyadenylation sites which cause intron retention and production of truncated proteins (CYP76C7). In most cases, analyses of the splice sites surrounding these aberrantly spliced introns indicate that they are nonoptimal and prone to being retained. In contrast with this, the CYP708A2 locus contains an upstream transcription start whose usage causes the transcript to code for an unusually long (76 a.a.) signal sequence rather than its shorter (25 a.a.) and more typical signal sequence. Our database curations have also identified a natural 10 bp deletion in the coding region of the CYP74B2 gene in one commonly used ecotype (Col-0) of Arabidopsis that prevents this gene from expressing HPL activity (Duan et al. 2005). As a consequence, this particular ecotype contains an additional pseudogene (CYP74B2P) and is defective in C6-volatile production.

Even with the available cDNA clones, transcription start sites are not well defined in many of these P450 transcription units; those that lack full-length cDNAs often have no ESTs in current collections or only ESTs corresponding to the 3′ ends of loci. Support for current P450 gene models will come only from additional clonings, if low level P450 transcripts can be individually targeted in RT-PCR strategies, or sequence comparisons, if their derived sequences can be aligned with similar P450 proteins to localize deletions and/or insertions relative to structurally important regions.

Transcripts detected by microarray and oligoarray profiling

Various transcript profiling strategies have been used to identify the range of P450s expressed in different tissues and those induced or repressed in response to a particular stress regime. The high degree of evolutionary duplication in this large gene family has created special challenges for defining these expression patterns and subsets of coordinately regulated genes. The one predominating complication in this analysis arises from high degree of nucleic acid identity that, if not carefully monitored against, causes related P450 sequences to cross-hybridize and leads to inaccurate expression profiling. In the time since our previous review (Schuler and Werck-Reichhart 2003) discussed the cDNA/EST-based strategies being used to evaluate P450 expression patterns, several oligoarray and microarray platforms have become available for either full-genome profiling or more detailed analysis of P450 and other stress-response genes. The oligoarray platforms now include an Affymetrix ATH1 array (Redman et al. 2004) that contains 226 elements representing 226 P450 loci, a 70-mer oligoarray (http://www.ag.arizona.edu/microarray/) that contains 243 elements representing 237 P450 loci (with elements for 15 loci potentially detecting closely related transcripts), an Agilent 60-mer oligoarray (http://www.agilent.com/chem/DNA) that contains 304 elements representing 252 P450 loci and a more focused 50-mer array that contains elements for 246 P450 loci and 112 UGT loci (Kristensen et al. 2005). The microarray platforms now include a CATMA GST (gene-specific tag) microarray (Allemeersch et al. 2005) that contains 148 elements representing 141 P450 loci and a P450 gene-specific microarray (built at the University of Illinois in collaboration with Genoplante) that contains 265 P450 loci alongside 365 biochemical pathway and physiological function marker loci. To facilitate interpretations of various datasets, updated annotations have been assigned to the 70-mer oligoarray and the P450 gene-specific microarray identifying probe elements capable of hybridizing to two different regions of the same P450 locus as well as probe elements potentially capable of hybridizing with other Arabidopsis loci (both P450 and non-P450 loci) sharing >95% identity across a 70 nt. oligomer or across more than 100 nt. of a microarray element.

With these annotations in place to highlight potentially problematic loci, the process of categorizing P450 loci based on their tissue-specificity and inducibilities has begun using the more focused P450 gene-specific arrays and, to a more limited extent, the global oligoarray systems. One distinct advantage of the smaller arrays is that, because of their cost-effectiveness, it is possible to record P450 transcript levels in samples with many more datapoints per RNA sample as well as tissues and induction timepoints analyzed. With samples representing both technical and biological replicates and data analysis procedures that statistically identify all transcripts at least three-fold over background at P < 0.05, even very low P450 transcript levels can be statistically documented as being expressed (Kristensen et al. 2005; Ali et al. 2006a, 2006b). Comparisons between these small and large array systems have indicated that, often, transcript profilings done with more limited sets of 3–4 datapoints per sample in the global arrays fail to detect low abundance P450 transcripts. Exemplifying the sensitivity of the more focused P450 arrays, transcript profiles for shoots and roots of 7-day-old seedlings vs. flowers, stems and leaves of 1-month-old plants defined on our P450 microarrays have identified a significant fraction (86–93%) of the P450 loci that are expressed at some level in seedlings and mature flowering plants with significant variations in the abundance of individual transcripts in different tissues and in different P450 subfamilies (Ali et al. 2006a). Examples of these differences exist in the 5-member CYP86A subfamily that contains functionally characterized fatty acid hydroxylases (Benveniste et al. 1998; Wellesen et al. 2001; Duan and Schuler 2005; Rupasinghe et al. 2006), the 37-member CYP71B subfamily that contains CYP71B15 in camalexin synthesis (Zhou et al. 1999; Schuhegger et al. 2006) and 36 uncharacterized members and the 17-member CYP71A subfamily that contains several flower-specific transcripts. Expression patterns for these subfamilies normalized to the transcript levels in a universal control (e.g., RNA from all aerial tissues of 1-month-old plants and root tissue from 7-day-old seedlings) are shown in Table 3 with blue designating normalized ratios higher than 2.0, green designating ratios are less than 0.5 and ND (not detectable) designating loci have no signal over background in any of 8 datapoints. Ratios designated in yellow are those derived from a small number of datapoints (less than four of eight datapoints) that are not statistically significant and often represent transcripts whose signal levels are close to the background levels on these P450 microarrays; all of these should be viewed as statistically nondetectable. At this level of comparison, it is evident that members within individual subfamilies are independently regulated with examples in the CYP71A subfamily including the flower-specific CYP71A24 and root-specific CYP71A12 and CYP71A28. And, examples in the CYP71B subfamily including CYP71B15 that is overrepresented in seedling shoots and roots but not in stems and flowers, CYP71B14 and CYP71B26 that are expressed in all tissues analyzed and CYP71B9, CYP71B18, and CYP71B25 that are undetectable in all tissues. An expanded table showing the tissue-specificity of these Arabidopsis P450s exists at http://arabidopsis-P450.biotec.uiuc.edu.

Table 3 Tissue specificity of P450s within some of the larger P450 subfamilies

Side-by-side comparisons of the average raw scores and normalized ratios for the CYP86A subfamily shown in Table 4 indicate the significant range of signal intensities detected for members of individual subfamilies. In particular, CYP86A2 is exceptionally abundant and expressed in most tissues while CYP86A1 is exceptionally abundant in root and marginally detectable in other tissues (Duan and Schuler 2005). Similar comparisons of the signal levels obtained for all P450 loci indicate that several P450 transcripts accumulate at extremely high levels in all tissues while others accumulate at high levels in more limited sets of tissues. Using an arbitrary average signal cut-off of 1000, two P450 transcripts (CYP73A5, CYP705A16) appear to be constitutively expressed at significantly higher levels than other P450 transcripts (Table 5). The high signal intensities of the CYP73A5 element are consistent with its significant transcript levels observed in previous studies (Bell-Lelong et al. 1997; Mizutani et al. 1997). The high signal intensities of the CYP705A16 element are likely due to its existence in the long bicistronic CYP705A15/CYP706A16 transcript from this region of the genome (Thimmapuram et al. 2005). Other transcripts that are abundant in many tissues include CYP51G1 in sterol synthesis (Kushiro et al. 2001; Kim et al. 2005b) that has high signal in all except mature leaves (where its signal falls just below 1000), CYP81G1 (function undefined), CYP706A1 (function undefined) and CYP86A2 in fatty acid synthesis (Xiao et al. 2004; Duan and Schuler 2005) that have high signals in all except roots, CYP83B1 in indole glucosinolate synthesis (Bak et al. 2001; Bak and Feyereisen 2001) that has high signal in all except flowers, CYP74A1 in JA synthesis (Laudert et al. 1996) that has high signal in all except roots and flowers and CYP90A1 in BL synthesis (Szekeres et al. 1996) that has high signal in seedling shoots and mature leaves. Without detailing each and every locus, the numbers of moderately abundant P450 transcripts (signal intensities in the 200–1000 range) are 48 for seedling shoots, 30 for seedling roots and mature stems, 33 for mature leaves and 49 for flowers. Many other loci exist in the low abundance or undetectable range (with signal intensities below 200).

Table 4 Tissue-specificity and transcript variations in the CYP86A subfamily
Table 5 Tissue-specificity of P450 transcripts abundant under normal growth conditions

For evaluative purposes, some of the datasets obtained from the focused P450 microarray have been compared with those obtained using the more expensive 22,745 element ATH1 Affymetrix arrays and 27,216 element 70-mer arrays (Kristensen et al. 2005; Ali et al. 2006a). Using seedling root transcript profiles as a point of comparison, we have compared in Table 6 the root transcript profiles for all P450 genes and pseudogenes with root cell-specific transcript profiles for 6-day-old seedlings (Birnbaum et al. 2003). Because this particular Affymetrix dataset details expression levels in five root cell types (stele, endodermis, endodermis plus cortex, epidermal atrichoblast, lateral root cap) abundant in primary roots but not quiescent center or columellar root cap cells (Nawy et al. 2005), comparisons with our seedling root datasets have been done by scoring for loci having signal levels over 75 on the Affymetrix arrays for at least one type of root cell or on microarrays for intact roots. Clearly indicating the discrepancies between these array systems, 47 (of the 224 P450s represented on both types of arrays) are scored as expressed over background in roots in both array systems (designated in blue in Table 6), 31 were scored as expressed in our P450 microarrays but not in oligoarrays (designated in white) and 20 were scored as expressed in oligoarrays but below the signal cut-offs used in our microarrays (listed at the bottom of Table 6). While not directly comparing the levels of expression levels in these array systems, these comparisons highlight the large number of P450 loci (47) whose expression agrees and the larger number of P450 loci (51) whose expression differs between these two array formats. With the RT-PCR gel blot analyses in Duan and Schuler (2005) supporting detection of root-expressed transcripts for CYP86A2, CYP72A7, CYP74A1 and many others in both microarrays and Affymetrix array formats, there are very notable discrepancies between these two datasets. Among the differences noted for these five cell types are the CYP71B7, CYP86A1 and other transcripts (Duan and Schuler 2005). The high signals detected on our microarrays for these last two examples and the confirmation of their Affymetrix element sets suggest that factors other than low transcript levels or differences in RNA preparation methods contribute to the failure of Affymetrix arrays to detect these abundant P450 transcripts. The high degree of sequence identity that exists between some of the most recently duplicated P450 loci may explain some of these discrepancies since close identities of this sort have the tendency to cause recently duplicated genes to be scored as “absent” on Affimetrix arrays when they are in fact expressed. Although it does not factor into the detection problems detailed above, it is important for other researchers to note that a number of P450 elements on the ATH1 oligoarray have misleading locus annotations and CYP designations potentially complicating descriptions of the biochemical processes affected by any given treatment; the correct CYP designations for these should be: 246620_at (CYP81D1), 253101_at (CYP81F1), 251988_at (CYP71B31), 252674_at (CYP71B38), 264470_at (CYP735A2), 250838_at (CYP77A9). Apart from these problematic sets of P450 elements, the Affimetrix array elements that accurately record root P450 transcript levels demonstrate the extent of cell-specific expression of many individual P450 transcripts and, again, serve to group sets of P450s coordinately expressed and colocalized for common metabolic processes.

Table 6 Comparison of P450 microarray and Affymetrix ATH1 array datasets

Similar comparisons between P450 microarrays and Affymetrix arrays done for ABA- and IAA-treatment of 7-day-old seedlings (Ali et al. 2006b) also indicate that there are significant numbers of P450 transcripts whose expressions are not accurately recorded on Affymetrix arrays. One notable discrepancy on these Affymetrix arrays (NASC array 176 for ABA induction, NASC array 175 for IAA induction; http://affymetrix.arabidopsis.info/narrays/experimentbrowse.pl) is the recorded absence of induction for the CYP707A4 locus responsible for ABA inactivation after ABA treatment. Similar comparisons between the P450 and UGT 50-mer array and the full-genome 70-mer array have provided additional evidence supporting the fact that the focused array formats allow better detection of low abundance P450 transcripts (Kristensen et al. 2005). Continued comparative analyses of this sort is needed to define the range of P450 loci accurately and inaccurately reported on full-genome Affimetrix arrays.

Comparisons of focused P450 array datasets with previous cDNA/EST microarray datasets are difficult, if not impossible, given the different gene specificities of the shorter microarray elements that are now being used and the longer cDNA/EST microarray elements that had been used in earlier studies (Xu et al. 2001; Narusaka et al. 2004). As noted in these earlier works, signals from cDNA/EST elements sharing a high degree (>80%) of identity over the length of the longer probes represent the summed expression levels for P450 subfamilies containing several closely related members. Signals from the shorter microarray elements are locus-specific and, where potential for cross-reactivity exists, have been annotated (as shown in Tables 5 and 6 with underlines and asterisks) to highlight this possibility and emphasize the need to verify the expression profiles of these particular elements with independent methods.

Categorization of P450s by their tissue profiles defined on P450 microarrays have identified ten clusters designated according to the tissues displaying the highest normalized ratios relative to universal controls. Not including pseudogene elements, the numbers of P450s in these clusters are: (1) constitutive (27), (2) shoot (23), (3) stem/flower (30), (4) stem/shoot (7), (5) root (46), (6) flower (20), (7) stem (18), (8) root/shoot (20), (9) leaf/stem/shoot (21), (10) flower/shoot (22) with only eight in a group of unclassified loci. Again using root expression data to demonstrate the complexities of plant biochemistry occurring in individual tissues, the root-specific cluster (Table 7) includes many P450s and biochemical pathway markers involved in production of aliphatic and indole glucosinolates (CYP79B2, CYP79B3, CYP79F2), fatty acids (CYP86A1, CYP94B1), sterols (3-hydroxy-3-methylglutaryl CoA reductase (HMG1)), carotenoids (CYP97C1), flavonoids (chalcone isomerase (CHI2)) and cytokinins (CYP735A1) as well as an unusually large number of CYP705A subfamily members (13 of 26 total). When compiled with those in cluster 1 (constitutive) and 8 (root/shoot), the range of P450s expressed in roots can be expanded to include additional members involved in the synthesis of cytokinins (CYP735A1), glucosinolates (CYP83B1), camalexin (CYP71B15), flavonoids (chalcone synthase (CHS1, CHS2, CHI1), sterols (CYP51G1), isoprenoids and carotenoids (1-deoxy-D-xylulose 5-phosphate synthase (DXS1)), terpenes (IPP2), degradation of abscisic acid (CYP707A3) and two members of the CYP705A subfamily. More important than simply visualizing the complexities of plant biochemistries, these types of cluster analyses narrow the range of P450 candidates mediating functions in this tissue and limit the scope of prospective substrates for each of the functionally undefined P450s expressed in roots.

Table 7 Root-specific P450 cluster from tissue profiling of 7-day-old seedlings and 1-month-old plants

Transcripts detected from pseudogenes

The 27 Arabidopsis P450 pseudogenes that have been identified because they contain a P450 signature motif embedded within an open reading frame have open reading frames of many different lengths ranging from 102 to 1509 bp (Table 8). The curations of full-length P450 cDNAs described above have indicated that the CYP72A12P pseudogene, which sits immediately downstream of the CYP72A11 locus, is transcribed and terminated at alternative polyadenylation sites upstream or downstream of the CYP72A12P pseudogene yielding transcripts that terminate either 150 or 400 nt downstream from the CYP72A11 stop codon (Table 8). Sequencing of RT-PCR products derived from this transcription unit have indicated that the longer RT-PCR product corresponds to the CYP72A12P pseudogene embedded in the 3′ UTR of the CYP72A11 transcript. The existence of cDNAs/ESTs for others indicates that the CYP77A5P pseudogene is expressed as part of its adjacent At3g18270 transcription unit (a mandelate racemase family protein) while others are expressed as full-length transcripts containing prematurely truncated P450 ORFs (CYP51G2P; Kim et al. 2005b) or abbreviated transcripts containing partial ORFs (CYP79A3P, CYP705A17P, CYP705A29P).

Table 8 Pseudogene organization and transcripts detected

P450 microarray profiling has made it apparent that transcripts spanning several of the P450 pseudogene elements accumulate to significant levels in vivo. Using an arbitrary cut-off for average signal intensity of 75, four elements (CYP51G2P, CYP72A12P, CYP77A5P, CYP96A14P) stand out as having statistically significant signal intensities greater than this cut-off (Ali et al. 2006a). Average detectable signal intensities for these are 447–1086 for CYP51G2P, 438–788 for CYP72A12P, 125–175 for CYP77A5P and 94–114 for CYP96A14P (Table 8; Ali et al. 2006a). Detection of transcripts derived from these four loci in seedling shoots as well as other tissue samples are consistent with the existence of ESTs for these loci and/or their adjacent transcription units.

P450 microarray profiling also indicates that some nearly full-length P450 pseudogenes are not expressed at any discernible level in any tissue or chemical treatment analyzed. The CYP79A3P pseudogene that is capable of generating a prematurely truncated 467 a.a. protein produces no detectable transcripts in any of the five tissues analyzed despite the existence of a cDNA for this locus. The CYP71B30P and CYP96A9P pseudogenes that lack start codons upstream of their 448 and 275 a.a. ORFs also generate no transcripts as is consistent with the absence of cDNAs/ESTs in current databases.

Complexities of responses to chemical and environmental stresses

More complex than the expression patterns of individual P450 loci in control plant tissues are the responses of these loci to hormones, signaling molecules and environmental stresses. Taking into account the previous cautionary notes on detection of some closely related and low copy P450 transcripts on the global arrays, the expression patterns of P450 loci that are accurately monitored on Affymetrix ATH1 arrays can be assessed in the datasets compiled for different investigators available on the websites for Genevestigator (Zimmerman et al. 2004; https://www.genevestigator.ethz.ch/at/), TAIR (Rhee et al. 2003; http://www.arabidopsis.org/) and GEO (Barrett et al. 2005; http://www.ncbi.nlm.nih.gov/projects/geo/). These profiles, with examples for the MeJ-inducible CYP74A1, ABA-inducible CYP707A1 and root-expressed, MeJ-inducible CYP81F4 shown in Fig. 1, highlight the range of regimes modulating each P450 and the magnitude of their different responses. Sometimes, these datasets are limited in the number of timepoints available for an inducer, the number of chemicals tested individually and the tissues analyzed after a particular treatment. As examples, datasets are available for SA treatment of mature leaves for 2 h and 7-day-old seedlings for 3 h as well as MeJ treatment of seedlings for 30 min, 1 h and 3 h but not for any longer times or for SA and MeJ applied in combination. Virtually no datasets compare treatments with multiple chemicals to those with each of the individual chemicals. And, because these datasets are compiled from many different sources, comparisons of the magnitudes of individual responses are limited between the datasets of different investigators due to variations in labeling conditions and the various manners in which the normalizations have been performed.

Fig. 1
figure 1

Genevestigator data showing inducibilities for one or two P450s Induction ratios taken from the website for Genevestigator (Zimmerman et al. 2004; https://www.genevestigator.ethz.ch/at/) are shown for the CYP74A1 (At5g42650), CYP81F4 (At4g37410) and CYP707A1 (At4g19230) loci

Analysis on the focused P450 microarrays of the responses of selected sets of chemicals such as MeJ, SA and BION (1,2,3 benzothiodiazole-7-thiocarboxylic acid S-methyl ester) administered to 7-day-old seedlings individually or in combination and monitored for up to 30 h (Ali et al. 2006b) have demonstrated that P450 loci are modulated in all sorts of interacting manners. Using just these datasets, it is possible to find subsets of P450 loci induced additively, synergistically and combinatorially by two or three of these chemicals while other subsets are antagonistically affected by competing responses to these signaling molecules and fungal defense activators. These and other datasets monitoring responses to ABA, IAA, BL, phenobarbital (a mimic for environmental pollutants), cold, drought, osmotic stress have now been able to detail “expression signatures” specific for each of these Arabidopsis P450 loci with induction/repression magnitudes that are statistically significant and intercomparable between experiments. Using several hormone-responsive P450s as examples in Table 9, it becomes clear from the similarities in these expression signatures that P450 loci potentially coding for protein activities in the same or related pathways can be identified as coordinately regulated over a range of inducers and induction regimes. For example, CYP71A19, CYP71B19, CYP71B20, CYP71B26, CYP71B28, CYP76C2, CYP86B1, CYP89A9 and CYP94B3 are induced in response to 3 to 24 h ABA treatments, 3 h IAA treatment and 3 h osmotic stress along with CYP707A1 that is known to mediate ABA inactivation (Table 9). With these similarities clustering these genes in common response groups, differences in their response to other treatments and variations in the timings of their inductions/repressions allow one to discriminate subgroups that are likely to be involved in the same pathway or response. Another example of the selectivity of these response patterns is CYP78A7, which is the only P450 transcript besides CYP72C1 and CYP734A1 induced in response to short and long term IAA and BL treatments. Profiling at this level against multiple treatments has significant potential for discriminating between P450s that, although similar, moderate different branches in complex synthetic pathways as is the case with CYP85A1 and CYP85A2 in BL synthesis (Shimada et al. 2001; Kim et al. 2005a; Nomura et al. 2005).

Table 9 Expression signature for coordinately regulated genes

These comparisons also make it evident that the rapidity of responses to particular chemicals has significant potential for identifying P450 loci mediating the synthesis of regulatory molecules. The usefulness of evaluating expression kinetics has been especially apparent in the case of CYP94B1, where transcripts have been shown to be rapidly and transiently induced after MeJ treatment and whose protein has been shown to hydroxylate the plant signaling compound 9,10-epoxystearic acid (Civjan et al. 2006). Other examples of the rapid induction of P450s regulating hormones and signaling molecules exist in the set of four CYP707A proteins that inactivate ABA (Kushiro et al. 2004; Saito et al. 2004) and the CYP734A1 and CYP72C1 proteins that inactivate BL (Neff et al. 1999; Turk et al. 2003; Nakamura et al. 2005; Takahashi et al. 2005). Because of their important roles in maintaining hormone homeostasis, these loci respond rapidly and, in some cases, quite transiently after hormone treatment (Table 9).

Determinants of substrate specificities

The story describing the functional diversities of these many up-and down-regulated P450s evolves when one begins to compare the secondary and tertiary structures of P450s not just in Arabidopsis and other plants but in all organisms that contain them. These comparisons, which are further detailed in Rupasinghe and Schuler (2006) in this volume and Graham and Petersen (1999), indicate that most P450s have maintained secondary and tertiary structural conservations that are manifested in a core structure containing α-helices (labeled A-K) and β-pleated sheet (labeled 1–4) surrounding a buried catalytic site (Graham and Petersen 1999; Stout 2004; Poulos and Johnson 2005). Site-directed mutagenesis studies on closely related P450 proteins in the vertebrate CYP2 family have identified several substrate recognition sequences (termed SRS1-6 by Gotoh (1992)) as important for substrate metabolism as well as substrate access (Domanski and Halpert 2001). Among these, SRS1 corresponds to the loop region between the B-and C-helices positioned over the heme, SRS2 and SRS3 correspond to the F-and G-helices comprising part of the substrate access channel, SRS4 corresponds to the I-helix extending over the heme pyrrole ring B, SRS5 and SRS6 correspond to the amino-terminus of β-strand 1–4 and the β-turn at the end of β-sheet 4, respectively, which both protrude into the catalytic site.

Viewed from the perspective of these three-dimensional structures, substrate specificity in the Arabidopsis P450s is actually defined by a small number of regions that encompass the catalytic site as fingers on your hand might hold a space-filling model of a compound. Variations in the length of your fingers and/or their position change the position of the structure relative to the fixed plane that it sits above and any supports that surround it. Returning this analogy back to the protein sequences, increases and decreases in the lengths of the protein backbone as well as changes in the charges and sizes of a few catalytic site loops significantly impact the type of compounds that can be positioned over the heme plane and their position relative to the catalytically important I-helix that extends through the catalytic site much like a plie bar in a ballet studio.

Analyzed at this level, Arabidopsis P450 catalytic sites exhibit varying degrees of sequence diversity that do not necessarily map to their phylogenetic classifications (i.e., family, subfamily designations). There exist examples of closely related P450s that differ in a few residues within most of their SRS regions and mediate similar reactions (e.g., CYP86A subfamily, Rupasinghe et al. 2006; CYP707A subfamily, Kushiro et al. 2004; Saito et al. 2004) and examples of divergent P450s in completely different families that modify related aromatic substrates on the same manner (e.g., CYP73A5, CYP75B1, CYP84A1, CYP98A3; Rupasinghe et al. 2003). Examples of the most closely related P450s that differ in only one or two residues in a single SRS and mediate different reactions, such as Menta spicata CYP71D15 and M. piperita CYP71D18 (Schalk and Croteau 2000), have not yet been identified in Arabidopsis. ClustalW alignments of Arabidopsis P450 representatives from each of its subfamilies has indicated that the length variations potentially affecting substrate interactions are largely limited to the region between SRS2 and SRS3 where the loop between the F-and G-helices possibly interacts with the membrane and/or affects substrate access (Rupasinghe and Schuler 2006). Many other sequence variations that occur in the SRS1, SRS4, SRS5 and SRS6 regions (that do not vary in length) directly impact the binding properties of substrates and it is in these regions that sequence variations in closely related subfamily members allow individual proteins to metabolize the same substrate at different positions.

P450 functions defined by in vitro expression strategies

The membrane-bound nature of these proteins has created special challenges for defining their functionalities. One predominating complication arises at the protein level from the need for ER-localized P450s to pair with co-localized membrane-bound electron transfer partners such as NADPH P450 reductase and cytochrome b5/b5 reductase. Soluble P450s targeted to other subcellular locations (i.e., chloroplasts and mitochondria) utilize soluble electron transfer partners that are not restricted in their quantities or location. Details on the strategies being used for expression analysis of these P450s are covered in Duan and Schuler (2006) in this volume.

Because of these potential problems, the functions of only a small number of P450 genes present in plants have been defined by clearly establishing enzyme specificity at a biochemical level and relating it to one or more biological functions in planta. Even so, Arabidopsis ranks among the species with the most P450s functionally defined (Schuler and Werck-Reichhart 2003; http://arabidopsis-p450.biotec.uiuc.edu/functions.pdf; http://arabidopsis.org/info/genefamily/P450_functions.html) with currently 41 of its full-length genes having discrete functions (Table 10) assigned using heterologous expression systems or T-DNA knockout analyses. Functions for the remaining loci are being defined with strategies that combine knowledge of their expression profiles with predictive modeling of their catalytic sites and substrate binding assays.

Table 10 P450 functions defined in Arabidopsis

P450 functions defined by genetic mutations

Characterized genetic mutations in P450 loci remain limited with all currently published mutants listed in Table 11. Not unexpectedly, many of the mutant lines with obvious morphological defects have resulted from the insertion of T-DNA inserts within their coding regions that effectively silence P450 transcript production and/or accumulation. Examples of these include the earliest CYP90A1 (cpd, cbb3) and CYP90B1 (dwf4) knockout lines characterized for their involvement in brassinosteroid synthesis (Kauschmann et al. 1996; Szekeres et al. 1996; Choe et al. 1998; Azpiroz et al. 1998; Fujita et al. 2006), the CYP84A1 (fah1) knockout line characterized for its involvement in sinapoyl ester synthesis (Meyer et al. 1996), the CYP83B1 (sur2) knockout line characterized for its involvement in indole glucosinolate synthesis (Winkler et al. 1998) as well as the more recently identified CYP51G1 (cyp51a2) knockout lines disrupting obtusifol 14α-demethylase and, hence, sterol production (Kim et al. 2005b).

Table 11 Arabidopsis P450 mutants updated

From the perspective of the P450 molecular models mentioned previously, the mutant lines carrying EMS-derived codon changes are even more interesting. Lending support to various models, eleven changes have been identified as disrupting functions in eight P450s (Table 11). Examples of these that exist in predicted SRS regions are the R309C change in the CYP86A2 att1–1/hsr2–1 mutant (Xiao et al. 2004; M. Bevan, personal communication) that immediately precedes the highly conserved (D/E)T in the I-helix (SRS4) and the P380S change in the hsr2–2 mutant that occurs in SRS5 and is predicted to interfere with positioning of the adjacent S381-V382 side chains for catalytic site contacts with fatty acid substrates (Rupasinghe et al. 2006). Other examples are the G176E change in the CYP71B15 pad3–2 mutant that occurs in the F-helix (at the beginning of SRS2), the A291V change in the CYP83B1 atr4–2 mutant (Smolen and Bender 2002) that occurs in SRS4, the E283K change in the CYP97A3 lut5–2 mutant (Kim and DellaPenna 2006) that occurs in SRS2 and the P117L change in the CYP711A1 max1–1 mutant (Booker et al. 2005) that occurs in the B’-helix (SRS1). Others existing in recognizable structural components of these proteins outside of the SRS are the G444E change in the CYP83A1 ref2–4 mutant (Hemm et al. 2003) that occurs immediately downstream from the heme cysteine and the G444D change in the CYP98A3 ref8 mutant (Franke et al. 2002) that occurs in the L-helix that interacts with the heme. Yet others exist in regions not obviously affecting catalytic site binding and include the G58E change in the CYP90C1 rot3–2 mutant (Kim et al. 1998) that occurs in the region preceding the A’-helix where it possibly affects the structure of the adjacent β-strand 1 and the R438W change in the CYP83B1 atr4–1 mutant (Smolen and Bender 2002) that occurs in the K”-helix loop which potentially interacts with P450 reductase.

The availability of large collections of insertion lines from the SALK Institute (T-DNA insertions: http://signal.salk.edu/tabout.html), SAIL (T-DNA insertions: http://www.arabidopsis.org/abrc/sail.jsp), GABI-Kat (T-DNA insertions: http://www.gabi-kat.de/), FLAG (T-DNA insertions: http://urgv.evry.inra.fr/projects/FLAGdb++/HTML/index.shtml, Wisconsin (Ds-Lox insertions: http://www.hort.wisc.edu/krysan/Ds-Lox/), RIKEN (Ds transposon insertions: http://rarge.gsc.riken.jp/dsmutant/index.pl), GARNet-JIC (Ds-Spm insertions: http://garnet.arabidopsis.org.uk/transposons_for_functional_genomics.htm) and CSHL (gene trap and enhancer trap insertions: http://genetrap.cshl.org/) has made it possible to begin the characterization of knockout lines for the large number of remaining P450 loci whose transcripts are constitutively or inducibly expressed at some level in Arabidopsis. With the high level of insertional saturation in the genome and the expectation that all genes will be targeted with equal efficiency, it is notable that 23 full-length P450 genes still have no insertions identified within the body of their coding and intron sequences. With T-DNA knockout lines existing for several critical single-copy P450 loci (e.g., CYP73A5, CYP74A1, CYP90A1) that can be propagated as heterozygotes, the absence of insertions within these other P450 loci suggests that hemizygous knockouts containing even a single copy insertion are not viable in the processes used to construct and propagate these collections.

Summary

The view of P450-catalyzed reactions through the window of Arabidopsis biochemistry is becoming significantly more complex than originally thought when the very first P450 proteins were being purified and characterized. Rather than falling into a rabbit hole (terrier lapin, kaninchenhoehle or usagi no ana depending on your linguistic perspectives) full of confounding chemical substances and interconnecting pathways, explorations of the P450 molecular landscape are being enhanced by the large number of tools now available for monitoring P450 transcript levels, predicting protein structures and measuring chemical affinities as well as the genetics tools tying biochemistry to physiological functions.

The range of P450 genes and pseudogenes in other plant genomes is significantly less clear since, without comprehensive sequencing projects, sequences in many of these species have been identified individually as researchers have attempted to clone cDNAs coding for particular metabolic reactions. Their successes have uncovered an ever-expanding collection of P450 proteins and diverse metabolic reactions (Schuler and Werck-Reichhart 2003) that provide further evidence of the complex biochemistries that exist outside the window of Arabidopsis biochemistry. With the Oryza genome representing the only available annotated genome whose sequences can be compared with those in Arabidopsis (Nelson et al. 2004), it is already clear that many single copy P450 gene families in Arabidopsis have been duplicated to create series of related loci whose proteins may or may not have functions related to those already characterized in Arabidopsis. With the range of genetic engineering tools more limited in Oryza and other monocots, defining functions for many of these will depend on building connections to their closest Arabidopsis relatives via molecular modeling of their catalytic sites, heterologous expression and reconstitution of their activities and, potentially, complementation analysis of Arabidopsis knockout lines that are now being characterized. Although complex, the view through the looking glass is clearing to reveal a set of monooxygenases integrally tied to diversification in plant biochemical pathways and defense responses.