Comparative studies of vertebrate iduronate 2-sulfatase (IDS) genes and proteins: evolution of A mammalian X-linked gene

IDS is responsible for the lysosomal degradation of heparan sulfate and dermatan sulfate and linked to an X-linked lysosomal storage disease, mucopolysaccharidosis 2 (MPS2), resulting in neurological damage and early death. Comparative IDS amino acid sequences and structures and IDS gene locations were examined using data from several vertebrate genome projects. Vertebrate IDS sequences shared 60–99% identities with each other. Human IDS showed 47% sequence identity with fruit fly (Drosophila melanogaster) IDS. Sequence alignments, key amino acid residues, N-glycosylation sites and conserved predicted secondary and tertiary structures were also studied, including signal peptide, propeptide and active site residues. Mammalian IDS genes usually contained 9 coding exons. The human IDS gene promoter contained a large CpG island (CpG46) and 5 transcription factor binding sites, whereas the 3′-UTR region contained 5 miRNA target sites. These may contribute to IDS gene regulation of expression in the brain and other neural tissues of the body. An IDS pseudogene (IDSP1) was located proximally to the IDS gene on the X-chromosome in primate genomes. Phylogenetic analyses examined the relationships and potential evolutionary origins of the vertebrate IDS gene. These suggested that IDS has originated in an invertebrate ancestral genome and retained throughout vertebrate evolution and conserved on marsupial and eutherian X-chromosomes, with the exception of rat Ids on chromosome 8. Electronic supplementary material The online version of this article (doi:10.1007/s13205-016-0595-3) contains supplementary material, which is available to authorized users.

Introduction families which catalyze the hydrolysis of sulfate esters in the body derived from several catabolic pathways (Ratzka et al. 2010). Many IDS gene mutations and IDS deficiencies have been studied in human populations which result in the lysosomal storage of glycoaminoglycans and Hunter syndrome, an X-linked chromosome disease, referred to as mucopolysaccharidosis type 2 (MPS2) (Wilson et al. 1990;Rathmann et al. 1996;Chistiakov et al. 2014;Kosuga et al. 2016). Major clinical features for this rare genetic disease (1:100,000 births) include obstructive and restrictive airway disease, skeletal deformations, cardiac disease, joint contractures and mental retardation (Beck 2011;Tylki-Szymańska 2014;Anekar et al. 2015). Mouse and zebra fish animal models have been used to study the disease in more detail, including studies of Ids -/Idsknock out mice which have shown that IDS-deficiency generates many of the defects reported for human MPS2 (Garcia et al. 2007). In addition, possible treatments for the disease by enzyme replacement therapy have been investigated (Garcia et al. 2007; Moro et al. 2010;Fusar Poli et al. 2013;Cho et al. 2015;Parini et al. 2015) and a phase I/II clinical trial of intrathecal IDS replacement therapy in children with severe MPS2 has been recently reported (Muenzer et al. 2016).
The gene encoding IDS (IDS in primates; Ids in rodents) is expressed at high levels in neural tissues, particularly in the cortex, hippocampus, other brain and eye tissues; and is also widely expressed throughout the body (Smith et al. 2014). The enzyme catalyzes the first step in the degradation of glycoaminoglycans, dermatan sulfate and heparan sulfate ). Human IDS is expressed as three major isoforms which have distinct C-terminal sequences: IDSa encoding a 550 amino acid protein, expressed in brain tissues and with a wide tissue distribution; IDSb, 460 amino acids also expressed in brain tissues; and IDSc, encoding a 446 amino acid enzyme expressed in ductal carcinoma cells and pancreas (Thierry-Mieg and Thierry-Mieg 2006). The genomic organization of the human and mouse IDS/Ids genes have been reported with 9 exons observed for 24 kb and 22 kbs of DNA, respectively (Wilson et al. 1993;Thierry-Mieg and Thierry-Mieg 2006).
This paper reports the predicted gene structures and amino acid sequences for several vertebrate IDS genes and proteins, the predicted structures for vertebrate IDS proteins, a number of potential sites for regulating human IDS gene expression and the structural, phylogenetic and evolutionary relationships for these genes and enzymes.

Methods
Vertebrate IDS gene and protein identification BLAST studies were undertaken using web tools from NCBI (http://www.ncbi.nlm.nih.gov/) (Camacho et al. 2009). Protein BLAST analyses used human and mouse IDS amino acid sequences previously described Garcia et al. 2007) (Table 1). Protein sequence databases for several vertebrate genomes were examined using the blastp algorithm (see Holmes 2016). Predicted IDS protein sequences were obtained in each case and subjected to analyses of predicted protein and gene structures.
BLAT analyses were subsequently undertaken for each of the predicted IDS amino acid sequences using the UC Santa Cruz (UCSC) Genome Browser with the default settings to obtain the predicted locations for each of the vertebrate IDS genes, including predicted exon boundary locations and gene sizes (Kent et al. 2002). BLAT analyses were similarly undertaken for other vertebrate IDS genes using previously reported sequences in each case (Table 2). Structures for human isoforms (splicing variants) were obtained using the AceView website to examine predicted gene and protein structures (Thierry-Mieg and Thierry-Mieg 2006).

Predicted structures and properties of vertebrate IDS
Predicted secondary and tertiary structures for vertebrate IDS proteins were obtained using the SWISS-MODEL web-server (http://swissmodel.expasy.org/) (Schwede et al. 2003) using the reported tertiary structure for human arylsulfatase A (ARSA) (Lukatela et al. 1998;Chrusczcs et al. 2003) (PDB:1n2kA) with a modeling range of 35-549 for human IDS. Molecular weights, N-glycosylation sites and signal peptide cleavage sites for vertebrate IDS proteins were obtained using Expasy web tools (http://au. expasy.org/tools/pi_tool.html). The identification of conserved domains for IDS was conducted using NCBI web tools (Marchler-Bauer et al. 2011). Human IDS tissue expression RNA-seq gene expression profiles across 53 selected tissues (or tissue segments) were examined from the public database for human IDS, based on expression levels for 175

Amino acid sequence alignments and phylogenetic analyses
Alignments of vertebrate and Drosophila melanogaster IDS sequences were undertaken using Clustal Omega, a multiple sequence alignment program (Sievers and Higgins 2014) (Table 1). Percentage identities were derived from the results of these alignments (Table 1). Phylogenetic analyses used several bioinformatic programs, coordinated using the http://www.phylogeny.fr/ bioinformatic portal, to enable alignment (MUSCLE), curation (Gblocks), phylogeny (PhyML) and tree rendering (TreeDyn), to reconstruct phylogenetic relationships (Dereeper et al. 2008). Sequences were identified as vertebrate IDS members and a proposed primordial Drosophila melanogaster IDS gene and protein (Tables 1, 2).

Alignments of vertebrate IDS amino acid sequences
The deduced amino acid sequences for frog (Xenopus tropicalis) and zebrafish (Danio rerio) IDS are shown in Fig. 1 together with previously reported sequences for human ) and mouse IDS (Garcia et al. 2007) ( Table 1). Alignments of human with other vertebrate IDS sequences examined were between 60 and 99% identical, suggesting that these are products of the same family of genes, whereas comparisons of sequence identities of vertebrate IDS proteins with other human ARS proteins exhibited C27% identities, indicating that these are members of distinct ARS-like gene families (Table 1;  Supplementary Table 1).
The amino acid sequences for vertebrate IDS proteins contained 550-561 amino acids ( Fig. 1; Table 1). Previous studies have reported several key regions and residues for human and mouse IDS proteins (human IDS amino acid residues were identified in each case) ). These included an N-terminus leader peptide (24 residues excluding the N-terminus methionine) followed by a propeptide 8-residue segment (residues 25-33) (Wilson et al. 1990). A comparison of 10 mammalian IDS sequences for these N-terminal exon 1 regions revealed species specific variability in these sequences, with the signal peptides containing multiple proline and hydrophobic residues, and the propeptides exhibiting distinct mammalian sequences (see Figs. 1,2). In contrast, amino acid sequences located further upstream within exon 2, nearer to the active site catalytic residues (Asp45; Asp46), were predominantly invariant among the mammalian and other vertebrate sequences examined (Figs. 1, 2). One of the conserved active site residues observed for these Fig. 1 Amino acid sequence alignments for vertebrate IDS sequences. See Table 1 for sources of IDS sequences; asterisk shows identical residues for IDS subunits; colon similar alternate residues; dot dissimilar alternate residues; predicted phosphoresidues are in pink; predicted N-glycosylated Asn sites are in green; the active site residues (for human IDS) are shown in blue; active site residue subject to modification is shown as A; predicted a-helices for human IDS is in shaded yellow and numbered in sequence; predicted bsheets are in shaded gray and also numbered in sequence from the Nterminus; bold underlined font shows residues corresponding to known or predicted exon start sites; exon numbers refer to human IDS gene exons; leader peptide is in brown; propeptide in red mammalian and other vertebrate IDS sequences, included an active site catalytic residue (Cys84) which undergoes post-translational modification by sulfatase modifying factor 1 (SUMF1) to form C(alpha)-formylglycine (Fgly), required at the active site of many sulfatases (Sardiello et al. 2005). Other invariant active site residues included 334Asp/335His, which are likely to be involved in Ca 2? binding, based on predictions derived from 3D structures from other human sulfatases (Bond et al. 1997;Hernandez-Guzman et al. 2003). An internal proteolytic cleavage has been proposed for this enzyme as a result of the presence of 42-and 14-kD polypeptides in enzyme preparations derived from human liver, kidney, lung and placenta extracts (Wilson et al. 1990) (Fig. 1). It should be noted that the 42kD polypeptide contains the N-terminal sequence with all of the active site regions, whereas the 14kD polypeptide contained the catalytically inactive C-terminus region of human IDS.
Five N-glycosylation sites were consistently found for vertebrate IDS sequences (human IDS amino acid sequences identified in each case): Asn115-Phe116-Ser117 (site 1); Asn144-His145-Thr173 (site 2); Asn246-Ile247-Thr248 (site 3); Asn280-Ile281-Ser282 (site 4); and Asn513-Phe514-Ser515 (site 5). Two other N-glycosylation sites were observed for human IDS which were not commonly shared with other vertebrate IDS sequences, including Asn325-Ser326-Ser327 (site 6) and Asn537-Asp538-Ser539 (site 7), the latter restricted to mammalian IDS sequences ( Fig. 1; Table 1). Mutation analysis of the human IDS gene has shown that amino acid substitution of Asn115 (Asn?Tyr) (for site 1) resulted in Hunter's disease, reflecting the key role of this N-glycosylation site in supporting the structure of this enzyme (Vafiadaki et al. 1998). Figure 1 also shows predicted phosphosites sites that may contribute to regulating downstream cellular processes, molecular functions and protein-protein interactions (Hornbeck et al. 2015). Five of these were strictly conserved among the vertebrate IDS sequences examined (human IDS residues: Ser282; Try285; Thr409; Tyr490; and Tyr497) supporting a role for these residues, as yet unknown.

Predicted secondary and tertiary structures for vertebrate IDS
A predicted secondary structure for the human IDS sequence was examined (Fig. 1) using the known structure reported for human ARSA (Lukatela et al. 1998). Ten predicted a-helix and 21 b-sheet structures were observed for human IDS. Of particular interest were b-sheet structures (b1 and b11) and a-helix (a2) which were located proximate to the predicted active site residues for human IDS. The C-terminal end of human IDS contained a sequence of b-sheet structures (b15-b21), in addition to the a-helix (a10) located at the C-terminus. A predicted tertiary structure for human IDS is shown in Fig. 3. Two major domains for this enzyme were observed, that enclose a large cavity previously shown to contain the enzyme's active site. The more N-terminal of these domains contained the active site residues and comprised the bulk of the 42kD polypeptide chain previously reported (Wilson et al. 1990), whereas the other domain comprised most of the 14kD polypeptide, including the b-sheet structures (b15-b21) and the C-terminal a-helix (a10).   Table 1 for sources of IDS sequences; asterisk shows identical residues for IDS subunits; colon similar alternate residues; dot dissimilar alternate residues; the active site residues (for human IDS) are shown in blue; leader peptide is in brown; propeptide in red; bold underlined font shows residues corresponding to known or predicted exon start sites; exon numbers refer to human IDS gene exons tissues or tissue segments for 175 individuals (GTEx Consortium 2015) (Data Source: GTEx Analysis Release V6p (dbGaP Accession phs000424.v6.p1) (http://www. gtex.org). These data supported high levels of gene expression for human IDS in regions of the brain, particularly within the cortex, amygdala, hippocampus, hypothalamus and basal ganglia, but with lower levels in the brain cerebellum and spinal cord. IDS activity was also widely distributed at low levels among all other tissues examined. It is readily apparent that IDS is predominantly expressed in brain and nerve tissues of the body, which may reflect a specific role for IDS in neural glycoaminoglycan (GAG) metabolism, involving the efficient clearance of GAG sulfate residues within the extracellular matrix of nervous tissue.

Comparative human IDS tissue expression
Gene locations, exonic structures and regulatory sequences for vertebrate IDS genes Table 2 summarizes the predicted locations for vertebrate and fruit fly (Drosophila melanogaster) IDS genes based upon BLAT interrogations of several genomes using the reported sequence for human IDS Wilson et al. 1990) and the predicted sequences for other IDS enzymes and the UCSC genome browser (Kent et al. (2002)). The predicted vertebrate IDS genes were transcribed on both the negative strand (primates, mouse, rat, cow, marsupial and zebra fish genomes) and the positive strand (sheep, chicken, lizard and frog genomes). Of particular interest is the X-chromosome location for IDS for all eutherian and marsupial mammals examined with the exception of rat Ids gene, which is located on an autosome (chromosome 8). This is indicative of a chromosomal transfer between the common ancestral X-chromosome and chromosome 8 during rat evolution. An IDS pseudogene (designated as IDSP1) was also observed for human and other primate genomes. Figure 1 summarizes the predicted exonic start sites for human, mouse, frog and zebra fish IDS genes with each having 9 coding exons, in identical or similar positions to those predicted for the human IDS gene. In each case, exon 1 encoded the leader peptide and propeptide with exons 2, 3 and 7 encoding the predicted active site regions for this enzyme. org/workspace/. The rainbow color code describes the 3-D structures from the N-(blue) to C-termini (red color) for residues 35-549 for human IDS; predicted a-helices, b-sheets, proposed active site cleft, and N-and C-termini are shown Figure 5 shows the predicted structures for the three major human IDS transcripts (IDSa; IDSb; and IDSc) together with CpG46 and several transcription factor binding sites (TFBS), which are located at the 5 0 end of the gene, consistent with roles in regulating the transcription of this gene and forming part of the IDS gene promoter. The human IDSa transcript was 6088 bps in length with an extended 3 0untranslated region (UTR) containing 5 microRNA target sites; the human IDSb transcript was 5808 bps in length, also containing 5 microRNA target sites; whereas the IDSc transcript was much shorter in length (2213 bps), comprising only 8 coding exons and with no microRNA target sites present. The presence of miR-200 within the 3 0 -UTR of the human IDS gene was of special interest due to this miR family being induced and having a specific role during the late stages of neuronal differentiation (Beclin et al. 2016). In addition, the presence of miR-7 in this region may also be significant given that miR-7 inhibits neuronal apoptosis in a cellular Parkinson's disease model (Li et al. 2016) and contributes to the alteration of neuronal morphology and function . Moreover, miR-203 has a proposed role as a stemness inhibitor of glioblastoma stem cells and may contribute to the increased expression of glial and neuronal differentiation markers (Deng et al. 2016).
The human IDS genome sequence also contained several predicted transcription factor binding sites (TFBS) and a large CpG island (CpG46) located in the 5 0 -untranslated promoter region of human IDS on the X-chromosome. CpG46 contained 432 bps with a C plus G count of 279 bps, a C or G content of 65% and showed a ratio of observed to expected CpG of 1.02. Similar CpG islands were observed in the IDS gene promoters for other primate, eutherian mammal, marsupial (opossum) and bird (chicken) genomes (Table 3). It is likely therefore that these IDS CpG islands play a key role in regulating this gene and may contribute to the very high level of gene expression observed in neural tissues (Fig. 4) (Saxanov et al. 2006). At least 5 TFBS sites were colocated with CpG46 in the human IDS promoter region which may contribute to the high expression of this gene in human nerve and brain tissues (Table 4). Of special interest among these transcription factor binding sites were the following: BACH1 and BACH2 have been recognized as members of the BTB-basic region leucine zipper transcription factor family which downregulate cell proliferation of neuroblastoma cells (Shim et al. 2006); AP1 is constitutively upregulated in activated microglia and during the pathogenesis of Parkinson's disease (Pal et al. 2016); NFE2 has been shown to participate in the developmental regulation of the brain in zebrafish embryos (Williams et al. 2013); and XBP1 has been identified as a risk factor for Alzheimer's disease and bipolar disorders, contributing to impairment of contextual memory formation (Martinez et al. 2016).

Phylogeny and divergence of vertebrate IDS
A phylogenetic tree (Fig. 6) was calculated by the progressive alignment of 15 vertebrate IDS amino acid sequences with several other human ARS-like sequences (see Table 3). The IDS phylogram was 'rooted' with the fruit fly (Drosophila melanogaster) IDS sequence (see Table 1). The phylogram showed clustering of the IDS sequences into a single group which is represented throughout vertebrate evolution and has apparently evolved from an invertebrate IDS gene ancestor. shown with capped 5 0 -and 3 0 -ends for the predicted mRNA sequences; NM refers to the NCBI reference sequence; coding exons are in pink; the direction for transcription is shown as 5 0 ? 3 0 ; a large CpG46 island at the gene promoter is shown (see Table 4 for details of CpG islands for human and other vertebrate IDS gene promoters); 5 predicted transcription factor binding sites (TFBS) for human IDS are shown (see Table 1s for details); 5 predicted miRNA target sites were identified within the extended 3 0 -UTR region of human IDSa and IDSb transcripts The identification of TFBS within the IDS promoter region was undertaken using the human genome browser (http://genome.ucsc.edu); UNIPROT refers to UniprotKB/Swiss-Prot IDs for individual TFBS sequences (see http://kr.expasy.org); ER refers to endoplasmic reticulum

Conclusions
The current results indicate that vertebrate IDS genes and encoded proteins represent a distinct gene and protein family of ARS-like proteins. IDS has a distinct property among human arylsulfatases in being responsible for the lysosomal degradation of the glycoaminoglycans, heparan sulfate and dermatan sulfate, by hydrolysing 2-sulfate groups of the L-iduronate 2-sulfate units ). IDS is encoded by a single gene among the vertebrate genomes examined and is highly expressed in human brain and other nerve tissues, and contained 9 coding exons on the negative strand of the human genome. Primate genomes contained an IDS pseudogene (IDSP1) located in a proximal position on the X-chromosome. The promoter region of the human IDS gene contained a large CpG island together with at least 5 TFBS, which may contribute to the high level of gene expression in the brain. In addition, 5 microRNA target sites were observed within the extended 3 0 -UTR of the human IDS gene which may be implicated in regulating gene expression during brain development. Predicted secondary and tertiary structures for human IDS showed strong similarities with other ARS-like proteins. Several major structural domains were apparent for mammalian IDS, including the N-terminal leader peptide and propeptide regions; the active site (including a calcium binding site), which is responsible for arylsulfatase activity; and five conserved N-glycosylation sites. Phylogenetic studies using 15 vertebrate and one invertebrate (Drosophila melanogaster) IDS sequences indicated that the IDS gene has appeared early in evolution, prior to the appearance of bony fish.

Compliance with ethical standards
Conflict of interest The author declares that he has no conflicts of interest.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http:// creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.