Introduction

With an estimated 1,171 million inhabitants, India is second only to China in population numbers and currently accounts for over 17% of the global population (PRB 2009). Unlike China where some 90% of the population are of Han origin (Black et al. 2007), India has multiple geographical, ethnic, religious and language divisions (Bittles 2002). As the peoples of India have traditionally married and reproduced within these sub-divisions, major problems are encountered in estimating the impact of genetic disease at national, regional, state or even local levels. Data of this nature are, however, essential as despite the current national infant mortality rate of 55/1000 (PRB 2009), there is an increasingly rapid transition in the burden of disease across all age groups from a primarily communicable to a non-communicable pattern, with non-communicable diseases already estimated to account for 42% of deaths (Census of India 2001–2003).

The haemoglobinopathies typify these issues. It has been estimated that the prevalence of pathological haemoglobinopathies in India is 1.2/1,000 live births (Christianson et al. 2006), and with approximately 27 million births per year (PRB 2009) this would suggest the annual birth of 32,400 babies with a serious haemoglobin disorder. Within this overall disease classification a 1989 WHO Working Group on guidelines for the control of haemoglobin disorders estimated a 3.9% carrier frequency for β-thalassaemia in India, encompassing all types of β-thalassaemia trait (WHO 1989). This estimate was mainly derived from data collected prior to 1984 and relied on basic haematological methods of analysis supplemented by information sourced from Livingstone (1985). However, in the absence of more comprehensive, quantitative epidemiological information it continues to be widely cited as the baseline national prevalence for β-thalassaemia in India.

A WHO update on β-thalassaemia in India indicated a similar overall carrier frequency of 3–4%, which given the current national population would translate to between 35.1 and 46.8 million carriers of the disorder nationwide (WHO 2008; PRB 2009). At the same time, a screening project based on 56,814 college students and pregnant women recruited in the states of Maharashtra, Gujarat, Punjab, Karnataka, West Bengal and Assam indicated a carrier rate of 2.78% (Mohanty et al. 2008). These different carrier frequency estimates have been used to approximate the numbers of new affected births per year, which have been calculated to range from 10,000 to 15,000 cases (Edison et al. 2008; Sheth et al. 2008; Tamhankar et al. 2009), of which 8,000–10,000 would present with a severe form of the disease (Colah et al. 2009). If accurate, the figures would indicate a cumulative total of 100,000 children with thalassaemia major in India (WHO 2008).

Unfortunately, there are no adequately representative data sets to confirm or deny these approximations, and with 50,000–60,000 strictly endogamous communities in India (Gadgil et al. 1998), it is dubious whether any average disease prevalence estimate could realistically be applied to each and every community and sub-population. This contention is supported by estimates that the carrier frequency for β-thalassaemia ranges from 0.3 to 17% in different local communities (Agarwal and Mehta 1982; Weatherall and Clegg 2001; WHO 2008).

The initial studies on β-thalassaemia in Indian populations were undertaken among overseas migrant communities and so primarily established the presence of thalassaemia mutations in individuals from the states of Gujarat and Punjab, and in the Sindhi community, many of whom originated in Pakistan (Kazazian et al. 1984; Thein et al. 1988). Five mutations, IVSI-5(G>C), IVSI-1(G>T), 619-bp del, Codon 41/42(−TCTT) and Codon 8/9(+G) accounted for 90% of all mutations (Kazazian et al. 1984; Thein et al. 1988). The results were replicated in follow-up collaborative studies undertaken in Indian and Western centres, mainly focused on the populations of Gujarat, Punjab and Maharashtra (Varawalla et al. 1991a, 1992; Garewal et al. 1994). On the basis of these findings it therefore was assumed that in India the prevalence of β-thalassaemia was highest in the Sindhi and Punjabi communities, and it was only towards the end of the twentieth century that reports from other Indian states demonstrated the wide distribution and extensive heterogeneity of β-thalassaemia mutations in different Indian sub-populations.

Given the partial nature of the available information, the establishment of effective national and regional treatment and prevention programmes for a disorder such as β-thalassaemia is extremely difficult, especially with 229 mutations so far described for the disorder in the locus-specific HbVar database (Giardine et al. 2007), 184 of which are β+ or β0 mutations (http://globin.bx.psu.edu/hbvar). The primary aim of the present study was to systematically collate and critically assess the data so far published on β-thalassaemia in India and within the Indian diaspora, and from the results of this meta-analysis to identify the predominant causative mutations at national, regional and state levels. In acknowledgement of the size of the Indian population and the genetic complexity which follows from the numerous sub-divisions (Bittles 2002; Reich et al. 2009), attention also was directed to mutations that to date have been reported as being largely community-specific in their distribution.

Subjects and methods

The geographical locations of the states and regions of India are shown in Fig. 1. To minimize undue bias towards sample collection from individuals of specific geographical or ethnic origin, and to encourage future more representative sampling across states and regions, only studies reporting allelic frequencies for at least 10 β-globin gene mutations and with a minimum of 50 subjects specifically identified by their state of origin were selected for inclusion. Seventeen published studies met these criteria and were accepted for inclusion, with rigorous cross-checking of data to avoid duplicate entries (Table 1). The information on β-globin chain mutations was initially entered by state origin (n = 28), with subsequent collation into six geographical regions as defined in Fig. 1.

Fig. 1
figure 1

Map of India by state and region

Table 1 Profile of studies included in the meta-analysis of β-thalassaemia mutations in India

Data were excluded from the analysis where information on the regional, state or community origins of subjects was unclear, including 1,150 alleles omitted from persons identified only as being of Sindhi or Punjabi origin but lacking any other identifying details (Supplementary Information, S1). The mean IVSI-5(G>C) allele frequency among these excluded individuals was just 12.7%, compared with the national average figure of 54.7%, raising major doubts as to their provenance. Results from the seven Union Territories, the Andaman and Nicobar Islands, Chandigarh, Dadra and Nagar Haveli, Daman and Diu, Delhi, Lakshadweep, and Pondicherry also were excluded because of the mixed and highly mobile populations in Delhi, the national capital, and Chandigarh the joint capital of the states of Punjab and Haryana, and the local and numerically small populations of the other five Territories.

Only 46 alleles were reported for the populations of the Northeast region, which comprises eight individual states with a combined population of 39.0 million (Census of India 2001a), and is home to many tribal communities of Tibeto-Burmese origin. Of these Northeast samples 34 (73.9%) of alleles were IVSI-5(G>C) while the remaining 12 alleles consisted of five rare mutations and two uncharacterized alleles. Given the small and unrepresentative number of alleles tested, the Northeast data were not separately presented by region in Table 1 and Fig. 2, but the results were incorporated into the All-India data analysis.

Fig. 2
figure 2

Regional distributions of the most common β-thalassaemia alleles in India (n = 52)

As might have been expected in studies conducted over an extended time period, the methods of genomic analysis employed in the 17 studies varied quite widely and included gap-polymerase chain reaction (PCR), denaturing gradient gel electrophoresis (DGGE), temporal temperature gel electrophoresis (TTGE), amplification refractory mutation system (ARMS), reverse dot blot hybridization (RDB), and direct DNA sequencing (Varawalla et al. 1991a; Verma et al. 1997; Vaz et al. 2000; Old et al. 2001; Agarwal et al. 2003; Bashyam et al. 2004; Sheth et al. 2008; Edison et al. 2008; Colah et al. 2009). For this reason, some variability may inadvertently have resulted in the mutation profiles reported by individual study centres.

Results

National profile of β-thalassaemia mutations

Information on 8,505 alleles was collated, with 64 β-globin gene mutations causing β-thalassemia identified in the Indian population. The profile of the 52 most prevalent and widespread disease alleles, representing 97.5% of the total β-thalassaemia alleles reported at national level, is portrayed by region from the 3′ to 5′ end of the β-globin gene (Fig. 2). Equivalent information on the β-globin mutations identified at individual state level is reproduced in Supplementary Information (S2).

The ten most common β-thalassaemia mutations reported for All-India and by region are listed in Table 2. Nationally, IVSI-5(G>C) was the single most common mutant allele and represented 54.7% of all β-thalassaemia mutations reported. IVSI-5(G>C), 619-bp del, IVSI-1(G>T), Codon 41/42(−TCCT) and Codon 8/9(+G) comprised the five most common disease mutations at the national level and totalled 82.5% of all mutations, with Codon 15(G>A), Codon 30(G>C), Cap site +1(A>C), Codon 5(−CT) and Codon 16(−C) accounting for an additional 11.0% of all mutant alleles (Table 2).

Table 2 National and regional frequencies (%) of the most common β-thalassaemia mutations in India

It is important to note that 47.0% of the alleles analysed nationally were from subjects who originated either in the western states of Maharashtra and Gujarat or the northern state of Punjab. Furthermore, 15.8% of the national β-thalassaemia allele profile describes persons specifically identified as belonging to Sindhi or Punjabi ethnic groups, which collectively account for just 3.1% of the total population of India (Census of India 2001b). Therefore, as discussed below in terms of regional mutation profiles, over-sampling of these groups significantly influenced the national β-thalassaemia mutation profile reported in previous studies. To complete the national profile of the β-thalassaemia mutations so far described in India the remaining 12 alleles, a number of which have been reported in one or several subjects only, are listed in Table 3.

Table 3 Less common β-thalassaemia mutations reported in the population of India

Regional profile of β-thalassaemia mutations in India

The percentage distribution of five representative β-thalassaemia mutations is illustrated in Fig. 3 according to state of origin. IVSI-5 (G>C) accounts for 54.7% of all β-thalassaemia alleles nationally, and the majority of subjects with this mutation originate from or are resident in the major states of Maharashtra and Gujarat (West region), Uttar Pradesh (North region) and West Bengal (East region). Codon 15 (G>A) also has widespread national distribution but with 35.3% of all subjects resident in Maharashtra. The high percentage of −88(C>T) alleles in cases from Punjab (74.3%) can be ascribed to the frequency of this mutation in the Jat-Sikh community (Garewal et al. 2005). Likewise, the high prevalence of Codon 5(−CT) in Gujarat (79.7%) is associated with the Lohana and Prajapti communities in that state (Sheth et al. 2008). Although the Poly A(T>C) allele has been reported in the populations of nine states, 65.6% of cases were subjects who originated in the adjacent southern states of Tamil Nadu and Karnataka (Edison et al. 2008; Colah et al. 2009).

Fig. 3
figure 3

Percentage distributions of five illustrative β-thalassaemia mutations at state level

The West region, comprising the major states of Maharashtra, Gujarat and Rajasthan and the small state of Goa, had a combined population in 2001 of 205.4 million (Census of India 2001a). The West is the most widely represented region in terms of sampling with 3,238 alleles analysed (38.1% of the total sample), and IVSI-5(G>C) accounts for 50.7% of all β-thalassaemia mutations. However, the West region deviates from the national pattern of five common mutations in the somewhat higher prevalence of the 619-bp deletion (14.2%) and IVSI-1(G>T) (8.7%), and with Codon 15(G>A) as the fourth commonest regional mutation with a frequency of 7.6%.

The North region is genetically heterogeneous and ranges from Uttar Pradesh on the Gangetic Plain in the east to Punjab, the westernmost state which adjoins the Pakistani province of Punjab. Haryana with a large agricultural community of Jats, and the Himalayan states of Himachal Pradesh, Uttarakhand and Jammu and Kashmir are the remaining four states in the region. No data are available from Jammu and Kashmir because of ongoing civil unrest. Sampling across the region was non-uniform. Of the 2,484 alleles reported (29.2% of the total sample), 997 were obtained from residents of the state of the Punjab which has a population of 24.3 million, as opposed to the 1,368 alleles representing the 166.1 million strong population of Uttar Pradesh. Although IVSI-5(G>C) accounts for just 44.8% of β-thalassaemia alleles the five most common mutations reported closely match the national pattern, probably due to the high representation of samples from Punjab, but with Codon 16(−C) and −88(C>T) in the list of ten common mutations along with Codon 15(G>A), Codon 30(G>C) and Cap site +1(A>C).

The Central region consisting of the quite sparsely populated states of Madhya Pradesh and Chhattisgarh is grossly under-represented with only 259 reported alleles. Importantly, the Central region is home to many indigenous Scheduled Tribes which in the 2001 Census of India constituted 26.0% of the total regional population of 81.2 million, and with another 13.4% of the population belonging to Scheduled Castes. There is no evidence that either of these predominantly rural and impoverished communities is represented in the regional data set analysed.

Four states Andhra Pradesh, Karnataka, Tamil Nadu and Kerala make up the South region which has a predominantly Dravidian population, ethnically and culturally quite distinct from the largely Indo-European populations of northern, central and western India that represent later population flows into the Indian sub-Continent (Reich et al. 2009). In the 2001 Census of India 20.8% of people countrywide indicated a Dravidian mother tongue, which closely parallels the 21.7% of the national population resident in the South region. IVSI-5(G>C) has a prevalence of 67.9% in the South, suggesting that it may have been the ancestral mutation in the Dravidian founder population of the sub-Continent. The other five and ten most common disease alleles in the South region differ significantly from the overall national pattern; the 619-bp deletion is present in only 1.8% of cases, whereas Codon 15(G>A) is the second most common southern disease allele (8.8%), Poly A site (T>C) is the third most common allele (4.7%) and in 6.3% of cases the disease mutation is rare or unknown.

The East region exhibited by far the highest prevalence of IVS I-5(G>C) at 71.4%, with Codon 30(G>C) and Codon 15(G>A) the second and third most common alleles, accounting for 5.8% and 5.4% of the total respectively, followed by Codon 41/42−TCCT) with a prevalence of 4.3%. The data for the East region are mainly drawn from West Bengal, with the other three constituent states, Bihar, Jharkhand and Orissa, contributing just 311 alleles to the regional total of 1,410 disease alleles. As in the South region there are a large number of alleles (8.0%) which are rare or unknown nationally, probably indicative of the substantial Scheduled Tribal populations in Jharkhand (26.3%) and Orissa (22.1%), the Scheduled Caste communities in West Bengal (23.1%) (Census of India 2001c), and very substantial population movement into the region from Bangladesh (formerly East Pakistan) to the east, during and following the Independence of India in 1947 and the establishment of Bangladesh in 1971.

Discussion

Although it is estimated that more than 300,000 babies are born each year with a major inherited haemoglobin disorder (Christianson et al. 2006) and the lives of many millions of children, adolescents and adults are adversely affected, until quite recently these diseases were rarely included in the health priorities of national governments or international health agencies (Weatherall and Clegg 2001). This situation changed in 2006 with recognition by the Executive Board of the World Health Organization that thalassaemia and sickle cell anaemia were major global health problems which needed to be urgently addressed (WHO 2006), a move reinforced by their inclusion in the current Global Burden of Disease Study (http://globalburden.org).

Given the demonstrated high frequency of β-thalassaemia alleles in India and the immense size of the national population, the present study is necessarily preliminary and any conclusions drawn need to be assessed in that light. As previously noted, with a total population of 1,171 million and a rate of natural population increase of 1.6% (PRB 2009), collecting accurate and representative health information in India is a major problem. The highly endogamous nature of Indian society, traditionally based on castes which claim long and unbroken genealogical histories, means that each community effectively functions as a separate breeding pool, with the consequent probability that recent mutations may be unique to single communities (Bittles 2008, 2009; Bittles and Black 2010). Representative sampling can therefore become extremely difficult, given the population stratification that results from the multiple ethnic, social and religious subdivisions which are a central facet of everyday existence.

When dealing with an autosomal recessive disorder such as β-thalassaemia, an additional important factor that has to be considered is the widespread preference for intra-familial unions in the southern Dravidian states of Andhra Pradesh, Karnataka and Tamil Nadu, where 30+% of marriages are consanguineous, mainly uncle-niece (F = 0.125) or first cousin (F = 0.0625), and with substantial levels of consanguineous unions in neighbouring Kerala and southern Maharashtra (Bittles et al. 1991; Bittles 2002; www.consang.net). In these states it would be expected that a high proportion of β-thalassaemia cases would be homozygous for the causative mutation, and indeed 98% of affected subjects investigated in Andhra Pradesh were homozygotes for a specific mutant allele (Bashyam et al. 2004).

At first sight the situation might be considered different in the other regions of India where exogamy in the Hindu population is practised at gotra level, i.e. involving extended male lineages, but with marital endogamy at caste level. However, since an overwhelming percentage of marriages continue to be contracted on an intra-caste and intra-community basis, even though spouses may not be known to be biologically related, there is a very strong chance that they have a large proportion of their genes in common (Bittles 2008). Thus, even in West, North, and East India, a higher than expected proportion of patients with β-thalassaemia probably are homozygous for a single mutant allele rather than being compound heterozygotes. This probability is increased by the high frequency of the IVSI-5(G>C) mutation in each region (Table 2), and by the 15.8% to 33.0% prevalence of consanguineous marriage at state level in the large Muslim minority population (Bittles and Hussain 2000).

The prevailing wisdom has been that β-thalassaemia in India principally affects the Sindhi, Gujarati, Bengali, Punjabi and Muslim communities (Agarwal 2005), although this supposition has been strongly influenced by the more extensive testing undertaken in these sub-populations. As a large majority of communities have yet to be sampled, especially among the Scheduled Castes and Scheduled Tribes and the group of lower caste communities collectively defined by the Government of India as Other Backward Classes, this opinion may well require significant future revision, and it seems highly probable that previously uncharacterized mutations remain to be identified. In the interim, it is important that public education programmes, in combination with opportunities for premarital and prenatal screening, should be made available to as wide a range of couples, families and communities as possible.

Table 2 showed that while IVSI-5 (G>C) was the predominant mutation throughout India, the prevalence varied from 44.8% in the North to 71.4% in the East. It also was apparent from Table 2 and Fig. 3 that the profile of other mutations showed significant inter-regional variation, to the extent that this variation merited serious consideration in the design and implementation of future screening programmes. The higher the mutation detection rate with as small a number of markers employed, the more efficient the testing protocol will be in terms of staff time expended and the costs involved.

As summarized in Table 4, this type of approach already appears feasible at regional level. Importantly, testing for the five most common mutations at national level would detect 82.5% of cases, and for the ten most common mutations 93.5% of cases would be identified. But by changing the testing protocols to incorporate the most appropriate mutation profiles identified at regional level, the potential levels of detection could be increased to 87.7% (North) for the five most common mutations, and 97.6% (Central) for the ten most common β-thalassaemia mutations. Given the size of the potential β-thalassaemia case-load in India, due accommodation for these differences in the potential efficiency of screening programmes could produce substantial savings in both time and costs.

Table 4 National and regional frequencies (%) of the five and ten most common β-thalassaemia mutations in India

Could this level of performance be further improved if community-based rather state or regional mutation data were available? To answer this question, previously unreported data on 1,031 β-thalassaemia alleles in the large northern state of Uttar Pradesh (S Agarwal, unpublished) were examined in a separate analysis. As indicated in Table 5 the results have been subdivided into seven categories, corresponding to the main religious, caste and socioeconomic subdivisions within the population of the state.

Table 5 Community-specific profiles of β-thalassaemia mutations in Uttar Pradesh, India

The ten most common β-thalassaemia mutations identified are as listed for the North region in Table 2. Although the numbers within each sub-division are small and significant mutation overlap exists between a number of the communities, such as the Hindu upper caste Brahmins and Kshatriyas, there also are major differences in community mutation profiles, e.g., comparing the Brahmin community in which Cap Site +1(A>C), IVSI-5 (G>C) and Codon 41/42(−TCTT) are the three most common diseases alleles, with the communities classified as Other Backward Classes where ‘Other mutations’, Codon 16(−C), IVSI-5(G>C), and Codon 15 (G>A) alleles predominate.

There also is clear evidence of over-sampling of economically more privileged groups. Thus while the four Hindu upper and middle castes, the Brahmins, Kshatriyas, Vaishyas and Kayasthas comprise ~19% of the population of Uttar Pradesh they account for 56.8% of the β-thalassaemia alleles tested, whereas the Hindu Other Backward Classes who form ~31% of the total state population (NSSO 2005) comprised just 11.4% of alleles.

From a genetic screening and genetic counselling perspective the data do indicate that community-specific mutation profiles could be highly effective in helping to screen for and prevent β-thalassaemia. At the same time it has to be acknowledged that to establish similar community-specific mutation profiles throughout India would be an extremely difficult logistic task within the near future. But the potential benefits are very high in health, social and economic terms, and the creation of more detailed databases of β-thalassaemia alleles will facilitate better focused, more efficient, and cost-effective testing and treatment protocols that can concentrate on individual communities and sub-populations.

Conclusions

The outcomes derived from the basic data collated in the present study should provide a sound platform on which future health care planning for the prevention and treatment of β-thalassaemia in India can be undertaken. The need for a paradigm shift in β-thalassaemia-related research is, however, indicated. While determination of the broad-based geographical distribution of causative mutations has been an important initial step, there is a clear need for structured sampling programmes to be planned and instituted to provide representative information on regions, such as Central India and the Northeast, for which data are currently inadequate. Additionally, in a country with a population as large and ethnically and socially diverse as India, the further extension of sampling to facilitate state, district and village registers of persons with β-thalassaemia and carriers of the disorder is warranted (WHO 2008). Indeed, given the continuing marked hereditary sub-divisions within Indian society that result from intra-caste and intra-community marriage, community-specific mutation testing would provide the basis for the optimum delivery of genetic education, screening and prevention programmes.