Typically, genotype imputation runs are started on a per chromosome basis. Earlier versions of IMPUTE required a huge amount of working memory (RAM) for this task. Since version 0.4.0, it has been possible (and recommended) to carry out genome-wide imputation in chromosomal sub-regions, instead of imputing whole chromosomes. In order to do this, input files do not have to be split manually. Instead, the region of imputation can be specified by command line arguments, which is very convenient. There is also an additional option to avoid edge effects at the borders of the imputed sub-regions. Afterwards, the imputed sub-regions can be easily concatenated to generate imputed files for complete chromosomes, for example by using the 'cat' command under Linux or MacOS X, while redirecting the output into a text file. BEAGLE and MACH do not offer the imputation of particular chromosomal regions with a special treatment of region borders. Their memory requirements are much lower compared to IMPUTE, however, and they have implemented alternative algorithms which pass memory costs to runtime in order to reduce memory usage. The BEAGLE software also supplies the user with a tool and detailed instructions on how to divide the sample cohort (not the reference panel) into sub-samples and perform imputation on each sub-sample separately.
In general, it is difficult for non-technical users to predict the working memory and runtime requirements for particular datasets. MACH and IMPUTE provide an estimation of memory allocation (main memory consumption) at runtime, so one should check the memory message while the programs start. BEAGLE does not show memory information while running, but devotes a short but informative chapter of its documentation to this problem. The main memory allocations for chromosome 6 did not exceed 2 gigabytes (GB) for BEAGLE (<1 GB in memory-saving mode), 14 GB for IMPUTE (<1 GB for each of the 18 chunks of ~10 megabases [Mb] size) and 7 GB for MACH (<1 GB in memory-saving mode), respectively.
We used the data from chromosome 6 for illustrating the runtime differences between the programs. All programs ran on a single AMD-Shanghai 2.4 GHz processor machine, providing a maximum of 32 GB shared RAM, using the AMD64-variant of CentOS-5 (Linux distribution based on Red Hat Enterprise Linux) and the batch processing system PBSPro (Altair Engineering). BEAGLE's cumulative runtime was the shortest of all three programs (350 minutes; 366 minutes in memory-saving mode [5 per cent increase]). IMPUTE required a considerably longer time (433 minutes [24 per cent higher than that of BEAGLE]; 464 minutes when split into 18 chromosomal segments of ~10 Mb [7 per cent increase]), while MACH was by far the slowest program (2781 minutes [695 per cent higher than that of BEAGLE] -- that is, about two days; 4421 minutes in memory-saving mode [59 per cent increase]).
Strand orientation of the alleles has to be consistent between the observed genotypes and the haplotype reference data, which is the responsibility of the user. All three programs check for strand concordance, however, SNP markers with C/G and A/T alleles cannot be tested for orientation. BEAGLE automatically stops when strand errors occur. A python script from the author can be used to switch the respective alleles if necessary. IMPUTE and MACH can automatically flip SNP markers to the other allele, when called with an additional option. By default, IMPUTE drops erroneous markers, while MACH quits when strand errors occur. IMPUTE additionally provides the user with strand files for the Affymetrix GeneChip 500K Mapping Array Set and SNP Array 6.0. When run with such a strand file, IMPUTE automatically flips SNP markers where necessary. With Illumina genotype data, which contains hardly any C/G and A/T SNP markers, the use of the auto-flip option with IMPUTE and MACH is sufficient for automatic correction.
Proper error handling improves the usability of software enormously, since the user is not forced to investigate the sometimes lengthy process of error detection. Adequate handling of errors should include helpful error warnings and the reason(s) for program termination when fatal errors occur. In general, we found only a few errors that are, in our view, mishandled by the programs. The error handling of BEAGLE is exemplary; in our experience, the program always stopped with an appropriate error message when running with incorrect input. IMPUTE does not (sufficiently) check genotype probabilities, accepting negative values or those exceeding 1.0. Also, it does not terminate when the genotype input file cannot be found (eg due to an incorrect path or filename). Instead, IMPUTE enters an infinite loop, requiring a manual termination. If MACH is unable to find the reference input files, it gives a warning but does not terminate. Instead, MACH starts to infer haplotypes from the genotypes without any reference, resulting in the allocation of more than tenfold the amount of main memory usually used. In many instances, this will cause the computer to crash when the warning by MACH is overlooked by the user, which can easily happen when MACH runs in a batch processing system, as is generally the case for large computing clusters. We are, of course, aware that more errors might have escaped our attention, and this list of issues is not likely to be comprehensive.