Advertisement

The feasibility of using low-density marker panels for genotype imputation and genomic prediction of crossbred dairy cattle of East Africa

Open AccessPublished:August 01, 2018DOI:https://doi.org/10.3168/jds.2018-14621

      ABSTRACT

      Cost-effective high-density (HD) genotypes of livestock species can be obtained by genotyping a proportion of the population using a HD panel and the remainder using a cheaper low-density panel, and then imputing the missing genotypes that are not directly assayed in the low-density panel. The efficacy of genotype imputation can largely be affected by the structure and history of the specific target population and it should be checked before incorporating imputation in routine genotyping practices. Here, we investigated the efficacy of imputation in crossbred dairy cattle populations of East Africa using 4 different commercial single nucleotide polymorphisms (SNP) panels, 3 reference populations, and 3 imputation algorithms. We found that Minimac and a reference population, which included a mixture of crossbred and ancestral purebred animals, provided the highest imputation accuracy compared with other scenarios of imputation. The accuracies of imputation, measured as the correlation between real and imputed genotypes averaged across SNP, were around 0.76 and 0.94 for 7K and 40K SNP, respectively, when imputed up to a 770K panel. We also presented a method to maximize the imputation accuracy of low-density panels, which relies on the pairwise (co)variances between SNP and the minor allele frequency of SNP. The performance of the developed method was tested in a 5-fold cross-validation process where various densities of SNP were selected using the (co)variance method and also by alternative SNP selection methods and then imputed up to the HD panel. The (co)variance method provided the highest imputation accuracies at almost all marker densities, with accuracies being up to 0.19 higher than the random selection of SNP. The accuracies of imputation from 7K and 40K panels selected using the (co)variance method were around 0.80 and 0.94, respectively. The presented method also achieved higher accuracy of genomic prediction at lower densities of selected SNP. The squared correlation between genomic breeding values estimated using imputed genotypes and those from the real 770K HD panel was 0.95 when the accuracy of imputation was 0.64. The presented method for SNP selection is straightforward in its application and can ensure high accuracies in genotype imputation of crossbred dairy populations in East Africa.

      Key words

      INTRODUCTION

      Selection of animals based on genomic estimated breeding values (GEBV); that is, genomic selection (GS), is now a standard practice for genetic improvement of many livestock species. Genomic selection exploits the linkage disequilibrium (LD) between known markers and unknown causal mutations in estimation of GEBV. Genome-wide SNP are usually used as genomic markers to estimate GEBV of selection candidates that have genotypes only, based on a prediction equation that is derived from a large reference population with both genotypes and phenotypes (
      • Meuwissen T.
      • Hayes B.
      • Goddard M.
      Genomic selection: A paradigm shift in animal breeding.
      ).
      Genomic selection is especially important in situations where traditional genetic evaluations based on pedigrees are not available because of the absence of pedigree information. Smallholder dairy farmers in East Africa rear crossbred cattle to combine the adaptation features of indigenous animals with the high milk yield potential of exotic dairy breeds. These farmers do not record pedigrees and there is no current genetic evaluation to aid them in making informed breeding decisions. Based on new phenotype recording programs, GS could help East African smallholder dairy farmers to establish an effective genetic improvement program.
      The accuracy of GEBV can increase when a larger reference population and high-density (HD) SNP panels are used for estimation of marker effects in the reference population. Denser panels are more effective in capturing LD between markers and QTL, and although medium density 50K assays are dense enough for reaching a useful level of LD within a breed, HD panels are required for multi-breed applications (e.g.,
      • de Roos A.P.W.
      • Hayes B.J.
      • Spelman R.J.
      • Goddard M.E.
      Linkage disequilibrium and persistence of phase in Holstein–Friesian, Jersey and Angus cattle.
      ). This is especially important in situations where the size of reference population for a breed is small and genotypes from other larger breeds are incorporated in genomic prediction. Although the cost of genotyping has decreased dramatically since the technology emerged, HD SNP panels are still very costly for routine use in genetic improvement of livestock species, especially in smallholder systems. A cost-effective alternative is to genotype animals with cheaper low density panels and then to infer the missing genotypes that have not been directly assayed, based on information from a reference population genotyped by an HD panel; a method called genotype imputation.
      The optimal number of SNP and an appropriate algorithm for selecting them to include in a low-density array that can later be used in imputation to HD genotypes is unknown.
      • Habier D.
      • Fernando R.L.
      • Dekkers J.C.M.
      Genomic selection using low-density marker panels.
      suggested to genotype selection candidates using a sparse panel of evenly spaced marker across the genome and then impute the missing genotypes using co-segregation information within families. Although their proposed method could work across different breeds and traits and is independent of the genetic architecture of trait of interest, it requires availability of pedigree and HD genotypes on both parents of selection candidates. Other attempts to design low-density SNP panels has been mostly based on the use of evenly spaced markers and maximization of minor allele frequency (MAF) with some enrichments at chromosomal ends (e.g.,
      • Boichard D.
      • Chung H.
      • Dassonneville R.
      • David X.
      • Eggen A.
      • Fritz S.
      • Gietzen K.J.
      • Hayes B.J.
      • Lawley C.T.
      • Sonstegard T.S.
      • Van Tassell C.P.
      • VanRaden P.M.
      • Viaud-Martinez K.A.
      • Wiggans G.R.
      Bovine LD Consortium
      Design of a bovine low-density SNP array optimized for imputation.
      ;
      • Bolormaa S.
      • Gore K.
      • van der Werf J.H.J.
      • Hayes B.J.
      • Daetwyler H.D.
      Design of a low-density SNP chip for the main Australian sheep breeds and its effect on imputation and genomic prediction accuracy.
      ).
      • Corbin L.J.
      • Kranis A.
      • Blott S.C.
      • Swinburne J.E.
      • Vaudin M.
      • Bishop S.C.
      • Woolliams J.A.
      The utility of low-density genotyping for imputation in the Thoroughbred horse.
      showed that when the low-density panels are designed to optimize equidistant spacing of markers based on LD units and to increase MAF, they can provide higher imputation accuracy and lower variations in accuracy of individual SNP than equidistant selection of SNP on base pair positions.
      • Wu X.-L.
      • Xu J.
      • Feng G.
      • Wiggans G.R.
      • Taylor J.F.
      • He J.
      • Qian C.
      • Qiu J.
      • Simpson B.
      • Walker J.
      • Bauck S.
      Optimal design of low-density SNP arrays for genomic prediction: Algorithm and applications.
      described a multiple objective optimization algorithm to select SNP for low-density panels that achieved substantially higher imputation accuracies than when selecting SNP solely based on uniform distribution of map position.
      Knowledge on the level and extent of LD between genome-wide markers is important because it can help to determine the required number of SNP markers for fine mapping of quantitative trait loci, GS, and genotype imputation (e.g.,
      • Sargolzaei M.
      • Schenkel F.S.
      • Jansen G.B.
      • Schaeffer L.R.
      Extent of linkage disequilibrium in Holstein cattle in North America.
      ;
      • Corbin L.J.
      • Kranis A.
      • Blott S.C.
      • Swinburne J.E.
      • Vaudin M.
      • Bishop S.C.
      • Woolliams J.A.
      The utility of low-density genotyping for imputation in the Thoroughbred horse.
      ;
      • Mathew B.
      • Léon J.
      • Sillanpää M.J.
      A novel linkage-disequilibrium corrected genomic relationship matrix for SNP-heritability estimation and genomic prediction.
      ). The structure of LD is different in different populations. It is expected that in populations with smaller effective population size (Ne) and higher average LD between markers, such as commercial dairy cattle breeds, lower number of markers will suffice. It has also been suggested that HD panels with at least 300,000 SNP are required for multi-breed applications (
      • de Roos A.P.W.
      • Hayes B.J.
      • Spelman R.J.
      • Goddard M.E.
      Linkage disequilibrium and persistence of phase in Holstein–Friesian, Jersey and Angus cattle.
      ).
      Existing SNP assays have been mainly designed for use in pure breeds and methods of imputation have been tested mostly in purebred populations. East African crossbred dairy cattle populations are complex admixtures of dairy Bos taurus breeds and indigenous African breeds. Therefore, the objectives of this study were to assess the accuracy of genotype imputation and subsequent genomic prediction in crossbred dairy cattle populations of East Africa. We compared existing arrays and methods of imputation and various methods of selecting SNP for customized arrays, including a new method based on (co)variances between SNP that are weighted by their MAF.

      MATERIALS AND METHODS

      Data

      Population

      The crossbred dairy cattle in East Africa form an admixed population resulting from many generations of crossing of African indigenous cattle to several exotic dairy breeds, mainly from Friesian, Holstein, Ayrshire and related red breeds, and Jersey. These animals are kept by smallholder dairy farmers, typically in herds of size 1 to 5 cows, and produce almost all of the milk consumed in East Africa. The majority of East African crossbred dairy cattle are bred via natural mating, with a small proportion of matings by AI to imported and locally bred purebred dairy bulls. Very few animals have pedigree records and no genetic evaluation systems or systematic breeding programs are used to aid farmers. The Dairy Genetics East Africa (DGEA) project collected a wide range of smallholder cow performance, animal genotype, and household data in 4 east African countries, Kenya, Uganda, Ethiopia, and Tanzania, between 2010 and 2014 to determine the needs and provide feasible solutions for short- and long-term genetic improvement of smallholder crossbred dairy cattle populations.
      The genetic diversity of the crossbred cattle in relation to the indigenous breeds of the region and global reference breeds was previously presented in principal component plots by
      • Strucken E.M.
      • Al-Mamun H.A.
      • Esquivelzeta-Rabell C.
      • Gondro C.
      • Mwai O.A.
      • Gibson J.P.
      Genetic tests for estimating dairy breed proportion and parentage assignment in East African crossbred cattle.
      . They showed that the East African indigenous breeds are ancient admixtures of Bos indicus and African Bos taurus cattle where the latter is a lineage that is genetically very distinct from European Bos taurus. The crossbred dairy cattle were all clearly shown in the principal component plots as crosses between exotic dairy breeds and the local indigenous breeds of the country in which they were sampled, with proportion of exotic dairy content ranging from almost 0 to almost 100% (
      • Strucken E.M.
      • Al-Mamun H.A.
      • Esquivelzeta-Rabell C.
      • Gondro C.
      • Mwai O.A.
      • Gibson J.P.
      Genetic tests for estimating dairy breed proportion and parentage assignment in East African crossbred cattle.
      ).

      Genotypes

      Genotype data were available on 3,513 animals (3,124 crossbreds and 389 indigenous breed animals) genotyped for 777,962 SNP markers using the Illumina BovineHD BeadChip (Illumina, San Diego, CA). Animals consisted of indigenous breeds from East Africa as well as crossbred cows (from Kenya, Uganda, Ethiopia, and Tanzania) and bulls (only from Kenya and Uganda) in those countries. Another data set containing 26 British Friesian and 519 Canadian Ayrshire cows genotyped on the same SNP panel was also added to the data to increase the size of purebred genotypes. Quality controls applied on the combined raw data were as follows: only SNP with GC score >0.6 and call rate >95% were kept; mitochondrial, unmapped, duplicate map position, and SNP located on sex chromosomes (X and Y) were removed. Further, SNP with a MAF less than 0.01 were excluded. Animals were also required to have genotypes for more than 90% of SNP. These controls resulted to 691,230 SNP over 29 Bos taurus autosomes based on UMD 3.1 genome assembly (
      • Zimin A.V.
      • Delcher A.L.
      • Florea L.
      • Kelley D.R.
      • Schatz M.C.
      • Puiu D.
      • Hanrahan F.
      • Pertea G.
      • Van Tassell C.P.
      • Sonstegard T.S.
      • Marçais G.
      • Roberts M.
      • Subramanian P.
      • Yorke J.A.
      • Salzberg S.L.
      A whole-genome assembly of the domestic cow, Bos taurus.
      ). To increase the size of data further and to have more purebred animals that could be used as the reference population for genotype imputation, 197 animals representing Holstein, Jersey, Guernsey, Nelore, and N'Dama breeds genotyped by the bovine HapMap consortium (http://bovinegenome.org) were added to the data. The publicly available HapMap genotypes were post quality control. Only those SNP in common between the HapMap and DGEA genotypes (after quality control) were included. The 5 dairy breeds included in the data set represented the main dairy breeds reported to have been used for crossbreeding in the region. Finally, there were 4,207 animals whose SNP marker genotypes were coded as 0, 1, and 2, respectively, for AA, AB, and BB allele combinations. Table 1 contains the details on the number of animals in different breeds and the sources of data for this study.
      Table 1Number of genotyped animals for different breeds and breed groups and the sources they were obtained from
      Breed
      All animals were females except 2 crossbred male populations from Kenya and Uganda.
      SizeSource
      DGEA = Dairy Genetics East Africa; HapMap = the Bovine HapMap consortium; SRUC = Scottish Rural University College; and CDN = Canadian Dairy Network.
      Breed groupBreed group size
      Ethiopian Begait Barka30DGEAAfrican Zebu285
      Ethiopian Boran28DGEAAfrican Zebu
      Ethiopian Central Highland28DGEAAfrican Zebu
      Ethiopian Danakil Harar30DGEAAfrican Zebu
      Ethiopian Fogera28DGEAAfrican Zebu
      Kenyan Boran28DGEAAfrican Zebu
      Kenyan Zebu58DGEAAfrican Zebu
      Tanzanian Boran20DGEAAfrican Zebu
      Tanzanian Iringa Red13DGEAAfrican Zebu
      Tanzanian Singida White22DGEAAfrican Zebu
      Ugandan Ankole43DGEASanga Zebu60
      Ugandan Nganda17DGEASanga Zebu
      Ethiopian crossbred545DGEAAfrican Crossbred3,083
      Kenyan crossbred bull97DGEAAfrican Crossbred
      Kenyan crossbred1,378DGEAAfrican Crossbred
      Tanzanian crossbred462DGEAAfrican Crossbred
      Ugandan crossbred bull46DGEAAfrican Crossbred
      Ugandan crossbred555DGEAAfrican Crossbred
      Kenyan Sahiwal38DGEABos indicus73
      Nelore35HapMapBos indicus
      N'Dama24HapMapAfrican taurine24
      Guernsey21HapMapBos taurus682
      Holstein71HapMapBos taurus
      Jersey46HapMapBos taurus
      British Friesian25SRUCBos taurus
      Ayrshire519CDNBos taurus
      1 All animals were females except 2 crossbred male populations from Kenya and Uganda.
      2 DGEA = Dairy Genetics East Africa; HapMap = the Bovine HapMap consortium; SRUC = Scottish Rural University College; and CDN = Canadian Dairy Network.
      After quality controls, 0.58% of genotypes were sporadically missing. To have a complete data set for all animals at all loci, these sporadically missing genotypes of individuals were imputed using FImpute V 2.2 (
      • Sargolzaei M.
      • Chesnais J.P.
      • Schenkel F.S.
      A new approach for efficient genotype imputation using information from relatives.
      ). To test the accuracy of this imputation, we randomly masked genotypes for 5% of the known genotypes and then all the missing genotypes (∼5.58%) were predicted together by FImpute. The correlation between imputed genotypes and the masked genotypes (5%) were higher than 99%, indicating a very high accuracy of imputation.
      To explore the genetic diversity in different breeds under investigation, LD was computed for phased genotypes of animals as the squared correlation coefficient of haplotypes of syntenic loci that were up to 1 Mbp apart:
      r2=(pABpApB)2pApapBpb,


      where A and a, and B and b, are alleles of A and B SNP, respectively; pA, pa, pB, and pb are the corresponding allele frequencies; and pAB is the frequency of the AB haplotype.

      Phenotypes

      Test-day milk yield records (TDMY) were available from the first 3 parities of 1,034 smallholder crossbred cows aged between 4 and 8 yr in Kenya. A fixed regression test-day model was used fitting contemporary group effects of random management group-year-season and fixed parity. Fixed lactation curves of animals were modeled by Legendre polynomials of order 4 within days in milk interacting with dairy group. Animals were assigned into 5 dairy groups based on their percentage dairy breed ancestry estimated from an admixture analysis (
      • Ojango J.M.K.
      • Marete A.
      • Mujibi D.
      • Rao J.
      • Pool J.
      • Rege J.E.O.
      • Gondro C.
      • Weerasinghe W.M.S.P.
      • Gibson J.P.
      • Okeyo A.M.
      A novel use of high density SNP assays to optimize choice of different crossbred dairy cattle genotypes in small-holder systems in East Africa.
      ). The model also included a random additive animal effect, N(0,Gσa2), where G is the additive relationship matrix (described later), plus a permanent environmental effect, N(0,Iσpe2), with I being the identity matrix. Milk yield deviations (MYD) were calculated by correcting TDMY for fixed effects, management group-year-season effects, and permanent environmental effects estimated from the fixed regression test-day model. For each animal a single MYD was calculated as the average of all MYD of the animal. Different animals had different number of TDMY and to account for the effect of the number of TDMY on the accuracy of the calculated MYD, a weight was assigned to each MYD based on the inverse of the standard error of each MYD. The standard error of each MYD was calculated as the standard deviation of MYD divided by the square root of n, with n being the number of TDMY used for calculation of MYD.

      SNP Selection

      To design lower density SNP panels that can be efficiently used for genotype imputation to higher densities or used directly in genetic evaluation of genotyped animals, a method of selecting SNP based on the pairwise SNP (co)variance was developed. Consider n SNP from which we want to select k SNP such that the selected k SNP together explain a higher proportion of the variance of the n SNP than any other set of k SNP. To start the SNP selection process, SNP genotypes are scaled so that the mean and variance at each SNP are 0 and 1, respectively:
      xkadj=(xkxk¯)σ(xk),


      where xk is genotype at kth SNP and xk¯ and σ(xk) are the average and standard deviation of kth SNP genotype, respectively.
      Then the covariances between all pairs of scaled SNP genotypes are calculated and stored in a matrix (V), which is an n × n (co)variance matrix, and Vij is the covariance between SNP i and SNP j. The diagonal elements of matrix V are variances of SNP, and initially all are equal to 1. The sum of the diagonal elements or the trace of V is the total variance of n SNP, which is equal to the total number of SNP in the beginning.
      The developed method for SNP selection (COV) is a sequential process where in each round:
      • (1)
        For a given SNP the strength of its correlation with all other SNP is calculated. This is summed up across all pairs for each SNP and is weighted by the MAF of the given SNP:
        i=1,,n;j=1,,nandij,


        Eij=ViiVij2Vjj,and


        Dj=i=1nEij,and


        Djadj=Dj×[1(w×MAFj)],


        where Eij is the unexplained variance (UNV) for SNP i after accounting for SNP j; Dj is the sum of UNV across all SNP for SNP j, and w is the weight on the MAF, which can be between 0 and 1. If w = 0, MAF is ignored in selection of SNP, whereas with w = 1 the same weight is put on both UNV and MAF. We used a weighing factor w = 1 in the current study.
        The SNP with the lowest adjusted Dadj, say SNP k, is selected because it has highest average covariance with all the other SNP, so it explains more variance than any other SNP and it is also highly informative because of being highly polymorphic.
      • (2)
        The pairwise (co)variances between the remaining SNP are corrected by removing the amount of (co)variance explained by covariance of each SNP with the selected SNP, k:
        i=1,,n1andj=1,,n1,


        Vijadj=VijVik×VjkVkk,


        where Vijadj is the (co)variance between SNP i and SNP j corrected for the selected SNP k. Then, the adjusted (co)variance matrix Vadj has dimensions of (n − 1) × (n − 1). The SNP in perfect LD with the selected SNP will have the same D value, and therefore they are removed at this stage because 100% of the information they contributed is already explained by the selected SNP.
      • (3)
        At this stage, it is determined whether the selected SNP have explained enough variance and the SNP selection process should be stopped or more SNP are required. The proportion of variance explained by the selected SNP is calculated as
        ωexpt2=(σ02σt2)σ02,


        where ωexpt2 is the proportion of total variance explained by selected SNP after selecting t SNP, σ02 is the total variance with no SNP selected and σt2 is the remaining variance after t SNP selected and is calculated as the trace of Vadj.
      We used a sliding window approach in which SNP were selected within overlapping intervals of 1 Mbp. The interval moved forward by 500 kbp until it reached the end of the chromosome. The number of SNP selected from each window was determined based on the proportion of variance that was required to be explained by the selected SNP. Different thresholds for the proportion of explained variance (ωexp2) were set to achieve different densities of SNP panels. To account for the edge effect, twice the number of SNP required for explaining variance were selected from the first and last 1 Mbp interval in each chromosome. We also selected equal number of SNP to that selected by the (co)variance method (COV) within each interval either based on highest MAF (MAFI) or randomly (RANI). The SNP were also selected randomly (RANC) or based on highest MAF (MAFC) across the whole chromosome without accounting for their map position on the chromosome, matching the number of SNP selected by the (co)variance method at each density.
      The SNP selection was carried out using only crossbred animals. To implement the SNP selection and validation in independent populations, a cross-validation approach was implemented for the COV, MAFI, and MAFC methods. Animals were randomly divided into 5 groups such that the number of animals in each group was as similar as possible and animals from all countries are presented in each group (∼617 animals in each fold). Then at each rotation, 4 folds were used to select SNP and 1 fold was excluded from SNP selection and was only used in the validation processes (genomic prediction and imputation). The random selections of SNP within interval (RANI) or across chromosome (RANC) were also repeated 5 times to minimize the sampling error.
      To assess the efficiency of the developed method for selecting SNP, the selected SNP by the 5 different methods were used in turn for genome-enabled best linear unbiased prediction (GBLUP) of breeding values of animals with both genotypes and phenotypes (n = 1,034) and for genotype imputation of crossbred animals to the HD panel (details below). The imputation and GBLUP accuracies obtained from SNP selected by the 5 selection methods were averaged across the 5 folds.

      Genotype Imputation

      Four different commercially available SNP chips, Illumina BovineLD v2 (m = 7,931), BovineSNP50 v3 BeadChip (m = 53,218; Illumina), GeneSeek-Genomic-Profiler (GGP) Bovine 50K (m = 47,843), and Indicus 35K v1.03 (m = 34,000; Neogen Corporation, Lincoln, NE), with m being the number of SNP in the original panel, were used to find the optimal strategy for genotype imputation. Different scenarios of imputation in which different groups of animals were included in the reference population to predict the genotypes of crossbred animals were tested. The reference population consisted of only crossbred (scenario 1), only purebred (scenario 2), or both crossbred and purebred animals (scenario 3). At each rotation of cross-validation in scenario 1, the reference and validation sets included around 2,466 and 617 animals, respectively. Scenario 2 used 1,124 reference animals to impute 3,083 crossbreds. The reference set in scenario 3 was around 3,590 at each rotation of cross-validation to impute around 617 animals. Further, we also investigated whether the choice of imputation algorithm can affect the imputation accuracy. Three different programs, FImpute v2.2 (
      • Sargolzaei M.
      • Chesnais J.P.
      • Schenkel F.S.
      A new approach for efficient genotype imputation using information from relatives.
      ), Beagle v4.1 (
      • Browning B.L.
      • Browning S.R.
      Genotype imputation with millions of reference samples.
      ), and Minimac v3 (
      • Das S.
      • Forer L.
      • Schönherr S.
      • Sidore C.
      • Locke A.E.
      • Kwong A.
      • Vrieze S.I.
      • Chew E.Y.
      • Levy S.
      • McGue M.
      • Schlessinger D.
      • Stambolian D.
      • Loh P.-R.
      • Iacono W.G.
      • Swaroop A.
      • Scott L.J.
      • Cucca F.
      • Kronenberg F.
      • Boehnke M.
      • Abecasis G.R.
      • Fuchsberger C.
      Next-generation genotype imputation service and methods.
      ), were used as the choice for imputation software. For FImpute and Beagle, genotype phasing was done using their embedded algorithms during imputation, whereas for Minimac, Eagle v2.3.5 (
      • Loh P.-R.
      • Palamara P.F.
      • Price A.L.
      Fast and accurate long-range phasing in a UK Biobank cohort.
      ) was used for pre-phasing genotypes before imputation. The SNP in common between each of the 4 commercial panels and the HD panel were extracted and then used in imputation to the HD panel. The best imputation strategy was defined as the scenario that resulted in the highest imputation accuracy. The accuracy of imputation was measured as the proportion of correctly imputed genotypes (i.e., concordance), as well as the correlation between real and imputed genotypes averaged across SNP. The optimal imputation strategy was then used for imputing up the SNP selected by the different SNP selection methods to the HD panel.

      Relationship Between Animals in Target and Reference Set

      To investigate the effect of connectedness between training and target populations on imputation accuracy, various measures of genomic relationship between animals in reference and validation sets were calculated for different scenarios of imputation. For each animal in the target set, we calculated the maximum, average of top 5 and top 10, as well as all genomic relationship coefficients between that animal and animals from the reference set that were used in the imputation of the given animal. Further, we also calculated an average value across all validation animals for all replications of cross-validation to compare the imputation scenarios in terms of relationships between reference and target sets. For these comparisons, the allelic frequency was set to 0.5 for all SNP genotypes so that the difference in allele frequencies between breeds did not affect the relationships and the genomic relationships were only derived by the differences in actual genotypes.

      Genomic Prediction

      The GEBV of animals with both genotypes and phenotypes (n = 1034) were estimated using a linear mixed model in GBLUP context:
      y=1nμ+Wu+e,


      where y is the vector of MYD; 1n is a vector of ones; µ is the population mean term; u is a vector that contains genomic breeding values of animals and is assumed to be distributed as uN(0,Gσa2) where G is the additive relationship matrix based on SNP genotype; e is the vector of random residual term distributed as eN(0,Sσe2) S being a diagonal matrix with weight values for each MYD; and W is the incidence matrix relating phenotypes to animals. G was constructed according to
      • VanRaden P.M.
      Efficient methods to compute genomic predictions.
      :
      G=ZZk=1m2pkqk,


      where Z is the matrix for additive marker covariates and contains 0 − 2pk, 1 − 2pk, and 2 − 2pk for AA, AB, and BB genotypes, respectively; pk is the frequency of allele B at marker k and qk = 1 − pk.
      The GEBV were computed for animals using the HD panel (GEBVHD) in a 5-fold cross-validation process treating 1 fold as the target set and the rest as reference at each rotation. The GEBV were also estimated using different reduced subsets of selected SNP attained by various methods (GEBVSEL) as well as using imputed genotypes to the HD panel (GEBVIMP) in the same cross-validation setting. The correlations between GEBVHD and GEBVSEL and those between GEBVHD and GEBVIMP were calculated and averaged across 5 folds to assess the performance of different SNP selection methods.

      RESULTS

      Genetic Diversity

      Figure 1 illustrates the decay in pairwise LD between SNP located at varying distances on genome for crossbreds compared with purebred populations. As expected, LD is higher between SNP in close proximity and it decreases as the distance between SNP increases. The average LD was found to be lower in crossbreds compared with purebred animals at all distances. In addition, LD in crossbreds shows a more rapid decline over distance than those in purebreds (Figure 1). Among the exotic purebred populations, Guernsey and Jersey had stronger LD between pairwise markers at all distances and British Friesians showed lowest levels of LD. Iringa Red and Kenyan Boran showed substantially higher LD than other indigenous Zebu breeds, which were similar to the levels of LD that were observed in Nganda.
      Figure thumbnail gr1
      Figure 1Average linkage disequilibrium (r2) of pairwise SNP over varying genomic distances in crossbred (XX) versus (top) 5 Bos taurus (GU = Guernsey; JE = Jersey; HO = Holstein; AY = Ayrshire; and BF = British Friesian) breeds; (middle) 2 Bos indicus (NL = Nelore and SH = Sahiwal), 2 Sanga (NG = Nganda and AK = Ankole), and 1 African taurine (ND = N'Dama) breeds; and (bottom) 10 African Zebu (IR = Iringa Red; KB = Kenyan Boran; TB = Tanzanian Boran; SW = Singida White; EC = Central Highland; EB = Ethiopian Boran; DH = Danakil Harar; FG = Fogera; BB = Begait Barka; and ZB = Zebu) breeds.

      Optimal Imputation Strategy

      The total number of SNP in the 4 commercially available arrays retained from the HD panel after quality control as well as the number of common SNP between the different panels are included in Table 2. Illumina Bovine50 had the highest number of SNP in common with the HD panel, followed by the GGP Bovine 50K. The GGP Indicus 35K had the lowest number of SNP in common with all the other panels, reflecting the selection of SNP with high MAF in Bos indicus on the GGP Indicus 35K. The accuracy of imputation of crossbred genotypes obtained from the 3 different imputation algorithms are shown in Table 3, Table 4. As expected, marker panels with higher number of SNP generally achieved higher imputation accuracies in all scenarios of imputation. However, the GGP Bovine 50K always achieved higher imputation accuracies than the Illumina Bovine50 though it contained around 2.6K fewer SNP. The difference between imputation accuracy of the GGP Bovine 50K and the Illumina Bovine50 was higher when Beagle was used as the imputation software. For FImpute and Minimac, most of the accuracy in imputation of crossbred genotypes came from including crossbred animals in the reference data (scenario 1) and putting purebreds in the reference set (scenario 3) added little accuracy. This was not the case in imputations carried out by Beagle where scenario 3 achieved up to 0.11 higher accuracy than scenario 1 (Table 4). Beagle achieved lowest imputation accuracy in scenario 1 compared with other imputation algorithms with higher differences for low-density panels. Scenario 2 resulted in the lowest imputation accuracies for all commercial panels and imputation programs whereas scenario 3 always achieved the highest imputation accuracies. Among the 3 imputation software programs used in this study, Minimac outperformed FImpute and they both performed better than Beagle. The difference between imputation software programs was especially higher in imputations of lower densities such that in scenario 3 Minimac achieved up to 0.33 and 0.15 higher correlations in imputation of Illumina BovineLD than those from Beagle and FImpute, respectively (Tables 4).
      Table 2Total number of SNP (diagonals) and number of common SNP between different commercial panels (lower triangle) retained from the quality controlled high-density genotypes (with proportion of missing SNP to be imputed in parentheses)
      Panel
      GeneSeek-Genomic-Profiler (GGP) Bovine 50K and GGP Indicus 35K (Neogen Corporation, Lincoln, NE); Illumina Bovine50 and Illumina BovineLD (Illumina, San Diego, CA).
      Illumina BovineLDGGP Indicus 35KGGP Bovine 50KIllumina Bovine50
      Illumina BovineLD7,154 (0.99)
      GGP Indicus 35K1,25230,586 (0.96)
      GGP Bovine 50K6,9983,53239,480 (0.94)
      Illumina Bovine507,0632,34513,87342,147 (0.94)
      1 GeneSeek-Genomic-Profiler (GGP) Bovine 50K and GGP Indicus 35K (Neogen Corporation, Lincoln, NE); Illumina Bovine50 and Illumina BovineLD (Illumina, San Diego, CA).
      Table 3Concordance
      Concordance was defined as the proportion of correctly imputed genotypes.
      values for imputation of commercial panels in different scenarios of imputation from Beagle, FImpute, and Minimac (with SE in parentheses)
      Panel
      GeneSeek-Genomic-Profiler (GGP) Bovine 50K and GGP Indicus 35K (Neogen Corporation, Lincoln, NE); Illumina Bovine50 and Illumina BovineLD (Illumina, San Diego, CA).
      Software
      FImpute v2.2 (Sargolzaei et al., 2014), Beagle v4.1 (Browning and Browning, 2016), and Minimac v3 (Das et al., 2016).
      Imputation scenario
      Imputation scenarios differed based on the inclusion of animals in the reference population where in scenarios (1) only crossbred, (2) only purebred and (3) all purebred and crossbred were included in the reference set.
      123
      Illumina BovineLDBeagle0.6175 (0.0004)0.60010.6222 (0.0001)
      FImpute0.7766 (0.0008)0.64370.7779 (0.0007)
      Minimac0.8556 (0.0006)0.70720.8564 (0.0006)
      GGP Indicus 35KBeagle0.7753 (0.0001)0.69460.8531 (0.0002)
      FImpute0.9141 (0.0004)0.80770.9175 (0.0004)
      Minimac0.9483 (0.0003)0.86440.9519 (0.0002)
      GGP Bovine 50KBeagle0.8837 (0.0004)0.73670.9456 (0.0003)
      FImpute0.9366 (0.0004)0.85190.9386 (0.0004)
      Minimac0.9564 (0.0003)0.89210.9597 (0.0003)
      Illumina Bovine50Beagle0.8456 (0.0004)0.72380.9151 (0.0003)
      FImpute0.9274 (0.0004)0.83870.9295 (0.0004)
      Minimac0.9499 (0.0004)0.88120.9534 (0.0003)
      1 Concordance was defined as the proportion of correctly imputed genotypes.
      2 GeneSeek-Genomic-Profiler (GGP) Bovine 50K and GGP Indicus 35K (Neogen Corporation, Lincoln, NE); Illumina Bovine50 and Illumina BovineLD (Illumina, San Diego, CA).
      3 FImpute v2.2 (
      • Sargolzaei M.
      • Chesnais J.P.
      • Schenkel F.S.
      A new approach for efficient genotype imputation using information from relatives.
      ), Beagle v4.1 (
      • Browning B.L.
      • Browning S.R.
      Genotype imputation with millions of reference samples.
      ), and Minimac v3 (
      • Das S.
      • Forer L.
      • Schönherr S.
      • Sidore C.
      • Locke A.E.
      • Kwong A.
      • Vrieze S.I.
      • Chew E.Y.
      • Levy S.
      • McGue M.
      • Schlessinger D.
      • Stambolian D.
      • Loh P.-R.
      • Iacono W.G.
      • Swaroop A.
      • Scott L.J.
      • Cucca F.
      • Kronenberg F.
      • Boehnke M.
      • Abecasis G.R.
      • Fuchsberger C.
      Next-generation genotype imputation service and methods.
      ).
      4 Imputation scenarios differed based on the inclusion of animals in the reference population where in scenarios (1) only crossbred, (2) only purebred and (3) all purebred and crossbred were included in the reference set.
      Table 4Correlations
      Correlations between masked and imputed genotypes averaged across SNP.
      for imputation of commercial panels in different scenarios of imputation from Beagle, FImpute, and Minimac (with SE in parentheses)
      Panel
      GeneSeek-Genomic-Profiler (GGP) Bovine 50K and GGP Indicus 35K (Neogen Corporation, Lincoln, NE); Illumina Bovine50 and Illumina BovineLD (Illumina, San Diego, CA).
      Software
      FImpute v2.2 (Sargolzaei et al., 2014), Beagle v4.1 (Browning and Browning, 2016), and Minimac v3 (Das et al., 2016).
      Imputation scenario
      Imputation scenarios differed based on the inclusion of animals in the reference population where in scenarios (1) only crossbred, (2) only purebred, and (3) all purebred and crossbred were included in the reference set.
      123
      Illumina BovineLDBeagle0.4501 (0.0003)0.37090.4344 (0.0002)
      FImpute0.6136 (0.0012)0.36570.6177 (0.0012)
      Minimac0.7637 (0.0007)0.47340.7638 (0.0007)
      GGP Indicus 35KBeagle0.6384 (0.0003)0.56920.7569 (0.0003)
      FImpute0.8630 (0.0006)0.68250.8688 (0.0006)
      Minimac0.9217 (0.0004)0.78470.9274 (0.0003)
      GGP Bovine 50KBeagle0.7958 (0.0003)0.62040.9078 (0.0004)
      FImpute0.8944 (0.0004)0.75000.8979 (0.0004)
      Minimac0.9296 (0.0004)0.82270.9350 (0.0003)
      Illumina Bovine50Beagle0.7516 (0.0003)0.60260.8587 (0.0003)
      FImpute0.8822 (0.0004)0.73560.8858 (0.0004)
      Minimac0.9209 (0.0004)0.81040.9265 (0.0004)
      1 Correlations between masked and imputed genotypes averaged across SNP.
      2 GeneSeek-Genomic-Profiler (GGP) Bovine 50K and GGP Indicus 35K (Neogen Corporation, Lincoln, NE); Illumina Bovine50 and Illumina BovineLD (Illumina, San Diego, CA).
      3 FImpute v2.2 (
      • Sargolzaei M.
      • Chesnais J.P.
      • Schenkel F.S.
      A new approach for efficient genotype imputation using information from relatives.
      ), Beagle v4.1 (
      • Browning B.L.
      • Browning S.R.
      Genotype imputation with millions of reference samples.
      ), and Minimac v3 (
      • Das S.
      • Forer L.
      • Schönherr S.
      • Sidore C.
      • Locke A.E.
      • Kwong A.
      • Vrieze S.I.
      • Chew E.Y.
      • Levy S.
      • McGue M.
      • Schlessinger D.
      • Stambolian D.
      • Loh P.-R.
      • Iacono W.G.
      • Swaroop A.
      • Scott L.J.
      • Cucca F.
      • Kronenberg F.
      • Boehnke M.
      • Abecasis G.R.
      • Fuchsberger C.
      Next-generation genotype imputation service and methods.
      ).
      4 Imputation scenarios differed based on the inclusion of animals in the reference population where in scenarios (1) only crossbred, (2) only purebred, and (3) all purebred and crossbred were included in the reference set.
      The combination of Minimac and scenario 3 was chosen as the optimal imputation strategy because it provided the highest imputation accuracy for all SNP panels compared with other imputation algorithms and scenarios. Figure 2 shows the concordance and correlation of imputations for different chromosomes obtained from the optimal imputation strategy. Accuracy of individual chromosomes was similar to the overall imputation accuracies reported for different panels (Table 3, Table 4). However, chromosomal accuracies had fewer fluctuations when the overall imputation accuracy was higher. Different chromosomes achieved the highest and lowest imputation accuracies in different panels. For example, chromosomes 5, 6, 2, and 8 showed the highest concordance in imputation of GGP Bovine 50K, Illumina Bovine50, GGP Indicus 35K, and Illumina BovineLD, respectively. While correlation and concordance showed very similar trends across different chromosomes, correlations were lower than concordances especially for Illumina BovineLD (Figure 2).
      Figure thumbnail gr2
      Figure 2Concordance values (top) and correlations (bottom) of individual chromosomes obtained from the optimal imputation of different commercial arrays. GeneSeek-Genomic-Profiler (GGP) Bovine 50K and GGP Indicus 35K (Neogen Corporation, Lincoln, NE); Illumina Bovine50 and Illumina BovineLD (Illumina, San Diego, CA). Color version available online.
      The values of concordance and correlations of imputed SNP against their MAF are illustrated in Figure 3. For all panels, SNP with lowest MAF showed higher concordance values and concordance decreased as MAF increased. The rate of decline in concordance values was highest for Illumina BovineLD compared with other panels, and the panel with highest average concordance across all SNP (i.e., GGP Bovine 50K) showed smaller reduction in SNP concordance values. In contrast, correlations showed a moderate ascending trend from the low to high MAF and also less variations across MAF for all panels.
      Figure thumbnail gr3
      Figure 3Concordance values (top) and correlations (bottom) of individual SNP against their minor allele frequency (MAF) obtained from the optimal imputation of different commercial arrays. GeneSeek-Genomic-Profiler (GGP) Bovine 50K and GGP Indicus 35K (Neogen Corporation, Lincoln, NE); Illumina Bovine50 and Illumina BovineLD (Illumina, San Diego, CA).

      Reference and Target Population Connectedness

      Table 5 contains various measures of genomic relationships used to evaluate the connectedness between reference and target animals in different scenarios of imputation. The values of connectedness across the 3 scenarios agreed with the imputation accuracies obtained from each scenario (Table 3, Table 4), where higher connectedness always led to higher imputation accuracy.
      Table 5Connectedness between animals in target and reference sets measured by genomic relationships in different scenarios of imputation
      Connectedness measure
      Max = average of maximum relationships; top 5 = average of top 5 relationships; top 10 = average of top 10 relationships; and mean = average of all relationships between each individual in the target set and all the reference individuals.
      Scenario
      123
      Max0.730.650.73
      Top 50.670.650.68
      Top 100.650.650.67
      Mean0.550.530.55
      1 Max = average of maximum relationships; top 5 = average of top 5 relationships; top 10 = average of top 10 relationships; and mean = average of all relationships between each individual in the target set and all the reference individuals.
      Figure 4, Figure 5 show the mean values of concordance and correlations, respectively, for target animals grouped according to their average genomic relationship with the reference set used in imputation of their genotypes. For all panels and in all scenarios of imputation, concordance and correlation increased as the relationship between target and reference sets. The accuracies of the imputation from Illumina BovineLD to HD genotypes benefitted the most from the increase in genomic relationships between target and reference animals. For example in scenario 3, the difference between concordance values of animals with the lowest and highest genomic relationship with reference set was around 0.26 for imputation from Illumina BovineLD whereas other panels showed an average difference of 0.15.
      Figure thumbnail gr4
      Figure 4Concordance values of imputed genotypes of crossbred animals against their relationship with the reference set obtained from Minimac in different scenarios of imputation. GeneSeek-Genomic-Profiler (GGP) Bovine 50K and GGP Indicus 35K (Neogen Corporation, Lincoln, NE); Illumina Bovine50 and Illumina BovineLD (Illumina, San Diego, CA). Color version available online.
      Figure thumbnail gr5
      Figure 5Correlations between real and imputed genotypes of crossbred animals against their relationship with the reference set obtained from Minimac in different scenarios of imputation. GeneSeek-Genomic-Profiler (GGP) Bovine 50K and GGP Indicus 35K (Neogen Corporation, Lincoln, NE); Illumina Bovine50 and Illumina BovineLD (Illumina, San Diego, CA). Color version available online.

      Imputation Accuracies from Different SNP Selection Methods

      Table 6 shows the number of selected SNP at different thresholds for the total variance of SNP explained, as well as the accuracies of imputation obtained from different SNP selection methods. Selection of SNP based on the (co)variance method achieved the highest imputation accuracy at almost all thresholds such that it provided 0.03, 0.19, 0.17, and 0.16 higher correlations compared with SNP selection based on MAFI, RANI, RANC, and MAFC, respectively, when around 4K SNP were selected (5% of total SNP variance explained). The values of concordance and correlations within a SNP selection method could differ to a large extent especially at lower densities of selected SNP, but they became closer as more SNP were selected at higher densities. The difference between the accuracy of imputation from the (co)variance method and those of other SNP selection methods was also highest at lower marker densities and there was little difference in accuracy of imputation between methods at high marker densities, where all accuracies were high. Selection of SNP based on highest MAF provided the second highest correlations after the (co)variance method. Random selection of SNP at lowest densities provided the poorest accuracies with little difference between RANI and RANC. As more SNP were selected, random selection of SNP started to outperform selection based on MAF such that at densities higher than 20K, RANI and RANC always provided higher accuracies than MAFI and MAFC, though the differences were marginal. The standard errors of imputation accuracies (result not shown) were always smaller than 0.02 with lower values at higher densities.
      Table 6The number of selected SNP at different thresholds for the proportion of variance explained and imputation accuracies obtained from different methods of SNP selection
      Variance explained (%)No. of SNP
      Averaged across 5 folds.
      Imputation accuracy from SNP selection methods
      Averaged across 5 folds.
      Proportion of correctly imputed genotypes (Con) and average correlations between real and imputed genotypes (Cor) obtained from imputation of subsets of selected SNP based on COV = (co)variance method; MAFI = minor allele frequency within interval; RANI = random within interval; RANC = random across chromosome; and MAFC = minor allele frequency across chromosome.
      COVMAFIRANIRANCMAFC
      ConCorConCorConCorConCorConCor
      13,7570.79050.63790.77330.60770.69700.46110.69740.46800.70240.4886
      54,0130.81090.67780.79300.64580.71340.49200.71940.50940.72440.5225
      74,7250.83930.73160.82180.70200.76950.60720.76460.59610.77160.6086
      106,1660.86840.78340.85110.75530.82630.71660.82040.70330.82170.7005
      127,1620.88100.80490.86360.77720.84620.75220.84220.74310.84130.7361
      158,7380.89540.82920.87780.80160.86630.78710.86320.78030.86060.7701
      2011,7730.91400.85990.89630.83250.89070.82810.88930.82450.88380.8101
      2515,3730.92830.88320.91070.85610.90870.85780.90780.85520.90080.8385
      3019,8120.93980.90170.92280.87560.92350.88160.92320.88020.91550.8626
      3525,4100.94950.91700.93360.89280.93650.90210.93640.90150.92810.8830
      4032,5730.95760.92990.94370.90880.94800.92030.94820.91990.93940.9010
      4541,3830.96440.94050.95250.92260.95790.93550.95790.93510.94900.9161
      5052,1340.97000.94950.96020.93440.96610.94810.96610.94780.95730.9291
      5564,9070.97470.95670.96660.94440.97290.95850.97270.95790.96430.9401
      6079,8310.97870.96300.97200.95260.97820.96650.97780.96580.97010.9490
      6597,6130.98230.96860.97650.95960.98250.97310.98200.97220.97480.9563
      70119,1200.98530.97340.98050.96570.98590.97830.98550.97750.97910.9629
      75144,9950.98800.97760.98400.97100.98870.98250.98830.98180.98270.9684
      80177,3820.99040.98140.98690.97560.99090.98590.99050.98530.98580.9733
      85220,1090.99260.98480.98960.97970.99280.98880.99240.98820.98870.9777
      90281,0300.99470.98810.99220.98360.99440.99130.99410.99070.99150.9819
      95378,2160.99690.99170.99460.98690.99580.99350.99550.99300.99420.9857
      1 Averaged across 5 folds.
      2 Proportion of correctly imputed genotypes (Con) and average correlations between real and imputed genotypes (Cor) obtained from imputation of subsets of selected SNP based on COV = (co)variance method; MAFI = minor allele frequency within interval; RANI = random within interval; RANC = random across chromosome; and MAFC = minor allele frequency across chromosome.

      Genomic Predictions Using Reduced and Imputed SNP Panels

      The relative accuracies of genomic predictions obtained from the reduced sets of SNP and also from the imputed genotypes are in Table 7 for different SNP selection methods and different thresholds placed for the variance explained. Genomic predictions based on SNP selected by the (co)variance method achieved higher accuracies compared with other methods of selecting SNP, especially when less SNP were selected. When the threshold on variance explained was at 40% and higher, RANI and RANC achieved marginally higher genomic prediction accuracies compared with other methods of SNP selection, but all methods gave r2 higher than 0.99. Accuracies of genomic prediction were always higher when the reduced SNP were imputed to the HD panel and then used to estimate GEBV rather than being used directly in genomic evaluations. The prediction accuracies from imputed genotypes were consistent with but substantially higher than accuracies achieved in imputations. Using the best method of SNP selection (i.e., COV), an imputation accuracy of 0.64 was achieved when 3.8K SNP were selected and this subsequently gave an accuracy of 0.95 for GEBV. Using the same selected SNP in genomic evaluation without imputation gave a substantially lower accuracy of 0.70 for GEBV.
      Table 7Squared correlations between genomic breeding values estimated using the real high-density panel and those from reduced (Sel) and imputed (Imp) genotypes at different SNP densities selected by different methods of SNP selection
      Variance explained (%)No. of SNP
      Averaged across 5 folds.
      Genomic prediction accuracy from SNP selection methods
      Averaged across 5 folds.
      Selection of SNP based on COV = (co)variance method; MAFI = minor allele frequency within interval; RANI = random within interval; RANC = random across chromosome; and MAFC = minor allele frequency across chromosome.
      COVMAFIRANIRANCMAFC
      SelImpSelImpSelImpSelImpSelImp
      13,7570.69900.94740.65200.92790.67750.86690.65920.87060.58540.8710
      54,0130.71020.95580.64850.93980.67610.88900.65120.89790.58550.8932
      74,7250.73640.96660.66350.95360.67710.92660.68600.92420.60090.9196
      106,1660.77780.97700.69200.96620.74570.95790.71140.95710.64680.9489
      127,1620.80170.98030.71590.96980.74380.96780.74400.96610.66190.9608
      158,7380.82500.98490.73930.97630.79070.97320.79110.97400.70170.9688
      2011,7730.85470.99070.78860.98300.81830.98110.83110.98250.74210.9793
      2515,3730.87480.99290.81660.98700.82820.98640.85680.98730.77260.9847
      3019,8120.89070.99490.84210.99010.88170.99070.88390.99020.80490.9885
      3525,4100.90720.99610.85980.99270.89930.99310.90180.99310.83660.9915
      4032,5730.91610.99670.87970.99440.91880.99490.92950.99560.86200.9945
      4541,3830.92510.99760.89560.99550.93390.99640.93850.99700.88560.9957
      5052,1340.93200.99820.90620.99670.94630.99770.95150.99800.90480.9966
      5564,9070.94000.99860.91770.99750.95580.99840.96110.99850.91570.9977
      6079,8310.94660.99890.92950.99820.96030.99900.96960.99900.92730.9982
      6597,6130.95280.99920.93960.99870.96900.99930.97610.99930.93580.9986
      70119,1200.96100.99940.94670.99890.97580.99950.98050.99950.94490.9990
      75144,9950.96760.99960.95640.99930.97930.99970.98500.99970.95410.9992
      80177,3820.97230.99980.96280.99960.98270.99980.98850.99980.96260.9995
      85220,1090.97870.99990.97040.99970.98760.99990.99120.99990.97020.9997
      90281,0300.98550.99990.97950.99990.99050.99990.99370.99990.97940.9998
      95378,2160.992710.98950.99990.994210.996610.99010.9999
      1 Averaged across 5 folds.
      2 Selection of SNP based on COV = (co)variance method; MAFI = minor allele frequency within interval; RANI = random within interval; RANC = random across chromosome; and MAFC = minor allele frequency across chromosome.

      DISCUSSION

      The results of the current study confirm that genotype imputation can be successfully applied in crossbred dairy cattle populations of East Africa. An imputation scenario including both crossbred and ancestral purebred animals in the reference set and using Minimac as the imputation software achieved the highest accuracy in imputation of commercial SNP panels and it was regarded as the optimal imputation strategy. The accuracy of imputation and of GEBV from reduced SNP assays was higher when low density SNP were selected using a method based on the (co)variance between SNP and weighted by their MAF. At higher SNP densities, however, no significant differences were present between accuracies of imputations or GEBV from different SNP selection methods. The presented method can be used as a guide to select the required number of SNP from different regions of genome based on the varying structure of LD between markers, to design low-density arrays optimized for imputation to higher densities.
      The accuracy of imputation with low density arrays largely depends on the strategy used to impute genotypes. This includes the choice of reference population and imputation algorithm among other factors influencing the imputation accuracy. It has been shown that a larger size of the reference population generally increases the imputation accuracy (e.g.,
      • Hozé C.
      • Fouilloux M.-N.
      • Venot E.
      • Guillaume F.
      • Dassonneville R.
      • Fritz S.
      • Ducrocq V.
      • Phocas F.
      • Boichard D.
      • Croiseau P.
      High-density marker imputation accuracy in sixteen French cattle breeds.
      ). However, having immediate ancestors of the target individuals in the reference set is more important than the sample size (e.g.,
      • Ventura R.V.
      • Lu D.
      • Schenkel F.S.
      • Wang Z.
      • Li C.
      • Miller S.P.
      Impact of reference population on accuracy of imputation from 6K to 50K single nucleotide polymorphism chips in purebred and crossbreed beef cattle1.
      ). Crossbred dairy cattle in East Africa form an admixed population of animals with different breed proportions from various indigenous and exotic dairy breeds (
      • Strucken E.M.
      • Al-Mamun H.A.
      • Esquivelzeta-Rabell C.
      • Gondro C.
      • Mwai O.A.
      • Gibson J.P.
      Genetic tests for estimating dairy breed proportion and parentage assignment in East African crossbred cattle.
      ). To achieve a high level of imputation accuracy, therefore, it is necessary to create a reference set with high relationships to the crossbred animals and also an optimal contribution to the breed composition of crossbred animals. In the present study, a mixture of animals from crossbred, indigenous, and exotic groups when used as the reference set (i.e., scenario 3) resulted in higher top relationships between target and reference populations (Table 5) and consequently higher accuracies of imputation compared with other scenarios (Table 3, Table 4). The inclusion of crossbreds in the reference set provides closely related individuals to the target set, which allows high imputation accuracies (scenario 1). Adding to the imputation process purebred indigenous and exotic animals representative of the ancestral breeds contributing to the imputed crossbred animals improves the accuracy of imputation only marginally (scenario 3 vs. scenario 1). The crossbred dairy cattle in East Africa result from many generations of crossing of animals with different breed proportions, and the individual purebred animals used as reference in imputation are not themselves ancestors of the crossbred animals. Thus the long-range haplotypes in crossbreds will likely differ from those in reference purebreds, explaining why use of just the reference purebreds in imputation (i.e., scenario 2) performed poorly. Similar to our findings,
      • Ventura R.V.
      • Miller S.P.
      • Dodds K.G.
      • Auvray B.
      • Lee M.
      • Bixley M.
      • Clarke S.M.
      • McEwan J.C.
      Assessing accuracy of imputation using different SNP panel densities in a multi-breed sheep population.
      reported that a large reference population including all the available data from all breeds is preferred over smaller within breed reference sets in imputation of multi-breed sheep populations of New Zealand.
      A positive relationship between relatedness to the reference set and imputation accuracy was observed in all scenarios of imputation (Figure 4, Figure 5). This confirms the importance of maintaining a high level of connectedness between reference and target animals to achieve a high imputation accuracy. The relationship between connectedness to the reference set and the realized imputation accuracy can also be used to predict an expected imputation accuracy for target animals before undertaking an imputation (e.g.,
      • Ventura R.V.
      • Miller S.P.
      • Dodds K.G.
      • Auvray B.
      • Lee M.
      • Bixley M.
      • Clarke S.M.
      • McEwan J.C.
      Assessing accuracy of imputation using different SNP panel densities in a multi-breed sheep population.
      ). Genomic relationships between different populations can be similarly used to predict accuracy of across-population imputations. Here we observed that the accuracies of across-population imputations were related to the average genomic relationships between the populations (Supplemental Tables S1 to S2; https://doi.org/10.3168/jds.2018-14621).
      In addition to the structure of reference population, the imputation algorithm can affect the imputation results. Several imputation algorithms are available that either use the LD in population [population imputation: e.g., Beagle (
      • Browning B.L.
      • Browning S.R.
      Genotype imputation with millions of reference samples.
      ); Minimac (
      • Howie B.
      • Fuchsberger C.
      • Stephens M.
      • Marchini J.
      • Abecasis G.R.
      Fast and accurate genotype imputation in genome-wide association studies through pre-phasing.
      )] or additionally incorporate family relationships between individuals [family imputation: e.g., Alphaimpute (
      • Hickey J.M.
      • Kinghorn B.P.
      • Tier B.
      • van der Werf J.H.
      • Cleveland M.A.
      A phasing and imputation method for pedigreed populations that results in a single-stage genomic evaluation.
      ); FImpute (
      • Sargolzaei M.
      • Chesnais J.P.
      • Schenkel F.S.
      A new approach for efficient genotype imputation using information from relatives.
      )] to implement imputation. Several studies have compared the performance of different imputation software in livestock species and found that different methods are preferred in different situations (e.g.,
      • Hayes B.J.
      • Bowman P.J.
      • Daetwyler H.D.
      • Kijas J.W.
      • van der Werf J.H.J.
      Accuracy of genotype imputation in sheep breeds.
      ;
      • Khatkar M.S.
      • Moser G.
      • Hayes B.J.
      • Raadsma H.W.
      Strategies and utility of imputed SNP genotypes for genomic analysis in dairy cattle.
      ). In pedigreed populations, the incorporation of family information allows the tracking of long haplotypes that run within families and have low frequency in the population. This is particularly beneficial in imputation of rare alleles, which have a low imputation accuracy from population-based imputations (
      • Sargolzaei M.
      • Chesnais J.P.
      • Schenkel F.S.
      A new approach for efficient genotype imputation using information from relatives.
      ). The observed trend between the MAF of SNP and the 2 measures of accuracy of imputation used in this study (Figure 3) has been reported in previous studies (e.g.,
      • Hayes B.J.
      • Bowman P.J.
      • Daetwyler H.D.
      • Kijas J.W.
      • van der Werf J.H.J.
      Accuracy of genotype imputation in sheep breeds.
      ;
      • Hickey J.M.
      • Crossa J.
      • Babu R.
      • de los Campos G.
      Factors affecting the accuracy of genotype imputation in populations from several maize breeding programs.
      ) and the reasons behind these trends has been documented (
      • Calus M.P.L.
      • Bouwman A.C.
      • Hickey J.M.
      • Veerkamp R.F.
      • Mulder H.A.
      Evaluation of measures of correctness of genotype imputation in the context of genomic prediction: A review of livestock applications.
      ). The SNP with low MAF show a high concordance value (Figure 3) simply because there is a higher chance of correctly inferring rare alleles based on population allele frequencies by assigning the major allele as the missing allele. Correlation accounts for the MAF of SNP so it can be used to compare imputation accuracy across different loci with different MAF (
      • Calus M.P.L.
      • Bouwman A.C.
      • Hickey J.M.
      • Veerkamp R.F.
      • Mulder H.A.
      Evaluation of measures of correctness of genotype imputation in the context of genomic prediction: A review of livestock applications.
      ).
      The accuracy of imputation for SNP which have specific applications in genetic studies (e.g., parentage tests and calculation of breed composition) is of additional interest. Using a similar data,
      • Strucken E.M.
      • Al-Mamun H.A.
      • Esquivelzeta-Rabell C.
      • Gondro C.
      • Mwai O.A.
      • Gibson J.P.
      Genetic tests for estimating dairy breed proportion and parentage assignment in East African crossbred cattle.
      showed that SNP with largest allele frequency difference between European dairy breeds and a combined Nelore plus N'Dama population (as representatives of the ancient ancestors of indigenous East African zebu cattle) are the most informative for calculation of breed proportions of East African crossbred animals. Here, we compared the imputation accuracy of SNP grouped according to the difference in average allele frequencies between indigenous versus exotic dairy breeds (Figure 6). We found that accuracies were highest for loci with high allele frequency difference between the ancestral breeds, confirming that the imputed genotypes can be safely used for calculation of breed proportions of crossbred animals. These loci are also the most powerful SNP for undertaking genome-wide association studies to detect regions of genome that cause the very large differences in performance and adaptation traits of exotic dairy versus indigenous breeds. Because such loci are very poorly represented on all commercial assays other than the Illumina BovineHD BeadChip, this property of the imputation is particularly valuable.
      Figure thumbnail gr6
      Figure 6Average concordance values (top) and correlations (bottom) of SNP obtained from optimal imputation of crossbred animals and grouped based on allele frequency difference between ancestral pure breeds. GeneSeek-Genomic-Profiler (GGP) Bovine 50K and GGP Indicus 35K (Neogen Corporation, Lincoln, NE); Illumina Bovine50 and Illumina BovineLD (Illumina, San Diego, CA). Color version available online.
      The sensitivity of the imputation algorithm to size of the reference population and the density of the imputed and reference arrays should also be considered. In our study, Beagle gave the lowest imputation accuracies in scenario 1 compared with FImpute and Minimac. The inclusion of purebred in addition to crossbreds in the reference (i.e., scenario 3) increased the accuracies from Beagle especially at higher SNP densities but only slightly improved the accuracies for FImpute and Minimac (Table 3, Table 4). Although this could be partly due to the change in the structure of reference population (as discussed above), it can also be related to the increase in the reference sample size, which could imply that Beagle is more dependent on the size of reference population, as reported by
      • Ventura R.V.
      • Lu D.
      • Schenkel F.S.
      • Wang Z.
      • Li C.
      • Miller S.P.
      Impact of reference population on accuracy of imputation from 6K to 50K single nucleotide polymorphism chips in purebred and crossbreed beef cattle1.
      . On the other hand, no increase occurred in imputation accuracy using Beagle from scenario 1 to scenario 3 for the Illumina BovineLD which could also indicate that the accuracy from Beagle is sensitive to the density of imputed panel. Similar results were reported by
      • Sargolzaei M.
      • Chesnais J.P.
      • Schenkel F.S.
      A new approach for efficient genotype imputation using information from relatives.
      who showed that the imputation from lower densities using Beagle is more dependent on the size of reference set.
      Cost-effective genotyping requires an appropriate balance between the size of reference and target populations. In an additional analysis, we examined the effect of having different proportions of reference and target animals in imputation. We found that when only 10% of the population (≈308 animals) was used as the reference to impute the remaining 90% as the target, the accuracy of imputation was still reasonably high. The increase in imputation accuracy was not proportional to the increase in reference sample size (Supplemental Table S3; https://doi.org/10.3168/jds.2018-14621), with only modest increase in accuracy especially when more than 30% of animals were in the reference set. As long as the reference set is a representative sample of the target population, the imputation accuracy should be related to the number of animals rather than the proportion of animals in the reference set. That is, the current reference set could be used to impute with the same accuracy as observed here the genotypes of an infinite number of animals sampled from the same population in future. The relationship of sampled animals in future to the current reference set will provide a test of whether the reference set is properly representative of newly sampled animals.
      In addition to the factors discussed above, the accuracy of imputation depends on the characteristics of SNP included in the low-density panel. It is expected that if SNP in the low-density panel have high LD with the missing genotypes, they should achieve a high imputation accuracy, especially in population imputation methods. But, if markers in the low-density panel are in strong LD with each other, they will provide redundant information for imputation. It is also known that SNP with high MAF have more information content for imputation (
      • Hayes B.J.
      • Bowman P.J.
      • Daetwyler H.D.
      • Kijas J.W.
      • van der Werf J.H.J.
      Accuracy of genotype imputation in sheep breeds.
      ). The (co)variance method for selecting SNP presented in the current study is formulated to maximize LD with missing loci while minimizing LD among SNP on the assay and maintaining high MAF.
      We applied our SNP selection method to select SNP within overlapping intervals of an arbitrary length of 1 Mbp across the whole genome, which was equivalent to a block with approximately 1% recombination per generation. Ideally, SNP should be selected within LD blocks separated by recombination hotspots. This would minimize the crossing over that weakens the LD between markers so that the selected markers remain useful for imputation over many generations. However, there is currently no high accuracy recombination map for the bovine HD genotypes that can be used to define SNP selection intervals. Alternatively, given that LD varies across the bovine genome (e.g.,
      • Sargolzaei M.
      • Schenkel F.S.
      • Jansen G.B.
      • Schaeffer L.R.
      Extent of linkage disequilibrium in Holstein cattle in North America.
      ), the interval for selecting markers could be also varied to match the observed pattern of LD. The (co)variance method will account for the variation in LD across the genome because it selects SNP based on accounting for a target proportion of variation in SNP. More SNP will be selected in regions where LD is low than where LD is high.
      The selection of SNP could be undertaken within larger intervals or even across the whole chromosome. This would decrease the number of selected SNP that is required to explain a certain amount of variation between SNP compared with choosing SNP from shorter intervals. But selection of SNP from long intervals would lead to selections based on lower levels of LD that are less useful for imputation and more likely to decay over time than selection from short intervals (
      • de Roos A.P.W.
      • Hayes B.J.
      • Spelman R.J.
      • Goddard M.E.
      Linkage disequilibrium and persistence of phase in Holstein–Friesian, Jersey and Angus cattle.
      ). Genomic intervals could be defined so that the average LD between markers within the interval pass a predefined threshold. For example,
      • Meuwissen T.H.E.
      • Hayes B.J.
      • Goddard M.E.
      Prediction of total genetic value using genome-wide dense marker maps.
      suggested that a squared correlation between adjacent markers equal to 0.2 or 0.3 is useful for GS. The selection of SNP within small intervals will be particularly important in crossbred populations such as we examined where there is no pedigree information, very little family structure, many generations of recombination that break up ancestral haplotypes, and low levels of LD (Figure 1 and Supplemental Table S4; https://doi.org/10.3168/jds.2018-14621) and hence large effective population size.
      The (co)variance method for SNP selection provided the highest accuracy of imputation of low-density genotypes of East African crossbred dairy cattle compared with other methods tested in this study (Table 6). Selection of SNP based on MAFI was inferior to the COV method because it does not account for the LD between SNP and hence can select subsets of SNP that have high LD with each other and generally contain lower information. This problem is worse when SNP are selected based on highest MAF across the whole chromosome (MAFC) because MAFC is not optimized for uniformity across the chromosome and hence can leave gaps with little information for imputation. Random selection of SNP within intervals (RANI) or across chromosomes (RANC) provided very similar accuracies to each other at all densities. This suggests that even at the lowest densities used here, uniformity of marker spacing is not a particularly important factor for accuracy of imputation if SNP are selected at random.
      The imputation accuracies in Table 6 are somewhat lower than accuracies reported in the literature for purebred dairy populations but are within the same range of those from populations with greater genetic diversity (e.g.,
      • Hozé C.
      • Fouilloux M.-N.
      • Venot E.
      • Guillaume F.
      • Dassonneville R.
      • Fritz S.
      • Ducrocq V.
      • Phocas F.
      • Boichard D.
      • Croiseau P.
      High-density marker imputation accuracy in sixteen French cattle breeds.
      ). Crossbred populations resulting from many generations of admixture are expected to have larger Ne and weaker long-distance LD compared with purebred populations (e.g.,
      • Lu D.
      • Sargolzaei M.
      • Kelly M.
      • Li C.
      • Vander Voort G.
      • Wang Z.
      • Plastow G.
      • Moore S.
      • Miller S.
      Linkage disequilibrium in Angus, Charolais, and Crossbred beef cattle.
      ). Hence, larger reference populations are required to capture the haplotype diversity in crossbreeds and to achieve a similar accuracy of imputation to those in purebreds.
      • Hozé C.
      • Fouilloux M.-N.
      • Venot E.
      • Guillaume F.
      • Dassonneville R.
      • Fritz S.
      • Ducrocq V.
      • Phocas F.
      • Boichard D.
      • Croiseau P.
      High-density marker imputation accuracy in sixteen French cattle breeds.
      reported lower imputation accuracies in beef breeds compared with dairy breeds where the former group in general showed a higher rate of decay of LD across their genome.
      • Bolormaa S.
      • Gore K.
      • van der Werf J.H.J.
      • Hayes B.J.
      • Daetwyler H.D.
      Design of a low-density SNP chip for the main Australian sheep breeds and its effect on imputation and genomic prediction accuracy.
      also reported lower imputation accuracies for a crossbred sheep population than those obtained for purebred sheep breeds.
      The squared correlations (r2) between GEBVHD and GEBVIMP shown in Table 7 confirm that the genotype imputation is an effective approach for obtaining accurate GEBV from low-density genotypes in the crossbred dairy cattle populations of East Africa. The r2 between GEBVHD and GEBVIMP was high even when small number of SNP were used in imputation and the accuracy of imputation was lowest. The high r2 obtained from GEBVIMP also implies that the loss in genetic gain from using imputed genotypes instead of real genotypes is minimal in East African crossbred dairy cattle. Imputation will increase the accuracy of GEBV further through allowing a bigger reference population for a given genotyping budget. We obtained an r2 of 0.95 between GEBVHD and GEBVIMP when around 4K SNP were imputed up to 700K HD genotypes. Similar patterns to results of the current study have been reported in previous studies.
      • Weigel K.A.
      • de los Campos G.
      • Vazquez A.I.
      • Rosa G.J.M.
      • Gianola D.
      • Van Tassell C.P.
      Accuracy of direct genomic values derived from imputed single nucleotide polymorphism genotypes in Jersey cattle.
      showed that the predictive ability of imputed genotypes is 95% of that from the real genotypes when 3K genotypes are imputed up to 50K in Jersey cattle.
      • Daetwyler H.D.
      • Wiggans G.R.
      • Hayes B.J.
      • Woolliams J.A.
      • Goddard M.E.
      Imputation of missing genotypes from sparse to high density using long-range phasing.
      found that the accuracy of GEBV from imputed genotypes was 95% of that from real genotypes when 87.8% of genotypes were correctly imputed.
      • Cleveland M.A.
      • Hickey J.M.
      Practical implementation of cost-effective genomic selection in commercial pig breeding using imputation.
      reported a correlation of 0.95 between GEBV from genotypes imputed from 3K to 50K compared with the real 50K genotypes in pigs.
      The r2 between GEBVHD and GEBVSEL, however, was small when low-density SNP panels were directly used in estimation of GEBV without imputation and it only went above 0.95 when at least 80K SNP were included. Therefore, the best strategy in obtaining GEBV from low-density panels is to first impute them to HD genotypes and then incorporate the imputed genotypes in GS. Even when the imputations have high error rates, the bias from the wrongly inferred genotypes will not propagate in accuracy of subsequent genomic prediction (
      • Wu X.-L.
      • Xu J.
      • Feng G.
      • Wiggans G.R.
      • Taylor J.F.
      • He J.
      • Qian C.
      • Qiu J.
      • Simpson B.
      • Walker J.
      • Bauck S.
      Optimal design of low-density SNP arrays for genomic prediction: Algorithm and applications.
      ) and the effectiveness of selection on GEBVIMP would be very similar to selection based on GEBVHD.
      The (co)variance method presented in the current study was optimized to achieve a high accuracy of genotype imputation, yet it can also be used to select reduced SNP assays for GS. Since GS relies on LD between markers and QTL to estimate GEBV, the selection of SNP for direct use in GS might not need an adjustment for MAF. To compare genomic prediction accuracies obtained from SNP selected with (w = 1) and without (w = 0) adjustment for MAF, we also selected different densities of SNP with no adjustment for MAF and subsequently used them for prediction of GEBV. The accuracy of genomic prediction from (co)variance method with no adjustment for MAF was highest at almost all densities of selected SNP compared with other SNP selection method (Supplemental Table S5; https://doi.org/10.3168/jds.2018-14621). The accuracies from the (co)variance method with an adjustment for MAF (w = 1) were higher when less than around 15K SNP were selected but higher accuracies obtained with no adjustment for MAF (w = 0) at higher densities of selected SNP. Genomic prediction methods that rely on information from individual SNP genotypes rather than a genomic relationship matrix (e.g., Bayesian methods) might achieve higher accuracies of genomic prediction when SNP from the (co)variance method with no adjustment for MAF are used to predict GEBV.

      CONCLUSIONS

      The results of the current study confirm that the genotypes of East African crossbred dairy cattle can be imputed with sufficiently high accuracy to achieve highly accurate GEBV, thus providing large savings in genotyping costs, or increased effectiveness of selection for a fixed genotyping budget. These results will be applied to optimize genotyping within the current Africa Dairy Genetic Gains program, which aims to demonstrate effective and sustainable genetic improvement for smallholder crossbred cattle in East Africa. There is no great advantage of including the current purebred reference animals in the crossbred imputation, but if the actual purebred ancestors were available, using them for imputation of crossbred animals might provide a greater increase in accuracy. The presented method for SNP selection is straightforward in its application and can be useful in any population, especially those that rely on population-based methods of imputation.

      ACKNOWLEDGMENTS

      The authors thank Bill & Melinda Gates Foundation for funding the Dairy Genetics East Africa (DGEA) project. Illumina (Illumina, San Diego, CA) and Geneseek (Neogen Corporation, Lincoln, NE) kindly provided contributions to genotyping costs. Special thanks to Ed Rege (PICO Eastern Africa, Nairobi, Kenya) who co-designed and helped leading the DGEA project, and to Julie Ojango, James Rao, Denis Mujibi, and Tadelle Dessie of International Livestock Research Institute (ILRI, Kenya and Ethiopia) who facilitated and undertook the field sampling that allowed this research to be undertaken. The British Friesian genotype data was kindly provided by Scottish Rural University College (SRUC, Scotland), and the Ayrshire genotypes were kindly supplied by the Canadian Dairy Network (CDN, Canada). We also thank the smallholder farmers who participated in the DGEA project and provided samples and data on their animals.

      REFERENCES

        • Boichard D.
        • Chung H.
        • Dassonneville R.
        • David X.
        • Eggen A.
        • Fritz S.
        • Gietzen K.J.
        • Hayes B.J.
        • Lawley C.T.
        • Sonstegard T.S.
        • Van Tassell C.P.
        • VanRaden P.M.
        • Viaud-Martinez K.A.
        • Wiggans G.R.
        • Bovine LD Consortium
        Design of a bovine low-density SNP array optimized for imputation.
        PLoS One. 2012; 7: e34130
        • Bolormaa S.
        • Gore K.
        • van der Werf J.H.J.
        • Hayes B.J.
        • Daetwyler H.D.
        Design of a low-density SNP chip for the main Australian sheep breeds and its effect on imputation and genomic prediction accuracy.
        Anim. Genet. 2015; 46 (26360638): 544-556
        • Browning B.L.
        • Browning S.R.
        Genotype imputation with millions of reference samples.
        Am. J. Hum. Genet. 2016; 98 (26748515): 116-126
        • Calus M.P.L.
        • Bouwman A.C.
        • Hickey J.M.
        • Veerkamp R.F.
        • Mulder H.A.
        Evaluation of measures of correctness of genotype imputation in the context of genomic prediction: A review of livestock applications.
        Animal. 2014; 8 (25045914): 1743-1753
        • Cleveland M.A.
        • Hickey J.M.
        Practical implementation of cost-effective genomic selection in commercial pig breeding using imputation.
        J. Anim. Sci. 2013; 91 (23736050): 3583-3592
        • Corbin L.J.
        • Kranis A.
        • Blott S.C.
        • Swinburne J.E.
        • Vaudin M.
        • Bishop S.C.
        • Woolliams J.A.
        The utility of low-density genotyping for imputation in the Thoroughbred horse.
        Genet. Sel. Evol. 2014; 46 (24495673): 9
        • Daetwyler H.D.
        • Wiggans G.R.
        • Hayes B.J.
        • Woolliams J.A.
        • Goddard M.E.
        Imputation of missing genotypes from sparse to high density using long-range phasing.
        Genetics. 2011; 189 (21705746): 317-327
        • Das S.
        • Forer L.
        • Schönherr S.
        • Sidore C.
        • Locke A.E.
        • Kwong A.
        • Vrieze S.I.
        • Chew E.Y.
        • Levy S.
        • McGue M.
        • Schlessinger D.
        • Stambolian D.
        • Loh P.-R.
        • Iacono W.G.
        • Swaroop A.
        • Scott L.J.
        • Cucca F.
        • Kronenberg F.
        • Boehnke M.
        • Abecasis G.R.
        • Fuchsberger C.
        Next-generation genotype imputation service and methods.
        Nat. Genet. 2016; 48 (27571263): 1284
        • de Roos A.P.W.
        • Hayes B.J.
        • Spelman R.J.
        • Goddard M.E.
        Linkage disequilibrium and persistence of phase in Holstein–Friesian, Jersey and Angus cattle.
        Genetics. 2008; 179 (18622038): 1503-1512
        • Habier D.
        • Fernando R.L.
        • Dekkers J.C.M.
        Genomic selection using low-density marker panels.
        Genetics. 2009; 182 (19299339): 343-353
        • Hayes B.J.
        • Bowman P.J.
        • Daetwyler H.D.
        • Kijas J.W.
        • van der Werf J.H.J.
        Accuracy of genotype imputation in sheep breeds.
        Anim. Genet. 2012; 43 (22221027): 72-80
        • Hickey J.M.
        • Crossa J.
        • Babu R.
        • de los Campos G.
        Factors affecting the accuracy of genotype imputation in populations from several maize breeding programs.
        Crop Sci. 2012; 52: 654-663
        • Hickey J.M.
        • Kinghorn B.P.
        • Tier B.
        • van der Werf J.H.
        • Cleveland M.A.
        A phasing and imputation method for pedigreed populations that results in a single-stage genomic evaluation.
        Genet. Sel. Evol. 2012; 44 (22462519): 9
        • Howie B.
        • Fuchsberger C.
        • Stephens M.
        • Marchini J.
        • Abecasis G.R.
        Fast and accurate genotype imputation in genome-wide association studies through pre-phasing.
        Nat. Genet. 2012; 44 (22820512): 955
        • Hozé C.
        • Fouilloux M.-N.
        • Venot E.
        • Guillaume F.
        • Dassonneville R.
        • Fritz S.
        • Ducrocq V.
        • Phocas F.
        • Boichard D.
        • Croiseau P.
        High-density marker imputation accuracy in sixteen French cattle breeds.
        Genet. Sel. Evol. 2013; 45 (24004563): 33
        • Khatkar M.S.
        • Moser G.
        • Hayes B.J.
        • Raadsma H.W.
        Strategies and utility of imputed SNP genotypes for genomic analysis in dairy cattle.
        BMC Genomics. 2012; 13 (23043356): 538
        • Loh P.-R.
        • Palamara P.F.
        • Price A.L.
        Fast and accurate long-range phasing in a UK Biobank cohort.
        Nat. Genet. 2016; 48 (27270109): 811
        • Lu D.
        • Sargolzaei M.
        • Kelly M.
        • Li C.
        • Vander Voort G.
        • Wang Z.
        • Plastow G.
        • Moore S.
        • Miller S.
        Linkage disequilibrium in Angus, Charolais, and Crossbred beef cattle.
        Front. Genet. 2012; 3 (22912646)
        • Mathew B.
        • Léon J.
        • Sillanpää M.J.
        A novel linkage-disequilibrium corrected genomic relationship matrix for SNP-heritability estimation and genomic prediction.
        Heredity. 2018; 120 (29238077): 356-368
        • Meuwissen T.
        • Hayes B.
        • Goddard M.
        Genomic selection: A paradigm shift in animal breeding.
        Anim. Front. 2016; 6: 6-14
        • Meuwissen T.H.E.
        • Hayes B.J.
        • Goddard M.E.
        Prediction of total genetic value using genome-wide dense marker maps.
        Genetics. 2001; 157 (11290733): 1819-1829
        • Ojango J.M.K.
        • Marete A.
        • Mujibi D.
        • Rao J.
        • Pool J.
        • Rege J.E.O.
        • Gondro C.
        • Weerasinghe W.M.S.P.
        • Gibson J.P.
        • Okeyo A.M.
        A novel use of high density SNP assays to optimize choice of different crossbred dairy cattle genotypes in small-holder systems in East Africa.
        in: Proc. 10th World Congr. Genet. Appl. to Livest. Prod. Am. Soc. Anim. Sci., Champaign, IL2014: 2-4
        • Sargolzaei M.
        • Chesnais J.P.
        • Schenkel F.S.
        A new approach for efficient genotype imputation using information from relatives.
        BMC Genomics. 2014; 15 (24935670): 478
        • Sargolzaei M.
        • Schenkel F.S.
        • Jansen G.B.
        • Schaeffer L.R.
        Extent of linkage disequilibrium in Holstein cattle in North America.
        J. Dairy Sci. 2008; 91 (18420642): 2106-2117
        • Strucken E.M.
        • Al-Mamun H.A.
        • Esquivelzeta-Rabell C.
        • Gondro C.
        • Mwai O.A.
        • Gibson J.P.
        Genetic tests for estimating dairy breed proportion and parentage assignment in East African crossbred cattle.
        Genet. Sel. Evol. 2017; 49 (28899355): 67
        • VanRaden P.M.
        Efficient methods to compute genomic predictions.
        J. Dairy Sci. 2008; 91 (18946147): 4414-4423
        • Ventura R.V.
        • Lu D.
        • Schenkel F.S.
        • Wang Z.
        • Li C.
        • Miller S.P.
        Impact of reference population on accuracy of imputation from 6K to 50K single nucleotide polymorphism chips in purebred and crossbreed beef cattle1.
        J. Anim. Sci. 2014; 92 (24663187): 1433-1444
        • Ventura R.V.
        • Miller S.P.
        • Dodds K.G.
        • Auvray B.
        • Lee M.
        • Bixley M.
        • Clarke S.M.
        • McEwan J.C.
        Assessing accuracy of imputation using different SNP panel densities in a multi-breed sheep population.
        Genet. Sel. Evol. 2016; 48 (27663120): 71
        • Weigel K.A.
        • de los Campos G.
        • Vazquez A.I.
        • Rosa G.J.M.
        • Gianola D.
        • Van Tassell C.P.
        Accuracy of direct genomic values derived from imputed single nucleotide polymorphism genotypes in Jersey cattle.
        J. Dairy Sci. 2010; 93 (20965358): 5423-5435
        • Wu X.-L.
        • Xu J.
        • Feng G.
        • Wiggans G.R.
        • Taylor J.F.
        • He J.
        • Qian C.
        • Qiu J.
        • Simpson B.
        • Walker J.
        • Bauck S.
        Optimal design of low-density SNP arrays for genomic prediction: Algorithm and applications.
        PLoS One. 2016; 11 (27583971): e0161719
        • Zimin A.V.
        • Delcher A.L.
        • Florea L.
        • Kelley D.R.
        • Schatz M.C.
        • Puiu D.
        • Hanrahan F.
        • Pertea G.
        • Van Tassell C.P.
        • Sonstegard T.S.
        • Marçais G.
        • Roberts M.
        • Subramanian P.
        • Yorke J.A.
        • Salzberg S.L.
        A whole-genome assembly of the domestic cow, Bos taurus.
        Genome Biol. 2009; 10 (19393038): R42