If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Center for Quantitative Genetics and Genomics, Department of Molecular Biology and Genetics, Aarhus University, DK-8830 Tjele, DenmarkAnimal Breeding and Genomics, Wageningen University and Research, Wageningen, 6700 AH, the Netherlands
Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture of China, National Engineering Laboratory for Animal Breeding, College of Animal Science and Technology, China Agricultural University, Beijing 100193, China
Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture of China, National Engineering Laboratory for Animal Breeding, College of Animal Science and Technology, China Agricultural University, Beijing 100193, China
In genome-wide association studies (GWAS), sample size is the most important factor affecting statistical power that is under control of the investigator, posing a major challenge in understanding the genetics underlying difficult-to-measure traits. Combining data sets available from different populations for joint or meta-analysis is a promising alternative to increasing sample sizes available for GWAS. Simulation studies indicate statistical advantages from combining raw data or GWAS summaries in enhancing quantitative trait loci (QTL) detection power. However, the complexity of genetics underlying most quantitative traits, which itself is not fully understood, is difficult to fully capture in simulated data sets. In this study, population-specific and combined-population GWAS as well as a meta-analysis of the population-specific GWAS summaries were carried out with the objective of assessing the advantages and challenges of different data-combining strategies in enhancing detection power of GWAS using milk fatty acid (FA) traits as examples. Gas chromatography (GC) quantified milk FA samples and high-density (HD) genotypes were available from 1,566 Dutch, 614 Danish, and 700 Chinese Holstein Friesian cows. Using the joint GWAS, 28 additional genomic regions were detected, with significant associations to at least 1 FA, compared with the population-specific analyses. Some of these additional regions were also detected using the implemented meta-analysis. Furthermore, using the frequently reported variants of the diacylglycerol acyltransferase 1 (DGAT1) and stearoyl-CoA desaturase (SCD1) genes, we show that significant associations were established with more FA traits in the joint GWAS than the remaining scenarios. However, there were few regions detected in the population-specific analyses that were not detected using the joint GWAS or the meta-analyses. Our results show that combining multi-population data set can be a powerful tool to enhance detection power in GWAS for seldom-recorded traits. Detection of a higher number of regions using the meta-analysis, compared with any of the population-specific analyses also emphasizes the utility of these methods in the absence of raw multi-population data sets to undertake joint GWAS.
In genome-wide association studies (GWAS), sample size is the most important factor that affects statistical power and is under control of the investigator. Limited sample size is hence a major hurdle in GWAS for traits that are difficult or expensive to measure. In the livestock breeding industry, emerging phenotypes of interest for selective breeding are often expensive or difficult to measure. Measurements for such traits are limited to experimental samples from different populations. One option to deal with the limitation of sample size in understanding the genetics underlying such traits could be to combine the available smaller data sets for joint GWAS (mega-analysis) or to combine summaries of the individual GWAS for meta-analysis.
Combining data sets for large-scale joint GWAS has been used as an effective method to increase GWAS power in human disease association studies (
Major Depressive Disorder Working Group of the Psychiatric GWAS Consortium A mega-analysis of genome-wide association studies for major depressive disorder.
Genome-wide associations for feed utilisation complex in primiparous Holstein-Friesian dairy cows from experimental research herds in four European countries.
) has been the meta-analysis of individual GWAS. Different methods of meta-analysis have been proposed, depending on the sources of information used and assumptions regarding SNP effects in the different populations (
). The most common approaches are to combine P-values or transformed P-values, as in the weighted z-score method, or to use SNP effects, as implemented in either fixed or random effects models (
). The relative performance of the different meta-analyses approaches depends on the existence and extent of heterogeneity between studies and differences in sample sizes (
). With heterogeneity between individual studies, the random effects approach, which implicitly assumes different effect sizes in the null hypothesis, is commonly applied (
pointed out that assuming heterogeneity under the null hypothesis in the random effects approach makes the P-values overly conservative and suggested a modified random effects model.
Theoretical illustrations and simulation studies have indicated statistical advantages from combining data sets and GWAS summaries in enhancing QTL detection power (
). In human disease association studies, combining summaries of small GWAS in meta-analysis has become popular than combining raw data for mega-analysis, due to restrictions in sharing individual-level data (
). Using simulated and real case-control data, Lin and Zheng (2010) showed that meta-analysis of GWAS summaries can be as statistically efficient as mega-analysis combining individual-level data. Similarly, an empirical comparison by
reported comparable performance between mega- and meta-analysis in identifying gene × environment interactions. Such comparisons are not reported in livestock GWAS. Given relaxed restrictions in sharing individual-level data in the livestock genetics frontier compared with human disease studies, the choice of methods in joint GWAS for livestock traits might hinge solely on statistical efficacy, and thus comparison of meta- and mega-analysis might be crucial. Often, such comparisons are based on simulation studies, as real effects remain unknown. However, the complexity of genetics underlying most quantitative traits, which itself is not fully understood, is difficult to fully capture in simulated data sets. In this context, it is worth investigating the utility of these different approaches of combining data sets, using real data and comparing results—for instance, using frequently reported markers that are highly likely in cases of strong linkage disequilibrium (LD) with validated causative variants.
In this study, we investigated the advantages and challenges pertinent to combining multi-population data sets for joint GWAS and meta-analysis of population-specific studies using milk fatty acids (FA) measured on Dutch, Danish, and Chinese Holstein Friesian cows as example traits. Milk FA composition traits have attracted growing interest, mainly in relation to implications for human health. Better understanding of the genetics underlying these traits could help implement selective breeding for milk with specific fat compositions. Detailed milk fat composition is not routinely recorded. Gas chromatography analysis is currently the method of choice in determination of milk fat composition with high accuracy. However, this method is expensive and time-consuming, limiting the measurement of milk fat composition to experimental samples.
Combining multi-population data sets is not straightforward and comes with its own challenges. Heterogeneity of samples from the different populations is a major hurdle. Such heterogeneity might arise, for instance, due to genetic distance between the populations, differences between trait measurements, different environmental exposures, and different genotyping chips (
The main objective of this study was to assess the advantages and challenges of combining multi-population data or GWAS summaries in detection of genomic regions over individual within-population analyses, and to compare the relative performance of combining raw data (mega-analysis) versus combining within-population GWAS summaries (meta-analysis) using milk FA composition traits as example. Detection of significant associations with the frequently reported variants within the DGAT1 (
Positional candidate cloning of a QTL in dairy cattle: Identification of a missense mutation in the bovine DGAT1 gene with major effect on milk yield and composition.
) regions, in connection to milk FA composition, received due emphasis in comparing results of the different GWAS scenarios. Possible sources of heterogeneity in the FA between the sample populations and potential implications of these on the different GWAS scenarios are discussed.
MATERIALS AND METHODS
Animals and Phenotypes
Measurements for 13 FA traits including C8:0, C10:0, C12:0, C14:0, C14:1, C15:0, C16:0, C16:1, C18:0, C18:1 cis-9, C18:2n-6, C18:3n-3, and C18:2 cis-9,trans-11 (CLA) were obtained from test-day milk samples of 784 Chinese, 675 Danish, and 1,566 Dutch Holstein cows. Quantification of the FA traits was based on the GC method, as previously presented in detail by
for Dutch samples. Desaturation indexes were also calculated based on the FA measurements as follows: C14 index = C14:1/(C14:1 + C14:0) × 100; C16 index = C16:1/(C16:1 + C16:0) × 100; and C18 index = C18:1 cis-9/(C18:1 cis-9 + C18:0) × 100.
Cows were sampled from 18 herds in China, 22 herds across Denmark, and 398 herds in the Netherlands. Stage of lactation in sampled cows ranged from 3 to 700 DIM in the Chinese population, 9 to 481 DIM in the Danish population, and 60 to 278 DIM in the Dutch population. To standardize the samples from the 3 countries, only cows at DIM 60 and above were considered for the association analyses. Thus, 700 Chinese, 614 Danish, and 1,566 Dutch samples were available for the association analyses. The reason to standardize the data set by lactation stage is that the genetic determination of milk fat composition traits might be different in the early stages of lactation. Evidence suggests that effects of genes in early lactation differ from those later in lactation (
). By excluding early-lactation records, we eliminate this heterogeneity issue.
Genotypes and Imputation
The BovineSNP50 BeadChip (50K; Illumina, San Diego, CA) was used to genotype cows in the Chinese data set. Imputation of the 50K genotypes to high density (HD) was then performed with the Fimpute software package (
), using a reference population of 96 Chinese Holstein bulls genotyped with BovineHD BeadChip (777K; Illumina) that included all sires of the imputation target cows in the Chinese data set. In the Danish data set, 278 cows were genotyped using the BovineSNP50 BeadChip. The remaining Danish Holstein cows were genotyped using the BovineHD BeadChip and used as reference to impute the 50K genotypes of the first part of the Danish cows to HD, as described by
A custom 50K SNP beadchip, designed by CRV (Arnhem, the Netherlands), was used to genotype all cows in the Dutch data set. A reference population of 1,333 animals from the Dutch Holstein data set, with HD genotypes, was then used to impute the 50K genotypes to HD, as described in
. Those SNP with minor allele frequencies (MAF) less than 0.05, or with a count of 1 of the genotypes of less than 10 in each population, were excluded from the association analyses, population-specific as well as joint GWAS. A total of 464,130 SNP were available in common for the population-specific as well as the combined-population analyses. Positions on the bovine genome assembly UMD3.1 (
Principal component analysis (PCA) was used to investigate the genetic structure of the 3 populations. The PCA was implemented using a genomic relationship matrix constructed with all HD markers. The proportion of variance explained by, and the relationship between, the first 2 principal components was examined to show the relationship between the populations.
Genome-Wide Bin-Wise LD and MAF Analysis
Genome-wide pair-wise LD was calculated between the SNP markers within a 1-Mbp window along the genome using the r and r2 values as a measure in the Plink program (
). The average within-population r2 and the correlation of r across populations were studied as functions of genomic distance, to determine the decay of LD with increasing distance within populations and to assess consistency in LD phases between the populations, respectively. Accordingly, marker pairs were first sorted by distance, with the maximum distance being 1 Mbp. Marker pairs with distance within each 1-kb range were then grouped together such that marker pairs with less than 1-kb distance were assigned to one group, marker pairs with distance from 1 to 2 kb were assigned to another group, and so on, forming in total 1,000 marker-pair groups within the 1-Mbp window. The r2 values were averaged for marker pairs for each group. Similarly, correlation in r values between the populations were computed for each of these 1,000 groups of marker pairs. Additionally, correlation of MAF in the 3 populations was assessed for non-overlapping bins of 100 SNP throughout the genome.
Statistical Analysis
Test for Phenotypic Differences
A 2-tailed t-test was carried out, testing the differences in phenotypic means of FA between the 3 populations using the t.test default function in R (
). Similarly, an F-test was carried out to test differences in standard deviations (SD) pair-wise. The test criterion was F = σ1/σ2, where σ1 is the larger of the 2 SD, with degrees of freedom f1 for the larger SD and f2 for the smaller SD (σ2).
Association Analysis
A single-SNP association test was implemented, using a mixed linear model in the GCTA program (
where yijkl is the phenotype of cow l; µ is the fixed effect of mean; parityi and herdj are the fixed effects of parity and herd, respectively; b1 is the regression coefficient for DIM; DIMijkl is a covariate of DIM (because only cows with more than 60 DIM were included in the analyses, a linear adjustment for DIM was sufficient); b2 is the allele substitution effect for SNP; SNPk is a covariate indicating the number of copies of a specific allele (0, 1, or 2) of the SNP; and animall is the random additive genetic effect. Animal effects were assumed to be distributed as
where G is the genomic relationship matrix constructed excluding the chromosome on which the SNPk is located and
is the genetic variance. Residuals were assumed to be distributed as
where I is the identity matrix and
is the residual variance.
Homogeneity of residuals was assessed by plotting the residuals against predicted phenotypes from the model used to estimate heritability; that is, model [1] without the SNP effect. For some FA, especially for C18:2n-6, C18:3n-3, and CLA, residuals tend to increase with the mean, indicating heterogeneity of the residual variance (Figure 1). Therefore, records for these FA were log-transformed in both the combined and the population-specific analyses.
Figure 1Residuals computed from the mixed linear model with G-matrix constructed from all SNP plotted against predicted phenotypes in combined data set, colored according to population (blue asterisks = Dutch samples, red triangles = Danish samples, green circles = Chinese samples) in C18:2n-6 before (A) and after (B) log-transformation, in C18:3n-3 before (C) and after (D) transformation, and in CLA before (E) and after (F) transformation.
Significance of association was determined using a threshold −log10P-value of 5. Choice of this significance threshold was based on false discovery rate of 5% for which the corresponding −log10P-values ranged between 3.4 and 5, depending on the trait and population. Therefore, to apply a common cutoff point across FA traits and populations, we used a −log10P-value of 5 as the genome-wide significance threshold.
To determine whether a region harbored 1 or more QTL, iterative approaches fitting the effect of SNP with the highest −log10P-values were employed. In this approach, the SNP with the highest −log10P-value for the studied FA trait was considered the lead SNP. The allelic dosage of such a lead SNP was then fitted as fixed effect for a second round of chromosome-wide analyses. If other SNP, also significantly associated in the first-round GWAS, were still found to have −log10 P-value >5 in the second-round analysis, the SNP with the highest −log10P-value in the second analysis was taken as the second lead SNP and its allelic dosage fitted as fixed effect for a third round of analysis. This procedure was iterated until no further SNP with −log10P-value >5 was observed. The SNP that showed significant association in a round of GWAS but showed −log10P-value <5 upon fitting the allelic dosage of the lead SNP were then considered as part of a region around that lead SNP. The position of the first and last such SNP before and after the lead SNP were considered as the boundaries of the region.
Heritability (h2) estimates were computed for the populations separately, as well as for the combined data set, as follows:
where
is the additive genetic variance estimated using model [1] but without fitting an effect for SNP and using G constructed from all SNP, and
is the residual variance.
Meta-Analysis of Population-Specific GWAS Results
We compared the performance of our joint GWAS with meta-analysis of summaries from the population-specific studies. A modified random effects meta-analysis model proposed by
was implemented, using METASOFT program. In short, the method assumes that the effect size of a variant is different among individual studies and follows a probability distribution with mean μ and variance τ2, which are assumed to be 0 under the null hypothesis. The method involves a likelihood-ratio test, with the test statistic
built as follows:
where βi is the effect estimated for population i, Vi is the square standard error (SE) of βi, and
and
are the maximum likelihood estimates of μ and τ2 that are derived using iterative procedure.
) also provides heterogeneity statistics, with Cochran's test statistic (Q) and I2 values estimated thus:
where B is a combined effect across population, computed as
and wi is the weight for the corresponding population, computed as
As a heterogeneity test statistics, it has been suggested that Cochran's test statistic (Q) is underpowered when the number of studies used for meta-analysis is small. Other robust test statistics have been proposed (
). In our study, to examine the effect of heterogeneity under the different meta-analysis scenarios, one of these heterogeneity statistics (I2) was computed as
Power Calculations
To quantify the theoretical expected gain in power as a result of combining the data sets for GWAS analysis, we ran a power test based on the sample size of each population and the combined data set for varying scenarios of QTL-explained variance, assuming similar LD structures, using the package ldDesign in R (
). The parameter settings used for the ldDesign package were allele frequencies of P = q = 0.5, assuming the same LD between markers and QTL in the different populations, with r2 value of 0.3 and a significance threshold −log10P-value of 5.
RESULTS
Descriptive Statistics and Genetic Parameters
Table 1 presents phenotypic means and SD for FA traits in the 3 populations. Significant differences between the 3 populations were observed in phenotypic means for several FA. Phenotypic means in the Chinese samples were, in general, lower for short- and medium-chain FA and higher for most long-chain FA. The largest difference in phenotypic means between the 3 populations was observed for C18:2n-6, which was 3 times higher in the Chinese samples (3.99) compared with the mean values in the Dutch (1.11) and Danish (1.74) samples. Large differences in phenotypic means were also shown for C8:0 and C18:1 cis-9 between the Chinese samples compared with the Dutch and Danish samples. Significant differences in SD occurred between the populations but not to the same extent as for the means. In only 3 FA did all 3 populations differ significantly. Standard deviations were generally lower for most FA in the Chinese sample compared with those for the Dutch and Danish samples.
Table 1Phenotypic means and SD for the studied milk fatty acid (FA) traits in the different population samples
Table 2 presents additive genetic variances and heritability estimates for the studied milk FA traits in the Dutch, Danish, and Chinese Holsteins as well as in the combined data set. Due to relatively small sample sizes, estimates of additive genetic variances in general showed large SE, some of which were larger than the estimates. For most FA, additive genetic variances in Dutch and Danish samples were similar, but additive genetic variances in the Chinese data were significantly lower than the other 2 populations. Heritability estimates were higher for most FA in the Dutch samples compared with the Danish and Chinese samples. Heritability estimates from the combined analysis were moderate to high, with the highest value estimated for the C14 index (0.53) and the lowest for C18:3n-3 (0.21).
Table 2Genetic parameters (±SE) for milk fatty acids (FA) in 1,566 Dutch, 614 Danish, and 700 Chinese Holstein samples, with combined analysis of data sets
Parameter estimates in the combined analysis were made before any data transformation. NL = Dutch Holstein; DK = Danish Holstein; CN = Chinese Holstein. σ2a = additive genetic variance.
FA
NL
DK
CN
Combined
σ2a
h2
σ2a
h2
σ2a
h2
σ2a
h2
Saturated FA
C8:0
0.01 (0.04)
0.48 (0.06)
0.01 (0.07)
0.33 (0.10)
0.002 (0.04)
0.06 (0.05)
0.008 (0.04)
0.27 (0.03)
C10:0
0.09 (0.11)
0.51 (0.06)
0.09 (0.17)
0.36 (0.10)
0.02 (0.09)
0.16 (0.07)
0.07 (0.09)
0.39 (0.04)
C12:0
0.12 (0.14)
0.40 (0.06)
0.10 (0.19)
0.30 (0.10)
0.04 (0.13)
0.21 (0.07)
0.09 (0.11)
0.33 (0.04)
C14:0
0.27 (0.21)
0.39 (0.06)
0.15 (0.32)
0.14 (0.10)
0.21 (0.28)
0.22 (0.08)
0.21 (0.17)
0.25 (0.03)
C15:0
0.006 (0.04)
0.29 (0.06)
0.007 (0.05)
0.27 (0.10)
0.001 (0.02)
0.10 (0.07)
0.004 (0.02)
0.23 (0.04)
C16:0
2.79 (0.66)
0.48 (0.06)
0.75 (0.76)
0.12 (0.09)
0.77 (0.51)
0.27 (0.08)
1.80 (0.48)
0.34 (0.04)
C18:0
0.78 (0.39)
0.37 (0.06)
0.53 (0.49)
0.23 (0.10)
0.54 (0.43)
0.25 (0.08)
0.52 (0.29)
0.25 (0.04)
Unsaturated FA
C14:1
0.03 (0.07)
0.55 (0.06)
0.03 (0.09)
0.49 (0.10)
0.01 (0.06)
0.35 (0.09)
0.03 (0.05)
0.47 (0.04)
C16:1
0.05 (0.08)
0.65 (0.05)
0.07 (0.13)
0.42 (0.10)
0.02 (0.09)
0.26 (0.09)
0.05 (0.07)
0.46 (0.04)
C18:1 cis-9
1.90 (0.58)
0.41 (0.06)
0.37 (0.66)
0.07 (0.08)
1.33 (0.68)
0.24 (0.08)
1.38 (0.46)
0.27 (0.04)
C18:2n-6
0.007 (0.04)
0.27 (0.06)
0.01 (0.07)
0.17 (0.09)
0.03 (0.12)
0.26 (0.10)
0.01 (0.05)
0.18 (0.03)
C18:3n-3
0.002 (0.02)
0.27 (0.06)
0.0004 (0.02)
0.05 (0.08)
0.0001 (0.01)
0.05 (0.06)
0.005 (0.01)
0.19 (0.03)
CLA
0.009 (0.04)
0.32 (0.06)
0.002 (0.04)
0.11 (0.09)
0.001 (0.02)
0.15 (0.07)
0.004 (0.02)
0.21 (0.04)
Desaturation indexes
C14 index
1.81 (0.47)
0.62 (0.05)
2.10 (0.65)
0.59 (0.10)
0.89 (0.51)
0.36 (0.09)
1.57 (0.37)
0.53 (0.03)
C16 index
0.39 (0.23)
0.55 (0.06)
0.46 (0.37)
0.37 (0.10)
0.17 (0.25)
0.24 (0.08)
0.32 (0.19)
0.38 (0.04)
C18 index
6.99 (1.03)
0.49 (0.06)
3.46 (1.18)
0.26 (0.10)
1.90 (0.83)
0.21 (0.07)
3.95 (0.73)
0.31 (0.04)
1 Parameter estimates in the combined analysis were made before any data transformation. NL = Dutch Holstein; DK = Danish Holstein; CN = Chinese Holstein. σ2a = additive genetic variance.
Figure 2 presents the PCA plot of the genomic relationships between the 3 populations. The first 2 principal components together explained 32.5% of the variation. Overlapping clusters of the 3 populations indicate that the populations are genetically similar.
Figure 2Principal component analysis plot showing the relationship between the first 2 principal components (PC1, PC2) and the proportions of genetic variance explained (% explained var.) among Chinese (red circles), Danish (green triangles), and Dutch (blue asterisks) Holstein Friesian cows.
The genome-wide LD analysis showed that the 3 populations have similar LD structures across the genome. Figure 3 shows mean LD (r2 values) between pairs of SNP within 1-Mbp bins across the genome. The maximum mean bin-wise r2 was 0.71 for the 3 populations, and the minimum mean bin-wise r2 were 0.07 for the Dutch and Danish populations and 0.06 for the Chinese population. Figure 4 presents the correlations of r values between the 3 populations for the same pair of markers in relation to genomic distance. The correlations in r values for closely located pairs of markers (<10 kb) was greater than 0.8 between all 3 populations but declined with increasing marker distance, with the correlation between Chinese and Dutch Holsteins having the lowest value across the 1-Mbp bins. Correlation in MAF between the populations averaged for bins of 100 SNP throughout the genome was 0.87 between the Danish and Dutch populations and between the Danish and Chinese populations, and 0.81 between the Chinese and Dutch populations.
Figure 3Linkage disequilibrium (LD) for the Dutch (NL, blue triangles), Danish (DK, red squares), and Chinese (CN, green crosses) Holstein Friesian genotypes. The y-axis is the mean bin-wise LD, and the x-axis is the physical distance between pair-wise markers in megabase pairs (Mbp).
Figure 4Correlations in r values for pair of SNP between Chinese and Danish (CN_vs_DK, blue points), Chinese and Dutch (CN_vs_NL, red points), and Danish and Dutch (DK_vs_NL, green points) Holstein Friesian cows, as a function of marker pair-wise distance in megabase pairs (Mbp).
Genomic Regions Detected Across the Different Analyses
Table 3 shows genomic regions significantly associated with at least 1 FA trait using the different analyses. Using the different scenarios (population-specific, combined population, and meta-analyses), a total of 67 genomic regions were found to be significantly associated with the studied FA. Regions were identified on all the chromosomes except BTA 18. Only 3 regions (14a, 14b, and 26) were commonly detected with significant association to at least 1 FA across the different scenarios—that is, all population-specific analyses and joint GWAS, as well as the 3 different meta-analyses.
Table 3Genomic regions associated with milk fatty acid (FA) traits detected using population-specific analysis, combined population analysis, and meta-analysis
Number of FA, out of the 16 traits studied, with significant associations in the respective analyses. NL = Dutch Holstein; DK = Danish Holstein; CN = Chinese Holstein. Joint GWAS = combined genome-wide association study.
NL
DK
CN
Joint GWAS
Meta-analysis
1a
19.92
19.93
—
—
—
1
1
1b
60.0
60.0
—
—
—
—
1
1c
101.0
101.0
—
—
—
1
—
1d
132
132
—
2
—
—
—
1e
141.3
142.5
—
—
—
1
1
2a
12.5
19.8
—
—
—
2
3
2b
54.9
59.8
1
—
—
4
2
2c
64.1
67.8
—
—
—
2
2
2d
106.5
135.6
1
—
—
4
2
3a
8.5
8.7
1
—
—
—
—
3b
104.2
104.2
—
—
—
—
1
3c
116.2
119.4
—
—
—
2
—
4
15.59
15.6
—
—
—
1
1
5a
10.33
10.36
—
—
1
1
1
5b
65.7
82.8
—
—
1
2
2
5c
87.4
100.0
7
2
—
10
9
6
41.4
41.4
—
—
—
1
—
7a
5.05
5.09
1
—
—
—
—
7b
14.6
15.5
—
—
—
2
1
7c
78.4
78.4
—
—
1
1
1
7d
81.6
83.2
—
—
—
2
2
8a
57.5
59.7
—
—
—
3
2
8b
79.9
98.4
—
—
—
3
3
9a
25.5
25.6
—
—
—
1
—
9b
81.3
81.3
—
—
—
1
—
9c
97.1
98.8
1
—
—
—
2
10a
1.1
8.6
2
1
—
2
2
10b
11.7
12.9
1
—
—
2
—
10c
73.4
73.5
—
—
—
—
1
10d
78.1
80.1
1
—
—
1
1
10e
87.3
93.1
1
—
—
3
2
11a
24.7
26.7
—
—
—
1
1
11b
58.81
58.89
—
—
—
1
1
12a
17.1
17.1
—
—
—
1
1
12b
24.0
24.8
—
—
—
1
—
12c
70.0
77.4
—
—
1
2
2
12d
86.4
86.4
—
—
1
—
—
13
64.6
65.7
2
—
—
1
2
14a
1.5
5.0
13
8
4
14
14
14b
5.2
20
11
5
1
12
13
14c
44.7
49.9
1
—
—
4
2
15a
27.2
31.2
—
—
3
3
3
15b
46.9
65.9
—
—
1
1
3
16a
23.8
25.22
—
—
—
2
2
16b
57.53
57.58
2
—
—
2
2
17a
17.4
22.6
—
—
—
2
2
17b
27.8
44.1
4
—
—
5
3
19
37.3
61.3
6
—
2
8
7
20a
1.8
11.0
2
—
—
—
—
20b
32.4
34.2
—
—
1
2
2
20c
36.7
36.9
—
—
1
2
—
20d
55.3
60.4
—
—
—
2
1
21
53.8
59.1
1
—
—
4
3
22
59.12
59.13
—
—
—
1
—
23a
21.22
21.23
2
—
—
—
—
23b
26.7
32.7
—
—
—
1
2
23c
33.5
36.5
2
—
—
1
2
23d
40.7
43.5
2
—
1
3
2
24a
6.82
6.85
—
—
1
—
—
24b
10.2
10.2
—
—
—
1
—
25a
9.8
9.9
—
—
—
1
—
25b
24.7
24.7
—
—
—
1
1
25c
41.4
41.7
—
—
—
2
—
26
2.9
43.0
6
4
2
11
8
27
37.0
42.2
—
—
1
1
1
28
36.6
37.2
—
—
—
2
—
29
32.9
40.5
2
—
—
2
1
1 BTA number plus letters to denote multiple regions within a chromosome.
2 Number of FA, out of the 16 traits studied, with significant associations in the respective analyses. NL = Dutch Holstein; DK = Danish Holstein; CN = Chinese Holstein. Joint GWAS = combined genome-wide association study.
The largest number of significantly associated regions was identified using the joint GWAS on the combined data set: 56 regions were detected with significant associations with at least 1 of the 16 FA traits studied. The detected regions were spread across all the chromosomes except BTA 18. Of all regions detected using the joint GWAS, 28 regions were not detected in any of the population-specific analyses, suggesting increased detection power from combining the data sets. Our computation of theoretical detection power, as a function of sample size and proportion of explained variance by a QTL, also shows that a QTL explaining more than 5% of genetic variance can be detected with a power of 0.97 in the combined data set, compared with a power of 0.57 in the Dutch data set, 0.08 in the Chinese, and 0.05 in the Danish (Figure 5).
Figure 5Theoretical expectations of genome-wide association detection power based on sample sizes for the Dutch (NL, blue line), Danish (DK, red line), and Chinese (CN, green line) data sets and the combined multi-population data set (black line). The y-axis represents the detection power using the different scenarios (sample sizes), and the x-axis is the simulated QTL variance.
The separate analysis of the Dutch samples detected 24 regions with SNP significantly associated with at least 1 of the 16 FA traits, except C18:0. Four of these regions were detected only in the Dutch data and not in any of the other scenarios, including the joint GWAS and meta-analyses. These regions exclusively detected for the Dutch samples were significantly associated with C18:3n-3 (region 3a), C16:0 (region 7a), C16:1 and the C16 index (region 20a), and with the C18 index and CLA (region 23a). Separate analysis based on the Danish samples detected significant associations between the FA traits and SNP at 6 regions found on BTA 1, 5, 10, 14, and 26. Significant associations in the Danish sample were limited to 9 FA traits, with no significant association detected for C8:0, C10:0, C12:0, C14:0, C18:0, the C18 index, or CLA. One of the regions detected in the separate analysis for the Danish population (region 1d) was significantly associated with C14:1 and the C14 index but was not detected in any of the other scenarios. The separate GWAS for the Chinese population detected 16 regions. Significant associations detected in the Chinese sample were limited to C14:1, the C14 index, C18:0, C18:1, and C18:2n-6. Significant associations detected in the separate analysis for the Chinese population with C18:2n-6 (region 12d) and C18:0 (region 24a) were not detected in the other population-specific analyses or in the combined GWAS and meta-analyses. Supplemental Files S1–S5 (https://doi.org/10.3168/jds.2019-16676) show Q-Q plots and the genomic inflation factor (λ) values for all the within-population analyses and the meta-analysis.
In this study, meta-analysis of summaries from the within-population GWAS was implemented based on the modified random effects approach of
. Heterogeneity statistics from the meta-analysis are presented in Supplemental File S6 (https://doi.org/10.3168/jds.2019-16676) for all significantly associated SNP detected in the meta-analysis. Significant associations were detected with the meta-analysis in 43 of the 56 regions detected with the joint GWAS. In total, significant associations were detected with at least 1 of the studied FA in 47 regions of the bovine autosomes, including 4 regions that were not detected using the joint GWAS.
Apart from differences in the number of regions detected for at least 1 FA across scenarios, we also discovered differences in the number of FA traits significantly associated with the detected regions. For region 14a, for instance, only 4 FA traits were found to have significant associations in the Chinese analysis, whereas analysis of the Dutch sample resulted in detection of significant associations with 13 FA traits. The same number of FA traits were found to have significant associations with regions 14a (14 FA) using the joint GWAS and meta-analysis. Interestingly, association with C14:1 of region 14b was found only in the separate analysis of the Chinese data. For BTA 26, significant associations were detected with 11 FA traits in the joint GWAS compared with 8 FA traits in the meta-analysis, followed by 6 FA in the Dutch, 4 in the Danish, and 2 in the Chinese separate analyses.
SNP Effects Across the Scenarios
Table 4 presents the estimated regression coefficients and −log10P-values for lead SNP on BTA 14 (SNP within the DGAT1 gene) and 26 (SNP within the SCD1 gene) found to have the strongest associations with the studied FA traits. The results also show that the combined-population analysis resulted in substantially increased −log10P-values for the significant regions in most of the traits compared with the population-specific GWAS. For instance for the C14 index, −log10P-value for the lead SNP on chromosome 26 increased from 70.9 in the Dutch analysis to 126.1 in the combined analysis. These results also show that when the associations were significant, directions of SNP effects were similar for the 3 populations. Apart from direction of effects, we have compared estimated effects of the DGAT1 (ARS-BFGL-NGS-4939) and SCD1 (BovineHD2600005461) loci standardized with the SD of the FA in the combined data set (by dividing the estimates by the SD of the FA) in the different populations. Figure 6 shows correlations between standardized effects of DGAT1 marker in the 3 populations for all FA except C12:0, C14:1, and C18:0, which are not significantly affected by DGAT1 in the combined analysis. The plots indicate that not only were directions of SNP effects similar but the estimates of DGAT1 effect in the Dutch and Danish population are very similar (high correlation and regression coefficient of about 1). The Chinese population showed a different pattern: the correlation between effects in the Dutch and Chinese population is high, as is the correlation between effects in the Danish and Chinese populations. However, the regression coefficients are approximately 0.5 (0.57 and 0.52), indicating that the standardized effect sizes of DGAT1 in the Chinese population are about half of those observed in the Dutch and Danish populations. Looking at effects in individual FA, lower SNP effects were consistently estimated for the FA where phenotypic averages in the Chinese sample significantly differed compared with the other populations (C8:0, C18:1 cis-9, and C18:2n-6). Effect of DGAT1 loci on C8:0 was not significant in the Chinese or Danish data sets; therefore valid comparison cannot be made. For C18:1 cis-9 and C18:2n-6, however, the standardized effects of the DGAT1 loci were the lowest in the Chinese data set compared with the Dutch and Danish samples.
Table 4Population-specific and combined-population regression coefficients [b(±SE)] and log-transformed P-values (−log10P) for lead SNP on BTA 14 and 26
Figure 6Correlations in the effects of DGAT1 locus (ARS-BFGL-NGS-4939) between the 3 studied populations (DK = Danish, NL = Dutch, CN = Chinese), standardized by dividing the SNP effects by the SD of the fatty acid trait in the combined data set: relationship of DGAT1 effects in the (A) Danish and Dutch populations; (B) Chinese and Dutch populations, and (C) Chinese and Danish populations.
For SCD1, the number of FA detected across analyses and significantly affected in the joint GWAS was much lower: 5 FA. However, the correlations of loci effects in these FA showed similar trends with that of the DGAT1 effect, such that we found high correlations of effects between the populations but lower effect sizes for the Chinese population (Figure 7).
Figure 7Correlations in the effects of SCD1 locus (BovineHD2600005461) between the 3 studied populations (DK = Danish, NL = Dutch, CN = Chinese), standardized by dividing the SNP effects by the SD of the fatty acid trait in the combined data set: relationship of SCD1 effects in the (A) Danish and Dutch populations; (B) Chinese and Dutch populations; and (C) Chinese and Danish populations.
Detection of Genomic Regions Under the Different Scenarios
Our combined-population GWAS resulted in detection of 28 regions significantly associated with 1 or more of the studied FA traits in addition to those detected in the population-specific analyses altogether. We have also shown that −log10P-values increased up to 2 fold in the joint GWAS compared with the population-specific analyses. Apart from detection of more regions, we also found more FA significantly associated with the identified regions in the joint GWAS compared with the number of FA associated with similar regions in the population-specific studies. As theoretically expected, under the assumption that traits are genetically the same in the different studies, these results demonstrate that combining data sets can substantially increase detection power.
The regions detected on BTA 14 (region 14a) and BTA 26 are known to contain the DGAT1 and SCD1 genes, respectively. Studies have frequently reported significant associations between several milk FA traits and the DGAT1 (
) genes. Detections in the DGAT1 region were similar for the joint GWAS and the meta-analysis, with significant associations established for 14 FA traits. In the separate analysis with the Dutch samples, significant associations were also detected with these traits, except C14:1. Previous studies have shown that the K allele of DGAT1 polymorphisms has a reducing effect in C14:0 (
). Through reduction of C14:0, DGAT1 is also expected to affect the concentrations (%wt/wt) of C14:1. Therefore, significant associations with C14:1 were expected.
The SCD enzyme is involved in the synthesis of monounsaturated FA, primarily by introducing a double bond in the delta-9 position of C14:0, C16:0, and C18:0 (
). The significant associations we detected across the different analyses with C14:1, C16:1, and the desaturation indexes of these FA are thus expected. However, through the desaturation process, SCD1 also affects the concentrations (%wt/wt) of C14:0, C16:0, and C18:0 and, thus, significant associations with these FA are also to be expected. However, significant associations for C16:0 were detected using the joint GWAS only. Likewise, significant associations detected in the region with C8:0 and C12:0 were limited to the joint GWAS.
These results indicate that the joint GWAS resulted in more power to detect associations with frequently reported variants within the validated DGAT1 and SCD1 regions compared with the rest of the scenarios, including the meta-analysis. Comparison of number of regions detected across the BTA for the studied FA also indicate that the joint GWAS showed enhanced power compared with the meta-analysis. In a multi-breed scenario,
Comparing power and precision of within-breed and multi-breed genome-wide association studies of production traits using whole-genome sequence data for 5 French and Danish dairy cattle breeds.
showed that the difference in detection between joint GWAS with combined raw data and meta-analysis of GWAS summaries increases with inclusion of distantly related breeds of cattle.
Of the 28 additional regions detected by the joint GWAS compared with the Dutch within-population analysis, only 15 were re-detected using the meta-analysis. This indicates the comparative performance of joint GWAS over meta-analysis, given similar model components. One advantage of joint GWAS with combined raw data sets might be the flexibility to employ different transformation and standardization strategies on the raw data sets and the ability to fit common model components as opposed to summaries of different studies, often with different model components, in meta-analyses. In our study, such data transformation and standardization strategies have remedied, to a certain extent, some observed heterogeneity between the samples, as discussed below.
Heterogeneity Between Samples
Few regions detected within each of the population-specific analyses were not detected in the remaining populations or in the joint GWAS and meta-analyses. Theoretically, it is expected that combining data, by increasing sample size, will enhance detection power and enable detection of regions with effects that are too small to meet the thresholds in the population-specific analyses. This is also demonstrated in our computation of theoretical expectations of detection power, as presented in Figure 5. However, these calculations assume that genotypic effects in the different populations are the same. In reality, this assumption of homogeneity might be incorrect for several reasons.
Differences in the pattern of LD structure over chromosomal regions of interest across populations might cause heterogeneity in the genetic effects (
). In our study, estimates of pair-wise LD within bins of 1-Mbp size indicate that the genome-wide LD pattern is similar between the populations. High correlations in r values between closely located marker pairs among the 3 populations also indicate consistency in LD phases. This is in agreement with a previous study by
Consistency of linkage disequilibrium between Chinese and Nordic Holsteins and genomic prediction for Chinese Holsteins using a joint reference population.
, which reported high consistency in LD between adjacent markers of the Chinese and Danish Holstein populations.
Phenotypic means and SD were significantly different among the 3 populations for most FA. The Chinese samples, especially, showed larger differences in phenotypic means compared with the Dutch and Danish samples. Such differences might partly be explained by differences in management systems, such as feeding, between the populations. In genetic analysis, known sources of variation can be accounted for in the statistical analysis. In our analyses, phenotypic differences between the populations are accounted for by fitting herd as a fixed effect. Because herds were unique for each country, the effect of herd is expected to also account for differences between countries. However, interaction of genotype with these sources of variation (e.g., genotype × feed interaction) might still have consequences in association analyses. Genotype × environment interactions can be quantified by calculating genetic correlations between the populations. This was not possible in our study due to small sample sizes within populations, leading to high SE of the correlation estimates.
The SNP effects in our analyses were similar in direction in the 3 populations whenever the associations were significant. Comparison of the standardized effects of DGAT1 and SCD1 loci on the FA traits also shows strong correlation between estimated effects for both loci in the 3 populations. However, effect sizes seem to be different in the Chinese data compared with the Dutch and Danish data. These differences might point to genotype × environment interaction. However, high correlations and similar direction of SNP effects between the populations suggest that this interaction is mostly due to scaling rather than re-ranking of genotype effects (strong correlation in effects and similar direction of effects). Because of high correlation between estimated effects, the data from the 3 populations do not contradict but support each other. We found no indications that the GWAS signal might disappear by combining data, due to effects that differ in direction (re-ranking). The estimated SNP effects imply that for at least the DGAT1 and SCD1 loci, the value of an observation from the Chinese population contributes less to the joint GWAS signal than does an observation from the other 2 populations, due to the smaller effect sizes.
Genomic Inflation Factor
Across the implemented analysis, evidence of inflated genome-wide statistics was observed with genomic inflation factor greater than 1 for most of the studied traits. This is unlikely to be caused by population structure, because in all scenarios the association analyses were adjusted for genetic relatedness estimated from the HD marker data. Studies based on theoretical demonstration and analysis of simulated data indicate that in the presence of polygenic inheritance, as is the case in milk FA traits, considerable genomic inflation is to be expected even in the absence of population structure and other technical artifacts (
Data Transformation and Standardization for Joint GWAS
In this study, different data transformation and standardization approaches are implemented to address differences in residuals, SD, and differences in stages of lactation between the samples. Residuals of some FA (namely, C18:2n-6, C18:3n-3, and CLA) tended to increase with the mean, indicating heterogeneity of the residual variances and violating the assumptions underlying significance testing. Logarithmic transformation was therefore applied for these traits, and the transformed values were used for both the population-specific analyses and the joint GWAS. Residuals plotted against predicted phenotypes in these FA traits (presented in Figure 1) indicate that the problem of heterogeneity is corrected by the transformation.
Milk FA traits can also differ between populations due to differences in lactation stages of the cows. In our study, an attempt was made to address this source of differences by restricting lactation stages to 60 DIM and above. Evidence suggests that effects of genes in early lactation differ from those later in lactation. For instance,
have reported significant DGAT1 × lactation stage interaction for milk production traits including fat content, and showed that the DGAT1 effect shows a large increase during early lactation (from the start of lactation to d 50 to 150) and tends to decrease later in lactation.
Some significant differences in SD were also observed between the samples from the 3 populations. To test sensitivity of such differences in relation to the joint GWAS outcomes, we standardized all the FA measurements within population to have mean 0 and SD 1, and a separate test joint GWAS was undertaken with this standardized data set. The joint GWAS using the standardized data set led to detection of significant association with 1 additional FA on BTA 13, but detections in the rest of the regions remained unchanged after the standardization, indicating that our joint GWAS is not substantially affected by differences in SD and highlighting the stability of our results.
CONCLUSIONS
Joint GWAS using multi-population data sets detected the highest number of regions and the highest number of associated FA traits compared with the population-specific analyses as well as the meta-analysis of population-specific GWAS summaries. However, detection of higher numbers of regions using the meta-analysis, compared with any of the population-specific analyses, emphasizes the utility of meta-analysis in the absence of raw multi-population data sets to undertake joint GWAS.
ACKNOWLEDGMENTS
GG was enrolled in the Erasmus-Mundus joint doctorate European Graduate School in Animal Breeding and Genetics and was also supported by the Center for Genomic Selection in Animals and Plants (GenSAP), funded by the Danish Council for Strategic Research (Copenhagen). The study used data from the Dutch Milk Genomics Initiative (www.milkgenomics.nl), High-Technology R&D Program of China (2013AA102504) and Beijing Dairy Industry Innovation Team (BAIC06-2018, China), and the Danish-Swedish Milk Genomics Initiative and the Milk Levy Fund (Denmark) projects “Phenotypic and Genetic Markers for Specific Milk Quality Parameters” and “The Importance of the Metagenome for Milk Composition and Quality.” All samples in the Dutch and Chinese data sets were collected in accordance with the guidelines and protocols for the care and use of animals approved by the Ethical Committee on Animal Experiments of Wageningen University (protocol: 200523.b) and the Animal Welfare Committee of China Agricultural University (Permit Number: DK996), respectively. All procedures to collect the Danish samples followed the protocols approved by the National Guidelines for Animal Experimentation and the Danish Animal Experimental Ethics Committee, and all sampling was restricted to routine on-farm procedures that did not cause any inconvenience or stress to the animals, and hence, no specific permission was required. The authors declare that they have no competing interests. List of SNP significantly associated with the studied traits in the meta-analysis, together with heterogeneity statistics and allelic frequencies in the Chinese, Danish, and Dutch Holstein herds, are available in Supplemental File S5 (https://doi.org/10.3168/jds.2019-16676). The raw data sets analyzed in the study are not publicly available but can be made available, on reasonable request, from the co-authors responsible for the different data sets (bart.buitenhuis@mbg.au.dk for the Danish, henk.bovenhuis@wur.nl for the Dutch, and sundx@cau.edu.cn for the Chinese). GG and HB conceived of the study, GG implemented analysis and drafted the manuscript, NAP and HJFV collected milk samples, and AJB and HB attracted funding. All authors contributed to the discussion of results and approved the final manuscript.
Positional candidate cloning of a QTL in dairy cattle: Identification of a missense mutation in the bovine DGAT1 gene with major effect on milk yield and composition.
Comparing power and precision of within-breed and multi-breed genome-wide association studies of production traits using whole-genome sequence data for 5 French and Danish dairy cattle breeds.
Genome-wide associations for feed utilisation complex in primiparous Holstein-Friesian dairy cows from experimental research herds in four European countries.
Consistency of linkage disequilibrium between Chinese and Nordic Holsteins and genomic prediction for Chinese Holsteins using a joint reference population.