Identification of candidate novel production variants on the Bos taurus chromosome X

Chromosome X is often excluded from bovine genetic studies due to complications caused by the sex specific nature of the chromosome. As chromosome X is the second largest cattle chromosome and makes up approximately 6% of the female genome, finding ways to include chromosome X in dairy genetic studies is important. Using female animals and treating chromosome X as an autosome, we performed X chromosome inclusive genome-wide association studies in the selective breeding environment of the New Zealand dairy industry, aiming to identify chromosome X variants associated with milk production traits. We report on the findings of these genome-wide association studies and their potential effect within the dairy industry. We identify missense mutations in the MOSPD1 and CCDC160 genes that are associated with decreased milk volume and protein production and increased fat production. Both of these mutations are exonic SNP that are more prevalent in the Jersey breed than in Holstein-Friesians. Of the 2 candidates proposed it is likely that only one is causal, though we have not been able to identify which is more likely.


INTRODUCTION
Genome-wide association studies have become a popular method for examining genomes and identifying QTL and causal variants for target phenotypes.Until recently, chromosome X (ChrX) has been largely excluded from these GWAS due to the complications it introduces.In cattle, the inclusion of ChrX could be particularly important because ChrX is the second largest chromosome, representing approximately 6% of the genome in females.In Taurine cattle (Bos taurus) ChrX is a large chromosome (139 Mb), containing 1,222 genes, with a gene/Mbp ratio of 8.79 (Czech et al., 2020;Rosen et al., 2020).The inclusion of ChrX in genetic studies is important because of the core principles and methods of genetic improvement in the New Zealand dairy industry.For dairy farmers, while many factors contribute to profit generated from cows, phenotypes associated with milk production are some of the most important.
Expression of genes from ChrX is more complicated than from autosomes, and unlike ChrX inheritance patterns, dosage compensation is hard to predict.Females of XY sex-determined species undergo ChrX inactivation (XCI) to compensate for the female having 2 whole copies of ChrX while the male has only one.The XCI is the process whereby one of the female's ChrX is condensed to the point where (most of) the genes on the chromosome are not expressed (Heard et al., 1997).In the past this was believed to be a random process, with a 50/50 chance in every cell of either the maternal or paternal ChrX being inactivated.However, research shows that there are several instances where, for a variety of reasons, this is not the case (Hatakeyama et al., 2004;Migeon, 1998).For example, ChrX imprinting occurs in certain cell and tissue types, meaning that in that cell type only the maternal or paternal copy of certain genes on ChrX can be active (Sado and Ferguson-Smith, 2005).It has also been shown that in some tissues (e.g., mammary), one or the other ChrX is favored depending on the inactivation status of the original blastocyst cell(s) which that tissue grew from (Couldrey et al., 2017).While XCI applies across the entire chromosome and is maintained during cell division, some genes escape inactivation and are expressed even when they are on the inactivated copy of ChrX, resulting in females potentially having a double dose of ChrX product from that region compared with males.Patterns of escape from XCI are not equivalent across species or even across an individual animal, and certain genes are more pre-disposed to it than others, including genes in the ChrX pseudoautosomal region and genes with Y-homologs (Posynick and Brown, 2019).
In spite of the challenges in working with ChrX genotypes, specialized software toolkits similar to XWAS (Gao et al., 2015) have been developed to facilitate genetic studies of ChrX.We make use of genetic analysis programs such as PLINK (Purcell et al., 2007) and BOLT-LMM (Loh et al., 2018) which have had features added to allow them to cope with or at least recognize sex chromosome input.
GWAS of the cattle genome have been used to identify (among other things) milk composition QTL (Littlejohn et al., 2016, p. 1), milk production candidate genes and loci (Liu et al., 2020), novel cattle syndromes (Reynolds et al., 2021), candidate white spotting mutations (Jivanji et al., 2019), fertility variants in US Holstein Dairy bulls (Pacheco et al., 2020), and recurrent infertility in Japanese Black cattle (Arishima et al., 2017).Due to the 1,222 genes encoded on ChrX, it was expected that the chromosome would likely play a role in the expression of dairy production phenotypes.The recent imputation of DNA sequence variants on ChrX by Wang et al. (2021) allowed us to include ChrX in large scale sequence level GWAS and other genetic analyses.
Our main aim was to determine whether any significant relationships existed between ChrX variants and milk production phenotypes.We also wanted to determine whether a representative GWAS including ChrX could be achieved when separate GWAS were performed for individual breeds.The results of this study have the potential to affect and influence the breeding objectives and methods of the NZ dairy industry.

MATERIALS AND METHODS
No human or animal subjects were used, so this analysis did not require approval by an Institutional Animal Care and Use Committee or Institutional Review Board.

Study Population
The animals examined in this study are derived from a set of 91,872 female cattle born in New Zealand between 1990 and 2018.We considered animals from 3 breeds; Holstein-Friesian, Jersey, and crosses of these 2 breeds (henceforth referred to as HFxJ).To classify breeds, we used the NZ animal evaluation classification (DairyNZ, 2021) based on breed 16ths from a 4-generation family pedigree.Animals with at most 13/16ths of either Holstein-Friesian or Jersey are classified as HFxJ (n = 41,218).When specifically analyzing purebred 835) or purebred Jersey (n = 11,734) animals we are examining the "purebred" animals with 16/16 of their pedigree belonging to one breed.This leaves animals with 14/16ths or 15/16ths of their pedigree belonging to 186) or Jersey (n = 1,672) which are included in the analyses of all animals but not in that of purebred animals.It is worth noting that in the case of the 16/16ths purebred animals this pedigree-based breed classification does not necessarily represent a genetic purebred.Due to the speed of generation turnover in the NZ dairy industry an animal could very conceivably have all animals on its 4-generation pedigree belong to one breed classification, but a previous crossbred could have contributed DNA that persisted through the generations.A full breakdown of the number of animals with data for each phenotype can be seen in Supplemental Table S1 (https: / / data .mendeley.com/datasets/ 5yzy4p5zr5/ 1; Trebes, 2023).Mao et al. (2016), the imputation accuracy of the X chromosome was improved if the pseudoautosomal (PAR) region was treated as autosomal and both non-PAR and PAR regions were imputed separately.Thus, in all sets, genotype data were separated into non-PAR (ChrX: 0-133300518) and PAR (ChrX: 133300518-139009144) regions (Johnson et al., 2019).

Consolidation of SNP-Chip Panels for Sequence Imputation. According to
The genotyping panels were grouped into 3 sets: GGP panels (GGPv3, GGPv3.1, and GGPv4, totaling 1,577 ChrX variants), 50K panels (BovineSNP50v1 and BovineSNP50v2, totaling 1,177 ChrX variants), and GGP50k panels (GGP50kv1 and GGP50kv1.1,totaling 2,275 ChrX variants).All 3 sets were imputed separately to the BovineHD genotype panel, which contained 3,769 animals genotyped on Illumina 539 ChrX variants).The imputed and physically genotyped panels were then combined and imputed to sequence resolution using the sequencebased imputation reference population described below.Beagle 5.1 (Browning et al., 2018) was used for all phasing and imputation steps with the default parameters except for the effective population size that was set at 400, and a window size of 20 Mb.
Reference Population for Sequence-Based Imputation.Whole-genome sequence data were available for 1,298 animals which include 306 Holstein-Friesian, 219 Jersey, and 717 HFxJ crossbred animals or other breeds and crossbreeds (n = 56).The raw sequence data were mapped to the bovine genome build ARS-UCD1.2bovine reference genome (Rosen et al., 2018) using BWA-MEM algorithm (Li, 2013).The average sequence coverage across the complete reference population was 15.17 fold.850 animals with a depth above 10 were treated as high-depth sequence animals.The detected variants were then filtered using BCFtools (Li, 2011) and Plink 1.9 for high-depth sequenced animals, retaining variants with allele count of 2 (-i 'AC > 1', '-max-alleles 2'), variants with map quality higher than 50 (-i 'MQ > = 50'), and mendelian error rate below 5%.In the end, singletons and nonvariants were removed again which resulted in 589,484 variants for the X chromosome, in which 492,072 variants are located in the non-PAR region and 97,412 are located in the PAR region.The genotypes at the positions of those filtered variants were then extracted from the sequence data of all 1,298 animals and were phased using the software Beagle 5.1 to generate the sequence-based imputation reference panel.
The Final Genotype Set.A final stage of quality control was performed on the imputed genotypes, where all SNP with a minor allele frequency (MAF) below 0.005 or dosage R 2 below 0.9 were removed to bring ChrX more in-line with the autosomes.This resulted in a final imputed set of 16,122,289 autosomal variants and 368,647 ChrX variants.As we are examining only female animals in this study, we were able to re-join the imputed non-PAR and PAR and treat X as an autosome in our analyses.

Phenotypes
We examined the association between ChrX SNP and 12 production phenotypes.Throughout this report the nomenclature for these phenotypes is made up of 2 parts.First is the type of phenotype, this can be "volume" referring to milk volume (liters), "fat" referring to milk fat yield (kilograms), or "protein" referring to milk protein yield (kilograms).The trait is followed by the lactation (parity) denoted by the numbers "1" through to "4" where lactations 1, 2, and 3 represent lactations following a cow's first, second or third calf, while "lactation 4" is a binned aggregate of the fourth lactation and all latter lactations.
For each phenotype being studied, the animals with a phenotype record were selected from the total set of all valid genotyped animals (not all animals had records for all phenotypes, see Supplemental Table S1).Phenotype values were adjusted from the raw yield deviation based on the methods described in Reynolds et al. (2021).Briefly, the raw yield deviation phenotypes were adjusted using the national (New Zealand) genetic evaluation models.This method includes adjustments for age at calving, stage of lactation, record type (lactation traits may be recorded as a.m.milkings, p.m. milkings, or both), effect of induced calving, and pairwise heterosis between breeds.

GWAS Preparation, Process, and Parameters
GWAS were performed using the BOLT-LMM package (version 2.3.2,BOLT-LMM) using a leave one segment out approach, with the size of the segment to be left out set to 5Mbp, as in Tiplady et al. (2021).Infinitesimal mixed model association statistics were evaluated to assess the additive effect of each SNP.A set of autosomal SNP to be used in the BOLT-LMM mixed model (parameter: modelSnps) to account for population structure in the GWAS was generated based on the 50K SNP-chip imputation reference as described in Tiplady et al. (2021).To control genomic inflation (λ) and account for population structure on ChrX we selected 2,222 ChrX SNP to include in the modelSnps file.These ChrX SNP consist of 698 SNP from the Illumina 50K SNP-chip reference with a MAF >0.1 in the reference population, this was supplemented with 1,524 ChrX SNP from the imputed variants with a MAF >0.1.The number of SNP to include from ChrX (2,222) was chosen based on the size of ChrX being similar to the size of chromosome 2 which was represented by 2,221 SNP in the modelSnps file.
All phenotypes were first analyzed with a BOLT-LMM GWAS including all animals with no fixed effects applied, henceforth the (12) base GWAS.A second GWAS was then run for each phenotype using BOLT-LMM parameters to apply breed as a fixed effect.The purebred breed classification was also used to separate Holstein-Friesians and Jerseys and perform a third GWAS for each purebred breed with no fixed effects.
All phenotypes were then run through iterative GWAS as described in Tiplady et al. (2021) to attempt to distinguish between multiple QTL.Briefly, the iterative GWAS ran multiple rounds of GWAS for a phenotype and in each round included the most significant ChrX SNP from the previous GWAS round(s) as a fixed effect, this continued until there were no longer any significant (P < 5E-8) association effects observed on ChrX.The first iteration used the output of the base GWAS described above as input, i.e., iteration 1 occurred after the base GWAS (see Table 1) and accounted for the most significant ChrX SNP from that GWAS.
It must be noted that the GWAS significance threshold referred to throughout this study is at a Pvalue of 5E-8 and that this is not necessarily the best significance threshold for all or any of these phenotypes.The 5E-8 significance threshold was proposed by Risch and Merikangas (1996) for human GWAS and it has previously been applied to cattle studies (Jivanji et al., 2019;Xiang et al., 2021).We have used this P-value significance threshold to allow easy com-parison with other publications, but this was always used in conjunction with a visual assessment of the GWAS results.

Analysis of Areas of Interest
Regions of interest identified by the GWAS were examined and analyzed with a variety of approaches as described below.As the production phenotypes of milk volume, fat and protein are known to be highly correlated we employed a somewhat holistic approach to classify our candidate SNP of interest.For a SNP to be classified as a candidate it had to meet the below conditions across the phenotypes and lactations (though not necessarily in all phenotypes and lactations).First the SNP must be within the region of interest identified by the GWAS and must have a significant association in several of the 12 phenotypes.Second the SNP must be in high linkage disequilibrium (LD R 2 > 0.85) with the SNP at the top of the production peak.Finally, the SNP must have a deleterious effect as classified by the Ensembl Variant Effect Predictor (described below).
To further assess the quality of the imputed sequence around regions of interest, and to confirm that the SNP we saw were not due to genotyping or sequencing artifacts we examined the region of interest using the Integrated Genomics Viewer (IGV) (Thorvaldsdóttir et al., 2013).We know from other research in our group that there are regions of the genome that are poorer quality which could potentially have been enriched by the genotyping and imputing to sequence pipelines.Therefore, inspecting the genome with IGV is standard practice for us to mitigate these risks.The IGV was also used to identify recombinants.
We then used the Ensembl Variant Effect Predictor (VEP) tool (McLaren et al., 2016) to analyze the potential effects of the SNP around and including the SNP of interest.The VEP was also used in this manner to access the SIFT score (Vaser et al., 2016) for SNP to assess whether a SNP was likely to have a deleterious effect.Homologene (Sayers et al., 2021) was used to assess the homology and pairwise alignment scores of the genes of interest.SNAP2 scores (Hecht et al., 2015) were used to assess the effect of sequence variants on the resulting protein.
While GWAS detect associations between all genotypes and a phenotype (similar to an ANOVA), we were also interested in whether there are differences in phenotype between specific genotypes.For example, knowing if there is a significant difference between heterozygotes and either homozygote could indicate potential escape from ChrX inactivation.To determine if these significant differences between genotypes exist we used the R function "TukeyHSD" to calculate Tukey Honest Significant Differences (Tukey HSD; Miller, 1981;Yandell, 2017) based on ANOVA results generated by the R "aov" function (Chambers et al., 1992).These tests were performed for all animals, and for the purebred subsamples separately.
To further test whether the SNP of interest could be causing the production peak we performed a round of iterative GWAS for each phenotype, with one of the SNP of interest forced to be the first SNP accounted for as a fixed effect.Some phenotypes are shown on 2 lines, one with * to denote that line as the most significant production peak SNP, in these cases the SNP shown was either the second or third most significant ChrX SNP.Gene and class values determined from manual inspection of SNP location using IGV (Thorvaldsdóttir et al., 2013).Minor allele frequency shown in MAF column rounded to 5 decimal places.The MAF is specific to the phenotype or GWAS run and is not necessarily equal to the MAF in all imputed animals.
As part of investigating genes of high interest, we used the same bovine lactating mammary RNaseq data set (n = 423) as Davis et al. (2022) to perform a GWAS of ChrX for expression phenotypes (eQTL analysis) of genes of interest using the methods described in Prowse-Wilkins et al. (2022) employing GCTA (Yang et al., 2011) leave one chromosome out analysis.All animals used for the eQTL analysis were female, and all 3 breeds were represented.

Genome-Wide Association Studies
Our GWAS results showed significant associations between ChrX SNP and all phenotypes and lactations of the 12 base GWAS.In all the base GWAS a peak can be seen centered roughly around ChrX: 18000000 (see Figures 1 and 2).The peak formed in this region is henceforth referred to as the "production peak."The full range of most significant production peak SNP from these base GWAS can be seen in Table 1.Of these 12 base GWAS, the most significant production peak SNP falls in an exon of MOSPD1 5 times (always ChrX: 18584126C > G), an intron of FAM122B 5 times (see Table 1), and at 2 separate intergenic loci (ChrX: 18501994A > C, ChrX: 18348881C > T).
The P-values for the most significant production peak SNP ranged between 1.3E-10 (protein lactation 4) and 8E-37 (Volume lactation 2).There were 3 SNP that were significant in all 12 of the base GWAS, 2 of which were intergenic (ChrX: 18501994A > C and ChrX: 18504680C > T) and one within an exon of MOSPD1 (ChrX: 18584126C > G).This is likely a result of high LD between these SNP as the 2 intergenic SNP (ChrX: 18501994A > C and ChrX: 18504680C > T) have an LD R 2 of 0.997 and the same MAF of 0.057, the LD R 2 between both these SNP and ChrX: 18584126C > G is 0.803 (3dp).
In the base GWAS for fat lactations 1 to 3 and protein lactation 1 we observed that the most significantly associated SNP (ChrX: 34002922T > C, MAF = 0.009) appears on its own with no other SNP forming a peak up to it (see Figure 2).The highest LD observed between this SNP (ChrX: 34002922T > C) and any other was R 2 = 0.04.The ChrX: 34002922T > C was significant (P < 5E-8) in 11 of the 12 base GWAS but did not show a peak shape in any.A similar SNP was observed topping the production peak for protein lactation 4 at ChrX: 16697687C > T. While this SNP is within the bounds of the production peak we have ruled it out from being a true part of the production peak and having an effect on the GWAS itself.This is due to it being in very low LD (highest R 2 with any SNP is 0.002) and having little effect on the iterative GWAS as explained later.For these reasons we have ruled out these SNP (ChrX: 34002922T > C and ChrX: 16697687C > T) as belonging to the production peak.Figure 1 shows the results of the volume lactation 2 base GWAS across the genome.The Chromosome 14 DGAT1 peak (Grisart et al., 2002;Spelman et al., 2002) can be clearly seen dominating the genome in Figure 1 (this peak has been truncated for plotting, its true most significant P-value is 2.8E-808) with the ChrX production peak (top SNP ChrX: 18584126C > G) shown at the far right.Figure 2 shows the P-values across ChrX for all lactations or all phenotypes from the base GWAS.The production phenotypes are known to be highly correlated so seeing a common peak across the 3 phenotypes was not surprising.
Iterative GWAS.Performing iterative GWAS indicated that the production peak observed across the Manhattan plots is likely the result of a single QTL.The number of iterations required until no significant (P < 5E-8) associations were observed ranged between 2 and 5 and can be seen in Supplemental Table S3 (https: / / data .mendeley.com/datasets/ 5yzy4p5zr5/ 1; Trebes, 2023).
We observed 2 scenarios occurring in the iterative GWAS that enabled us to further define the production peak and surmise that this peak is likely caused by a single QTL.For those phenotypes where the base GWAS most significant ChrX SNP was ChrX: 34002922T > C (fat lactation 1 to 3 and protein lactation 1) or ChrX: 16697687C > T (protein lactation 4) the first iteration showed the production peak with significance unchanged or increased.In these cases all the most significant iteration 1 SNP were within the production peak in the range ChrX: 18475739G > C -ChrX: 18584126C > G.For the phenotypes where the base GWAS most significant ChrX SNP was atop the production peak, the first iteration revealed the whole of the production peak had dropped into insignificance and only a few (1-3) lone SNP (no peaks leading up to them) remained above the significance threshold (5E-8) spread across the chromosome.In these cases, the iteration 1 most significant SNP was either ChrX: 34002922T > C or ChrX: 16697687C > T.
Effect Size of SNP. Figure 3 shows the BOLT-LMM predicted effect size of SNP across ChrX.The SNP found to be strongly associated with volume and protein are predicted (by the GWAS) to have a (relatively) strong negative effect on the phenotype, while the fat effect is predicted to be positive.To provide context, Figure 3 also shows the GWAS predicted effect sizes from chromosome 14 volume lactation 1, as this shows the strong and well-known predicted effects of SNP within the DGAT1 gene (Grisart et al., 2002;Spelman et al., 2002).The SNP effect sizes shown in Figure 3 are BOLT-LMM predictions of the change to an animals' phenotype for each (active) copy of the alternate allele.Therefore, a positive value indicates a predicted increase in the phenotype while a negative value indicates a decrease.
Covariates.Performing separate GWAS for the purebred breeds showed a clear interaction between breed and the relationship between ChrX and the production phenotypes.Manhattan plots for the GWAS of separate breeds can be seen in Figure 4. We observed that for volume, all lactations showed a clear significant production peak in the Jerseys and a complete absence of this peak in Holstein-Friesians.For protein we observed the tell-tale shape of the production peak across all lactations of Jersey animals; however, this peak was only significant (P < 5E-8) in lactations 1 and 2, and again this peak was absent in Holstein-Friesians.For fat the production peak appears to be absent in both breeds for all lactations except in lactation 2 for Holstein-Friesians.
BOLT-LMM flags were also used to run GWAS with breed included in the model as a fixed effect.The results of this can be seen in Supplemental Figure S1 (https: / / data .mendeley.com/datasets/ 5yzy4p5zr5/ 1; Trebes, 2023) and show that there was little difference in the presence and strength of the production peak with or without breed applied as a fixed effect.

Identification of Candidate Causal Variants
Given the significance, novelty, and potential effects of the production peak, we prioritized SNP within this region for further investigation.To identify SNP of interest, we define the production peak as ChrX: 15 ,000 ,000-20 ,000 ,000 as this contains the visible peak on the Manhattan plots (Figure 2) and 98% of all significant ChrX SNP.Of the SNP in this region, 7 were predicted to have deleterious effects (VEP;McLaren et al., 2016), of which 2 (ChrX: 18584126C > G and ChrX: 18028733A > G, henceforth SNP of interest) were significantly associated with any of the phenotypes.The ChrX: 18584126C > G was significantly associated (P < 5E-8) with all 12 of the phenotypes base GWAS, and ChrX: 18028733A > G was significantly associated (P < = 5E-8) with 8 of the 12 phenotypes base GWAS (Volume all lactations, fat lactation 4, and protein lactations 1 to 3).
The SNP at ChrX: 18584126C > G is a missense mutation in the third exon (McLaren et al., 2016) of the gene Motile Sperm Domain-Containing protein 1 (MOSPD1).The Arginine to Proline substitution at MOSPD1 AA 96 caused by ChrX: 18584126C > G has a very strong SNAP2 (Hecht et al., 2015) signal for effect of 91.Variant Effect Predictor (McLaren et al., 2016) showed that of the 39 SNP in MOSPD1, ChrX: 18584126C > G was the only predicted deleterious SNP and had a SIFT score of "0.0" indicating maximum confidence in the deleterious prediction (Vaser et al., 2016).
The ChrX: 18028733A > G SNP is a missense mutation in the third exon of the gene Coiled Coil Domain-Containing protein 160 (CCDC160) (McLaren et al., 2016) and causes a Lysine to Glutamic Acid substitution.The ChrX: 18028733A > G SNP is predicted deleterious by VEP (SIFT score = 0.04) with the missense mutation receiving a SNAP2 score of 52 (indicating a strong signal for effect on the resulting protein).
The MAF for all GWAS animals for the ChrX: 18028733A > G SNP is 0.101 and for ChrX: 18584126C > G is 0.047 (both rounded to 3dp).The MAF of ChrX: 18028733A > G and ChrX: 18584126C > G in Jerseys are 0.091 and 0.086, respectively, and in Holstein-Friesians are 0.1 and 0.008, respectively.We can see in Figure 5 that the frequency of the alternate alleles for the 2 SNP of interest have been gradually decreasing over time.This coincides with the number of heterozygotes becoming gradually greater through the years compared with the homozygous alternate genotype.The SNP of interest show great variation in their LD between the breed populations.In all the GWAS animals, the LD R 2 between these SNP is 0.393.The LD R 2 between the SNP of interest in the purebred Jersey animals is 0.89 while for Holstein-Friesians the R 2 is 0.055.shows that ChrX: 18584126C > G is consistently in high LD (R 2 > 0.79) with the most significant production peak SNP in all GWAS animals.Supplemental Table S5 (https: / / data .mendeley.com/datasets/ 5yzy4p5zr5/ 1; Trebes, 2023) shows that ChrX: 18028733A > G is consistently in high LD (R 2 > 0.88) with the most significant production peak SNP in Jersey cattle, but not in Holstein-Friesians or all GWAS animals (Table 2).
Further to the predicted effect sizes shown across ChrX in Figure 3, Table 3 shows the predicted effect sizes per lactation of the 2 SNP of interest (ChrX: 18584126C > G, ChrX: 18028733A > G).The proportion of genetic variance and total phenotype variance attributed to each SNP of interest can be seen in Table 4.The proportion of total variance explained by ChrX: 18584126C > G ranged between 0.0006 and 0.00234 and for ChrX: 18028733A > G ranged between 0.00025 and 0.00121.The proportion of genetic variance explained by each SNP ranged between 0.00276 and 0.00891 for ChrX: 18284126 C > G and between 0.0011 and 0.00501 for ChrX: 18028733A > G.
As shown in Figure 6 we observed that heterozygotes for the SNP of interest produce less milk, and protein than homozygous reference animals, and that the homozygous alternate animals produce less milk again.These production differences were more obvious at the SNP ChrX: 18584126C > G than at ChrX: 18028733A > G.We tested the significance of these observed differences using Tukey HSD tests, the results of which are shown in Table 5 and expanded upon in Supplemental Tables S6 and S7 (https: / / data .mendeley.com/datasets/ 5yzy4p5zr5/ 1; Trebes, 2023).For the  .Filled sections of the plots show the allele frequencies in the total GWAS population, lines show the allele frequencies of the reference allele in the purebred Jersey and Holstein-Friesian animals.Note that before 1995 each year had less than 100 animals.
SNP ChrX: 18584126C > G, of the 36 Tukey HSD tests 32 were significant (P < 0.05), with the 4 insignificant results coming from the test between homozygous reference animals and heterozygotes for each lactation of the fat phenotype.For ChrX: 18028733A > G 26 of the 36 Tukey HSD tests were significant (P < 0.05).Again, all the insignificant Tukey HSD results came from the fat phenotype, with only fat lactation 2 showing the homozygous alternate animals had significantly different production to the homozygous reference and heterozygous animals.
To assess whether the genes MOSPD1 and CCDC1160 were undergoing differential expression we performed an eQTL analysis.The eQTL GWAS showed no ChrX peaks, indicating that no variants on ChrX were causing differential expression of either MOSPD1 or CCDC160.
Table 6 shows the top production peak SNP in iteration 1 after fitting the SNP of interest as fixed effects.Fitting ChrX: 18584126C > G as a fixed effect caused the production peak to drop into insignificance across all phenotypes and lactations with all P-values >6E-7.Fitting ChrX: 18028733A > G caused a reduction in the significance of the production peak in all phenotypes and lactations with all P-values >1.6E-17.For all the fat lactations and protein lactations 1 to 3, the same SNP topped the production peak after 1 iteration regardless of which of the SNP of interest was fitted as a fixed effect.Fitting ChrX: 18028733A > G for volume  lactations 1 to 3 and protein lactation 1 resulted in the iteration 1 most significant production peak SNP being the other SNP of interest ChrX: 18584126C > G.

DISCUSSION
In this paper we describe the use of GWAS to identify 2 candidate ChrX variants associated with milk production traits in dairy cattle.We used this study as an opportunity to evaluate different approaches for performing GWAS with an imputed ChrX and found that with female animals ChrX could be treated as an autosome, and that separating breeds can have important effects on the outcome of GWAS analysis of this type.

GWAS in Subsets of the Population
Previous work by our group has indicated significant challenges in accounting for all population structure in the NZ dairy population (B.Harris, personal communication, 2021).It was therefore important to understand whether the base GWAS results were due to population structure or whether the peaks observed are breed specific.We therefore ran separate GWAS for each breed and saw significant differences in the GWAS results with all animals (reflecting the HFxJ population that makes up most of the animals studied) and purebred Jerseys clearly showing a strong peak which was entirely absent in purebred Holstein-Friesians except in fat lactation 2. The differing associations observed between the breeds could be a result of differing allele frequencies of the SNP of interest.The MAF of the SNP of interest ChrX: 18584126C > G is 10 times higher in Jerseys than in Holstein-Friesians, with 2020 animals carrying the alternate allele in purebred Jerseys (MAF = 0.086) compared with 2999 in Holstein-Friesians (MAF = 0.008).However, it must be kept in mind that as well as MAF in the GWAS population, the sample sizes also differ between the breeds and phenotypes (see Supplemental Table S1) and that reducing the sample sizes by excluding the HFxJ and separating the purebred animals will have greatly reduced the power of the GWAS.For example, the base GWAS of all animals for fat lactation 2 had data from 65,822 animals, while the purebred analyses had only 12,691 and 9,476 animals for Holstein-Friesians and Jerseys, respectively.

SNP and Genes of Interest
While we have not yet proven that either of the SNP of interest (ChrX: 18584126C > G and ChrX: 18028733A > G) are causing the effects of the production peak, they are both rare, exonic missense SNP that have a very significant association with the production phenotypes in Jerseys and HFxJ thus making them strong candidates.It is possible that these SNP could be linked to Jersey milk having a higher fat content than Holstein-Friesian milk.Changes in the way NZ farmers are paid for milk, and the relative value placed on fats, proteins and milk volume in the NZ dairy industry could cause a change in the allele frequency of these variants.
All analyses showed a significant GWAS association between ChrX variants and the production phenotypes.Based on the results of the iterative GWAS (drastic drops in production peak significance when SNP from this region are accounted for as fixed effects, and low numbers of iterations required for all SNP to be below the P = 5E-8 significance threshold) it is likely that the production peak as it appears in each individual phenotype is the result of a single, or a very few QTL.Based on observations of ChrX: 34002922T > C and ChrX: 16697687C > T from the base GWAS and iteration 1, we believe that these 2 SNP are separate from the production peak.The ChrX: 34002922T > C is separate because of its distance from the peak, and the possibility of a sequencing error is ruled out by having ChrX: 34002922T > C as a fixed effect in the GWAS having little to no effect on the production peak.The ChrX: 16697687C > T is located within the bounds of the production peak, however when it is set as a fixed effect in the GWAS the production peak shows no great change.The inverse for both ChrX: 34002922T > C and ChrX: 16697687C > T also supports these SNP not belonging to the production peak in that the first GWAS iteration fitting a production peak SNP has little to no effect on the significance of these variants.
As both ChrX: 34002922T > C and ChrX: 16697687C > T appear as lone SNP in the initial base GWAS or throughout the iterations we believe that their significance is an artifact or error with both SNP having very low MAF (0.009 and 0.01 for ChrX: 34002922T > C and ChrX: 16697687C > T, respectively) and being in very low LD with nearby SNP (highest LD with any SNP being 0.4 and 0.002 for ChrX: 34002922T > C and ChrX: 16697687C > T, respectively).Due to the abnormalities with the LD, low MAF, lack of peak shape (SNP often appeared on their own) and the evidence presented from the GWAS and iterative GWAS, we discounted these 2 SNP as belonging to the production peak and from being SNP of interest.
We have identified animals with a recombination event in the production peak between the 2 SNP of interest (ChrX: 18584126C > G and ChrX: 18028733A > G).At present the number of animals with this recombination is small; however, once more recombinants are identified we may be better able to test if the production peak we observed is the result of a single QTL.To facilitate the discovery of more potential recombinants, we have begun a breeding trial inseminating cattle with one or more of the SNP of interest with sperm from bulls with one or more of the SNP of interest.The offspring resulting from these matings will potentially give us the opportunity to observe cattle with the SNP of interest throughout their lives on a more day-to-day scale than our current data provides, as well as enabling us to potentially examine changes to the traits of animals with a recombination in the production peak region.
The function of the CCDC160 protein that contains the SNP ChrX: 18028733A > G is unknown (Swiss Institute of Bioinformatics, 2023) though it is associated with ChrX-linked hypothyroidism and kidney disease.According to Homologene (Sayers et al., 2021) CCDC160 is conserved across tetrapods with pairwise alignment scores of 70.9% (protein) and 83.3% (DNA) between the Bos taurus and human genes.
ChrX: 18584126C > G lies within MOSPD1 which encodes a transmembrane tether protein localized to the Endoplasmic Reticulum (Cabukusta et al., 2020).The protein coded by this gene is proposed to be involved in Epithelial-to-Mesenchymal cell transition (Kara et al., 2015;Thaler et al., 2011) and differentiation or proliferation of mesenchymal stem cells.NCBI's tool Homologene (Sayers et al., 2021) reports MOSPD1 as conserved in Bilateria (animals with bilaterally symmetric embryos).The pairwise alignment score of the Bos taurus and human gene is reported as 97.2% (protein) and 93.6% (DNA; Sayers et al., 2021).
MOSPD1 is known to duplicate in humans as part of Xq25q26 duplication syndrome, potentially causing double outlet right ventricle (Hirota et al., 2017).In our data, duplications could potentially present as males showing heterozygous diploid for SNP on these genes (where they would normally be hemizygous).However, due to the methods with which our imputed genotypes were calculated, males were forced to hemizygosity -so this method would not show in the GWAS genotypes.None of the eQTL Manhattan plots generated for the genes of interest (MOSPD1, CCDC160) showed significant peaks on ChrX, indicating that if the effects described in this study are caused by either of these genes, it is likely not due to differing gene expression levels.Again, if either of these genes are causative, the lack of eQTL peaks also increases the likelihood the variant/s behind the differences in production are exonic, rather than intronic or regulatory as these would be more likely to effect expression levels and produce significant eQTL peaks.Hence the predicted effects seen in Table 3 are likely the result of the changes to the protein structure resulting from the missense mutations.
As a point of comparison for the ChrX chromosome effect sizes reported in Table 3, the effect size of average allele substitution of 2 DGAT1 SNP on chromosome 14 reported by Spelman et al. (2002) was −134 L/lactation and −110 L/lactation of milk volume for Holstein-Friesians and Jerseys, respectively.Spelman Phenotypes (rows) denoted followed by lactation number.The lactations shown are those with the most significant ChrX production peak SNP from the base GWAS.Analysis of variance performed using the R "aov" function.Tukey HSD performed with R function "TukeyHSD."Hyphenated numbers (e.g., 1-0, 2-0, and 2-1) indicate the 2 genotypes being tested for significance of difference in phenotype using Tukey HSD.Values of <2E-16 and 0 are the result of the R aov and TukeyHSD packages reporting values so small they were outside the precision capabilities of the package.This table is expanded in Supplemental Table S6 to show the results across all phenotypes, and in Supplemental Table S7 (https: / / data .mendeley.com/datasets/ 5yzy4p5zr5/ 1; Trebes, 2023) to show the results for the phenotypes shown here for the separate purebred breeds.
et al. ( 2002) also predicted the average change to milk fat yield being 5.76 kg/lactation and 3.3 kg/lactation for Holstein-Friesians and Jerseys, respectively, and milk protein changing by an average of −2.45 kg/lactation and −2.48 kg/lactation for Holstein-Friesians and Jerseys, respectively.These predictions, examining 2 alleles, are comparable in magnitude to the results we obtained for the effects of the SNP of interest reported here as ChrX: 18584126C > G and ChrX: 18028733A > G in the MOSPD1 and CCDC160 genes, respectively.

CONCLUSIONS
We found significant associations between 2 ChrX missense SNP and milk production phenotypes in New Zealand grazing HFxJ and purebred Jersey cattle.We observed differences in the existence/significance of genetic associations with production traits between the breeds, and so believe that for ChrX analysis in this context it is wise to examine the breeds separately as well as together.As no functional trials have been performed, we cannot say whether the SNP identified (ChrX: 18584126C > G; MOSPD1, ChrX: 18028733A > G; CCDC160) are indeed responsible for the GWAS production peaks we are seeing.Breeding trials are being carried out to attempt to further assess the actual effects of these SNP, functional trials will likely also be necessary.

Figure 2 .
Figure 1.Manhattan plot of volume lactation 2 no-fixed effect GWAS P-values.ChrX is shown in light gray on the far right.GWAS performed with BOLT-LMM including animals of all breeds.Plot generated using R CMplot package.This phenotype was chosen to exemplify the ChrX production peak as it has the most significant P-value of the no-fixed effect all-animals GWAS and is strongly related to both the other phenotypes.

Figure 4 .
Figure3.SNP effect sizes predicted by BOLT-LMM for the total GWAS population on ChrX and Chromosome 14 (Chr14).GWAS log 10 P-value indicated by point color.Chromosome 14 included for comparison with the DGAT1 peak.For each phenotype, the lactation that yielded the most significant ChrX production peak QTL in the base GWAS is shown.Y-axes units are liters/lactation for volume, and kilograms/lactation for fat and protein.A positive effect indicates a predicted increase in phenotype, while a negative effect predicts a decrease in phenotype.Effect is for each active copy of the alternate/minor allele.

Figure 5 .
Figure 5. Allele frequency (as a proportion of the GWAS population) for SNP of interest (ChrX: 18584126C > G and ChrX: 18028733A > G) over time.Filled sections of the plots show the allele frequencies in the total GWAS population, lines show the allele frequencies of the reference allele in the purebred Jersey and Holstein-Friesian animals.Note that before 1995 each year had less than 100 animals.

Figure 6 .
Figure 6.Boxplot of phenotype yields for animals of each genotype at the SNP ChrX: 18028733A > G and ChrX: 18584126C > G for the 3 phenotypes volume lactation 2, fat lactation 4, and protein lactation 2. The genotype (0, 1, or 2) refers to the number of copies of the minor or alternate allele the animal has at the specified SNP.The lactations shown were chosen as they had the most significant associations in the base GWAS.Numbers above each box indicate the number of animals in that group.Boxes represent the interquartile ranges (IQR), with the bottom of the box being the 25th percentile (Q1) and the top of the box being the 75th percentile (Q3), the median/50th percentile (Q2) is shown by a horizontal line within each box.Vertical lines (whiskers) from the top and bottom of each box represent the ranges Q3 + 1.5 × IQR and Q1 − 1.5 × IQR, respectively.Dots represent potential outliers as defined by values being outside the range Q1 − 1.5 × IQR to Q3 + 1.5 × IQR.These boxplots were generated using the R package ggplot and the geom geom_boxplot, which calculates whether each value falls into the ranges of the boxes, whiskers, or dots shown.
Trebes et al.: CHROMOSOME X AFFECTS CATTLE PRODUCTION TRAITS

Table 1 .
Trebes et al.: CHROMOSOME X AFFECTS CATTLE PRODUCTION TRAITS Most significant chromosome X (ChrX) and production peak SNP from the GWAS of all animals, for each phenotype with no fixed effects (base GWAS) 1

Table 2 .
Trebes, 2023): CHROMOSOME X AFFECTS CATTLE PRODUCTION TRAITS Linkage disequilibrium (LD) between the SNP of interest and the top SNP of each of the base GWAS, or with the most significant production peak SNP as indicated by an asterisk 1 This table is expanded and separated into LD between breeds in Supplemental TablesS4 and S5(https: / / data .mendeley.com/datasets/5yzy4p5zr5/1;Trebes, 2023).

Table 4 .
Trebes et al.: CHROMOSOME X AFFECTS CATTLE PRODUCTION TRAITS The proportion of genetic and total variance in a phenotype explained by each of the SNP of interest Chr: 18584126 C > G and ChrX: 18028733 A > G 1

Table 5 .
Trebes et al.: CHROMOSOME X AFFECTS CATTLE PRODUCTION TRAITS Analysis of variance and Tukey honest significant difference (Tukey HSD) P-values for phenotype difference between genotype combinations for the 2 SNP of interest 1