If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Department of Agronomy, Food, Natural Resources, Animals and Environment (DAFNAE), University of Padova, Viale dell' Università 16, 35020 Legnaro, Italy
Department of Agronomy, Food, Natural Resources, Animals and Environment (DAFNAE), University of Padova, Viale dell' Università 16, 35020 Legnaro, Italy
Department of Agronomy, Food, Natural Resources, Animals and Environment (DAFNAE), University of Padova, Viale dell' Università 16, 35020 Legnaro, Italy
Department of Agronomy, Food, Natural Resources, Animals and Environment (DAFNAE), University of Padova, Viale dell' Università 16, 35020 Legnaro, Italy
Department of Agronomy, Food, Natural Resources, Animals and Environment (DAFNAE), University of Padova, Viale dell' Università 16, 35020 Legnaro, Italy
Department of Agronomy, Food, Natural Resources, Animals and Environment (DAFNAE), University of Padova, Viale dell' Università 16, 35020 Legnaro, Italy
Department of Agronomy, Food, Natural Resources, Animals and Environment (DAFNAE), University of Padova, Viale dell' Università 16, 35020 Legnaro, Italy
Department of Agronomy, Food, Natural Resources, Animals and Environment (DAFNAE), University of Padova, Viale dell' Università 16, 35020 Legnaro, Italy
Department of Animal Science, Food and Nutrition (DIANA) and Research Center Romeo and Enrica Invernizzi for Sustainable Dairy Production (CREI), Faculty of Agricultural, Food and Environmental Sciences, Università Cattolica del Sacro Cuore, 29122 Piacenza, Italy
Department of Agronomy, Food, Natural Resources, Animals and Environment (DAFNAE), University of Padova, Viale dell' Università 16, 35020 Legnaro, Italy
In recent years, increasing attention has been focused on the genetic evaluation of protein fractions in cow milk with the aim of improving milk quality and technological characteristics. In this context, advances in high-throughput phenotyping by Fourier transform infrared (FTIR) spectroscopy offer the opportunity for large-scale, efficient measurement of novel traits that can be exploited in breeding programs as indicator traits. We took milk samples from 2,558 Holstein cows belonging to 38 herds in northern Italy, operating under different production systems. Fourier transform infrared spectra were collected on the same day as milk sampling and stored for subsequent analysis. Two sets of data (i.e., phenotypes and FTIR spectra) collected in 2 different years (2013 and 2019–2020) were compiled. The following traits were assessed using HPLC: true protein, major casein fractions [αS1-casein (CN), αS2-CN, β-CN, κ-CN, and glycosylated-κ-CN], and major whey proteins (β-lactoglobulin and α-lactalbumin), all of which were measured both in grams per liter (g/L) and proportion of total nitrogen (% N). The FTIR predictions were calculated using the gradient boosting machine technique and tested by 3 different cross-validation (CRV) methods. We used the following CRV scenarios: (1) random 10-fold, which randomly split the whole into 10-folds of equal size (9-folds for training and 1-fold for validation); (2) herd/date-out CRV, which assigned 80% of herd/date as the training set with independence of 20% of herd/date assigned as the validation set; (3) forward/backward CRV, which split the data set in training and validation set according with the year of milk sampling (FTIR and gold standard data assessed in 2013 or 2019–2020) using the “old” and “new” databases for training and validation, and vice-versa with independence among them; (4) the CRV for genetic parameters (CRV-gen), where animals without pedigree as assigned as a fixed training population and animals with pedigree information was split in 5-folds, in which 1-fold was assigned to the fixed training population, and 4-folds were assigned to the validation set (independent from the training set). The results (i.e., measures and predictions) of CRV-gen were used to infer the genetic parameters for gold standard laboratory measurements (i.e., proteins assessed with HPLC) and FTIR-based predictions considering the CRV-gen scenario from a bi-trait animal model using single-step genomic BLUP. We found that the prediction accuracies of the gradient boosting machine equations differed according to the way in which the proteins were expressed, achieving higher accuracy when expressed in g/L than when expressed as % N in all CRV scenarios. Concerning the reproducibility of the equations over the different years, the results showed no relevant differences in predictive ability between using “old” data as the training set and “new” data as the validation set and vice-versa. Comparing the additive genetic variance estimates for milk protein fractions between the FTIR predicted and HPLC measures, we found reductions of −19.7% for milk protein fractions expressed in g/L, and −21.19% expressed as % N. Although we found reductions in the heritability estimates, they were small, with values ranging from −1.9 to −7.25% for g/L, and −1.6 to −7.9% for % N. The posterior distributions of the additive genetic correlations (ra) between the FTIR predictions and the laboratory measurements were generally high (>0.8), even when the milk protein fractions were expressed as % N. Our results show the potential of using FTIR predictions in breeding programs as indicator traits for the selection of animals to enhance milk protein fraction contents. We expect acceptable responses to selection due to the high genetic correlations between HPLC measurements and FTIR predictions.
Milk protein composition is a key of milk that has an important biological effect on its quality and technological traits, such as milk coagulation and cheesemaking aptitude (
). However, assessing milk protein fractions at the individual level in dairy cattle is difficult and time-consuming due to high phenotyping costs, which are a deterrent to large-scale quantification.
From a technological point of view, advances in milk Fourier transform infrared spectroscopy (FTIR) for high-throughput phenotyping of dairy cattle allows the assessment of complex traits that are difficult and expensive to measure on a large scale. Milk FTIR spectra have been used for direct prediction of different phenotypes in milk, such as fat (
The effect of the number of observations used for Fourier transform infrared model calibration for bovine milk fat composition on the estimated genetic parameters of the predicted data.
Development of Fourier transform mid-infrared calibrations to predict acetone, β-hydroxybutyrate, and citrate contents in bovine milk through a European dairy network.
Genetic parameters of measures and population-wide infrared predictions of 92 traits describing the fine composition and technological properties of milk in Italian Simmental cattle.
Evaluating the performance of machine learning methods and variable selection methods for predicting difficult-to-measure traits in Holstein dairy cattle using milk infrared spectral data.
). Recently, increasing attention has been directed to the potential usefulness of milk FTIR spectroscopy for large-scale phenotyping, as the technique is cost-effective, fast, nondestructive, and able to phenotype a large number of animals (
Genetic analysis of rennet coagulation time, curd-firming rate, and curd firmness assessed over an extended testing period using mechanical and near-infrared instruments.
The main concern regarding the use of FTIR spectroscopy to predict milk composition is its predictive ability. However, appropriate statistical methods using rank-reduction and variable selection can be used to identify the relevant FTIR wavelengths and capture the nonlinear relationships between predictor variables and target traits (
Evaluating the performance of machine learning methods and variable selection methods for predicting difficult-to-measure traits in Holstein dairy cattle using milk infrared spectral data.
Mid-Infrared spectroscopy coupled with chemometrics: A tool for the analysis of intact food systems and the exploration of their molecular structure−Quality relationships—A review.
) and provide helpful support for farm managers to make decisions on several aspects of management of the farm. Further improvements in FTIR prediction ability have come from statistical approaches that better capture and describe the complex relationship between chemical bonds and milk components related to specific wavelengths (
Genetic parameters of differential somatic cell count, milk composition, and cheese-making traits measured and predicted using spectral data in Holstein cows.
have pointed out that differences in the spectrometers used to measure the FTIR spectra could result in prediction bias and less accurate predictions. Furthermore, prediction accuracy is also affected by time as a result of changes in the FTIR spectrometer parameters, such as light source intensity, detector sensitivity, and laser stability, although zero-set calibration and weekly calibration adjustments for milk components (i.e., fat, lactose, protein, and TS) can minimize these changes in the signal intensity over time (
The potential application of FTIR-predicted traits (i.e., indicator traits) for breeding purposes depends on their genetic correlations with measured traits. Several authors have reported high genetic correlations between gold standard measurements (i.e., measured by HPLC) and FTIR predictions for different traits, such as milk coagulation aptitude, fatty acid profiles, and other milk components (
Genetic parameters of cheese yield and curd nutrient recovery or whey loss traits predicted using Fourier-transform infrared spectroscopy of samples collected during milk recording on Holstein, Brown Swiss, and Simmental dairy cows.
Genetic parameters for milk protein composition predicted using mid-infrared spectroscopy in the French Montbéliarde, Normande, and Holstein dairy cattle breeds.
). Nevertheless, even moderate predictive ability provides valuable information for breeding programs as the breeding value of a sire is based on a rather large amount of data on many relatives that allows noise estimated breeding value correction (
Production Comparison of Genetic Parameters Estimation of Fatty Acids from Gas Chromatography and FT-IR in Holsteins. Proc. 10th World Congress of Genetics Applied to Livestock.
The effect of the number of observations used for Fourier transform infrared model calibration for bovine milk fat composition on the estimated genetic parameters of the predicted data.
showed that the size of training set data strongly affects the FTIR predictive ability and the correlation between prediction and gold standard phenotype measurement. As a solution,
used pooled multibreed data to increase the training set size.
In this work, therefore, we investigated the potential use of FTIR predictions of milk protein fractions in Holstein cattle as indicator traits for breeding purposes. The specific aims were (1) to assess the predictive ability of gradient boosting machine (GBM) using random 10-fold and leave-one-batch-out CRV methods for true proteins (TP), specifically the casein fractions β-CN, κ-CN, αS1-CN, and αS2-CN, total casein (TCN), the whey proteins α-LA and β-LG, and total whey proteins (TWP), expressed as proportions of total nitrogen (% N) and contents in milk (g/L); (2) to measure FTIR predictive ability using calibration and validation databases collected in different years, thereby testing the reproducibility of GBM equations over time; and (3) to estimate the genetic parameters for FTIR predictions and milk protein fractions measured by the gold standard method (i.e., HPLC), based on bi-trait genomic model analysis.
MATERIALS AND METHODS
The animal procedures in this study were approved by the Organismo Preposto al Benessere Degli Animali (OPBA; Organization for Animal Welfare) of the Università Cattolica del Sacro Cuore (Piacenza, Italy) and by the Italian Ministry of Health (protocol number 510/2019-PR of 19/07/2019). The study was carried out also following ARRIVE (Animal Research: Reporting of In Vivo Experiments) guidelines.
Field Data
For this study, we integrated data from previous research projects. The data set was compiled by the LATSAN and BENELAT projects, whose aims are to develop new strategies and innovative tools to improve animal welfare and milk quality in dairy farming (
Genetic parameters of differential somatic cell count, milk composition, and cheese-making traits measured and predicted using spectral data in Holstein cows.
). The phenotypic information from COWPLUS project were obtained from specialized (Holstein and Brown Swiss) and dual-purpose breeds (Alpine Grey, Rendena, and Simmental) belonging to 32 multibreed dairy farms (which showed 2 or 5 breeds in the herd) located in the province of Trentino (northeastern Italy).
Milk samples were collected once during the evening milking from specialized dairy breeds, including Holstein (1,618 cows from 30 herds) and Brown Swiss (586 cows from 30 herds), and dual-purpose breeds Alpine Grey (80 cows from 14 herds), Rendena (116 cows from 9 herds), and Simmental (158 cows from 16 herds), which the cows belonged to 38 herds managed under different dairy systems in northern Italy (Table 1). The cows were housed mostly in sand-bedded freestalls and fed twice a day on TMR based on corn and sorghum silage or hay (Emilia-Romagna and Trentino Region herds) supplemented with concentrates. The cows were sampled once after health checks; specifically, animals with clinical disease or on medical treatment were excluded from the study. Further details on the multibreed data set measured in 2013 are available in
Genetic parameters of differential somatic cell count, milk composition, and cheese-making traits measured and predicted using spectral data in Holstein cows.
Table 1Schematic representation regarding the number of animals with phenotypic and Fourier transform infrared information across the herd explored in this study
TP = true protein; glyco-κ-CN = glycosylated-κ-CN; TCN = total casein; TWP = total whey protein; g/L = protein fraction contents in grams per liter of milk; % N = protein fraction in percentage of nitrogen.
Herd/location
Total
Herd 1: Lombardy
Herd 2: Emilia-Romagna
Herd 3: Emilia-Romagna
Herd 4: Veneto
Herd 5: Veneto
Herd 6: Veneto
Herds 7–38: Trentino
TP
g/L
22
70
927
133
17
67
1,174
2,410
% N
21
69
917
129
17
67
1,168
2,388
Casein
αs1-CN
g/L
20
70
927
131
17
67
1,169
2,401
% N
19
69
921
128
17
66
1,170
2,390
αs2-CN
g/L
21
69
911
133
17
64
1,188
2,403
% N
21
70
908
133
17
67
1,183
2,399
β-CN
g/L
21
70
920
133
17
67
1,183
2,411
% N
20
69
905
130
17
66
1,188
2,395
κ-CN
g/L
21
69
921
133
17
65
1,177
2,403
% N
20
69
905
133
17
67
1,190
2,401
Glyco-κ-CN
g/L
20
69
911
133
17
65
1,154
2,369
% N
20
69
908
133
17
67
1,166
2,380
TCN
g/L
22
70
924
132
17
66
1,172
2,403
% N
21
70
919
127
17
67
1,178
2,399
Whey protein
α-LA
g/L
20
70
915
133
17
67
1,193
2,415
% N
21
71
910
133
17
67
1,186
2,405
β-LG
g/L
22
70
929
133
17
67
1,164
2,402
% N
20
68
918
131
17
66
987
2,207
TWP
g/L
22
70
923
133
17
67
1,168
2,400
% N
21
70
921
132
17
67
1,169
2,397
1 TP = true protein; glyco-κ-CN = glycosylated-κ-CN; TCN = total casein; TWP = total whey protein; g/L = protein fraction contents in grams per liter of milk; % N = protein fraction in percentage of nitrogen.
Milk samples were collected in 55 batches (i.e., herd/date combinations, where each cow was sampled once and each herd was sampled on a specific date): 32 batches in 2013 (1,197 cows), 17 in 2019 (1,011 cows), and 6 in 2020 (350 cows). The large herd (herd 3; Table 1) was sampled in 2019 in 14 batches (856 cows) and in 2020 in 2 batches (80 cows), considering an experimental design where each batch was included in the analysis of milk coagulation properties on fresh milk, and the laboratory could only process a maximum of roughly 65 milk samples per day, as described in
Genetic parameters of differential somatic cell count, milk composition, and cheese-making traits measured and predicted using spectral data in Holstein cows.
. All procedures and protocols were identical for both databases assessed in 2013 and 2019 to 2020. The individual milk samples were maintained at 4°C until laboratory analysis (within 24 h). Each sample was divided into the following 2 aliquots: bronopol preservative was added to 1 sample, which was then transferred to the laboratories of the Breeders' Association of the Veneto or of the Province of Trento for analyses of milk quality and composition, and the other sample, without preservative, was transported to the University of Padova (Legnaro, Padova, Italy) for analysis of milk protein fractions by validated reversed-phase HPLC (
Detection and quantification of αS1-, αS2-, β-, κ-casein, α-lactalbumin, β-lactoglobulin and lactoferrin in bovine milk by reverse-phase high- performance liquid chromatography.
The following milk protein traits were measured: true protein (TP), the major casein fractions αS1-CN, αS2-CN, κ-CN, and β-CN, total casein (TCN; the sum of all casein fractions), the major whey proteins β-LG and α-LA, and TWP (the sum of all whey protein fractions). The milk protein fraction traits were expressed as grams per liter of milk (g/L), calculated by multiplying each milk protein fraction determined by HPLC by the milk casein contents obtained by FTIR and as a percentage of the total milk nitrogen content (% N).
Milk FTIR spectra were recorded on 2,558 Holstein cows and analyzed with a MilkoScan FT6000 (Foss Electric); specifically, they consisted of the transmittance values measured at 1,060 wavenumbers ranging from 5,011 to 925 (cm−1). The 2 spectra obtained were averaged before the data analysis, expressed as an absorbance value [log(1/transmittance)], and standardized to mean zero and standard deviation equal to one. Principal component analysis of the FTIR spectral information was performed, and the Mahalanobis distance was calculated; particularly, animals were considered outliers when they exhibited a Mahalanobis distance based on FTIR information from the average spectral population greater than 3.5 standard deviations (Figure 1). The principal component analysis results pointed out a similarity between old and new FTIR files, indicating that no preprocessing was required to mitigate possible biases due to differences in baseline absorbance over time. After FTIR quality control, milk spectral data from 2,496 Holstein cows remained in the data set. Following phenotypic quality control of the milk protein fractions, observations outside the interval between 3 standard deviations below and above the mean of each batch were removed. After phenotypic quality control, 2,437 cows remained for the analysis; specifically, we had 1,197 cows sampled in 2013, 1,011 cows sampled in 2019, and 229 in 2020, all under similar conditions. A summary of the records for the milk protein fractions by the herd is shown in Table 1. The average (± SD) DIM was 188.26 ± 112.47, parity 2.3 ± 1.51, milk yield 29.30 ± 10.01 kg, and the number of cows per herd/date ranged from 17 to 930. Descriptive statistics for the milk protein fractions expressed in g/L and % N are summarized in Table 2; the boxplots for the milk protein fractions across herds are shown in Supplemental Figure S1 for g/L and Supplemental Figure S2 for % N (https://doi.org/10.6084/m9.figshare.21864596.v1;
Predicting milk protein fraction using infrared spectroscopy and a gradient boosting machine for breeding purposes in Holstein cattle. Figshare. Figure.
Figure 1(A) Average value for Fourier transform infrared information expressed as absorbance (solid line represents the average, and color region represents the mean ± 3 SD) and (B) principal component (PC) for the FTIR spectral data of milk samples. Colors represent the years of FTIR assessment; old population (2013; n = 1,197 cows) and new population (2019 and 2020; n = 1,241 cows).
TP = true protein; glyco-κ-CN = glycosylated-κ-CN; TCN = total casein; TWP = total of whey protein; g/L = protein fraction contents in grams per liter of milk; % N = protein fraction in the percentage of nitrogen.
Predicting milk protein fraction using infrared spectroscopy and a gradient boosting machine for breeding purposes in Holstein cattle. Figshare. Figure.
2 TP = true protein; glyco-κ-CN = glycosylated-κ-CN; TCN = total casein; TWP = total of whey protein; g/L = protein fraction contents in grams per liter of milk; % N = protein fraction in the percentage of nitrogen.
The 1,067 cows whose phenotypic information was acquired in 2019 and 2020 were genotyped with the Geneseek Genomic profiler Bovine 100K SNP Chip assay (Neogene). The non-autosomal regions were excluded from the subsequent genotypic quality control. Autosomal markers presenting minor allele frequencies of less than 0.05, deviating significantly from the Hardy–Weinberg equilibrium (P ≤ 10−5), and with a call rate lower than 0.95, were removed. After quality control, 1,056 cows and 81,274 SNP markers remained in the data set.
CRV Scenarios
The FTIR prediction for each milk protein fraction was assessed using random 10-fold cross-validation (CRV) and 3 batch-independent CRV scenarios [i.e., herd/date-out, forward/backward (F/B), and 5-fold genetic parameters].
Random 10-Fold
In a random 10-fold CRV, the data set considering an admixture of breeds was split into 10-folds of equal size within each breed, with 9-folds used as the training set and the remaining 1-fold as the validation set to assess model performance. This CRV scenario was replicated 10 times, with the average value of these 10 replications used to determine the predictive ability of the model.
Herd/Date Out
In the herd/date-out CRV, which was based on the herd and date on which samples were collected, 80% of the population was randomly assigned to the training set (44 herd/dates), and the other 20% to the validation set (11 herd/dates). Given the variability in herd size, random sampling was carried out to ensure greater homogeneity in the number of animals in the training and validation sets. For this, batches were grouped into 5 similar groups based on the number of cows (ranging from 43 to 45 cows) and then divided into 80% for training and 20% validation within each group. This CRV scenario was repeated 10 times as for the random 10-fold. The 80% herd/dates (i.e., the entire herd/date which encompasses the production level) considered in the training set were independent of that 20% of herd/date considered in the validation set.
F/B
In this CRV scenario, we wanted to assess the predictive ability of the GBM equations across the different years of sampling to test their reproducibility over time. The training and validation were subsets of animals classified according to the year the FTIR spectral data were recorded, and the herds in the “old” (2013; multibreed herds, 1,197 cows) and “new” data set (2019–2020, 1,240 Holstein cows) were completely independent among them. For the Forward CRV, “old” FTIR data were used as the training set, whereas the “new” was assigned as the validation set. For the Backward CRV, the “new” FTIR data were assigned to the training set and the “old” FTIR data as the validation set. The farms considered in training and validation sets were completely independent among them.
CRV for Genetic Parameters
We used CRV for genetic parameters (CRV-gen) to assess the usefulness of FTIR predictions as a potential indicator trait for breeding purposes. In the first step, we assigned a fixed training population considering animals without pedigree information from the multibreed data set (138 Holstein cows, 537 Brown Swiss, 74 Alpine Grey, 101 Rendena, and 107 Simmental) to exploit all the available FTIR information efficiently. Next, the data set that considers animals with known pedigree and genomic information (i.e., the new FTIR data set sampled between 2019 and 2020; Supplemental Figure S3, https://doi.org/10.6084/m9.figshare.21864596.v1;
Predicting milk protein fraction using infrared spectroscopy and a gradient boosting machine for breeding purposes in Holstein cattle. Figshare. Figure.
), which encompasses 23 herds/date, and we used it as the base to split the population into 5-folds based on herd-date randomly. The entire herd-date was considered within each fold, approximately 4 herd-date for 2-fold and 5 herd-date for 3-folds. From these 5-folds, 1-fold was assigned to the fixed training population, and 4-folds were assigned to the validation set (independent from the training set), aiming to guarantee a large number of animals in the validation set. Thus, the training set comprised the fixed population (957 cows without pedigree information) plus 1-fold, and the validation set 4-folds. Finally, we repeated the process 5 times, and predictions obtained on each validation set (a total of 5 different validation folds) were used to estimate the genetic parameters using a bi-trait animal model for the FTIR predictions and the measurements using the HPLC approach for milk protein fractions.
FTIR Calibration Equations
We selected the GBM statistical method because previous results indicated that this method achieved the highest accuracy of FTIR-based prediction of different phenotypic traits compared with the partial least squares (PLS;
Evaluating the performance of machine learning methods and variable selection methods for predicting difficult-to-measure traits in Holstein dairy cattle using milk infrared spectral data.
). We compared the GBM performance against the gold-standard method (PLS) for milk protein fractions in grams per liter (g/L) and percentage of nitrogen (% N) for the following CRV scenarios: 10-folds, herd/date-out, F/B (Supplemental Table S1, https://doi.org/10.6084/m9.figshare.21864596.v1;
Predicting milk protein fraction using infrared spectroscopy and a gradient boosting machine for breeding purposes in Holstein cattle. Figshare. Figure.
The milk protein fractions were predicted using the GBM, an ensemble learning approach that converts weak learners into strong learners through a sequential combination of different regression tree models, with bias and variance reduced by shrinkage and variable selection (
Evaluating the performance of machine learning methods and variable selection methods for predicting difficult-to-measure traits in Holstein dairy cattle using milk infrared spectral data.
). We implemented GBM with a random tuning of the 4 main hyperparameters [i.e., the number of trees, learning rate, maximum tree depth, and minimum samples per leaf;
]. We performed the random tuning by splitting the training subset into 5-fold to optimize the hyperparameters and increase model performance [i.e., higher accuracy and lower root mean square error (RMSE)]. The number of trees values were determined in the range 100 to 5,000 in intervals of 100, the learning rate in the range 0.01 to 1 in intervals of 0.01, the maximum tree depth in the range 5 to 80 in intervals of 5, and minimum samples per leaf in the range 10 to 100 in intervals of 10. The GBM model was built with a random search using the h2o.grid function in the R h2o package (https://cran.r-project.org/web/packages/h2o). The relative importance of the FTIR wavelength predictor was determined by calculating the relative influence of the FTIR wavelength on predictive ability improvements during the regression tree building process, this being the sum of the squared improvements over all internal nodes of the tree for which the FTIR wavelength was chosen as the partitioning variable (
). The predictive ability of the GBM approach was assessed by Pearson correlation (r) between the observed and predicted phenotypes, and RMSE were assessed in the validation set across the CRV scenarios. The RMSE was calculated as
where ylab is the measured phenotype and ypred is the predicted phenotype in the validation set. We assessed the model unbiasedness by the slope of the linear regression of the gold standard laboratory measurements on predicted values in each CRV scenario for milk protein fractions.
Genetic Parameters
The genetic parameters for gold-standard laboratory measurements (ylab) and FTIR-based predictions from CRV-gen scenario (ypred; i.e., 5 different validation sets), for milk protein fractions expressed in g/L and % N with a bi-trait animal model using a single-step genomic BLUP, separately for each fold, as follows:
where ylab is the gold standard laboratory measurement and ypred is the FTIR-based prediction from the CRV-gen scenario for milk protein fractions; blab and bpred are the vectors of the fixed effects of DIM (6 classes: class 1, less than 60 d; class 2, 60–120 d; class 3, 121–180 d; class 4, 181–240 d; class 5, 241–300 d; class 6, more than 300 d), parity (4 classes: 1, 2, 3, ≥4), and herd-date for gold standard laboratory measurement and FTIR-based prediction, respectively; alab and apred are the vectors of additive genetic effects for gold standard laboratory measurement and FTIR-based prediction, respectively; Xlab, Xpred, Wlab, and Wpred are the incidence matrices relating ylab and ypred to the fixed effects (blab and bpred) and the additive effects (alab and apred), respectively; and elab and epred are the residual effect for gold standard laboratory measurement and FTIR-based prediction, respectively. The single-step genomic BLUP model was fitted under the following assumptions for the random effects:
and
where
and
are the genetic variances in the gold standard measurements, the FTIR-based predictions, and the covariances between them, respectively; R is the (co)variance residual matrix
where
and
are the residual variances in the gold standard measurements, the FTIR-based predictions, and the covariances between them, respectively; I is the identity matrix, and the symbol ⊗ represents the Kronecker product. H is a matrix that combines pedigree and genomic information (
where A−1 is the inverse of the pedigree relationship matrix,
is the inverse of the pedigree relationship matrix for genotyped animals, and G−1 is the inverse of the genomic relationship matrix obtained according to
. We assumed a flat prior distribution for the fixed effects and used an inverse Wishart distribution as a prior for the random effects.
Heritability (h2) was calculated based on the posterior (co)variance estimates for each trait as
where
is the additive genetic variance, and
is the residual variance for the gold standard measurements (ylab) or FTIR-based predictions (ypred) of milk protein fractions expressed in % N and g/L. Genetic (rg) and phenotypic (rp) correlation estimates were calculated as follows:
where
and
denote the phenotypic variance calculated as the sum of
and
and
is the phenotypic covariance between traits calculated as the sum of the additive genetic and residual covariance for the gold standard measurements (ylab) or FTIR-based predictions (ypred).
The model was implemented in the R package BGLR 1.0.9 (
). The genetic parameters were obtained from the posterior distribution using the Markov Chain Monte Carlo method via Gibbs sampling. We ran a single chain of 500,000 cycles, with a burn-in of the first 50,000 iterations, with samples stored every 10 cycles. Hence, the posterior means were obtained from 45,000 samples, and the analysis converged through visual inspection using the Bayesian output analysis (
Bernado J.M. Berger J.O. Smith A.P. Dawid A.F.M. Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. 4th ed. Clarendon Press,
1992
), the convergence was attained for the evaluated traits with a P-value > 0.15.
RESULTS
FTIR Prediction
The fitting statistics of the GBM model for milk protein fractions using the different CRV strategies are shown in Table 3, expressed in g/L, and Table 4, expressed in % N. Predictive ability was lower when the milk protein fractions were expressed in % N than in g/L (Supplemental Figure S4, https://doi.org/10.6084/m9.figshare.21864596.v1;
Predicting milk protein fraction using infrared spectroscopy and a gradient boosting machine for breeding purposes in Holstein cattle. Figshare. Figure.
). These reductions in prediction accuracy ranged from −27 to −51% using the 10-fold CRV scenario [except total casein (TCN), which increased by 3.61%], −2.47 to −51% using herd/date-out, −7.59 to 52.31 for F/B, and −1.2 to −45% using CRV-gen (Supplemental Figure S4). On the other hand, comparing the average RMSE for each average was observed a reduction of 36% with values ranging from −1 to −88%, whereas an increased 23% for αS2-CN, 28% for glycosylated (glyco)-κ-CN, 19% for α-LA, 26% for β-LG, and 20% for TWP (Table 3, Table 4).
Predictive ability for milk protein fractions expressed in g/L ranged from 0.63 ± 0.069 to 0.88 ± 0.020 with random 10-fold CRV, 0.62 ± 0.063 to 0.83 ± 0.077 with herd/date-out, 0.60 ± 0.063 to 0.78 ± 0.053 with F/B, and 0.62 ± 0.067 to 0.87 ± 0.031 with CRV-gen (Table 3). True protein and TCN showed better prediction accuracy than α-LA, which had the lowest predictive ability across all the CRV scenarios evaluated. With F/B, the predictive ability of the models based on FTIR milk spectra was lower than those obtained from a random 10-fold CRV scenario, with reductions ranging from −4.8% for α-LA to −19.8% for β-LG (Table 3). Unbiased estimation based on the regression slope of the gold standard measures of milk protein fractions on the FTIR-predicted values indicated a great difference across the CRV scenarios. The slope values obtained with the F/B CRV scenario indicated a prediction bias greater than 1, with values ranging from 10% for glyco-κ-CN to 27% for blood β-CN and a decrease of 8% for TP. The slope coefficient estimates obtained with the random 10-fold and 5-fold genetic CRV scenarios were less biased than those obtained with the herd/date-out and F/B CRV scenarios. These results agree with the assessment of model fit by RMSE and showed that the random 10-fold CRV and CRV-gen scenarios led to lower residual parameters compared with herd/date-out and F/B, showing a greater reduction in the RMSE ranging from 4% for κ-CN to 55% for α-LA, and from 8% for β-CN to 73% for α-LA, respectively (Table 3).
The predictive ability of milk protein fractions expressed in % N ranged from 0.34 ± 0.034 to 0.86 ± 0.023 with random 10-fold CRV, 0.33 ± 0.094 to 0.79 ± 0.071 with herd/date-out, 0.31 ± 0.024 to 0.73 ± 0.084 with F/B, and 0.36 ± 0.036 to 0.81 ± 0.022 with CRV-gen (Table 4). The best predictive abilities were obtained for TCN (R2 = 0.73–0.86) and TP (R2 = 0.60–0.64), and the lowest for β-CN (R2 = 0.31–0.36) across all the CRV scenarios evaluated (Table 4). With the F/B CRV scenario, the predictive ability of FTIR predictions was lower than that of the random 10-fold CRV scenario, with reductions ranging from −6.3% for TP to −15.79% for αS1-CN (Table 4). On the other hand, CRV-gen exhibited lower predictive ability than random 10-fold CRV, ranging from −1.56% for TP to −8.33 for glyco-κ-CN. However, the milk protein fractions β-CN and TWP showed an increased predictive ability of 4.7 and 2.63%, respectively. Inflation, estimated as the regression slope of the measured milk protein fractions on the FTIR-predicted values, indicated a slight variation in slope values between random 10-fold CRV and CRV-gen, with slope values ranging from 0.97 ± 0.038 for TWP to 1.12 ± 0.011 for αS2-CN, and from 0.97 ± 0.024 for β-CN to 1.12 ± 0.01 for glyco-κ-CN, respectively (Table 4). In contrast, the slope of the F/B CRV scenario showed a tendency to biased predictions, with values ranging from 1.08 ± 0.034 for TP to 1.26 ± 0.060 for TWP. Overall, FTIR prediction using the herd/date-out and F/B CRV scenarios produced more biased predictions than random 10-fold CRV and 5-fold CRV-gen.
Associations Between FTIR Wavelength Absorbance and Milk Protein Fractions
Overall, the milk protein fractions fell in the same FTIR wavelength regions, whether measured g/L (Figure 2) or % N (Figure 3). For milk protein fractions expressed in g/L, 3 main regions were found to explain more than 0.5% importance in the GBM approach (Figure 2). Significant individual FTIR wavelengths ranged from 37 for glyco-κ-CN to 68 for αS2-CN (Supplemental Figure S5, https://doi.org/10.6084/m9.figshare.21864596.v1;
Predicting milk protein fraction using infrared spectroscopy and a gradient boosting machine for breeding purposes in Holstein cattle. Figshare. Figure.
), covering 3 main regions (3,200–2,900 cm−1, 1,750–1,500 cm−1, and 1,250–950 cm−1). Twenty-nine wavelengths were shared by at least 5 milk proteins expressed in g/L, including 1,680 cm−1 (9 traits), 1,727, 2,972, 2,975, and 3,149 cm−1 (8 traits), 1,677 and 2,979 cm−1 (7 traits), 1,619, 1,653, 1,684, 2,968, 2,983, 3,018, 3,022, 3,184, and 3,191 cm−1 (6 traits), and 1,006, 1,561, 1603, 1,615, 1,646, 1,715, 2,918, 2,987, 2,991, 2,995, 3,029, 3,122, and 3,241 cm−1 (5 traits). These regions contributed between 0.61 and 2.21% of predictive ability in the GBM approach. Consistent with these results, the main milk FTIR wavelength regions were highly correlated with the target milk protein traits expressed in g/L, with values ranging from −0.25 to −0.97 and from 0.23 to 0.99 (r; Supplemental Figure S6, https://doi.org/10.6084/m9.figshare.21864596.v1;
Predicting milk protein fraction using infrared spectroscopy and a gradient boosting machine for breeding purposes in Holstein cattle. Figshare. Figure.
Figure 2Variable importance for single-wavelength absorbance associations across the entire Fourier transform infrared spectrum (1,060 wavelengths) for milk protein fractions [true protein (TP); major casein fractions: αs1-CN, αs2-CN, κ-CN, glycosylated-κ-CN (glyco-κ-CN), β-CN; total casein (TCN); major whey proteins: β-LG and α-LA; and total whey protein (TWP)], expressed in grams per liter of milk.
Figure 3Variable importance for single-wavelength absorbance associations across the entire Fourier transform infrared spectrum (1,060 wavelengths) for milk protein fractions [true protein (TP); major casein fractions: αs1-CN, αs2-CN, κ-CN, glycosylated-κ-CN (glyco-κ-CN), β-CN; total casein (TCN); major whey proteins: β-LG and α-LA; and total whey protein (TWP)], expressed as the percentage of the total milk nitrogen content (% N).
For milk proteins expressed in % N, 4 main regions (in the case of TP, αS2-CN, κ-CN, glyco- κ-CN, β-LG, and TWP) and a further 5 (in the case of αS1-CN, β-CN, TCN, and α-LA) were found to explain more than 0.5% of importance in the GBM approach (Figure 3). The number of significant individual FTIR wavelengths ranged from 44 for β-CN to 63 for αS1-CN, κ-CN, and TCN (Supplemental Figure S5), which covered the following regions: 4,900 to 4,650 cm−1, 3,600 to 3,350 cm−1, 3,200 to 2,900 cm−1, 2,550 to 2,400 cm−1, 1,750 to 1,500 cm−1, and 1,250 to 950 cm−1. These regions are related to overtones and combinations of the vibrations of some chemical bonds, such as C–O symmetric stretching, C=O stretching, C–H, N–H, O–H, and S–H. Some peaks exhibited moderate to strong associations with milk protein fractions expressed as % N in these regions (Figure 3). The major wavelength shared by at least 6 milk proteins was 1,603 cm−1 [variable importance (VI) > 0.90%], which was shared by TP, αS1-CN, αS2-CN, TCN, β-LG, and TWP. The wavelength 3,245 cm−1 (VI 0.71% to 1.66%) and 1,688 cm−1 (VI 0.60% to 3.03%) were each shared by the following 6 milk protein fractions: TP, αS1-CN, αS2-CN, β-CN, TCN, and β-LG in the former case, and β-CN, κ-CN, glyco-κ-CN, α-LA, β-LG, and TWP in the latter case. Fourteen wavelengths (i.e., 3,041, 1,611, 971, 3,091, 3,234, 1,607, 3,207, 3,049, 1,646, 1,665, 1,580, 3,026, 3,211, 3,029) were shared by a group of 5 milk proteins (Supplemental Figure S5B) and explained 0.61 to 2.70% of the predictive ability of the GBM approach. The Pearson correlations among the major milk FTIR wavelength regions with the target milk protein fractions expressed as % N were highly correlated, with ranges of −0.18 to −0.98 and 0.17 to 0.99 (Supplemental Figure S7, https://doi.org/10.6084/m9.figshare.21864596.v1;
Predicting milk protein fraction using infrared spectroscopy and a gradient boosting machine for breeding purposes in Holstein cattle. Figshare. Figure.
Genetic Parameters of Laboratory-Measured and FTIR-Predicted Milk Protein Fractions
Table 5 reports the genetic parameter estimates for laboratory-measured and FTIR-predicted milk protein fractions expressed in g/L, which yielded heritability estimates that were either moderate (TCN, TP, glyco-κ-CN, αS2-CN, and α-LA) or high (β-CN, αS1-CN, β-LG, κ-CN, and TWP). Heritability estimates for the FTIR-based predictions were slightly lower than those obtained for the laboratory measurements (Table 5). However, these reductions were slight, ranging from −1.93% for β-CN to −7.25% for α-LA, meaning that FTIR-based predictions effectively capture the variability in milk protein fractions (Figure 4A). On the other hand, the additive genetic, residual, and phenotypic variances for the FTIR-based predictions were considerably lower than for the laboratory measurements; specifically, between −6.62% (αS1-CN) and 33.33% (αS2-CN) for genetic variance, between −1.25% (αS1-CN) and −29.17% (αS2-CN) for residual variance, and between −3.01% (αS1-CN) and −30.47% (αS2-CN) for phenotypic variance (Figure 4A).
Table 5The average and, in parentheses, the range of SD of genetic parameters estimates across the 5-folds from cross-validation for genetic parameters for milk protein fractions expressed as grams per liter for the gold-standard measurement (laboratory) and Fourier transform infrared predicted
Predicting milk protein fraction using infrared spectroscopy and a gradient boosting machine for breeding purposes in Holstein cattle. Figshare. Figure.
Figure 4Relative difference for genetic
residual
and phenotypic
variance estimates and heritability (h2) for Fourier transform infrared (FTIR) prediction and gold standard measurement of milk proteins expressed as g/L (A) and % N (B). The relative difference (%) was calculated as [(genparpred − genparmeas)/genparmeas] × 100, where genparpred and genparmeas are the genetic parameters
and h2) for FTIR predicted and measured milk protein fractions, respectively. TP = true protein; glyco-κ-CN = glycosylated-κ-CN; TCN = total casein; TWP = total of whey protein.
For the milk protein fractions expressed as % N, the heritability estimates observed for α-LA (h2 = 0.266), αS2-CN (h2 = 0.293), and TCN (h2 = 0.375) were moderate, whereas those observed for TP, αS1-CN, β-CN, κ-CN, glyco-κ-CN, β-LG, and TWP were high, with values ranging from 0.434 to 0.798 (Table 6). The heritability estimates for the FTIR predictions displayed the same trend as the laboratory measurements, although they were slightly lower (Figure 4B). The differences were smaller for αS2-CN (−1.68%), glyco-κ-CN (−1.77%) and β-LG (−1.87%), and larger for αS1-CN (−6.63%) and TWP (−786%; Figure 4B). However, we observed considerably smaller additive genetic, residual, and phenotypic variances in the FTIR-based predictions compared with the laboratory measurements (Figure 4B). The differences ranged from −7.72% (TP) to −41.83% (αS2-CN) for genetic variance, from −1.75% (TWP) to −40.48% (αS2-CN) for residual variance, and from −5.07% (TP) to −40.88% (αS2-CN) for phenotypic variance (Figure 4B).
Table 6The average and, in parentheses, the range of SD of genetic parameter estimates across the 5-folds from cross-validation for genetic parameters for milk protein fractions expressed as a percentage of nitrogen for the gold-standard measurement (laboratory) and Fourier transform infrared predicted
Predicting milk protein fraction using infrared spectroscopy and a gradient boosting machine for breeding purposes in Holstein cattle. Figshare. Figure.
Genetic and Phenotypic Correlations Between Laboratory Measurements and FTIR Predictions
The estimated posterior densities of the genetic and phenotypic correlations between the laboratory measurements and FTIR-based predictions of milk proteins are reported in Figure 5 in g/L and Figure 6 as % N. For the protein fractions expressed in g/L, the averages of the posterior genetic correlations were high, ranging from 0.88 ± 0.033 for α-LA to 0.98 ± 0.005 for TP. The phenotypic correlations were lower, with values ranging between 0.64 ± 0.034 for α-LA and 0.86 ± 0.012 for TP (Figure 5). The posterior densities were skewed and their shape was similar across subsets of the data for genetic correlations, whereas slightly different densities were observed for the phenotypic correlations (Figure 5). The genetic correlations were high for the protein fractions expressed as % N, ranging from 0.87 ± 0.017 for α-LA to 0.97 ± 0.013 for TP (Figure 6). However, the phenotypic correlations were lower than 0.80, except for TP (0.81 ± 0.0212) and TCN (0.87 ± 0.0153; Figure 6). The posterior distributions of the genetic correlations were skewed with small differences across the subsets, whereas, in contrast, large differences were observed in the posterior distributions of the phenotypic correlations (Figure 6).
Figure 5Posterior distribution of genetic and phenotypic correlation between gold standard and Fourier transform infrared (FTIR) prediction using the cross-validation for genetic parameter scenarios for milk protein fraction expressed in grams per liter of milk. TP = true protein; glyco-κ-CN = glycosylated-κ-CN; TCN = total casein; TWP = total of whey protein.
Figure 6Posterior distribution of genetic and phenotypic correlation between gold standard and Fourier transform infrared (FTIR) prediction for milk protein fraction expressed as the percentage of the total milk nitrogen content (% N). TP = true protein; glyco-κ-CN = glycosylated-κ-CN; TCN = total casein; TWP = total of whey protein.
The magnitude of the predictive ability of FTIR determines its effectiveness for farm management and breeding purposes. Traditionally, FTIR predictive ability is assessed by phenotyping a small number of animals using gold-standard measurements, which results in under-optimistic evaluations of complex phenotypes.
evaluated CRV performed on a training set comprising specialized and dual-purpose breeds to ensure a sufficiently large population size and observed improvements in predictive ability over a single breed population. In the present study, we evaluated different CRV scenarios and a multibreed population (specialized and dual-purpose breeds) to evaluate the prediction performance of the model for milk protein fractions. Comparing the performances of the different CRV scenarios, smaller reductions in the predictive ability were observed as the independence between the training and validation sets increased. These reductions in the coefficient of determination (R2) values were around −9.01% (−4.76 to −19.75%) and −10.97% (−6.25 to −15.79%) for F/B, −4.92% (−1.39 to −12.66%) and −6.55% (−2.94 to −10.42%) for herd/date-out, and −2.80% (−1.14 to −6.17%) and −1.54 (−8.33 to 4.71%) for CRV-gen, for milk protein fractions expressed in g/L (Table 3) and % N (Table 4), respectively.
Overall, milk protein fractions expressed in g/L had higher predictive ability across the CRV scenarios. However, we observed a greater reduction in R2 with the F/B scenario compared with a random 10-fold CRV, from −19.8 to −4.8% in g/L and from −6.3 to 15.8% as % N, which could be ascribed to differences in FTIR acquisition over time (i.e., oldest, 2013 vs. newest, 2019–2020), this being an extreme case of independence between the training and validation sets. In addition, FTIR measurements over time can present variations in the interferometer signal leading to changes in the vibrational bands caused by altered shapes, intensities, and relative intensities (
evaluated the day-to-day variation in FTIR spectra and observed a significant effect on accuracy; they used variance-simultaneous component analysis to monitor spectral variation, which allowed them to correct shifts in peak intensity or band shape, which would reduce predictive ability. Pretreatments for spectral noise reduction are very common and often important for obtaining robust predictive models, mainly the Savitzky–Golay smoothing algorithm used to attenuate high-frequency signals coming from noise and tends to retain important chemical signals (
). The principal component analysis of the milk FTIR spectra is useful for detecting possible differences in spectra values over time and using a noise reduction strategy to remove these differences across files. When the principal component analysis indicates a dissimilarity across FTIR information, increase the distance between structural relationships between variables and find potential clusters affecting the predictive model ability biases due to differences in baseline absorbance. However, we found no significant differences between the old and new FTIR data sets (P > 0.05), which did not contribute to biased or lower FTIR predictions.
The FTIR-based predictive abilities for milk proteins expressed as % N ranged from 0.38 to 0.86 for random 10-fold CRV, 0.35 to 0.79 for herd/date-out, 0.32 to 0.73 for F/B, and 0.38 to 0.81 for CRV-gen, and R2 were higher than those previously obtained using different statistical approaches, which were in the ranges 0.14 to 0.82 (
Effectiveness of mid-infrared spectroscopy for the prediction of detailed protein composition and contents of protein genetic variants of individual milk of Simmental cows.
). Higher predictive abilities were obtained in the case of TP, TCN, and TWP in g/L and % N compared with the other traits, which might be due to their higher concentrations in milk (Table 3). The lower predictive ability observed for milk protein fractions expressed as % N compared with g/L agrees with previous results (
Effectiveness of mid-infrared spectroscopy for the prediction of detailed protein composition and contents of protein genetic variants of individual milk of Simmental cows.
). This suggests that FTIR information can capture different biological data related to variations in milk protein fractions according to how these traits are expressed.
For practical application in breeding, the usefulness of FTIR predictions as potential indicator traits for genetic evaluation rests on obtaining FTIR-predicted values and gold standard measurements from a large number of samples. The CRV-gen scenario we devised to make FTIR predictions for genetic analyses was performed on large training (n = 1,253 cows) and validation sets (n = 1,184 cows) and gave predictive ability values ranging from moderate to high (Table 3, Table 4). Similarly,
The effect of the number of observations used for Fourier transform infrared model calibration for bovine milk fat composition on the estimated genetic parameters of the predicted data.
demonstrated that assembling large reference populations makes it possible to improve the accuracy of FTIR-based predictions intended for estimating genetic parameters. We obtained moderate to high FTIR-based predictive abilities (R2) for milk protein fractions expressed in both g/L and % N, indicating their potential use for breeding purposes. Although
Production Comparison of Genetic Parameters Estimation of Fatty Acids from Gas Chromatography and FT-IR in Holsteins. Proc. 10th World Congress of Genetics Applied to Livestock.
observed that moderate predictive ability also provides valuable information for breeding programs. In this case, when FTIR predictive ability is moderate, the breeding value of a bull based on information from many progenies allows noise prediction correction. The greater predictive ability obtained may be due to the GBM selecting the milk spectra that can capture greater variability in milk chemical composition (Figure 2, Figure 3) and by their flexibility in mapping the complex associations between predictors and target phenotypes (
Evaluating the performance of machine learning methods and variable selection methods for predicting difficult-to-measure traits in Holstein dairy cattle using milk infrared spectral data.
, comparing machine learning and penalized regression against PLS regression, observed a superior ability of GBM to predict difficult-to-measure milk traits. A similar pattern was also found in this study in which GBM showed superiority against PLS, with R2 increasing from 2 to 49% for protein fractions (Supplemental Table S1, https://doi.org/10.6084/m9.figshare.21864596.v1;
Predicting milk protein fraction using infrared spectroscopy and a gradient boosting machine for breeding purposes in Holstein cattle. Figshare. Figure.
Associations Between Milk FTIR Wavelength Absorbances and Milk Protein Fractions
The FTIR wavelength absorbance is characterized by the effect of electromagnetic radiation, which is correlated with the stretching and bending vibrations of specific chemical bonds within a molecule (
Mid-Infrared spectroscopy coupled with chemometrics: A tool for the analysis of intact food systems and the exploration of their molecular structure−Quality relationships—A review.
). The number of spectral regions associated with milk protein fractions varies according to how the proteins are expressed: 3 main regions for g/L and 4 or 5 for % N. Consequently, the biological background of milk protein fractions expressed as % N is more complex and requires more wavelength regions for its prediction. In addition, milk protein fractions expressed in g/L and % N shared 3 wavelength regions (3,200–2,900 cm−1, 1,750–1,500 cm−1, and 1,250–950 cm−1), which are related to the fingerprint region (C–O, C–C, C=C, C–H, N–O, C–N, C=CH2, O–H, amide II, and amide III bands), corresponding to common chemical bonds present in milk components such as fat, protein, lactose, carbohydrates, and organic acids (
). In particular, casein profiles are expected to be associated with absorption peaks related to the wavenumbers 1,250 cm−1 (amide III), 1,550 cm−1 (amide II), and 1,650 cm−1 (amide I;
). However, vibrations on the water wavelengths related to the O–H groups are sensitive to interactions between water and the polar lipids and proteins present in milk, affecting the contribution of water to spectrum variability (
Determination of the secondary structure content of proteins in aqueous solutions from their amide I and amide II infrared bands. Comparison between classical and partial least-squares methods.
). These regions, mapping on 4,900 to 4,650 cm−1, 3,600 to 3,350 cm-1, and 2,550 to 2,400 cm−1, explained the significant effect on milk protein fractions when expressed as % N. Wavelength regions 4,500 to 5,000 cm−1 contribute to vibrations of the N–H and C = O groups in the proteins (
). The wavelength region 3,600 to 3,350 cm−1 consists of absorbance from stretching vibrations of hydroxyl groups (O–H) and amide A of proteins (N–H). Overall, the wavenumbers are known to contain information on milk components, and statistical approaches that can perform variable selection (GBM) have the advantage of being able to map the complex associations (e.g., nonlinear and interactions) between the FTIR wavelengths and the target trait (
Genetic Parameters for Laboratory Measurements and FTIR-Based Predictions of Milk Proteins
Phenotyping milk protein fraction is still a bottleneck, so techniques for precisely and reliably recording them are required to improve breeding program selection efficiency. Increasing the genetic gain rate using FTIR technologies can reduce the cost of measuring complex phenotypes on a large scale during different stages of lactation (
). However, it is important to identify their genetic variations. The heritability estimates for milk protein fractions expressed in g/L and % N, assessed by gold standard laboratory measurements and FTIR-based predictions, show that the additive genetic effect influences them. However, notable reductions in the genetic parameters were observed for αS2-CN and β-LG expressed in g/L and for αS1-CN, and αS2-CN expressed as % N. In contrast, the heritability estimates for α-LA and TCN in g/L, and TWP and αS1-CN as % N were large. Our findings show that robust predictive models that include a larger number of samples in the training data set and a more complex algorithm may be able to capture the relationships between milk FTIR and milk protein fraction more accurately, corroborating the suitability of FTIR prediction of milk protein fractions for genetic evaluation purposes.
The observed reductions in the genetic parameters between FTIR-based predictions and gold standard measurements are consistent with results from previous studies (
The effect of the number of observations used for Fourier transform infrared model calibration for bovine milk fat composition on the estimated genetic parameters of the predicted data.
, ranging from −32 to −81%. This difference may be explained by the different abilities of the statistical models used to deal with complex associations between infrared spectra and the target phenotype in the calibration equations.
Genetic parameters of measures and population-wide infrared predictions of 92 traits describing the fine composition and technological properties of milk in Italian Simmental cattle.
estimated the genetic parameters for FTIR prediction of different milk-related traits and observed a significant association between predictive ability and reductions in the genetic parameters for FTIR-predicted traits. These reductions were smaller for traits predicted with an R2 higher than 0.90 than those with an R2 lower than 0.80.
Correlations Between Laboratory Measurements and FTIR Predictions of Milk Protein Fractions
The magnitude of the genetic correlations between FTIR-based predictions and gold-standard laboratory measures is the main parameter for determining the feasibility of including such indicator traits in animal selection for breeding purposes (
). Successful incorporation into a breeding program depends on the degree of genetic gain attained through indirect selection, which is directly related to the strength of the genetic correlation between the target and FTIR-predicted trait. Milk protein fractions are important for the dairy industry because they influence milk's technological properties, mainly during the coagulation process, whereby the milk protein fractions αS1-CN and κ-CN lead to reductions in coagulation time (
). Our estimates of the genetic correlations between the gold standard measurements and FTIR predictions were high and ranged from 0.87 to 0.99 (Figure 5, Figure 6). The strength of the genetic correlations varied as a function of predictive ability: as FTIR predictive ability increased, the genetic correlation between the predicted values and the gold standard measures also increased, as shown in Supplemental Figure S8A (https://doi.org/10.6084/m9.figshare.21864596.v1;
Predicting milk protein fraction using infrared spectroscopy and a gradient boosting machine for breeding purposes in Holstein cattle. Figshare. Figure.
Genetic parameters of measures and population-wide infrared predictions of 92 traits describing the fine composition and technological properties of milk in Italian Simmental cattle.
), especially for β-CN (0.95 vs. 0.63), α-LA (0.88 vs. 0.57), and β-LG (0.94 vs. 0.77). Slight differences in the genetic correlations were observed for TP (0.99 vs. 0.98), αS1-CN (0.91 vs. 0.94), αS2-CN (0.91 vs. 0.87), κ-CN (0.95 vs. 0.90), whereas no difference was observed for TCN. On the other hand, the genetic correlations for milk protein fractions expressed as % N ranged from 0.88 to 0.97, strikingly different from the results of previous studies, ranging from 0.23 to 0.90 (
Genetic parameters of measures and population-wide infrared predictions of 92 traits describing the fine composition and technological properties of milk in Italian Simmental cattle.
). Furthermore, although we observed moderate predictive ability for αS1-CN, αS2-CN, β-CN, κ-CN, Glyco-κ-CN, α-LA, β-LG, and TWP expressed as % N, the genetic correlation between the gold standard measures and the FTIR predictions was greater than 0.80. This high correlation indicates that there is little or no reranking of the animals concerning their expected breeding value according to gold-standard measurements.
assessed the genetic parameters of FTIR predictions and milk proteins expressed as % N and found predictive ability to vary from 0.18 (αS1-CN) to 0.56 (β-LG), resulting in genetic correlations ranging from 0.62 (β-CN) to 0.97 (TWP), good enough for exploitation in breeding programs. Concerning milk technological traits,
found R2 values from 0.46 to 0.52 for infrared predictions of curd firming, with genetic correlations between the measures and the predictions ranging from 0.71 to 0.87, and R2 values from 0.61 to 0.69 for infrared predictions of coagulation time, with genetic correlations ranging from 0.91 to 0.96.
The phenotypic correlations between the gold standard measurements and FTIR predictions of milk protein fractions expressed in g/L and % N ranged from 0.63 to 0.87 and were dependent on FTIR predictive ability (Supplemental Figure S8B). Differences in the association between the phenotypic correlations and the model's predictive ability according to whether milk proteins were expressed in g/L or % N can be explained by differences in the contributions of the genetic and environmental effects. The same trend has been observed in dairy cattle (
The effect of the number of observations used for Fourier transform infrared model calibration for bovine milk fat composition on the estimated genetic parameters of the predicted data.
Genetic parameters of measures and population-wide infrared predictions of 92 traits describing the fine composition and technological properties of milk in Italian Simmental cattle.
Prediction of meat quality traits in the abattoir using portable near-infrared spectrometers: heritability of predicted traits and genetic correlations with laboratory-measured traits.
), where the genetic correlations between the infrared predictions and measured traits were less dependent on predictive ability than the phenotypic correlations. Milk protein fractions with the highest predictive abilities in the calibration equation exhibited the highest genetic and phenotypic correlations with the relative gold standard measurement. However, a low to moderate R2 can also give rise to acceptable genetic and phenotypic correlations. Therefore, our results support for the potential application of the developed prediction equation for breeding purposes to enhance milk quality and cheesemaking aptitude.
CONCLUSIONS
This study showed that FTIR spectra can be successfully exploited for the prediction of milk protein fractions expressed both in g/L and % N, although the predictions were in general more reliable when proteins were expressed in g/L, as in % N requires more FTIR wavelengths to capture the phenotypic variability. Similar regions of the FTIR spectra were found to explain the variability of traits expressed in g/L and in % N, confirming that they share the same biological background. The heritability estimates for milk protein fractions assessed by laboratory measurements and FTIR predictions followed the same trend with slight differences among them. The high genetic correlations between the FTIR predictions and the laboratory measurements found in our study provide evidence for their potential use as indicator traits in breeding programs aimed at altering protein fractions and improving milk quality and cheesemaking ability. Further studies could be conducted applying the FTIR calibrations on a population database, provided that FTIR spectra are available, and estimating genetic parameters and genomic breeding values exploiting longitudinal data and random regression models.
ACKNOWLEDGMENTS
The research was part of the projects (1) LATSAN funded by the Ministero delle politiche agricole alimentari, forestali e del turismo (MIPAAF, Rome, Italy), (2) BENELAT, Interventi a breve e lungo termine per il miglioramento del benessere, dell'efficienza e della qualità delle produzioni dei bovini da latte della Lombardia, Bando per il finanziamento di progetti di ricerca in campo agricolo e forestale 2018 (d.d.s. 28 marzo 2018, n. 4403), and (3) AGER 2 Project under Grant number 2017-1130 funded by the Fondazione CARIPLO. The authors are also grateful to the Italian Holstein-Friesian and Jersey Cattle Breeders Association (ANAFIJ, Cremona, Italy) for collaborating in the research activities and for providing pedigree information. The authors have not stated any conflicts of interest.
REFERENCES
Aguilar I.
Misztal I.
Johnson D.L.
Legarra A.
Tsuruta S.
Lawlor T.J.
A unified approach to utilize phenotypic, full pedigree, and genomic information for genetic evaluation of Holstein final score.
Effectiveness of mid-infrared spectroscopy for the prediction of detailed protein composition and contents of protein genetic variants of individual milk of Simmental cows.
Genetic parameters of measures and population-wide infrared predictions of 92 traits describing the fine composition and technological properties of milk in Italian Simmental cattle.
Genetic parameters of cheese yield and curd nutrient recovery or whey loss traits predicted using Fourier-transform infrared spectroscopy of samples collected during milk recording on Holstein, Brown Swiss, and Simmental dairy cows.
Genetic analysis of rennet coagulation time, curd-firming rate, and curd firmness assessed over an extended testing period using mechanical and near-infrared instruments.
Determination of the secondary structure content of proteins in aqueous solutions from their amide I and amide II infrared bands. Comparison between classical and partial least-squares methods.
Bernado J.M. Berger J.O. Smith A.P. Dawid A.F.M. Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. 4th ed. Clarendon Press,
1992
Development of Fourier transform mid-infrared calibrations to predict acetone, β-hydroxybutyrate, and citrate contents in bovine milk through a European dairy network.
Mid-Infrared spectroscopy coupled with chemometrics: A tool for the analysis of intact food systems and the exploration of their molecular structure−Quality relationships—A review.
Detection and quantification of αS1-, αS2-, β-, κ-casein, α-lactalbumin, β-lactoglobulin and lactoferrin in bovine milk by reverse-phase high- performance liquid chromatography.
Predicting milk protein fraction using infrared spectroscopy and a gradient boosting machine for breeding purposes in Holstein cattle. Figshare. Figure.
Evaluating the performance of machine learning methods and variable selection methods for predicting difficult-to-measure traits in Holstein dairy cattle using milk infrared spectral data.
Genetic parameters of differential somatic cell count, milk composition, and cheese-making traits measured and predicted using spectral data in Holstein cows.
Production Comparison of Genetic Parameters Estimation of Fatty Acids from Gas Chromatography and FT-IR in Holsteins. Proc. 10th World Congress of Genetics Applied to Livestock.
The effect of the number of observations used for Fourier transform infrared model calibration for bovine milk fat composition on the estimated genetic parameters of the predicted data.
Genetic parameters for milk protein composition predicted using mid-infrared spectroscopy in the French Montbéliarde, Normande, and Holstein dairy cattle breeds.
Prediction of meat quality traits in the abattoir using portable near-infrared spectrometers: heritability of predicted traits and genetic correlations with laboratory-measured traits.