Can unsupervised learning methods applied to milk recording big data provide new insights into dairy cow health?

Among the dairy sector’s current concerns, the assessment of global animal health status is a complex challenge. Its multidimensionality means that global monitoring tools are rarely considered. Instead, specific disease detection is often studied separately and, due to financial and ethical issues, uses small-scale data sets focusing on few biomarkers. Several studies have already been conducted using milk Fourier transform mid-infrared (FT-MIR) spectroscopy to detect mastitis and lameness or to quantify health-related biomarkers in milk or blood. Those studies are relevant but they focus mainly on one biomarker or disease. To solve this issue and the small-scale data set, in this study, we proposed a holistic approach using big data obtained from milk recording, including milk yield, somatic cell count, and 27 FT-MIR–based predictors related to milk composition and animal health status. Using 740,454 records collected from 114,536 first-parity Holstein cows in southern Belgium, we performed repeated unsupervised learning algorithms based on Ward’s agglomerative hierarchical clustering method to find potential interesting patterns. A divide-and-conquer approach was used to overcome the limitation of computational resources in clustering a relatively large data set. Five groups of records were identified. Differences observed in the fourth group suggested a relationship to metabolic disorders. The fifth group seemed to be related to mastitis. In a second step, we performed a partial least squares discriminant analysis (PLS-DA) to predict the probability of belonging to those specific groups for the entire data set. The obtained global accuracy was 0.77 and the balanced accuracy (i.e., the mean between sensitivity and specificity) of discriminating the fourth and fifth groups was 0.88 and 0.96, respectively. Then, a validation of the interpretation of those groups was performed using 204 milk and blood reference records. The predicted probability associated with the metabolic disorders issue had significant correlations of 0.54 with blood β-hydroxybutyrate, 0.44 with blood nonesterified fatty acids, −0.32 with blood glucose, −0.23 with milk glucose-6-phosphate, and 0.38 with milk isocitrate. In contrast, the predicted probability of belonging to the mastitis group had correlations of 0.69 with milk lactate dehydrogenase, 0.46 with milk N -acetyl-β-d-glucosaminidase, −0.18 with milk free glucose, and 0.16 with milk glucose-6-phosphate. Consequently, these results suggest that the obtained quantitative traits indirectly reflect some of the main health disorders in dairy farming and could be used to monitor dairy cows on a large scale. By using unsupervised learning on large-scale milk recording data and then validating the pattern using reference laboratory measures, we propose a new approach to quickly assess dairy cow health status.


INTRODUCTION
Monitoring dairy cow health is complex because it involves many aspects such as disease, energy balance, and heat stress.However, dairy cow health, being a part of animal welfare, is important for dairy farming sustainability.Stakeholders (producers, policymakers, and consumers) need warranties about the ethics of the product.In particular, the welfare status of cows is of increasing concern (Fraser, 1995;Blokhuis et al., 2003).
Currently, cow health monitoring tools mostly rely on milk and blood analyses, which provide healthrelated biomarkers.However, defining health indicators is a lengthy and costly process because it requires many potential biomarkers to be scanned among an animal population that has health issues (Foldager et al., 2020;Krogh et al., 2020).Depending on the issue, the process can raise ethical questions when the disorder is induced in animals or the measurement is invasive.In other cases, detecting animals with health issues is complicated because the farms participating in the studies are already aware of the problem, making the sample unrepresentative and biased toward healthy animals.
Animal scientists have developed strategies to predict health indicators using milk Fourier transform mid-infrared (FT-MIR) spectrometry to avoid ethical issues and the cost of routine measurements of biomarkers.This spectroscopic technology has many advantages as it is noninvasive and reflects part of the animal's metabolism by assessing the global milk composition.Moreover, milk FT-MIR spectra, along with other dairy production phenotypes, are collected by DHI organizations on average once a month in participating herds.From the FT-MIR spectra, statistical models can predict biomarkers of interest such as acetone, BHB, citrate, lactoferrin, and SCC to assess and, therefore, monitor dairy cow health (Soyeurt et al., 2012;Luke et al., 2019).This technique enables routine individual assessment of cow health status on farms after applying appropriate prediction equations related to the biomarkers to the recorded spectral data.
A standard procedure to obtain prediction models consists of collecting blood and milk samples from as many healthy and sick animals as possible, analyzing those milk samples using FT-MIR spectrometry to record the spectral data, and measuring the contents of the studied biomarkers in milk and blood using certified chemical methodologies.Afterward, supervised machine learning methods such as partial least squares (PLS) regression can be used to build the model, predicting the desired biomarker (Soyeurt et al., 2020).This process has been largely used in the last decades to increase the number of phenotypes predicted by FT-MIR and not only to predict health issues (De Marchi et al., 2014;Bastin et al., 2016;Gengler et al., 2016).Table 1 presents the characteristics of some models published in the literature predicting different health biomarkers naturally present in milk and blood.A large range of cross-validation accuracy can be observed: R 2 varied between 0.21 and 0.88 (Table 1).More information on key factors affecting the quality of predictions is available in Grelet et al. (2021).Model performance and validation accuracy can sometimes be poorer for several reasons.First, for practical, ethical, or financial issues, the calibration data set may not be representative enough, with too few animals or herds.However, to maximize the prediction quality, the modeling procedure requires sampling among the whole animal population to maximize the variability of the calibration data used to develop the equation (Grelet et al., 2021).Thus, making a prediction outside of the calibration variability is detrimental to the prediction quality due to inner extrapolation.In addition, maximizing the calibration set for health indicators requires both healthy and sick animals to be included.Sick animals are relatively scarce, which explains why obtaining high BHB levels is more difficult, as mentioned by Grelet et al. (2016), Fleming et al. (2017), and Bonfatti et al. (2019).This unbalanced situation between sick and healthy animals can affect the final prediction accuracy of the developed models.Finally, most of the metabolites of interest have a very low concentration in milk or are even predicted in another matrix (e.g., blood).These indirect predictions rely more on global milk changes than on measuring a specific molecule.Consequently, the disorder cannot be simulated artificially, such as spiking milk with molecules of interest, because it would interfere with the natural relationships existing between the milk components.All these aspects explain the difficulty in developing prediction models related to health indicators.To solve these issues, it would be interesting to combine different health biomarker predictions, different models derived from different populations, and usual production traits and indicators to provide insights into cow health status (e.g., milk production or BW).
In their review of infrared spectrometry as a highthroughput phenotyping technology, Bresolin and Dórea (2020) suggest using larger data sets and modern data-mining approaches to improve the models.In the current study, the innovative aspect consists of realizing an unsupervised learning approach that profits from the large-scale data set collected from DHI and combining several FT-MIR-based predictions and usual dairy traits to discriminate groups of cows that have a different data pattern.This approach allows the combination of several indicators of health issues (i.e., multidimensional aspect).Furthermore, the exploration of a large DHI database ensures the consideration of high variability and warrants the representativeness of the sample set.Therefore, the context of this study can be considered as a "big data" case; the 5 Vs representing aspects of big data are found: volume, variety, veracity, velocity, and value (Gandomi and Haider, 2020).
Moreover, unsupervised methods have the advantage of not introducing any bias about the definition of health.Among those methods, clustering, and particularly hierarchical clustering, is used as it allows patterns to be observed at any scale of the database and also cuts the data into groups.Previous studies have considered multiple reference health measures together and even clustering, but they were conducted using small-scale reference data sets (LeBlanc, 2010;Grelet et al., 2019;Atashi et al., 2020) and not from a routine DHI database.Therefore, the current approach presents the distinction of having no reference analysis of health biomarkers to construct the discrimination between healthy and sick animals.However, this discrimination will be validated using real reference measurements to attest to the relevancy of the discovered groups.

MATERIALS AND METHODS
All data treatments and analyses were performed using R software (version 4.0.2;https: / / www .r-project .org/).All data were collected as per standard routine by the Walloon Breeders Association, which is accredited to perform such collection.

Calibration Data
The data set to be used with unsupervised learning methods, hereafter called the calibration data set, contained 740,454 DHI records collected by Elevéo (Awé groupe, Ciney, Belgium) in 1,077 herds located in southern Belgium between January 2012 and March 2020 from 114,536 primiparous Holstein cows with DIM ranging between 5 and 365 d.Milk analysis was performed by the milk laboratory Comité du Lait (Battice, Belgium) on Foss Milkoscan FT6000, FT+, and FT7 standardized spectrometers.Among the large collection of possible phenotypes for this study, we kept only milk yield, milk FT-MIR spectra, SCC, and contents of fat, protein, urea, and lactose predicted by the spectrometer.Furthermore, predictive equations developed from previous studies (Table 2) were applied to the recorded spectral data to predict 20 additional biomarkers related to health issues.Although initially predicted in grams per deciliter (g/dL) of milk, the contents of SFA, C18:1 cis-9, MUFA, and PUFA were expressed in grams per 100 grams of fat using the fat content predicted by the spectrometer, as this unit is more relevant for assessing the biological pathways behind fatty acid production.Then, DMI and fat-and protein-corrected milk (FPCM) were computed using the following equations (NRC, 2001;Yan et al., 2011): where BW is the predicted BW using the equation of Tedde et al. (2021), WOL is the week of lactation, MY is milk yield, and FAT and PROT are fat and protein contents expressed as percentages.The consumption index was defined as the ratio of DMI to FPCM.Finally, the calibration data set used for the clustering contained 29 traits that were directly or indirectly related to health issues (Table 3).

Clustering Method
Patterns were found in the calibration data set using hierarchical clustering.This agglomerative algorithm works step by step to combine the closest records or groups of records until the whole data set is considered.Ward's agglomerative hierarchical clustering method with Euclidean distance was chosen because this is widely used to distinguish groups in a multivariate Euclidean space (Ward, 1963).However, this type of clustering has the constraint of requiring the computing of a distance matrix between every observation or cluster at each step.So, for n records, the algorithm computes 1 2 distances.For 740,454 records, this represents 274 × 10 9 distances, which would make the computation too slow, and storing the distance matrix in memory would be impossible with standard computational infrastructure.The volume of data is a common hindrance in big data methodologies, and many approaches exist to solve this issue (Wang et al., 2016).In this study, we decided to apply a divide-and-conquer approach, which consisted of dividing the data into more manageable subsets, analyzing them separately, and recombining them afterward.The calibration data were first split randomly into 100 subsets of ~7,500 records.For each subset, the records were clustered, and the obtained dendrogram was cut to divide the subset into 500 groups.A mean sample (i.e., a centroid) was created for each group by averaging all features.Therefore, 50,000 centroids (i.e., 500 centroids from 100 subsets) were computed to summarize the information contained in the initial data set.Those centroids were then clustered using the same hierarchical clustering method.As the approach depends on the splitting of the data set, the procedure was replicated 10 times to estimate the robustness of the final clustering obtained.From this final clustering, the inertia (i.e., the ratio between the between-class and within-class variance) of the dendrogram was estimated to define the best number of groups (i.e., selection of several groups before a large drop of inertia gain).This can be done by looking at the graph of the inertia according to the number of groups.

Cluster Interpretation
Standard statistical methods to test the significance of the effects, such as ANOVA, are useless here because of the number of records.There are too many observations, so the tests are too powerful, making every effect significant.Moreover, the main objective was to interpret the clusters to understand what they represent.Thus, the differences between clusters were represented by calculating their least squares means (LSM) per group, which were visualized using bar plots on a standardized scale.

Expanding the Result to the Original Data Set
For the previous step, the large data set was simplified into 50,000 centroids clustered into several groups.To expand this result to the whole data set, a cluster prediction model was needed.A PLS discriminant analysis (PLS-DA) model was built from the centroids using the 29 biomarkers.It is a predicting discriminant model using linear combinations of the original variables called latent variables, which are optimized to best represent  the variability of the response and the original variables.
The output of the model is the probability of belonging to a specific cluster.The hyperparameter of the model is the number of latent variables, which was fixed based on the performances obtained from a 10-fold crossvalidation.As our aim was to develop a model able to predict every cluster, but more specifically the clusters associated with sick animals, we compared the global accuracy and the balanced accuracy for each cluster, which is the mean between sensitivity and specificity.
The PLS-DA model is sensitive to unbalanced data as it will most likely predict the most observed cluster.In a case where the proportions of observations in each cluster vary widely, it is recommended to generate a balanced training data set.The considered up-sampling approach consists of sampling with replacement of the original data set but with the constraint to reach 10,000 records for each cluster.Moreover, we expected that a quantitative trait would better represent the continuum of the animal health status instead of a binary division (healthy vs. sick).Compared with the k-nearest neighbors algorithm, PLS-DA gives more quantitative information: the scores on specific latent variables and the probability of being in a cluster.

Validation
The innovative aspect of our approach was to use a large-scale database to find patterns potentially related to animal health to keep the maximum variability existing in the population.Therefore, no reference measurements for health status were used.This is why validation was performed using reference biomarkers measured in milk and blood samples.This approach aimed to validate the interest of obtained clusters and the potential of an associated quantitative trait such as a PLS-DA probability of belonging to a cluster.
The validation health data set was created from the European Union FP7 Genotype plus Environment project (http: / / www .gpluse.eu).The data set contained 380 records collected between 2013 and 2018 for primiparous and multiparous Holsteins with DIM between 10 and 50.As the clusters are defined from primiparous Holsteins, a first data set containing only primiparous cows was created but only 50 records were usable as SCC measures were missing.Therefore, a second data set containing 204 records from primiparous and multiparous cows was developed to increase the number of observations available.Several biomarkers   2019) describe the data collection and the applied protocol in more detail.All phenotypes included in the calibration data set were also collected to allow prediction of the clusters, PLS-DA probabilities, and scores on latent variables for those observations.To validate the relationships between the probabilities or the scores on latent variables and the reference values measured in the laboratory, Pearson correlations were calculated between those quantitative and reference traits.Then, statistical t-tests for association between paired samples were performed to assess the significance of the estimated correlation values.

Developing the Clusters
As mentioned previously, standard hierarchical clustering methodologies could not be used and a specific approach was developed.As this approach involved randomness, the process was repeated 10 times to assess its robustness.To determine the similarity, we looked at whether the groups obtained and the division order between them were equivalent.Nine dendrograms out of 10 repetitions were similar.Thus, only one representative dendrogram is shown in Figure 1.Based on inertia, the dendrogram was cut into 5 groups.The graph is structured into 2 large groups divided into one sub-group containing cluster 1 and cluster 4 and another sub-group divided into 3 sub-groups (clusters 2, 3, and 5; Figure 1).Among these 3 sub-groups, cluster 5 was very small and well separated from the others.The distribution of the centroids among the clusters was uneven, with 15,271 centroids in cluster 1 (30.54%),19,573 in cluster 2 (39.15%), 5,452 in cluster 3 (10.90%),6,856 in cluster 4 (13.71%), and 2,848 in cluster 5 (5.70%).Clusters 1 and 2 covered most of the records, with 69.69% of observations.
As mentioned earlier, testing for the significance of the effect is irrelevant here.Instead, the LSM of the 29 studied traits were computed and are illustrated in Figure 2 on a standardized scale.From the visual analysis, 2 clusters differed from the others: cluster 4 and cluster 5. Clusters 1 and 4 had similar behavior, with both means of each prediction being positive or negative.However, the means of cluster 4 were larger than those of cluster 1, suggesting that clusters 1 and 4 represented the same biological phenomena but with severe cases being in cluster 4. Those groups were characterized by negative values for the predictions of protein content, BW, CH 4 , P, DMI, energy balance (EB), residual feed intake, consumption index, and SFA.There were positive means for the predictions of Na, nitrogen use efficiency, citrate, acetone, BHB, and the predicted percentages of C18:1 cis-9, MUFA, and PUFA in milk fat.Based on those results, we hypothesized that cluster 4 referred to cows with metabolic disorders induced by negative EB, up to ketosis.Although we mainly used predictions in this study, the results are similar to those of Gross et al. (2011), who observed reduced milk production  with a lower percentage of protein, combined with a lower BW and DMI for Holstein cows in negative EB.Moreover, cluster 4 was related to lower content of SFA and higher contents of MUFA and PUFA.Vranković et al. (2017) and Fiore et al. (2020) showed that similar changes in milk fatty acid composition, with decrease of de novo fatty acids, are associated with negative EB and body fat mobilization.
Cluster 5 was characterized mostly by a large positive value for SCC.Moreover, Na and lactoferrin means were high, and that of lactose was low, suggesting that cluster 5 was related to mastitis issues (Figure 2).Although we used predictions for Na and lactose, these findings can be corroborated by previous works.Bansal et al. (2005) made the same observation for cows with subclinical mastitis using reference measurements.Fernando et al. (1985) also mentioned the use of Na as an indicator of subclinical mastitis.Lactose as an indicator of subclinical mastitis was identified by Ebrahimie et al. (2018).Cluster 3 was defined by low production of milk associated with high estimated means for fat, protein, and minerals.

Predicting the Cluster for the Calibration Data Set
At this stage, we discovered 2 potentially interesting groups for disease detection (i.e., cluster 4 and cluster 5).However, those clusters were only defined using the centroids.Therefore, we needed to expand the results of the clustering to the whole data set.As each of the centroids was defined from 500 groups for each of the 100 subsets, this could be performed by recovering the records for each centroid.However, this method prevents the extrapolation of the results to future records.Instead, a PLS-DA model was built to allow discrimination of the found clusters from the 50,000 centroids.As the number of centroids by cluster was greatly unbalanced, which risks impairing the predictions, we used an up-sampling approach based on bootstrapping, wherein we sampled with higher probabilities for the smaller cluster.As a result, the training data set was balanced with around the same number of observations for each cluster, creating a model that was not biased toward the prediction of the most represented clusters.The number of latent variables was fixed at 6, as the highest overall cross-validation accuracy was found for this model (0.77).The accuracy decreased to 0.70 when the model was applied to the original centroid data.The balanced accuracy of discriminating cluster 4 and cluster 5 was 0.88 and 0.96, respectively.With use of the PLS-DA model, we were able to extend the prediction of clusters to the entire data set.We observed that cluster 1 was assigned to 21.39% of the records, cluster 2 to 40.89%, cluster 3 to 18.01%, cluster 4 to 16.62%, and cluster 5 to 3.1%.This repartition is more or less the same as the repartition observed using the 50,000 centroids, confirming the model, with a lower percentage of records belonging to cluster 4 and cluster 5.This is expected as those groups should be related to cows with metabolic disorders or mastitis.
Although we had clusters for all records, these categorical variables do not fully represent the reality of a sick animal.Animal health status is a continuous variable from healthy to sick.By observing the PLS-DA latent variables, we can observe this continuum and the overlap between clusters (Figure 3).Therefore, the probabilities of being in cluster 4 or cluster 5 seemed to be more relevant compared with the previous hard clustering, where each observation belongs to only one cluster.However, it is important to verify whether the found clusters were still pertinent using reference biomarker measurements.

Validating Against Reference Values
Based on the GplusE project, it was possible to create a data set containing reference measurements of biomarkers related to animal health.As all predictors needed to predict the clusters were available for these records, we were able to estimate for each record the probability of being in a cluster from the PLS-DA (Table 5).As the variabilities from the data set composed of primiparous Holsteins were low, especially for the probability related to cluster 5, we assume that this small data set contained very few cases of mastitis.If we consider a proportion of 6% for cluster 5, we would expect 3 observations for this cluster.To solve this issue, we extended the result to subsequent lactations.The extension allowed the consideration of a larger number of observations, where more cows with health issues could be expected.This example shows the difficulty in finding representative reference data when working with health data.
As all of the PLS-DA probabilities were quantitative, Pearson correlations could be computed between these outputs and the reference contents of biomarkers measured in blood and milk.Table 6 shows these correlations and their significance.The correlations confirmed the previous interpretation of the clusters.Among the significant correlations, the probability associated with cluster 4 was positively correlated with blood and milk BHB, isocitrate, and NEFA and negatively correlated with blood fructosamine, blood glucose, milk free glucose, and milk glucose-6-phosphate.As reported in many studies, these biomarkers reflect EB issues up to ketosis through a lower level of circulating glucose in blood, reflected in the long term through a lower content of fructosamine, and leading to release of NEFA because of body fat mobilization and ultimately production of ketone bodies in the liver (Ingvartsen, 2006;Suthar et al., 2013;Esposito et al., 2014).Larsen (2014) and Larsen and Moyes (2015) also proved that free glucose, glucose-6-phosphate, and isocitrate are related to EB issues or risk of diseases.
For the correlation between the probability of belonging to cluster 5 and the biomarkers, we observed that they differed between the first and the subsequent lactations.We assumed that this was related to the low variability in the first-lactation data set, suggesting that there were no severe mastitis issues in the 50 observations.However, we do not have the on-farm information about mastitis so we cannot validate this assumption.Thus, we also considered the correlations from the other data set.The probability of belonging to cluster 5 was negatively correlated with free glucose in milk but not in blood, and positively correlated with NAGase and LDH.Similarly, Larsen et al. (2010) and Kitchen et al. (1980) showed that LDH and NAGase were relevant indicators of mastitis.
As shown by the validation step based on reference laboratory data, the probabilities were strongly correlated with relevant biomarkers of health issues.Thus, our results highlighted that even using predictions from models with low or fair quality (9 models had an R 2 cv < 0.7), trends and patterns can be found in a large data set and validated through external data.As mentioned by McParland and Berry (2016) or Bonfatti et al. (2017), the current results validate the idea that models with low or fair accuracy can be valuable for breeding and monitoring if they can highlight trends or genetic correlation through large-scale data sets.Moreover, the measures could be exploited in a longitudinal approach by computing the variance, the mean value for lactation, or the expected curves.The measures, frequencies, and means of measures could be indicators of the resilience of the animals.However, the method requires more frequently collected spectral data, not a monthly test, to be used as a continuous monitoring tool.
The current approach has the limitation of requiring previous biomarker models to constitute the clusters.However, the cluster prediction method, being based only on DHI records, opens up many avenues to use already available data.The method offers a way to summarize information from multiple models, to find consistency between all the predictions, and ultimately to detect health patterns in large databases.

CONCLUSIONS
This research aimed to consider the multidimensionality of animal health and its associated traits to detect potential health issues in a routine environment.Instead of using reference laboratory data, the innovative aspect of this study was to make use of a DHI data set containing records related to usual dairy traits and FT-MIR-based predictions that have direct or indirect links with health issues, to discover groups of cows with different data patterns.Even though the method had some computational and validation limitations, this novel approach allowed detection of global patterns from a large data set that is more representative of the population.The results are promising, showing that the approach is worth developing.Two clusters were shown to represent cows with metabolic disorders or that have mastitis issues.From the probability of belonging to 1 of those 2 clusters estimated through a PLS-DA model, we propose a quantitative gradient of sickness related to metabolic disorders or mastitis.These traits are directly usable in many applications; for example, as a tool to monitor the animal health status of every productive cow during routine milk recording or to develop new estimated breeding values due to the large amount of directly available data.

Franceschini
et al.: UNSUPERVISED METHODS ON MILK RECORDING BIG DATA

Franceschini
et al.: UNSUPERVISED METHODS ON MILK RECORDING BIG DATA

Franceschini
et al.: UNSUPERVISED METHODS ON MILK RECORDING BIG DATA

Franceschini
et al.: UNSUPERVISED METHODS ON MILK RECORDING BIG DATA

FranceschiniFigure 1 .
Figure 1.Dendrograms for the 50,000 centroids.Height indicates the value of the Ward's criterion for each agglomeration of centroids or groups of centroids.Agglomerations with low heights represent centroids, or groups of centroids, that are close.Clusters are identified by color.

FranceschiniFigure 3 .
Figure 3. Scatterplot representing the distribution of the 740,454 records on the first and second latent variables from the partial least squares discriminant analysis.

Franceschini
et al.: UNSUPERVISED METHODS ON MILK RECORDING BIG DATA

Table 1 .
Franceschini et al.: UNSUPERVISED METHODS ON MILK RECORDING BIG DATA Published prediction models for health biomarkers

Table 2 .
Performance of the equations used 1 1R 2 c = coefficient of determination of calibration; R 2 cv = coefficient of determination of cross validation; SEc = standard error in calibration; SEcv = standard error in cross validation; RPDcv = ratio of performance to deviation in cross validation.2EB = energy balance; DMI2 = predicted DMI; NUE = nitrogen use efficiency; RFI = residual feed intake (alternate calculation via RFI2); C18:1 cis-9 = oleic acid.

Table 5 .
Descriptive statistics for the probability of being in a cluster for the validation records