combi-Exploiting machine learning methods with monthly routine milk recording data and climatic information to predict subclinical mastitis in Italian Mediterranean buffaloes

1942 ABSTRACT Mastitis has detrimental effects on the world’s dairy industry, reducing animal health, milk production and quality, as well as income for farmers. In addition, consumers’ growing interest in food safety and rational usage of antibiotics highlights the need to develop novel strategies to improve mastitis detection, prevention, and management. In the present study we applied machine learning (ML) analyses to predict presence or absence of subclinical mastitis in Italian Mediterranean buffaloes, exploiting information collected the previous month during routine milk recording procedures, as well as climatic data. The data set included 3,891 records of 1,038 buffaloes from 6 herds located in Basilicata Region (South Italy). Prediction models were developed using 4 different ML algorithms (Generalized Linear Model, Support Vector Machines, Random Forest, and Neural Network) and 2 data set splitting approaches for the creation of the training and test sets (by record or by animal ID number, always with 80% of the data used for model training and the remaining 20% for model testing). Support Vector Machine was the best method to predict high or low somatic cell count at the subsequent test-day record in the validation set, and therefore it was used to estimate the contribution of each feature to the best model. Independently from the data set splitting approach, the most important features were somatic cell score, differential somatic cell count, electrical conductivity, and milk production. Among climatic data, the most informative were temperature and relative humidity. When the data were split by animal ID, an improvement in models’ predictive performance on the test set was observed, suggesting this as the most appropriate data splitting approach in data sets with repeated measures to avoid data leakage. According to different metrics, Neural Network was the best method for making predictions on the test set. Our findings confirmed the promising role of ML methods to improve prevention and surveillance of subclinical mastitis, exploiting the large amount of data currently available to identify animals that would possibly have high somatic cell count the subsequent month.


INTRODUCTION
Mastitis, an inflammatory condition of the udder, has become a critical issue in the world's dairy industry, affecting animal health, milk production and quality, and income for farmers (Halasa et al., 2007).Mediterranean buffaloes (Bubalus bubalis) have been generally considered less susceptible to udder infections compared with dairy cows, thanks to morphological characteristics of the teat canal and sphincter that reduce the possible invasion of mastitis-causing pathogens (Fagiolo and Lai, 2007).Nevertheless, mastitis has a detrimental effect also on the buffalo dairy sector, which suffers from poor scientific knowledge about this disease in comparison to the bovine dairy sector (Puggioni et al., 2020).Recently, efforts have been made to improve mastitis detection, management and selection in dairy buffaloes.Indeed, novel indicators of mammary gland inflammation derived from traditional SCC, previously developed for improving selection for mastitis resistance in Italian Holstein cattle (Bobbo et al., 2018), were investigated in dairy buffaloes (Costa et al., 2021).Moreover, the dynamics of the different cell types (e.g., macrophages and neutrophils) that compose total SCC have been explored (Alterisio et al., 2021).Differential somatic cell count (DSCC), a novel parameter that represents the proportion of lymphocytes and neutrophils on the total SCC, has been recently introduced in the routine milk recording scheme of dairy buffaloes.The combi-nation of SCC and DSCC has been demonstrated to better define the udder health status of dairy cattle and enhance a rational use of antibiotics (Bobbo et al., 2020).In addition, a novel cathelicidin ELISA has been developed for detecting buffalo mastitis (Puggioni et al., 2020).However, there is still a need for filling the gap in knowledge, possibly by using information that is currently available and not fully exploited.For instance, great advantage could be taken of the large amount of data provided by automatic milking recording systems, as well as by monthly test-day (TD) milk recording procedures.Such information, easily accessible, could be used to train machine learning (ML) algorithms for the prediction of specific traits of interest, such as phenotypes that are difficult to measure, or the possible occurrence of a disease.Machine learning offers a new approach for data analysis and has already been applied in several areas of dairy research (e.g., feeding, behavior, reproduction, and health) for supporting management of farms (Cockburn, 2020).Early detection and prevention of mastitis would represent a valuable asset from both the economic and health point of view.Previous studies reported in the literature have attempted to predict mastitis in dairy cattle, defined by the presence of high milk SCC (Ebrahimi et al., 2019;Anglart et al., 2020;Bobbo et al., 2021) or of mastitis-causing pathogens (Sharifi et al., 2018;Hyde et al., 2020), by applying different ML algorithms.Nevertheless, in livestock research, where data sets with repeated measures are often used for ML data analysis, there has been little discussion on the issue of data leakage related to data splitting and model overfitting (Satoła and Bauer, 2021;Ji et al., 2022).Data leakage occurs when the training set used to create the model contains information about the target to be predicted.
Following the approach reported by Bobbo et al. (2021), in the present study we exploited information already collected in the frame of the monthly routine milk recording procedure of Italian Mediterranean buffaloes, as well as climatic data (features at time t − 1), to predict which animals will present high or low milk SCC level at the subsequent TD (outcome at time t).In addition, we compared results obtained using 2 different data splitting approaches to evaluate the possible effects of data leakage.

Ethics Statement
Animal welfare and use committee approval was not needed for this study because data sets were obtained from pre-existing databases based on routine animal recording procedures.

Data Collection and Editing
Buffaloes involved in the current study were reared on commercial farms and were not subjected to any invasive procedure.Test-day data, recorded during monthly routine milk recording procedures, were provided by the Italian Breeders Association (Rome, Italy).Data included information about herd, animals (ID number, date of calving, stage of lactation, and parity order), date of sampling, daily milk production (kg/d), milk composition [fat (%), protein (%), casein (%), lactose (%), pH, and urea (mg/100 mL)], SCC (cells/mL), DSCC (%), BHB (mmol/L), electrical conductivity (EC, mS), and milk coagulation properties [rennet coagulation time (min) and curd firmness 30 min after rennet addition (mm)].The original data set, which included records collected from August 2019 to February 2021, was edited to select animals with at least 2 TD records within lactation, and with less than 360 DIM.In addition, only consecutive TD records separated by a time interval lower than 6 wk were selected.This approach, also applied by Bobbo et al. (2021), was adopted to reduce data fragmentation over time.Among milk traits, outliers beyond 4 standard deviations, possibly resulting from errors in sampling or recording procedures, were considered as missing values, and only full records were selected for subsequent analysis.Average daily milk production and SCC of contemporary groups-that is, animals sampled in the same herd and day (herd-test-date, HTD)-were also determined (milk_HTD and SCC_HTD, respectively).Finally, the 2 SCC-related traits (SCC and SCC_HTD) were log-transformed to SCS and SCS_HTD to achieve normality, whereas no transformation was required for DSCC.The outcome to be predicted-that is, presence or absence of subclinical mastitis at the subsequent monthly TD-was coded as a binary trait and was based on SCC: animals were classified as healthy (SCC ≤200,000 cells/mL) or mastitic (SCC >200,000 cells/ mL).The threshold of 200,000 cells/mL was selected based on the published literature (Moroni et al., 2006;Costa et al., 2020Costa et al., , 2021)).The prevalence of subclinical mastitis (SCC >200,000 cells/mL) was 40.3%.After editing, the data set included 3,891 records of 1,038 buffaloes in 6 herds.Each record included information of 2 subsequent monthly TD: animal and milk data collected at the previous TD and outcome (healthy vs. mastitic) at the subsequent TD.
In addition, climatic information of the sampling location and date were retrieved from the NASA Pre-  (Sparks, 2018), which allowed access to daily averaged data by providing latitude and longitude values of the 6 herds and the desired date range.In particular, parameters of interest were as follows: All Sky Surface Shortwave Downward Irradiance (MJ/m 2 per day), All Sky Surface UV Index (dimensionless), Temperature at 2 Meters (°C), Relative Humidity at 2 Meters (%), Precipitation Corrected (mm/day), Surface Pressure (kPa), Wind Speed at 2 Meters (m/s), and Wind Direction at 2 Meters (Degrees).For a detailed description of climatic variables see Supplemental Table S1 (https://data.mendeley.com/datasets/pdmy7czpz4/1; Bobbo et al., 2022).

Data Processing, Recursive Feature Elimination, and Model Building
Four different ML methods were adopted to develop subclinical mastitis prediction models: Generalized Linear Models (GLM; Nelder and Wedderburn, 1972), Support Vector Machine (SVM; Cortes and Vapnik, 1995), Random Forest (RF; Breiman, 2001), and Neural Network (NN;McCulloch and Pitts, 1943).Two approaches were used for splitting the data, to evaluate whether results could be biased by possible overfitting due to data leakage in time series data sets: Recursive feature elimination using a 10-fold crossvalidation repeated 100 times with the RF method (Svetnik et al., 2004) was applied to eventually reduce the number of features, automatically selecting the most predictive ones to identify the most parsimonious model with best performance-that is, with highest accuracy of prediction.Then, a stratified 10-fold cross-validation repeated 100 times was employed to train and evaluate the models.In particular, the training data set was divided into 10 subsets of equal size.
Splitting the data by record, partitions of the 10-fold cross-validation were randomly selected; splitting by animal ID, data were split into the 10 subsets based on groups (ID).At each of the 10 iterations, prediction models were trained on 9 subsets and evaluated on the last one, changing the validation subset every time.This entire process was repeated 100 times, for a total of 1,000 iterations.Therefore, 100 mean accuracy and kappa values of each 10-fold cross-validation were then averaged to obtain the final metrics of each method reported in the tables.Data standardization was performed within cross-validation.Tuning details of each model are reported in the supplemental information file (https://data.mendeley.com/datasets/pdmy7czpz4/1;Bobbo et al., 2022).Data analysis was performed using Caret v. 6.0-86 (Kuhn, 2021) and Tidyverse v. 1.3.1 (Wickham et al., 2019) packages of R software v. 4.1.2(R Core Team, 2021).

Comparison of Methods Predicting Performance on Validation and Test Sets
Comparison of methods predicting performance on the validation set was first performed by means of accuracy and Cohen's kappa values.Feature importance (i.e., the estimation of the contribution of each variable to the best model) was then computed.Importance values were then scaled to 0 (least important) and 100 (most important).Predictive ability of all models on the test set was then assessed, and method comparisons were based on different metrics: sensitivity, specificity, accuracy, positive predictive value, negative predictive value, Cohen's kappa value, and F1 score.False positive, false negative, and total error rates of each method were also calculated.Receiver operating characteristic curve analysis was performed using pROC package v. 1.17.0.1 (Robin et al., 2011), and area under the receiver operating characteristic curve (AUC) was measured.Finally, Matthew's correlation coefficient (MCC) was calculated according to the following formula: where TP is true positive, TN is true negative, FP is false positive, and FN is false negative.

Data Processing, Recursive Feature Elimination, and Model Building
Four ML methods (GLM, SVM, RF, and NN) were applied to develop subclinical mastitis prediction models, using animals and milk information collected during monthly routine milk recording procedures and climatic data.Training and test sets were obtained using 2 approaches: dividing the original data set by records or by animal ID, so that the same animals could or could not be present in the 2 sets of data; that is, they could be totally unknown or not when testing the model.
Before model building and training, a recursive feature elimination was applied to eventually reduce the number of features and remove uninformative ones.Splitting the data set both by record and by animal ID, all 27 features were retained in the most parsimonious yet accurate model (Figures 1a and 1b).Nevertheless, a sort of plateau can be reached with the first 7 most important features (SCS, SCS_HTD, milk_HTD, EC, milk, parity, and DSCC).

Comparison of Methods Predicting Performance on Validation Set
Evaluation and comparison of the predicting performance of the 4 ML algorithms on the validation set was based on accuracy and kappa value.Splitting the data set by record, accuracy ranged between 75.4% (NN) and 76.1% (SVM), and kappa between 0.476 (NN) and 0.489 (SVM) (Figure 2a).Splitting the data by animal ID, slightly lower values were reported, with accuracy ranging from 74.8% (RF) to 75.3% (SVM), and kappa from 0.446 (RF) to 0.457 (GLM; Figure 2b).In both cases, SVM was the best method to predict presence or absence of subclinical mastitis in the validation set, and therefore it was used for estimating the contribution of each variable to the best model.Results of the feature importance using SVM on the validation set suggested that, independently from the data set splitting approach, SCS at the previous TD was the most important feature, followed by SCS_HTD and DSCC (Figures 3a and 3b).Two other important variables were milk_HTD and EC.Among climatic data, the most informative were temperature and relative humidity.

Comparison of Methods Predicting Performance on Test Set
Comparison of the prediction performance of the 4 ML algorithms on test set, obtained by splitting the original data set with 2 different approaches, was based on several metrics, summarized in Table 1.Splitting the data set by record, accuracy of prediction ranged from 73.9% (SVM) to 75.4% (NN), whereas kappa values were ranged between 0.447 (SVM) and 0.480 (NN).The NN method also showed the highest F1 score (0.676) and MCC (0.482), followed by GLM (0.670 and 0.476, respectively).Similar findings but with slightly greater scores were obtained by splitting the data set by animal ID.Indeed, NN proved to be the best-performing method, with prediction accuracy of 76.2%, kappa value of 0.518, F1 score of 0.726, and MCC of 0.522.The SVM method, which was the most accurate in predicting subclinical mastitis on the validation set, was instead the worst-performing on the test set.
Considering all 4 methods, the greatest AUC values were observed by splitting the data set by animal ID rather than by record: 84.1% versus 81.2% for GLM, 83.3% versus 80.2% for SVM, 84.1% versus 79.0% for RF, and 84.0%versus 81.4% for NN (Figures 4a and  4b).

DISCUSSION
In the current study, we predicted whether Italian Mediterranean buffaloes will present high or low SCC in the milk collected at the subsequent TD, applying ML analyses on easily accessible and already available information (i.e., milk data collected the previous month during monthly routine milk recording, as well as climatic data related to the sampling location).Although Mediterranean buffaloes seem to be more robust and resistant to diseases than dairy cows, their health and production are also affected by mastitis (Puggioni et al., 2020).Therefore, strategies for early detection and prevention of subclinical mastitis are of paramount importance for both economic and health aspects.From this perspective, our study highlighted the pivotal role of ML analysis for exploiting the large amounts of data that are available nowadays, with the aim of improving disease surveillance and, consequently, farm management strategies.
Subclinical mastitis prediction models were developed using 4 different ML methods, one linear (GLM), one with a distance-based approach (SVM), an algorithm based on decision trees (RF), and one that works like the human brain trying to perform pattern recognition (NN).We decided to compare results obtained using 2 different data set splitting approaches.Indeed, train-ing and test sets were created by dividing the original data set by records (i.e., the same animals, but with different TD records, can be found in both sets of data) or by animal ID (i.e., animals in the test set were not included in model building and were totally unknown).A common approach during model building is to randomly divide the data set into multiple subsets, so that training and fine-tuning of the model are performed using a k-fold cross-validation as resampling procedure (Ebrahimi et al., 2019;Anglart et al., 2020;Bobbo et al., 2021).In addition, data sets can also be split to Evaluated features, using Support Vector Machine as the predictive method, are as follows: individual SCS and SCS of contemporary group (scs and scs_htd), differential SCC (DSCC), electrical conductivity (EC), individual milk production and milk production of contemporary group (milk and milk_HTD), parity, stage of lactation (DIM), milk composition traits (urea, pH, lactose, fat, casein, protein), BHB, year and month of sampling (yms), year and month of calving (ymc), rennet coagulation time (r), curd firmness 30 min after rennet addition (a30), and climatic data (temperature, relative humidity, UV index, irradiance, pressure, precipitation, wind speed, and wind direction).use part of the data for training with cross-validation and to hold out a portion of the data as external test set (e.g., 80/20%, 90/10%, 50/50%).The test set is important in order to obtain non-inflated estimates due to possible overfitting; indeed, model predictive performance on test sets is generally lower.In such cases, data are typically divided by randomly selecting a certain proportion of records (Anglart et al., 2020;Bobbo et al., 2021) or of farms (Hyde et al., 2020), or numbers of milkings (Ankinakatte et al., 2013).Nevertheless, records in time series data sets or in data sets with repeated measures of the same individual (e.g., animals with several TD) might be highly correlated; therefore special attention should be paid to choosing the most appropriate data splitting approach.In such cases, data should be split based on ID rather than by records, to avoid possible overfitting due to data leakage.Indeed, the aim of predictive modeling is to develop a model that makes accurate predictions on novel unseen data.Splitting by record data sets with repeated measures, data leakage might occur; that is, the data you are using for model training might contain the information you are trying to predict.In our study, when splitting by record, we observed slightly better predictive performances on the validation set and lower performance on the test set.This can be the result of overfitting, although, in our study, to minimize data leakage, recursive feature elimination as well as data standardization were performed within cross-validation.When the data were split by animal ID (both in the creation of the training and test sets and during cross-validation), an improvement in models' predictive performance on the test set was observed, suggesting this as the most appropriate data splitting approach according to our data structure.
Comparisons of the predicting performance of the 4 ML algorithms on both validation and test sets were based on several metrics, including F1score, AUC, and MCC, which are independent from the outcome rate.Results of the feature importance based on the most accurate method (SVM) on the validation set revealed that, independently from how the data set was split, SCS recorded at the previous TD was, as expected, the most important feature for predicting the presence or absence of subclinical mastitis at the subsequent TD, followed by the other 2 SCC-related traits (SCS_HTD and DSCC).In addition to individual SCS, average SCS of contemporary groups was included to represent herd hygiene conditions.Our results confirmed the important information provided by DSCC, a novel indicator of udder health status, to be used in combination with SCC to better screen for udder health status, as previously observed for dairy cattle (Bobbo et al., 2020).Indeed, DSCC and SCS are different traits, as their phenotypic and genetic correlations differ from unity (i.e., 0.66), as reported by (Bobbo et al., 2019).Other important variables were milk_HTD, a proxy for herd management level, individual milk production, and EC.The negative correlation between buffaloes' milk production and SCS has already been reported in the literature (Tripaldi et al., 2010;Costa et al., 2020).In addition, previous ML studies on dairy cows (Ebrahimie et al., 2018;Ebrahimi et al., 2019) have found EC to be one of the most important features in the prediction of subclinical mastitis based on automatic milking parameters.Indeed, udder infection alters the volume of milk produced, as well as its ionic composition due to leakage of components through the blood-milk barrier.Parity order and stage of lactation also showed relevant contributions to the best model; indeed, they are well known factors affecting SCC variation (Cerón-Muñoz et al., 2002).Among climatic data, the most informative were temperature and relative humidity.In livestock, heat stress is known to negatively affect both milk production and animal health (Bernabucci et al., 2010(Bernabucci et al., , 2014)).The temperature-humidity index,  which represents the combined effect of air temperature and humidity, is a parameter commonly used to evaluate the degree and the consequences of heat stress (Bernabucci et al., 2014;Matera et al., 2022).A recent study conducted on Italian Mediterranean buffaloes (Matera et al., 2022) has confirmed the negative effect of temperature-humidity index variation on udder health, defined by SCC.In the present study, traits related to solar radiation (UV_index and irradiance) also showed moderate relevance.Climate variables such as temperature, relative humidity, and solar radiation have previously been found to slightly affect milk production and composition (Sharma et al., 1988).In addition, the inclusion of meteorological parameters (e.g., precipitation, sunshine hours, and soil temperature) in milk production forecast models resulted in a slight improvement in the prediction accuracy, with sunshine hours having the largest effect (Zhang et al., 2020).

CONCLUSIONS
The findings of our study confirmed ML methods to be a promising tool to improve prevention and surveillance of subclinical mastitis, exploiting the large amount of data currently available.Given consumers' growing concerns about food safety, quality, and antibiotic usage, further studies are needed to advance mastitis detection, management, and selection.Indeed, given the high economic value of Protected Designation of Origin (PDO) Mozzarella di Bufala cheese, special attention should be paid to the health and well-being of Italian Mediterranean buffaloes and their milk quality.We are confident that our research will serve as a basis for practical implementation of these methodologies in dairy management systems, as well as in the application of complex phenotypes in genetic and genomic evaluations.

Figure 1 .
Figure 1.Results of the recursive feature elimination, a function that implements backward feature selection, incorporating 27 to 1 features in the model, using the training set obtained by splitting the original data set (a) by record and (b) by animal ID.The number of features is reported on the x-axis, and the model accuracy from the 10-fold cross-validation repeated 100 times on the y-axis.

Figure 3 .
Figure 2. Metrics (accuracy and Cohen's kappa value) for the comparison of methods predicting performance on the validation set, obtained by splitting the original data set (a) by record and (b) by animal ID.Prediction models were developed using 4 machine learning methods: Generalized Linear Model (glmnet), Support Vector Machines (svmRadial), Random Forest (rf), and Neural Network (nnet).Error bars represent 95% CI.

Figure 4 .
Figure 4. Receiver operating characteristic (ROC) curves of 4 machine learning methods [Generalized Linear Model (GLM), Support Vector Machines (SVM), Random Forest (RF), and Neural Network (NN)] run for predicting the presence or absence of subclinical mastitis on the test set, obtained splitting the original data set (a) by record and (b) by animal ID.In each plot, area under the curve (AUC) and 95% CI are reported.
Bobbo et al.: MACHINE LEARNING FOR MASTITIS PREDICTION diction of Worldwide Energy Resource (POWER) Data Access Viewer Bobbo et al.: MACHINE LEARNING FOR MASTITIS PREDICTION

Table 1 .
Bobbo et al.:MACHINE LEARNING FOR MASTITIS PREDICTION Metrics for the comparison of methods predicting performance on test set, obtained by splitting the original data set by record and by animal ID: accuracy, 95% CI, sensitivity (Se), specificity (Sp), positive predictive value (PPV), negative predictive value (NPV), Cohen's kappa value, F1 score, and Matthew's correlation coefficient (MCC) 1 Bobbo et al.: MACHINE LEARNING FOR MASTITIS PREDICTION