Advertisement

Exploiting machine learning methods with monthly routine milk recording data and climatic information to predict subclinical mastitis in Italian Mediterranean buffaloes

Open AccessPublished:December 29, 2022DOI:https://doi.org/10.3168/jds.2022-22292

      ABSTRACT

      Mastitis has detrimental effects on the world's dairy industry, reducing animal health, milk production and quality, as well as income for farmers. In addition, consumers' growing interest in food safety and rational usage of antibiotics highlights the need to develop novel strategies to improve mastitis detection, prevention, and management. In the present study we applied machine learning (ML) analyses to predict presence or absence of subclinical mastitis in Italian Mediterranean buffaloes, exploiting information collected the previous month during routine milk recording procedures, as well as climatic data. The data set included 3,891 records of 1,038 buffaloes from 6 herds located in Basilicata Region (South Italy). Prediction models were developed using 4 different ML algorithms (Generalized Linear Model, Support Vector Machines, Random Forest, and Neural Network) and 2 data set splitting approaches for the creation of the training and test sets (by record or by animal ID number, always with 80% of the data used for model training and the remaining 20% for model testing). Support Vector Machine was the best method to predict high or low somatic cell count at the subsequent test-day record in the validation set, and therefore it was used to estimate the contribution of each feature to the best model. Independently from the data set splitting approach, the most important features were somatic cell score, differential somatic cell count, electrical conductivity, and milk production. Among climatic data, the most informative were temperature and relative humidity. When the data were split by animal ID, an improvement in models' predictive performance on the test set was observed, suggesting this as the most appropriate data splitting approach in data sets with repeated measures to avoid data leakage. According to different metrics, Neural Network was the best method for making predictions on the test set. Our findings confirmed the promising role of ML methods to improve prevention and surveillance of subclinical mastitis, exploiting the large amount of data currently available to identify animals that would possibly have high somatic cell count the subsequent month.

      Key words

      INTRODUCTION

      Mastitis, an inflammatory condition of the udder, has become a critical issue in the world's dairy industry, affecting animal health, milk production and quality, and income for farmers (
      • Halasa T.
      • Huijps K.
      • Østerås O.
      • Hogeveen H.
      Economic effects of bovine mastitis and mastitis management: A review.
      ). Mediterranean buffaloes (Bubalus bubalis) have been generally considered less susceptible to udder infections compared with dairy cows, thanks to morphological characteristics of the teat canal and sphincter that reduce the possible invasion of mastitis-causing pathogens (
      • Fagiolo A.
      • Lai O.
      Mastitis in buffalo.
      ). Nevertheless, mastitis has a detrimental effect also on the buffalo dairy sector, which suffers from poor scientific knowledge about this disease in comparison to the bovine dairy sector (
      • Puggioni G.M.G.
      • Tedde V.
      • Uzzau S.
      • Guccione J.
      • Ciaramella P.
      • Pollera C.
      • Moroni P.
      • Bronzo V.
      • Addis M.F.
      Evaluation of a bovine cathelicidin ELISA for detecting mastitis in the dairy buffalo: Comparison with milk somatic cell count and bacteriological culture.
      ). Recently, efforts have been made to improve mastitis detection, management and selection in dairy buffaloes. Indeed, novel indicators of mammary gland inflammation derived from traditional SCC, previously developed for improving selection for mastitis resistance in Italian Holstein cattle (
      • Bobbo T.
      • Penasa M.
      • Finocchiaro R.
      • Visentin G.
      • Cassandro M.
      Alternative somatic cell count traits exploitable in genetic selection for mastitis resistance in Italian Holsteins.
      ), were investigated in dairy buffaloes (
      • Costa A.
      • De Marchi M.
      • Neglia G.
      • Campanile G.
      • Penasa M.
      Milk somatic cell count-derived traits as new indicators to monitor udder health in dairy buffaloes.
      ). Moreover, the dynamics of the different cell types (e.g., macrophages and neutrophils) that compose total SCC have been explored (
      • Alterisio M.C.
      • Ciaramella P.
      • Guccione J.
      Dynamics of macrophages and polymorphonuclear leukocytes milk-secreted by buffaloes with udders characterized by different clinical status.
      ). Differential somatic cell count (DSCC), a novel parameter that represents the proportion of lymphocytes and neutrophils on the total SCC, has been recently introduced in the routine milk recording scheme of dairy buffaloes. The combination of SCC and DSCC has been demonstrated to better define the udder health status of dairy cattle and enhance a rational use of antibiotics (
      • Bobbo T.
      • Penasa M.
      • Cassandro M.
      Combining total and differential somatic cell count to better assess the association of udder health status with milk yield, composition and coagulation properties in cattle.
      ). In addition, a novel cathelicidin ELISA has been developed for detecting buffalo mastitis (
      • Puggioni G.M.G.
      • Tedde V.
      • Uzzau S.
      • Guccione J.
      • Ciaramella P.
      • Pollera C.
      • Moroni P.
      • Bronzo V.
      • Addis M.F.
      Evaluation of a bovine cathelicidin ELISA for detecting mastitis in the dairy buffalo: Comparison with milk somatic cell count and bacteriological culture.
      ). However, there is still a need for filling the gap in knowledge, possibly by using information that is currently available and not fully exploited. For instance, great advantage could be taken of the large amount of data provided by automatic milking recording systems, as well as by monthly test-day (TD) milk recording procedures. Such information, easily accessible, could be used to train machine learning (ML) algorithms for the prediction of specific traits of interest, such as phenotypes that are difficult to measure, or the possible occurrence of a disease. Machine learning offers a new approach for data analysis and has already been applied in several areas of dairy research (e.g., feeding, behavior, reproduction, and health) for supporting management of farms (
      • Cockburn M.
      Review: Application and prospective discussion of machine learning for the management of dairy farms.
      ). Early detection and prevention of mastitis would represent a valuable asset from both the economic and health point of view. Previous studies reported in the literature have attempted to predict mastitis in dairy cattle, defined by the presence of high milk SCC (
      • Ebrahimi M.
      • Mohammadi-Dehcheshmeh M.
      • Ebrahimie E.
      • Petrovski K.R.
      Comprehensive analysis of machine learning models for prediction of sub-clinical mastitis: Deep learning and gradient-boosted trees outperform other models.
      ;
      • Anglart D.
      • Hallén-Sandgren C.
      • Emanuelson U.
      • Rönnegård L.
      Comparison of methods for predicting cow composite somatic cell counts.
      ;
      • Bobbo T.
      • Biffani S.
      • Taccioli C.
      • Penasa M.
      • Cassandro M.
      Comparison of machine learning methods to predict udder health status based on somatic cell counts in dairy cows.
      ) or of mastitis-causing pathogens (
      • Sharifi S.
      • Pakdel A.
      • Ebrahimi M.
      • Reecy J.M.
      • Fazeli Farsani S.
      • Ebrahimie E.
      Integration of machine learning and meta-analysis identifies the transcriptomic bio-signature of mastitis disease in cattle.
      ;
      • Hyde R.M.
      • Down P.M.
      • Bradley A.J.
      • Breen J.E.
      • Hudson C.
      • Leach K.A.
      • Green M.J.
      Automated prediction of mastitis infection patterns in dairy herds using machine learning.
      ), by applying different ML algorithms. Nevertheless, in livestock research, where data sets with repeated measures are often used for ML data analysis, there has been little discussion on the issue of data leakage related to data splitting and model overfitting (
      • Satoła A.
      • Bauer E.A.
      Predicting subclinical ketosis in dairy cows using machine learning techniques.
      ;
      • Ji B.
      • Banhazi T.
      • Phillips C.J.C.
      • Wang C.
      • Li B.
      A machine learning framework to predict the next month's daily milk yield, milk composition and milking frequency for cows in a robotic dairy farm.
      ). Data leakage occurs when the training set used to create the model contains information about the target to be predicted.
      Following the approach reported by
      • Bobbo T.
      • Biffani S.
      • Taccioli C.
      • Penasa M.
      • Cassandro M.
      Comparison of machine learning methods to predict udder health status based on somatic cell counts in dairy cows.
      , in the present study we exploited information already collected in the frame of the monthly routine milk recording procedure of Italian Mediterranean buffaloes, as well as climatic data (features at time t − 1), to predict which animals will present high or low milk SCC level at the subsequent TD (outcome at time t). In addition, we compared results obtained using 2 different data splitting approaches to evaluate the possible effects of data leakage.

      MATERIALS AND METHODS

      Ethics Statement

      Animal welfare and use committee approval was not needed for this study because data sets were obtained from pre-existing databases based on routine animal recording procedures.

      Data Collection and Editing

      Buffaloes involved in the current study were reared on commercial farms and were not subjected to any invasive procedure. Test-day data, recorded during monthly routine milk recording procedures, were provided by the Italian Breeders Association (Rome, Italy). Data included information about herd, animals (ID number, date of calving, stage of lactation, and parity order), date of sampling, daily milk production (kg/d), milk composition [fat (%), protein (%), casein (%), lactose (%), pH, and urea (mg/100 mL)], SCC (cells/mL), DSCC (%), BHB (mmol/L), electrical conductivity (EC, mS), and milk coagulation properties [rennet coagulation time (min) and curd firmness 30 min after rennet addition (mm)]. The original data set, which included records collected from August 2019 to February 2021, was edited to select animals with at least 2 TD records within lactation, and with less than 360 DIM. In addition, only consecutive TD records separated by a time interval lower than 6 wk were selected. This approach, also applied by
      • Bobbo T.
      • Biffani S.
      • Taccioli C.
      • Penasa M.
      • Cassandro M.
      Comparison of machine learning methods to predict udder health status based on somatic cell counts in dairy cows.
      , was adopted to reduce data fragmentation over time. Among milk traits, outliers beyond 4 standard deviations, possibly resulting from errors in sampling or recording procedures, were considered as missing values, and only full records were selected for subsequent analysis. Average daily milk production and SCC of contemporary groups—that is, animals sampled in the same herd and day (herd-test-date, HTD)—were also determined (milk_HTD and SCC_HTD, respectively). Finally, the 2 SCC-related traits (SCC and SCC_HTD) were log-transformed to SCS and SCS_HTD to achieve normality, whereas no transformation was required for DSCC. The outcome to be predicted—that is, presence or absence of subclinical mastitis at the subsequent monthly TD—was coded as a binary trait and was based on SCC: animals were classified as healthy (SCC ≤200,000 cells/mL) or mastitic (SCC >200,000 cells/mL). The threshold of 200,000 cells/mL was selected based on the published literature (
      • Moroni P.
      • Sgoifo Rossi C.
      • Pisoni G.
      • Bronzo V.
      • Castiglioni B.
      • Boettcher P.J.
      Relationships between somatic cell count and intramammary infection in buffaloes.
      ;
      • Costa A.
      • Neglia G.
      • Campanile G.
      • De Marchi M.
      Milk somatic cell count and its relationship with milk yield and quality traits in Italian water buffaloes.
      ,
      • Costa A.
      • De Marchi M.
      • Neglia G.
      • Campanile G.
      • Penasa M.
      Milk somatic cell count-derived traits as new indicators to monitor udder health in dairy buffaloes.
      ). The prevalence of subclinical mastitis (SCC >200,000 cells/mL) was 40.3%. After editing, the data set included 3,891 records of 1,038 buffaloes in 6 herds. Each record included information of 2 subsequent monthly TD: animal and milk data collected at the previous TD and outcome (healthy vs. mastitic) at the subsequent TD.
      In addition, climatic information of the sampling location and date were retrieved from the NASA Prediction of Worldwide Energy Resource (POWER) Data Access Viewer (
      • Sparks A.H.
      nasapower: A NASA POWER global meteorology, surface solar energy and climatology data client for R.
      ), which allowed access to daily averaged data by providing latitude and longitude values of the 6 herds and the desired date range. In particular, parameters of interest were as follows: All Sky Surface Shortwave Downward Irradiance (MJ/m2 per day), All Sky Surface UV Index (dimensionless), Temperature at 2 Meters (°C), Relative Humidity at 2 Meters (%), Precipitation Corrected (mm/day), Surface Pressure (kPa), Wind Speed at 2 Meters (m/s), and Wind Direction at 2 Meters (Degrees). For a detailed description of climatic variables see Supplemental Table S1 (
      • Bobbo T.
      • Matera R.
      • Pedota G.
      • Manunza A.
      • Cotticelli A.
      • Neglia G.
      • Biffani S.
      Supplementary_Information_file_JDS.2022–22292. Mendeley Data, V1.
      ).
      Finally, a total of 27 features were considered: parity (from 1 to ≥6), stage of lactation (DIM: 10 classes, 9 of 30 d each and the last one including DIM >300 d), year and month of calving (18 levels), year and month of sampling (10 levels), milk production, fat, protein, casein, lactose, pH, urea, SCS, DSCC, BHB, EC, milk_HTD, SCS_HTD, the 2 milk coagulation properties, and the 8 climatic parameters.

      Data Processing, Recursive Feature Elimination, and Model Building

      Four different ML methods were adopted to develop subclinical mastitis prediction models: Generalized Linear Models (GLM;
      • Nelder J.A.
      • Wedderburn R.W.M.
      Generalized linear models.
      ), Support Vector Machine (SVM;
      • Cortes C.
      • Vapnik V.
      Support-vector networks.
      ), Random Forest (RF; ), and Neural Network (NN;
      • McCulloch W.S.
      • Pitts W.
      A logical calculus of the ideas immanent in nervous activity.
      ). Two approaches were used for splitting the data, to evaluate whether results could be biased by possible overfitting due to data leakage in time series data sets:
      • (1)
        Splitting by record. The data set was randomly split into 2 subsets: 80% of the data was used to train and evaluate the models, and the remaining 20% was excluded from model building and held out as a test set. Random sampling was performed within each outcome class, thus preserving the original outcome rate in training, validation, and test sets. Splitting by record, the same animals, but with different TD records, can be found in all created subsets.
      • (2)
        Splitting by animal ID. The data set was randomly split so that 80% of the animals (and all their relative TD records) were included in the training subset used for model building and evaluation, and the remaining 20% were included in the test set. Original class distribution of the outcome was preserved. Splitting by animal ID, buffaloes in the test set were not included in the training subset.
      Recursive feature elimination using a 10-fold cross-validation repeated 100 times with the RF method (
      • Svetnik V.
      • Liaw A.
      • Tong C.
      • Wang T.
      Application of Breiman's random forest to modeling structure-activity relationships of pharmaceutical molecules.
      ) was applied to eventually reduce the number of features, automatically selecting the most predictive ones to identify the most parsimonious model with best performance—that is, with highest accuracy of prediction. Then, a stratified 10-fold cross-validation repeated 100 times was employed to train and evaluate the models. In particular, the training data set was divided into 10 subsets of equal size. Splitting the data by record, partitions of the 10-fold cross-validation were randomly selected; splitting by animal ID, data were split into the 10 subsets based on groups (ID). At each of the 10 iterations, prediction models were trained on 9 subsets and evaluated on the last one, changing the validation subset every time. This entire process was repeated 100 times, for a total of 1,000 iterations. Therefore, 100 mean accuracy and kappa values of each 10-fold cross-validation were then averaged to obtain the final metrics of each method reported in the tables. Data standardization was performed within cross-validation. Tuning details of each model are reported in the supplemental information file (
      • Bobbo T.
      • Matera R.
      • Pedota G.
      • Manunza A.
      • Cotticelli A.
      • Neglia G.
      • Biffani S.
      Supplementary_Information_file_JDS.2022–22292. Mendeley Data, V1.
      ). Data analysis was performed using Caret v. 6.0-86 (
      • Kuhn M.
      caret: Classification and regression training. R package version 6.0-86.
      ) and Tidyverse v. 1.3.1 (
      • Wickham H.
      • Averick M.
      • Bryan J.
      • Chang W.
      • McGowan L.D.
      • François R.
      • Grolemund G.
      • Hayes A.
      • Henry L.
      • Hester J.
      • Kuhn M.
      • Pedersen T.L.
      • Miller E.
      • Bache S.M.
      • Müller K.
      • Ooms J.
      • Robinson D.
      • Seidel D.P.
      • Spinu V.
      • Takahashi K.
      • Vaughan D.
      • Wilke C.
      • Woo K.
      • Yutani H.
      Welcome to the Tidyverse.
      ) packages of R software v. 4.1.2 (
      • R Core Team
      R: A language and environment for statistical computing.
      ).

      Comparison of Methods Predicting Performance on Validation and Test Sets

      Comparison of methods predicting performance on the validation set was first performed by means of accuracy and Cohen's kappa values. Feature importance (i.e., the estimation of the contribution of each variable to the best model) was then computed. Importance values were then scaled to 0 (least important) and 100 (most important). Predictive ability of all models on the test set was then assessed, and method comparisons were based on different metrics: sensitivity, specificity, accuracy, positive predictive value, negative predictive value, Cohen's kappa value, and F1 score. False positive, false negative, and total error rates of each method were also calculated. Receiver operating characteristic curve analysis was performed using pROC package v. 1.17.0.1 (
      • Robin X.
      • Turck N.
      • Hainard A.
      • Tiberti N.
      • Lisacek F.
      • Sanchez J.-C.
      • Müller M.
      pROC: An open-source package for R and S+ to analyze and compare ROC curves.
      ), and area under the receiver operating characteristic curve (AUC) was measured. Finally, Matthew's correlation coefficient (MCC) was calculated according to the following formula:
      MCC=TP×TNFP×FN(TP+FP)×(TP+FN)×(TN+FP)×(TN+FN),


      where TP is true positive, TN is true negative, FP is false positive, and FN is false negative.

      RESULTS

      Data Processing, Recursive Feature Elimination, and Model Building

      Four ML methods (GLM, SVM, RF, and NN) were applied to develop subclinical mastitis prediction models, using animals and milk information collected during monthly routine milk recording procedures and climatic data. Training and test sets were obtained using 2 approaches: dividing the original data set by records or by animal ID, so that the same animals could or could not be present in the 2 sets of data; that is, they could be totally unknown or not when testing the model.
      Before model building and training, a recursive feature elimination was applied to eventually reduce the number of features and remove uninformative ones. Splitting the data set both by record and by animal ID, all 27 features were retained in the most parsimonious yet accurate model (Figures 1a and 1b). Nevertheless, a sort of plateau can be reached with the first 7 most important features (SCS, SCS_HTD, milk_HTD, EC, milk, parity, and DSCC).
      Figure thumbnail gr1
      Figure 1Results of the recursive feature elimination, a function that implements backward feature selection, incorporating 27 to 1 features in the model, using the training set obtained by splitting the original data set (a) by record and (b) by animal ID. The number of features is reported on the x-axis, and the model accuracy from the 10-fold cross-validation repeated 100 times on the y-axis.

      Comparison of Methods Predicting Performance on Validation Set

      Evaluation and comparison of the predicting performance of the 4 ML algorithms on the validation set was based on accuracy and kappa value. Splitting the data set by record, accuracy ranged between 75.4% (NN) and 76.1% (SVM), and kappa between 0.476 (NN) and 0.489 (SVM) (Figure 2a). Splitting the data by animal ID, slightly lower values were reported, with accuracy ranging from 74.8% (RF) to 75.3% (SVM), and kappa from 0.446 (RF) to 0.457 (GLM; Figure 2b). In both cases, SVM was the best method to predict presence or absence of subclinical mastitis in the validation set, and therefore it was used for estimating the contribution of each variable to the best model. Results of the feature importance using SVM on the validation set suggested that, independently from the data set splitting approach, SCS at the previous TD was the most important feature, followed by SCS_HTD and DSCC (Figures 3a and 3b). Two other important variables were milk_HTD and EC. Among climatic data, the most informative were temperature and relative humidity.
      Figure thumbnail gr2
      Figure 2Metrics (accuracy and Cohen's kappa value) for the comparison of methods predicting performance on the validation set, obtained by splitting the original data set (a) by record and (b) by animal ID. Prediction models were developed using 4 machine learning methods: Generalized Linear Model (glmnet), Support Vector Machines (svmRadial), Random Forest (rf), and Neural Network (nnet). Error bars represent 95% CI.
      Figure thumbnail gr3
      Figure 3Plot of the feature importance, scaled from 0 (least important) to 100 (most important), showing the ranking for the prediction of presence or absence of subclinical mastitis in the validation set, obtained by splitting the original data set (a) by record and (b) by animal ID. Evaluated features, using Support Vector Machine as the predictive method, are as follows: individual SCS and SCS of contemporary group (scs and scs_htd), differential SCC (DSCC), electrical conductivity (EC), individual milk production and milk production of contemporary group (milk and milk_HTD), parity, stage of lactation (DIM), milk composition traits (urea, pH, lactose, fat, casein, protein), BHB, year and month of sampling (yms), year and month of calving (ymc), rennet coagulation time (r), curd firmness 30 min after rennet addition (a30), and climatic data (temperature, relative humidity, UV index, irradiance, pressure, precipitation, wind speed, and wind direction).

      Comparison of Methods Predicting Performance on Test Set

      Comparison of the prediction performance of the 4 ML algorithms on test set, obtained by splitting the original data set with 2 different approaches, was based on several metrics, summarized in Table 1. Splitting the data set by record, accuracy of prediction ranged from 73.9% (SVM) to 75.4% (NN), whereas kappa values were ranged between 0.447 (SVM) and 0.480 (NN). The NN method also showed the highest F1 score (0.676) and MCC (0.482), followed by GLM (0.670 and 0.476, respectively). Similar findings but with slightly greater scores were obtained by splitting the data set by animal ID. Indeed, NN proved to be the best-performing method, with prediction accuracy of 76.2%, kappa value of 0.518, F1 score of 0.726, and MCC of 0.522. The SVM method, which was the most accurate in predicting subclinical mastitis on the validation set, was instead the worst-performing on the test set.
      Table 1Metrics for the comparison of methods predicting performance on test set, obtained by splitting the original data set by record and by animal ID: accuracy, 95% CI, sensitivity (Se), specificity (Sp), positive predictive value (PPV), negative predictive value (NPV), Cohen's kappa value, F1 score, and Matthew's correlation coefficient (MCC)
      Prediction models were developed using 4 machine learning methods: Generalized Linear Model (GLM), Support Vector Machines (SVM), Random Forest (RF), and Neural Network (NN).
      MethodAccuracy95% CISeSpPPVNPVKappaF1 scoreMCC
      Splitting by record
       GLM0.7520.720–0.7820.6240.8380.7230.7670.4730.6700.476
       SVM0.7390.707–0.7700.6180.8210.7000.7600.4470.6560.450
       RF0.7400.708–0.7710.6270.8170.6990.7640.4520.6610.453
       NN0.7540.723–0.7840.6340.8360.7240.7710.4800.6760.482
      Splitting by animal ID
       GLM0.7600.729–0.7890.6610.8450.7870.7420.5120.7190.518
       SVM0.7490.717–0.7780.6510.8340.7720–7340.4890.7060.495
       RF0.7590.728–0.7880.6450.8570.7960.7360.5090.7130.517
       NN0.7620.731–0.7910.6770.8360.7810.7490.5180.7260.522
      1 Prediction models were developed using 4 machine learning methods: Generalized Linear Model (GLM), Support Vector Machines (SVM), Random Forest (RF), and Neural Network (NN).
      Considering all 4 methods, the greatest AUC values were observed by splitting the data set by animal ID rather than by record: 84.1% versus 81.2% for GLM, 83.3% versus 80.2% for SVM, 84.1% versus 79.0% for RF, and 84.0% versus 81.4% for NN (Figures 4a and 4b).
      Figure thumbnail gr4
      Figure 4Receiver operating characteristic (ROC) curves of 4 machine learning methods [Generalized Linear Model (GLM), Support Vector Machines (SVM), Random Forest (RF), and Neural Network (NN)] run for predicting the presence or absence of subclinical mastitis on the test set, obtained splitting the original data set (a) by record and (b) by animal ID. In each plot, area under the curve (AUC) and 95% CI are reported.

      DISCUSSION

      In the current study, we predicted whether Italian Mediterranean buffaloes will present high or low SCC in the milk collected at the subsequent TD, applying ML analyses on easily accessible and already available information (i.e., milk data collected the previous month during monthly routine milk recording, as well as climatic data related to the sampling location). Although Mediterranean buffaloes seem to be more robust and resistant to diseases than dairy cows, their health and production are also affected by mastitis (
      • Puggioni G.M.G.
      • Tedde V.
      • Uzzau S.
      • Guccione J.
      • Ciaramella P.
      • Pollera C.
      • Moroni P.
      • Bronzo V.
      • Addis M.F.
      Evaluation of a bovine cathelicidin ELISA for detecting mastitis in the dairy buffalo: Comparison with milk somatic cell count and bacteriological culture.
      ). Therefore, strategies for early detection and prevention of subclinical mastitis are of paramount importance for both economic and health aspects. From this perspective, our study highlighted the pivotal role of ML analysis for exploiting the large amounts of data that are available nowadays, with the aim of improving disease surveillance and, consequently, farm management strategies.
      Subclinical mastitis prediction models were developed using 4 different ML methods, one linear (GLM), one with a distance-based approach (SVM), an algorithm based on decision trees (RF), and one that works like the human brain trying to perform pattern recognition (NN). We decided to compare results obtained using 2 different data set splitting approaches. Indeed, training and test sets were created by dividing the original data set by records (i.e., the same animals, but with different TD records, can be found in both sets of data) or by animal ID (i.e., animals in the test set were not included in model building and were totally unknown).
      A common approach during model building is to randomly divide the data set into multiple subsets, so that training and fine-tuning of the model are performed using a k-fold cross-validation as resampling procedure (
      • Ebrahimi M.
      • Mohammadi-Dehcheshmeh M.
      • Ebrahimie E.
      • Petrovski K.R.
      Comprehensive analysis of machine learning models for prediction of sub-clinical mastitis: Deep learning and gradient-boosted trees outperform other models.
      ;
      • Anglart D.
      • Hallén-Sandgren C.
      • Emanuelson U.
      • Rönnegård L.
      Comparison of methods for predicting cow composite somatic cell counts.
      ;
      • Bobbo T.
      • Biffani S.
      • Taccioli C.
      • Penasa M.
      • Cassandro M.
      Comparison of machine learning methods to predict udder health status based on somatic cell counts in dairy cows.
      ). In addition, data sets can also be split to use part of the data for training with cross-validation and to hold out a portion of the data as external test set (e.g., 80/20%, 90/10%, 50/50%). The test set is important in order to obtain non-inflated estimates due to possible overfitting; indeed, model predictive performance on test sets is generally lower. In such cases, data are typically divided by randomly selecting a certain proportion of records (
      • Anglart D.
      • Hallén-Sandgren C.
      • Emanuelson U.
      • Rönnegård L.
      Comparison of methods for predicting cow composite somatic cell counts.
      ;
      • Bobbo T.
      • Biffani S.
      • Taccioli C.
      • Penasa M.
      • Cassandro M.
      Comparison of machine learning methods to predict udder health status based on somatic cell counts in dairy cows.
      ) or of farms (
      • Hyde R.M.
      • Down P.M.
      • Bradley A.J.
      • Breen J.E.
      • Hudson C.
      • Leach K.A.
      • Green M.J.
      Automated prediction of mastitis infection patterns in dairy herds using machine learning.
      ), or numbers of milkings (
      • Ankinakatte S.
      • Norberg E.
      • Løvendahl P.
      • Edwards D.
      • Højsgaard S.
      Predicting mastitis in dairy cows using neural networks and generalized additive models: A comparison.
      ). Nevertheless, records in time series data sets or in data sets with repeated measures of the same individual (e.g., animals with several TD) might be highly correlated; therefore special attention should be paid to choosing the most appropriate data splitting approach. In such cases, data should be split based on ID rather than by records, to avoid possible overfitting due to data leakage. Indeed, the aim of predictive modeling is to develop a model that makes accurate predictions on novel unseen data. Splitting by record data sets with repeated measures, data leakage might occur; that is, the data you are using for model training might contain the information you are trying to predict. In our study, when splitting by record, we observed slightly better predictive performances on the validation set and lower performance on the test set. This can be the result of overfitting, although, in our study, to minimize data leakage, recursive feature elimination as well as data standardization were performed within cross-validation. When the data were split by animal ID (both in the creation of the training and test sets and during cross-validation), an improvement in models' predictive performance on the test set was observed, suggesting this as the most appropriate data splitting approach according to our data structure.
      Comparisons of the predicting performance of the 4 ML algorithms on both validation and test sets were based on several metrics, including F1score, AUC, and MCC, which are independent from the outcome rate. Results of the feature importance based on the most accurate method (SVM) on the validation set revealed that, independently from how the data set was split, SCS recorded at the previous TD was, as expected, the most important feature for predicting the presence or absence of subclinical mastitis at the subsequent TD, followed by the other 2 SCC-related traits (SCS_HTD and DSCC). In addition to individual SCS, average SCS of contemporary groups was included to represent herd hygiene conditions. Our results confirmed the important information provided by DSCC, a novel indicator of udder health status, to be used in combination with SCC to better screen for udder health status, as previously observed for dairy cattle (
      • Bobbo T.
      • Penasa M.
      • Cassandro M.
      Combining total and differential somatic cell count to better assess the association of udder health status with milk yield, composition and coagulation properties in cattle.
      ). Indeed, DSCC and SCS are different traits, as their phenotypic and genetic correlations differ from unity (i.e., 0.66), as reported by (
      • Bobbo T.
      • Penasa M.
      • Cassandro M.
      Short communication: Genetic aspects of milk differential somatic cell count in Holstein cows: A preliminary analysis.
      ). Other important variables were milk_HTD, a proxy for herd management level, individual milk production, and EC. The negative correlation between buffaloes' milk production and SCS has already been reported in the literature (
      • Tripaldi C.
      • Palocci G.
      • Miarelli M.
      • Catta M.
      • Orlandini S.
      • Amatiste S.
      • Di Bernardini R.
      • Catillo G.
      Effects of mastitis on buffalo milk quality.
      ;
      • Costa A.
      • Neglia G.
      • Campanile G.
      • De Marchi M.
      Milk somatic cell count and its relationship with milk yield and quality traits in Italian water buffaloes.
      ). In addition, previous ML studies on dairy cows (
      • Ebrahimie E.
      • Ebrahimi F.
      • Ebrahimi M.
      • Tomlinson S.
      • Petrovski K.R.
      A large-scale study of indicators of sub-clinical mastitis in dairy cattle by attribute weighting analysis of milk composition features: Highlighting the predictive power of lactose and electrical conductivity.
      ;
      • Ebrahimi M.
      • Mohammadi-Dehcheshmeh M.
      • Ebrahimie E.
      • Petrovski K.R.
      Comprehensive analysis of machine learning models for prediction of sub-clinical mastitis: Deep learning and gradient-boosted trees outperform other models.
      ) have found EC to be one of the most important features in the prediction of subclinical mastitis based on automatic milking parameters. Indeed, udder infection alters the volume of milk produced, as well as its ionic composition due to leakage of components through the blood-milk barrier. Parity order and stage of lactation also showed relevant contributions to the best model; indeed, they are well known factors affecting SCC variation (
      • Cerón-Muñoz M.
      • Tonhati H.
      • Duarte J.
      • Oliveira J.
      • Muñoz-Berrocal M.
      • Jurado-Gámez H.
      Factors affecting somatic cell counts and their relations with milk and milk constituent yield in buffaloes.
      ). Among climatic data, the most informative were temperature and relative humidity. In livestock, heat stress is known to negatively affect both milk production and animal health (
      • Bernabucci U.
      • Lacetera N.
      • Baumgard L.H.
      • Rhoads R.P.
      • Ronchi B.
      • Nardone A.
      Metabolic and hormonal acclimation to heat stress in domesticated ruminants.
      ,
      • Bernabucci U.
      • Biffani S.
      • Buggiotti L.
      • Vitali A.
      • Lacetera N.
      • Nardone A.
      The effects of heat stress in Italian Holstein dairy cattle.
      ). The temperature-humidity index, which represents the combined effect of air temperature and humidity, is a parameter commonly used to evaluate the degree and the consequences of heat stress (
      • Bernabucci U.
      • Biffani S.
      • Buggiotti L.
      • Vitali A.
      • Lacetera N.
      • Nardone A.
      The effects of heat stress in Italian Holstein dairy cattle.
      ;
      • Matera R.
      • Cotticelli A.
      • Gómez Carpio M.
      • Biffani S.
      • Iannacone F.
      • Salzano A.
      • Neglia G.
      Relationship among production traits, somatic cell score and temperature–humidity index in the Italian Mediterranean buffalo.
      ). A recent study conducted on Italian Mediterranean buffaloes (
      • Matera R.
      • Cotticelli A.
      • Gómez Carpio M.
      • Biffani S.
      • Iannacone F.
      • Salzano A.
      • Neglia G.
      Relationship among production traits, somatic cell score and temperature–humidity index in the Italian Mediterranean buffalo.
      ) has confirmed the negative effect of temperature-humidity index variation on udder health, defined by SCC. In the present study, traits related to solar radiation (UV_index and irradiance) also showed moderate relevance. Climate variables such as temperature, relative humidity, and solar radiation have previously been found to slightly affect milk production and composition (
      • Sharma A.K.
      • Rodriguez L.A.
      • Wilcox C.J.
      • Collier R.J.
      • Bachman K.C.
      • Martin F.G.
      Interactions of climatic factors affecting milk yield and composition.
      ). In addition, the inclusion of meteorological parameters (e.g., precipitation, sunshine hours, and soil temperature) in milk production forecast models resulted in a slight improvement in the prediction accuracy, with sunshine hours having the largest effect (
      • Zhang F.
      • Upton J.
      • Shalloo L.
      • Shine P.
      • Murphy M.D.
      Effect of introducing weather parameters on the accuracy of milk production forecast models.
      ).

      CONCLUSIONS

      The findings of our study confirmed ML methods to be a promising tool to improve prevention and surveillance of subclinical mastitis, exploiting the large amount of data currently available. Given consumers' growing concerns about food safety, quality, and antibiotic usage, further studies are needed to advance mastitis detection, management, and selection. Indeed, given the high economic value of Protected Designation of Origin (PDO) Mozzarella di Bufala cheese, special attention should be paid to the health and well-being of Italian Mediterranean buffaloes and their milk quality. We are confident that our research will serve as a basis for practical implementation of these methodologies in dairy management systems, as well as in the application of complex phenotypes in genetic and genomic evaluations.

      ACKNOWLEDGMENTS

      This research was funded by Italian Ministry of Agriculture (MIPAAF – DISR 07) – Programma di Sviluppo Rurale Nazionale 2014/2020 (Rome, Italy). Caratterizzazione delle risorse genetiche animali di interesse zootecnico e salvaguardia della biodiversità. Sottomisura: 10.2 – Sostegno per la conservazione, l'uso e lo sviluppo sostenibili delle risorse genetiche in agricoltura. Project: “Bufala Mediterranea Italiana – Tecnologie innovative per il miglioramento Genetico – BIG” Prot. N. 0215513 11/05/2021. CUP ANASB: J29J21003720005; CUP UNINA: J69J21003020005. Climatic data were obtained from the NASA Langley Research Center POWER Project funded through the NASA Earth Science Directorate Applied Science Program. The authors thank the Associazione Nazionale Allevatori Specie Bufalina (ANASB; Caserta, Italy) for providing the data. The authors have not stated any conflicts of interest.

      REFERENCES

        • Alterisio M.C.
        • Ciaramella P.
        • Guccione J.
        Dynamics of macrophages and polymorphonuclear leukocytes milk-secreted by buffaloes with udders characterized by different clinical status.
        Vet. Sci. 2021; 8 (34679034): 204
        • Anglart D.
        • Hallén-Sandgren C.
        • Emanuelson U.
        • Rönnegård L.
        Comparison of methods for predicting cow composite somatic cell counts.
        J. Dairy Sci. 2020; 103 (32564958): 8433-8442
        • Ankinakatte S.
        • Norberg E.
        • Løvendahl P.
        • Edwards D.
        • Højsgaard S.
        Predicting mastitis in dairy cows using neural networks and generalized additive models: A comparison.
        Comput. Electron. Agric. 2013; 99: 1-6
        • Bernabucci U.
        • Biffani S.
        • Buggiotti L.
        • Vitali A.
        • Lacetera N.
        • Nardone A.
        The effects of heat stress in Italian Holstein dairy cattle.
        J. Dairy Sci. 2014; 97 (24210494): 471-486
        • Bernabucci U.
        • Lacetera N.
        • Baumgard L.H.
        • Rhoads R.P.
        • Ronchi B.
        • Nardone A.
        Metabolic and hormonal acclimation to heat stress in domesticated ruminants.
        Animal. 2010; 4 (22444615): 1167-1183
        • Bobbo T.
        • Biffani S.
        • Taccioli C.
        • Penasa M.
        • Cassandro M.
        Comparison of machine learning methods to predict udder health status based on somatic cell counts in dairy cows.
        Sci. Rep. 2021; 11 (34211046)13642
        • Bobbo T.
        • Matera R.
        • Pedota G.
        • Manunza A.
        • Cotticelli A.
        • Neglia G.
        • Biffani S.
        Supplementary_Information_file_JDS.2022–22292. Mendeley Data, V1.
        • Bobbo T.
        • Penasa M.
        • Cassandro M.
        Short communication: Genetic aspects of milk differential somatic cell count in Holstein cows: A preliminary analysis.
        J. Dairy Sci. 2019; 102 (30827547): 4275-4279
        • Bobbo T.
        • Penasa M.
        • Cassandro M.
        Combining total and differential somatic cell count to better assess the association of udder health status with milk yield, composition and coagulation properties in cattle.
        Ital. J. Anim. Sci. 2020; 19: 697-703
        • Bobbo T.
        • Penasa M.
        • Finocchiaro R.
        • Visentin G.
        • Cassandro M.
        Alternative somatic cell count traits exploitable in genetic selection for mastitis resistance in Italian Holsteins.
        J. Dairy Sci. 2018; 101 (30146278): 10001-10010
        • Breiman L.
        Random forests.
        Mach. Learn. 2001; 45: 5-32
        • Cerón-Muñoz M.
        • Tonhati H.
        • Duarte J.
        • Oliveira J.
        • Muñoz-Berrocal M.
        • Jurado-Gámez H.
        Factors affecting somatic cell counts and their relations with milk and milk constituent yield in buffaloes.
        J. Dairy Sci. 2002; 85 (12487456): 2885-2889
        • Cockburn M.
        Review: Application and prospective discussion of machine learning for the management of dairy farms.
        Animals (Basel). 2020; 10 (32962078)1690
        • Cortes C.
        • Vapnik V.
        Support-vector networks.
        Mach. Learn. 1995; 20: 273-297
        • Costa A.
        • De Marchi M.
        • Neglia G.
        • Campanile G.
        • Penasa M.
        Milk somatic cell count-derived traits as new indicators to monitor udder health in dairy buffaloes.
        Ital. J. Anim. Sci. 2021; 20: 548-558
        • Costa A.
        • Neglia G.
        • Campanile G.
        • De Marchi M.
        Milk somatic cell count and its relationship with milk yield and quality traits in Italian water buffaloes.
        J. Dairy Sci. 2020; 103 (32229124): 5485-5494
        • Ebrahimi M.
        • Mohammadi-Dehcheshmeh M.
        • Ebrahimie E.
        • Petrovski K.R.
        Comprehensive analysis of machine learning models for prediction of sub-clinical mastitis: Deep learning and gradient-boosted trees outperform other models.
        Comput. Biol. Med. 2019; 114 (31605926)103456
        • Ebrahimie E.
        • Ebrahimi F.
        • Ebrahimi M.
        • Tomlinson S.
        • Petrovski K.R.
        A large-scale study of indicators of sub-clinical mastitis in dairy cattle by attribute weighting analysis of milk composition features: Highlighting the predictive power of lactose and electrical conductivity.
        J. Dairy Res. 2018; 85 (29785910): 193-200
        • Fagiolo A.
        • Lai O.
        Mastitis in buffalo.
        Ital. J. Anim. Sci. 2007; 6: 200-206
        • Halasa T.
        • Huijps K.
        • Østerås O.
        • Hogeveen H.
        Economic effects of bovine mastitis and mastitis management: A review.
        Vet. Q. 2007; 29 (17471788): 18-31
        • Hyde R.M.
        • Down P.M.
        • Bradley A.J.
        • Breen J.E.
        • Hudson C.
        • Leach K.A.
        • Green M.J.
        Automated prediction of mastitis infection patterns in dairy herds using machine learning.
        Sci. Rep. 2020; 10 (32152401)4289
        • Ji B.
        • Banhazi T.
        • Phillips C.J.C.
        • Wang C.
        • Li B.
        A machine learning framework to predict the next month's daily milk yield, milk composition and milking frequency for cows in a robotic dairy farm.
        Biosyst. Eng. 2022; 216: 186-197
        • Kuhn M.
        caret: Classification and regression training. R package version 6.0-86.
        • Matera R.
        • Cotticelli A.
        • Gómez Carpio M.
        • Biffani S.
        • Iannacone F.
        • Salzano A.
        • Neglia G.
        Relationship among production traits, somatic cell score and temperature–humidity index in the Italian Mediterranean buffalo.
        Ital. J. Anim. Sci. 2022; 21: 551-561
        • McCulloch W.S.
        • Pitts W.
        A logical calculus of the ideas immanent in nervous activity.
        Bull. Math. Biophys. 1943; 5: 115-133
        • Moroni P.
        • Sgoifo Rossi C.
        • Pisoni G.
        • Bronzo V.
        • Castiglioni B.
        • Boettcher P.J.
        Relationships between somatic cell count and intramammary infection in buffaloes.
        J. Dairy Sci. 2006; 89 (16507694): 998-1003
        • Nelder J.A.
        • Wedderburn R.W.M.
        Generalized linear models.
        J. R. Stat. Soc. [Ser A]. 1972; 135: 370-384
        • Puggioni G.M.G.
        • Tedde V.
        • Uzzau S.
        • Guccione J.
        • Ciaramella P.
        • Pollera C.
        • Moroni P.
        • Bronzo V.
        • Addis M.F.
        Evaluation of a bovine cathelicidin ELISA for detecting mastitis in the dairy buffalo: Comparison with milk somatic cell count and bacteriological culture.
        Res. Vet. Sci. 2020; 128 (31783263): 129-134
        • R Core Team
        R: A language and environment for statistical computing.
        R Foundation for Statistical Computing, 2021
        • Robin X.
        • Turck N.
        • Hainard A.
        • Tiberti N.
        • Lisacek F.
        • Sanchez J.-C.
        • Müller M.
        pROC: An open-source package for R and S+ to analyze and compare ROC curves.
        BMC Bioinformatics. 2011; 12 (21414208): 77
        • Satoła A.
        • Bauer E.A.
        Predicting subclinical ketosis in dairy cows using machine learning techniques.
        Animals (Basel). 2021; 11 (34359259)2131
        • Sharifi S.
        • Pakdel A.
        • Ebrahimi M.
        • Reecy J.M.
        • Fazeli Farsani S.
        • Ebrahimie E.
        Integration of machine learning and meta-analysis identifies the transcriptomic bio-signature of mastitis disease in cattle.
        PLoS One. 2018; 13 (29470489)e0191227
        • Sharma A.K.
        • Rodriguez L.A.
        • Wilcox C.J.
        • Collier R.J.
        • Bachman K.C.
        • Martin F.G.
        Interactions of climatic factors affecting milk yield and composition.
        J. Dairy Sci. 1988; 71 (3372822): 819-825
        • Sparks A.H.
        nasapower: A NASA POWER global meteorology, surface solar energy and climatology data client for R.
        J. Open Source Softw. 2018; 31035
        • Svetnik V.
        • Liaw A.
        • Tong C.
        • Wang T.
        Application of Breiman's random forest to modeling structure-activity relationships of pharmaceutical molecules.
        in: Roli F. Kittler J. Windeatt T. Multiple Classifier Systems. Springer, 2004: 334-343
        • Tripaldi C.
        • Palocci G.
        • Miarelli M.
        • Catta M.
        • Orlandini S.
        • Amatiste S.
        • Di Bernardini R.
        • Catillo G.
        Effects of mastitis on buffalo milk quality.
        Asian-Australas. J. Anim. Sci. 2010; 23: 1319-1324
        • Wickham H.
        • Averick M.
        • Bryan J.
        • Chang W.
        • McGowan L.D.
        • François R.
        • Grolemund G.
        • Hayes A.
        • Henry L.
        • Hester J.
        • Kuhn M.
        • Pedersen T.L.
        • Miller E.
        • Bache S.M.
        • Müller K.
        • Ooms J.
        • Robinson D.
        • Seidel D.P.
        • Spinu V.
        • Takahashi K.
        • Vaughan D.
        • Wilke C.
        • Woo K.
        • Yutani H.
        Welcome to the Tidyverse.
        J. Open Source Softw. 2019; 41686
        • Zhang F.
        • Upton J.
        • Shalloo L.
        • Shine P.
        • Murphy M.D.
        Effect of introducing weather parameters on the accuracy of milk production forecast models.
        Inf. Process. Agric. 2020; 7: 120-138