Performance comparison of machine learning models used for predicting subclinical mastitis in dairy cows: Bagging, boosting, stacking, and super-learner ensembles versus single machine learning models

Mastitis has a substantial impact on the dairy industry across the world, causing dairy producers to suffer losses due to the reduced quality and quantity of produced milk. A further problem, related to this issue, is the excessive use of antibiotics that leads to the development of resistance in different bacterial strains. The growing consumer awareness oriented toward food safety and rational use of antibiotics has promoted the search for new methods of early identification of cows that may be at risk of developing the disease. Subclinical mastitis does not cause any visible changes to the udder or milk, and therefore it is more difficult to detect than clinical mastitis. The collection of large amounts of data related to milk performance of cows allows using machine learning (ML) methods to build models that could be used for classifying cows into healthy and at risk of subclinical mastitis. The data used for the purpose of this study included information from routine milk recording procedures. The dataset consisted of 19,856 records of 2,227 Polish Holstein-Friesian cows from 3 herds. The authors decided to use the approach of building ensemble ML models, in particular bagging, boosting, stacking, and super-learner models, and comparing them for accuracy of identification of disease-affected cows against single ML models based on the support vector machines, logistic regression, Gaussian Naive Bayes, k-nearest neighbors, and decision tree algorithms. The models were trained and evaluated based on the information recorded for herd 1 and using an 80:20 train-test split ratio according to animal ID (to avoid data leakage). The information recorded for herds 2 and 3 was only used to evaluate on unseen data models developed using the herd 1 dataset. Among the single ML models, the support vector machines model was found to be the most accurate in predicting subclinical mastitis at subsequent test day when used both for the training set (mean F 1 -score of 0.760) and the testing sets containing data for herds 1, 2, and 3 (F 1 -score of 0.778, 0.790, and 0.741 respectively). The gradient boosting model was found to be the best performing model among the ensemble ML models (F 1 - score of 0.762, 0.779, 0.791, and 0.723 for the training set and the testing sets, respectively). The super-learner model, featuring the most advanced design and logistic regression in the meta layer, achieved the highest mean F 1 -score of 0.775 during the cross validation; however, it was characterized by a slightly worse prediction accuracy of the testing sets (mean F 1 -score of 0.768, 0.790, and 0.693 for herds 1, 2 and 3 respectively). The study findings confirm the promising role of ensemble ML methods, which were found to be slightly superior with respect to most of the single ML models.


INTRODUCTION
Mastitis is one of the most common diseases of dairy cattle; it affects milk quality, leads to milk production losses, and is one of the causes of cow culling (Seegers et al., 2003;Halasa et al., 2007;Dürr et al., 2008;Forsbäck et al., 2009;Schukken et al., 2009;Hogeveen et al., 2011;Hertl et al., 2014Hertl et al., , 2018)).It also affects the overall health status of a herd and is associated with a greater use of antibiotics (Pol and Ruegg, 2007).The use of antibiotics helps to reduce the prevalence of infections as well as to prevent their incidence (Unakal and Kaliwal, 2010); however, an excessive use or misuse of antibiotics can lead to the development of resistance in different bacterial strains (Vanderhaeghen et al., 2010).Clinical mastitis causes changes to the glandular tissue of the udder as well as physical and, as a general rule, bacteriological changes in milk.In contrast, subclinical mastitis does not cause visible changes in the udder or milk, but the milk SCC is increased.Even though subclinical mastitis is much more common than clinical mastitis (Ebrahimie et al., 2018), it is more difficult to detect due to the absence of visible symptoms.
In the recent years, interest has increased in the use of data collected both with the use of automatic milking recording systems and from routine milk recording procedures.Such data are readily available to farmers and, given the large volume of data, can also be used for training machine learning (ML) models designed for classification purposes or regression models designed for assessing traits that can be difficult to measure.Machine learning models oriented toward dairy cattle have found use in the early identification of cows at risk of diseases such as mastitis or ketosis (Kamphuis et al., 2010;Mammadova and Keskin, 2015;Sitkowska et al., 2017;Ebrahimie et al., 2018Ebrahimie et al., , 2021;;Ebrahimi et al., 2019;Bobbo et al., 2021;Fadul-Pacheco et al., 2021;Satoła and Bauer, 2021;Bauer and Jagusiak, 2022;Pakrashi et al., 2023).They are also used for predicting the time to calving in cows (Miller et al., 2020) and for the identification of cows at risk of heat stress (Becker et al., 2021).
The reports published to date have typically used classification models for the identification of healthy cows and those at risk of subclinical mastitis, which was defined as a high SCC in milk.The most commonly used threshold value is 200,000 cells/mL of milk (Hiitiö et al., 2017;Bobbo et al., 2021) or 250,000 cells/mL of milk (Ebrahimi et al., 2019;Ebrahimie et al., 2021).Only Anglart et al. (2020), although using SCC, decided to refrain from creating a categorical variable.
In their recently published paper on predicting subclinical mastitis in Mediterranean buffaloes, Bobbo et al. (2023) pointed out the issue of data leakage that occurs when the training dataset used for developing a ML model contains information about the target parameter to be predicted and there is a direct link between the past and the future data.This is related to the use of data from routine milk recording procedures or automated milk performance monitoring systems, which contain results of repeated measurements for individual animals.To deal with this problem, the authors of this study have proposed a new method that involves dividing the input dataset according to the animal ID.The information recorded for a specific animal can only be assigned to one of the sets: either the training set or the testing set.The standard approach used for training ML models divides the input dataset into a training dataset and a testing dataset, and randomly assigns records to one of them.As a result, data recorded for a specific animal can be assigned to both the training and the testing sets, thus causing data leakage.
Single or deep ML models have typically been used for classifying cows into healthy and at risk of subclinical mastitis.The most common models include those based on decision trees, models using a distance-based approach (support vector machines, SVM), clustering models (k-nearest neighbors, linear discriminant analysis), models based on neural networks, and generalized linear models (logistic regression; Mammadova and Keskin, 2015;Sitkowska et al., 2017;Ebrahimi et al., 2019;Bobbo et al., 2021).
This study presents a more comprehensive approach oriented toward building ensemble ML models (bagging, boosting, stacking, and super learner ensembles) that are subsequently used for classifying cows into healthy and at risk of subclinical mastitis.The accuracy of classification is compared between ensemble and single ML models, given that typically predictions from ensemble ML models are more accurate than those originating from any single ML model within a group.Data from routine milk recording procedures were used for the identification of cows with subclinical mastitis (defined as an elevated SCC in milk, exceeding 200,000 cells/mL of milk) at the subsequent test day (TD).The identification of cows at risk of subclinical mastitis could support farmers in implementing appropriate preventive measures.

Animal Welfare and Ethics Statement
The requirement of obtaining an approval from an animal welfare and ethics committee was waived with respect to this study, given that no invasive procedures were used and the study data were extracted from an existing database containing information from routine milk recording procedures.

Initial Dataset
The raw dataset consisted of 26,786 TD records for 2,403 Holstein-Friesian cows (9,306,6,956,5,730,and 4,794 TD records from parities 1,2,3,and ≥4,respectively).The cows calved in 3 herds.The study data were provided by a Polish commercial dairy farm.Test-day records were collected from January 2010 to December 2011.The recorded information included 6 milk traits: daily milk yield, fat percentage, protein percentage, lactose percentage, milk urea concentration, and SCC.In addition, each TD record included cow number, herd number, date of sampling, date of birth of the cow, date of calving, and the lactation number.
The dataset only included records containing data from test milkings performed between 5 and 360 DIM, with a minimum of 2 TD records during the lactation period.For each record, new variables were created: lactation number, season of calving, season of sampling, and stage of lactation.These variables covered the following groups: 4 classes of lactation (1, 2, 3, and ≥4, hav-  ,953, 5,062, 4,306, 3,535 records, respectively); 2 seasons of calving: April to September (9,583 records) and October to March (10,273 records); and 4 seasons of sampling: spring (March to May, 5,946 records), summer (June to August, 1,934 records), autumn (September to November, 6,118 records), and winter (December to February, 5,858 records).The stages of lactation were created as follows: 9 classes of 30 d each, class 10: between 271 and 305 DIM, and class 11: >305 DIM (having from 1,575 to 1,987 records).
The SCC in the dataset were divided by 1,000 and log 10 -transformed to SCS to achieve normal distribution (Anglart et al., 2020;Bobbo et al., 2021).All milk variables (milk yield, fat percentage, protein percentage, lactose percentage, milk urea concentration, and SCS) were recorded as continuous traits.The lactation number, season of calving, season of sampling, and stage of lactation were used as categorical variables.The target feature was coded as a binary trait and defined as a presence or absence of subclinical mastitis during the subsequent TD performed at an interval of no more than 6 weeks.This approach to defining the dependent variable was also used by Bobbo et al. (2021Bobbo et al. ( , 2023)).The predefined SCC threshold of 200,000 cells/mL was used to classify the cows as healthy (SCC < 200,000 cells/mL) or mastitic (SCC ≥ 200,000 cells/mL).
Once edited, the dataset for further ML processing included 19,856 records for 2,227 cows: 6,402 records for 761 cows from herd 1; 9,282 records for 1,006 cows from herd 2; and 4,172 records for 460 cows from herd 3 (Table 1).Each record included information concerning a relevant animal, milk traits from the considered TD, and the test outcome determined on the subsequent TD and given in the binary format (0 = healthy, 1 = mastitic).Finally, a total of 10 independent traits were considered: lactation number, season of calving, season of sampling, stage of lactation, milk yield, fat percentage, protein percentage, lactose percentage, milk urea concentration, and SCS.The prevalence of subclinical mastitis was 49% in herd 1, 50% in herd 2, and 35% in herd 3.

Method
To ensure that the procedures of evaluation and testing of the study models were consistent and fair (identical datasets and their train-test split ratio, steps for data preprocessing, application of evaluation methods, and generation of final results), the data preprocessing, modeling, and final test were performed using the exact same procedure for all models examined in this study.Figure 1 presents an overview of the model evaluation procedure.The analysis and modeling were performed using Python (3.10.19),Pandas (1.5.3), numpy (1.23.5), scikit-learn (1.2.1), lightgbm (3.3.5),xgboost (1.7.3) and matplotlib (3.6.3)libraries.
Data Preprocessing.The data preprocessing was performed for the herd 1, 2, and 3 datasets using 2 main steps: elimination of outliers and selection of features.To ensure the best performance of ML models, 6 versions of the input dataset were prepared using different methods for outlier elimination and features selection.
Two approaches, analytical and numerical, were used for the elimination of outliers.The analytical approach involved removing records with milk traits (milk yield, fat percentage, protein percentage, lactose percentage, milk urea concentration, and SCS) values beyond 3 SD from the mean.The numerical approach was aimed at detecting and removing outliers using the unsupervised one-class classification approach based on the scikitlearn local outlier factor (LOF) ML method.The LOF algorithm computes the local density deviation for a given data point with respect to its neighbors.Samples that have a substantially lower density then their neighbors are considered to be outliers.
To select the best input feature sets for modeling, the method of recursive feature elimination with cross validation (RFECV) was used, implemented using the scikit-learn RFECV class, XGBClassifier as classifier, and RepeatedStratifiedKFold as scikit-learn cross validation procedure, with 10 splits repeated 3 times.The RFECV method uses the cross validation approach and  2 summarizes the differences between the 6 datasets generated using different methods for the elimination of outliers and selection of features.ML Methods.To select the best performing models, 5 groups of ML models were used: single, bagging, boosting, and 2 metalearners: stacking and super learner.Models from each group were compared empirically, using training and testing datasets, and the k-fold cross validation resampling method.
For the group of single ML models, the following scikit-learn implementations of algorithms were used: dummy classifier, logistic regression, decision tree, SVM, Gaussian Naive Bayes, and k-nearest neighbors.To develop a model in this group, only one learner was built based on the training data.The dummy classifier was used to establish the performance baseline (point of reference) for all other modeling methods.
Ensemble ML methods allow combining predictions from a group of 2 or more diverse ML models or data with a view to improving the overall predictive performance.It is often observed that predictions originating from an aggregate of learners are more accurate, useful,  or correct than predictions generated by any single ML model within the group.In this study, 4 ensemble ML methods: bagging, boosting, stacking, and super learner, were used for developing models that potentially could achieve the best performance in predicting subclinical mastitis.
The bagging (bootstrap aggregating) method involves fitting several decision trees over different samples (bootstrapped replicas) from one training dataset and averaging predictions by majority voting (Zhou, 2012;Rokach, 2019).The group of bagging ensembles was represented by scikit-learn bagging classifiers using decision tree (DecisionTreeRDSE and DecisionTreeRSE) and k-nearest neighbors (KNeighborsRDSE and KNeigh-borsRSE) as estimators, trained using both full feature sets of random subsets of the original dataset (random data subset ensemble, RDSE) and random samples of features (random subspace ensemble, RSE).In addition, the group included scikit-learn implementations of random forest ensemble (RandomForest) and extra trees ensemble (ExtraTrees).Figure 2 shows an example of a bagging ensemble designed for fitting several decision trees over different samples (bootstrapped replicas) from one training dataset and averaging predictions by majority voting.
The boosting methods add sequentially ensemble members to correct prediction errors made by prior models (weak learners), providing greater focus on training data examples that previous fitting models have wrongly interpreted, resulting in an output that is a simple weighted average of predictions from all weak learners in proportion to their performance (Zhou, 2012;Rokach, 2019).The group of boosting ensembles consisted of scikit-learn Ada Boost classifiers with decision tree (DT) and logistic regression (LR) as estimators (AdaBoostDT and AdaBoostLR), and a gradient boosting classifier (GradientBoosting).In addition, the group included the extreme gradient boosting (XGB) classifier and light gradient-boosting machine (LGBM) classifier from the xgboost and lightgbm packages.Figure 3 shows an example of a boosting ensemble where weak learners are fitted and added sequentially to the ensemble with focus on correcting the wrong predictions previously generated by weak learners.Predictions combined from all weak learners result in a strong learner.
Stacking (stacked generalization) involves fitting a diverse group of model types (built using different methods, making different assumptions, resulting in having less correlated prediction errors) using the same training dataset (first-level learners) and another model (second level, metalearner) to teach how to best combine the predictions generated by the first-level members of an ensemble (Zhou, 2012;Rokach, 2019).The stacking group included VotingEnsembleKNN (first-level: scikit-learn  Evaluation Procedure.To avoid data leakage, before modeling, each input dataset was split into the training and testing subsets according to animal ID, using the GroupShuffleSplit scikit-learn class and an 80:20 split ratio.For the second level of training used for metalearner models (stacking and super learner), each input dataset was split into training and testing subsets using the scikit-learn stratified train_test_split function, which led to the generation of training (70% of input vectors) and testing (30% of input vectors) subsets that had the same proportion of class labels as the initial secondlevel dataset.Each dataset was scaled using 4 scikit-learn feature-scaling methods: MinMaxScaler, StandardScaler, RobustScaler, and Normalizer.For comparison, a nonscaled version of all datasets was also used.The same training and testing datasets were used to evaluate all models.
The 24 binary classification methods were evaluated using their default hyperparameters: 5 single ML algorithms and 6 bagging, 5 boosting, 3 stacking, and 5 super-learner ensembles.Each method was evaluated using 6 input training datasets containing information recorded for herd 1 and their feature-scaling variations.As a result, a total of 720 models were evaluated (24 modeling techniques × 6 datasets × 5 feature-scaling methods = 720 models).The evaluation was carried out using the scikit-learn 10-folds RepeatedStratifiedKFold cross validation method repeated 10 times and the mean cross validation metrics (described in the next section).Their SD were also recorded for comparison.
The best performing models were fitted over the entire training dataset to generate predictions at a later stage (using unseen data represented by the holdout test dataset for herd 1).The final ranking of the models was carried out separately for each of the 5 groups of models (single, bagging, boosting, stacking, and super learner) based on the mean F 1 -score as the main metric (with higher values corresponding to better performing models) and their SD from the cross validation (with lower values corresponding to better performing models).The best performing model was selected from each group for the final assessment of prediction performance using datasets containing information recorded for herds 2 and 3.The main metric for the best performing models used for herds 2 and 3 was the F 1 -score.
Evaluation Metrics.Six metrics were used for the evaluation of classification models: sensitivity (recall, true positive rate), specificity (true negative rate), accuracy, Matthews correlation coefficient (MCC), F 1 -score, and area under the receiver operating characteristic (ROC) curve (AUC).
Sensitivity indicates the proportion of cows with subclinical mastitis that were correctly predicted, and specificity indicates the proportion of healthy cows that were correctly predicted.Accuracy is the percentage of correct predictions in a dataset.
The MCC (Matthews, 1975) was calculated according to the following formula: The MCC values range from −1 to 1, where −1 represents total disagreement between the predicted and the actual value, and 1 indicates that the prediction generated by a model corresponds to the actual value.
The F 1 -score was calculated as follows: The F 1 -score values range from 0 to 1, where values closer to 1 represent better performing models.The F 1 -score was used as the most important metric for performance evaluation because it works better compared with other metrics (e.g., accuracy) when there is a serious downside to predicting false negatives (a cow developed the disease but a model predicted otherwise).
The AUC metric measures the 2-dimensional area under an ROC curve.The AUC value shows the degree to which a model is capable of distinguishing between classes.When the value of AUC is close to 1, it indicates a better performing model.

Evaluation of Single and Ensemble Models Using Training Dataset
Table 3 shows one best performing model from each ensemble model group (boosting, bagging, stacking, and super learner) and, for comparison, the best performing single ML models of each type.The best performing models in each group were selected using the mean F 1score obtained during the cross validation.The F 1 -score values ranged from 0.678 (the single decision tree model) to 0.775 (the SuperLearnerLR model).Taking into account the F 1 -score, slightly worse models than the Su-perLearnerLR model were the boosting model (Gradient-Boosting; 0.762), the bagging ensemble (RandomForest; 0.758) and 2 single models: SVM (0.760) and logistic regression (0.757).The best performing stacking model (VotingEnsembleRFC; 0.750) was worse than these 2 single models.The SD for F 1 -score from the cross validation were similar at about 0.02 for all models (Table 3).
Considering the average accuracy resulting from the cross validation, the GradientBoosting model (0.767) and the SVM single model (0.767) were slightly better performing than the SuperLearnerLR model (0.765).In contrast, the mean MCC was highest at 0.534 for the SuperLearnerLR, GradientBoosting, and SVM models.Similarly, according to the AUC criterion, the Super-LearnerLR, GradientBoosting, and SVM models were best performing, with the cross validation mean AUC of 0.765, 0.767 and 0.767, respectively.These models (SVM, GradientBoosting, and SuperLearnerLR) can be used for predicting subclinical mastitis.For these recommended models, outliers were removed using the LOF ML method and, for the SuperLearnerLR model, 8 independent variables were used (lactation number, stage of lactation, milk yield, fat percentage, protein percentage, lactose percentage, milk urea concentration, and SCS), but only 3 independent variables (lactation number, stage of lactation, and SCS) were used for the GradientBoosting and SVM models.

Comparison of the Best Performing Models Using Testing Datasets Collected for Different Herds
Using the same metrics as those used for the cross validation, the performance of predicting subclinical mastitis was tested for the best models from each group (single, bagging, boosting, stacking, and super learner) using testing datasets of previously unseen data collected for herds 1, 2 and 3 (Table 4).For herd 1, the highest F 1 -score values were achieved by the GradientBoosting and the SVM models (0.779 and 0.778, respectively), and the F 1score for the SuperLearnerLR model was slightly lower (0.768).Similarly, the MCC and AUC values achieved using the testing dataset for herd 1 by the GradientBoosting model (0.574 and 0.786, respectively) and the SVM model (0.570 and 0.785, respectively) were higher than those achieved by the SuperLearnerLR model (0.518 and 0.757, respectively).
Table 4 also shows the values of the analyzed metrics achieved for the testing datasets containing records for cows from herds 2 and 3, respectively.Information recorded for cows from these herds was not taken into account at the model-training stage.As with the testing set for herd 1, the highest F 1 -score was achieved with the testing set for herd 2 for the GradientBoosting model (0.791) as well as the SuperLearnerLR and SVM models (0.790 in both cases).However, for herd 3, the highest F 1 -score from the testing set was obtained for the SVM model (0.741).Considering the values of metrics such as accuracy, MCC, and AUC, the best performing model applied to the testing set for herd 2 was the SuperLearnerLR model (0.785, 0.573, and 0.786, respectively), and for herd 3, it was the single SVM model (0.796, 0.573, and 0.787, respectively).The study results confirm that one of the models, SuperLearnerLR, GradientBoosting, or SVM, can be validly used to predict the occurrence of subclinical mastitis in cows from other herds.

ROC Curves
Figures 5, 6, and 7 show the ROC curves (determined using the testing datasets for herds 1, 2, and 3) for single, boosting, bagging, stacking, and super learner and the best performing models (included in Table 4).Analysis of the ROC curves (Figure 5) and the AUC values in the group of single models showed that SVM was the best performing model (AUC was 0.785 for herd 1, 0.779 for herd 2, and 0.787 for herd 3; Table 4), with the logistic regression and Gaussian Naive Bayes models being slightly worse.
In the group of bagging models, the DecisionTreeRSE was clearly inferior to the others (Figure 6), and the highest AUC value was obtained for the RandomForest bagging model (AUC was 0.773 for herd 1, 0.766 for herd 2, and 0.772 for herd 3; Table 4).However, among the boosting models, the AdaBoostDT model was the weakest (Figure 6), and the highest AUC value was obtained for the GradientBoosting model (AUC was 0.786 for herd 1, 0.777 for herd 2, and 0.772 for herd 3; Table 4).Among the stacking models, it was only possible to plot ROC curves for 2 models, StackingEnsemble and VotingEnsembleRFC (Figure 6).The shape of the ROC curves was similar for both models and so were the AUC values, ranging from 0.751 to 0.763.In contrast, among super-learner models, the ROC curves were similar regardless of the method used for building the model (LGBM, k-nearest neighbors, extra trees, logistic regression, and SVM; Figure 6), with the AUC values ranging from 0.725 to 0.786.
The shapes of the ROC curves fitted over the testing datasets in the group of the best performing models (Su-perLearnerLR, GradientBoosting, SVM, RandomForest, VotingEnsembleRFC) was similar for each herd they were plotted for (Figure 7).The AUC values for these models were also similar: from 0.757 to 0.786 for herd 1, from 0.751 to 0.786 for herd 2, and from 0.763 to 0.787 for herd 3 (Table 4).

DISCUSSION
The aim of the study was to predict whether Polish Holstein-Friesian cows would develop subclinical mastitis (defined as an elevated count of somatic cells in the milk, exceeding the threshold value of 200,000 cells/mL) at the subsequent TD.The authors used information on milk yield traits from the TD immediately preceding the milking for which the prediction was generated.These are data collected on a regular basis for cows subjected to milk recording procedures in Poland.Large datasets make it possible to use ML methods for building classification models.To date, authors of papers relating to the prediction of mastitis in dairy cattle (Kamphuis et al., 2010;Mammadova and Keskin, 2015;Sitkowska et al., 2017;Ebrahimie et al., 2018;Ebrahimi et al., 2019;Bobbo et al., 2021;Fadul-Pacheco et al., 2021) compared different types of single ML models (e.g., SVM, logistic regression, k-nearest neighbors, decision tree) and more complex models (e.g., random forest, gradient boosted tree, and neural network) for the performance in predicting the disease.We proposed a more systematic approach to comparing single ML models (SVM, logistic regres- sion, Gaussian Naive Bayes, k-nearest neighbors, and decision tree) with each other and for selecting the best performing model.In the next step, ensemble models were compared by dividing them into groups according to their design: bagging, boosting, and stacking models (Figures 2, 3, and 4).Finally, the most advanced, that is, super-learner models, were built, which were also assessed for the best accuracy in predicting subclinical mastitis.The study also provides a summary of the models from each group that achieved the highest F 1 -score on the training set (Table 3).The aim was to test whether model ensembles, and in particular super-learner models, are more accurate in predicting subclinical mastitis than the commonly used single models.
The use of ML methods is based on dividing the dataset into a training set and a testing set.The input dataset is most often divided using an 80:20 split ratio for records falling into the training and testing datasets respectively (Anglart et al., 2020;Bobbo et al., 2021;Bobbo et al., 2023), and less commonly a 75:25 (Fadul-Pacheco et al., 2021) or a 70:30 (Satoła and Bauer, 2021) split ratio.For this study, an 80:20 split ratio was used.Typically, this division is carried out randomly (Ebrahimi et al., 2019;Bobbo et al., 2021;Fadul-Pacheco et al., 2021), so that the predetermined proportion of records is maintained.When a dataset contains results from repeated measurements carried out for a single animal (and this is the case with TD data), a random selection of records for the training and testing sets can cause the testing set to contain information that had already been included in the training set, and this may cause data leakage.When the primary objective of building an ML model is to generate predictions that are as accurate as possible for new, previously unseen data, other methods are sought for selecting data for the training and testing sets.This study used the approach proposed by Bobbo et al. (2023), which involves selecting records for the training and testing sets based on the animal ID so that the information recorded for individual animals is only included in one of the 2 sets.Bobbo et al. (2023) tested the usefulness of 4 ML methods (generalized linear model, SVM, random forest, and neural network) for predicting subclinical mastitis in Italian Mediterranean buffaloes.They used 2 methods for selecting data for the training and testing sets: random selection and assignment according to the animal identifier.When the authors split the training and testing sets based on the animal ID, they observed an improvement in the predictive capability of the models on the testing set.
The next important step in model building is the selection of explanatory variables.Variables for the study models were selected using the RFE ML method and 3 sets of variables were extracted.The first set included all the available traits: lactation number, season of calving, season of sampling, stage of lactation, milk yield, fat percentage, protein percentage, lactose percentage, milk urea concentration, and SCS (10 traits).The second set consisted of 8 traits: lactation number, stage of lactation, milk yield, fat percentage, protein percentage, lactose percentage, milk urea concentration, and SCS.The third set featured 3 traits: lactation number, stage of lactation, and SCS.The best performing models from each group (single, bagging, boosting, stacking, and super learner), included in Table 3 and characterized by the highest F 1 -score from the training set, were built based on the datasets containing 3 or 8 features.Other studies using ML methods for predicting subclinical mastitis mostly used all available independent variables and, for example, Bobbo et al. (2021) included 15 traits (parity, stage of lactation, year of sampling, season of sampling, milk yield, fat, protein, casein, lactose, urea, log 10 SCC, pH, differential SCC, log 10 SCC for cows sampled within the same herd and on the same day, and milk yield of cows sampled within the same herd and on the same day).Bobbo et al. (2023), while continuing their research into the usefulness of ML methods for predicting subclinical mastitis in Italian Mediterranean buffaloes, expanded the set of traits (independent variables) to 27 in that they additionally included such traits as electrical conductivity and milk coagulation properties, year and month of calving, and 8 climatic parameters.However, the authors pointed out that a sort of plateau can be reached with 7 most important traits (SCS, SCS sampled within the same herd and on the same day, milk yield of cows sampled within the same herd and on the same day, electrical conductivity, milk yield, parity, and differential SCC).In contrast, Ebrahimi et al. (2019) used data from automated systems to test the usefulness of ML models for predicting subclinical mastitis in New Zealand dairy cattle.Their dataset included 8 milk yield traits, such as milk volume and weight, fat, protein, and lactose, as well as electrical conductivity, peak flow, and milking time.However, the authors pointed out that the characteristics electrical conductivity and lactose percentage had the greatest effect on predicting subclinical mastitis.
Each of the sets of independent features that we recommend included SCS from the TD before the TD for which the prediction was generated, similarly to the studies published by Bobbo et al. (2021Bobbo et al. ( , 2023)).The lactation stage and lactation number (traits present in each of the proposed trait sets) are also known to influence the variability of SCC in milk (Sharma et al., 2011;Hiitiö et al., 2017).In the future (provided that appropriate data are available), when selecting variables and building a model for predicting subclinical mastitis, it would be necessary to take into account such a characteristic as electrical conductivity.As pointed out by Ebrahimi et al. (2019), electrical conductivity indicates the leakage of blood components into milk that occurs during mastitis.This feature was included in the models by Ebrahimi et al. (2019), Anglart et al. (2020), andBobbo et al. (2023).
It should be noted that in the herd 1 dataset for which cross validation was performed, the prevalence of subclinical mastitis was 49%.This means that the dataset was balanced and no over-or undersampling was required.
The study models were evaluated with the most commonly used metrics (F 1 -score, accuracy, MCC, and AUC).The value of the F 1 -score on the training set for the best performing model (SuperLearnerLR; Table 3) was 0.775, and the F 1 -scores for the next 2 models (GradientBoosting and SVM) were slightly lower at 0.762 and 0.760, respectively.It should be noted that only 3 traits (lactation number, stage of lactation, and SCS) were considered within the GradientBoosting and SVM models.The MCC value from the training set for the best performing study models (SuperLearnerLR, GradientBoosting, and SVM) was 0.534 (Table 3).The mean accuracy and AUC for the 3 models referred to above ranged between 0.765 and 0.767 (Table 3), and mean accuracy was lower than the maximum values for these metrics obtained by Bobbo et al. (2021), with an accuracy of 0.805 for the linear discriminant analysis, neural network, and generalized linear models.It should be noted that for the analyzed models, the values of the metrics (F 1 -score, accuracy, MCC, and AUC) used in the study clearly indicate the superiority of the 3 models (SuperLearnerLR, Gradi-entBoosting, and SVM) over the other models analyzed for the capability of predicting subclinical mastitis.The study also included the assessment of values of these metrics for each model using the testing set for herd 1.The highest F 1 -score, accuracy, MCC, and AUC values were obtained for the GradientBoosting and SVM mod-els (Table 4), 2 out of the 3 models identified as the bestperforming models on the training set.The third best performing model on the training set, the SuperLearnerLR model, provided lower values for the analyzed metrics on the testing set, which may indicate a weaker capability to correctly classify cows into healthy and at risk of the disease based on previously unseen data.The values of the metrics obtained using the test collections for herds 2 and 3 are also included in Table 4 for comparison.For herd 3 (as for herd 1), taking into account the values of F 1 -score, accuracy, MCC, and AUC, the best performing models were GradientBoosting (0.723, 0.783, 0.545, and 0.772, respectively) and SVM (0.741, 0.796, 0.573, and 0.787, respectively), and for herd 2, SuperLearnerLR (0.790, 0.785, 0.573, and 0.786, respectively) proved to perform better than these models.Therefore, it can be expected that the proposed study models (Gradient-Boosting, SVM, and SuperLearnerLR) can be applied to other herds that contain data like the ones used to build the study models.
Two out of the recommended models, the boosting model (GradientBoosting) and the single SVM model, were the best performing models on the training and testing sets (for herds 1 and 3), and Bobbo et al. (2023) reported that the SVM model that was the best performing model on the validation set turned out to be the worst performing on the testing set.In contrast, Ebrahimi et al. (2019), who had data from an electronic automated monitoring system, showed that in terms of accuracy, the best performing model applied to the validation set was also the gradient boosted tree model (0.849), and in terms of the AUC value, the deep learning model was the best performing model (0.826).It should be added that Ebrahimi et al. (2019) used a different approach to define the independent variable compared with Bobbo et. (2021), Bobbo et al. (2023), or this study.They predicted the occurrence of subclinical mastitis defined as an elevated SCC (≥250,000 cells/mL) using automatic milking traits recorded at the same milking.Their models did not include other information about the study cows, such as lactation number or lactation stage.
It should be noted that for the best performing study models (GradientBoosting, SVM, and SuperLearnerLR), the mean sensitivity on the training set ranged from 0.746 to 0.816, and the specificity ranged from 0.714 to 0.787 (Table 3).This demonstrates a similar (and a relatively high) percentage of disease-affected and healthy cows that were correctly classified.Other studies (Ebrahimi et al., 2019;Bobbo et al., 2021) relating to predicting subclinical mastitis in dairy cows reported fairly high specificity values (0.819-0.919) on the validation set and relatively low sensitivity values (0.381-0.616;Bobbo et al., 2021) or, as in the study of Ebrahimi et al. (2019), high sensitivity values (>0.93) with relatively low speci-Satoła and Satoła: ENSEMBLE MACHINE LEARNING FOR MASTITIS PREDICTION ficity values (<0.50).Ebrahimi et al. (2019) explained that their models might be less capable of identifying healthy cows.In contrast, the models proposed by Bobbo et al. (2021) will be inferior in terms of identifying cows with subclinical mastitis.
Udder health monitoring in dairy herds is usually done once a month using the recorded SCC in milk.However, there is a risk of udder health misclassification due to the natural variability of the SCC, which can affect the results (Quist et al., 2008).Therefore, considering predictions for the occurrence of subclinical mastitis in the subsequent TD can provide (in addition to the SCC from the current TD) additional information for udder health decision-support tools.Machine learning models are also suitable for dealing with large datasets in detecting of complex and unobvious relationships between variables.

CONCLUSIONS
In recent years, more and more farmers have been using data collected both with the use of automatic milking recording systems and from monthly TD milk recording procedures.Such data are readily available and, given the large amount of information they contain, can be used for training ML models.Having data from the current TD and using ML models can additionally provide information regarding the occurrence of subclinical mastitis in the future.The ML models proposed in this study can help farmers to identify in advance cows that would possibly present high SCC in the subsequent TD.The 3 ML models (one single and 2 ensembles), SVM, gradient boosting, and super learner proved to be able to accurately predict subclinical mastitis based on milk variables and animal information.

NOTES
This research was financed by the Ministry of Science and Higher Education of the Republic of Poland (SUB/020012-D015; Warsaw, Poland).The requirement of obtaining an approval from an animal welfare and ethics committee was waived with respect to this study, given that no invasive procedures were used and the study data were extracted from an existing database containing information from routine milk recording procedures.The authors have not stated any conflicts of interest.

Figure 1 .
Figure 1.Overview of the model evaluation procedure using cross validation for model fitting.
Figure 2. Diagram of the bagging ensemble that fits several decision trees to different samples (bootstrapped replicas) from the same training dataset and averages predictions by majority voting.Figure 3. Diagram of the boosting ensemble where weak learners are fit to and added to the ensemble sequentially with focus on the improvement of wrongly predicted records by the previously used weak learner.Predictions combined from all weak learners result in a strong learner.

Figure 3 .
Figure 2. Diagram of the bagging ensemble that fits several decision trees to different samples (bootstrapped replicas) from the same training dataset and averages predictions by majority voting.Figure 3. Diagram of the boosting ensemble where weak learners are fit to and added to the ensemble sequentially with focus on the improvement of wrongly predicted records by the previously used weak learner.Predictions combined from all weak learners result in a strong learner.

Figure 4 .
Figure 4. Diagram of the stacking ensemble that involves fitting a diverse group of model types onto the same training dataset (first-level learners) and using another model (second level, metalearner) to teach how to best combine the predictions generated by the first-level members of the ensemble.
TN, FP, and FN are true positive, true negative, false positive, and false negative values, respectively.

Figure 5 .
Figure 5. Receiver operating characteristic curves for the best single machine learning models for herds 1, 2, and 3 based on test datasets.

Figure 6 .
Figure 6.Receiver operating characteristic curves for the best ensemble machine learning models (bagging, boosting, stacking, and super learner) for herds 1, 2, and 3 based on test datasets.

Figure 7 .
Figure 7. Receiver operating characteristic curves for the best machine learning models (one per group: single, bagging, boosting, stacking, and super learner) for herds 1, 2, and 3 based on test datasets.

Table 1 .
Satoła and Satoła: ENSEMBLE MACHINE LEARNING FOR MASTITIS PREDICTION Characteristics of the dataset for further machine learning processing, according to herd number 1 For each method used for the elimination of outliers, 3 best feature sets were chosen, resulting in 6 input datasets for modeling.Sets 1 and 4 included all 10 features: lactation number, season of calving, season of sampling, stage of lactation, milk yield, fat percentage, protein percentage, lactose percentage, milk urea concentration, and SCS.Sets 2 and 5 contained 8 features: lactation number, stage of lactation, milk yield, fat percentage, protein percentage, lactose percentage, milk urea concentration, and SCS.Sets 3 and 6 consisted of 3 features: lactation number, stage of lactation, and SCS.Table

Table 2 .
Characteristics of input datasets used for modeling and derived from the Table1dataset using a feature selection method, outlier detection methods, and independent features included in the datasets 1