3D pose estimation to detect posture transition in free-stall housed dairy cows

Free stall comfort is reflected in various indicators, including the ability for dairy cattle to display unhin-dered posture transition movements in the cubicles. To ensure farm animal welfare, it is instrumental for the farm management to be able to continuously monitor occurrences of abnormal motions. Advances in computer vision have enabled accurate kinematic measurements in several fields such as human, equine and bovine biomechanics. An important step upstream to measuring displacement during posture transitions is to determine that the behavior is accurately detected. In this study, we propose a framework for detecting lying to standing posture transitions from 3D pose estimation data. A multi-view computer vision system recorded posture transitions between Dec. 2021 and Apr. 2022 in a Swedish stall housing 183 individual cows. The output data consisted of the 3D coordinates of specific anatomical landmarks. Sensitivity of posture transition detection was 88.2% while precision reached 99.5%. Analyzing those transition movements, breakpoints detected the timestamp of onset of the rising motion, which was compared with that annotated by observers. Agreement between observers, measured by intra-class correlation, was 0.85 between 3 human observers and 0.81 when adding the automated detection. The intra-observer mean absolute difference in annotated time-stamps ranged from 0.4s to 0.7s. The mean absolute difference between each observer and the automated detection ranged from 1.0s to 1.3s. There was a significant difference in annotated timestamp between all observer pairs but not between the observers and the automated detection, leading to the conclusion that the automated detection does not introduce a distinct bias. We conclude that the model is able to accurately detect the phenomenon of interest and that it is equatable to an observer.


INTRODUCTION
All cubicles in a dairy barn are usually identical, however a natural variability exists both in animal size relative to the cubicle (Dirksen et al., 2020), and in individual motion patterns and locomotor activity (Shepley et al., 2020).A factor of stall comfort, which affects lesion prevalence and lying-time, is how easy it is for the cow to get up and down in the cubicle (Zambelis et al., 2019).Ease of movement during posture transition was highlighted as an evaluation criteria for stall quality regarding cow comfort by Lidfors (1989), who noted that cows in cubicles were more regularly seen performing abnormal motions (such as sideways lunging or horse-like rising) than on pasture.Ceballos et al. (2004) analyzed the kinematics of posture transitions and found that cows used less longitudinal space when rising in a cubicle than on an open pack.Given the evidence for the link between restrictive movements and signs of reduced welfare (Beaver et al., 2021), the quality of posture transitions is included as an indicator in welfare assessment schemes such as Welfare Quality (Blokhuis et al., 2013).
Assessing ease of posture transition per-se, rather than indirect signs of reduced comfort such as hock lesions (Dirksen et al., 2020) or reduced lying time (Shewbridge Carter et al., 2021), is more challenging, and practical objective methods are needed (Brouwers et al., 2023).Visual observations noting the occurrence of abnormal behaviors are commonplace in farm management and welfare assessment schemes.Alternatively, ease of movement can be assessed quantitatively by measuring the displacement of anatomical landmarks throughout bouts of posture transition (Ceballos et al., 2004).Drawbacks exist for both approaches.The visual method relies on time-consuming, sporadic human observations.Although Zambelis et al. (2019) found excellent agreement between observers (kappa of 0.93 for getting-up movement ease) a degree of subjectivity always exists in visual scoring of animal movements (Chaplin & Munksgaard, 2001;Vasseur, 2017).The acquisition of 3D kinematics data by Ceballos et al. (2004) relied on fitting motion-capture reflectors on cows, requiring lengthy preparation and exposure of the equipment to damage.These limitations might be a reason behind the low sample size (n = 5 cows with at least 2 bouts per cow) in the latter study.
Considering the variability in cow sizes and kinematic profiles, and the need for objective methods to assess ease of movement, we propose a framework to detect lying-to-standing (LTS) posture transitions from 3D pose estimation data.As a step in validating the potential of this method, the aim of this study was to measure the performance of a feature extractor in detecting the onset of LTS posture transitions compared with the human eye.

MATERIALS AND METHODS
The study presented here was approved by the ethical committee Uppsala djurförsöksetiska nämnd under approval 5.8.18-13069/2021.Consideration for the 3Rs in animal research regards the use of already collected, non-invasive, video material.

LOCATION AN ANIMALS
Recordings were obtained at the Swedish Livestock Research Centre's dairy barn (Uppsala, Sweden).The herd comprises Swedish Holstein and Swedish Red cattle housed indoors with access to pasture 120 d a year, between May and September.Video was recorded on 30 separate days (midnight to midnight), sampled for convenience, between 2021 and 12-08 and 2022-04-28.As the barn is lit at all times, recordings were obtained at all times of day.An average of 51 cows were present simultaneously in the pen, with individuals being added and removed throughout, for a total of 183 different individuals having visited the pen during the study period.Seven RGB cameras (G3 Bullet, Ubiquiti) were placed around an area approximately 1/4 of the pen, located closest to the sorting gate to the milking robot, and oriented toward the rows of cubicles so that all cubicles in the study ward, including forward lunge room defined as the 60cm beyond the head rail, were visible by at least 2 cameras.The study ward comprised the 12 cubicles (Cubicle divider cc1800 with rigid head bar, Delaval) for which video coverage was optimal, out of 66 total in the pen.The cameras were installed on fixed metal rails, part of the barn's infrastructure, between 2.8 and 3.6 m high.The location of each camera, as well as the stall layout are shown in Figure 1.
Cows had access to feeding troughs with ad-libitum mixed feed as well as 2 rotary brushes, and concentrate dispensed both at the milking robot and at concentrate dispensers.Passage through the milking robot's sorting gate was compulsory for access to the feed.Milking was done by one milking robot (VMS V300, Delaval) which cows had access to on a voluntary basis.Cows were brought to the robot by farm staff if they had not been milked in over 12h.

KEY-POINT ACQUISITION IN 3 DIMENSIONS
This study used a 3D pose estimation software (Sony Multi-Camera System, Sony Nordic).The software estimates the 3D pose by finding cross-view correspondences across inferred 2D poses of the same object on synchronized views.It then creates a track for each object based on spatial continuity in the 3D location.The initial synchronization is achieved by reading the timestamp of each frame, and relating the first fullsecond transition for a common timestamp across all video recordings as the initial synchronized frame.The initial frame synchronization is provided as an input to the multi-camera system.Synchronization is maintained using the estimated time of arrival of each frame in the processing buffer.
The 2D object detector and pose estimator use convolutional neural networks to detect cows and specific anatomical landmarks on RGB images, in the form of bounding box and key-points respectively.The landmarks used in this study were limited to the centertop of the poll, highest point at the withers, spine at the 13th thoracic vertebra, and the top of the sacrum taken immediately behind the uppermost part of the ilium (referred to respectively as head, withers, t13 and sacrum).
The output data consists of one key-point for each anatomical landmark with X, Y and Z coordinates for each object and given frame.Figure 2 shows the estimated 3D position of the key-points, linked to create a visual structure, for 2 objects during a LTS transition, as well as the video frames used to generate them.

DETECTION OF POSTURE TRANSITIONS
The recordings were sampled visually by one observer with the aim of finding 1000 sequences containing LTS transitions.When a cow was observed fully getting up from a lying positions, the timestamp was annotated, and a video sequence corresponding to a window of ± 15 s around the annotated timestamp was extracted.In the final data set, an arbitrary 979 sequences were eventually identified.These sequences were then processed with the 3D pose estimation software.
When the cow rises, the line formed by linking the sacrum and t13 key-points increases its angle compared with the horizontal plane, as the cow's back is at an angle with the ground.By calculating the difference between the sacrum height and withers height, and following this difference through time, we identified peaks corresponding to LTS motions.When a peak above 0.4 (in the coordinates' arbitrary spatial reference system) was detected, the frame was considered to be within a potential rising motion.The mean withers Z position in the 120 frames located 330 frames after the peak was then compared with the mean withers Z position in the last 120 frames of the sequence.If the ratio of height difference after and before the peak was higher than 140%, the track was classified as a LTS motion.Figure 3 illustrates this by showing the vertical position of the key points.At 16 s, there is an important difference in the withers (orange) and sacrum (green) heights.This difference points toward a potential rising bout.Calculating the difference in withers position between the 5s and 27s marks, we determine that the animal has transitioned from a low, lying posture to a high, standing posture.
In these 979 sequences, this method initially detected 493 LTS motions for which the cow was tracked at each consecutive frame.For the remainder (486 sequences), the tracks was interrupted for several frames and the motion was captured in several separate tracks.Detec-  tions were stitched together if they fit the following criteria: • The tracks are found in the same 30-s sequence.
• The second track starts after the first track vanishes, and within an interval of 30 frames.• The Euclidian distance in the 3D pose estimator's coordinate system between the last point in the vanishing track and the first point of the starting one is lower than 0.2.
There was no limit on the number of tracks appended together to form one single track, as long as the above conditions were fulfilled.The resulting stitched track was kept if it contained more than 700 frames, and discarded otherwise.Using this method, an additional 370 rising sequences were detected by applying the height difference rule to stitched tracks, giving a total of 863 predicted positives.For the remaining 116 sequences, either the animal was not detected by the pose estimation software, the posture transition detector failed to identify the occurrence, or the motion was split between different tracks that were not relatable due to noise or an interruption across more than 30 frames.Visual inspection of the predicted LTS motions revealed 4 false positives.
Twenty-two true positives were discarded from the data set, because the posture transition was initiated before the start of the video snippet and thus not captured in its entirety.

SIGNAL PROCESSING
Each series of raw coordinates was processed to attenuate noise.A low pass filter with a cut-off frequency of 10Hz was applied, to remove high-frequency noise resulting from key-point jittering.This cut-off was chosen based on the recommendations by Hamäläinen et al. (2011) and Riaboff et al. (2020) for noise removal on animal activity data.The filter was applied separately to each key-point and its respective X, Y and Z coordinates' time series.The filter was implemented in Python 3.9 using the function "butter" from the SciPy package (Virtanen et al., 2020).Figure 3 illustrates the filtered Z coordinates time series during a rising sequence.
From the processed signal, consisting of the coordinates of each key-point in 3 dimensions, we detected the timestamp at which the cow starts rising.Considering solely the kinematic features available through the 4 key-points, this is most clearly reflected by the change in the withers' position, as rising on the elbows will cause the withers to rise upwards slightly, visible by an increase in the withers Z (vertical) coordinate.When doing so, the cow aligns its back along the length of the cubicle, which is reflected in a change of the withers' Y coordinate (axis perpendicular to the cubicle's length).Although, from a behavior perspective, there is more to the LTS transition than solely the withers' movement, the system was blind to all but the position of 4 anatomical landmarks.The withers were chosen for the stability of the key-point (low jittering) and for their consistent motion pattern in the LTS transition across sequences.To detect the exact onset of rising motions, we used linearly penalized segmentation (Pelt), implemented the python library "Ruptures" (Truong et al., 2020).Pelt was applied to the bivariate series of the Y (lateral, perpendicular to the cubicles) and Z (height) positions of the withers to identify breakpoints in the time series.No restrictions were set on the number of breakpoints to be detected.A baseline height (Z coordinate) was calculated for each sequence as the median wither height in the sequence's first 30 frames.The break points detected by Pelt were iterated through.If the median wither height in the 30 frames following the breakpoint was higher than the baseline, the breakpoint was then considered to be the start of the rising motion.If not, we iterated to the next breakpoint and applied the same logic. of a cow's head, withers and sacrum throughout a lying-to-standing motion.Initially, the low variability on the vertical axis indicates that the cow is lying still.At about 11 s, the withers (orange) rise gently as the cow sits on its carps, followed by lunging with vertical bobbing of the head (blue) from 12 to 17 s.The sacrum (green) rises rapidly soon after, describing a sigmoid.There is a pause on the carps, with the sacrum already up, from 16 to 20s.The cow has risen by the 22 s mark.The vertical dotted line shows the onset of the posture transition detected using linearly penalized segmentation.This example was cherry-picked for clarity.

VALIDATION EXPERIMENT
To evaluate the performance of the tool in detecting the occurrence of LTS bouts, we compared the timestamps automatically detected to those annotated by 3 human observers, considered the gold standard for behavioral observations.Observers were provided with the following definition: "The cow is lying down and rises on its breastbone and elbows, which causes the withers to rise visibly above the rest of the back."This definition is based on that of Lidfors (1989) but it adds the withers' position as an indicator.The animals were seen to initiate the movement by centering their elbows under the body, this in turn causes the withers to rise slightly.This motion of the withers was used to determine the exact onset of the rising motion.The description was accompanied by illustrations taken from Schnitzer (1971) and Cermak (1988) as well as an ethogram describing the sequence of movements in the LTS transition, in which the movement to label was explicitly identified.This ethogram described the stages of the posture transition based on Lidfors (1989) and on Schnitzer (1971).Observers all received the same training, where the ethogram was explained and examples were showcased; they reviewed 5 videos of different cows rising, and agreed on the exact frame to label as the onset of the rising motion.These 5 videos were taken from the original data set and used solely for training observers.
The validation data set was sampled randomly from the 471 complete LTS sequences captured in a single track.In total, 60 unique LTS sequences were annotated by at least 1 observer.This number was determined a-priori as no prior data was available on observer variability in posture transition detection.These sequences were the original 30 s synchronized video snips from which the key-points were detected.The video was available to the observers from all 7 cameras used for key-point detection, plus one additional ceiling mounted camera.Observers were free to choose the camera offering the best view of the animal performing the bout.Every observer was provided with a total of 55 randomly selected video clips.Of these 55 sequences, 30 were common to all observers and 10 were unique to each observer (40 different sequences per observer).The remaining 15 sequences were randomly resampled from the prior 40 and re-annotated by the same observer, to measure intra-observer reliability.
All sequences were blinded, with a different label each time the sequence appeared.

STATISTICAL ANALYSIS
The mean absolute difference (MAD) in annotate timestamp was calculated between each observer to quantify intra-observer reliability as, , , 1 2 with t s i , , 1 and t s i ,2 being the time stamp of the s:th sequence provided at 1st and 2nd assessment occasion, respectively, by observer i.Also, the interrater MAD was calculated as MAD i j The following mixed effects models were fitted using statsmodels.formula.api.Mixedlm (Seabold & Perktold, 2010) in Python 3.9, to evaluate the observer effect and intra-class correlation (ICC) with or without the automated detection: where β 0 is the (fixed) intercept, u N i u ~, 0 2 σ ( ) is a ran- dom sequence effect, β 1 and β 2 are fixed observer effects, β 3 is a fixed effect corresponding to the automated detection taken as an additional observer (referred to as the "Model" or M), and ε σ s i e N , ~, 0 2 ( ) is a (random) error term.The sequence number is indicated by the subscript s, I i are the observers and r is the index for repeated sequences annotated 1 to 2 times by the same observer.The observer effects were tested using ANOVA.ICC 2 as a measure of inter-observer agreement were calculated.A post-hoc pairwise t-test with Bonferroni correction for 6 tests was then computed to test the pairwise differences between observers.The annotated timestamps were not normalized because a 1 s difference between observers, Kroese et al.: 3D pose estimation for cow bout detection for example, has the same practical meaning in this context regardless of whether the annotation is done at the 4 s mark or the 12 s mark.
The performance of the algorithm was assessed in the same way, by treating the algorithm as an additional observer and seeing if it differed from the human observers.The differences were calculated between the algorithms' detection (denote T M ) and the observer annotation, T H . Bland-Altman plots were prepared for each observer pair T T t t i j s i s j , , , , , ) and also com- paring T H with T M , with a view to checking for the absence of a pattern and points beyond 1.96 standard deviations.MAD(H, M), and MD(H, M) were calculated.

RESULTS
A total of 836 rising bouts were detected out of 979 visually selected sequences equating to a sensitivity of 88.5% or a false negative rate of 11.5%.Four sequences  were wrongly classified as rising motions giving a precision of 99.5% or false positive rate of 0.5%.
Model (1) comparing only human observers, gave ICC = 0.85.There was a significant observer effect in predicting the annotated timestamps of LTS onset (P < 0.001) according to the ANOVA.When the model (2) was fitted to assess performance of the prediction, the ICC decreased to 0.81, remaining at a similarly satisfactory level of agreement.There was however no significant difference between the predicted timestamp ("model") and each observer's annotations according to the post-hoc pair-wise t-test with Bonferroni correction of the type-1 error at α = 0.0083.There was a significant difference between all observer pairs:

( )=
Mean absolute differences T s M H , are summarized in Table 1.These values indicate good inter-observer agreement and good agreement between humans and machine.The magnitude of T s M H , is identical to that of meaning that T M could be used in further research, as the model does not deviate from the observers more than they do from one-another.Figure 5 shows the timestamp annotated by each observer (including the model and repeat sequences) for each sequence.
Intra-observer reliability was assessed using the mean absolute difference in seconds, and consistency using the standard deviation (σ).Observer 1 had a MAD of 0.55 ± 0.88 s (µ ± σ).Observer 2 had 0.68 ± 1.47s, and observer 3 had 0.36 ± 0. 48.Pooled standard error was 0.27s.The standard deviation is preferred here to the standard error, to quantify the variability in the differences between and within-observers in annotated timestamp, independently of the number of samples.These results indicate very good intra-observer reliability, under 1 s on average Finally, we compared the annotations to the automated detections visually using the Bland-Altman plot in Figure 4.The upper-left plot shows most points to be centered around 0, without signs of consistent bias from the model.More importantly, the spread was similar when comparing observers to the algorithm and observers together.

DISCUSSION
The ICC values show a good agreement between automated model detection and human observers in detecting the onset of cows' rising motions, according to previous research on the use of ICC as a reliability metric in animal motion scoring (Kaler et al., 2009).The ANOVA demonstrated a significant observer effect, strengthening the claim that observations of cows' movements are prone to individual variations.The post-hoc test showed a significant difference in annotated timestamp between all pairs of observers, but the difference between the model and the observers was not significant.We conclude from this that the model's detection lies somewhere in-between the observers' annotations.The mean difference of −0.06s between observers and the model, in Figure 4, and the proximity of the points to 0 show that there is no systematic bias introduced by the automated detection.This latter finding is also supported by Figure 5, showing the timestamp annotated by each observer at each sequence, in which there is no evidence of the detection being consistently divergent from human annotations, as the triangular points (model) are not systematically above or below the circular ones (observers).We also see that the predictions do not tend to be further from the annotations than the annotations are from each other.
This agreement is a crucial step in validating the capability of 3D computer vision to accurately identify this specific kinematic feature in bovine behavior.Notably, the findings suggest that the model's performance does not considerably differ from human observers when compared with the variability among human observers.This suggests that the model does not introduce a distinct source of error in the detection process.While there are discrepancies between the model and human observations, the magnitude of these divergences is not meaningful in comparison to the overall duration of the LTS transition.
Some limitations are important to mention however.One such limitation is the likely over-representation of specific individuals.The animals were filmed in a limited area of the barn and we can expect a degree of site fidelity from the animal (Vázquez Diosdado et al., 2018) leading to some individuals being over-represented.As there was no individual detection, correcting for individual was not possible.It is also unlikely that all recorded bouts were spontaneous; some may have been triggered by human intervention or by the presence of agonistic individuals.Bout motivation could introduce changes in kinematic patterns and velocity, and potentially impact the accuracy of the automated detection.
Limitations also exists regarding external validity as the study was conducted in a single cubicle design, under a limited period of time, and using manually selected video sequences.This manual selection work upstream to automated processing is an important limitation which drove the high sensitivity and specificity.The same system should be tested on continuous recordings.To counterbalance this limitation however, the posture transition is an evident behavior, with a large difference in key-point height before and after, which would easily be captured even with noisy keypoints by simply following the height of the cow's back.
The scope of this study was determined retrospectively; the decision to compare the automated detection to manual annotations was made after collecting the data and visually identifying LTS motions.The inclusion criteria were based on data quality and not experimental considerations.The exclusion of 22 longer bouts discarded important information with implications for the most vulnerable individuals when it comes to stall comfort, as a long pause during the posture transition is associated with adverse welfare outcomes (Zambelis et al., 2019) The study's gold standard was human observation, which is known to be variable across observers, due to individual subjectivity.Although there is a bias incorporated in the model, this bias is consistent across observations.The accuracy of the model could be improved by both altering the ethograms to make them more "machine-learnable" (Brouwers et al., 2023), and by diversifying the data.Importantly, although human observations are biased, humans are rarely "completely off," especially when the phenomenon at hand, such as posture transition, is evident.Algorithms on the other hand sometimes produce unexpected results, and monitoring and understanding their occurrence is essential for practical application.For instance, a difference of 6s is found between the model and observer 2 in sequence this technology is still able to deliver meaningful information either at herd or at cubicle level.The automated detection through 3D computer vision could, after further validation, serve as a new gold standard for the task of detecting LTS transitions (and other movements) similarly to how interpreting accelerometer data has become standard in behavior classification of ruminants (Riaboff et al., 2022).

CONCLUSION
In summary, our results demonstrate good agreement between human observations and automated detection of cows' rising motions.Notably, they indicate that the model introduces no more bias than human observers.This finding validates the use of multi-view 3D pose estimation for detecting the onset of rising motions in bovine behavior, albeit in the conditions of a single farm.Automating the task with computer vision presents an opportunity to scale up bovine kinematic measurements and behavior monitoring, and apply objective methods to their study.

Notes
The authors thank the Swedish Research Council (Formas) for funding this research, the personnel of the Swedish Dairy Research Center at Lövsta for their outstanding support, and Sony Nordic for their extensive collaboration.The authors declare that Sony Nordic has contributed to this research in kind and in staff hours.Sony Nordic provided the technology to generate the 3D pose.Conceptualization, study design, statistical analysis and presentation of the results were decided by researchers at the Swedish University of Agricultural Sciences.Sony Nordic contributed in drafting the methods section regarding key-point acquisition in 3 dimensions, and in revising the final manuscript.This study was not conducted with the purpose of supporting a commercial claim.

Figure 1 .
Figure1.Schematic of the portion of the stall where recordings were obtained.The greyed-out areas are passageways unavailable to cows.Thick borders mark the stall boundaries, while dashed lines indicate that there is a continuing area accessible to the cows beyond that shown here.Cameras are represented by red circles, placed between 2.8 and 3.6 m high.The parallel rectangles are cubicles, data was collected in the ones marked with a star.
Kroese et al.: 3D pose estimation for cow bout detection

Figure 2 .
Figure 2.This figure shows both the 2D pose estimation and 3D fusion of 2 cows.As a header is the 2D result, showing the synchronized frames from cameras 0 to 6, onto which predicted bounding boxes and key-points are overlaid.The rest of the scene shows the projection of 2 cows from key-points in 3D.Cameras 4 and 6 are represented as magenta and gray cuboids respectively, in the 3D representation, in their spatial position relative to each other and to the cows.A projection of the frames from cameras 4 and 6 (identical to those in the 2D images above) is shown in front of the camera's 3D representation.The 5 other camera representations are not displayed from this angle and camera 4 occludes the view from camera 0 because of the choice of angle.Only 4 of the key-points shown in this figure were used in the study.
Figure 3. 3D pose estimation tracked the coordinates of anatomical landmarks of dairy cows.This figure shows the Z coordinate (height)of a cow's head, withers and sacrum throughout a lying-to-standing motion.Initially, the low variability on the vertical axis indicates that the cow is lying still.At about 11 s, the withers (orange) rise gently as the cow sits on its carps, followed by lunging with vertical bobbing of the head (blue) from 12 to 17 s.The sacrum (green) rises rapidly soon after, describing a sigmoid.There is a pause on the carps, with the sacrum already up, from 16 to 20s.The cow has risen by the 22 s mark.The vertical dotted line shows the onset of the posture transition detected using linearly penalized segmentation.This example was cherry-picked for clarity.

Figure 4 .
Figure 4. Bland-Altman plots comparing the timestamp of onset of cows' rising motions annotated by human observers to that predicted by the model.3D pose estimation provided the coordinates of cows' anatomical landmarks.Detecting breakpoints in the key-point motion enabled to detect the onset of rising.

Figure 5 .
Figure 5. Annotated timestamp by each observer and by the model.The discrete x-axis shows each lying-to-standing sequences.On the y-axis is the timestamp of the onset of the posture transition annotated by each observer or predicted.To each annotation is subtracted the earliest timestamp in that sequence.

Table 1 .
Kroese et al.:3D pose estimation for cow bout detection Inter observer agreement.The table contains the mean absolute difference (MAD) ± σ between the annotations of all pairs of observers (including the model).Note that pairs between observers calculate the MAD on 30 sequences, whereas pairs with the model include an additional 10 annotations, unique to each observer