Invited review: Assessing experimental designs for research conducted on commercial dairies☆
Article Outline
- Abstract
- Introduction
- Experimental Design Concepts
- Factors Influencing Statistical Power
- Power Determinations for Continuous Responses
- Statistical Power for Binary Responses
- Eliciting VC Specifications
- Scope of Inference
- Conclusions
- Supplementary data
- References
- Copyright
Abstract
Because of increasing constraints placed on conducting large studies at universities, more research is being conducted on commercial dairies, thereby raising some implications for experimental designs and data analysis. For example, experimental units are often specified to be pens of animals in on-farm studies, thereby requiring that at least 2 pens be used per treatment group in a single-dairy study. Even when treatments are compared within pens, the precision of inference on treatment differences is still primarily limited by the number of pens in the study, rather than the number of cows per treatment in each pen. Other challenges with on-farm studies include proper blocking and randomization of cows or pens to treatments. On the other hand, multiple farm studies are attractive, because they facilitate a broader scope of inference on treatment effects across a wider range of management or climatic conditions and genetic backgrounds compared with single-site university studies. Furthermore, studies based on multiple farms or multiple pens within a single large farm can facilitate greater power for treatment comparisons on binary reproduction or health responses than can be achieved at a smaller research herd. Because quantitative geneticists have been analyzing commercial dairy data for decades, they have developed useful data analysis techniques that should be harnessed to facilitate even greater statistical scope and power for on-farm studies.
Key words: experimental design, experimental unit, ecope of inference
Introduction
Dairy cattle research conducted at university experiment stations has been and will continue to be vitally important in terms of providing knowledge that can be readily used by the dairy industry. University herds are generally best suited for basic fundamental research geared toward determining various physiological mechanisms or modes of action that are important for dairy production (St-Pierre and Jones, 1999). This is particularly true when constant vigilance, intensive sampling, proximity to laboratories, and compliance to the experimental plan are vital.
Recently, various funding and environmental regulatory pressures have increasingly redirected more applied dairy research away from universities to commercial dairy venues. On-farm studies are often less expensive than applied research at universities on a per-herd basis; furthermore, they may be sometimes more relevant to modern production systems than studies conducted in smaller and often very differently managed research herds (St-Pierre and Jones, 1999). Nevertheless, there are several critical issues to consider when conducting on-farm research in commercial dairies. First, control of randomization of cows to treatments may be compromised as dependent upon the concerns of the farm manager, particularly when it is rather uncertain whether or not some treatments may have a detrimental effect on various aspects of dairy cattle performance. Second, and concomitant with increasing average herd sizes, a greater number of farms house and manage animals together in pens, thereby complicating issues for true experimental replication (St-Pierre, 2007).
Quantitative genetics researchers have been analyzing on-farm dairy data, primarily DHIA records, for many decades and have generally been quick to adopt cutting edge statistical tools for the analysis of such data. As the clearest example of this, the mixed model is the most commonly used statistical model for the analysis of many university and experimental research designs, with its development being particularly inspired by problems in dairy cattle breeding (Henderson, 1953, Henderson et al., 1959). In fact, it was within the context of these applications that optimal methods of statistical inference for the mixed model were proposed [i.e., an efficient algorithm for providing generalized least squares estimates of treatment effects and BLUP of random effects (Henderson et al., 1959;
Goldberger, 1962)]. Mixed models do not only provide the backbone for virtually all genetic evaluation systems for livestock today, but generalized least squares is used in statistical software like SAS PROC MIXED/GLIMMIX (Littell et al., 2006) to properly analyze dairy research data.
It is the intent of this review to provide some direction for the design and analysis of on-farm dairy research, primarily providing an extensive power analysis for some common designs that involve pens or groups of animals, whether it be for continuous production responses or for binary health-reproduction responses. It will be particularly emphasized that different sizes of experimental units (cows versus pens versus herds) will determine the appropriate scope of inference for the on-farm study at hand. The potential utility of statistical model extensions, previously proposed by quantitative geneticists, for facilitating a broader scope of inference is discussed further as well.
Experimental Design Concepts
There are several basic principles that are essential for proper experimental design, whether they be applied to university research stations or commercial dairies. First, treatments should be randomly assigned to experimental units. When individual cows are randomly assigned to treatments, such that cows in the same treatment are not grouped or managed together, cows then serve as the experimental units as in these completely randomized designs (CRD). Furthermore, when the experiment can be conducted over several different time periods such that all cows receive all treatments but in a different time order or sequence, as in a Latin square or crossover design, then cows serve as a blocking factor along with time periods. It is well established that the statistical power of the crossover design generally exceeds that of the CRD for the same number of cows, as treatments are compared within cows in the crossover. It is important to note, however, that even before randomization for such studies, whether based on CRD or block designs, the cows chosen for the study are assumed to be a random or representative sample from a conceptually larger population of cows that represents the target population for inference (e.g., primiparous Holsteins).
When cows are grouped together in pens or paddocks as in large dairies, then the same randomization principles essentially apply except that now pens of cows, rather than individual cows themselves, represent the experimental units. In other words, it is essential that pens are randomly assigned to treatments in a CRD. Similarly, different time sequences of treatments are randomly assigned to each pen in a crossover study, just as they would for cows in individual cow studies. For example, if there are 2 treatments, A and B, to compare within each pen across 2 time periods, then one-half of the pens would be randomly assigned to the A → B sequence (i.e., receiving A in the first period and B in the second period), whereas the other half of the pens would, by default, be assigned to the B A sequence.
Of course, several experimental units per treatment are required to establish experimental error such that at least 2 pens must be randomly assigned to each treatment; however, we should be quick to note that 2 pens per treatment may not generally ensure sufficient statistical power, even if one could indefinitely increase the number of cows within each pen. Cox (1980) likely encountered many difficult statistical consulting sessions with dairy scientists when he writes: “One could merely assume that the effects of pens were negligible. This is equivalent to assuming that, when feeding a single diet, that the gains of two animals in the same pen are no more alike than the gains of two animals each in different pens.” This temptation to then define individual cows as the experimental unit in pen studies has too often slipped into peer-reviewed literature, as also noted by St-Pierre and Jones (1999), such that Cox (1980) later bluntly retracts: “…pen-to-pen variation has been important too often to make such an assumption credible.” The Journal of Dairy Science in its Instructions to Authors clearly stipulates that “The experimental unit is the smallest unit to which an individual treatment is imposed. For group-fed animals, the group of animals in the pen or the paddock is the experimental unit; therefore, groups must be replicated.” It is important to further indicate that simply including pens as a source of variation in the model does not necessarily resolve this issue, particularly if pens are treated as fixed effects. If this approach is taken, then the scope of inference would be narrow (McLean et al., 1991); that is, any inferences derived from such an analysis would only apply to those specific pens used in the study. Further warning on biased estimates and type I error rates when treating individual cows as the experimental units for pen studies was also recently provided by St-Pierre (2007).
Factors Influencing Statistical Power
A pressing issue for on-farm studies, in which pens are the experimental units, involves an appropriate choice for the numbers of pens and cows per pen. That is, how many pens or groups of animals per treatment should be considered when pens are the experimental units? Would it ever be worthwhile to conduct an experiment on one particular farm if there were only 2 or 3 pens per treatment?
To properly address this question, it is important to review what influences classical statistical power. Recall that statistical power is the probability that one would correctly conclude that 2 or more treatments have different mean responses. For the purposes of this review, we’ll consider power to be sufficient if this probability exceeds 80%; however, depending on the context, anywhere between 70 to 99% power might be deemed acceptable or absolutely necessary.
Most researchers understand that the larger the true mean difference between treatment groups, the greater the statistical power will be. Power also depends on the specification of a desired type I error rate, denoted as α, which is the probability that one will conclude that 2 or more treatments have different mean responses when in fact there are no treatment effects. Typically, journals like the Journal of Animal Science or the Journal of Dairy Science dictate α to be less than or equal to 0.05, and certainly no greater than 0.10, for a researcher to be able to claim a statistically significant result. Beyond the true mean difference and the type I error rate, a third primary determinant of power is the number of experimental units. Again, depending upon the design and scope of the study, this may be defined by either the number of cows or pens or even the number of farms as described later.
However, statistical power vitally depends upon a fourth and perhaps least understood component, at least with respect to the design of pen studies. Variability from different sources of random effects as measured by the corresponding variance components (VC) may have the greatest effect on comparisons between various experimental designs for statistical power. As an aside, VC estimation has been a primary interest of dairy quantitative geneticists since Henderson (1953). First, there is the variability,
, between multiple measurements of the same response within cows, which is often characterized as the within-cow or measurement error VC; quantitative geneticists typically refer to
as the temporary environmental variance (Van Vleck, 1993). The VC characterizing variability between cows
describes the biological or innate variability between cows and is generally referred to as the permanent environmental variance (Van Vleck, 1993). Let us further denote
as the VC measuring variability between pens and
as the VC measuring variability among herds (or among herds during a particular season within a particular year) as might be important in a multiherd study. It is important to realize that the arrangement of the design factors, specifically cows, pens, herds, or all three, define the design structure for the experiment. Elements of these design factors are typically specified to be random. That is, the cows, pens, or herds actually used for a particular study are believed to be a random subset of all cows, pens, or herds that could have been used for the study such that it is the general intent of the investigator that conclusions drawn from a particular experiment apply to all cows, pens, or herds from which these entities are considered to be representative of, not just to those that were actually used in the study.
There are other sources of variability due to interactions that are critical to specify in some on-farm experiments; in fact, by definition, treatment × design structure interactions define the nature of replication in experimental designs (Milliken and Johnson, 1984). For example, denote
as the VC due to pen × treatment interaction. That is,
can be used to express the VC of true treatment differences across pens such that if treatment differences are not truly variable across pens, then
. Similarly,
can be used to define the VC due to cow × treatment interaction (i.e., the true treatment difference is variable or random across cows), whereas
defines the VC due to herd × treatment interaction (i.e., true treatment difference is variable across herds).
Depending on the design and scope of inference for the study as discussed later, any of the aforementioned VC could combine to determine the experimental variability and hence the statistical power for a particular study. Define the corresponding composite VC as
. Note that for all possible experimental designs,
is always a function of
; furthermore,
may be a function of any of
,
,
,
,
, and
for different sizes of experimental units (i.e., cows, pens, and perhaps even farms). Along these same lines, the number, n, of experimental units per treatment may be defined by cows, pens, or farms, again depending upon the design and scope of inference.
Power Determinations for Continuous Responses
This section concentrates on power determinations for continuously recorded responses, like milk production or weight gain, and assumes that all such responses are normally distributed. Suppose that a study is conducted to compare 2 treatment groups, A and B, having respective unknown true means of μA and μB. Hence, the true unknown mean difference is δ
=
μA
−
μB, and it is desired to have sufficient power to conclude that H1: δ
≠
0 if indeed δ
≠
0. For a particular type I error rate of α, a specific number of experimental replicates (n) per treatment, and a particular design structure, the single most important function of δ and
influencing statistical power is the effect size:

defines the standard deviation between experimental units and may be a function of any number of constituent VC as previously noted. As a side note, for the joint comparison of t > 2 treatments having means μ1, μ2,…, μt, the use of Δ might be replaced by a particular function of
(O’Brien and Lohr, 1984;On many university experiment research stations involving individually treated cows, Latin square designs are typically used with cow as 1 blocking factor and period as the second blocking factor. Hence, the number of cows defines the number of experimental replicates with
given that treatments are compared within cows. For the case of replicated 4
×
4 Latin squares and double crossover designs for individually treated cows, statistical power determinations were determined for various numbers of cows by Tempelman (2004), who noted that, despite the wide myriad of possible choices for δ or σe, statistical power is simply a function of their ratio
. The following sections characterize statistical power for 3 common designs based on the use of pens of animals as being the experimental units.
CRD
Consider the power of test for a CRD where pen is the experimental unit; that is, a number (n) of pens are randomly assigned to each of 2 or more treatments within a herd. In this case,

and
refer to the VC characterizing the variability between cows within treatments and between pens within treatments, respectively. Note that here
defines the variability between the pen sample means; that is, the ANOVA expected mean square (EMS) for pens within treatments divided by the number of observations (d) on each pen or experimental unit. If the study is conducted within only 1 herd, then
does not depend upon
or
as indicated above; however, then the scope of inference is substantially limited to that herd as described later. For the situation where there is d
.Figure 1a displays the power curves for the CRD design for different number of pens per each of 2 treatments and for different values of Δ ranging from 0.75 to 2.5 and α
=
0.05. For example, suppose that it is desired to detect a true mean difference of δ
=
2
kg of milk, and it is known from previous studies or experience at a particular commercial dairy that
. Suppose that this dairy groups animals in pens such that there are d
=
8 animals per pen. Then the experimental variability between pens, being the experimental units, is defined by



Figure 1.
Statistical power of test versus number of experimental units for effect sizes (△) equal to 0.75 (○), 1.00 (●), 1.25 (□), 1.50 (■), 1.75 (♢), 2.00 (♦), 2.25(△), or 2.50 (▴) for comparing 2 treatments in a a) completely randomized design, b) single crossover design, and c) randomized complete block design with subsampling.
Now suppose it was possible to have 20 instead of 8 animals per pen. Then using the same VC specifications as provided,


It should be always clearly noted that
; that is, experimental variability will never be less than the variability that exists between pen effects, regardless of the number of animals per pen. That is, for the particular mean difference δ and VC specifications provided above, Δ is always less than 2, for instance,

below 1Now suppose that this study could be conducted such that animals are individually penned and fed (i.e., d
=
1). Then, using the VC values as provided,
such that

would be the same for individually fed versus group-fed animals. Using the Δ
may be substantially smaller for individual animals in tie-stall facilities than for grouped animals in large commercial dairies, especially if pens in the latter are large or rather distal to each other. In spite of this possibility, one would likely need far fewer cows in total for individual cow studies than for pen studies. For example, based on the given VC specifications, 16 individually fed cows per treatment would provide 80% power to detect a mean difference of δ
and
are not separately estimable in individual cow CRD studies such that their sum is used and relabeled as the between cow VC
.Single Crossover Design
Consider a 2-period simple crossover (i.e., 2
×
2 Latin square) study comparing the 2 treatments A and B involving n pens in which one-half (n/2) of the pens are assigned to the A→B treatment sequence with the remaining pens assigned to the B→A sequence over the 2 periods. With d animals per pen,

would be generally combined and labeled together as the within-cow VC, because its 2 components are not separately estimable. Here,
is derived as the ANOVA EMS for pen × treatment, divided by the number (d) of observations on each experimental unit or pen × treatment combination. Note that, unlike the CRD,
no longer depends upon either the between-cow
or between-pen variability
, because treatment comparisons are made both within cows and pens. However,
does depend upon the variability in true treatment differences across pens as measured by
. If d
; hence, it would not be further possible to separate any of the VC in
such that
would then be labeled as the within-cow VC. Nevertheless, as noted similarly for the CRD design, the magnitude of
for individually treated animals may not be the same as that for group-treated animals because of the potentially different nature of pen effects for individually treated versus penned animals.Statistical power versus n for the simple crossover design is provided in Figure 1b for various specifications of Δ ranging from 0.75 to 2.50. Note that n refers to the total number of pens used in the crossover study. Suppose that
,
, and each pen is to have d
=
6 cows. With an expected treatment mean difference of δ
=
2
kg and




always would be greater than
such that the effect size for these particular specifications would have an upper bound of
if δ
, or some combination of an increase in δ or a decrease in
thereof. Please note that, regardless of the VC values, a minimum of 2 pens per treatment sequence (i.e., a total of nRandomized Complete Block Design with Subsampling
There has been another increasingly common misconception with studies on commercial dairies that if different treatments could be randomly assigned to cows within the same pen (e.g., using Calan gates), then cows would define the experimental units for comparing treatments. However, even then, it is still the number of pens that defines the number of experimental units. This would certainly be the case if all cows inadvertently received the same treatment within a pen. The experimental variability for this randomized complete block design with subsampling (RCBDS) design is inherently based on pen × treatment interaction, as also so clearly exposited by St-Pierre (2007).
With d animals assigned to each treatment within each pen (i.e., a pen size of 2 d animals), the experimental variability for the RCBDS is

. As for the single crossover design,
is derived as the ANOVA EMS for pen
is slightly larger with the RCBDS compared with the crossover design for 2 primary reasons: 1)
is now included, because treatments are not compared within cows as they were with the crossover design, and 2) d = number of animals per pen per treatment in the RCBDS, whereas d = number of cows per pen in the crossover design. That is, the pen density would be twice as great in the RCBDS compared with the crossover design for the same value of d.Statistical power as a function of the number of pens (n) and effect size (Δ) for the RCBDS is presented in Figure 1c. From a quick comparison of Figures 1b and 1c, it might seem that for pen studies the single crossover design is more powerful compared with the RCBDS for the same value of Δ. This seems particularly so, because the size of
is larger with the RCBDS, thereby requiring a larger mean difference of δ for the RCBDS to compensate in determining Δ. Furthermore, the 2 different definitions of d reflect that greater pen densities would be required for the RCBDS to be competitive with the single crossover design. However, one practical disadvantage of the single crossover design is that the experiment needs to be conducted over 2 periods in contrast to the CRD or the RCBDS. Suppose that
, whereas
in the RCBDS,
being larger in the RCBDS than that specified for the single crossover example because of the additional
. Furthermore, suppose each treatment is to have d
=
6 cows within each pen (i.e., pen size of 12 in the RCBDS). With an expected mean treatment difference of δ
=
2
kg and


Statistical Power for Binary Responses
Many responses of interest involving research on commercial dairies involve reproduction or health characteristics that are measured on a binary scale (e.g., pregnant versus nonpregnant, diseased versus healthy). Generalized linear mixed models (GLMM) have been used by quantitative geneticists to provide genetic evaluations on animals for nonnormal or categorical responses such as calving ease in cattle (Gianola and Foulley, 1983). The probit or threshold mixed model is one such GLMM that is particularly well-suited and biologically interpretable for the analysis of various binary reproduction and health characters in dairy cattle (Tempelman, 1998). This model assumes that an observed binary outcome, for example, disease or pregnancy, is determined by an underlying and unobserved normally distributed variable called liability. If the liability for a certain animal exceeds a specific threshold, then the animal is observed to be diseased; otherwise, it is healthy (Falconer, 1986). For example, suppose that the binary outcome (yes or no) of clinical mastitis observed on any particular cow is determined by an underlying normally distributed disease load. If the disease load or liability of the animal exceeds a certain threshold, then a clinical mastitis outcome is observed; otherwise, it is not.
Labeling disease as the “success” outcome, the threshold mixed model is typically defined such that the probability of success for a treatment i is pi
=
Φ(μi), where Φ(·) indicates the standard normal distribution function and μi indicates the mean response for treatment i on the underlying normal scale. Note that tabulated values of this function are commonly provided as the first appendix table near the back of most statistical methods textbooks. For example, Φ(1.96)
=
0.975 is well known for its use to provide 95% confidence intervals or 2-tailed z-tests on the means or mean differences. The inverse of this function is referred to as the probit function Φ−1(·); for instance, Φ−1 (0.975)
=
1.96.
Suppose, as before, one uses the CRD design whereby several pens are assigned at random to each treatment, each pen having d animals per pen. For an experimental unit or pen j assigned to a particular treatment i, the expected proportion of the d cows having the “success” outcome can be written using the GLMM as pij
=
Φ(μi
+
rij), where rij = the random experimental effect associated with pen j within treatment i. Here, rij is a random draw from a normal distribution with variance

models experimental variability, often referred to as overdispersion in GLMM (Tempelman, 1998), on the underlying liability scale.Suppose that a binary health response is recorded and one wishes to know whether or not there is sufficient power to be able to reject Ho: p1
=
p2 in favor of H1: p1
≠
p2, where p1 and p2 represent the incidence rates of disease or pregnancy for 2 treatments (1 and 2), respectively. The mixed model power determination procedures of Stroup (2002) can be adapted for computing power for the threshold model. In a threshold model analysis using software such as SAS PROC GLIMMIX, the number of diseased cows over the total number of animals per each experimental unit (i.e., pen) defines the response variable (Littell et al., 2006); SAS PROC GLIMMIX will also estimate VC for any specified random effects. In a power analysis, one would substitute dpi over d for the total number of diseased cows over all cows, respectively, for any pens associated with treatment i, specifying any VC contributing to
to be known or fixed. The resulting output F-test statistics can then be used to determine the noncentrality parameters that are subsequently used to approximate the power, similar to what is done in Stroup (2002) and Kononoff and Hanford (2006). Recently, Walt Stroup (University of Nebraska, Lincoln; personal communication) has determined based on simulation analyses that these power approximations are reliable for GLMM analysis of binary data.
Figure 2 has power curves to detect a difference between 2 incidence rates of p1
=
0.05 and p2 ranging from 0.10 to 0.45 for a CRD where cows are the experimental units (i.e., cows are individually treated). Furthermore, no overdispersion or experimental variability is assumed on the underlying scale, for instance,
. In other words, the power determinations using Figure 2 are likely to be slightly overstated for real applications. Nevertheless, one can clearly see that at least 425 cows per treatment would be required to find a difference in incidence rates between 5 and 10%, whereas at least 150 cows per treatment would be required to find a difference in incidence rates between 5 and 15%. Clearly, individual cow studies pose substantial logistical and cost difficulties when conducted on this scale.

Figure 2.
Statistical power of test for comparing 2 treatments for differences in incidence rates of binary responses given an incidence rate of 0.05 for 1 treatment and various incidence rates for the second treatment (0.10 ○, 0.15 ●, 0.20 □, 0.25 ■, 0.30 ♢, 0.35 ♦, 0.40 △, and 0.45 ▴) as a function of number of individually treated cows per treatment.
Figure 3 displays a panel of power curves in a CRD to detect a difference between 2 incidence rates of p1
=
0.05 and p2 ranging from 0.10 to 0.45 and as a function of the number of pens per treatment for d
=
24 or 48 cows per pen and

is relatively small
, there would be no situation in which 2 pens per treatment with 24 cows per pen would provide sufficient power of test to detect a difference in incidence rates as large as that between p1
between pens (Figure 3b). However, not even 4 pens per treatment with 48 cows per pen would be sufficient to detect a difference in incidence rates between 0.05 versus 0.15 with sufficient power. Of course, as
increases, power of test would further deteriorate as illustrated for
(Figure 3c and d) and
(Figure 3e and f). Nevertheless, pen studies have merit from a management viewpoint, because they require substantially fewer experimental units to achieve the same statistical power compared with large individually treated cow studies. For example, from Figure 3a, 5 pens per treatment, each pen with 24 cows, would be required to detect a difference in incidence rates between 0.05 and 0.20 with sufficient power, whereas from Figure 2, eighty-five individually treated cows per treatment would be required. Although the pen study requires more animals per treatment (5
Figure 3.
Statistical power of test for comparing 2 treatments for differences in incidence rates of binary responses given an incidence rate of 0.05 for 1 treatment and various incidence rates for the second treatment (0.10 ○, 0.15 ●, 0.20 □, 0.25 ■, 0.30 ♢, 0.35 ♦, 0.40 △, and 0.45 ▴) as a function of the number of pens per treatment. Experimental variability
and number of cows (d) per pen set equal to a)
, d
=
24, b)
, d
=
48, c)
, d
=
24, d)
, d
=
48, e)
, d
=
24, and f)
, d
=
48.
The unfortunate limitation of Figure 3 is that it only applies to specifically chosen values of p1, p2, d, and
, whereas many readers may desire to consider alternative specifications. It is possible to derive an approximate effect size for a threshold mixed model analysis based on any combination of values of p1, p2, d, and
.
[1]
, defining variability between experimental units, but also pure binomial variability. That is, even if there was no experimental variability between pens, such that the probability of having the success outcome was identical for all such pens, the resulting outcomes for different pens will still differ because of binomial sampling variability. Recall that
is a measure of overdispersion that takes into account additional biological or environmental sources of variability. Exploring equation [1] further, one could determine for the same difference between p1 and p2 that Δ, and hence statistical power, will be larger for detecting a particular difference between 2 extreme proportions (e.g., between 0.05 and 0.10) than between 2 intermediate proportions (e.g., between 0.45 and 0.50).Once Δ is computed using equation [1] for binary responses in the CRD design, then it can be used to determine power using Figure 1a. For example, from Figure 3d, a CRD with n
=
5 pens per treatment, based on d
=
48 cows per pen and an experimental variability between pens of
on the underlying liability scale, would provide nearly 80% power for detecting a difference between incidence rates of 0.05 and 0.20. Using these same specifications in equation [1], the numerator is


Similarly, one can use equation [1] for evaluating power for binary data for different single crossover scenarios in Figure 1b, remembering, however, that now n is equal to the total number of pens rather than the number of pens per treatment, and that the specification for
will also depend upon the study. For example, suppose that a single crossover design is used, and it is anticipated that the experimental variability is

, recalling that d refers to the number of cows per pen used over the 2 periods. Also, suppose that dEliciting VC Specifications
One often troubling question is what are appropriate VC values to specify for a statistical power analysis for pen studies? On many commercial dairies, it should be possible to use historical records for some traits such as milk production to have some indication of the magnitude of the various VC. Furthermore, based on extensive quantitative genetics research using test-day models, reasonable specifications for
vary from 2
kg2 in late lactation to 10
kg2 in early lactation, with
generally increasing with parity (Jensen, 2001;
Tempelman, 2004). Given that the reported repeatabilities
(Van Vleck, 1993) for milk production often approach 0.50 (Dong et al., 1988), this also suggests that
is of a similar magnitude as
However, it seems that values of
and
might be herd-dependent (Weigel et al., 1993), and it is even more likely that values of
might vary across herds for pen studies. In other words, there is likely greater variability between pen performance within some herds than in others due to differences in management or orientation of pens.
Because not enough dairy studies may have been conducted with the necessary precision to determine suitable values for various VC of interest, one reasonable strategy for eliciting VC specifications is based on the commonly used range approximation as exposited in various statistical methods textbooks (Gill, 1978;
Ott and Longnecker, 2001). The empirical rule for the normal distribution dictates that 95% of the effects for a particular random factor (e.g., cows or pens or herds) should fall within ± 2 standard deviations of the mean, or equivalently, within a range of 4 standard deviations. Here, standard deviations represent the square root of the VC of interest. Hence, the range (R) of the effects divided by 4 roughly approximates the standard deviation or square root of the respective VC for the random effects in question. For example, suppose that on a particular farm, one expects the range of true mean pen performances for milk production (on the same treatment) to be as large as 2
kg between pens of greatest and least production. Then, σp
≈
R / 4
=
2 / 4
=
0.5
kg such that
would be a reasonable specification for that VC.
One could also elicit VC for random interaction effects involving treatments like
or
in a similar manner. Suppose that Rδ denotes the range of differences between 2 treatments across levels of the random effects factor (e.g., pens); then,

). For example, suppose that the true mean difference between 2 treatments is specified to be δ

Eliciting appropriate values for VC in threshold mixed models is a little more challenging, because that variability is characterized on an underlying unobserved scale. Furthermore, the observed effect of a particular VC value at one incidence rate is not necessarily the same as it is for another incidence rate. Extending the empirical rule, Figure 4 demonstrates the magnitude of 4σr on the true range in incidence rates across experimental units by plotting 3 sets of effective range bands (corresponding to 3 different overall incidence rates: p
=
0.05, p
=
0.25, and p
=
0.50) such that the width of the bands are defined by

. For example, the effect of an experimental variability of
on the underlying liability scale leads to an effective range of pen effects in incidence rates of [0.02, 0.12] when p
is large. So, for instance, if one anticipated a range in true incidence rates between experimental units (i.e., pens) on the same treatment to be as great as [0.13, 0.41], then
would be a reasonable specification.
Figure 4.
Empirical ranges (curves) of observed incidence rates of binary responses between experimental units (e.g., pens) based on overall incidence rates of p
=
0.05 (solid line), p
=
0.25 (short-dash line), and p
=
0.50 (long dash line) as a function of the experimental variability
on the underlying liability scale.
Scope of Inference
Thus far, this paper has concentrated entirely on ordinary replications; that is, the replication that ensures that treatments are replicated across several experimental units within a particular herd, whether they be pens or individually treated animals. However, if the intent of the study is that inferences from the study are applied to all herds, from which the study herd is considered to be somewhat representative, then herd effects are considered to be random effects. Hence, a single herd study provides no replication. That is, the scope of inference for a single herd study is such that any conclusions derived from such a study would strictly only apply to that particular herd. Of course, the same criticism could be levied against university-based experiment station research such that there should be continued support for multistate or regional university projects that involve replicating experiments at several experiment stations (Stevenson et al., 2006). Nevertheless, research station effects should be specified as random in multiuniversity projects such that treatment × station interaction serves as the error term for testing treatment effects; this would effectively specify research station as the experimental unit, much like what is done for meta-analysis when studies are treated as experimental units (St-Pierre, 2001).
If an on-farm study involves multiple herds, such that each treatment is used within each herd, and it is desired that the scope of inference includes all herds from which those used in the study are a random sample, then the treatment × herd interaction serves as the experimental error term for treatment. This specification is analogous to a RCBDS except that now each pen within a herd is considered to be the subsample. Therefore, for the situation in which a CRD is considered within each of several herds, the experimental variability to allow the broader scope of study should be modified to be

= the VC due to herd × treatment interaction as previously noted; b = the number of pens per treatment per herd; and

The importance of this level of metareplication across herds is further borne out in a more general discussion by Johnson (2006). Metareplication within a somewhat different context, involving the joint or meta-analysis of results from different dairy research studies, is discussed further by St-Pierre (2001), who also noted that each study (i.e., herd) should be treated as an experimental unit. Although the scope of inference is much greater with a multiherd study, far more resources are obviously required. Note that the experimental variability
will be affected primarily by either
or
, which represent variability between herds and their interactions with treatment, respectively. Hence, it would be vitally important to model potentially important herd-specific covariates that might lower either or both VC to increase statistical power. For example, it is well known that ambient temperature and humidity (Ravagnolo and Misztal, 2002) and herd size (Oleggini et al., 2001) influence various measures of dairy performance such that modeling their effects and their interactions with treatments could effectively decrease
or
, respectively. Consequently,
would be lowered such that Δ and hence statistical power would be enhanced.
Given the currently high inbreeding rates with dairy cattle (Weigel, 2001) and the concomitant increase in genetic relationships between animals both within and across herds, it might seem useful to consider modeling genetic effects to account for additional variability both within and between pens or herds. In fact, SAS PROC MIXED can be modified to do this provided one has a pedigree file available on all animals in the study (Tempelman and Rosa, 2004). Because of the complexity of the covariance structure in this case, it would be important to use the Satterthwaite or Kenward-Rogers option for estimating denominator degrees of freedom, as available in SAS PROC MIXED/GLIMMIX, for testing treatment differences (Littell et al., 2006). As multiherd databases accumulate and SNP genetic markers become progressively cheaper for cattle (Van Tassell et al., 2008), there will be more opportunities to pursue on-farm research that tailors drugs or feedstuffs for particular cow genotypes (Swanson, 2008), thereby providing some direction toward potentially more optimal strategies for grouping dairy cattle in pens. The scale of SNP association studies would likely require the need for thousands of animals (Iles, 2008) and hence the need for participating commercial dairies.
The use of multivariate mixed model analysis (Thompson and Meyer, 1986) would also facilitate the use of multiherd studies in which several performance traits (e.g., milk yield and its various components) are highly correlated. The benefits of greater precision on treatment effects for any one trait within a multivariate analysis over a series of trait-specific analyses was recently reviewed by Riley et al. (2007). The advantages of multivariate analysis are particularly noticeable when measurements on some traits are missing or not extensively recorded.
There are some complications associated with pen studies on commercial dairies that go beyond the issue of scope of inference and the definition of the experimental unit. One is the issue of commingling dairy cows from different groups, particularly if it happens on a regular (i.e., weekly) basis. Some commingling might be accommodated with the RCBDS as treatments are compared within pens. However, it is difficult to know how to accommodate this mixing with a CRD design. If mixing is common, it would be useful to measure a response that can be recorded as regularly as the mixing itself (e.g., daily rather than monthly milk yields) such that adjustments for important covariates (e.g., DIM) can be modeled using polynomial (Jensen, 2001) or spline (White et al., 1999) relationships. Another related complication relates to the allelomimetic nature of dairy cattle behavior (Nordlund and Cook, 2006) whereby, for example, cows are stimulated to eat when other cows are eating, such that pen feeding might not be entirely representative of individual feeding for its biological effect. However, this phenomenon might be somewhat negated by competition effects due to antagonistic interactions, particularly in overstocked pens, which may further complicate the proper specification of between-pen and between-cow VC (Van Vleck and Cassady, 2005). Inferences then on effects of interest (e.g., parity × treatment interaction) may be biased or too complex to estimate if animals of different parities are grouped together, realizing that multiparous animals are generally socially dominant to primiparous animals (Nordlund and Cook, 2006). Finally, some dairies may have only 1 group per production level or stage of lactation, further necessitating the use of a multiherd study, given the inability to replicate within herds.
Conclusions
Based on power analyses presented in this paper, it is difficult to imagine any scenario involving the comparison of 2 treatments in which 4 pens would provide enough replication, even when treatments are compared within pens as they are in the single crossover or the RCBDS. In fact, 4 pens would be minimally necessary just to estimate experimental error in the single crossover design. Naturally, an even greater number of pens would be required for comparing more than 2 treatments. Given that it might be particularly difficult to attain more than 1 pen per treatment per dairy, and that replication across herds provide a scope of inference that is broader than what is possible with a single dairy study, multiherd studies should be encouraged. These studies would then provide a scope of inference that is broader than what is possible with a single herd or single experiment station study.
Supplementary data
Interpretive summary.
References
- . Statistical Power Analysis for the Behavioral Sciences. 2nd ed.. Hillsdale, NJ: Erlbaum; 1988;
- . Design and analysis in nutritional and physiological experimentation. J. Dairy Sci. 1980;63:313–321
- . Effect of relationships on estimation of variance components with an animal model and restricted maximum likelihood. J. Dairy Sci. 1988;71:3047–3052
- . Introduction to Quantitative Genetics. 3rd ed.. New York, NY: Longman Scientific and Technical; 1986;
- . Sire evaluation for ordered categorical data with a threshold model. Genet. Sel. Evol. 1983;15:201–223
- . Design and Analysis of Experiments in the Animal and Medical Sciences. Ames: Iowa State University Press; 1978;
- . Best linear unbiased prediction in the generalized linear regression model. J. Am. Stat. Assoc. 1962;57:369–375
- . Estimation of variance and covariance components. Biometrics. 1953;9:226–252
- . The estimation of environmental and genetic trends from records subject to culling. Biometrics. 1959;15:192–218
- . What can genome-wide association studies tell us about the genetics of common disease?. PLoS Genet. 2008;4:e33
- . Genetic evaluation of dairy cattle using test-day models. J. Dairy Sci. 2001;84:2803–2812
- . The many faces of replication. Crop Sci. 2006;46:2486–2491
- . Estimating statistical power of mixed models used in dairy nutrition experiments. J. Dairy Sci. 2006;89:3968–3971
- . SAS for Mixed Models. 2nd ed.. Cary, NC: SAS Institute Inc; 2006;
- . A unified approach to mixed linear models. Am. Stat. 1991;45:54–64
- . Analysis of Messy. Data Designed Experiments. Volume I. Belmont, CA: Wadsworth Inc; 1984;
- . Commingling dairy cows: Pen moves, stocking density, and health. In: Pages 36–42 in The American Association of Bovine Practitioners Proceedings. Stillwater, OK: Frontier Printers; 2006;
- . Power analysis for univariate linear models: The SAS system makes it easy. In: Pages 840–846 in Proceedings of the Ninth Annual Conference, Cary, NC. Cary, NC.: SAS Users’ Group International, SAS Institute Inc; 1984;
- . Effect of region and herd size on dairy herd performance parameters. J. Dairy Sci. 2001;84:1044–1050
- . An Introduction to Statistical Methods and Data Analysis. 5th ed.. Pacific Grove, CA: Duxbury; 2001;
- . Effect of heat stress on nonreturn rate in Holsteins: Fixed-model analyses. J. Dairy Sci. 2002;85:3101–3106
- . An evaluation of bivariate random-effects meta-analysis for the joint synthesis of two correlated outcomes. Stat. Med. 2007;26:78–97
- . Invited review: Integrating quantitative findings from multiple studies using mixed model methodology. J. Dairy Sci. 2001;84:741–755
- . Design and analysis of pen studies in the animal sciences. J. Dairy Sci. 2007;90:E87–E99
- . Interpretation and design of nonregulatory on-farm feeding trials. J. Anim. Sci. 1999;77:177–182
- . Treatment of cycling and noncycling lactating dairy cows with progesterone during ovsynch. J. Dairy Sci. 2006;89:2567–2578
- . Power analysis based on spatial effects mixed models: A tool for comparing design and analysis strategies in the presence of spatial variability. J. Agric. Biol. Environ. Stat. 2002;7:491–511
- . Using genomic biology to study liver metabolism. J. Anim. Physiol. Anim. Nutr. (Berl.). 2008;92:246–252
- . Generalized linear mixed models in dairy cattle breeding. J. Dairy Sci. 1998;81:1428–1444
- . Experimental design and statistical methods for classical and bioequivalence hypothesis testing with an application to dairy nutrition studies. J. Anim. Sci. 2004;82:E162–E172
- Tempelman, R. J., and G. J. M. Rosa. 2004. Empirical Bayes approaches to mixed model inference in quantitative genetics. Pages 149–176 in Genetic Analysis of Complex Traits Using Sas. A. M. Saxton, ed. SAS Institute Inc., Cary, NC.
- . A review of theoretical aspects in the estimation of breeding values for multi-trait selection. Livest. Prod. Sci. 1986;15:299–313
- . SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries. Nat. Methods. 2008;5:247–252
- . Selection Index and Introduction to Mixed Model Methods. Boca Raton, FL: CRC Press Inc; 1993;
- . Unexpected estimates of variance components with a true model containing genetic competition effects. J. Anim. Sci. 2005;83:68–74
- . Controlling inbreeding in modern breeding programs. J. Dairy Sci. 2001;84:E177–E184
- . Identification of factors causing heterogeneous within-herd variance components using a structural model for variances. J. Dairy Sci. 1993;76:1466–1478
- . Genetic and environmental smoothing of lactation curves with cubic splines. J. Dairy Sci. 1999;82:632–638
☆ Support from the Michigan Agricultural Experiment Station (Project MICL01822) is gratefully acknowledged.
PII: S0022-0302(09)70304-2
doi:10.3168/jds.2008-1404
© 2009 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.

