library(FSA) # for headtail(), hist(), col2rgbt()
library(dplyr) # for select(), mutate(), filter()
library(emmeans) # for emtrends(), emmeans()
Weight-Length Relationship for 3+ Groups
Dummy variable regression (DVR) was introduced in Chapter 7 of Ogle (2016) in the context of determining if the slope and y-intercept parameters of weight-length relationship regressions differed between fish captured in two different years. That analysis may be extended to more than two groups, though more dummy variables are required and special methods are needed to determine which pairs of groups (if any) differ. This supplement demonstrates how to extend the DVR to more than two groups.
Setup
Packages
Functions used in this supplement require the following packages. Also note that one function from lubridate
and two functions from car
are used below with ::
so that the whole packages do not need to be loaded.
Data
Weights (g) and total lengths (mm) from Ruffe (Gymnocephalus cernuus) captured in the St. Louis River Harbor (Minnesota and Wisconsin) were used in Chapter 7 of Ogle (2016) and will also be used in this supplement. These data are from Ogle and Winfield (2009) and are in RuffeSLRH.csv
. To eliminate within-season variability, only Ruffe captured in July are used here. Additionally, a factored version of year
was created, the common logarithms of weight and length were created, and the fishID
, month
, and day
variables were removed to save space in the output.
<- read.csv("../scripts/RuffeSLRH.csv") |>
ruf filter(month==7) |>
mutate(fYear=factor(year),logW=log10(wt),logL=log10(tl)) |>
select(-fishID,-month,-day)
headtail(ruf)
#R| year tl wt fYear logW logL
#R| 1 1988 78 6.0 1988 0.7781513 1.892095
#R| 2 1988 81 7.0 1988 0.8450980 1.908485
#R| 3 1988 82 7.0 1988 0.8450980 1.913814
#R| 1936 2004 137 28.0 2004 1.4471580 2.136721
#R| 1937 2004 143 31.4 2004 1.4969296 2.155336
#R| 1938 2004 174 82.4 2004 1.9159272 2.240549
For the first example below, only fish from 1990, 1995, and 2000 were examined. In the second example below, fish from 1992 to 1995 were examined.
<- ruf |>
ruf1 filter(year %in% c(1990,1995,2000)) |>
droplevels()
<- ruf |>
ruf2 filter(year %in% 1992:1995) |>
droplevels()
Full Model
The number of dummy variables required to represent \(k\) groups is \(k-1\). Thus, in Chapter 7 of Ogle (2016), only one dummy variable was required to represent the two groups. In this example, three groups (the years) are being examined and, thus, two dummy variables are needed. For example, lm()
will ultimately treat the “1990” group as the reference group and create two dummy variables as follows
\[ fYear1995 = \left\{\begin{array}{l} 1 \text{, if captured in 1995 }\\ 0 \text{, if NOT captured in 1995 } \end{array} \right. \]
\[ fYear2000 = \left\{\begin{array}{l} 1 \text{, if captured in 2000 }\\ 0 \text{, if NOT captured in 2000 } \end{array} \right. \]
These dummy variables are each combined with the \(log_{10}(L)\) covariate to construct the following ultimate full model
\[ \begin{split} log_{10}(W_{i}) &= log_{10}(\alpha) + \beta log_{10}(L_{i}) \\ &\mspace{16mu}+ \delta_{1} fYear1995 + \delta_{2} fYear2000 \\ &\mspace{16mu}+ \gamma_{1} fYear1995*log_{10}(L_{i})+ \gamma_{2} fYear2000*log_{10}(L_{i}) + \epsilon_{i} \end{split} \tag{1}\]
Substitution of appropriate values for the dummy variables into Equation 1 shows how Equation 1 simultaneously represents the weight-length relationship regressions for all three years as shown below.
Year | fYear1995 | fYear2000 | Submodel (\(log_{10}(W_{i})=\)) |
---|---|---|---|
1990 | 0 | 0 | \(log_{10}(\alpha) + \beta log_{10}(L_{i})\) |
1995 | 1 | 0 | \((log_{10}(\alpha) + \delta_{1}) + (\beta + \gamma_{1})*log_{10}(L_{i})\) |
2000 | 0 | 1 | \((log_{10}(\alpha) + \delta_{2}) + (\beta + \gamma_{2})*log_{10}(L_{i})\) |
From these submodels, it is apparent that \(\alpha\) is the y-intercept for the reference (e.g., 1990) group, \(\beta\) is the slope for the reference group, \(\delta_{i}\) is the difference in y-intercepts between the \(i\)th and reference groups, and \(\gamma_{i}\) is the difference in slopes between the \(i\)th and reference groups. By extension, the interaction variables measure differences in slopes and the dummy variables measure differences in y-intercepts.
The model in Equation 1 is fit with lm()
using a formula of the form y~x*factor
exactly as described in Chapter 7 of Ogle (2016) (again noting that lm()
will create the appropriate dummy and interaction variables).
<- lm(logW~logL*fYear,data=ruf1) fit1
The linearity and homoscedasticity assumptions and normality assumption (Figure 1) are largely met with this model.
par(mfrow=c(1,2))
plot(resid(fit1)~fitted(fit1),xlab="Fitted Values",ylab="Residuals")
hist(~resid(fit1),xlab="Residuals",breaks=12)
An ANOVA table is constructed (using Anova()
from car
) and interpreted (sequentially starting at the bottom of the table) as described in Chapter 7 of Ogle (2016).
::Anova(fit1) car
#R| Anova Table (Type II tests)
#R|
#R| Response: logW
#R| Sum Sq Df F value Pr(>F)
#R| logL 43.529 1 22278.435 < 2.2e-16 ***
#R| fYear 0.451 2 115.429 < 2.2e-16 ***
#R| logL:fYear 0.052 2 13.395 2.159e-06 ***
#R| Residuals 0.971 497
#R| ---
#R| Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From these results it is apparent that the interaction term is a significant predictor in the model. In relation to Equation 1 this suggests that at least one of \(\gamma_{1}\) or \(\gamma_{2}\) is significantly different than zero, which implies that the slope of the relationship for fish captured in 1995, 2000, or both differs significantly from the slope for fish captured in 1990. Additionally, it is possible that the slopes for fish captured in 1995 and 2000 also differ, but this cannot be assessed from this output.
The ANOVA table for the fit of the full model is useful for determining if there is some difference in the regression model parameters among groups, but it cannot specifically identify where those differences occur. Specific differences are identified in the next section.
Specific Differences Among Slopes
The use of emtrends()
from emmeans
is described in this fishR post. Results for pairwise comparisons of slopes are obtained with emtrends()
using the fitted lm()
object as the first argument, a specs=
argument with pairwise~
followed by the name of the factor variable from the lm()
model (fYear
in this case), and var=
followed by the name of the covariate from the lm()
model (logL
in this case), which must be in quotes. The results should be assigned to an object so that specific results can be extracted.
<- emtrends(fit1,specs=pairwise~fYear,var="logL") cs
The object saved from emtrends()
is then given as the first argument to summary()
, which also requires infer=TRUE
if you would like p-values to be calculated.1
1 emtrends()
does not compute p-values by default.
<- summary(cs,infer=TRUE) css
The $contrasts
component in this saved object contains the results for comparing all pairs of slopes. Each paired comparison is a row with the groups compared under contrast
, the difference in sample slopes under estimate
, the standard error of the difference in sample slopes under SE
, the degrees-of-freedom under df
, a 95% confidence interval for the difference in slopes under lower.CL
and upper.CL
, and the t test statistic and p-value adjusted for multiple comparisons for testing a difference in slopes under t.ratio
and p.value
, respectively.
$contrasts css
#R| contrast estimate SE df lower.CL upper.CL t.ratio p.value
#R| fYear1990 - fYear1995 0.1719 0.0502 497 0.0539 0.290 3.424 0.0019
#R| fYear1990 - fYear2000 0.2285 0.0452 497 0.1222 0.335 5.052 <.0001
#R| fYear1995 - fYear2000 0.0567 0.0487 497 -0.0579 0.171 1.163 0.4759
#R|
#R| Confidence level used: 0.95
#R| Conf-level adjustment: tukey method for comparing a family of 3 estimates
#R| P value adjustment: tukey method for comparing a family of 3 estimates
These results show that the slope for 1990 is significantly greater than the slopes for 1995 and 2000, and that the slopes for 1995 and 2000 do not differ significantly.
The $emtrends
component contains results for each slope with the groups under the name of the factor variable (fYear
in this example), the sample slopes under xxx.trend
(where xxx is replaced with the name of the covariate variable, logW
in this example), standard errors of the sample slopes under SE
, degrees-of-freedom under df
, 95% confidence intervals for the slope under lower.CL
and upper.CL
, and t test statistics and p-values adjusted for multiple comparisons for testing that the slope is not equal to zero under t.ratio
and p.adj
, respectively.
$emtrends css
#R| fYear logL.trend SE df lower.CL upper.CL t.ratio p.value
#R| 1990 3.03 0.0331 497 2.96 3.09 91.366 <.0001
#R| 1995 2.85 0.0377 497 2.78 2.93 75.629 <.0001
#R| 2000 2.80 0.0308 497 2.74 2.86 90.762 <.0001
#R|
#R| Confidence level used: 0.95
In this case, this output is primarily for completeness, as these hypothesis are not generally of interest with weight-length relationship regressions.
A Summary Plot
A plot that shows the transformed weight-length data with best-fit lines for each year superimposed (Figure 2) is constructed with the code below. This code follows that found in Ogle (2016) with the exception that col2rgbt()
from FSA
is used to add transparency to each color in a vector of colors.
## Base plot
<- c("black","red","blue")
clr1 <- col2rgbt(clr1,1/4)
clr2 plot(logW~logL,data=ruf1,pch=19,col=clr2[fYear],
xlab="log(Total Length)",ylab="log(Weight)")
## Fitted lines
<- coef(fit1) ) ( cfs
#R| (Intercept) logL fYear1995 fYear2000 logL:fYear1995
#R| -4.9144676 3.0251636 0.2817809 0.3942964 -0.1718633
#R| logL:fYear2000
#R| -0.2285159
<- min(ruf1$logL)
mnx <- max(ruf1$logL)
mxx curve(cfs[1]+cfs[2]*x,from=mnx,to=mxx,col=clr1[1],lwd=2,add=TRUE)
curve((cfs[1]+cfs[3])+(cfs[2]+cfs[5])*x,
from=mnx,to=mxx,col=clr1[2],lwd=2,add=TRUE)
curve((cfs[1]+cfs[4])+(cfs[2]+cfs[6])*x,
from=mnx,to=mxx,col=clr1[3],lwd=2,add=TRUE)
## Add legend
legend("topleft",levels(ruf1$fYear),pch=19,col=clr1)
Specific Differences Among Intercepts
When a difference in slopes exists, as in the previous example, it generally does not make sense to compare intercepts. However, if the slopes do not differ, then testing for differences in intercepts becomes important because, with parallel lines (i.e., same slopes), a difference in intercepts implies that the mean value of the response variable differs at every value of the explanatory variable.
The example below fits a DVR of the weight-length relationship regressions for the years from 1992 to 1995.2 In this example, the interaction term is not a significant predictor which indicates that none of the slopes differ. However, the term related to the factor variable is significant, which implies that at least one of the \(\delta_{i}\) is different from zero. Thus, the y-intercept for at least of 1993, 1994, or 1995 differs from the y-intercept for 1992 (the reference level).
2 Note the change in data frames from the previous section.
<- lm(logW~logL*fYear,data=ruf2)
fit2 ::Anova(fit2) car
#R| Anova Table (Type II tests)
#R|
#R| Response: logW
#R| Sum Sq Df F value Pr(>F)
#R| logL 75.798 1 45594.0523 <2e-16 ***
#R| fYear 0.326 3 65.4373 <2e-16 ***
#R| logL:fYear 0.008 3 1.6373 0.1793
#R| Residuals 1.235 743
#R| ---
#R| Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Before comparing each pair of intercepts, a model without the interaction term, which was not significant, is fit.
<- lm(logW~logL+fYear,data=ruf2) fit2_noint
The use of emmeans()
from emmeans
is described in this fishR post. Results for the comparison of all intercepts is obtained with emmeans()
using the fitted lm()
object (without the interaction term) as the first argument and a specs=
argument with pairwise~
followed by the name of the factor variable from the lm()
model (fYear
in this case).
<- emmeans(fit2_noint,specs=pairwise~fYear) ci
The object saved from emmeans()
is then given as the first argument to summary()
, which also requires infer=TRUE
if you would like p-values to be calculated.
<- summary(ci,infer=TRUE) cis
The $contrasts
component in this saved object contains the results for comparing all pairs of predicted means at the overall mean of the covariate. Each paired comparison is a row with the groups compared under contrast
, the difference in predicted means under estimate
, the standard error of the difference in predicted means under SE
, the degrees-of-freedom under df
, a 95% confidence interval for the difference in predicted means under lower.CL
and upper.CL
, and the t test statistic and p-value adjusted for multiple comparisons for testing a difference in predicted means under t.ratio
and p.value
, respectively.
$contrasts cis
#R| contrast estimate SE df lower.CL upper.CL t.ratio p.value
#R| fYear1992 - fYear1993 0.056085 0.00444 746 0.04466 0.0675 12.638 <.0001
#R| fYear1992 - fYear1994 0.060984 0.00519 746 0.04761 0.0744 11.741 <.0001
#R| fYear1992 - fYear1995 0.061607 0.00516 746 0.04833 0.0749 11.947 <.0001
#R| fYear1993 - fYear1994 0.004900 0.00405 746 -0.00553 0.0153 1.210 0.6207
#R| fYear1993 - fYear1995 0.005522 0.00407 746 -0.00495 0.0160 1.358 0.5262
#R| fYear1994 - fYear1995 0.000622 0.00481 746 -0.01176 0.0130 0.129 0.9992
#R|
#R| Confidence level used: 0.95
#R| Conf-level adjustment: tukey method for comparing a family of 4 estimates
#R| P value adjustment: tukey method for comparing a family of 4 estimates
These results show that the y-intercept for 1992 is significantly greater than the y-intercepts for all other years, which did not differ significantly.
The $emmeans
component contains results for predicted means for each group with the groups under the name of the factor variable (fYear
in this example),3 the predicted means under emmean
, standard errors of the predicted means under SE
, degrees-of-freedom under df
, 95% confidence intervals for the predicted mean under lower.CL
and upper.CL
, and t test statistics and p-values adjusted for multiple comparisons for testing that the predicted mean is not equal to zero under t.ratio
and p.adj
, respectively.
3 By default these are the predicted mean of the response variable at the mean of the explanatory variable.
$emmeans cis
#R| fYear emmean SE df lower.CL upper.CL t.ratio p.value
#R| 1992 1.172 0.003831 746 1.164 1.179 305.932 <.0001
#R| 1993 1.116 0.002209 746 1.112 1.120 505.130 <.0001
#R| 1994 1.111 0.003403 746 1.104 1.118 326.423 <.0001
#R| 1995 1.110 0.003417 746 1.104 1.117 324.965 <.0001
#R|
#R| Confidence level used: 0.95
These results show, for example, the mean \(log_{10}(W)\) for when \(log_{10}(L)\)=2.013 for fish captured in 1992 is 1.172.
<- c("black","red","blue","green")
clr1 <- col2rgbt(clr1,1/4)
clr2 plot(logW~logL,data=ruf2,pch=19,col=clr2[fYear],
xlab="log(Total Length)",ylab="log(Weight)")
<- coef(fit2)
cfs <- min(ruf2$logL)
mnx <- max(ruf2$logL)
mxx curve(cfs[1]+cfs[2]*x,from=mnx,to=mxx,col=clr1[1],lwd=2,add=TRUE)
curve((cfs[1]+cfs[3])+(cfs[2]+cfs[6])*x,from=mnx,to=mxx,col=clr1[2],lwd=2,add=TRUE)
curve((cfs[1]+cfs[4])+(cfs[2]+cfs[7])*x,from=mnx,to=mxx,col=clr1[3],lwd=2,add=TRUE)
curve((cfs[1]+cfs[5])+(cfs[2]+cfs[8])*x,from=mnx,to=mxx,col=clr1[4],lwd=2,add=TRUE)
## Add legend
legend("topleft",levels(ruf2$fYear),pch=19,col=clr1)