Background

Hoary Bat Researchers measured (among other things) the canine tooth height (cm) from two subspecies of Hoary bats (Lasiurus cinereus cinereus and Lasiurus cinereus semotus) found in Hawaii. Their primary question was to determine if canine tooth height (hereafter, just tooth height) differed between subspecies and, more importantly to them, can tooth height be used to predict the subspecies of bat. In this lecture we will focuse on their primary goal – can tooth height be used to predict the subspecies of bat. With this,

  • What are the response and explanatory variables?1
  • What type of analysis should be used?2

The data are loaded into R below. For class demonstration purposes only, the data.frame was reduced to only the two variables of interest. In addition, canine was converted from cm to mm so that the slope would be more meaningful.3 Neither of these decisions is required for a logistic regression.

> bat <- read.csv("https://raw.githubusercontent.com/droglenc/NCData/master/Batmorph.csv")
> bat <- bat[,c("subsp","canine")]   # for class demo purposes only
> bat$canine <- bat$canine*10        # convert cm to mm
> xlbl <- "Canine Tooth Height (mm)"
> ylbl <- "Subspecies Code"
> str(bat)
'data.frame':   118 obs. of  2 variables:
 $ subsp : Factor w/ 2 levels "cinereus","semotus": 2 2 2 2 2 2 2 2 2 2 ...
 $ canine: num  3.26 3.08 2.91 2.87 3.01 3.05 2.77 3.13 2.89 2.93 ...
  • Recall that R lists levels alphabetically and codes the first item as 0. Note then that cinereus is listed as the first level for subsp and, thus, will be coded with a 0 (and semotus will be coded with a 1). This ordering is important below as R will consider the ‘1’ group to be a “success.”

Before beginning this analysis, I like to examine the data to see if it is reasonable to distinguish between the two subspecies based on tooth height. The histograms below show some overlap but also considerable separation between the subspecies. Thus, it may be reasonable to separate the two subspecies for many tooth heights.4

> hist(canine~subsp,data=bat,w=0.1,xlim=c(2.6,3.8),ymax=20,xlab=xlbl,nrow=2,ncol=1)


Preparing to Model

All models that we have fit in class have been linear (at least after transformation) and represented the mean of the response variable (recall that all models had μY) on the right-hand-side. This may seem like an issue here because the response variable is categorical. How do you compute the mean of “words”? However, recall that behind the scenes R has converted the levels into numbers – cinereus as a 0 and semotus as a 1.

Suppose that you have five hoary bats and two are cinereus and three are semotus. Behind the scenes this is the same as having two 0s and three 1s. The mean of these values is thus \(\frac{3 \text{ (sum of 0s and 1s)}}{5 \text{ (total number of bats)}}\)=0.6. This is ALSO precisely the propotion of those five hoary bats that are semotus (i.e., \(\frac{3 \text{ (number of semotus)}}{5 \text{ (total number of bats)}}\)=0.6).

  • The mean of a binary categorical variable is exactly the same as the proprtion of individuals in the second level of the variable.
  • The second level of a binary categorical variable is generically considered a “success” (and, thus, the first level is a “failure”).

One way to visualize logistic regression data is to plot the categorical response (but as their numberic codes) and the quantitative explanatory (see below). Because of the nature of categorical data there will be many points plotted on top of each other. Thus, points are plotted with transparency such that darker “points” actually represent more points. In the plot below you can see that tooth heights are always semotus until about 3.0 mm, there is a mix of semotus and cinereus between 3.0 and about 3.4 mm (where there are more bats), and there is all cinereus after about 3.4 mm.

This plot can be modified by thinking of narrow vertical “windows” (dashed lines below). The mean of the points within each of these windows is computed and plotted in the center of the window with a “blue plus sign.” Recall that these means are also the proportions of “successes”; thus, these blue plusses also represent the proportion of semotus within each window. In the plot below, the first four “windows” up to about 3.0 mm have 100% semotus, the percent of semotus declines in windows between about 3.0 and 3.4 mm, and the percent that are semotus is 0% in the “windows” larger than 3.4 mm.

Logistic regression tries to fit a model that best represents the blue plusses (i.e., the means as with other models, but remembering that these are also proportions of a categorical variable in logistic regression). However, the blue plusses are clearly not linear. What do we try to do when the data are non-linear?

  • A logistic regression model attempts to fit a linear model to the proportion of “successes” and the quantitative explanatory variable.
  • The proportion of “successes” is very rarely linear. The easiest way to think about this is to realize that the propotion of “successes” must be between 0 and 1. Thus, the line must bend at the edges to vertically stay between 0 and 1.
  • The bending at the edges to stay between 0 and 1 leads to an “S-curve” that is often referred to as a “logistic curve” (think of population growth from your ecology class), which is where the name “logistic regresion” comes from.


Transformations

The transformation required to linearize the proportion of “successes” is a two-step process, which are discussed separately below.

Odds

The proportion of successes in a “window” of the plot above is labelled as p. For example, the fifth “window” had 80% semotus, so p for that “window” is 0.80. Similarly the seventh “window” had 40% semotus so p for that window is 0.40. These p are interpreted as the probability of “success” in logistic regression. For example, the probability of being a semotus in the fourth window is 0.8. Thus, 1-p is the probability of “failure.” For example, the probability of being a cinereus (i.e., not a semotus) in the fourth “window” is 1-0.8=0.2.

The odds of a “success” are defined as \(\frac{\text{p}}{1-\text{p}}\). This is interpreted as the ratio of the probability of success to the probability of failure. If the odds are 1 then there are equal probabilities of “success” and “failure” (think of flipping a fair coin). If the odds are greater than 1 then there is a higher probability of a “success” and if the odds are less than 1 then there is a higher probability of a “failure.”

In the examples above, the odds for the fourth “window” are \(\frac{0.8}{1-0.8}\)=\(\frac{0.8}{0.2}\)=4. Thus, in the fourth “window” it is four times more likely for the bat to be a semotus than a cinereus. In contrast, the odds for the seventh “window” are \(\frac{0.4}{1-0.6}\)=\(\frac{0.4}{0.6}\)=0.67. Thus, in the seventh “window” it is 0.67 times as likely for the bat to be a semotus than a cinereus. It is often easier when describing odds that are less than 1 to flip them. For example, the inverse of the odds for the seventh “window” is \(\frac{1}{0.67}\)=1.5, which can be interpreted as the probability that the bat is a cinereus (i.e., a failure) is 1.5 times the probability that it is a semotus.

Odds are useful in general, but they are particularly useful here as the first step in transforming the non-linear probabilities. Consider the following table.

p 0.999 0.99 0.9 0.75 0.5 0.25 0.1 0.01 0.001
odds 999 99 9 3 1 0.3333 0.1111 0.0101 0.0010

This implies that as the probabilities (p) get closer and closer to 1 then the odds go to positive infinity. On the other side, as the probabilities get closer and closer to 0 then the odds also get closer and closer to 0.

  • Probabilities are constrained to be between 0 and 1.
  • Odds are constrained to be between 0 and infinity.
  • Odds and probabilities are not synonyms, though they are sometimes used as such in everyday language (but not by statisticians). The probability of something happening explains how likely that that something is going to happen. The probability that a woman walks through of 0.6 means that 60% of the people that walk through the door will be women. The odds that a woman walks through the of 1.5 means that there will be 1.5 women walking through the door for every man that walks through the door. Odds and probabilities are related, but not equal.

Thus, converting the probabilities that were the blue plusses in the plot above to odds produces the plot below. At first glance, computing the odds does not seem to be much of an improvement because the plot of odds is still non-linear. However, it now resembles an exponential function that you are familiar with. How should you transform this?5

  • Transforming the probabilities to odds changes the curve from logistic to exponential.

Log Odds or Logits

The exponential form of the odds versus X function can be linearized by transforming the odds to log(odds), as shown below.

Thus, a linear relationship is observed if the probabilities are transformed to log(odds). Thus, the model that will be fit in logistic regression is

\[ \text{log}\left(\frac{\text{p}}{1-\text{p}}\right) = \alpha+\beta X \]


Model Fitting

Using glm()

Fitting this linear model in R requires using glm() rather than lm().6 The first two arguments to glm() are the same as for lm() – i.e., a formula of the form response~explanatory and the data frame in data=. However, to force glm() to fit a logistic regression to the log(odds)-transformed probabilities, you must also include family=binomial.

> glm1 <- glm(subsp~canine,data=bat,family=binomial)

Interpreting the Slope and Back-Transformed Slope

Parameter estimates are extracted from the glm() object with coef() and confint() as with lm(). These results are organized as before – i.e., values associated with the y-intercept are in the (Intercept) row and those associated with the slope are in the row labeled with the quantitative explanatory variable (i.e., canine in this example).

> ( cfs <- cbind(Ests=coef(glm1),confint(glm1)) )
Waiting for profiling to be done...
                 Ests     2.5 %   97.5 %
(Intercept)  35.51574  24.21685 49.66132
canine      -11.11193 -15.52430 -7.58941

As with linear models, interpretation of the slope is most important. In logistic regression, the slope measures how much the LOG(ODDS) change for a one unit increase in the explanatory variable. Thus, in this case, the LOG(ODDS) of being a semotus decrease by betwen 7.6 and 15.5 for every 1 mm increase in tooth height. This is visualized below for an increase from 2.6 to 3.6 mm of tooth height.7

point tooth height log (odds)
1 2.6 35.51574-11.11193×2.6 = 6.624722
2 3.6 35.51574-11.11193×3.6 = -4.487208
DIFF 1.0 -4.487208 - 6.624722 = -11.11193 (see that this is β)

It is hard to interpret results on the log scale. Thus, the slope is often back-transformed with eβ. The back-transformed slope provides a MULTIPLE for how the ODDS change for a one unit increase in the explanatory variale. For examle, all of the parameter estimates are back-transformed as shown below.

> exp(cfs)
                    Ests        2.5 %       97.5 %
(Intercept) 2.656377e+15 3.290379e+10 3.695180e+21
canine      1.493306e-05 1.810853e-07 5.057793e-04

The back-transformed slope is that the odds of being a semotus are between 0.0000002 and 0.0005058 TIMES the odds of being a cinereus when the tooth height increases by 1 mm. In other words, if the tooth height increases by 1 mm then it becomes much more unlikely that the bat is a cinereus. This is visualized below for an increase from 2.6 to 3.6 mm of tooth height.

point tooth height odds (from log(odds) above)
1 2.6 e6.624722 = 0.01125202
2 3.6 e-4.487208 = -4.487208
RATIO \(\frac{0.01125202}{753.4947}\) = 0.00001493 (see that this is eβ)

Default Tests for the Slope

As before, the parameter estimates, standard errors, and a test that the parameter is equal to zero (or not) is obtained by submitting the glm() object to summary(). As above, information for the slope is in the row labeled with the quantitative explanatory variable. The output from summary() below the “Coefficients:” table can be ignored.

> summary(glm1)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   35.516      6.428   5.525 3.29e-08
canine       -11.112      2.005  -5.543 2.97e-08

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 163.040  on 117  degrees of freedom
Residual deviance:  97.178  on 116  degrees of freedom
AIC: 101.18

Number of Fisher Scoring iterations: 5

The p-value for the slope is a test of whether the slope is equal to 0 or not. This tests if a relationship (i.e., a non-zero slope) between the LOG(ODDS) and the explanatory variable exists. If the H0 is not rejected then no relationship exists and the blue plusses in the previous plots would all be equal and represented by a flat line.

As usual, fitPlot() can be used to visualize these results. For logistic regression it is best to also include breaks=, which controls the number of vertical “windows” at which the proportion of successes is calculated (and, thus, how many blue plusses are included on the plot). Finding an appropriate number is usually a matter of trial-and-error, but usually around 10 is a good idea. In the code below seq() is used to create a sequence of numbers that starts at 2.6, ends at 3.8, and goes in steps of 0.1 (i.e., 2.6, 2.7, 2.8, …, 3.7, 3.8).

> fitPlot(glm1,breaks=seq(2.6,3.8,0.1),xlim=c(2.6,3.8),xlab=xlbl,ylab=ylbl)

There are no formal assumptions to check with logistic regresion. Rather, we simply assess whether the fitted model (red line above) fits the proportion of “successes” (the blue plusses) reasonably closely. The fit displayed above is quite good.


Predicting Odds and Probabilities

The equation for the best-fit line from the results above is

\[ \text{log}\left(\frac{\text{p}}{1-\text{p}}\right) = 35.51574-11.11193\times\text{Height} \]

Thus, plugging a value for tooth height into this equation results in a predicted value of the log(odds). For example, the predicted log(odds) for a tooth height of 3.1 mm is

\[ \text{log}\left(\frac{\text{p}}{1-\text{p}}\right) = 35.51574-11.11193\times3.1 = 1.068757 \]

Of course, interpreting the log(odds) is difficult; thus, this value can be back-transformed to the odds by raising to the power of e.

\[ e^{\text{log}\left(\frac{\text{p}}{1-\text{p}}\right)} = e^{1.068757} \] \[ \frac{\text{p}}{1-\text{p}} = 2.911758 \]

Thus, a hoary bat with a 3.1 mm tooth height is 2.91 times more likely to be a semotus than a cinereus subspecies. Look at the fitplot above to convince yourself that this calculation makes sense.8

Of course, the researchers really would like to predict the probability that a bat is a semotus. Algebra on the best-fit line equation will result in an equation for making this prediction. Steps for the algebra are below

  • Reminder of the equation of the best-fit line.

\[ \text{log}\left(\frac{\text{p}}{1-\text{p}}\right) = \alpha+\beta X \]

  • Both side of the equation set to the power of e (i.e., back-transforming the log). Note that this is an equation for predicting the odds as was demonstrated above; thus, substitute odds for \(e^{\alpha+\beta X}\).

\[ e^{\text{log}\left(\frac{\text{p}}{1-\text{p}}\right)} = e^{\alpha+\beta X} \] \[ \frac{\text{p}}{1-\text{p}} = e^{\alpha+\beta X} \] \[ \frac{\text{p}}{1-\text{p}} = \text{odds} \]

  • Multiply both sides by 1-p (which removes 1-p from the right-hand-side).

\[ \text{p} = \text{odds}(1-\text{p}) \]

  • Distribute the odds through the 1-p (I added the explicit × for clarity).

\[ \text{p} = \text{odds}-\text{p}\times\text{odds} \]

  • Add \(\text{p}\times\text{odds}\) to both sides (which removes \(\text{p}\times\text{odds}\) from the left-hand-side).

\[ \text{p}+\text{p}\times\text{odds} = \text{odds} \]

  • Factor out the p.

\[ \text{p}(1+\text{odds}) = \text{odds} \]

  • Divide both sides by \(1+\text{odds}\) (which removes \(1+\text{odds}\) from the right-hand-side).

\[ \text{p} = \frac{\text{odds}}{1+\text{odds}} \]

Thus, the probability of “success” can be obtained as the ratio of the odds of “success” to 1 plus the odds of “success”. The odds that a hoary bat with a tooth height of 3.1 mm was semotus was calculated above as 2.911758. Using the equation for p just derived, the probability that a bat with a tooth height of 3.1 mm is semotus is \(\frac{2.911758}{1+2.911758}\)=0.744361. Again, check the fitplot above to make sure that this value makes sense.

These predictions can be made in R with predict() very similarly to what you have done before. For example, the log(odds) that a hoary bat with a 3.1 mm tooth height is semotus is computed with

> predict(glm1,data.frame(canine=3.1))
       1 
1.068746 

The probability (rather than log(odds)) that a hoary bat with a 3.1 mm tooth height is semotus is computed by including type="response" in predict(). [Compare this to the hand-calculated results above.]

> predict(glm1,data.frame(canine=3.1),type="response")
        1 
0.7443584 

The odds cannot be computed directly with predict(). The odds can be computed by either back-transforming the log(odds)

> exp(predict(glm1,data.frame(canine=3.1)))
       1 
2.911727 

or computing the odds from the predicted probability

> p <- predict(glm1,data.frame(canine=3.1),type="response")
> p/(1-p)
       1 
2.911727 


Predicting X with Certain Probability

Researchers will also commonly use logistic regression results to predict the value of the quantitive explanatory variable (X) that would have a certain probability of “success.” For example, researchers may ask what the tooth height is such that there is an even probability that the bat would be semotus or cinereus (i.e., the probability of being semotus is 0.5) or the tooth height where the probability of being a semotus is 0.9. Visualize this as choosing the given probabiilty on the y-axis, moving right until you hit the best-fit line, and then moving down to find the corresponding point on the x-axis (see below for the the examples of 0.5 and 0.9).

Of course, we want to be exact with this prediction. Again, we can perform some algebra on the equation of the line to solve for X.

  • Reminder of the equation of the best-fit line.

\[ \text{log}\left(\frac{\text{p}}{1-\text{p}}\right) = \alpha+\beta X \]

  • Subtract α from both sides (α will disappear from the right-hand side).

\[ \text{log}\left(\frac{\text{p}}{1-\text{p}}\right) - \alpha = \beta X \]

  • Divide both sides by β (β will disappear from the right-hand side)

\[ \frac{\text{log}\left(\frac{\text{p}}{1-\text{p}}\right) - \alpha}{\beta} = X \]

  • Simply flip the sides so that it looks like an equation for X.

\[ X = \frac{\text{log}\left(\frac{\text{p}}{1-\text{p}}\right) - \alpha}{\beta} \]

Thus, the predicted tooth height for a probability of 0.5 is 3.196 as computed with

\[ X = \frac{\text{log}\left(\frac{0.5}{1-0.5}\right) - 35.51574}{-11.11193} \]

Make sure that this makes sense to you from the plot above.

Similarly, the predicted tooth height for a probability of 0.9 is 2.998 as computed with

\[ X = \frac{\text{log}\left(\frac{0.9}{1-0.9}\right) - 35.51574}{-11.11193} \]

Again, make sure that this makes sense to you from the plot above.


Confidence Intervals for Predictions

Bootstrapping

As always, these predictions should be accompanied by a confidence interval. Unfortunately, the predictions within a logistic regression do not follow the normal distribution theory used for other linear models. Confidence intervals can be constructed, however, with a method called bootstrapping.

Bootstrapping is a process that develops an approximate sampling distribution of a statistic by repeatedly sampling from the original data. Specifically, one bootstrap step randomly selects n individuals from the original data with replacement and computes the desired statistic. It then repeats this process many times (usually on the order of 5000-10000 times). All of the statistics from these repeated samples are then ordered from smallest to largest. The values of the statistic at the 2.5 and 97.5 percentiles9 are then found to form a 95% confidence interval for the statistic.

The bootstrapped samples can be generated with bootCase() which requires the glm() object as its only argument.10

> bc1 <- bootCase(glm1)      # bootstrapping, be patient!
'bootCase' is provided here only for backward compatibility.
Consider using 'Boot' from the 'car' package instead.

As partially seen below, bootCase() returns parameter estimates for each bootstrapped sample (each row is a separate bootstrapped sample).

     (Intercept)    canine
[1,]    34.21969 -10.64698
[2,]    33.53490 -10.45157
[3,]    33.13709 -10.34672
[4,]    39.08206 -12.21312
[5,]    40.54202 -12.67247

A histogram of the slopes from all of the bootstrapped samples is shown below. The vertical red lines show the values that have 2.5% and 97.% of the samples smaller and, thus, show the endpoints of a 95% bootstrapped confidence interval. Thus, one would be 95% confident that the slope for this logistic regression is between -15.60 and -8.28.

CIs for Predicted Probabilities

Bootstrapping is more useful for making confidence intervals for the predictions discussed above.11 First, however, we have to write a function that can be used to make predictions for each of the bootstrapped samples. For example, the following function can be used to predict the probability of a “success” given a particular value of X.12

> predProb <- function(x,alpha,beta) exp(alpha+beta*x)/(1+exp(alpha+beta*x))

For example, this function can be used to find the probability of a semotus given a tooth height of 3.1 (and the obsered values of the intercept and slope). [Compare this result to what was calculated above.]

> predProb(3.1,alpha=35.51574,beta=-11.11193)
[1] 0.7443605

More importantly, the alpha= and beta= arguments can be the intercept and slope columns (first and second columns, respectively; thus, the use of “1” and “2” below) from the bootstrapped samples object. This would then predict the probability of being semotus if the tooth height is 3.1 for each bootstrapped sample (the first five are shown below).

> p31 <- predProb(3.1,bc1[,1],bc1[,2])
[1] 0.7710168 0.7567687 0.7431242 0.7723065 0.7785726

The quantile() function is used to identify the values in the 2.5% and 97.5% positions.

> quantile(p31,c(0.025,0.975))
     2.5%     97.5% 
0.6347614 0.8500010 

Thus, one is 95% confident that the probability of being a semotus for a hoary bat with a 3.1 mm tooth height is between 0.63 and 0.85.

CIs for Predicted Values of X for a Given Probability

The same process can be followed for making a confidence interval for the value of the quantitative explanatory variable for a certain probability. First, make a function to compute the value of X for a given probability.13

> predX <- function(p,alpha,beta) (log(p/(1-p))-alpha)/beta

This is then applied to the boostrapped samples (here for a probability of 0.5).

> x05 <- predX(0.5,bc1[,1],bc1[,2])
> quantile(x05,c(0.025,0.975))
    2.5%    97.5% 
3.151589 3.240603 

Thus, one is 95% confident that the tooth height where it is an equal probability that the hoary bat is a semotus or a cinereus is between 3.15 and 3.24.


Footnotes


  1. The researchers are trying to predict subspecies so it is the response variable. Thus, tooth height is the explanatory variable.

  2. The subspeces response variable is categorical (and binomial) and the tooth height explanatory variable is quantitative. Thus, this question requires a (binary) logistic regression.

  3. The range of tooth heights was less than 1cm. Thus, when interpreting the slope a “1cm increase in tooth height” was not realistic. Thus, this variable was multiplied by 10 to convert the cm to mm such that a slope would be for a “1mm increase in tooth height” and would thus would not be a larger increase then the range of the data.

  4. There are several arguments used in this hist() that you may not have seen before. The w= controls how wide the bins are, ymax= sets a common maximum value for the two y-axes, ncol= sets how many columns the plots will be placed in, and nrow= sets how many rows the plots will be placed in.

  5. As this does follow an exponential function then, from our work in the SLR module, only the response variable should be log-transformed. Thus, the log of the odds should be computed.

  6. The “g” in glm() stands for “general.” If the family= argument is not used then glm() behaves exactly like lm(); i.e., all other models we have fit in this class could have also been fit with glm(). However, glm() is more general in the sense that it allows for residuals that are not normally distributed, which is the case with logistic regression.

  7. The log(odds) are predicted by plugging the tooth height values into the best-fit line produced by glm() with coefficients shown by coef(). Remember that the best-fit line predicts log(odds).

  8. When I look at the fitplot it appears to me that the probability that the bat is a semotus is around 0.7 or 0.75. If the probability was 0.75 then the odds would be \(\frac{0.75}{0.25}\)=3, which is pretty close to the calculated 2.91 value.

  9. Percentiles are the values of X that have the given percent lower. For example, the 97.5 percentile would have 97.5% of values smaller.

  10. By default bootCase() will take 999 bootstrap samples, which will be adequate for the analyses in this class. For research-grade analyses, I will increase this number to at least 5000.

  11. Note that confidence or prediction intervals can NOT be computed with predict() as was done in previous modules. In addition, there is no mathematical way to derive a confidence interval for the value of X at a certain probability value as this calculation reverses the order of the best-fit model (i.e., solves for X given a value of Y).

  12. This function is simply an R version of \(\text{p} = \frac{odds}{1+odds} = \frac{e^{\alpha+\beta X}}{1+e^{\alpha+\beta X}}\).

  13. This function is simply and R version of \(X = \frac{\text{log}\left(\frac{\text{p}}{1-\text{p}}\right) - \alpha}{\beta}\).