Note:
  • Tables (and figures) should be labeled as described in the homework format description. Table labels go ABOVE the table and figure labels go BELOW the table. Tables (and figures) should be referred to in your answers. See the key below for a model of this.
  • Use complete sentences to answer questions.
  • Use an R appendix to show the code you used to produce results. Do not include R code in any of your other answers.
  • Keep “many” decimals in intermediate calculations … i.e., don’t round until the final answer.
  • Note explanations in the key below.
  • In question 2 below make sure to use the word “mean” as the hypotheses are testing that the “mean ph”, not “the pH”, differs between the two rivers. Hypotheses are about summaries, not individual values. Also make sure the say what is different … say “mean Ph differs between stream A and stream B” not “the group means differ.” Don’t just say “the null hypothesis is rejected” … explain what that means about mean pH.
  • Always demonstrate your answers. As one example, don’t just say that F=t2 … show it with the values from the tales.
  • There are three possible MS values – MSTotal, MSWithin, and MSAmong. Don’t use “residuals MS” (e.g., last question) even though that is how R labels it.
  • Don’t say that something is equal if it clearly is not. If your algebra in Question 7 does not equal MSWithin if it does not. If you know that it is supposed to equal MSWithin and you can’t get your algebra straight then SEEK HELP FROM ME.

pH in Two Rivers

  1. The p-values for the two-sample t-test (\(p=0.0000016\); Table 1), from the ANOVA table (\(p=0.0000016\); Table 2), and for the slope coefficient (\(p=0.0000016\); Table 3) are all the same. These p-values are all equivalent because the 2-sample t-test null hypothesis of equal means (or difference in means equals zero) is the same as the null hypothesis for the slope (see below about the slope representing the difference in means) which is the same as the null hypothesis for the ANOVA table (i.e., simple model of one mean representing both groups). Obviously, the alternative hypotheses are also the same across the 2-sample t-test, slope, and full model in the alternative hypothesis.
  2. With these p-values, very strong evidence to reject the null hypothesis exists. Thus, the mean pH appears to differ between the two rivers.
  3. The mean of the first (A) group in the 2-sample t-test (8.662; Table 1) is the same as the intercept coefficient from the linear model (8.662; Table 3). This occurs because an intercept is defined as the “value of \(Y\) when \(X\)=0, on average”. In this case, \(Y\) is pH and \(X\) is “river A” because river A is coded with a zero in lm() (because the levels are code alphabetically). Thus, the intercept is the mean pH (\(Y\)) for river A (\(X=0\)).
  4. The difference in the means (i.e., 6.408-8.662 = -2.254; Table 1) is the same as the slope coefficient in the linear model (i.e., -2.254; Table 3). This is because the slope coefficient shows the change in \(Y\) for a one unit change in \(X\). As noted above river A is coded with a 0 in lm() as it is alphabetically first in the river names. river B is thus coded as a 1 in lm() as it is alphabetically second. Thus a one unit change in \(X\) is simply a move from river A to river B (i.e., from 0 to 1). Thus, the slope is the difference in mean pH between the two rivers (i.e., change in \(Y\)).
  5. The df from the two-sample t-test (18; Table 1) and the within-group df from the ANOVA table (18; Table 2) are identical. The within-group df are equal to the total number of individuals (\(n=n_{1}+n_{2}\)) minus the number of groups (\(I=2\)), which is the same as for the 2-sample t-test (i.e., \(n_{1}+n_{2}-2\)).
  6. The F test statistic (48.789;Table 2) is equal to the square of the t test statistic (6.985\(^2\)=48.789; Table 1). This relationship occurs when the numerator df for the ANOVA is equal to one (i.e., there are only two groups).
  7. The SE for the difference in means is equal to \(\frac{\bar{x_{1}}-\bar{x_{2}}}{t}\) = \(\frac{8.662-6.408}{6.9849}\) = 0.3227. The pooled variance (\(s_{p}^{2}\)) is then equal to this value squared and divided by the sum of the reciprocals of the sample sizes – i.e., \(\frac{0.3227^{2}}{\frac{1}{10}+\frac{1}{10}}\) = 0.5207.
  8. The \(s_{p}^{2}\) computed in the previous question is the same as \(MS_{within}\) (Table 2).

Table 1: Results from 2-sample t-test of pH by river.

t = 6.9849, df = 18, p-value = 1.599e-06
95 percent confidence interval:
 1.576042 2.931958 
sample estimates:
mean in group A mean in group B 
          8.662           6.408 

Table 2: Analysis of variance table for pH by river.

          Df  Sum Sq Mean Sq F value    Pr(>F)
river      1 25.4026 25.4026  48.789 1.599e-06
Residuals 18  9.3719  0.5207                  

Table 3: Coefficient results from the one-way ANOVA for pH by river.

            Estimate Std. Error t value Pr(>|t|)
(Intercept)   8.6620     0.2282  37.961  < 2e-16
riverB       -2.2540     0.3227  -6.985  1.6e-06
---
Residual standard error: 0.7216 on 18 degrees of freedom
Multiple R-squared: 0.7305, Adjusted R-squared: 0.7155 
F-statistic: 48.79 on 1 and 18 DF,  p-value: 1.599e-06 

R Appendix.

d <- read.csv("RiverPh.csv")
t.test(pH~river,data=d,var.equal=TRUE)
lm1 <- lm(pH~river,data=d)
anova(lm1)
summary(lm1)
coef(lm1)