Module 14 Hypothesis Testing - Errors

In Module 13 we examined the concept of hypothesis testing and showed how a p-value is calculated and used to reject or not reject H₀. The basic concept there is that if the p-value is small (<α) then it unlikely that the observed statistic would come from a population with the parameter in the H₀. It is important to note the word unlikely in this statement. It is possible that the observed statistic would come from this population, just unlikely. In the case where this unlikely event actually happened, we would make an error in our conclusion about H₀. These errors are discussed in this module.

14.1 Error Types

The goal of hypothesis testing is to make a decision about H₀. Unfortunately, because of sampling variability, there is always a risk of making an incorrect decision. Two types of incorrect decisions can be made (Table 14.1). A Type I error occurs when a true H₀ is falsely rejected. In other words, even if H₀ is true, there is a chance that a rare sample will occur and H₀ will be deemed incorrect. A Type II error occurs when a false H₀ is not rejected.

Table 14.1: Types of decisions that can be made from a hypothesis test.
Truth about Popn	Decision from Test	Error Type
H₀ is true	Reject H₀	Type I (\(\alpha\))
H₀ is true	DNR H₀	–
H_A is true	Reject H₀	–
H_A is true	DNR H₀	Type II (\(\beta\))

It is important to be able to articulate what Type I and a Type II errors are within the context of a specific situation. For example, suppose that if the mean abundance of a rare plant in a particular area drops below 0.5 plants per m² then there is a real concern that that plant will become extinct. A researcher may conduct a study to determine if the mean abundance of this plant has dropped below the 0.5 m² threshold or not. The null and alternative hypotheses would be

H₀: μ=0.5, where μ is the mean abundance of the plant per m²
H_A: μ<0.5

Before articulating what the errors would be in this example, it is useful to write in words what the hypotheses are within the context of the situation.

H₀: “abundance at or above the threshold; population not at risk of extinction”
H_A: “abundance below the threshold; population at risk of extinction”

From the definition of a Type I error (i.e., incorrectly rejecting a true H₀), a Type I error occurs if we conclude that H_A is true when H₀ is really true. In this scenario then, a Type I error would be concluding that the abundance of the plant is low (think H_A is true) when the abundance is really not low (H₀ is really true) or concluding that the population is at risk of extinction when the population is really not at risk of extinction.

From the definition of a Type II error (i.e., incorrectly not rejecting a false H₀), a Type II error is concluding that H₀ is true when H_A is really true. In this scenario then, a Type II error is concluding that the abundance of the plant is not low (think H₀ is true) when the abundance is really low (H_A is really true) or concluding that the population is not at risk of extinction when the population is really at risk of extinction.

14.2 Error Rates

The decision in the Square Lake example of Module 13 was a Type II error because H₀:μ=105 was not rejected even though we know that μ=98.06 (Table 2.1). Unfortunately, in real life, it will never be known exactly when a Type I or a Type II error has been made because the true μ is not known. However, the rates at which these errors are made can be considered.

The probability of making a Type I error is set when α is chosen. Thus, the researcher can largely choose the rate at which they will make a Type I error. The probability of a Type II error is denoted by β, which is never known because calculating β requires knowing the true but unknown μ. Decisions can be made, however, that affect the magnitude of β (Figure 14.1).

There are two items that affect β that a researcher can control – the size of α and n. The β decreases as α increases (Figure 14.1); i.e., the researcher is reducing Type II errors by allowing for more Type errors. In other words, the researcher is simply “trading errors,” which may be appropriate if a Type II error is more egregious than a Type I error. The β also decreases with increasing n (Figure 14.1); i.e., fewer errors are made as more information is gathered. Of these two choices, reducing β by increasing n is generally more beneficial because it does not result in an increase in Type I errors as would occur with increasing α.

The relationship between one-tailed (lower) β and α, n, actual mean (μ~A~), and σ. In all situations where the variable does not vary, μ~0~=105, μ~A~=98.06, σ=31.49, n=50, and α=0.05.

Figure 14.1: The relationship between one-tailed (lower) β and α, n, actual mean (μ_A), and σ. In all situations where the variable does not vary, μ₀=105, μ_A=98.06, σ=31.49, n=50, and α=0.05.

The value of β is als related to two items that a researcher cannot control. The β increases as the difference between the hypothesized mean (μ₀) and the actual mean (μ_A) decreases (Figure 14.1). This means that more Type II errors will be made when the hypothesized and actual mean are close together. In other words, more Type II errors are made when it is harder to distinguish the hypothesized mean from the actual mean.

In addition, β increases with increasing amounts of natural variability (i.e., σ; Figure 14.1). In other words, more Type II errors are made when there is more variability among individuals.

A researcher cannot control the difference between μ₀ and μ_A or the value of σ. However, it is important to know that if a situation with a “large” amount of variability is encountered or the difference to be detected is small, the researcher will need to increase n to reduce β. For example, if n could be doubled in the Square Lake example to 100, then β (for H₀:μ=105) would decrease to approximately 0.18 (Figure 14.1).

Statistical Power

A concept that is very closely related to decision-making errors is the idea of statistical power, or just power for short. Power is the probability of correctly rejecting a false H₀. In other words, it is the probability of detecting a difference from the hypothesized value if a difference really exists. Power is used to demonstrate how sensitive a hypothesis test is for identifying a difference. High power related to a H₀ that is not rejected implies that the H₀ really should not have been rejected. Conversely, low power related to a H₀ that was not rejected implies that the test was very unlikely to detect a difference, so not rejecting H₀ is not surprising nor particularly conclusive. Power is equal to 1-β.

14.3 Test Statistics and Effect Sizes

Instead of reporting the observed statistic and the resulting p-value, it may be of interest to know how “far” the observed statistic was from the hypothesized value of the parameter. This is easily calculated with

\[ \text{Observed Statistic}-\text{Hypothesized Parameter} \]

where “Hypothesized Parameter” represents the specific value in H₀. However, the meaning of this difference is difficult to interpret without an understanding of the standard error of the statistic. For example, a difference of 10 between the observed statistic and the hypothesized parameter seems “very different” if the standard error is 3 but does not seem “different” if the standard error is 15 (Figure 14.2).

Sampling distribution of samples means with SE=3 (Left) and SE=15 (Right). A single observed sample mean of 90 (a difference of 10 from the hypothesized mean of 100) is shown by the red dot and arrow.

Figure 14.2: Sampling distribution of samples means with SE=3 (Left) and SE=15 (Right). A single observed sample mean of 90 (a difference of 10 from the hypothesized mean of 100) is shown by the red dot and arrow.

The difference between the observed statistic and the hypothesized parameter is standardized to a common scale by dividing by the standard error of the statistic. The result is called a test statistic and is generalized with

\[ \text{Test Statistic} = \frac{\text{Observed Statistic}-\text{Hypothesized Parameter}}{\text{SE}_{\text{Statistic}}} \]

Thus, the test statistic measures how many standard errors the observed statistic is away from the hypothesized parameter.⁶⁰ A relatively large value of the test statistic is indicative of a difference that is likely not due to randomness (i.e., sampling variability) and suggests that the null hypothesis should be rejected.

The test statistic in the Square Lake Example is \(\frac{100-105}{\frac{31.49}{\sqrt{50}}}\)=-1.12. Thus, the observed mean total length of 100 mm is 1.12 standard errors below the null hypothesized mean of 105 mm. From our experience, a little over one SE from the mean is not “extreme” and, thus, it is not surprising that the null hypothesis was not rejected.

There are other forms for calculating test statistics, but all test statistics retain the general idea of scaling the difference between what was observed and what was expected from the null hypothesis in terms of sampling variability. Even though there is a one-to-one relationship between a test statistic and a p-value, a test statistic is often reported with a hypothesis test to give another feel for the magnitude of the difference between what was observed and what was predicted.

This is very closely related to the concept of Z-scores and standardization introduced in Section 7.4.↩︎