Hypothesis testing

Null hypothesis testing (often described just as hypothesis testing is very commonly used in epidemiological investigations, and may be used in both analytic studies (for example, assessing whether disease experience differs between different exposure groups), and in descriptive studies (for example, if assessing whether disease experience differs from some suspected value). For the purposes of this page, the use of hypothesis testing in analytic studies will be focussed on. As in most studies, only a sample of individuals is taken, it is not possible to definitively state whether or not there is a difference between the two exposure groups. Hypothesis tests provide a method of assessing the strength of evidence in favour or against a true difference in the underlying population. However, despite their widespread use, the results of hypothesis tests are often misinterpreted.

Concept behind null hypothesis testing

Hypothesis tests provide a systematic, objective method of data analysis, but do not actually answer the main question of interest (which is commonly along the lines of 'is there a difference in disease experience between individuals with exposure x and individuals without exposure x?'). Rather, hypothesis tests answer the question 'if there is no difference in disease experience between individuals with or without exposure x, what is the probability of obtaining the current data (or data more 'extreme' than this)?' As such, hypothesis tests do not inform us whether or not there is a difference, but instead they offer us varying degrees of evidence in support of or against a situation where there is no difference in the population under investigation. This situation of 'no difference' is known as the 'null hypothesis'(defined as H₀). Along with this null hypothesis, an alternative hypothesis (H₁) should be stated, which will relate to the statement made if 'sufficient' evidence against H₀ is found. An example of a null and an alternative hypothesis is: H₀:there is no association between the prevalence of disease and exposure to factor x; H₁: there is an association between the prevalence of disease and exposure to factor x. However, in some occasions, the null and alternative hypotheses may have a direction - for example: H₀: the prevalence of disease amongst animals exposed to factor x is not higher than amongst animals not exposed to factor x; H₁: the prevalence of disease amongst animals exposed to factor x is higher than amongst animals not exposed to factor x. The differences in these relate to whether a one-tailed or a two-tailed test will be used.

Null hypothesis testing relies on the fact that the likely values obtained in a sample taken from a population in which the null hypothesis is true (i.e. there is no difference in disease experience between groups) can be predicted. That is, although it is most likely that a sample from this population will also show no difference, it is not impossible that, through chance, a sample will contain more diseased animals in the 'exposed' group than in the 'unexposed' group, for example. Through knowledge of the number of animals sampled and the prevalence of disease in the population, the probability of getting any particular pattern of data from this sample can be ascertained. This probability is known as the p-value, and is the main outcome of interest from a hypothesis test. A p-value of 1.00 would suggest that if there was no difference in disease experience according to exposure in the population and repeated samples were taken from this population and performed hypothesis tests on these, all of these samples would be expected to show this pattern (or 'more extreme' - i.e. show more of a difference). However, a p-value of 0.001 would suggest that there is only a 0.1% chance of seeing these data (or more extreme) if there was no true difference in the population. Note that although 0.001 is a very low p-value, it cannot be used to prove that the null hypothesis is false, or that the alternative hypothesis is true - it can only be stated that it gives 'strong evidence against the null hypothesis'.

Significance levels

Although the approach described above (of varying degrees of evidence against the null hypothesis) is the most statistically correct interpretation of p-values, it is often not practical to apply this in epidemiological analysis. For example, an investigator may want to identify a number of exposures which appear to be associated with the outcome in order to investigate these further - the approach described above would not allow this (as it will never lead to an association being proven. As such, many studies will use significance levels as a method of interpretation of p-values as 'significant' or 'not significant'. Commonly, a p-value of 0.05 or less is used to denote a 'significant' association. This means that if the there is a 5% chance or less of observing the data in question (or more extreme) if the null hypothesis is true, then the association will be denoted as 'significant'. Of course, there remains a 5% chance that there is no true difference in the population, and this can be a problem when testing large numbers of exposures (as shown in the cartoon here). Because of this, great care should be taken whenever using significance levels to interpret hypothesis tests, and the actual p-value should always be presented.

Limitations of null hypothesis tests

One considerable limitation of hypothesis testing is described above: namely, hypothesis tests do not relate to the main question of interest (whether or not there is a true difference in the population), and only provide degrees of evidence in favour or against there being no true difference. Another limitation is that there will always be a difference of some magnitude between the two groups, even if this is of no relevance. Consider a cohort study where 1 million nondiseased individuals are followed up to see whether or not exposure to substance x is associated with disease. It may be that in this whole population of 1 million animals, 10.0% of exposed individuals develop the disease and that 9.9% of unexposed individuals develop the disease. Of course, this difference is not of any biological relevance, and yet there is a difference there (as this is a whole population rather than a sample, we would not conduct a hypothesis test). As the size of any sample increases, the ability to detect a true difference increases. As there will be a 'true difference' (however small) in most populations, this means that hypothesis tests on large sample sizes will tend to give low p-values (indeed, some statisticians view hypothesis testing as a method of determining whether or not the sample size is sufficient to detect a difference). This problem can be reduced by ensuring that the appropriate measure of effect is always presented along with the hypothesis test p-value. In the example above, the incidence risk of disease amongst exposed individuals was 0.100, and that amongst unexposed was 0.099, giving a risk ratio of 0.100/0.099 = 1.01. Therefore, regardless of the result of hypothesis testing, there is very little association between exposure and disease in this case.

Errors in hypothesis testing

In any hypothesis test, there is a risk that the incorrect conclusion is made - which will either take the form of a type I or a type II error, as described below. Note that no single hypothesis test can be affected by both type I and type II errors, as they are each based on different assumptions regarding the source population. However, as the true state of the source population will not be known, both types of errors should be considered when interpreting a hypothesis test (and when calculating the required sample size).

Type I error

This type of error refers to the situation where it is concluded that a difference between the two groups exists, when in fact it does not. The probability of a type I error is often denoted with the symbol α. As this type of error is based on a situation in which the 'null hypothesis' is correct, it is associated with the p-value given in a hypothesis test, which is often set at 0.05 to indicate 'significance'. This means that there is a 5% chance of a type I error (which in the case of hypothesis testing, is interpreted as 'if the null hypothesis was correct, we would expect to see this difference or greater only 5% of the time - meaning that there is [weak] evidence against the null hypothesis being correct).

Type II error

This type of error refers to the situation where it is concluded that no difference between two groups exists, when in fact it does. The probability of a type II error is often denoted with the symbol β. The 'power' of a study is defined as the probability of detecting a difference when it does exist, and so can be calculated as (1-β).

One-tailed and two-tailed tests

Another issue to consider when conducting a hypothesis test is whether a one-tailed or a two-tailed is required. The decision of which to use will depend upon the null and alternative hypotheses stated. A two-tailed test will allow the detection of a difference in either direction (for example, either a lower prevalence of disease in the exposed group or a lower prevalence of disease in the unexposed group). One-tailed tests are used when the null and alternative hypotheses have a direction, and will only detect a difference if it is in that particular direction.

For example, consider a clinical trial of a drug which is thought to reduce the risk of death. However, it may be found that the drug actually increases the risk of death. If H₀ was defined as 'there is no difference in risk of death according to treatment status', then H₁ would be 'there is a difference in risk of death according to treatment status', and a two-tailed test would be performed. This hypothesis test would be expected to find evidence against the null hypothesis (the direction of which would be quantified through a measure of effect such as the risk ratio). However, if the investigators were convinced that the drug would not increase the risk of death, then H₀ may be stated as 'the risk of death is not reduced amongst treated animals', in which case H₁ would be 'the risk of death is reduced amongst treated animals'. In this case, a one-tailed hypothesis test would be performed, which would fail to find evidence against the null hypothesis (since the null is in fact correct). For this reason, two-tailed tests are used in the vast majority of cases.