Hypothesis tests

This page is intended as an introduction to some commonly used basic hypothesis tests. Before using any of these, it is important that the concepts behind hypothesis testing are understood. These concepts are explained on this page.

Hypothesis tests are very commonly used in epidemiological investigations, and a wide number of tests are available. These can be classified into groups according to the data types in question, according to whether a specific underlying distribution is assumed when performing the test (in which case, the test is known as a parametric test), and according to whether or not the data are matched or independent (i.e. whether comparisons are being made at the individual level or the group level). As described earlier, qualitative data are not numerical in nature, and include categorical and ordinal data (such as the breed of dog, or the body condition score of a cow). Quantitative data are numerical, and include variables such as weight, age and height.

Comparing a qualitative variable between different groups

Chi-square test

The chi-square test is one of the most commonly used hypothesis tests, and allows the comparison of any qualitative exposure with any qualitative outcome (given that certain assumptions are met). As a simple example, it may be used to investigate the effect of previous exposure to substance x on disease experience amongst a group of animals - as classified in the 2x2 contingency table below:

Disease status	Exposed to x	Unexposed to x	Total
Diseased	a₁	a₀	m₁
Non-diseased	b₁	b₀	m₀
Total	n₁	n₀	n

A chi-square test could also be used to investigate whether the body condition score of a horse was associated its lameness score, as classified in the rxc contingency table below:

Lameness score	Body condition score 1-3	Body condition score 4-6	Body condition score 7-9	Total
1	b₁	b₂	b₃	m_b
2	c₁	c₂	c₃	m_c
3	d₁	d₂	d₃	m_d
4	e₁	e₂	e₃	m_e
5	f₁	f₂	f₃	m_f
Total	n₁	n₂	n₃	n

In either case, the chi-square test is based upon the comparison of the observed results, with those results which would be expected if there was no association between the exposure and outcome of interest. For each individual cell, this 'expected' value is calculated, is subtracted from the observed value and the answer is squared. This is then divided by the expected value and the process is repeated for all other cells. These results are then summed to give a test statistic, which can be interpreted using a table of the chi-squared distribution (or by using a computer program) in order to give a p-value. The number of cells involved in the calculation of the test statistic will have an impact upon its magnitude, and this is accounted for in the calculation in the form of 'degrees of freedom' (which can be a difficult concept to understand, but relate in this context to the number of cells which are free to take any value, given that the test statistic is known). The number of degrees of freedom can be calculated by subtracting 1 from the number of rows, subtracting 1 from the number of columns, and multiplying these together - meaning that a 2x2 table has one degree of freedom.

The main assumptions of a chi-square test are:

the data are derived from a simple random sample
observations are independent of each other (i.e. there are no repeated measures etc...)
at least 80% of all cells (i.e. all cells in a 2x2 table, or eight cells in a 2x5 table) have expected values of greater than 5, with no cells having an expected value of zero.

Fisher's exact test

Fisher's exact test is most commonly used instead of the chi-square test when the sample size is small and/or when expected cell counts are less than 5. This test generally requires the variables of interest to be dichotomous (i.e. a 2x2 contingency table), although methods are available of applying the test to contingency tables of greater size. Instead of assuming the data approximates a distribution (as is the case with the chi-square test), the exact probability of the particular arrangement of data (and that of 'more extreme' patterns, given the row and column totals in the contingency table are fixed) is calculated (based upon the hypergeometric distribution).

Comparing a quantitative variable between two groups

t-test

The t-test (also known as the 'Student's t-test') is the most commonly used test for the comparison of two normally distributed variables, and can also be used to assess whether a single normally distributed variable differs from a particular value. As for many hypothesis tests, it involves the calculation of a test statistic which is assumed to follow a particular distribution (in this case, the t distribution). The general approach to the calculation of the test statistic is to divide the difference of interest (whether that is the difference between the mean of interest and a particular value, or the difference between two different means of interest) with the standard error of this difference. The methods of calculation of the standard error therefore differ depending upon the characteristics of the data in question:

When comparing the mean of a group with a particular value, the difference between the mean and the value in question is divided by the product of the standard deviation of the group and the the reciprocal of the square root of the number of individuals in the sample.
When comparing the means of two group, the approach used depends on other characteristics of the data:
- when both groups have approximately equal variances and there are equal numbers of individuals in each group, the difference in mean values between the two groups is divided by the product of the pooled standard deviation and the square root of two divided by the number of individuals in each group.
- when both groups have approximately equal variances but there are different numbers of individuals in each group, the difference in mean values between the two groups is divided by the product of the pooled standard deviation and the square root of the sum of the reciprocals of the group sizes.
- when the groups have different variances, the difference in mean values between the two groups is divided by the square root of the sum of the variances of each group divided by the group size

Mann-Whitney U test

The Mann-Whitney U test (also known as the Wilcoxon Rank Sum test) is commonly used for the comparison of two groups where the variable of interest is continuous but not normally distributed for at least one group. The basic concept behind the test is that all observations within each group are combined and then ranked in order of magnitude. These are then reassigned to their original groups, and the sums of the ranks in each groups are calculated and compared (by counting, for each observation in either of the groups, the number of observations in the alternate group which have a lower rank, and summing these together). Alternative methods are available for those cases where this approach would not be feasible.

Comparing a quantitative variable between more than two groups

F-test

The F-test is used when comparing the means of two or more quantitative variables, as is the case when conducting an ANOVA (analysis of variance). The basic concept of an ANOVA is that the total variation in a population can be viewed as being comprised of the variation between the groups (i.e. 'explained variation') and variation within the groups (i.e. 'unexplained variation'). The F statistic is based upon the ratio of a measure of the variance between groups (the 'mean square of treatment', MSTR) to a measure of the variance within groups (the 'mean square error', MSE).

Kruskal-Wallis test

Comparing a categorical outcome between matched observations

McNemar's chi-square test

Comparing a quantitative outcome between matched observations

Paired t-test

Comparing two quantitative outcomes between matched observations

Pearson's correlation coefficient

Spearman's rank correlation coefficient