Sampling strategies

The information gained through the study of disease in populations which will be increased if more members of the population are sampled. However, the sampling of every individual in a population is rarely feasible from either a logistical or an economic perspective (except in the case of very small-scale studies). Censuses are a form of descriptive study which aims to systematically collect information about every member of the population of interest (the source population), and are carried out in many countries for both livestock as well as for humans (although information regarding disease may not be collected). Statistical surveys are another type of descriptive study, which aim to select a sample (known as the study sample) from the source population, with the intention of extrapolating the information about these individuals to the source population. Similarly, in most analytic studies, a sample of the population must be selected for the same reasons.

This process of sampling from populations poses potential problems, as it must both select a sufficient number of individuals in order to be useful for the purposes of the study (whilst not sampling more than is required), and must also ensure that any biases in the selection process are minimised.

Populations and samples

When sampling from populations and when interpreting the results of studies involving sampling, it is important to consider what can be inferred from the results. A number of concepts are presented here, using the example of a descriptive study investigating the prevalence of bovine tuberculosis amongst beef cattle in England.

Target population

The target population is the population to which the results of the study may be extrapolated out to, even if not all members of this population were eligible for sampling, and is often not clearly defined. In the example given here, it may be that the target population is viewed as all cattle (beef, dairy and noncommercial) in England, or all beef cattle in Great Britain, or all cattle in Great Britain. The decision regarding which population the results can be extrapolated to will depend on the knowledge and experience of the person interpreting the study, and the suitability of this extrapolation is described as the external validity of the study.

Source population

The source population is the population from which the sample was taken, and therefore all members of this population should have a chance of being selected for inclusion in the study. In the case of the example given here, the source population may be all registered beef herds in England. As such, the results obtained from the sample should relate to this population - if this is not the case, then there are likely to be considerable problems in the interpretation of the results. This is known as the internal validity of the study.

Study sample

The sample population includes those animals which are included in the final study. It is important to remember that in most epidemiological studies, we are not interested in this population per se - rather, we are interested in using this sample in order to make statements regarding the source population (and possibly the target population). Because not all members of the source population have been sampled, statistical techniques need to be applied to the results from the study group in order to estimate what the characteristics of the source population are expected to be. Due to this extrapolation, there is always a possibility that any estimates from a sample are incorrect due to random variation in the sample. Although this random variation cannot be controlled without increasing the sample size (or redefining the source population), the accuracy of the estimate can be maximised by ensuring that sources of bias are minimised.

Approaches to sampling

Probability sampling

In probability sampling, every individual in the source population has a calculable, non-zero probability of being randomly selected for the study sample. As such, it is the only appropriate method of sampling for descriptive studies (since the ability to extrapolate the results to the source population is of integral importance in these cases). Some types of probability sampling are described below.

Simple random sampling

Simple random sampling is the optimal method of sampling from a population, from a statistical viewpoint. It requires the formation of a sampling frame, which is a list of all the individuals in the source population. From this, a randomisation procedure is used to select animals for further study. As such, if a sampling frame is not available and cannot be created, simple random sampling cannot be used. A very basic example of a simple random sampling procedure could involve labelling each of the members of the source population on pieces of paper, and randomly selecting a number of these out of a bag - however, computerised techniques involving random numbers are more commonly used nowadays.

Systematic random sampling

Systematic random sampling does not require a sampling frame, but does require the individuals in the source population to each be identifiable and requires them to be randomly ordered in some way. A member of the population is initially selected, and then other individuals are selected based on a set sampling interval (calculated by dividing the size of the source population with the required sample size). A common application of systematic random sampling is when animals are ordered in order to pass through a race or when dairy cattle are entering the milking parlour (note, however, that recently calved animals may be excluded from the sample frame in the latter example, which may result in selection bias). For example, if you wanted to take a sample of 20 animals from a sheep flock containing 200 animals which are all due to pass through a race in order to be dosed with anthelmintics, you calculate the sampling interval (=10), and then randomly select a number within this interval to indicate the first sheep. For example, assume that the number selected is four: you would then sample the fourth animal passing through the race, followed by every 10th animal (i.e. 14th, 24th, 34th...184th, 194th) - giving you a total sample of 20 animals. Given that the order of passing through the race is random, every sheep has an equal probability of being selected (prior to determination of the sampling interval!).

Stratified random sampling

This form of sampling is based on simple or systematic random techniques, but prior to selection of the study sample, the source population is divided into a number of strata (often according to factors considered to be associated with disease). Most commonly, the proportion of animals within each stratum in the source population is used as the proportion of the total sample size to be taken from each stratum (and therefore, the number of animals to be selected per stratum). This approach ensures that every animal has an equal probability of selection. However, other approaches may be used which produce a 'weighted sample' (for example, animals from one particular stratum may be oversampled) - it is important to note that even in these cases, the sampling strategy is still a probability sample, as the probability for each animal within each stratum can still be calculated (even if the probability differs between strata). In these cases, additional approaches must be applied in the analysis stage in order to 'unweight' the sample.

Cluster sampling

Cluster sampling is used in cases where the individual animals of interest are 'clustered' within other groupings (such as animals within farms), and it is easier to sample many animals from a smaller number of clusters than it would be to sample small numbers of animals from many clusters (as would be the likely situation if simple random sampling was used), or if a sampling frame of the clusters (known as the primary sampling units) but not the individual animals is available. A random sample of clusters is first made (using simple or systematic random sampling techniques), followed by sampling of every individual within the selected cluster. As each cluster has an equal probability of being selected, and as every animal within these clusters is then sampled, the probability of selection of any individual animal is constant. It should however be noted that variation in the outcome of interest is likely to be lower within clusters than between clusters, meaning that this must be accounted for when calculating the sample size and when interpreting the results.

Multistage sampling

This sampling approach extends the concepts used in cluster sampling in order to avoid sampling every individual within each cluster (since this may be impractical, for example in the case of large sheep farms containing thousands of animals, or there may be very little variation between animals within the clusters). In order to ensure that the probability of selection of each individual (known as a secondary sampling unit) is constant, the sampling of SSUs within the PSUs can either be based on sampling with a probability proportional to size, or through using the same process as described for cluster sampling, with the selection of a set proportion of all animals within each cluster. Sampling with probability proportional to size involves weighting the larger clusters in order to increase their chance of selection, followed by the selection of a set number of animals within each selected cluster. This approach is often used as it is logistically simpler (since the exact same number of animals is sampled, regardless of the size of the farm). Methods of weighting larger clusters are described elsewhere

Non-probability sampling

Non probability sampling methods do not use random selection techniques, and so are not ideal from a statistical viewpoint. However, they are commonly used in analytic investigations, where the internal validity does not need to be as high as for descriptive studies.

Judgement sampling

In this type of sampling, the study sample is chosen because they are considered by the investigator to be representative of the source population. As such, the criteria for selection (and therefore the definition of the source population) are not stated, making it very difficult to assess the validity of the results obtained. This sampling method is therefore not advised.

Convenience sampling

The selection of the study sample in these cases is based on convenience - for example, ease of access, availability of data or the presence of a sampling frame. Despite the limitations of this approach, the criteria for selection can often be clearly defined, and an analytic study using convenience sampling may be able to be extrapolated to a larger target population in some cases.

Purposive sampling

This sampling strategy involves the inclusion of a subset of the total population in the sample - for example, only including infected or exposed animals. It can therefore be viewed as a change in the definition of the source population - meaning that if random sampling techniques are used in order to select the study sample, it is actually a probability sample involving only a subset of the population.

Sample size calculation

As mentioned earlier, it is important in any study not only that bias is minimised, but that the sample has sufficient precision and power (in the case of analytic studies) to answer the question(s) for which the study is intended. Both of these are closely related to the random variability in any sample taken from a population. Although this can be reduced by increasing the sample size, a number of other considerations (usually logistical and economic considerations) will also be acting in order to reduce the number of samples which can realistically be taken. Statistical techniques are therefore available in order to calculate the required sample size. Counterintuitively, these require assumptions to be made regarding the final results of the study, as well as information regarding the required level of confidence, precision or power of the study. Sample size formulae are not given here, but can be found in most statistical textbooks.

Expected variation in the data

The variability of an outcome of interest in the sample collected will have a considerable effect on the precision and power of a study. When the outcome is a continuous variable, this variability can be measured as the variance in the source population. However, in the case of binary outcomes, the concept of variability can be more difficult to comprehend. In these cases, the binomial distribution is used to estimate the variance - calculated as the proportion of animals with the outcome of interest multiplied by the proportion of animals without the outcome of interest. This can be viewed as the expected variation in the proportion estimate of a sample if a number of samples were repeatedly taken from the source population, rather than the variation in the proportion estimate in the source population itself.

Required precision

In the case of descriptive studies, this relates to the width of the 95% confidence interval. For example, you may want to estimate the seroprevalence to Bluetongue virus to within ±10% of the true population seroprevalence, or you may want to estimate the mean skin thickness of a group of cattle following tuberculin testing to within ±1mm of the true population mean. The concept of precision is also used in analytic studies, in the form of the difference between groups which you wish to detect. As this is closely associated with power calculations, it is mentioned in the 'power' section below.

Level of confidence

This is used in descriptive studies in order to indicate the level of confidence that the confidence interval of the estimate produced will contain the true population value. Usually, a confidence level of 95% is used. The level of confidence is also The concept of confidence intervals is explained further in the section on random variation.

Power

This relates to the ability to detect a difference in a parameter of interest between two groups, and so relates to analytic studies. The power indicates the probability that a study will detect a 'significant' difference between groups (using a specified p-value [usually 0.05] to indicate significance), assuming that a difference of a specified size does exist. For example, if there is a true difference in mean annual milk yield of 500 litres between two groups of cows, a study with a power of 80% will detect a statistically significant difference 80% of the time. That is, if the same study was repeated again and again, selecting the calculated required number of cows from each herd, 80% of these studies would detect a difference between groups and 20% would not.

Clustering

When cluster or multistage sampling techniques are used, the effect of clustering of the outcome of interest within clusters will have an effect on the required sample size, since animals within the same cluster would be expected to be more similar to each other than to those from other clusters. Therefore, formulas are available in order to calculate the 'design effect' (or DEFF), which indicates the factor by which the calculated sample size needs to be increased by in order to account for this.

Sampling fraction

This relates to the proportion of the total target population which is sampled. In most epidemiological studies, samples are collected without replacement (i.e. an individual animal cannot be selected twice), although many of the calculations used are based on the concept of sampling with replacement. This does not cause a problem when (as in most cases), the sampling fraction is low (less than about 5%, when expressed as a percentage). However, if the sampling fraction is high, a correction known as the finite population correction should be made to account for this in the calculation of the required sample size (and in the final estimates).

Multivariable studies

When the effect of confounding or interaction is to be accounted for in the study, the sample size needs to be increased accordingly.