|
|
(10 intermediate revisions by the same user not shown) |
Line 1: |
Line 1: |
− | In many epidemiological studies, it is not possible to include every individual in a population. Rather, a [[Sampling strategies|sample]] of individuals is collected. This may be take the form of a [[Study design#Surveys|survey]], a [[Study design#Cross sectional studies|cross-sectional study]], a [[Study design#Experimental studies|randomised controlled trial]], and so on. The important issue is that '''not every individual in the [[Sampling strategies#Source population|source population]] is included''', which means that [[Random error| random, or sampling, error]] and [[Bias#Selection bias|biases]] may be introduced. These affect our ability to extrapolate our results (whether [[Study design#Descriptive studies|descriptive]] or [[Study design#Analytic studies|analytic]] in nature) to the source population. However, the aim of most studies is to draw some conclusion about the source population, using the results obtained from the sample. This requires the use of statistical methodology in a process known as '''inferential statistical analysis''', and is commonly used in epidemiological investigations.<br> | + | In many epidemiological studies, it is not possible to include every individual in a population. Rather, a [[Sampling strategies|sample]] of individuals is collected. This may be take the form of a [[Study design#Surveys|survey]], a [[Study design#Cross sectional studies|cross-sectional study]], a [[Study design#Experimental studies|randomised controlled trial]], and so on. The important issue is that '''not every individual in the [[Sampling strategies#Source population|source population]] is included''', which means that [[Random error| random, or sampling, error]] and [[Bias#Selection bias|biases]] may be introduced. These affect our ability to extrapolate our results (whether [[Study design#Descriptive studies|descriptive]] or [[Study design#Analytic studies|analytic]] in nature) to the source population. However, the aim of most studies is to draw some conclusion about the source population, using the results obtained from the sample. This is known as '''inferential statistics''', and is a very important concept.<br> |
| | | |
− | Inferential statistical methods cannot be used to correct for the presence of '''selection biases''' introduced during sample collection. These should either be minimised during data collection, or if they cannot be avoided, they should be discussed in the analysis report. However, statistical methods are available in order to account for '''random error''' in a sample. Due to their random nature, these errors can be quantified if the underlying data is known. Of course, as this is never the case when sampling from populations, sample estimates are used to approximate the underlying population parameters. The most common application of inferential statistics is in the calculation of '''[[Random error#confidence intervals|confidence intervals]]'''.<br>
| + | '''Selection biases''' introduced during sample collection cannot be accounted for in the analysis, and so should either be avoided from the start, or discussed in the analysis report. Statistical methodology employed during analysis of sample data is aimed at accounting for '''random error''' in the sample. |
− | | |
− | ==Confidence intervals==
| |
− | As mentioned [[Random error#confidence intervals|earlier]], a 95% confidence interval for a parameter (such as a proportion or a mean) gives an indication of a range of values which we can be confident will include the true population parameter. That is, if samples were repeatedly taken from this population and 95% confidence intervals were calculated for each, 95% of these intervals would contain the true population parameter. We do not know whether ''our'' (single) estimate of the confidence interval is one of those 95%, or whether it is one of the 5% confidence intervals which do not include the true population parameter, but we are more confident that it does contain the mean than we are that it doesn't.
| |
− | | |
− | ==Calculation of confidence intervals==
| |
− | The calculation of confidence intervals for means or proportions follows the same basic procedure, whereas a slightly different approach is used for other measures such as rates and ratios. Most commonly, confidence intervals will not be calculated manually, but it is useful to know the general approach used. This is based upon the estimation of the '''standard error''' of the parameter in question, which is the the '''standard deviation of the sampling distribution''' of the parameter. That is, if samples of a specified size were taken from the population repeatedly (with replacement after each sampling), the standard deviation of all of these results is the standard error. It is known that with large sample sizes, the sampling distribution will be normally distributed (known as the '''central limit theorem'''), and this distribution can be calculated by dividing the 'standard deviation' of this parameter in the population with the square root of the number of animals sampled (or by dividing the variance with the number of animals sampled, and taking the square root of the result).<br>
| |
− | However, as these parameters in the '''population''' are not known, they must be approximated using data from the '''sample'''.
| |
− | | |
− | ===Approach for means and proportions===
| |
− | The general approach for the calculation of confidence intervals in these cases follows the following steps. The method of estimating the sample standard deviation for means and proportions will be described below:
| |
− | * Calculate the '''sample standard deviation''', as an approximation of the population standard deviation:
| |
− | * Estimate the '''standard error of the mean''' or proportion by dividing the sample standard deviation by the square root of the number of animals sampled.
| |
− | * Calculate the '''sample mean''' or '''proportion''' (as an approximation of the population mean or proportion).
| |
− | * Decide upon the confidence level required (usually 95%), and '''multiply the estimate of the standard error with the appropriate percentile point of the normal distribution''' (unless you are estimating the confidence interval for the mean of a small sample, in which case the t-distribution should be used instead). In the case of a 95% confidence interval, this is 1.96.
| |
− | * Subtract the resultant number from the sample mean or proportion to give the '''lower confidence limit'''.
| |
− | * Add the number to the sample mean or proportion to give the '''upper confidence limit'''.
| |
− | | |
− | ====Means====
| |
− | As described [[Data description|earlier]], means are the most appropriate measure of central tendency for normally distributed continuous variables, and the standard deviation is the most appropriate measure of 'spread' in these cases. An adjusted form of the sample standard deviation is used to approximate the standard deviation in the population, and is calculated by dividing the sum of the squared differences from the sample mean by the number of animals sampled minus 1, and taking the square root of the answer.
| |
− | | |
− | ====Proportions====
| |
− | Proportions are the most appropriate method of description of [[Data description|categorical and binary]] variables. Although the concept of a 'variance' or 'standard deviation' for a proportion is difficult to comprehend, it is based upon the 'binomial distribution'. The variance can be estimated by multiplying the proportion of positive animals by the proportion of negative animals; and the square root of this will give the standard deviation.
| |
− | | |
− | ===Approach for rates and ratios===
| |
− | Although the general concept behind confidence intervals for rates and ratios is the same as that for means and proportions, the method of calculation is different. The number of 'outcomes' (i.e. the numerator of the rate) can be considered to follow a 'Poisson distribution'. This facilitates the estimation of the standard error for this count, as the standard error of a Poisson variable is the square root of the expected value (that is, the square root of the number of outcomes, in our case). From this, confidence intervals can be estimated as above (i.e. estimate the standard error, multiply this by 1.96 [for 95% confidence limits], and add and subtract this value to/from the number of outcomes). Finally, to convert these confidence limits into a rate (rather than a count), the confidence limits themselves should be divided by the total amount of animal-time under observation (i.e. the denominator of the rate).
| |
| | | |
| [[Category:Veterinary Epidemiology - Statistical Methods|D]] | | [[Category:Veterinary Epidemiology - Statistical Methods|D]] |
In many epidemiological studies, it is not possible to include every individual in a population. Rather, a sample of individuals is collected. This may be take the form of a survey, a cross-sectional study, a randomised controlled trial, and so on. The important issue is that not every individual in the source population is included, which means that random, or sampling, error and biases may be introduced. These affect our ability to extrapolate our results (whether descriptive or analytic in nature) to the source population. However, the aim of most studies is to draw some conclusion about the source population, using the results obtained from the sample. This is known as inferential statistics, and is a very important concept.
Selection biases introduced during sample collection cannot be accounted for in the analysis, and so should either be avoided from the start, or discussed in the analysis report. Statistical methodology employed during analysis of sample data is aimed at accounting for random error in the sample.