1,981 bytes added ,  09:24, 10 May 2011
no edit summary
Line 1: Line 1: −
A central concept in any epidemiological investigation is that of appropriate data description. A number of methods are available for describing data, and the most appropriate one will depend upon both the [[Data types|type of data]] available and the aims of the investigation. If these issues are not considered, useful information may be lost, or more seriously, a misleading estimate may be made.<br>
+
All epidemiological investigations require some form of data description. A number of methods are available for describing data, and the most appropriate one will depend upon both the [[Data types|type of data]] available and the aims of the investigation. If these issues are not considered, useful information may be lost, or more seriously, a misleading estimate may be made.<br>
    
==Measures of central tendency==
 
==Measures of central tendency==
Line 17: Line 17:     
===Variance and standard deviation===
 
===Variance and standard deviation===
The variance of a set of data is calculated by adding together the squared differences of each value from the mean and dividing this by the number of observations. The ''square'' of each difference is used because if the difference itself were used, the values higher than the mean and the values lower than the mean would cancel each other out, meaning that the resulting number would be zero. However, as the squares are used, the variance is expressed in terms of the square of the units of measurement (for example, the variance of the weights of a sample of animals may be 25kg<sup>2</sup>. As this is not easy to relate back to the original units of measurement, the ''square root'' of the variance is often used - which is known as the '''standard deviation'''.
+
The variance of a set of data is calculated by adding together the squared differences of each value from the mean and dividing this by the number of observations. The ''square'' of each difference is used because if the difference itself were used, the values higher than the mean and the values lower than the mean would cancel each other out, meaning that the resulting number would be zero. However, as the squares are used, the variance is expressed in terms of the square of the units of measurement (for example, the variance of the weights of a sample of animals may be 25kg<sup>2</sup>. As this is not easy to relate back to the original units of measurement, the ''square root'' of the variance is often used - which is known as the '''standard deviation'''. The variance and standard deviation should generally only be used in cases where the mean is used as a measure of central tendency, as they relate to this mean in their calculation. As for the mean, they are also affected by the presence of outliers.
    
===Interquartile range===
 
===Interquartile range===
Line 29: Line 29:  
   
 
   
 
===Qualitative data===
 
===Qualitative data===
Qualitative data may or may not have an intrinsic order, and can always be described using proportions (i.e. the proportion of animals in each 'category'). The '''mode''' can also be a useful measure of central tendency, and the '''median''' may be appropriate in some cases of numerical ordinal data, such as body condition score (although careful consideration should be given to the usefulness of this before using this measure. There are no meaningful measures of spread for qualitative data, as the difference between adjacent categories is not standard, although the '''range''' of ordinal values may be useful.
+
Qualitative data may or may not have an intrinsic order, and can always be described using proportions (i.e. the proportion of animals in each 'category'). The '''mode''' can also be a useful measure of central tendency, and the '''median''' may be appropriate in some cases of numerical ordinal data, such as body condition score (although careful consideration should be given to the usefulness of this before using this measure. The only meaningful measure of spread which may be used for qualitative data is the '''range''', which can only be used in some cases of numerical ordinal data.
    
===Quantitative data===
 
===Quantitative data===
These data can be described according to a '''measure of central tendency''', '''spread''' and the '''shape''' of their distribution.
+
These data can be described according to a '''measure of central tendency''', their '''spread''' and the '''shape''' of their distribution. The shape of the distribution is important in deciding upon the most appropriate method of description, and can be described according to '''skew''' (symmetry of the distribution) and '''kurtosis''' ('pointyness' of the distribution). A '''normal distribution''' (shown below) has a skew of zero and a kurtosis of zero, and is a very commonly used distribution in statistics. If data follow a normal distribution, then they can be completely described using only the '''mean''' and the '''standard deviation'''.<br>
 +
 
 +
However, data may be skewed to the right (where there is a 'tail' on the right, also known as a positive skew) or to the left (where there is a 'tail' on the left, also known as a negative skew). In these cases, the observations in the tail can affect the estimate of the mean, and make it less useful as a measure of central tendency. This (and the lack of symmetry in the distribution) will also reduce the usefulness of the standard deviation as a measure of spread. In these cases, it is more appropriate to describe the data using the '''median''' and the '''interquartile range''' (as these measures are more ''robust'' against these extreme values).<br>
 +
 
 +
In some cases (such as a bimodal distribution), the median may also not be an appropriate measure of central tendency, and the mode(s) may be more appropriate. This demonstrates that careful consideration of the usefulness of the available measures should be given whenever describing data, and 'common sense' should be used to select the most appropriate one. For example, although there is nothing statistically 'wrong' with using the mean to describe a highly skewed dataset, it does not offer the same amount of information as the median would do, and risks misrepresenting the data.<br>
    
[[Category:Veterinary Epidemiology - Statistical Methods|A]]
 
[[Category:Veterinary Epidemiology - Statistical Methods|A]]
700

edits