Changes

Jump to navigation Jump to search
34 bytes added ,  16:34, 20 January 2013
Line 17: Line 17:     
===Variance and standard deviation===
 
===Variance and standard deviation===
The variance of a set of data is calculated by adding together the squared differences of each value from the mean and dividing this by the number of observations. The ''square'' of each difference is used because if the difference itself were used, the values higher than the mean and the values lower than the mean would cancel each other out, meaning that the resulting number would be zero. However, as the squares are used, the variance is expressed in terms of the square of the units of measurement (for example, the variance of the weights of a sample of animals may be 25kg<sup>2</sup>. As this is not easy to relate back to the original units of measurement, the ''square root'' of the variance is often used - which is known as the '''standard deviation'''. The variance and standard deviation should generally only be used in cases where the mean is used as a measure of central tendency, as they relate to this mean in their calculation. As for the mean, they are also affected by the presence of outliers.
+
The variance of a set of data is calculated by adding together the squared differences of each value from the mean and dividing this by the number of observations minus one (= degrees of freedom). The ''square'' of each difference is used because if the difference itself were used, the values higher than the mean and the values lower than the mean would cancel each other out, meaning that the resulting number would be zero. However, as the squares are used, the variance is expressed in terms of the square of the units of measurement (for example, the variance of the weights of a sample of animals may be 25kg<sup>2</sup>. As this is not easy to relate back to the original units of measurement, the ''square root'' of the variance is often used - which is known as the '''standard deviation'''. The variance and standard deviation should generally only be used in cases where the mean is used as a measure of central tendency, as they relate to this mean in their calculation. As for the mean, they are also affected by the presence of outliers.
    
===Interquartile range===
 
===Interquartile range===
Line 35: Line 35:  
These data can be described according to a '''measure of central tendency''', their '''spread''' and the '''shape''' of their distribution. The shape of the distribution is important in deciding upon the most appropriate method of description, and can be described according to '''skew''' (symmetry of the distribution) and '''kurtosis''' ('pointyness' of the distribution). A '''normal distribution''' (shown below) has a skew of zero and a kurtosis of zero, and is a very commonly used distribution in statistics. If data follow a normal distribution, then they can be completely described using only the '''mean''' and the '''standard deviation'''.<br>
 
These data can be described according to a '''measure of central tendency''', their '''spread''' and the '''shape''' of their distribution. The shape of the distribution is important in deciding upon the most appropriate method of description, and can be described according to '''skew''' (symmetry of the distribution) and '''kurtosis''' ('pointyness' of the distribution). A '''normal distribution''' (shown below) has a skew of zero and a kurtosis of zero, and is a very commonly used distribution in statistics. If data follow a normal distribution, then they can be completely described using only the '''mean''' and the '''standard deviation'''.<br>
    +
[[File:Skewed.png|thumb|left|upright=2.0|An example of data with a right skew (above) and data with a left skew (below).]]
 
However, data may be skewed to the right (where there is a 'tail' on the right, also known as a positive skew) or to the left (where there is a 'tail' on the left, also known as a negative skew). In these cases, the observations in the tail can affect the estimate of the mean, and make it less useful as a measure of central tendency. This (and the lack of symmetry in the distribution) will also reduce the usefulness of the standard deviation as a measure of spread. In these cases, it is more appropriate to describe the data using the '''median''' and the '''interquartile range''' (as these measures are more ''robust'' against these extreme values).<br>
 
However, data may be skewed to the right (where there is a 'tail' on the right, also known as a positive skew) or to the left (where there is a 'tail' on the left, also known as a negative skew). In these cases, the observations in the tail can affect the estimate of the mean, and make it less useful as a measure of central tendency. This (and the lack of symmetry in the distribution) will also reduce the usefulness of the standard deviation as a measure of spread. In these cases, it is more appropriate to describe the data using the '''median''' and the '''interquartile range''' (as these measures are more ''robust'' against these extreme values).<br>
[[File:Skewed.png|thumb|left|upright=2.0|An example of data with a right skew (above) and data with a left skew (below).]]
+
 
    
In some cases (such as a bimodal distribution), the median may also not be an appropriate measure of central tendency, and the mode(s) may be more appropriate. This demonstrates that careful consideration of the usefulness of the available measures should be given whenever describing data, and 'common sense' should be used to select the most appropriate one. For example, although there is nothing statistically 'wrong' with using the mean to describe a highly skewed dataset, it does not offer the same amount of information as the median would do, and risks misrepresenting the data.<br>
 
In some cases (such as a bimodal distribution), the median may also not be an appropriate measure of central tendency, and the mode(s) may be more appropriate. This demonstrates that careful consideration of the usefulness of the available measures should be given whenever describing data, and 'common sense' should be used to select the most appropriate one. For example, although there is nothing statistically 'wrong' with using the mean to describe a highly skewed dataset, it does not offer the same amount of information as the median would do, and risks misrepresenting the data.<br>
    
[[Category:Veterinary Epidemiology - Statistical Methods|A]]
 
[[Category:Veterinary Epidemiology - Statistical Methods|A]]
2

edits

Navigation menu