Difference between revisions of "Data description"

From WikiVet English
Jump to navigation Jump to search
Line 1: Line 1:
A central concept in any epidemiological investigation is that of appropriate data description. A number of methods are available for describing data, and the selection of the incorrect one can result in a loss of information, or more seriously, a misleading estimate. The most appropriate measure to use will depend on the [[Data types|type of data]] in question.<br>
+
A central concept in any epidemiological investigation is that of appropriate data description. A number of methods are available for describing data, and the most appropriate one will depend upon both the [[Data types|type of data]] available and the aims of the investigation. If these issues are not considered, useful information may be lost, or more seriously, a misleading estimate may be made.<br>
  
==Qualitative data==
+
==Measures of central tendency==
Qualitative data may or may not have an intrinsic order, and can always be described using proportions (i.e. the proportion of animals in each 'category'). The '''mode''' can also be a useful measure of central tendency, in the case of ordinal data. There are no meaningful measures of spread in these cases, as the difference between adjacent categories is not standard.
+
In many cases, some estimate of an 'average' of the parameter of interest within the population is desired - also known as a measure of central tendency. There are three main measures of central tendency used in epidemiological studies - known as the '''mean''', the '''median''' and the '''mode'''. These will be described below.
  
==Quantitative data==
+
===Mean===
 +
The mean of a set of numbers is what most people consider the 'average', and is calculated by adding all the numbers together and dividing by the number of individuals. There are a number of different types of means available, although they are all based upon the same calculation, but with different ''transformations'' applied before and after (the '''arithmetic mean''' is that described above; the '''geometric mean''' is calculated using the natural logs of the numbers, and so must the antilog must be taken of the resultant estimate; and the '''harmonic mean''' uses the reciprocals of the numbers, and so the reciprocal of the final estimate should be taken). Although the '''proportion''' of individuals experiencing a binary event (classified as 1 or 0) is calculated in the same way as the arithmetic mean, it is not itself considered a measure of central tendency.<br>
 +
 
 +
===Median===
 +
The median is the exact midpoint in a series of data which have been placed in an ascending order, and is also known as the '''50th percentile'''. Therefore, approximately 50% of observations lie below the median and 50% lie above. In the situation where the number of observations is even, the '''mean''' of the middle two values is taken to indicate the median.
 +
 
 +
===Mode===
 +
The mode is the most common value, and as such is the only measure of central tendency which may have more than one value. It is also the only measure of central tendency which can be used for non-numerical (categorical) data.
 +
 
 +
==Choice of descriptive measure==
 +
As mentioned above, the descriptive measures available will depend upon the aim of the study and the data type in question. The options available for non-numerical (categorical) data are quite limited, but for numerical data, a measure of central tendency and a measure of 'spread' are often presented.
 +
 +
===Qualitative data===
 +
Qualitative data may or may not have an intrinsic order, and can always be described using proportions (i.e. the proportion of animals in each 'category'). The '''mode''' can also be a useful measure of central tendency, and the '''median''' may be appropriate in some cases of numerical ordinal data, such as body condition score (although careful consideration should be given to the usefulness of this before using this measure. There are no meaningful measures of spread for qualitative data, as the difference between adjacent categories is not standard, although the '''range''' of ordinal values may be useful.
 +
 
 +
===Quantitative data===
 
These data can be described according to a '''measure of central tendency''', '''spread''' and the '''shape''' of their distribution.
 
These data can be described according to a '''measure of central tendency''', '''spread''' and the '''shape''' of their distribution.
  
 
[[Category:Veterinary Epidemiology - Statistical Methods|A]]
 
[[Category:Veterinary Epidemiology - Statistical Methods|A]]

Revision as of 08:36, 10 May 2011

A central concept in any epidemiological investigation is that of appropriate data description. A number of methods are available for describing data, and the most appropriate one will depend upon both the type of data available and the aims of the investigation. If these issues are not considered, useful information may be lost, or more seriously, a misleading estimate may be made.

Measures of central tendency

In many cases, some estimate of an 'average' of the parameter of interest within the population is desired - also known as a measure of central tendency. There are three main measures of central tendency used in epidemiological studies - known as the mean, the median and the mode. These will be described below.

Mean

The mean of a set of numbers is what most people consider the 'average', and is calculated by adding all the numbers together and dividing by the number of individuals. There are a number of different types of means available, although they are all based upon the same calculation, but with different transformations applied before and after (the arithmetic mean is that described above; the geometric mean is calculated using the natural logs of the numbers, and so must the antilog must be taken of the resultant estimate; and the harmonic mean uses the reciprocals of the numbers, and so the reciprocal of the final estimate should be taken). Although the proportion of individuals experiencing a binary event (classified as 1 or 0) is calculated in the same way as the arithmetic mean, it is not itself considered a measure of central tendency.

Median

The median is the exact midpoint in a series of data which have been placed in an ascending order, and is also known as the 50th percentile. Therefore, approximately 50% of observations lie below the median and 50% lie above. In the situation where the number of observations is even, the mean of the middle two values is taken to indicate the median.

Mode

The mode is the most common value, and as such is the only measure of central tendency which may have more than one value. It is also the only measure of central tendency which can be used for non-numerical (categorical) data.

Choice of descriptive measure

As mentioned above, the descriptive measures available will depend upon the aim of the study and the data type in question. The options available for non-numerical (categorical) data are quite limited, but for numerical data, a measure of central tendency and a measure of 'spread' are often presented.

Qualitative data

Qualitative data may or may not have an intrinsic order, and can always be described using proportions (i.e. the proportion of animals in each 'category'). The mode can also be a useful measure of central tendency, and the median may be appropriate in some cases of numerical ordinal data, such as body condition score (although careful consideration should be given to the usefulness of this before using this measure. There are no meaningful measures of spread for qualitative data, as the difference between adjacent categories is not standard, although the range of ordinal values may be useful.

Quantitative data

These data can be described according to a measure of central tendency, spread and the shape of their distribution.