WikiVet English - User contributions [en]

Study design

2013-01-20T16:56:53Z

DRLUDI: /* Case-control studies */

[[Category:Veterinary Epidemiology - General Concepts|D]]

Epidemiological studies can be described as belonging to one of two categories: descriptive or analytical. Descriptive studies involve detailed investigations of individuals in order to improve knowledge of disease. Descriptive studies often have no prior hypotheses and are opportunistic studies of disease whereas analytical studies are used to test hypotheses by selection and comparison of groups. However, data obtained from analytical studies can be used in a descriptive manner and vice versa.

==Descriptive studies==

Descriptive studies include case-series, case-reports and surveys. Although unable to test hypotheses, as they do not involve the comparison of groups, they improve knowledge and understanding of disease and are useful for generating hypotheses. The overarching aims of descriptive studies are covered in [[Descriptive epidemiological studies|another page]].

====Case reports====
These are descriptions of disease in individual animals (or in a very small number of cases). Although the small number of animals included in these types of studies limit the ability to relate the results to larger populations, they provide useful information for further studies - in particular, in the case of rare or emerging diseases.

====Case series====
These include greater numbers of individuals than case reports (and can in fact include greater numbers of individuals than surveys in some cases), and therefore provide more information regarding the '''animal, place, time''' pattern of disease. However, these studies are often not planned out in advance, and the data collected may have been collected for other reasons than the study in question. This means that the individuals included may not be representative of external populations, and that data on factors of interest may be missing.

====Surveys====
These are carefully planned studies with clear, specific aims and a defined source population (for example, aiming to estimate the prevalence of disease X in country Y at time Z), which differentiates them from case series studies. Surveys will have a clear sampling strategy and a method of data collection (such as a questionnaire or serological test) used specifically for the study in question. If data regarding both outcomes (such as disease status) and exposures of interest are collected, then the study would be more accurately described as a '''cross sectional analytic study''' (see below).

==Analytic studies==

Analytical studies aim to identify different 'subpopulations' of animals (defined by the presence or absence of exposures of interest) amongst which disease experience differs, in an attempt to identify risk factors or protective factors for disease. The ultimate aim is to draw conclusions regarding possible causative associations between exposures and disease (although, as mentioned earlier, causation is impossible to prove). Depending on the study design, this may be achieved by comparing 'disease outcome' between groups of animals with or without the exposure of interest, or by comparing 'exposure' between groups of animals with or without disease. Analytical studies can be viewed as '''observational''' or '''experimental''' in nature. In the case of observational studies, the investigator does has no control over the exposure status of the animals, whereas in experimental studies, the investigator allocates exposures to a selection of the animals. This has important repercussions for the interpretation of the results, as in the case of observational studies, the groups of animals defined by the exposure of interest may differ from each other in other ways than just the exposure of interest.

===Observational studies===
As mentioned above, observational studies are based on the investigator observing the real-life situation and drawing inferences from this. Therefore, there is potential for biases and confounding, which must be considered when interpreting the results. Observational studies can be classified as one of three types, according to the method of selection of participants (although some studies may use aspects of different study designs). The study design will affect which measures of disease are possible.

====Cross sectional studies====
Cross sectional studies involve the selection of a sample of the population, regardless of their exposure or outcome status. As the sample is collected at one point in time, the '''prevalence''' of disease can be estimated, and this must be considered when identifying associations. As the prevalence of disease at any one point in time is dependent upon both the incidence of disease and the duration of disease, this can cause problems when trying to identify causal associations - primarily because the prerequisite for causation stating that the exposure must precede the outcome may not be able to be definitively proved (as the exposure may have been different at the time the animal actually developed the disease). Another problem for cross sectional studies is that of '''selection bias''', which may result from a refusal of some individuals to participate. 

Cross sectional approaches can also be used to follow up a population over time, by repeatedly sampling from the population - known as a '''repeated cross sectional''' design. Although this may appear to be similar to a '''cohort study''' (see below), they differ in that in a repeated cross sectional study, the same individual animals are not necessarily sampled each time, and so are not followed up over time.

====Cohort studies====
Cohort studies, as mentioned above, involve following animals over time in order to record whether or not they experience the outcome of interest. The selection of animals may be based upon exposure status (in which case, the study design is a cohort study, sensu stricto), or a selection of disease-negative animals may be made, with the exposure status determined after selection (which is strictly known as a longitudinal study). Cohort studies allow the measurement of the '''incidence''' of disease, as all animals are negative at the start of the study and are then followed up over time, which solves some of the problems described above when investigating prevalence in analytic studies. As for cross sectional studies, failure of some individuals to enrol may result in selection bias, as can individuals dropping out of the study as time progresses ('losses to follow-up'). One other problem with cohort studies is that they can be costly and can take a long period of time to complete.

====Case-control studies====
Case-control studies are based upon the identification of two populations of individuals - one ('cases') comprising those who have experienced the outcome of interest (e.g. disease) and one ('controls') comprising those who have not experienced the outcome of interest (but who are otherwise comparable to the cases). As the outcome itself is involved in the selection of participants, no measures of disease frequency can be made from a case-control study, which are only used for the identification of exposures associated with disease (through the estimation of 'odds ratios'). Although case-control studies can be very useful in the investigation of rare diseases, there can be considerable difficulties in ensuring that the control group are comparable to the case group. If these groups are not comparable, there can be considerable selection bias, which can invalidate the results of the study.

===Experimental studies===
In experimental studies (also known as '''intervention studies'''), the investigator allocates the exposure of interest to a selection of the participants prior to following them up. The allocation of the exposure should be randomised, and there should be a clear control group which does not receive the exposure (known as a '''Randomised Controlled Trial'''). Ideally, 'blinding' of participants and investigators to the treatment allocation should also be performed whenever possible in order to reduce any differences between the groups. As biases and confounding are reduced through the randomised allocation of exposure, these studies provide the best quality of evidence of any single study type. However, in a similar fashion to cohort studies, they can be very costly, both in terms of money and time. Additionally, in the case of suspected harmful exposures, randomised controlled trials may not be an ethical option, and so they are commonly used when investigating interventions which are suspected to be beneficial.

Data description

2013-01-20T16:34:11Z

DRLUDI: /* Variance and standard deviation */

All epidemiological investigations require some form of data description. A number of methods are available for describing data, and the most appropriate one will depend upon both the [[Data types|type of data]] available and the aims of the investigation. If these issues are not considered, useful information may be lost, or more seriously, a misleading estimate may be made. 

==Measures of central tendency==
In many cases, some estimate of an 'average' of the parameter of interest within the population is desired - also known as a measure of central tendency. There are three main measures of central tendency used in epidemiological studies - known as the '''mean''', the '''median''' and the '''mode'''. These will be described below.

===Mean===
The mean of a set of numbers is what most people consider the 'average', and is calculated by adding all the numbers together and dividing by the number of individuals. There are a number of different types of means available, although they are all based upon the same calculation, but with different ''transformations'' applied before and after (the '''arithmetic mean''' is that described above; the '''geometric mean''' is calculated using the natural logs of the numbers, and so must the antilog must be taken of the resultant estimate; and the '''harmonic mean''' uses the reciprocals of the numbers, and so the reciprocal of the final estimate should be taken). It should be noted that the mean can be considerably affected by extreme values (known as 'outliers'), and so generally should be avoided if these are present in the dataset. Although the '''proportion''' of individuals experiencing a binary event (classified as 1 or 0) is calculated in the same way as the arithmetic mean, it is not itself considered a measure of central tendency. 

===Median===
The median is the exact midpoint in a series of data which have been placed in an ascending order, and is also known as the '''50th percentile'''. Therefore, approximately 50% of observations lie below the median and 50% lie above. It can be found by identifying the observation lying in place (n+1)/2 in a dataset of n observations, ordered from smallest to largest and where n is odd. In the situation where the number of observations is even, the ''mean'' of the middle two values (n/2 and (n+1)/2) is taken to indicate the median.

===Mode===
The mode is the most common value in the dataset, and as such is the only measure of central tendency which may have more than one value. It is also the only measure of central tendency which can be used for non-numerical (categorical) data.

==Measures of spread==
A variety of measures of the ''spread'' of the data are available, and include the '''standard deviation''', the '''variance''', the '''interquartile range''' and the '''range'''.

===Variance and standard deviation===
The variance of a set of data is calculated by adding together the squared differences of each value from the mean and dividing this by the number of observations minus one (= degrees of freedom). The ''square'' of each difference is used because if the difference itself were used, the values higher than the mean and the values lower than the mean would cancel each other out, meaning that the resulting number would be zero. However, as the squares are used, the variance is expressed in terms of the square of the units of measurement (for example, the variance of the weights of a sample of animals may be 25kg2. As this is not easy to relate back to the original units of measurement, the ''square root'' of the variance is often used - which is known as the '''standard deviation'''. The variance and standard deviation should generally only be used in cases where the mean is used as a measure of central tendency, as they relate to this mean in their calculation. As for the mean, they are also affected by the presence of outliers.

===Interquartile range===
The interquartile range is based upon percentile points in the data. One of these has already been described - the 50th percentile (also known as the median). In the same way as the 50th percentile separates the lower 50% of observations from the upper 50% of observations, the 25th percentile separates the lower 25% of observations from the upper 75%, and the 75th percentile separates the lower 75% of observations from the upper 25%. The 25th percentile is also known as the '''lower quartile''', and the 75th percentile as the '''upper quartile''', and by subtracting the lower quartile from the upper quartile, the ''interquartile range'' can be calculated.

===Range===
The range is a very basic measure of spread, and is the difference between the lowest value in the observation and the highest value. It can be strongly affected by outliers, and so care should be taken in its interpretation.

==Choice of descriptive measure==
As mentioned above, the descriptive measures available will depend upon the aim of the study and the data type in question. The options available for non-numerical (categorical) data are quite limited, but for numerical data, a measure of central tendency and a measure of 'spread' are often presented.

===Qualitative data===
Qualitative data may or may not have an intrinsic order, and can always be described using proportions (i.e. the proportion of animals in each 'category'). The '''mode''' can also be a useful measure of central tendency, and the '''median''' may be appropriate in some cases of numerical ordinal data, such as body condition score (although careful consideration should be given to the usefulness of this before using this measure. The only meaningful measure of spread which may be used for qualitative data is the '''range''', which can only be used in some cases of numerical ordinal data.

===Quantitative data===
[[File:Normal.png|thumb|upright=2.0|An example of normally distributed data.]]
These data can be described according to a '''measure of central tendency''', their '''spread''' and the '''shape''' of their distribution. The shape of the distribution is important in deciding upon the most appropriate method of description, and can be described according to '''skew''' (symmetry of the distribution) and '''kurtosis''' ('pointyness' of the distribution). A '''normal distribution''' (shown below) has a skew of zero and a kurtosis of zero, and is a very commonly used distribution in statistics. If data follow a normal distribution, then they can be completely described using only the '''mean''' and the '''standard deviation'''. 

[[File:Skewed.png|thumb|left|upright=2.0|An example of data with a right skew (above) and data with a left skew (below).]]
However, data may be skewed to the right (where there is a 'tail' on the right, also known as a positive skew) or to the left (where there is a 'tail' on the left, also known as a negative skew). In these cases, the observations in the tail can affect the estimate of the mean, and make it less useful as a measure of central tendency. This (and the lack of symmetry in the distribution) will also reduce the usefulness of the standard deviation as a measure of spread. In these cases, it is more appropriate to describe the data using the '''median''' and the '''interquartile range''' (as these measures are more ''robust'' against these extreme values). 

In some cases (such as a bimodal distribution), the median may also not be an appropriate measure of central tendency, and the mode(s) may be more appropriate. This demonstrates that careful consideration of the usefulness of the available measures should be given whenever describing data, and 'common sense' should be used to select the most appropriate one. For example, although there is nothing statistically 'wrong' with using the mean to describe a highly skewed dataset, it does not offer the same amount of information as the median would do, and risks misrepresenting the data. 

[[Category:Veterinary Epidemiology - Statistical Methods|A]]