Robert E. Fromm, Jr., M.D., M.P.H.
Department of Medicine
Baylor College of Medicine, Houston, Texas
USA
and
Department of Anesthesiology & Critical
Care
The University of Texas M.D. Anderson Cancer Center, Houston, Texas, USA
Although not commonly considered a clinical subject, statistics and epidemiology form the cornerstone of clinical practice. An understanding of statistical principles is necessary to comprehend the published literature and practice in a rational manner. The purpose of this manuscript is to review some of the basic statistical principles and formulas. More in-depth discussion can be obtained in texts of biostatistics.
Prevalence is the most frequently used measure of disease frequency and is defined as:
Number of existing cases of a disease
Prevalence = ----------------------------------------------
Total population at a given point in time
Incidence quantifies the number of new cases that develop in a population at risk during a specific time interval:
Number of new cases of a disease during a given time period
Cumulative incidence = ------------------------------------------------------------------------
Total population at risk
Cumulative incidence (CI) reflects the probability that an individual develops a disease during a given time period.
Incidence density (ID) allows one to account for varying periods of follow-up and is calculated as:
New cases of the disease during a given period of time
ID = -----------------------------------------------------------------------------------
Total person-time of observation
Special types of incidence and prevalence measures are reported.
Mortality rate is an incidence measure:
Number of deaths
Mortality rate = ---------------------------------------------------------------------
Total population
Case-fatality rate is another incidence measure:
Number of deaths from the disease
Case-fatality rate = -----------------------------------------------------------
Number of cases of the disease
Attack rate is also an incidence measure:
Number of cases of the disease
Attack rate = -------------------------------------------------------------------
Total population at risk for a given time period
The performance of a laboratory test is commonly reported in terms of sensitivity and specificity defined as:
True positives
Sensitivity = -----------------------------------------------------------
True positives + false negatives
True negatives
Specificity = -----------------------------------------------------------
True negatives + false positives
Thus, sensitivity measures the number of people who truly have the disease who test positive.
Specificity measures the number of people who do not have the disease who test negative.
These crude measurements of laboratory performance do no take into account
the level at which a test is determined to be positive.
Receiver operator characteristics curves (ROC curves) examine the performance of a test throughout its range of values. An area under the ROC curve of 1.0 is a perfect test while a test that is no better than flipping a coin has an area under the ROC curve of 0.5.
As a clinician examining a positive test, we are most interested in determining
whether a patient actually has disease.
The positive predictive value (PPV) provides this probability:
True positives
PPV = -----------------------------------------------------------------
True positives + false positives
Prevalence x sensitivity
PPV = ------------------------------------------------------------------------
Prevalence x sensitivity + (1 - prevalence) x (1 - specificity)
Negative predictive value (NPV) describes the probability of a patient testing negative for the disease truly who does not have the disease:
True negatives
NPV = ------------------------------------------------------------------------
True negatives + false negatives
(1 - prevalence) x specificity
NPV = ------------------------------------------------------------------------
(1 - prevalence) x specificity + prevalence x (1 - sensitivity)
A large collection of data cannot be really appreciated by simple scrutiny. Summary or descriptive statistics help to succinctly describe the data. Two measures are usually employed. A measure of central tendency and a measure of dispersion.
Measures of central tendency include mean, median and mode.
Mean is the common arithmetic average;
Median is the middle value. The value such that one half of the data points fall below and one half falls above.
Mode is the most frequent occurring data point.
Measures of dispersion include the range, interquartile range, variance and standard deviation.
Range = Greatest value - Least value
The interquartile range (IQR) is the range of the middle 50% of the data.
IQR = U75 - L25
where U75 is the upper 75th percentile and L25 is the lower 25th percentile.
Variance is the average of the squared distances between each of the values and the mean and standard deviation is the square root of this value.
Hypothesis testing involves conducting a test of statistical significance, quantifying the degree to which random variability may account for the observed results. In preforming hypothesis testing two types of error can be made:
Type I errors refer to a situation in which statistical significance is found when no difference actually exists. The probability of making a type I error is equal to the p values of a statistical test and is commonly represent by the Greek letter . Traditionally levels of 0.05 are used for statistical significance. Type II errors refers to failure to declare a difference exists when there is a real difference between the study groups. The probability of making a type II error is represented by the Greek letter . The power of a test is calculated as 1 - and is the probability of declaring a statistically significant difference if one truly exists.
The following methods are the most frequently used test for biological data.
Chi-square test (x2) is used for discreet data such as counts.
The general form of a chi-square test is:
(Observed - expected)2
Chi-square = Sumation of -------------------------------------------------
Expected
Chi-square is commonly used in contingency tables:
Diseased | Not Diseased | Totals | |
---|---|---|---|
Exposed | a | b | a + b |
Not Exposed | c | d | c + d |
a + c | b + d | a + b + c + d |
Yates correction: When the expected value of any particular cell is less than 5, the Yates correction is used. This is calculated as:
(Observed - expected- 0.5)2
Chi-square Yates corrected = Summation of -------------------------------------------------
Expected
Relative risk: The data within a contingency table is commonly summarized in measures such as the relative risk. If we gather groups based on their exposure status, relative risk can be calculated as:
a
----------------
a+b
Relative risk = -----------------------------------------------------------
c
----------------
c+d
This figure represents the risk of becoming diseased if you are exposed (a/a+b) divided by the risk if you are not exposed (c/c+d), which is why is called relative risk. If the relative risk is calculated at 4.0, then the risk of becoming diseased if you are exposed is four times that of people who are not exposed.
Odds ratio: If we gather groups based on disease status, the odds ratio is calculated as an approximation to the relative risk:
a
--------
b
Odds ratio = --------------------
c
--------
d
This measure is the ratio of the odds of getting disease if you are exposed and the odds of becoming diseased if you are not.
t-test: Usually used in comparing means.
Normal approximation for comparing two proportions: A method for comparing whether two proportions are significantly different.
Analysis of variance: This method is commonly used to compare means across more than 2 categories.
Regression techniques: Generally obtained via computer programs; can be used to predict a continuous variable from single or multiple regressors which are either categorical, continuous or both.
All pages copyright © Priory Lodge Education Ltd1994,1995,1996.
Last amended: 3:16 PM on 07/07/96