Epidemiologists ultimately want to be able to draw conclusions about causation, but most epidemiologic studies focus on establishing associations. Association: Is a specified health outcome more likely in people with a particular "exposure"? Is there a link? Association is a statistical relationship between two variables. Two variables may be associated without a causal relationship. For example, there is a statistical association between the number of people who drowned by falling into a pool and the number of films Nicolas Cage appeared in in a given year.
However, there is obviously no causal relationship. Jewish women have a higher risk of breast cancer, while Mormons have a lower risk.
However, one's religion is not a cause of breast cancer. There are other explanations. It has been convincingly demonstrated that people of lower socioeconomic status SES have a higher risk of lung cancer, i. A more plausible explanation is that people of lower SES are more likely to smoke and to be chronically exposed to air pollution and that exposure of the respiratory tract to these contaminants causes mutations in bronchial cells that can eventually produce a cancer.
Causation: Causation means that the exposure produces the effect. It can be the presence of an adverse exposure, e. They are limited though, because a single number can never summarise every aspect of the relationship between two variables. This is why we always visualise the relationship between two variables. The standard graph for displaying associations among numeric variables is a scatter plot, using horizontal and vertical axes to plot two variables as a series of points.
There are a few other options beyond the standard scatter plot. Take note—it may be necessary to round numeric variables first e. Numerically exploring associations between pairs of categorical variables is not as simple as the numeric variable case. The resulting table is called a contingency table. The counts in the table are sometimes referred to as frequencies. For example, the frequencies of each storm category and month combination is given by:.
The first argument sets the variables to cross-tabulate. The second argument tells the function which data set to use. What does this tell us?
It shows us how many observations are associated with each combination of values of type and month. More severe storms occur in the middle of the storm season—perhaps not all that surprising. If both variables are ordinal we can also calculate a descriptive statistic of association from a contingency table. It makes no sense to do this for nominal variables because their values are not ordered.
Instead, we have to use some kind of rank correlation coefficient that accounts for the categorical nature of the data. The sign tells us about the direction of the association. Bar charts can be used to summarise the relationship between two categorical variables.
The basic idea is to produce a separate bar for each combination of categories in the two variables. The lengths of these bars is proportional to the values they represent, which is either the raw counts or the proportions in each category combination. This is the same information displayed in a contingency table. Using ggplot2 to display this information is not very different from producing a bar graph to summarise a single categorical variable. As always, we start by using the ggplot function to construct a graphical object containing the necessary default data and aesthetic mapping:.
We mapped the year variable to the x axis, and the storm category type to the fill colour. We want to display information from two categorical variables, so we have to define two aesthetic mappings.
This is called a stacked bar chart. We have all the right information in this graph, but it could be improved. Look at the labels on the x axis. Not every bar is labelled. This occurs because year is stored as a numeric vector in storms , yet we are treating it as a categorical variable in this analysis— ggplot2 has no way of knowing this of course. We need a new trick here. One way to do this is to convert year to a character vector We can convert a numeric vector to a character vector with the as.
We must load and attach dplyr to make this work. Now we just need to construct and display the ggplot2 object again using this new data frame:. However, the ordering of the storm categories is not ideal because the order in which the different groups are presented does not reflect the ordinal scale we have in mind for storm category.
Time for a new trick. We need to somehow embed the information about the required category order of type into our data. We make use of this we need to know how to convert something into a factor. We use the factor function, setting its levels argument to be a vector of category names in the correct order:. Factors are very useful. They crop up all the time in R. Unfortunately, they are also a pain to work with and a frequent source of errors.
One problem with this kind of chart is that it can be hard to spot associations among the two categorical variables. We snuck in one more tweak. This final figure shows that on average, storm systems spend more time as hurricanes and tropical storms than tropical depressions or extratropical systems.
Other than that, the story is a little messy. For example, was an odd year, with few storm events and relatively few hurricanes. The horizontal line inside the box is the sample median. This is our measure of central tendency. It allows us to compare the most likely value of the numeric variable across the different categories.
The boxes display the interquartile range IQR of the numeric variable in each category, i. This allows us to compare the spread of the numeric values in each category. The interpretation of these depends on which kind of box plot we are making. By default, ggplot2 produces a traditional Tukey box plot.
Each whisker is drawn from each end of the box the upper and lower quartiles to a well-defined point. To find where the upper whisker ends we have to find the largest observation that is no more than 1. The lower whisker ends at the smallest observation that is no more than 1. Any points that do not fall inside the whiskers are plotted as an individual point. These may be outliers, although they could also be perfectly consistent with the wider distribution.
0コメント