Here is an interesting plot showing the number of live births in the United States each day of 1978. We are going to use it to learn how to create plots using the ggformula
package.
To get R (or any software) to create this plot (or do anything else, really), there are two important questions you must be able to answer. Before continuing, see if you can figure out what they are.
To get R (or any software) to create this plot, there are two important questions you must be able to answer:
To make this plot, the answers to our questions are
A. Make a scatter plot (i.e., a plot consisting of points)
A. The data used for the plot:
We just need to learn how to tell R these answers.
We will provide answers to our two questions by filling in the boxes of this important template:
We just need to identify which portions of our answers go into which boxes.
It is useful to provide names for the boxes:
These names can help us remember which things go where. (The ...
indicates that there are some additional arguments we will add eventually.)
Sometimes we will add or subtract a bit from our formula. Here are some other forms we will eventually see.
# simpler version
goal( ~ x, data = mydata )
# fancier version
goal( y ~ x  z , data = mydata )
# unified version
goal( formula , data = mydata )
box  fill in with  purpose 

goal

gf_point

plot some points 
y

births

yaxis variable 
x

date

xaxis variable 
mydata

Births1978

name of data set 
Put each piece in its place in the template below and then run the code to create the plot.
goal(y ~ x, data = mydata)
If you get an “object not found” or “could not find function” error message, that indicates that you have not correctly filled in one of the four boxes from the template.
Note: R is case sensitive, so watch your capitalization.
For the record, here are the first few rows of Births1978
.
The most distinctive feature of ggformula
plots is the use of formulas to describe the positional information of a plot. Formulas in R always involve the tilde character, which is easy to overlook. It looks like this:
The position of on the keyboard varies from brand to brand. On Apple keyboards, it’s here.
Most gf_
functions take a formula that describes the positional attributes of the plot. Using one of these functions with no arguments will show you the “shape” of the formula it requires.
Run this code to see the formula shape for gf_point()
.
gf_point()
You should see that gf_point()
’s formula has the shape y ~ x
, so the \(y\)variable name goes before the tilde and the \(x\)variable name goes after. (Think: “y depends on x”. Also note that the \(y\)axis label appears farther left than the \(x\)axis label.)
Reverse the roles of the variables – changing births ~ date
to date ~ birth
– to see how the plot changes.
gf_point(births ~ date, data = Births1978)
Change date
to day_of_year
and see how the plot changes. (If you do this on a separate line, you will see both plots at once.)
gf_point(births ~ date, data = Births1978)
This tutorial includes similar data sets for the years 1969 to 1988. To use data from a different year, just change the name of the data set to indicate the year.
Fill in the “blanks” below to create plots for three different years. To what extent do the patterns from 1978 persist over time?
gf_point(births ~ date, data = Births1978)
gf_point(births ~ date, data = Births____)
gf_point(births ~ date, data = __________)
(Note: This creates three separate plots. We will learn how to combine them into a single plot shortly.)
Our plots have points because we have used gf_point()
. But there are many other gf_
functions that create different types of plots.
Experiment with some other plot types by changing gf_point()
to one of the following:
gf_line()
: connect the dotsgf_smooth()
: smoothed version of gf_line()
gf_lm()
: regression line (same as gf_smooth()
with method = "lm"
)gf_spline()
: another type of smoother (using splines)gf_point(births ~ date, data = Births1978)
If you created the spline plot, you probably found it “too wiggly”. If you created the smoothed plot, you might prefer that the gray bands not be displayed. For any of these plots you might have preferred different colors or sizes of things. Such global characteristics of a plot can be adjusted with additional arguments to the gf_
function. These go in the ...
part of the template.
The general form for these is option = value
.
For example,
spar = 0.5
(or any number between 0 and 1) controls the amount of smoothing in gf_spline()
.se = TRUE
(note capitalization) turns on the “error band” in gf_smooth()
and gf_lm()
.color = "red"
or fill = "navy"
(note quotes) can be used to change the colors of things. (fill
is typically used for regions that are “filled in” and color
for dots and lines.)alpha = 0.5
(or any number between 0 and 1) will set the opacity (0 is completely transparnt and 1 is completely opaque).Here are some examples. Adjust some of the options to see how things change. We’ve inserted some line breaks to make the options easier to locate in the code.
gf_spline(births ~ date, data = Births1978,
spar = 0.5,
color = "navy")
gf_smooth(births ~ date, data = Births1978,
se = TRUE,
color = "red")
gf_point(births ~ date, data = Births1978,
size = 3, shape = 18,
alpha = 0.5, color = "purple")
You can learn about the attributes available for a given layer using the “quick help” for a layer function. You can find out more by reading the help file produced with ?
.
# "quick help" for gf_point()
gf_point()
Curious to know all the available color names? Run this code.
colors()
We said that gf_point()
creates a plot with points. This isn’t quite true. Technically, it creates a layer with points. A plot may have multiple layers. To create a multilayered plot, simply append %>%
at the end of the code for one layer and follow that with another layer.
%>%
at the end of the first two lines.day_of_year
so that all three years are aligned.gf_smooth()
or gf_spline()
.gf_point(births ~ date, data = Births1978, color = "navy")
gf_point(births ~ date, data = Births1969, color = "red")
gf_point(births ~ date, data = Births1988, color = "skyblue")
The births data in 1978 contains two clear “waves” of dots. One conjecture is that these are weekdays and weekends. We can test this conjecture by putting different days in different colors.
In the lingo of ggformula
, we need to map color to the variable wday
. Mapping and setting attributes are different in an important way.
color = "navy"
sets the color to “navy”. All the dots will be navy.
color = ~ wday
maps color to wday
. This means that the color will depend on the values of wday
. A legend (aka, a guide) will be automatically included to show us which days are which.
Change the color argument so that it maps to wday
. Don’t forget the tilde (~
).
Try some other plot types: gf_line()
, gf_smooth()
, etc. Which do you like best? Why?
gf_point(births ~ date, data = Births1978, color = "navy")
If we want to look at all 20 years of birth data, overlaying the data is likely to put too much information in too little space and make it hard to tell which data is from which year. (Deciphering 20 colors or 20 shapes can be hard, too.) Instead, let’s put each year in separate facet or subplot. Facets are better than 20 separate plots because the coordinate systems are shared across the facets which saves space and makes comparisons across facets easier.
There are two ways to create facets. The simplest way is to add a vertical bar 
to our formula.
gf_point(births ~ day_of_year  year, data = Births, size = 0.5)
The second way is to add on a facet command using %>%
:
gf_point(births ~ day_of_year, data = Births, size = 0.5) %>%
gf_facet_wrap( ~ year)
Edit one of the plots above to do the following:
wday
Now edit one the plots above to
date
instead of day_of_year
What advantage do we get from using facets in this case?
The faceting we did on the previous page is called facet wrapping. If the facets don’t fit nicely in one row, the facets continue onto additional rows.
A facet grid uses rows, or columns, or both in a fixed way.
gf_point(births ~ day_of_year  year ~ wday, data = Births, size = 0.5)
Recreate the plot above using gf_facet_grid()
. This works much like gf_facet_wrap()
and accepts a formula with one of three shapes
y ~ x
(facets along both axes)~ x
(facets only along xaxis)y ~ .
(facets only along yaxis; notice the important dot in this one)(These three formula shapes can also be used on the right side of 
.)
Some gf_
functions only require the user to supply one variable. The ypositions for these plots can calculated from the xvariable.
Here are two examples. Run the code below to see the formula shapes used by these plots.
gf_histogram()
gf_density()
Notice that when there is only one variable it is always on the right side. Other examples of functions that take just one variable include gf_dens()
, gf_freqpoly()
, gf_dotplot()
, gf_bar()
, and gf_qq()
.
Create some “onevariable” plots using the functions listed above and the template below. Which variable should you use?
gf____( ~ _____, data = Births1978)
Time to explore on your own. Here are three data sets you can use.
HELPrct
has data from a study of people addicted to alcohol, cocaine, or heroineKidsFeet
has information about some kids’ feet.NHANES
has lots of physiologic and other measurements from 10,000 subjects in the National Health and Nutrition Evaluation Survey. (This data set will only be available if the NHANES package is installed where this tutorial is running.)To find out more about the data sets use ?HELPrct
, ?KidsFeet
, or ?NHANES
. To see the first few rows of the data, you can use head()
.
To get a list of functions available in ggformula
, run this code chunk.
# list all functions starting gf_
apropos("gf_")
Make some plots to explore one or more of these data sets.
gf_point(length ~ width, data = KidsFeet)
head(KidsFeet)
?KidsFeet
df_stats()
works much like the plotting functions in ggformula
but it produces a data frame of summary information rather than a plot. These summary data frames can be useful for plotting or for obtaining some numerical precision for the patterns we see in our plots.
Consider this plot
We can see clear trends, but what if we want to know the mean and standard deviation, or the median and IQR for the number of births on each of the seven weekdays?
Change gf_boxplot()
to df_stats()
.
gf_boxplot(births ~ wday, data = Births1978)
The default use of df_stats()
computes some of the most common summary statistics. But we can use df_stats()
to create other numerical summaries as well.
df_stats(births ~ wday, data = Births1978, median, iqr)
Modify the code above to compute some other summaries of your choosing.
Births1978 %>%
df_stats(births ~ wday, median, iqr)
In addition to the usual plot elements (lines, points, etc.), sometimes it is useful to add text with gf_text()
or gf_label()
.
See if you can figure out what this does before your execute it.
Births_summary <
df_stats(births ~ wday, data = Births1978)
gf_boxplot(births ~ wday, data = Births1978) %>%
gf_point(mean ~ wday, color = "red", data = Births_summary) %>%
gf_label(6400 ~ wday, label = ~ round(median), data = Births_summary) %>%
gf_text( 6700 ~ wday, label = ~ round(mean), data = Births_summary,
color = "red")
If you are fussy about your plots, you may be wondering how to have more control over things like:
As you can imagine, all of these things can be adjusted pretty much however you like. But we’ll save that for a separate tutorial on customizing your plots.
Not biological anatomy, rather plot anotomy. In order to talk about plots, it is handy to have words for their various components.
A frame is the bounding rectagle in which the plot is constructed. This
is essentially what you may think of as the x and y axes, but the plot isn’t actually required to have visible “axes”.
glyphs or marks are the particular symbols placed on the plot (examples: dots, lines, smiley faces, your favorite emoji, or whatever gets drawn).
Each glyph has a number of attributes or properties.
scales map raw data over to attributes of the plot.
guides go in the other direction and help the human map graphical attributes back to the raw data. Guides include what you might call a legend, but also things like axis labeling.
facets are coordinated subplots.