Tidy Plotting in R

The aim of this tutorial is to introduce you to tidy visualization in R. Think of it as the seemless transition from tidy data wrangling, to tidy data visualization.

While there are many and extremely diverse packages that can be used for plotting purposes in R - the one I use most is undoubtedly ggplot2.

For dynamic plotting and financial series plots, there are also packages like dygraphs and plotly.

Here follows a very basic, high level view of ggplot2 plotting functionality. I suggest supplementing this tutorial by reading this post with examples, and saving or printing out this ggplot cheatsheet.

A great source is also this e-book by Hadley Wickham and others.

There is also a great Rstudio add-in package called ggplotAssist, which allows for an easy user-interface when using ggplot.

As always, though, I encourage understanding the fundamentals of a platform as opposed to using simple (and easy) solutions.

Get data Tidy

Very important: ggplot2 wants your data to be in a tidy format as discussed in the previous tutorial.

You should have practiced this by now…

Once your data is tidy, planning your plot framework is essential.

How ggplot2 thinks:

It thinks as follows:

tidy data as input
ggplot() creates an open canvass.
aesthetic properties (is it a boxplot, lineplot, scatterplot, etc) draws on the canvass.
faceting (repeating a plot type in a grid, e.g.) splits plots by some attribute.
Theme - what to add / change theme-wise.

Note - ggplot calls are separated by +, not pipes ( %>% )

Let’s create a quick plot:

pacman::p_load(tidyverse)

plot_df <- left_join(fmxdat::Indexes %>% gather(Country, TRI, 
    -Date), fmxdat::Indexes_Labs, by = "Country") %>% mutate(Label = coalesce(Label, 
    "Other")) %>% arrange(Date)

print(plot_df)

## # A tibble: 16,590 x 4
##    Date       Country    TRI Label
##    <date>     <chr>    <dbl> <chr>
##  1 2000-01-14 brz     1420.  Brics
##  2 2000-01-14 chl       NA   Other
##  3 2000-01-14 chn       36.7 Brics
##  4 2000-01-14 col      105.  Other
##  5 2000-01-14 cze       91.7 Other
##  6 2000-01-14 egt      352.  HEGP 
##  7 2000-01-14 grc     1215.  Other
##  8 2000-01-14 hun      369.  HEGP 
##  9 2000-01-14 ind      175.  Brics
## 10 2000-01-14 ino      280.  Other
## # ... with 16,580 more rows

With our data now tidied up, we can proceed to plot countries as follows:

Plot all the Brics countries together in one plot:

dfPlotData <- plot_df %>% filter(Label == "Brics")
g <- dfPlotData %>% 
# Initialize the canvas:
ggplot() + 
# Now add the aesthetics:
geom_line(aes(x = Date, y = TRI, color = Country), alpha = 0.8, 
    size = 1)

# Notice that I used alpha to adjust the line opacity (useful
# when having overlapping lines e.g.)

print(g)

From the above code, note:

ggplot() starts the canvass environment, followed by ‘+’ as the pipe.

Then specify geometries (line, point, density, bars, etc.)

In the geometry:

Specify the aesthetics of the geometry (e.g. x = …, y = …, color = …)

Follow this with other specs, like e.g., alpha = 0.8 (opacity), size = 2 (e.g. thickness of line)

Follow this with theme specs, like: labs(title = “Some Title”, x = “date”, y = “TRI”)

Graphically, the process can be thought of as:

Also notice that the plot is now an object, g, that we can pass down to a function, save in dataframes, etc.

Let’s adjust g to make it a bit nicer. Take note of every chunk below and what it does:

g <- g + theme_bw() + theme(legend.position = "bottom") + labs(x = "", 
    y = "Prices", title = "TRI of Brics Countries", subtitle = "Total Return Index using net dividends reinvested", 
    caption = "Note:\nBloomberg data used")

g

Looks better right? Let’s do a few more things:

(Please see this section as well first)

Let’s place each line in its own plot:

g <- g + facet_wrap(~Country)

g

Almost there… let’s scale the axes to be free for each y, and also change the colours a bit using ggthemes:

pacman::p_load(ggthemes)
g <- g + facet_wrap(~Country, scales = "free_y") + ggthemes::scale_color_wsj()

g

Another theme library to consider is ggsci. * Here’s the gallery, and you can use this as shown here.

Let’s use e.g. npg as color, and also change the thickness of the lines (see size parameter…)

pacman::p_load("ggsci")
g + scale_color_npg()

Another very, very nice feature of this plotting platform is that your figures are alive.

E.g., see below:

g %+% subset(dfPlotData, Date > as.Date("2009-01-31")) + labs(title = "Post-Crisis TRI")

g %+% subset(dfPlotData, Date < as.Date("2009-01-31")) + labs(title = "Pre-Crisis TRI")

You could also retrieve the data of a ggplot object simply as:

Plot_Data_Recovered <- g$data

You could also add another dataframe and some labels to your plot as:

pacman::p_load(lubridate)
Another_df <- fmxdat::findata %>% gather(Stocks, Px, -Date) %>% 
    filter(Stocks == "JSE.SLM.Close")

g + geom_line(data = Another_df, aes(Date, Px, color = Stocks), 
    colour = "steelblue", alpha = 0.3) + geom_label(data = Another_df %>% 
    filter(Date == last(Date)), aes(Date, Px, label = Stocks), 
    size = 3, alpha = 0.1) + 
geom_label(data = dfPlotData %>% filter(Date == last(Date)), 
    aes(Date, TRI, label = Country), size = 3, alpha = 0.1)

Saving

Saving your plots are simple too:

rmsfuns::build_path(paste0(getwd(), "Figures"))
ggsave(filename = "Figures/Plot.png", plot = g, width = 6, height = 6, 
    device = "png")

Other types of plots

While this section is by no means exhuastive, I show a few other types of plots you may use as well:

Density plots

plot_data <- dfPlotData %>% 
group_by(Country) %>% 
mutate(Return = TRI/lag(TRI) - 1) %>% ungroup()

gdens <- plot_data %>% ggplot() + geom_density(aes(x = Return, 
    fill = Country))

gdens

Note - here we set the fill to be given by ‘Country’. Fill works similarly to colour, where the latter would’ve simply coloured the line, not the area under the curve.

Notice that we definitely need to adjust opacity using alpha. I also colour the lines using colour below:

gdens <- gdens + 
geom_density(aes(x = Return, fill = Country, colour = Country), 
    alpha = 0.2, size = 1.25) + 
ggthemes::theme_economist_white()

gdens

Subsetting to look only at the deep left-tail (say returns below 10%) is also super easy:

gdens %+% subset(plot_data, Return < -0.1) + labs(title = "Left tail Plot")

Notice that subset requires a logical operator (i.e. must give TRUE / FALSE). The following won’t work therefore as it is a wrangle, not a logical operator:

gdens %+% subset(plot_data, ifelse(Return < -0.1, -0.1, Return)) + 
    labs(title = "Left tail Plot")

So, let’s bring the tails in a bit to not skew the graph as much. This requires a re-wrangling, not simply a subsetting:

Let’s trim our returns to be between 5% - 95% quantiles of returns only:

# Winzorising to be between 5% and 95% can be done as
# follows:
plot_data <- plot_data %>% 
filter(!is.na(Return)) %>% group_by(Country) %>% 
mutate(q05 = quantile(Return, na.rm = T, probs = 0.05), q95 = quantile(Return, 
    na.rm = T, probs = 0.95), Return = ifelse(Return >= q95, 
    q95, ifelse(Return <= q05, q05, Return))) %>% 
select(-starts_with("q")) %>% ungroup()

gdens <- ggplot(plot_data) + 
geom_density(aes(x = Return, fill = Country, colour = Country), 
    alpha = 0.2, size = 1.25) + 
ggthemes::theme_economist_white() + 
labs(y = "Density (winsorized)", x = "Return", title = "Return density plot") + 
    
theme(legend.position = "bottom")

gdens

Can figures be in functions?

Of course! Ideally you should specify templates for your functions and source it like you would any other comment.

E.g., in the following code chunk I introduce you to boxplotting and jitter plotting combined in ggplot. I do this in a function, where I:

Make my function safe to illustrate an intentional error with a clear message to solve it
Show you how a figure’s design can be flexibly incorporated into a function:

pacman::p_load(purrr)

Jitter_Boxplot <- function(data_frame, Title, Subtitle, Caption, Xlab, Ylab, Alpha_Set = 0.5) {

  # Dataframe should be tidy, and be of the form: 
      # date  | Ticker  |   Identifyer  |  Return  

  if( !"Identifyer" %in% names(data_frame) ) stop("\n\nERROR:::::>Please provide valid Identifier column!\n\n")
  if( !"Date" %in% names(data_frame) ) stop("Please provide valid Date column ")
  if( class(data_frame$Date) != "Date" ) stop("Date column not of class Date ")
  
g1 <-   
ggplot(data_frame) + 
  
  geom_boxplot(aes(x = Identifyer, y = Return, fill = Country), alpha = Alpha_Set) + 
  
  geom_jitter(aes(x = Identifyer, y = Return, color = Country, alpha = Alpha_Set)) + 
  
    theme_bw() + 
  
  guides(color = FALSE, fill = FALSE, alpha = FALSE) +
  # Add titles:
  labs(title = Title, 
       subtitle = Subtitle,
       caption = Caption,
       x = Xlab, y = Ylab) + 
  
  scale_color_npg() + # Now we use fill...
  
  scale_fill_npg() # Now we use fill...

g1  

}

SafeJitter <- purrr::safely(Jitter_Boxplot)
# Let's first break it to see the error produced:
Result <-SafeJitter(data_frame = plot_data, 
               Title = "BRICS Return Histograms", 
               Subtitle = "Transparency is key", 
               Caption = "Data was downloaded from Bloomberg", 
               Xlab = "", Ylab = "Distribution",
               Alpha_Set = 0.4)

  print(Result$error)

## <simpleError in .f(...): 
## 
## ERROR:::::>Please provide valid Identifier column!
## 
## >

# And now let's use the function correctly based on the informative error message:    
Result <- 
SafeJitter(data_frame = plot_data %>% mutate(Identifyer = Country), 
               Title = "BRICS Return Histograms", 
               Subtitle = "Transparency is key", 
               Caption = "Data was downloaded from Bloomberg", 
               Xlab = "", Ylab = "Distribution")
  
g <- Result$result

g

Point Plots

Let’s create a mean-variance plot of various crypto currencies over the last 52 weeks - with the size of the bubble in the plot given by its market at the last date.

(Notice this requires a bit of planning and wrangling before we plot)…

Bub_Size <- fmxdat::cryptos %>% filter(date == last(date)) %>% 
    
select(name, market) %>% rename(Size = market) %>% unique

MeanVar <- fmxdat::cryptos %>% 
filter(date > last(date) %m-% weeks(52)) %>% 
group_by(name) %>% mutate(Return = close/lag(close) - 1) %>% 
    
summarise_at(vars(Return), list(Mean = ~mean(., na.rm = T), SD = ~sd(., 
    na.rm = T)))

# Let's append the size column:
plot_df <- left_join(MeanVar, Bub_Size, by = "name")

g <- plot_df %>% ggplot() + 
geom_point(aes(Mean, SD, size = Size), color = "steelblue", alpha = 0.6) + 
    guides(size = F) + 
theme_bw() + 
labs(title = "Terrible scales")
g

Almost there, but clearly the scales are terrible - let’s limit scales to be below [0.25, 0.15]:

plot_df <- 
plot_df %>% 
mutate(Mean = ifelse(Mean > 0.15, 0.15, Mean)) %>% 
mutate(SD = ifelse(SD > 0.25, 0.25, SD))

g <- plot_df %>% ggplot() + 
geom_point(aes(Mean, SD, size = Size), color = "steelblue", alpha = 0.6) + 
    guides(size = F) + 
theme_bw() + 
labs(title = "Terrible Size Spread")
g

Let’s improve the sizing by setting rather some quantiles:

plot_df_Size_Adj <- plot_df %>% mutate(Q1 = quantile(Size, 0.1)) %>% 
    
mutate(Q2 = quantile(Size, 0.8)) %>% 
mutate(Size = ifelse(Size < Q1, 5, ifelse(Size < Q2, 6, ifelse(Size >= 
    Q2, 7, NA_real_)))) %>% 
mutate(ColorSet = ifelse(Size == 5, "small", ifelse(Size == 6, 
    "medium", "large")))

g <- plot_df_Size_Adj %>% ggplot() + 
geom_point(aes(Mean, SD, size = Size, color = ColorSet), alpha = 0.6) + 
    
scale_color_manual(values = c(small = "darkgreen", medium = "steelblue", 
    large = "darkred")) + 
guides(size = F, color = guide_legend(title = "Size Guide")) + 
    
theme_bw() + 
labs(title = "Size and spread sorted")
g

Notice my setting of colours based on a column, as well as size based on a value above. Also - note how I changed the legend title

Barplots

To produce barplots in ggplot you have to be somewhat careful, the notation can initially be a bit confusing.

I give you an example below of a simple bar plot:

df_barplot <- 
dplyr::storms %>% 
group_by(year) %>% summarize(max_wind = max(wind), max_pressure = max(pressure)) %>% 
    
ungroup() %>% 
mutate(max_wind = max_wind/max(max_wind), max_pressure = max_pressure/max(max_pressure)) %>% 
    # Let's make it tidy...
gather(Type, Value, -year) %>% # Let's make column year a valid date object:
mutate(year = as.Date(paste0(year, "-01-01")))


df_barplot %>% 
ggplot() + 
geom_bar(aes(x = year, y = Value, fill = Type), stat = "identity") + 
    # scale_
facet_wrap(~Type, scales = "free_y", nrow = 2) + 
scale_x_date(labels = scales::date_format("%b '%y"), date_breaks = "2 years") + 
    
theme_bw() + 
theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
labs(x = "", y = "") + guides(fill = F)

For the above I show you a few tricks:

Notice how I transform a numeric column into a date column
Notice how I calculate the max wind per year - after which I calculate the percentage of that year’s max wind relative to all time max. Same for pressure
In the plot code - notice how I make the axes show only every second year, and doing so show the month and year as e.g. Jan ’82
Notice how I flip the x-axes to be vertical.
Notice that the legends are dropped as they are redundant as the facet title are descriptive enough.

Star wars plotting

For ordering plots, we make use of the forcats package in the code below. First, let’s plot the frequency of eye-colours for people in Star Wars:

pacman::p_load(forcats)

SW <- dplyr::starwars %>% 
select(name, height, mass, hair_color, gender, eye_color) %>% 
    
mutate(height = as.numeric(height))

SW %>% 
mutate(eye_color = forcats::fct_infreq(eye_color)) %>% 
ggplot() + 
geom_bar(aes(x = eye_color), fill = "steelblue", alpha = 0.7) + 
    
coord_flip() + 
labs(x = "Count", y = "Eye-Colour", title = "Eye-Colour spread in Star Wars movies", 
    caption = "Data from dplyr package in R")

Let’s plot the height of male characters in Starwars, arranged by height, with a dotted line indicating the height of Darth Vader:

plot_df <- 
SW %>% filter(!is.na(height)) %>% 
filter(gender == "masculine") %>% 
mutate(name = as_factor(name)) %>% 
mutate(name = fct_reorder(name, height)) %>% 
mutate(VaderHeight = max(ifelse(name == "Darth Vader", height, 
    NA_real_), na.rm = T)) %>% 
mutate(Taller_Than_Vader = ifelse(height > VaderHeight, "Taller", 
    "Shorter"))

plot_df %>% 
ggplot() + 
geom_bar(aes(name, height, fill = Taller_Than_Vader), stat = "identity") + 
    
scale_fill_manual(values = c(Taller = "darkred", Shorter = "darkgreen")) + 
    
theme_bw() + 
theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
geom_hline(aes(yintercept = max(VaderHeight)), linetype = "dotted", 
    size = 2, alpha = 0.8, color = "steelblue") + 
geom_label(data = plot_df %>% filter(height == median(height, 
    na.rm = T)) %>% head(1), aes(name, VaderHeight, label = glue::glue("Vader's height: {max(VaderHeight)} cm")), 
    alpha = 0.5) + guides(fill = F) + labs(title = "Height of Star Wars characters relative to Darth Vader's height", 
    caption = "Data from dplyr package", subtitle = "Seems Darth Vader is not the tallest....")

And that is how awesomely easy it is to plot in ggplot….

Bonus

Two bonus features is worth highlighting quick.

First, how to combine various ggplots into a single graph.

There’s a few ways to do this - probably easiest is to use patchwork

pacman::p_load(patchwork)

df_plot <- 
  
left_join(
  
fmxdat::Indexes %>% gather(Country, TRI, -Date),

fmxdat::Indexes_Labs,

by = "Country"

) %>% 
  
  mutate(Label = coalesce(Label, "Other")) %>% 
  
  arrange(Date)


p1 <- 
  df_plot %>% 
  
  filter(Label == "Brics") %>% 
  
  ggplot() + 
  
  geom_line(aes(Date, TRI, color = Country)) + 
  
  theme_bw() + 
  
  theme(legend.position = "bottom")

p2 <- 
SafeJitter(data_frame = plot_data %>% mutate(Identifyer = Country), 
           
               Title = "BRICS Return Histograms", 
           
               Subtitle = "Transparency is key", 
           
               Caption = "Data was downloaded from Bloomberg", 
           
               Xlab = "", Ylab = "Distribution")$result

p3 <- 
    plot_df %>% 
  
    ggplot() + 
  
    geom_bar(aes(name, height, fill = Taller_Than_Vader), stat= 'identity') + 
  
    scale_fill_manual(values = c(Taller = "darkred", Shorter = "darkgreen")) + 
  
    theme_bw() + 
  
    theme(axis.text.x=element_text(angle = 90, hjust = 1)) + 
  
    geom_hline(aes(yintercept = max(VaderHeight)), linetype = "dotted", size = 2, alpha = 0.8, color = "steelblue") + 
  
    geom_label( data = plot_df %>% filter(height == median(height, na.rm=T)) %>% head(1), aes(name, VaderHeight, label = glue::glue("Vader's height: {max(VaderHeight)} cm")), alpha = 0.5) + 
    guides(fill = F)

p3/ (p2 +p1) + 
  plot_annotation(
  title = 'Patching figures together',
  caption = 'Source: @littlemissdata'
) & 
  theme(text = element_text('bold'))

Second, how to embed a static graphic (jpg / png) into a tidyframe

Again, several ways of skinning this particular cat.

An easy way though is to use the magick & ggpubr packages as follows:

pacman::p_load(cowplot, magick, ggpubr)
Img_embed <- image_read("https://i.imgur.com/eDCUyUql.jpg")

Img_embed <- 
ggplot() + 
background_image(Img_embed) + coord_fixed()

Img_embed/p3 + plot_layout(widths = c(1, 1))

More examples

Let’s use the dslabs package for some datasets that we can use to practice plotting on:

pacman::p_load("dslabs", "tidyverse", "ggthemes", "ggrepel", 
    "scales")

US Murders

Let’s create a plot for each region (South, West, North East) that:

shows the percentage murders for every state per 1 million people;
Show the Name and % text of the best and worst states (tip: use ggrepel to not have text boxes overlap);
Ensure the four plots have the same axes
Ensure axes are log-scaled
Have size of bubbles be reflective of amount of murders.

murdr <- 
dslabs::murders %>% 
mutate(mrate = total/population * 10^6) %>% 
mutate(pop_print = population/1e+06)


murdr %>% ggplot() + 
geom_point(aes(pop_print, mrate, size = total, color = region)) + 
    
facet_wrap(~region) + 
scale_x_log10() + 
scale_y_log10() + 
theme_bw() + 
# Let's alter the strip text and background a bit:
theme(strip.background = element_rect(fill = "steelblue"), strip.text = element_text(face = "bold", 
    colour = "black", size = 10)) + 
scale_color_manual(values = c("red", "blue", "darkgreen", "orange")) + 
    
labs(title = "Murders per Million by Region", x = "Population million (Log Scaled)", 
    y = "Murders per million (Log Scaled)") + 
ggrepel::geom_label_repel(data = murdr %>% group_by(region) %>% 
    arrange(mrate) %>% filter(mrate == max(mrate)), aes(pop_print, 
    mrate, label = glue::glue("{state}:\n{round(mrate, 1)} per mn")), 
    size = 4, alpha = 0.35, fill = "red") + 
ggrepel::geom_label_repel(data = murdr %>% group_by(region) %>% 
    arrange(mrate) %>% filter(mrate == min(mrate)), aes(pop_print, 
    mrate, label = glue::glue("{state}:\n {round(mrate, 1)} per mn")), 
    size = 4, alpha = 0.35, fill = "green") + 
guides(color = F) + 
theme(legend.title = element_text("Total Murders"))

Let’s reminesce on how the pollsters got the 2016 election so badly wrong..

Let’s color Dems (and CNN) blue, Reps and Fox Red.

Let’s show Fox’s poll for the Donald as a highlighted dot, and the same for Clinton and CNN.

df_plot <- 
  dslabs::polls_us_election_2016 %>%
  
  tibble::as_tibble() %>% 
  
  rename(date = enddate) %>% 
  
  filter(state == "U.S." & date>="2016-07-01") %>%
  
  select(date, pollster, rawpoll_clinton, rawpoll_trump) %>%
  
  rename(Clinton = rawpoll_clinton, Trump = rawpoll_trump) %>%
  
  gather(candidate, percentage, -date, -pollster) %>% 
  
  mutate(Highlight = 
           ifelse(grepl("Fox", pollster) & grepl("Trump", candidate), "Fox", 
                            
                  ifelse(grepl("CNN", pollster) & grepl("Clinton", candidate), "CNN", "Other")) ) %>% 
  
  arrange(date)

# Specify the colours manually:
cols <- c("Trump" = "red", "Clinton" = "blue", "Fox" = "pink", "CNN" = "lightblue")

df_plot %>% 
  ggplot() + 
  
    geom_smooth(aes(date, percentage, color = candidate), method = "loess", span = 0.15, size = 3, alpha = 0.86) +
  
  geom_point(aes(date, percentage, color = candidate), size = 2, alpha = 0.5) + 
  
  geom_point(data = df_plot %>% filter(Highlight != "Other"), aes(date, percentage, color = Highlight), size = 8, alpha = 0.85) + scale_color_manual(values = cols) + 
  
  labs(x = "date", y = "Election Ratings", title = "Election Ratings for US Presidential Candidates 2016", caption = "Spoiler: they got it wrong...") + 
  
  theme(legend.position = "top", legend.title = element_blank()) + 
  
  # Let's make the date nicer: using the scales package
  scale_x_date(labels=scales::date_format("%b '%y"), date_breaks = "1 month")

End.

Applied Data Science

Tidy Visualization

NF Katzke