More experimenting with SQL in R Markdown: Pivoting data, outputting results to R, and creating a summary table using gt
Following on from my previous post, I use SELECT and JOIN statements to pivot the Taylor Swift and Beyoncé Tidy Tuesday data using RSQLite (after normalizing the underlying database), output the results to R, and create a relatively simple summary table using gt.
Setup
As in the previous post, load packages, read in data, set up a connection to the database and copy in the data, and clean release dates:
#load packages
library(tidyverse)
library(odbc)
library(DBI)
library(RSQLite)
library(gt) # for tables
#load data
sales <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/sales.csv')
charts <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/charts.csv')
#create connection to database and copy data into database
con <- dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, sales)
copy_to(con, charts)
#clean release dates as per previous post
dbExecute(con, "
UPDATE charts
SET released = TRIM(SUBSTR(released, 1, INSTR(released, ' (')))
WHERE INSTR(released, ' (')>0;
")
## [1] 20
Previewing tables using SELECT and LIMIT queries
Let’s quickly refresh our understanding of how this data is structured (focusing only on columns relevant for this tutorial):
dbGetQuery(con, '
SELECT artist, title, country, sales, released
FROM sales
LIMIT 5
')
## artist title country sales released
## 1 Taylor Swift Taylor Swift US 5720000 October 24, 2006
## 2 Taylor Swift Fearless WW 12000000 November 11, 2008
## 3 Taylor Swift Fearless US 7180000 November 11, 2008
## 4 Taylor Swift Fearless AUS 500000 November 11, 2008
## 5 Taylor Swift Fearless UK 609000 November 11, 2008
dbGetQuery(con, '
SELECT artist, title, chart, chart_position, released
FROM charts
LIMIT 5
')
## artist title chart chart_position released
## 1 Taylor Swift Taylor Swift US 5 October 24, 2006
## 2 Taylor Swift Taylor Swift AUS 33 October 24, 2006
## 3 Taylor Swift Taylor Swift CAN 14 October 24, 2006
## 4 Taylor Swift Taylor Swift FRA — October 24, 2006
## 5 Taylor Swift Taylor Swift GER — October 24, 2006
Database normalizing
Please note that my goal in these examples is to work with RSQLite in order to learn more about using SQL in R, rather than presenting this as the best way to conduct these analyses.
The task I set myself at the end of the last post was to join and pivot this data so that each album is represented by a single row of data, and there are separate columns for sales and chart information for each country of interest.
This process will be easier to follow if the underlying data are normalized. This is also best practice for any database working with relational data, and involves minimizing redundancies in data storage and ensuring each table serves only a single purpose. In this case, we should create a separate albums table to record the details of each album (i.e., artist, title and release dates), so edits and additions to this information need only be made in one place. We then give each album a unique ID (a primary key) that can be used to reference this album in the charts and sales tables.
First, let’s create the new albums
table, adding an id
field to this table (as a primary key) to store a unique id number for each distinct album:
dbExecute(con, "
CREATE TABLE albums
(id INTEGER NOT NULL PRIMARY KEY,
artist TEXT,
title TEXT,
released TEXT
);
")
## [1] 0
We now populate this table with the unique albums found in the charts
table (we could just as easily have done this from the sales table):
dbExecute(con, "
INSERT INTO albums (artist, title, released)
SELECT DISTINCT artist, title, released FROM charts;
")
## [1] 14
Fourteen records have been added to this table, and viewing results with the query below shows that this has functioned as intended. Distinct albums have been added to the table, and automatically assigned a unique sequential id number. (I’m using LIMIT 5 after testing when presenting these steps to minimize display space. Omitting this statement returns the full 14 rows of data expected.)
dbGetQuery(con, '
SELECT *
FROM albums
LIMIT 5
')
## id artist title released
## 1 1 Taylor Swift Taylor Swift October 24, 2006
## 2 2 Taylor Swift Fearless November 11, 2008
## 3 3 Taylor Swift Speak Now October 25, 2010
## 4 4 Taylor Swift Red October 22, 2012
## 5 5 Taylor Swift 1989 October 27, 2014
Let’s rename the original sales
table to sales_original
, so we can create a new sales
table in which albums are referenced by using their unique id number, rather than their title.
dbExecute(con, "
ALTER TABLE sales
RENAME TO sales_original;
")
## [1] 0
To check that we’ve renamed the table correctly, let’s list the tables accessible via the connection we defined in the setup phase:
dbListTables(con)
## [1] "albums" "charts" "sales_original" "sqlite_stat1"
## [5] "sqlite_stat4"
We see here that sales
has been renamed to sales_original
, freeing us to create our new and improved sales
table. (Note that the sqlite_stat1
and sqlite_stat4
tables are internal tables created to record the presence of keys and indexes in our database, to optimize database performance.)
Recall that we created an id
column as a primary key in the albums
table. We are now going to create a column for these unique id numbers as a foreign key (FK_Sales_Albums
) in our replacement sales
table (where these id numbers will be used to point back to the album details stored in the albums
table):
dbExecute(con, "
CREATE TABLE sales
(id INTEGER NOT NULL PRIMARY KEY,
FK_Sales_Albums,
country,
sales
);
")
## [1] 0
Next, we need to extract records to add to this table. First, let’s design and test our query and check that it works as intended. We want to join the albums
table to the sales_original
table by matching on albums.title = sales_original.title
, and extract sales by country details for each album with the correct id attached. So we select the id field from the albums table (albums.id
) and label this as our foreign key (AS FK_Sales_Albums
). To complete our query, we also select the country and sales fields from the original sales table:
dbGetQuery(con, '
SELECT albums.id AS FK_Sales_Albums, sales_original.country, sales_original.sales
FROM sales_original
LEFT JOIN albums
ON albums.title = sales_original.title
LIMIT 5
')
## FK_Sales_Albums country sales
## 1 1 US 5720000
## 2 2 WW 12000000
## 3 2 US 7180000
## 4 2 AUS 500000
## 5 2 UK 609000
This process of drafting queries, verifying results, and then executing updates is one that works well when designing multi-stage queries. Let’s now combine our JOIN query with an INSERT INTO query, to populate the replacement sales
table:
dbExecute(con, "
INSERT INTO sales (FK_Sales_Albums, country, sales)
SELECT albums.id AS FK_Sales_Albums, sales_original.country, sales_original.sales
FROM sales_original
LEFT JOIN albums
ON albums.title = sales_original.title;
")
## [1] 48
We can repeat the same process with the charts data. First, rename the original charts
table to charts_original
:
dbExecute(con, "
ALTER TABLE charts
RENAME TO charts_original;
")
## [1] 0
Next, create the structure of the new and improved charts
table:
dbExecute(con, "
CREATE TABLE charts
(id INTEGER NOT NULL PRIMARY KEY,
FK_Charts_Albums,
chart,
chart_position
);
")
## [1] 0
Next, draft and test a query extracting peak chart positions for each album by album id (from the original charts table):
dbGetQuery(con, '
SELECT albums.id AS FK_Charts_Albums, charts_original.chart, charts_original.chart_position
FROM charts_original
LEFT JOIN albums
ON albums.title = charts_original.title
LIMIT 5
')
## FK_Charts_Albums chart chart_position
## 1 1 US 5
## 2 1 AUS 33
## 3 1 CAN 14
## 4 1 FRA —
## 5 1 GER —
Now, combine this JOIN query with an INSERT INTO query and execute, to insert data into our new table and check that the full 140 records have been created:
dbExecute(con, "
INSERT INTO charts (FK_Charts_Albums, chart, chart_position)
SELECT albums.id AS FK_Charts_Albums, charts_original.chart, charts_original.chart_position
FROM charts_original
LEFT JOIN albums
ON albums.title = charts_original.title;
")
## [1] 140
Pivoting charts and sales data using SELECT and JOIN ON queries
To pivot data so that charts and sales data for each country is presented in a new column (using RSQLite), we can use a series of JOIN … ON statements.
The following code excerpt (see full context below) creates a temporary table called US_charts
which selects rows from the charts
table where chart = "US"
(for matching album IDs):
LEFT JOIN charts AS US_charts
ON US_charts.FK_Charts_Albums = albums.id
AND US_charts.chart = "US"
This temporary table can then be referenced in the SELECT statement at the beginning of the query, which specifies we want to extract US_charts.chart_position AS US_chart
.
Note that specific code for pivoting data varies depending on SQL vendor, and Microsoft SQL server uses PIVOT and UNPIVOT operators.
We can also add calculated fields to this query to show what percentage of albums sold worldwide were sold in a specified country (e.g., SELECT US_sales.sales/WW_sales.sales*100 AS US_percent
creates a US_percent
column, by dividing US sales by worldwide sales, then multiplying by 100). The round()
wrapper rounds the result of the formula defining US_percent
to the number of decimal points specified after the comma. Note that column names included in calculations must be specified in full (e.g., US_sales.sales
), and not by aliases that will not be assigned until the SELECT statement runs (e.g., US_sales
).
What follows should be seen as a proof of concept, which I’ve used to extract just US charts and sales data:
dbGetQuery(con, '
SELECT albums.artist, albums.title, SUBSTR(albums.released, -4) AS year,
US_charts.chart_position AS US_chart,
US_sales.sales AS US_sales,
WW_sales.sales AS WW_sales,
round(US_sales.sales/WW_sales.sales*100, 1) AS US_percent
FROM albums
LEFT JOIN charts AS US_charts
ON US_charts.FK_Charts_Albums = albums.id
AND US_charts.chart = "US"
LEFT JOIN sales AS US_sales
ON US_sales.FK_Sales_Albums = albums.id
AND US_sales.country = "US"
LEFT JOIN sales AS WW_sales
ON WW_sales.FK_Sales_Albums = albums.id
AND (WW_sales.country IN ("WW", "World"))
')
## artist title year US_chart US_sales WW_sales US_percent
## 1 Taylor Swift Taylor Swift 2006 5 5720000 NA NA
## 2 Taylor Swift Fearless 2008 1 7180000 12000000 59.8
## 3 Taylor Swift Speak Now 2010 1 4694000 5000000 93.9
## 4 Taylor Swift Red 2012 1 4465000 6000000 74.4
## 5 Taylor Swift 1989 2014 1 6215000 10100000 61.5
## 6 Taylor Swift Reputation 2017 1 2300000 4500000 51.1
## 7 Taylor Swift Lover 2019 1 1085000 3200000 33.9
## 8 Taylor Swift Folklore 2020 1 NA NA NA
## 9 Beyoncé Dangerously in Love 2003 1 5100000 11000000 46.4
## 10 Beyoncé B'Day 2006 1 3610000 8000000 45.1
## 11 Beyoncé I Am... Sasha Fierce 2008 1 3380000 8000000 42.3
## 12 Beyoncé 4 2011 1 1500000 NA NA
## 13 Beyoncé Beyoncé 2013 1 2512000 5000000 50.2
## 14 Beyoncé Lemonade 2016 1 1554000 2500000 62.2
One more SQL trick: UPDATE query using CASE WHEN statements
While writing the query above, I realized that the country codes in the sales table aren’t consistent, with worldwide sales sometimes being abbreviated as “WW” and sometimes as “World”. The country codes in this table also aren’t consistent with those used in ggflags
, which may become an issue when we use ggplot2
and ggflags
to graph data by country.
So let’s update multiple country codes in the sales
table, using an UPDATE query and a series of CASE WHEN statements:
dbExecute(con, "
UPDATE sales
SET country = CASE WHEN country = 'World' THEN 'WW'
WHEN country = 'WW' THEN 'WW'
WHEN country = 'AUS' THEN 'AU'
WHEN country = 'JPN' THEN 'JP'
WHEN country = 'UK' THEN 'GB'
WHEN country = 'CAN' THEN 'CA'
WHEN country = 'FRA' THEN 'FR'
WHEN country = 'FR' THEN 'FR'
WHEN country = 'US' THEN 'US'
WHEN country = 'NA' THEN 'NA'
END;
")
## [1] 48
Using SQL directly in R Markdown code chunks, and outputting for manipulation in R
What if I now want to output this data to R to create a prettily formatted table or figure (e.g., in an R Markdown report or in a blogdown post like this one)? In the following code chunk, I draw on Andrew Couch’s tutorial again, to write SQL code directly into the R Markdown code chunk (by replacing the {R}
prefix with the following {sql, connection = con, output.var = "df"}
). The first part of this statement tells R Markdown that the chunk uses SQL code and specifies the connection to access the database. The second part of the statement (output.var = "df"
) is used to save output as a dataframe named df.
The query below uses the same statements as the example above, but captures UK and US chart data, US and WW sales data (now with consistent use of “WW” for worldwide sales), and adds a new column (other_sales
) calculating sales in countries other than the US. For ease of presentation, sales figures are now calculated in millions:
SELECT albums.artist, albums.title, SUBSTR(albums.released, -4) AS year,
US_charts.chart_position AS US_chart,
UK_charts.chart_position AS UK_chart,
round(US_sales.sales/1000000, 1) AS US_sales,
round(WW_sales.sales/1000000, 1) AS WW_sales,
round((WW_sales.sales - US_sales.sales)/1000000, 1) AS other_sales,
round(US_sales.sales/WW_sales.sales*100, 1) AS US_percent,
round((WW_sales.sales - US_sales.sales)/WW_sales.sales*100, 1) AS other_percent
FROM albums
LEFT JOIN charts AS US_charts
ON US_charts.FK_Charts_Albums = albums.id
AND US_charts.chart = "US"
LEFT JOIN charts AS UK_charts
ON UK_charts.FK_Charts_Albums = albums.id
AND UK_charts.chart = "UK"
LEFT JOIN sales AS US_sales
ON US_sales.FK_Sales_Albums = albums.id
AND US_sales.country = "US"
LEFT JOIN sales AS WW_sales
ON WW_sales.FK_Sales_Albums = albums.id
AND (WW_sales.country = "WW");
Let’s look at the output:
head(df)
## artist title year US_chart UK_chart US_sales WW_sales
## 1 Taylor Swift Taylor Swift 2006 5 81 5.7 NA
## 2 Taylor Swift Fearless 2008 1 5 7.2 12.0
## 3 Taylor Swift Speak Now 2010 1 6 4.7 5.0
## 4 Taylor Swift Red 2012 1 1 4.5 6.0
## 5 Taylor Swift 1989 2014 1 1 6.2 10.1
## 6 Taylor Swift Reputation 2017 1 1 2.3 4.5
## other_sales US_percent other_percent
## 1 NA NA NA
## 2 4.8 59.8 40.2
## 3 0.3 93.9 6.1
## 4 1.5 74.4 25.6
## 5 3.9 61.5 38.5
## 6 2.2 51.1 48.9
Creating a simple summary table using gt
There are a plethora of packages capable of creating high quality tables using R Markdown and blogdown. Some resources for exploring these options can be found on my resources page. As the purpose of this particular blog post is primarily to explore the use of SQL, I’ve made a relatively simple table here. In creating this table (using gt as shown below) I drew on:
-
specific instructions for creating summary lines in gt
-
10+ guidelines for better tables in R, in which Thomas Mock adapts for R (using gt) tables used as examples in Jon Schwabish’s Ten guidelines for better tables
-
code used by gkaramanis to make a much more visually appealing table using gt summarizing this data
# define a function to create totals for summary rows, excluding na
fns_labels <- list(Total = ~sum(., na.rm = TRUE))
# start with the data exported from SQL database
df %>%
# select variables to include in table
select(title, artist, year, US_chart, UK_chart, US_sales, other_sales, WW_sales) %>%
# create a table using gt, grouping by artist and using title for row name
gt(rowname_col = "title", groupname_col = "artist") %>%
# set column labels
cols_label(
year = "Released",
US_chart = "US",
UK_chart = "UK",
US_sales = "US",
other_sales = "Other",
WW_sales = "Total") %>%
# transform NA values in all columns to "-"
text_transform(
locations = cells_body(columns = gt::everything()),
fn = function(x) {
str_replace(x, "NA", "–")
}
) %>%
# create headings spanning multiple columns
tab_spanner(label = "Chart position", columns = vars(US_chart,
UK_chart)) %>%
tab_spanner(label = "Sales (millions)", columns = vars(US_sales,
other_sales,
WW_sales)) %>%
# create title and subtitle for table, use md formatting
tab_header(
title = md("**Taylor Swift has sold more albums in the US and overall, Beyoncé has sold more internationally**"),
subtitle = md("*Peak chart position and number of copies sold by album and location*"))%>%
# create summary rows for each group
summary_rows(groups = TRUE,
columns = vars(US_sales, other_sales, WW_sales),
fns = fns_labels,
formatter = fmt_number, decimals = 1) %>%
# create source note for table
tab_source_note("Source: Billboard via Wikipedia, October 2020")
Taylor Swift has sold more albums in the US and overall, Beyoncé has sold more internationally | ||||||
---|---|---|---|---|---|---|
Peak chart position and number of copies sold by album and location | ||||||
Released | Chart position | Sales (millions) | ||||
US | UK | US | Other | Total | ||
Taylor Swift | ||||||
Taylor Swift | 2006 | 5 | 81 | 5.7 | – | – |
Fearless | 2008 | 1 | 5 | 7.2 | 4.8 | 12.0 |
Speak Now | 2010 | 1 | 6 | 4.7 | 0.3 | 5.0 |
Red | 2012 | 1 | 1 | 4.5 | 1.5 | 6.0 |
1989 | 2014 | 1 | 1 | 6.2 | 3.9 | 10.1 |
Reputation | 2017 | 1 | 1 | 2.3 | 2.2 | 4.5 |
Lover | 2019 | 1 | 1 | 1.1 | 2.1 | 3.2 |
Folklore | 2020 | 1 | 1 | – | – | – |
Total | — | — | — | 31.7 | 14.8 | 40.8 |
Beyoncé | ||||||
Dangerously in Love | 2003 | 1 | 1 | 5.1 | 5.9 | 11.0 |
B'Day | 2006 | 1 | 3 | 3.6 | 4.4 | 8.0 |
I Am... Sasha Fierce | 2008 | 1 | 2 | 3.4 | 4.6 | 8.0 |
4 | 2011 | 1 | 1 | 1.5 | – | – |
Beyoncé | 2013 | 1 | 2 | 2.5 | 2.5 | 5.0 |
Lemonade | 2016 | 1 | 1 | 1.6 | 0.9 | 2.5 |
Total | — | — | — | 17.7 | 18.3 | 34.5 |
Source: Billboard via Wikipedia, October 2020 |
Next steps and a note on editing
As outlined above, the next steps for this particular project for me are to further refine my output table using gt
(I’m wanting to include a barplot visualizing US sales as a percentage of total sales), and using ggplot2
and ggflags
to create bar plots showing sales data by country, assigning a flag to each bar. Once again, however, these are questions for a future post.
Finally, remember to disconnect
At the end of this process, best practice is always to disconnect from the database.
dbDisconnect(con)