More experimenting with SQL in R Markdown: Pivoting data, outputting results to R, and creating a summary table using gt

Following on from my previous post, I use SELECT and JOIN statements to pivot the Taylor Swift and Beyoncé Tidy Tuesday data using RSQLite (after normalizing the underlying database), output the results to R, and create a relatively simple summary table using gt.

Setup

As in the previous post, load packages, read in data, set up a connection to the database and copy in the data, and clean release dates:

#load packages
library(tidyverse)
library(odbc)
library(DBI)
library(RSQLite)
library(gt) # for tables

#load data
sales <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/sales.csv')
charts <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/charts.csv')

#create connection to database and copy data into database
con <- dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, sales)
copy_to(con, charts)

#clean release dates as per previous post
dbExecute(con, "
UPDATE charts
SET released = TRIM(SUBSTR(released, 1, INSTR(released, ' (')))
WHERE INSTR(released, ' (')>0; 
                          ")
## [1] 20

Previewing tables using SELECT and LIMIT queries

Let’s quickly refresh our understanding of how this data is structured (focusing only on columns relevant for this tutorial):

dbGetQuery(con, '
SELECT artist, title, country, sales, released
FROM sales
LIMIT 5
                          ')
##         artist        title country    sales          released
## 1 Taylor Swift Taylor Swift      US  5720000  October 24, 2006
## 2 Taylor Swift     Fearless      WW 12000000 November 11, 2008
## 3 Taylor Swift     Fearless      US  7180000 November 11, 2008
## 4 Taylor Swift     Fearless     AUS   500000 November 11, 2008
## 5 Taylor Swift     Fearless      UK   609000 November 11, 2008
dbGetQuery(con, '
SELECT artist, title, chart, chart_position, released
FROM charts
LIMIT 5
                          ')
##         artist        title chart chart_position         released
## 1 Taylor Swift Taylor Swift    US              5 October 24, 2006
## 2 Taylor Swift Taylor Swift   AUS             33 October 24, 2006
## 3 Taylor Swift Taylor Swift   CAN             14 October 24, 2006
## 4 Taylor Swift Taylor Swift   FRA              — October 24, 2006
## 5 Taylor Swift Taylor Swift   GER              — October 24, 2006

Database normalizing

Please note that my goal in these examples is to work with RSQLite in order to learn more about using SQL in R, rather than presenting this as the best way to conduct these analyses.

The task I set myself at the end of the last post was to join and pivot this data so that each album is represented by a single row of data, and there are separate columns for sales and chart information for each country of interest.

This process will be easier to follow if the underlying data are normalized. This is also best practice for any database working with relational data, and involves minimizing redundancies in data storage and ensuring each table serves only a single purpose. In this case, we should create a separate albums table to record the details of each album (i.e., artist, title and release dates), so edits and additions to this information need only be made in one place. We then give each album a unique ID (a primary key) that can be used to reference this album in the charts and sales tables.

First, let’s create the new albums table, adding an id field to this table (as a primary key) to store a unique id number for each distinct album:

dbExecute(con, "
CREATE TABLE albums
  (id INTEGER NOT NULL PRIMARY KEY,
  artist TEXT,
  title TEXT,
  released TEXT
  );
                          ")
## [1] 0

We now populate this table with the unique albums found in the charts table (we could just as easily have done this from the sales table):

dbExecute(con, "
INSERT INTO albums (artist, title, released)
  SELECT DISTINCT artist, title, released FROM charts; 
                            ")
## [1] 14

Fourteen records have been added to this table, and viewing results with the query below shows that this has functioned as intended. Distinct albums have been added to the table, and automatically assigned a unique sequential id number. (I’m using LIMIT 5 after testing when presenting these steps to minimize display space. Omitting this statement returns the full 14 rows of data expected.)

dbGetQuery(con, '
SELECT *
FROM albums
LIMIT 5
                          ')
##   id       artist        title          released
## 1  1 Taylor Swift Taylor Swift  October 24, 2006
## 2  2 Taylor Swift     Fearless November 11, 2008
## 3  3 Taylor Swift    Speak Now  October 25, 2010
## 4  4 Taylor Swift          Red  October 22, 2012
## 5  5 Taylor Swift         1989  October 27, 2014

Let’s rename the original sales table to sales_original, so we can create a new sales table in which albums are referenced by using their unique id number, rather than their title.

dbExecute(con, "
ALTER TABLE sales
  RENAME TO sales_original;
                          ")
## [1] 0

To check that we’ve renamed the table correctly, let’s list the tables accessible via the connection we defined in the setup phase:

dbListTables(con)
## [1] "albums"         "charts"         "sales_original" "sqlite_stat1"  
## [5] "sqlite_stat4"

We see here that sales has been renamed to sales_original, freeing us to create our new and improved sales table. (Note that the sqlite_stat1 and sqlite_stat4 tables are internal tables created to record the presence of keys and indexes in our database, to optimize database performance.)

Recall that we created an id column as a primary key in the albums table. We are now going to create a column for these unique id numbers as a foreign key (FK_Sales_Albums) in our replacement sales table (where these id numbers will be used to point back to the album details stored in the albums table):

dbExecute(con, "
CREATE TABLE sales
  (id INTEGER NOT NULL PRIMARY KEY,
  FK_Sales_Albums,
  country,
  sales
  );
                          ")
## [1] 0

Next, we need to extract records to add to this table. First, let’s design and test our query and check that it works as intended. We want to join the albums table to the sales_original table by matching on albums.title = sales_original.title, and extract sales by country details for each album with the correct id attached. So we select the id field from the albums table (albums.id) and label this as our foreign key (AS FK_Sales_Albums). To complete our query, we also select the country and sales fields from the original sales table:

dbGetQuery(con, '
SELECT albums.id AS FK_Sales_Albums, sales_original.country, sales_original.sales 
FROM sales_original
LEFT JOIN albums
ON albums.title = sales_original.title
LIMIT 5
                          ')
##   FK_Sales_Albums country    sales
## 1               1      US  5720000
## 2               2      WW 12000000
## 3               2      US  7180000
## 4               2     AUS   500000
## 5               2      UK   609000

This process of drafting queries, verifying results, and then executing updates is one that works well when designing multi-stage queries. Let’s now combine our JOIN query with an INSERT INTO query, to populate the replacement sales table:

dbExecute(con, "
INSERT INTO sales (FK_Sales_Albums, country, sales)
  SELECT albums.id AS FK_Sales_Albums, sales_original.country, sales_original.sales 
  FROM sales_original
  LEFT JOIN albums
  ON albums.title = sales_original.title;
                            ")
## [1] 48

We can repeat the same process with the charts data. First, rename the original charts table to charts_original:

dbExecute(con, "
ALTER TABLE charts
  RENAME TO charts_original;
                          ")
## [1] 0

Next, create the structure of the new and improved charts table:

dbExecute(con, "
CREATE TABLE charts
  (id INTEGER NOT NULL PRIMARY KEY,
  FK_Charts_Albums,
  chart,
  chart_position
  );
                          ")
## [1] 0

Next, draft and test a query extracting peak chart positions for each album by album id (from the original charts table):

dbGetQuery(con, '
SELECT albums.id AS FK_Charts_Albums, charts_original.chart, charts_original.chart_position 
FROM charts_original
LEFT JOIN albums
ON albums.title = charts_original.title
LIMIT 5
                          ')
##   FK_Charts_Albums chart chart_position
## 1                1    US              5
## 2                1   AUS             33
## 3                1   CAN             14
## 4                1   FRA              —
## 5                1   GER              —

Now, combine this JOIN query with an INSERT INTO query and execute, to insert data into our new table and check that the full 140 records have been created:

dbExecute(con, "
INSERT INTO charts (FK_Charts_Albums, chart, chart_position)
  SELECT albums.id AS FK_Charts_Albums, charts_original.chart, charts_original.chart_position 
  FROM charts_original
  LEFT JOIN albums
  ON albums.title = charts_original.title;
                            ")
## [1] 140

Pivoting charts and sales data using SELECT and JOIN ON queries

To pivot data so that charts and sales data for each country is presented in a new column (using RSQLite), we can use a series of JOIN … ON statements.

The following code excerpt (see full context below) creates a temporary table called US_charts which selects rows from the charts table where chart = "US" (for matching album IDs):

  LEFT JOIN charts AS US_charts 
  ON US_charts.FK_Charts_Albums = albums.id 
  AND US_charts.chart = "US"

This temporary table can then be referenced in the SELECT statement at the beginning of the query, which specifies we want to extract US_charts.chart_position AS US_chart.

Note that specific code for pivoting data varies depending on SQL vendor, and Microsoft SQL server uses PIVOT and UNPIVOT operators.

We can also add calculated fields to this query to show what percentage of albums sold worldwide were sold in a specified country (e.g., SELECT US_sales.sales/WW_sales.sales*100 AS US_percent creates a US_percent column, by dividing US sales by worldwide sales, then multiplying by 100). The round() wrapper rounds the result of the formula defining US_percent to the number of decimal points specified after the comma. Note that column names included in calculations must be specified in full (e.g., US_sales.sales), and not by aliases that will not be assigned until the SELECT statement runs (e.g., US_sales).

What follows should be seen as a proof of concept, which I’ve used to extract just US charts and sales data:

dbGetQuery(con, '

SELECT albums.artist, albums.title, SUBSTR(albums.released, -4) AS year,
  US_charts.chart_position AS US_chart,
  US_sales.sales AS US_sales,
  WW_sales.sales AS WW_sales,  
  round(US_sales.sales/WW_sales.sales*100, 1) AS US_percent
FROM albums
LEFT JOIN charts AS US_charts
  ON US_charts.FK_Charts_Albums = albums.id 
  AND US_charts.chart = "US"
LEFT JOIN sales AS US_sales
  ON US_sales.FK_Sales_Albums = albums.id 
  AND US_sales.country = "US"
LEFT JOIN sales AS WW_sales
  ON WW_sales.FK_Sales_Albums = albums.id 
  AND (WW_sales.country IN ("WW", "World"))
                          ')
##          artist                title year US_chart US_sales WW_sales US_percent
## 1  Taylor Swift         Taylor Swift 2006        5  5720000       NA         NA
## 2  Taylor Swift             Fearless 2008        1  7180000 12000000       59.8
## 3  Taylor Swift            Speak Now 2010        1  4694000  5000000       93.9
## 4  Taylor Swift                  Red 2012        1  4465000  6000000       74.4
## 5  Taylor Swift                 1989 2014        1  6215000 10100000       61.5
## 6  Taylor Swift           Reputation 2017        1  2300000  4500000       51.1
## 7  Taylor Swift                Lover 2019        1  1085000  3200000       33.9
## 8  Taylor Swift             Folklore 2020        1       NA       NA         NA
## 9       Beyoncé  Dangerously in Love 2003        1  5100000 11000000       46.4
## 10      Beyoncé                B'Day 2006        1  3610000  8000000       45.1
## 11      Beyoncé I Am... Sasha Fierce 2008        1  3380000  8000000       42.3
## 12      Beyoncé                    4 2011        1  1500000       NA         NA
## 13      Beyoncé              Beyoncé 2013        1  2512000  5000000       50.2
## 14      Beyoncé             Lemonade 2016        1  1554000  2500000       62.2

One more SQL trick: UPDATE query using CASE WHEN statements

While writing the query above, I realized that the country codes in the sales table aren’t consistent, with worldwide sales sometimes being abbreviated as “WW” and sometimes as “World”. The country codes in this table also aren’t consistent with those used in ggflags, which may become an issue when we use ggplot2 and ggflags to graph data by country.

So let’s update multiple country codes in the sales table, using an UPDATE query and a series of CASE WHEN statements:

dbExecute(con, "
UPDATE sales
SET country = CASE WHEN country = 'World' THEN 'WW'
                    WHEN country = 'WW' THEN 'WW'
                    WHEN country = 'AUS' THEN 'AU'
                    WHEN country = 'JPN' THEN 'JP'
                    WHEN country = 'UK' THEN 'GB'
                    WHEN country = 'CAN' THEN 'CA'
                    WHEN country = 'FRA' THEN 'FR'
                    WHEN country = 'FR' THEN 'FR'
                    WHEN country = 'US' THEN 'US'
                    WHEN country = 'NA' THEN 'NA'
                    END;
                          ")
## [1] 48

Using SQL directly in R Markdown code chunks, and outputting for manipulation in R

What if I now want to output this data to R to create a prettily formatted table or figure (e.g., in an R Markdown report or in a blogdown post like this one)? In the following code chunk, I draw on Andrew Couch’s tutorial again, to write SQL code directly into the R Markdown code chunk (by replacing the {R} prefix with the following {sql, connection = con, output.var = "df"}). The first part of this statement tells R Markdown that the chunk uses SQL code and specifies the connection to access the database. The second part of the statement (output.var = "df") is used to save output as a dataframe named df.

The query below uses the same statements as the example above, but captures UK and US chart data, US and WW sales data (now with consistent use of “WW” for worldwide sales), and adds a new column (other_sales) calculating sales in countries other than the US. For ease of presentation, sales figures are now calculated in millions:


SELECT albums.artist, albums.title, SUBSTR(albums.released, -4) AS year,
  US_charts.chart_position AS US_chart,
  UK_charts.chart_position AS UK_chart,
  round(US_sales.sales/1000000, 1) AS US_sales, 
  round(WW_sales.sales/1000000, 1) AS WW_sales, 
  round((WW_sales.sales - US_sales.sales)/1000000, 1) AS other_sales,
  round(US_sales.sales/WW_sales.sales*100, 1) AS US_percent,
  round((WW_sales.sales - US_sales.sales)/WW_sales.sales*100, 1) AS other_percent
FROM albums
LEFT JOIN charts AS US_charts
  ON US_charts.FK_Charts_Albums = albums.id 
  AND US_charts.chart = "US"
LEFT JOIN charts AS UK_charts
  ON UK_charts.FK_Charts_Albums = albums.id 
  AND UK_charts.chart = "UK"
LEFT JOIN sales AS US_sales
  ON US_sales.FK_Sales_Albums = albums.id 
  AND US_sales.country = "US"
LEFT JOIN sales AS WW_sales
  ON WW_sales.FK_Sales_Albums = albums.id 
  AND (WW_sales.country = "WW");

Let’s look at the output:

head(df)
##         artist        title year US_chart UK_chart US_sales WW_sales
## 1 Taylor Swift Taylor Swift 2006        5       81      5.7       NA
## 2 Taylor Swift     Fearless 2008        1        5      7.2     12.0
## 3 Taylor Swift    Speak Now 2010        1        6      4.7      5.0
## 4 Taylor Swift          Red 2012        1        1      4.5      6.0
## 5 Taylor Swift         1989 2014        1        1      6.2     10.1
## 6 Taylor Swift   Reputation 2017        1        1      2.3      4.5
##   other_sales US_percent other_percent
## 1          NA         NA            NA
## 2         4.8       59.8          40.2
## 3         0.3       93.9           6.1
## 4         1.5       74.4          25.6
## 5         3.9       61.5          38.5
## 6         2.2       51.1          48.9

Creating a simple summary table using gt

There are a plethora of packages capable of creating high quality tables using R Markdown and blogdown. Some resources for exploring these options can be found on my resources page. As the purpose of this particular blog post is primarily to explore the use of SQL, I’ve made a relatively simple table here. In creating this table (using gt as shown below) I drew on:

# define a function to create totals for summary rows, excluding na

fns_labels <- list(Total = ~sum(., na.rm = TRUE))

# start with the data exported from SQL database

df %>%
  
  # select variables to include in table  
  select(title, artist, year, US_chart, UK_chart, US_sales, other_sales, WW_sales) %>%
  
  # create a table using gt, grouping by artist and using title for row name
  gt(rowname_col = "title", groupname_col = "artist") %>%
  
    # set column labels 
    cols_label(
      year = "Released",
      US_chart = "US",
      UK_chart = "UK",
      US_sales = "US",
      other_sales = "Other",
      WW_sales = "Total") %>%

   # transform NA values in all columns to "-"
    text_transform(
      locations = cells_body(columns = gt::everything()),
      fn = function(x) {
      str_replace(x, "NA", "–")
      }
    ) %>%
  
    # create headings spanning multiple columns
    tab_spanner(label = "Chart position", columns = vars(US_chart, 
                                                         UK_chart)) %>%
    tab_spanner(label = "Sales (millions)", columns = vars(US_sales,
                                                            other_sales,
                                                            WW_sales)) %>%

    # create title and subtitle for table, use md formatting
    tab_header(
      title = md("**Taylor Swift has sold more albums in the US and overall, Beyoncé has sold more internationally**"),
      subtitle = md("*Peak chart position and number of copies sold by album and location*"))%>%
  
    # create summary rows for each group 
    summary_rows(groups = TRUE, 
                 columns = vars(US_sales, other_sales, WW_sales),
                 fns = fns_labels,
                 formatter = fmt_number, decimals = 1) %>%
  
    
    # create source note for table
    tab_source_note("Source: Billboard via Wikipedia, October 2020") 

Taylor Swift has sold more albums in the US and overall, Beyoncé has sold more internationally
Peak chart position and number of copies sold by album and location
Released Chart position Sales (millions)
US UK US Other Total
Taylor Swift
Taylor Swift 2006 5 81 5.7
Fearless 2008 1 5 7.2 4.8 12.0
Speak Now 2010 1 6 4.7 0.3 5.0
Red 2012 1 1 4.5 1.5 6.0
1989 2014 1 1 6.2 3.9 10.1
Reputation 2017 1 1 2.3 2.2 4.5
Lover 2019 1 1 1.1 2.1 3.2
Folklore 2020 1 1
Total 31.7 14.8 40.8
Beyoncé
Dangerously in Love 2003 1 1 5.1 5.9 11.0
B'Day 2006 1 3 3.6 4.4 8.0
I Am... Sasha Fierce 2008 1 2 3.4 4.6 8.0
4 2011 1 1 1.5
Beyoncé 2013 1 2 2.5 2.5 5.0
Lemonade 2016 1 1 1.6 0.9 2.5
Total 17.7 18.3 34.5
Source: Billboard via Wikipedia, October 2020

Next steps and a note on editing

As outlined above, the next steps for this particular project for me are to further refine my output table using gt (I’m wanting to include a barplot visualizing US sales as a percentage of total sales), and using ggplot2 and ggflags to create bar plots showing sales data by country, assigning a flag to each bar. Once again, however, these are questions for a future post.

Finally, remember to disconnect

At the end of this process, best practice is always to disconnect from the database.

dbDisconnect(con)
Lynley Aldridge
Lynley Aldridge

My research interests include social and educational inequality, transitions from education to employment, education, cross-cultural comparative research, migration/mobility, mental health/wellbeing, and Rstats.

Related