Unit | Value |
---|---|
Second | 1.00e-03 |
Minute | 1.67e-05 |
Hour | 3.00e-07 |
Day | 0.00e+00 |
1 INTRODUCTION
The European Union’s GDPR legislation ushered in a new data privacy epoch. One of the spillover effects of the legislation is the introduction of the Protection of Personal Information Act (POPIA) in South Africa and similar legislation elsewhere. As a resident of the European Middle East and Africa (EMEA) region, Spotify subscribers can request their data, including an archive of the subscriber’s listening history. In this post, we will work through submitting a request to spotify for your data. Exporting it to a local directory, reading it into R and exploring your listening history in the spirit of ‘Spotify Wrapped’1. The goal is to demonstrate the easy with which a person can complete these tasks in R.
- A spotify premium account.
- Export Function is available in your region.
- Access to R, whether through your local computer or {webr}
- Spotify API Access.
To get started, login to your spotify account, navigate to your settings, scroll to data download option and select the dataset, ensure to include your listening history (archive). These steps will initiate a data request, confirmed through the user’s email address. The confirmation email serves as an authentication layer and will redirect you back to Spotify. It’s worth noting that data request make take up to a day to process and once received, the link will have an expiry date.
Once you get access to the dataset, you will have a download button that can be used to export the zip file containing several json files. The JSON format is similar to a dictionary in Python, in that, it contains key-value pairs that can be stored in a nested structure. It is also an efficient data storage method. We can manipulate the json files in R using a host of packages, for our sake, we will rely on the jsonlite
package. Using the jsonlite
package, we can import a list of files and store them in a data.frame. Figure 1, illustrates how to go about importing the json files.
see code
lapply(c("tidyverse","janitor","fishualize","reactable","reactablefmtr",
"jsonlite","htmltools"),
require,
character.only = TRUE) |>
suppressWarnings() |>
suppressMessages()
theme_set(theme_minimal())
Data_Files <- list.files(full.names = TRUE,recursive = TRUE,
path = "posts/2024_Spotify_Analysis",
pattern = ".json$")
Data_Files <- Data_Files[!grepl("Video",Data_Files)]
Data_Files <- data.frame(data_path = Data_Files)
Data_Files$dataset <- lapply(Data_Files$data_path,read_json,simplifyVector = TRUE)
Listening_History <- do.call(rbind,Data_Files$dataset)
## Remove Episode Data
Listening_History <- Listening_History[!is.na(Listening_History$master_metadata_track_name),!grepl("episode",colnames(Listening_History))] |>
mutate(ts = ymd_hms(ts),
year = year(ts))
## Clean Column Names
colnames(Listening_History) <- gsub(replacement = "",
pattern = "master_metadata_|_name",
colnames(Listening_History))
apply
therequire
function through a list of packages while suppressing packagewarnings
andmessages
.listfiles
all files that end with thejson
file extension in the specified path.- Search through the
Data_Files
vector and return all values that do not contain the term “Video”. - Create a dataframe with a column named “data_path”.
lapply
(apply a function through a list), theread_json
function from thejsonlite
with the simplyVector option set toTRUE
.- Iteratively bind rows (
do.call
andrbind
) our list of data.frames contained in the dataset column and store the result in a variable called “Listening_History”. - Remove rows with missing values in the column
master_metadata_track_name
and remove columns that contain the term “episode” (they refer to podcasts). - Remove leading terms such as “master_metadata_” and “_name”.
The listening history object is ready for analysis. There are several insights to extract from the dataset including overall streaming time, personalised charts across years, months, seasons etc. The Spotify API has depracated the audio features and recommendations end points. The platform’s Terms and Conditions expressly forbid the use of the data for training of Machine Learning or AI models. As a result, we will not be able create a custom playlist using ML. Nonetheless, there is sufficient data to create meaningful metrics.
Listening time is calculated in milliseconds. Thus we can use Table 1 below to aggregate at different levels. We create a new object called Cumulative_History
to tally the listening hours overtime.
see code
<environment: R_GlobalEnv>
2 VISUALISATION
2.0.1 Spotify Streaming Hours
6653.586
Figure 2, illustrates the cumulative time spent listening to Spotify since 2018. This number without context is meaningless. Often, I stream Spotify while going about normal day-to-day activities including driving, reading, working, washing dishes etc. The account is linked to eight devices as well, including a PlayStation, Web Application, Desktop Application(s) and a Cellphone. We can expand on this information by measuring the rate of change from one year to the next.
see code
Cumulative_History |>
ggplot()+
geom_line(aes(date,total_minutes),
color = "#0f204b")+
geom_area(aes(date,total_minutes,alpha = 0.8),fill = "#0f204b",
show.legend = FALSE)+
labs(
x = "Date",
y = "Hours",
caption = "Cumulative Listening Hours for the period 2018-04-26 to 2024-12-09"
)+
scale_fill_fish()+
theme(
plot.title = element_text(face = "bold"),
plot.subtitle = element_text(face = "italic"),
plot.caption = element_text(face = "italic")
)
There is notable variation in year-on-year listening behaviour. The most notable among these is the 114 percent jump in streaming hours from 2018 to 2019. This change is largely explained by the fact that the premium account was activated on April 2018 thus periods under review are 8 months and 12 months respectively. In 2020, there is an 79.87 percent increase from the previous year (COVID-19 Lockdown). The remaining years are marked by moderate changes in the range -15.78 percent (2020 - 2021) to 9.9 percent (2022 - 2023). Figure 3, illustrates these differences across the years.
see code
Yearly_Hours |>
mutate(year = as.character(year)) |>
ggplot(aes(year,change,fill=year))+
geom_col(show.legend = FALSE)+
geom_text(aes(year,change,label = paste0(round(change,2),"%")),
vjust =1.9,
fontface="bold")+
scale_fill_fish_d("Pseudochromis_aldabraensis")+
labs(
x= NULL,
y = "year-on-year change (%)"
)
We have a clearer picture of overall streaming behaviour. We can drill a bit further by looking at artists over time. The album_artist
column contains the main artist in each song. As such, we can aggregate the duration of songs to across artist and year. This step can be achieved easy with a group-by
statement in R or SQL. Figure 4 contains a code chunk detailing the implementation.
see code
Top_20_Artist <- Listening_History[,c("year","album_artist","ms_played")] |>
filter(!is.na(ms_played)) |>
group_by(year,album_artist) |>
reframe(ms_played = sum(ms_played)*2.7778e-7) |>
pivot_wider(names_from = year,
values_from = ms_played) |>
na.omit() |>
mutate(total_time = `2018`+`2019`+`2020`+`2021`+`2022`+`2023`+`2024`) |>
arrange(desc(total_time)) |>
top_n(20) |>
suppressMessages()
The table of artists can be styled according to preference. Several R packages can achieve this objective including but not limited to reactable
and gt
. We will need some assets from the Spotify API. The Terms and Conditions of the API contain some restrictions about the use of and attribution of content extracted from the service. It is sufficient to note that caution is exercised in the use of the API.
2.1 SPOTIFY API
see code
library(httr2)
Spotify_Token <- request("https://accounts.spotify.com/api/token/") |>
req_method("POST") |>
req_body_raw(paste0("grant_type=client_credentials&client_id=",Sys.getenv("SPOTIFY_CLIENT_ID"),"&",
"client_secret=",Sys.getenv("SPOTIFY_CLIENT_SECRET")),"application/x-www-form-urlencoded") |>
req_perform() |>
resp_body_json()
SPOTIFY_ACCESS_TOKEN = Spotify_Token$access_token
collect_artist <- function(artist_name){
artist_result = request("https://api.spotify.com/v1/search") |>
req_method("GET") |>
req_url_query(
q = paste0(artist_name),
type = "artist",
limit = 1,
market = "ZA"
) |>
req_auth_bearer_token(token = SPOTIFY_ACCESS_TOKEN) |>
req_perform() |>
resp_body_json()
return(artist_result)}
Artist_Details <- map(Top_20_Artist$album_artist,collect_artist)
Artist_Df <- lapply(Artist_Details,function(a){
lapply(a[["artists"]][["items"]],\(b){
b[sapply(b, is.null)] <- NA
c = unlist(b) |> t() |> data.frame()
return(c)})
})
Artist_Df <- do.call(bind_rows,Artist_Df)[,c("name","genres1","popularity","followers.total","images.url.2")] |>
mutate(genres1 = case_when(is.na(genres1) & name == "Daniel Caesar" ~ "r&b",
TRUE ~ genres1))
colnames(Artist_Df) <- gsub(pattern = "\\d{1,}$|[[:punct:]]\\d{1,}$",
replacement = "",
colnames(Artist_Df)) |>
gsub(pattern = "\\.",
replacement = "_")
Artist_Df$popularity <- as.integer(Artist_Df$popularity)
Artist_Df$genres <- toupper(Artist_Df$genres)
Top_20_Artist <- Top_20_Artist |>
pivot_longer(starts_with("20")) |>
group_by(album_artist,total_time) |>
summarise(values = list(value),.groups="drop") |>
suppressWarnings() |>
suppressMessages()
- Obtain an access token. The Spotify API requires a time-limited (1 hour) access token that can be retrieved from the API by providing the client id and client secrets. These prerequisites are provided upon signing up to the Spotify developer page. Using the
httr2
package we can send a post statement to the API, in return, we receive the access token. - We store the access token to a variable called
SPOTIFY_ACCESS_TOKEN
. A better approach would be store the access token in a specific environment and obtain the access token when required. For our purposes, the latter approach is not necessary. - Collect artist information function. We create function that performs a number of steps. The Spotify API has a search end point. Our goal is to parse an artist name to the query parameter of the API request and authenticate the request. Since twenty artists are needed, we can iteratively parse artist name as the search query. The same objective can be achieved through an
lapply
ormap
function. - Apply the function through a list of artist names and store the result in a variable called
Artist_Details
. - We coerce the nested lists to a data.frame using an approach from this post.
- We combine the list of data.frames into a single data.frame and extract our variables of interest and impute an
NA
value with the correct value. - We clean the column names by removing trailing values.
- Finally, the Top_20_Artist variable is pivoted longer (similar to Stata’s reshape).
Below, we combine the artist details and our top 20 artist variables to create our dataset that we will use to create the chart. The reactable
package allows us to define a theme. Here, we rely on Spotify’s corporate guidelines to create a theme for the 2018 - 2024: Top 20 Chart
.
see code
library(reactable)
library(reactablefmtr)
Sivu_theme <- reactable::reactableTheme(
color = "#FFFFFF",
backgroundColor = "#121212",
borderColor = "#121212",
highlightColor = "#1ED760",
inputStyle = list(backgroundColor = "#121212"),
selectStyle = list(backgroundColor = "#121212"),
pageButtonHoverStyle = list(backgroundColor = "#1ED760"),
pageButtonActiveStyle = list(backgroundColor = "#1ED760"),
style = list(
fontFamily = "-Arial,Helvetica,Helvetica Neue,sans-serif",
padding= 12,
`border-radius` = 12)
)
Expanded_Df <- Artist_Df[,c("images_url","name","genres")] |>
rename(album_artist = name) |>
left_join(Top_20_Artist)
see code
reactable::reactable(Expanded_Df,
theme = Sivu_theme,
columns= list(
images_url = colDef(name = "ALBUM ART",
cell = embed_img(height = 160,
width = 160)),
album_artist = colDef(name = "ARTIST NAME"),
genres = colDef(name = "GENRE",
sortable = TRUE),
total_time = colDef(name = "TOTAL STREAMING (HOURS)",
format = colFormat(digits = 2)),
values = colDef(name = "STREAMING HOURS",
cell = react_sparkline(Expanded_Df,
show_area = TRUE,
height = 33,
show_line = TRUE,
point_size = 1.5,
labels = "last",
area_color = "#FFFFFF",
line_color = "#FFFFFF",
area_opacity = 1,
tooltip_color= "#1ED760",
tooltip = TRUE,
tooltip_type = 2))
),
sortable = TRUE,
searchable = TRUE,
highlight = TRUE,
bordered = FALSE,
outlined = TRUE,
onClick = "select",
striped = FALSE) |>
add_source(
htmltools::div(htmltools::tagAppendAttributes(shiny::icon("spotify",lib="font-awesome"),
style = "color: #1EF760; font-size: 24px; background-color: #FFFFFF"), "Data Request on 2024-12-04"),
background_color = "#1ED760",
font_style = "italic",
text_decoration = "underline",
font_color = "#FFFFFF") |>
suppressWarnings() |>
suppressMessages()
3 CONCLUSION
Figure 6 illustrates my personal top 20 artists for the period 2018 - 2024. The chart is quintessentially mainstream including some of the most popular artists worldwide in recent memory. The listening pattern varies over time with some of the top artists having accumulated most streaming hours in previous years. This pattern is inline with changes to streaming preferences (an artifact of age perhaps). To gain a better understanding of these changes in listening history, it is possible to generate a new playlist using the Spotify API. The playlist relies on tracks rather than artists. Consequently, it is far more eclectic set of artists and genres. While Hip Hop dominates the playlist, it is also peppered with Jazz, RnB, Ballads, Xhosa Traditional Music, uMbhaqanga (Zulu Traditional Music) and Yatch Rock.
4 REFERENCES
Cuilla K (2022). reactablefmtr: Streamlined Table Styling and Formatting for Reactable. R package version 2.0.0,https://CRAN.R-project.org/package=reactablefmtr.
Firke S (2023). janitor: Simple Tools for Examining and Cleaning Dirty Data. R package version 2.2.0, https://CRAN.R-project.org/package=janitor.
Iannone R, Cheng J, Schloerke B, Hughes E, Lauer A, Seo J,Brevoort K, Roy O (2024). gt: Easily Create Presentation-Ready Display Tables. R package version 0.11.1.9000, commit 9bc92cd152a7853e52e8623aced82c92c8c73504,https://github.com/rstudio/gt.
Lin G (2023). reactable: Interactive Data Tables for R. R package version 0.4.4, https://CRAN.R-project.org/package=reactable.
Mock T (2024). gtExtras: Extending ‘gt’ for Beautiful HTML Tables. R package version 0.5.0.9005, commit dc2da410b00aed73add6018b799bf45d8d86a4c4, https://github.com/jthomasmock/gtExtras.
Ooms J (2014). “The jsonlite Package: A Practical and Consistent Mapping Between JSON Data and R Objects.” arXiv:1403.2805 [stat.CO]. https://arxiv.org/abs/1403.2805.
Schiettekatte N, Brandl S, Casey J (2022). fishualize: Color Palettes Based on Fish Species. R package version 0.2.3, https://CRAN.R-project.org/package=fishualize.
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.
Zhu H (2024). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.4.0 https://CRAN.R-project.org/package=kableExtra.