2024 Spotify Data Analysis

We provide a walk-through for reading, cleaning and aggregating jsonfiles of a listener’s history from Spotify.
data cleaning
visualisation
Author
Affiliation

University of the Free State

Published

12 December 2024

1 INTRODUCTION

The European Union’s GDPR legislation ushered in a new data privacy epoch. One of the spillover effects of the legislation is the introduction of the Protection of Personal Information Act (POPIA) in South Africa and similar legislation elsewhere. As a resident of the European Middle East and Africa (EMEA) region, Spotify subscribers can request their data, including an archive of the subscriber’s listening history. In this post, we will work through submitting a request to spotify for your data. Exporting it to a local directory, reading it into R and exploring your listening history in the spirit of ‘Spotify Wrapped’1. The goal is to demonstrate the easy with which a person can complete these tasks in R.

Assumptions:
  1. A spotify premium account.
  2. Export Function is available in your region.
  3. Access to R, whether through your local computer or {webr}
  4. Spotify API Access.

To get started, login to your spotify account, navigate to your settings, scroll to data download option and select the dataset, ensure to include your listening history (archive). These steps will initiate a data request, confirmed through the user’s email address. The confirmation email serves as an authentication layer and will redirect you back to Spotify. It’s worth noting that data request make take up to a day to process and once received, the link will have an expiry date.

Once you get access to the dataset, you will have a download button that can be used to export the zip file containing several json files. The JSON format is similar to a dictionary in Python, in that, it contains key-value pairs that can be stored in a nested structure. It is also an efficient data storage method. We can manipulate the json files in R using a host of packages, for our sake, we will rely on the jsonlitepackage. Using the jsonlite package, we can import a list of files and store them in a data.frame. Figure 1, illustrates how to go about importing the json files.

see code
lapply(c("tidyverse","janitor","fishualize","reactable","reactablefmtr",
         "jsonlite","htmltools"),
       require,
       character.only = TRUE) |> 
  suppressWarnings() |>
  suppressMessages()


theme_set(theme_minimal())

Data_Files <- list.files(full.names = TRUE,recursive = TRUE,
           path = "posts/2024_Spotify_Analysis",
           pattern = ".json$")

Data_Files <- Data_Files[!grepl("Video",Data_Files)]

Data_Files <- data.frame(data_path = Data_Files)

Data_Files$dataset <- lapply(Data_Files$data_path,read_json,simplifyVector = TRUE)

Listening_History <- do.call(rbind,Data_Files$dataset)

## Remove Episode Data

Listening_History <- Listening_History[!is.na(Listening_History$master_metadata_track_name),!grepl("episode",colnames(Listening_History))] |> 
  mutate(ts = ymd_hms(ts),
         year = year(ts))

## Clean Column Names
colnames(Listening_History) <- gsub(replacement = "",
     pattern = "master_metadata_|_name",
     colnames(Listening_History))
Figure 1: Import and Preprocess Data
  1. apply the require function through a list of packages while suppressing package warnings and messages.
  2. listfiles all files that end with the json file extension in the specified path.
  3. Search through the Data_Files vector and return all values that do not contain the term “Video”.
  4. Create a dataframe with a column named “data_path”.
  5. lapply(apply a function through a list), the read_json function from the jsonlitewith the simplyVector option set to TRUE.
  6. Iteratively bind rows (do.call and rbind) our list of data.frames contained in the dataset column and store the result in a variable called “Listening_History”.
  7. Remove rows with missing values in the column master_metadata_track_name and remove columns that contain the term “episode” (they refer to podcasts).
  8. Remove leading terms such as “master_metadata_” and “_name”.

The listening history object is ready for analysis. There are several insights to extract from the dataset including overall streaming time, personalised charts across years, months, seasons etc. The Spotify API has depracated the audio features and recommendations end points. The platform’s Terms and Conditions expressly forbid the use of the data for training of Machine Learning or AI models. As a result, we will not be able create a custom playlist using ML. Nonetheless, there is sufficient data to create meaningful metrics.

Listening time is calculated in milliseconds. Thus we can use Table 1 below to aggregate at different levels. We create a new object called Cumulative_History to tally the listening hours overtime.

Unit Value
Second 1.00e-03
Minute 1.67e-05
Hour 3.00e-07
Day 0.00e+00
Table 1: Conversion Table
see code
list2env(readRDS(file ="Final_Db/2018 - 2024 Spotify Data 2024-12-13.rds"),
         envir = .GlobalEnv)
<environment: R_GlobalEnv>
see code
Cumulative_History <- Listening_History[,c("ts","ms_played")] |> 
  mutate(minutes_played = (ms_played*2.7778e-7),
         date = date(ts)) |> 
  group_by(date) |> 
  summarise(minutes_played = sum(minutes_played),
            .groups = "drop") |> 
  mutate(total_minutes = cumsum(minutes_played),
         year = year(date)) 

2 VISUALISATION

2.0.1 Spotify Streaming Hours

6653.586

Figure 2, illustrates the cumulative time spent listening to Spotify since 2018. This number without context is meaningless. Often, I stream Spotify while going about normal day-to-day activities including driving, reading, working, washing dishes etc. The account is linked to eight devices as well, including a PlayStation, Web Application, Desktop Application(s) and a Cellphone. We can expand on this information by measuring the rate of change from one year to the next.

see code
Cumulative_History |> 
  ggplot()+
  geom_line(aes(date,total_minutes),
                color = "#0f204b")+
  geom_area(aes(date,total_minutes,alpha = 0.8),fill = "#0f204b",
            show.legend = FALSE)+
  labs(
    x = "Date",
    y = "Hours",
    caption = "Cumulative Listening Hours for the period 2018-04-26 to 2024-12-09"
  )+
  scale_fill_fish()+
  theme(
    plot.title = element_text(face = "bold"),
    plot.subtitle = element_text(face = "italic"),
    plot.caption = element_text(face = "italic")
  )

Figure 2: Cumulative Listening History (Hours)

There is notable variation in year-on-year listening behaviour. The most notable among these is the 114 percent jump in streaming hours from 2018 to 2019. This change is largely explained by the fact that the premium account was activated on April 2018 thus periods under review are 8 months and 12 months respectively. In 2020, there is an 79.87 percent increase from the previous year (COVID-19 Lockdown). The remaining years are marked by moderate changes in the range -15.78 percent (2020 - 2021) to 9.9 percent (2022 - 2023). Figure 3, illustrates these differences across the years.

see code
Yearly_Hours |> 
  mutate(year = as.character(year)) |> 
  ggplot(aes(year,change,fill=year))+
  geom_col(show.legend = FALSE)+
  geom_text(aes(year,change,label = paste0(round(change,2),"%")),
            vjust =1.9,
            fontface="bold")+
  scale_fill_fish_d("Pseudochromis_aldabraensis")+
  labs(
    x= NULL,
    y = "year-on-year change (%)"
  )

Figure 3: Streaming History : Year-on-Year Change (%)

We have a clearer picture of overall streaming behaviour. We can drill a bit further by looking at artists over time. The album_artist column contains the main artist in each song. As such, we can aggregate the duration of songs to across artist and year. This step can be achieved easy with a group-by statement in R or SQL. Figure 4 contains a code chunk detailing the implementation.

see code
Top_20_Artist <- Listening_History[,c("year","album_artist","ms_played")] |>
  filter(!is.na(ms_played)) |> 
  group_by(year,album_artist) |> 
  reframe(ms_played = sum(ms_played)*2.7778e-7) |> 
  pivot_wider(names_from = year,
              values_from = ms_played) |> 
  na.omit() |> 
  mutate(total_time = `2018`+`2019`+`2020`+`2021`+`2022`+`2023`+`2024`) |> 
  arrange(desc(total_time)) |> 
  top_n(20) |> 
  suppressMessages()
Figure 4: Top 20 Artists

The table of artists can be styled according to preference. Several R packages can achieve this objective including but not limited to reactable and gt. We will need some assets from the Spotify API. The Terms and Conditions of the API contain some restrictions about the use of and attribution of content extracted from the service. It is sufficient to note that caution is exercised in the use of the API.

2.1 SPOTIFY API

see code
library(httr2)

Spotify_Token <- request("https://accounts.spotify.com/api/token/") |> 
  req_method("POST") |>
  req_body_raw(paste0("grant_type=client_credentials&client_id=",Sys.getenv("SPOTIFY_CLIENT_ID"),"&",
    "client_secret=",Sys.getenv("SPOTIFY_CLIENT_SECRET")),"application/x-www-form-urlencoded") |>
  req_perform() |> 
  resp_body_json()

SPOTIFY_ACCESS_TOKEN = Spotify_Token$access_token

collect_artist <- function(artist_name){
artist_result =   request("https://api.spotify.com/v1/search") |>
  req_method("GET") |> 
  req_url_query(
    q = paste0(artist_name),
    type = "artist",
    limit = 1,
    market = "ZA"
  ) |> 
  req_auth_bearer_token(token = SPOTIFY_ACCESS_TOKEN) |> 
  req_perform() |> 
  resp_body_json() 

return(artist_result)}

Artist_Details <- map(Top_20_Artist$album_artist,collect_artist)

Artist_Df <- lapply(Artist_Details,function(a){
  lapply(a[["artists"]][["items"]],\(b){
     b[sapply(b, is.null)] <- NA
  c = unlist(b) |> t() |> data.frame()
  return(c)})
})


Artist_Df <- do.call(bind_rows,Artist_Df)[,c("name","genres1","popularity","followers.total","images.url.2")] |> 
  mutate(genres1 = case_when(is.na(genres1) & name == "Daniel Caesar" ~ "r&b",
                             TRUE ~ genres1))

colnames(Artist_Df) <- gsub(pattern = "\\d{1,}$|[[:punct:]]\\d{1,}$",
     replacement = "",
     colnames(Artist_Df)) |> 
  gsub(pattern = "\\.",
       replacement = "_")

Artist_Df$popularity <- as.integer(Artist_Df$popularity)
Artist_Df$genres <- toupper(Artist_Df$genres)

Top_20_Artist <- Top_20_Artist |> 
  pivot_longer(starts_with("20")) |> 
  group_by(album_artist,total_time) |> 
  summarise(values = list(value),.groups="drop") |> 
  suppressWarnings() |> 
  suppressMessages()
Figure 5: Accessing the Spotify API
  1. Obtain an access token. The Spotify API requires a time-limited (1 hour) access token that can be retrieved from the API by providing the client id and client secrets. These prerequisites are provided upon signing up to the Spotify developer page. Using the httr2 package we can send a post statement to the API, in return, we receive the access token.
  2. We store the access token to a variable called SPOTIFY_ACCESS_TOKEN. A better approach would be store the access token in a specific environment and obtain the access token when required. For our purposes, the latter approach is not necessary.
  3. Collect artist information function. We create function that performs a number of steps. The Spotify API has a search end point. Our goal is to parse an artist name to the query parameter of the API request and authenticate the request. Since twenty artists are needed, we can iteratively parse artist name as the search query. The same objective can be achieved through an lapply or map function.
  4. Apply the function through a list of artist names and store the result in a variable called Artist_Details.
  5. We coerce the nested lists to a data.frame using an approach from this post.
  6. We combine the list of data.frames into a single data.frame and extract our variables of interest and impute an NA value with the correct value.
  7. We clean the column names by removing trailing values.
  8. Finally, the Top_20_Artist variable is pivoted longer (similar to Stata’s reshape).

Below, we combine the artist details and our top 20 artist variables to create our dataset that we will use to create the chart. The reactable package allows us to define a theme. Here, we rely on Spotify’s corporate guidelines to create a theme for the 2018 - 2024: Top 20 Chart.

see code
library(reactable)
library(reactablefmtr)

Sivu_theme <-  reactable::reactableTheme(
  color = "#FFFFFF",
  backgroundColor = "#121212",
  borderColor = "#121212",
  highlightColor = "#1ED760",
  inputStyle = list(backgroundColor = "#121212"),
  selectStyle = list(backgroundColor = "#121212"),
  pageButtonHoverStyle = list(backgroundColor = "#1ED760"),
  pageButtonActiveStyle = list(backgroundColor = "#1ED760"),
  style = list(
    fontFamily = "-Arial,Helvetica,Helvetica Neue,sans-serif",
    padding= 12,
    `border-radius` = 12)
  )

Expanded_Df <- Artist_Df[,c("images_url","name","genres")] |> 
  rename(album_artist = name) |> 
  left_join(Top_20_Artist)
see code
reactable::reactable(Expanded_Df,
                     theme = Sivu_theme,
                     columns= list(
                       images_url = colDef(name = "ALBUM ART",
                                           cell = embed_img(height = 160,
                                                            width = 160)),
                      album_artist = colDef(name = "ARTIST NAME"),
                      genres = colDef(name = "GENRE",
                                      sortable = TRUE),
                      total_time = colDef(name = "TOTAL STREAMING (HOURS)",
                                          format = colFormat(digits = 2)),
                      values = colDef(name = "STREAMING HOURS",
                                     cell = react_sparkline(Expanded_Df,
                                                            show_area = TRUE,
                                                            height = 33,
                                                            show_line = TRUE,
                                                            point_size = 1.5,
                                                            labels = "last",
                                                            area_color = "#FFFFFF",
                                                            line_color = "#FFFFFF",
                                                            area_opacity = 1,
                                                            tooltip_color= "#1ED760",
                                                            tooltip = TRUE,
                                                            tooltip_type = 2))
                     ),
                     sortable = TRUE,
                     searchable = TRUE,
                     highlight = TRUE,
                     bordered = FALSE,
                     outlined = TRUE,
                     onClick = "select",
                     striped = FALSE) |> 
  add_source(
    htmltools::div(htmltools::tagAppendAttributes(shiny::icon("spotify",lib="font-awesome"),
                                            style = "color: #1EF760; font-size: 24px; background-color: #FFFFFF"), "Data Request on 2024-12-04"),
    background_color = "#1ED760",
    font_style = "italic",
    text_decoration = "underline",
    font_color = "#FFFFFF") |> 
  suppressWarnings() |> 
  suppressMessages()

Data Request on 2024-12-04

Figure 6: 2018 - 2024: Top 20 Artists

3 CONCLUSION

Figure 6 illustrates my personal top 20 artists for the period 2018 - 2024. The chart is quintessentially mainstream including some of the most popular artists worldwide in recent memory. The listening pattern varies over time with some of the top artists having accumulated most streaming hours in previous years. This pattern is inline with changes to streaming preferences (an artifact of age perhaps). To gain a better understanding of these changes in listening history, it is possible to generate a new playlist using the Spotify API. The playlist relies on tracks rather than artists. Consequently, it is far more eclectic set of artists and genres. While Hip Hop dominates the playlist, it is also peppered with Jazz, RnB, Ballads, Xhosa Traditional Music, uMbhaqanga (Zulu Traditional Music) and Yatch Rock.

4 REFERENCES

Cuilla K (2022). reactablefmtr: Streamlined Table Styling and Formatting for Reactable. R package version 2.0.0,https://CRAN.R-project.org/package=reactablefmtr.

Firke S (2023). janitor: Simple Tools for Examining and Cleaning Dirty Data. R package version 2.2.0, https://CRAN.R-project.org/package=janitor.

Iannone R, Cheng J, Schloerke B, Hughes E, Lauer A, Seo J,Brevoort K, Roy O (2024). gt: Easily Create Presentation-Ready Display Tables. R package version 0.11.1.9000, commit 9bc92cd152a7853e52e8623aced82c92c8c73504,https://github.com/rstudio/gt.

Lin G (2023). reactable: Interactive Data Tables for R. R package version 0.4.4, https://CRAN.R-project.org/package=reactable.

Mock T (2024). gtExtras: Extending ‘gt’ for Beautiful HTML Tables. R package version 0.5.0.9005, commit dc2da410b00aed73add6018b799bf45d8d86a4c4, https://github.com/jthomasmock/gtExtras.

Ooms J (2014). “The jsonlite Package: A Practical and Consistent Mapping Between JSON Data and R Objects.” arXiv:1403.2805 [stat.CO]. https://arxiv.org/abs/1403.2805.

Schiettekatte N, Brandl S, Casey J (2022). fishualize: Color Palettes Based on Fish Species. R package version 0.2.3, https://CRAN.R-project.org/package=fishualize.

Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.

Zhu H (2024). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.4.0 https://CRAN.R-project.org/package=kableExtra.

Footnotes

  1. See Spotify Wrapped here↩︎