About

A boat at the sea

While Spotify presents us with Spotify Wrapped at the end of each year which summarizes our top songs, artists, and minute spent listening, I was curious about how my Spotify listening patterns have evolved over a much longer time frame. Furthermore, the data analysis presented through Spotify Wrapped is fairly shallow. Spotify will provide your personal Spotify data free to use and analyze, which is what I did.

I was mainly curious about things like the time of the day I listened to songs, top artists over time, top songs over time, and the number of times I listened to a song each day. However, there are some other variables I wanted to explore such as how tracks end (e.g., stop button, next song) as well as how far I get through song.

Download the data and follow along: Download

Load Packages

require(pacman)
p_load(tidyverse, jsonlite, knitr, gghighlight, plotly, hms, lubridate, here, gt, webshot, janitor, spotidy)

Exporting your Spotify Data

To export your Spotify data, you will need to do so from the Spotify app. Once your data is requested, you will receive an email with instructions on how to download it. Depending on the length of your Spotify history, there will be one or multiple data files. In my case, there were six files which I loaded into R using the jsonlite package then combined them using rbind.

It can take up to 30 days to retrieve your Spotify data. In my case it took 30 days exactly, so plan ahead of time!

Reading your Spotify Data into R

To read your Twitter data into R, we are going to be using the jsonlite package using the flatten = TRUE parameter to automatically clean up the nesting and children in the file.

StreamHistory0 <- fromJSON(here("Zip/MyData/endsong_0.json"), flatten = TRUE)
StreamHistory1 <- fromJSON(here("Zip/MyData/endsong_1.json"), flatten = TRUE)
StreamHistory2 <- fromJSON(here("Zip/MyData/endsong_2.json"), flatten = TRUE)
StreamHistory3 <- fromJSON(here("Zip/MyData/endsong_3.json"), flatten = TRUE)
StreamHistory4 <- fromJSON(here("Zip/MyData/endsong_4.json"), flatten = TRUE)
StreamHistory5 <- fromJSON(here("Zip/MyData/endsong_5.json"), flatten = TRUE)
StreamHistory6 <- fromJSON(here("Zip/MyData/endsong_6.json"), flatten = TRUE)

streamingData <- rbind(StreamHistory0, StreamHistory1, StreamHistory2, StreamHistory3, StreamHistory4, StreamHistory5, StreamHistory6)

spotify <- streamingData %>% 
  as_tibble() %>%
  separate(col = "ts", 
           into = c("date","time"),
           sep = "T") %>% 
  separate(col = "time",
           into = "time",
           sep = "Z") 

datetime <- as.POSIXct(paste(spotify$date, spotify$time), format = "%Y-%m-%d %H:%M:%S")
spotify$datetime <- datetime

spotify <- spotify %>% 
  mutate(datetime = datetime - hours(6)) %>% # Convert time zones to CST
  mutate(date = floor_date(datetime, "day") %>% # Creating date 
           as_date, minutes = ms_played / 60000) %>% # Convert ms played to minutes played
  mutate(time = floor_date(datetime, "minutes")) # Remove seconds from time

The following columns will now be present in your data frame.

spotify %>% 
  glimpse(24)

Rows: 99,835
Columns: 24
$ date                              <date>$ time                              <dttm>$ username                          <chr>$ platform                          <chr>$ ms_played                         <int>$ conn_country                      <chr>$ ip_addr_decrypted                 <chr>$ user_agent_decrypted              <chr>$ master_metadata_track_name        <chr>$ master_metadata_album_artist_name <chr>$ master_metadata_album_album_name  <chr>$ spotify_track_uri                 <chr>$ episode_name                      <chr>$ episode_show_name                 <chr>$ spotify_episode_uri               <chr>$ reason_start                      <chr>$ reason_end                        <chr>$ shuffle                           <lgl>$ skipped                           <lgl>$ offline                           <lgl>$ offline_timestamp                 <dbl>$ incognito_mode                    <lgl>$ datetime                          <dttm>$ minutes                           <dbl>

Listening Patterns Over Time

I am filtering out instances where a song was listened to for 1000 milliseconds (1 second) or less as these songs were not actually listened to.

Daily Songs

spotify %>% 
  filter(ms_played >= 1000) %>% 
  group_by(date)  %>%
  # group_by(date = floor_date(date, "day")) %>% # This does not work after updates. 
  summarize(songs = n()) %>% 
  arrange(date) %>% 
  ggplot(aes(x = date, y = songs)) +
  geom_col(aes(fill = songs, colour = songs)) + # Use `colour = ` here because using `fill = ` does not work with small lines.
  scale_x_date(breaks = "1 year", 
                   date_labels = "%Y",
                   expand = c(0, 0)) + # Removes white space around the plot on the x axis 
  scale_fill_gradient(high = "#0b3e34", low = "#1db954") +
  scale_colour_gradient(high = "#0b3e34", low = "#1db954") + 
  labs(x = "Date",
       y = "Number of Songs",
       colour = "Songs") +
  guides(fill = "none") +
  theme_bw() +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) 
A boat at the sea

Monthly Songs

spotify %>% 
  filter(ms_played >= 1000) %>% 
  group_by(date) %>%
  group_by(date = floor_date(date, "month")) %>% 
  summarize(songs = n()) %>% 
  arrange(date) %>% 
  ggplot(aes(x = date, y = songs)) +
  geom_col(aes(fill = songs, colour = songs)) + # Use `colour = ` here because using `fill = ` does not work with small lines.
  scale_x_date(breaks = "1 year", 
                   date_labels = "%Y",
               expand = c(0, 0)) + # Removes white space around the plot on the x axis 
  scale_fill_gradient(high = "#0b3e34", low = "#1db954") +
  scale_colour_gradient(high = "#0b3e34", low = "#1db954") + 
  labs(x = "Date",
       y = "Number of Songs",
       colour = "Songs") +
  guides(fill = "none") +
  theme_bw() +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) 
A boat at the sea

Listening Calendar

This figure shows the total number of minutes I spent listening to Spotify during one hour blocks throughout the week from 2014 to 2022. Interestingly, I listen to the most music from 3PM to about 1AM. This makes sense because I often to go sleep at about 1AM and I have worked until 3-5PM.

spotify %>% 
  group_by(date, hour = hour(datetime), weekday = wday(date, label = TRUE)) %>% 
  summarize(minuteslistened = sum(minutes)) %>% 
  mutate(year = format(date, "%Y")) %>% 
  group_by(hour, weekday) %>% 
  summarize(minuteslistened = sum(minuteslistened)) %>% 
  drop_na() %>% # NA Weekday value of 181 dropped. 
  ggplot(aes(weekday, hour, fill = minuteslistened)) +
  geom_tile(colour = "white", size = 0.1) + 
  scale_fill_gradient(high = "#0b3e34", low = "#1db954") +
  scale_y_continuous(trans = "reverse") +
  theme_bw() +
  labs(x = "Weekday",
       y = "Hour of the Day (24-Hour, CST)",
       fill = "Minutes",
       title = "Spotify Weekly Listening Heatmap",
       subtitle = "2014 to 2022") +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank())
A boat at the sea

Artists

Top 20 Artists - All Time

spotify %>% 
  mutate(year = floor_date(date, "year")) %>% 
  mutate(year = format(year, "%Y")) %>% 
  group_by(master_metadata_album_artist_name) %>% 
  summarize(hours = sum(minutes) / 60) %>% 
  drop_na() %>% 
  arrange(desc(hours)) %>% 
  head(20, hours) %>% 
  gt() %>% 
  tab_header(title = "Top 20 Artists All Time") %>% 
  cols_label(
    master_metadata_album_artist_name = "Artist",
    hours = "Hours Listened") %>% 
  fmt_number(columns = hours,
             rows = everything(),
             decimals = 0) %>% 
  as_raw_html()
  
Top 20 Artists All Time
ArtistHours Listened
Armin van Buuren434
GAIA89
Above & Beyond86
Gareth Emery70
Alex M.O.R.P.H.69
Gustav Mahler49
Lady Gaga44
ReOrder43
Halsey39
Aly & Fila38
Craig Connelly37
Ilan Bluestone31
Super8 & Tab30
BT30
Giuseppe Ottaviani30
Ian Taylor30
Andrew Rayel29
Solarstone29
Cosmic Gate28
John Williams27
spotify %>% 
  group_by(master_metadata_album_artist_name, date = floor_date(date, "month")) %>% 
  filter(master_metadata_album_artist_name %in% c("Armin van Buuren", 
                                                  "GAIA",
                                                  "Above & Beyond",
                                                  "Gareth Emery",
                                                  "Alex M.O.R.P.H.",
                                                  "Gustav Mahler",
                                                  "Lady Gaga",
                                                  "ReOrder",
                                                  "Halsey",
                                                  "Aly & Fila")) %>% 
  summarize(hours = sum(minutes) / 60) %>% 
  drop_na() %>% 
  ggplot(aes(x = date, y = hours, group = master_metadata_album_artist_name, 
             colour = master_metadata_album_artist_name)) +
  geom_line() +
  scale_x_date(breaks = "1 year", date_labels = "%Y") +
  gghighlight(master_metadata_album_artist_name == "Armin van Buuren" || 
              master_metadata_album_artist_name == "Gustav Mahler" ||
              master_metadata_album_artist_name == "Lady Gaga" ||
              master_metadata_album_artist_name == "Halsey") +
  labs(title = "Hours Listened to Most Listened Artists Over Time",
       subtitle = "Data for top 10 artists only \nData aggregated by month. Other grey lines are top Trance artists",
       x = "Date", 
       y = "Hours Listened") +
  theme_bw()  +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank())
A boat at the sea

Listening Device



device_time <- spotify %>% 
  mutate(device = case_when(str_detect(platform, 'Android') ~ 'Android',
                            str_detect(platform, 'iOS') ~ 'iPhone',
                            str_detect(platform, 'Windows|windows|web') ~ 'Windows',
                            str_detect(platform, 'google') ~ 'Google Home'
                            )) %>% 
  group_by(device) %>% 
  summarise_at(vars(minutes), sum) %>% 
  mutate(hours = minutes/60) %>% 
  mutate(days = hours/24)

device_time %>% 
  add_row(
    device_time %>% 
    summarise(across(2:last_col(), sum) %>% 
    mutate(device = "Total"))
    ) %>% 
  gt() %>% 
  tab_header(title = "Time Spent Listening by Platform") %>% 
  cols_label(
    device = "Platform",
    minutes = "Minutes",
    hours = "Hours",
    days = "Days") %>% 
  fmt_number(columns = c(minutes, hours, days),
             rows = everything(),
             decimals = 0) %>% 
  as_raw_html()
Time Spent Listening by Platform
PlatformMinutesHoursDays
Android93,4961,55865
Google Home5610
iPhone65,3751,09045
Windows130,4152,17491
Total289,3424,822201

Track Ends

Spotify provides data that indicates how each track was stopped. For example, if the user selects the play button, or the track completes and moves onto the next track, that information is recorded.

  • appload: Track stopped upon opening of the app.

  • backbtn: User pressed the back button.

  • clickrow: User selected another song within an album/playlist.

  • fwdbtn: User pressed the forward button.

  • persisted:

  • playbtn: User pressed the play button.

  • popup:

  • remote:

  • trackdone: The track completed in its entirety.

  • trackerror: An error caused the track to stop.

  • unknown: Unknown reason.

For the majority of tracks (72,814; 72.93%), the track ended because the song finished (trackdone). For 17,754 tracks (17.78%), the track ended because I clicked a new song via an album or playlist (clickrow). I use the forward button far more than the back button (3.72% of tracks were ended via the forward button). This is likely due to a song coming on that I wish to skip. Only 0.93% of tracks were ended via the back button, indicating I wished to repeat the previous song.

spotify %>%
  group_by(reason_start) %>% 
  mutate(reason_start = replace(reason_start, reason_start == "", "unknown")) %>% 
  dplyr::summarise(n = n()) %>% 
  mutate(percent = (n / sum(n))*100) %>% 
  
  gt() %>% 
    fmt_number(
      columns = percent,
      rows = everything(),
      decimals = 2
    ) %>% 
  as_raw_html()
reason_startnpercent
appload27162.72
backbtn9260.93
clickrow1775417.78
fwdbtn37173.72
persisted10.00
playbtn5820.58
popup440.04
remote2980.30
trackdone7281472.93
trackerror9410.94
unknown420.04

I wanted to see if I preferentially used the back button at certain points (e.g., within an album or playlist) to replay songs. Seen below, 911 and Chromatica I by Lady Gaga are the most ended tracks via the back button.

spotify %>% 
  filter(reason_end == "backbtn") %>% 
  group_by(master_metadata_track_name) %>% 
  dplyr::summarise(n = n()) %>% 
  drop_na() %>% 
  arrange(desc(n)) %>% 
  head(10) %>% 
  gt() %>% 
  as_raw_html()
master_metadata_track_namen
9117
Chromatica I7
How Can I6
Love Again5
Sound of Walking Away5
Without Me5
You're A Mean One, Mr. Grinch5
Beautiful Life (feat. Cindy Alma) - Kat Krazy Radio Edit4
Daydream - Extended Mix4
Free Woman4

Next, I wanted to see how the way a track is ended relates to the time the song is listened to. Seen below, in the first few seconds of a track, the majority of tracks are ended by the stop button, forward button, and back button. After a certain threshold (~30s), the majority of tracks are played to completion. This indicates that, in the first few seconds of a song playing, I typically will stop the song within 30 seconds if I dislike it.

spotify %>% 
  mutate(reason_end = replace(reason_end, reason_end == "unknown", "Unknown"),
         reason_end = replace(reason_end, reason_end == "", "Unknown"),
         reason_end = replace(reason_end, reason_end == "unexpected-exit", "Misc"),
         reason_end = replace(reason_end, reason_end == "unexpected-exit-while-paused", "Misc"),
         reason_end = replace(reason_end, reason_end == "remote", "Misc"),
         reason_end = replace(reason_end, reason_end == "trackerror", "Misc"),
         reason_end = replace(reason_end, reason_end == "logout", "Misc"),
         reason_end = replace(reason_end, reason_end == "popup", "Misc"),
         reason_end = replace(reason_end, reason_end == "appload", "App Load"),
         reason_end = replace(reason_end, reason_end == "backbtn", "Back Button"),
         reason_end = replace(reason_end, reason_end == "fwdbtn", "Forward Button"),
         reason_end = replace(reason_end, reason_end == "endplay", "Stop Button"),
         reason_end = replace(reason_end, reason_end == "trackdone", "Track Done"),
         reason_end = replace(reason_end, reason_end == "clickrow", "New Song"),
         ) %>% 
  ggplot(aes(x = minutes)) + 
  geom_histogram(aes(fill = reason_end), bins = 150) +
  scale_x_continuous(limits = c(0,10), expand = c(0,0)) +
  scale_y_continuous(expand = c(0,0)) +
  scale_fill_brewer("Track End \nReason", palette = "Paired")+
  theme_bw() +
  labs(x = "Minutes Played",
       y = "Track End Instances",
       title = "Reason for Spotify Track Ends - Personal Data",
       subtitle = "For songs played for 10 minutes or less",
       caption = "Viz: Cole Baril - colebaril.ca | Data: Spotify | Software: R") +
  theme(title = element_text(size = 16, face = "bold"),
        plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5),
        plot.caption = element_text(hjust = 1))
A boat at the sea

References

library(purrr)
c("tidyverse", "jsonlite", "knitr", "gghighlight",
  "hms", "lubridate", "here", "gt", "webshot") %>%
  map(citation) %>%
  print(style = "text")

Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.

Ooms J (2014). “The jsonlite Package: A Practical and Consistent Mapping Between JSON Data and R Objects.” arXiv:1403.2805 [stat.CO]. https://arxiv.org/abs/1403.2805.

Xie Y (2022). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.40, https://yihui.org/knitr/.

Xie Y (2015). Dynamic Documents with R and knitr, 2nd edition. Chapman and Hall/CRC, Boca Raton, Florida. ISBN 978-1498716963, https://yihui.org/knitr/.

Xie Y (2014). “knitr: A Comprehensive Tool for Reproducible Research in R.” In Stodden V, Leisch F, Peng RD (eds.), Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595, http://www.crcpress.com/product/isbn/9781466561595.

Yutani H (2022). gghighlight: Highlight Lines and Points in ‘ggplot2’. R package version 0.4.0, https://CRAN.R-project.org/package=gghighlight.

Müller K (2022). hms: Pretty Time of Day. R package version 1.1.2, https://CRAN.R-project.org/package=hms.

Grolemund G, Wickham H (2011). “Dates and Times Made Easy with lubridate.” Journal of Statistical Software, 40(3), 1-25. https://www.jstatsoft.org/v40/i03/.

Müller K (2020). here: A Simpler Way to Find Your Files. R package version 1.0.1, https://CRAN.R-project.org/package=here.

Iannone R, Cheng J, Schloerke B, Hughes E (2022). gt: Easily Create Presentation-Ready Display Tables. R package version 0.7.0, https://CRAN.R-project.org/package=gt.

Chang W (2022). webshot: Take Screenshots of Web Pages. R package version 0.5.4, https://CRAN.R-project.org/package=webshot.